# Association between Categorical Variables

Two categorical variables $X$ and $Y$ are called independent if the probability distribution of one variable is not affected by the presence of another.

In [1]:
data <- read.csv("2018TP2_TitanicData.csv")
str(data)

'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


In [2]:
attach(data)

In [3]:
Survived <- as.factor(Survived)

In [4]:
Pclass <- as.factor(Pclass)

In [5]:
levels(Survived) <- c("dead", "survived")
levels(Pclass) <- c("1st", "2nd", "3rd")

Creating one-way and two-way tables of __counts__ is the starting point for analyzing the relationship between two caegorical variables. The rows are the categories of one variable and the columns are the categories of the second variable. We count how many observations are in each combination of row and column categories. Generally, we put the explanatory varible in the rows, and the columns are the response variable. So you need to know which one is your explanatory variable, and which one is the response. However, it is not that much important. 

In [6]:
table(Survived)
round(prop.table(table(Survived)),2)*100

Survived
    dead survived 
     549      342 

Survived
    dead survived 
      62       38 

In [7]:
table(Pclass)
round(prop.table(table(Pclass)),2)*100

Pclass
1st 2nd 3rd 
216 184 491 

Pclass
1st 2nd 3rd 
 24  21  55 

In [8]:
table(Survived, Pclass)

          Pclass
Survived   1st 2nd 3rd
  dead      80  97 372
  survived 136  87 119

|          | 1st | 2nd | 3rd | Total |
|----------|-----|-----|-----|-------|
| dead     | 80  | 97  | 372 | 549   |
| survived | 136 | 87  | 119 | 342   |
| Total    | 216 | 184 | 491 | __891__ |



However, counts are difficult to interpret, especially when the number of observations in each category are unequal. Percents are more understandable than counts for describing how two categorical variables are related. 

We could think about row percents and column percents based on the question in mind.

In [9]:
#each cell in counts devided by 891
prop.table(table(Survived, Pclass))

          Pclass
Survived          1st        2nd        3rd
  dead     0.08978676 0.10886644 0.41750842
  survived 0.15263749 0.09764310 0.13355780

## Row percents
Provides conditional percents that give the percents out of each row total that fall in the various column categories. The summation of row percents is 100%.

|          | 1st | 2nd | 3rd | Total |
|----------|-----|-----|-----|-------|
| dead     | 15  | 18  | 68  | 100   |
| survived | 40  | 25  | 35  | 100   |


## Column percent
Provides conditional percents that the percents out of each column total that fall in the various row categories. The summation of column percents is 100%.

|          | 1st | 2nd | 3rd |
|----------|-----|-----|-----|
| dead     | 37  | 53  | 76  |
| survived | 63  | 47  | 24  |
| Total    | 100 | 100 | 100 |


Two categorical variables are related if at least two rows in the conditional percents __noticeably differ__ in the pattern of row percents. Equivalently, if at least two columns in the conditional percents __noticeably differ__ in the pattern of columns percents.

In our example, being survived is related to the ticket class. They are not independent.

In [10]:
round(c(80, 136)/216,2)*100
sum(round(c(80, 136)/216,2)*100)

## Chi-Square Test
The chi-square test for two-way tables is used as a for investigating the relationship between two categorical variables using a sample. Based on what we observe in the sample, we may infer something about the population. If the p-value for a chi-square test is less than 0.05 we call the observed relationship statistically significant. 

* __Null hypothesis__ $H_0$: The two variables are independent.
* __Alternative hypothesis__ $H_1$: The two variables are dependent.



In [11]:
tb <- table(Survived, Pclass)
tb

          Pclass
Survived   1st 2nd 3rd
  dead      80  97 372
  survived 136  87 119

In [12]:
chisq.test(tb)


	Pearson's Chi-squared test

data:  tb
X-squared = 102.89, df = 2, p-value < 2.2e-16


As the p-value is less than 0.05, we reject the null hypothesis. The date supports the existance of a relationship between two categorical variables. 

In [13]:
tb

          Pclass
Survived   1st 2nd 3rd
  dead      80  97 372
  survived 136  87 119

In [14]:
tb[,"1st"]

In [15]:
tb["dead",]

In [16]:
tb1 <- table(Sex, Survived)
tb1

        Survived
Sex      dead survived
  female   81      233
  male    468      109

In [17]:
chisq.test(tb1)


	Pearson's Chi-squared test with Yates' continuity correction

data:  tb1
X-squared = 260.72, df = 1, p-value < 2.2e-16


In [18]:
tb2 <- table(Sex, Pclass)
tb2

        Pclass
Sex      1st 2nd 3rd
  female  94  76 144
  male   122 108 347

In [19]:
chisq.test(tb2)


	Pearson's Chi-squared test

data:  tb2
X-squared = 16.971, df = 2, p-value = 0.0002064
