# Analysis of categorical data

## Contingency table

A contingency table is "A statistical table that shows the observed frequencies of data elements classified according to two variables, with the rows indicating one variable and the columns indicating the other variable". (American Heritage Dictionary of the English Language, 2009)

Since we have two variables (outcome variable and the 'group'  or 'exposure' variable) and each of these variables have two possible values each (diseased/not diseased and exposure/not exposed), we can construct a 2 x 2 contingency table.  We call it a 2 x 2 table because there are two rows and two columns.  This is a subset of the R x C contingency tables (r rows and c columns). 

This type of tables are very common in epidimiology where the rows convey the disease category and the columns are the exposure category. The main goal is to compare the frequency of disease between the two treatments (in a 2x2 setting) which normally evaluates exposed and unexposed individuals.

## 2 x 2 CONTINGENCY TABLE
$H0: P_{disease|exposed} = P_{disease|unexposed}$.

Notation: $X_{ij}$ refers to the cell value in the ith row and jth column, e.g. $X_{11}$ is the cell in the 1st row and 1st column, $X_{12}$ is the cell in the 1st row and 2st column, etc

For two-sample binomial variable where p = probability of success

|Group| Success | Failure |Total |
|--- | --- | --- | --- |
|Group 1 | x11 | x12 | n1 = x11 + x12 |
|Group 2 | x21 | x22 | n2 = x21 + x22 |
|Total | c1 = x11 + x21 | c2 = x12 + x22 | N = n1 + n2 = c1 + c2 |

Another way to denote this 2X2 contingency table is using the a,b,c,d notation

|-|Case|Control|
|---|---|---|
|Exposed|a|	b |
|Unexposed|c|d|

### Let's define some terms:

    p1 = probability of developing disease in exposed (Group 1) individuals
    p2 = probability of developing disease in unexposed (Group 2) individuals

## Risk Difference

We can calculate the absolute difference attributed to having the exposure using the risk difference (another way to name risk difference is attributable risk). In this case we will calculate the difference between the exposed group vs the unexposed group.

    RD = p1 - p2

In [1]:
Lung_DS = read.csv("LungCapData2.csv", header = T)
head(Lung_DS)

Unnamed: 0_level_0,Age,LungCap,Height,Gender,Smoke
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<fct>,<fct>
1,9,3.124,57.0,female,no
2,8,3.172,67.5,female,no
3,7,3.16,54.5,female,no
4,9,2.674,53.0,male,no
5,9,3.685,57.0,male,no
6,8,5.008,61.0,female,no


In [2]:
Tab_Lung = table(Lung_DS$Gender,Lung_DS$Smoke)
Tab_Lung

        
          no yes
  female 279  39
  male   310  26

In [3]:
Tab_Lung_2 = cbind(Tab_Lung[,2],Tab_Lung[,1])
colnames(Tab_Lung_2) = c("Yes","No")
Tab_Lung_2

Unnamed: 0,Yes,No
female,39,279
male,26,310


In [None]:
#let's calcualte p1 and p2

#remember that we are evaluating the risk difference between males and females in the exposed vs unexposed groups 
#therefore the probabilities we need to estimate relate to our groups (males/females)

p1 = 

p2 = 

p1 - p2 =

### Point and interval estimation for the Risk Difference

To obtain an unbiased point estimate of the risk difference we used the sample proportions $ \hat{p1} - \hat{p2} $. to obtain a confident interval estimate we used the equations below, as these are categorical variables where the samples are independent from each other we can use the normal approximation to the binomial distribution.

![title](Risk_difference.png)

where z is the critical value following the standarized normal distribution, alpha is the confidence probability to obtain a 95% confidence interval for the difference in the proportion.

In [None]:
##Confidence interval

##is n1p1q1 >= 5 same as n2p2q2

##solve confidence interval

## Risk Ratio

We can also obtain a point estimate for the risk ratio, which evaluates the strength of the association between the exposed and unexposed groups. The risk ratio of 1 indicates no risk, a risk ratio less than 1 indicates protective risk or low risk, and a risk ratio larger than 1 indicates increase risk.

    RR = p1/p2
    


In [None]:
#RR = p1/p2

p1/p2 

We can also calculate a confidence interval for the risk ratio. This CI also follows a normal approximation to the binomial distribution, assuming that the groups evaluated are independent. In this case as we are evaluating ratios we need to transform the distribution of the estimation of the risk ratios to the log of the distribution which is a closer approximation to the normal distribution than the distribution of risk ratios.

![title](CI_RR.png)


## Relative risk is another common way to name risk ratios. 
### The disadvantage of using relative risk is that the strenght of the relationship relies exclusively on the size of the probability of the denominator (the unexposed group). To avoid this constrain we can use another measure of comparing the proportions between exposed and unexposed groups which is the ODDS RATIO.

In [None]:
##Solve CI

## Odds ratio

The odds in favor of the probility of success can be defined as:

    OR = p / (1-p)

In [None]:
OR = (p1*q2) / (p2*q1)

Odds ratio is a very common measure of effect used in many disciplines. As we just defined odds of success can be calucated as $\frac {p}{(1-p)}$ where p is the probability of succes for any discrete event.  So, if the probability of getting a cold is 0.40 then the odds of getting a cold $= \frac {0.40}{(1-0.40)}=  \frac {0.40}{0.60}$  or 2 to 3; if the probability equals 0.25 then the odds are 1 to 3, if the probability is .75 then the odds are 3 to 1, etc.

### Odds ratio of the proportions p1 and p2

A common measure of effect in case-control studies is the odds – the odds of exposure among the diseased, and the odds of exposure among the not diseased.  If we take the ratio of these two odds, we have the …. **odds ratio:**

$$\frac {P_{d|e}/(1-P_{d|e})}{P_{d|e} /(1-P_{!d|e})}$$

that can also be redefined as

$$\frac {\hat{p1}\hat{q2}}{\hat{p2}\hat{q1}}$$


To calculate the odds ratio: construct a 2x2 contingency table with either cases in the first row and the presence of the risk factor in the first column (or the presence of the risk factor in the first row and cases in the first column).  You will see both scenarios in papers so you must learn it as the odds and not just as a, b, c and d.

|-|Case|Control|
|---|---|---|
|Exposed|a|	b |
|Unexposed|c|d|

Let $p_{case|exp}$ = probability of being a case among the exposed.  Then $p_{case|exp} = \frac {a}{(a+b)}$
	
then the odds of being a case among the exposed = $\frac {a/(a+b)}{1-[a/(a+b)]} = \frac {a}{b} $
 
Let $p_{case|no-exp}$ = probability of being a case among the non-exposure.  Then $p_{case|no-exp} =\frac  {c}{(c+d)} $

and the odds of being case among the non-exposed = $\frac {c/(c+d)}{1-[c/(c+d)]} = \frac {c}{d} $

Thus, the odds ratio of $\frac {(odds \ of \ being \ a  \ case \ among\  the\ exposed)}{(odds\ of\ being\ a\ case\ among\ the\ unexposed)} = \frac {a/b}{c/d} = \frac {ad}{bc}$           

As we have been previously reviewing OR can be used to evaluate the odds in favor of disease for an exposed group divided by the odds in favor of the disease for unexposed subjects, also known as the disease-odds. However, this approach can also be used for any 2x2 study.

One method (Woolf) for calculating the CI95 for the OR is:

$\large e^{ln⁡(\hat{OR})±z_{1-∝/2} \sqrt{(\frac {1}{a} + \frac {1}{b}+\frac {1}{c}+\frac {1}{d})}}\large$

Now, if the OR = 1 then the odds of disease among the exposed = odds of disease among the non-exposed which is equivalent to saying 'exposure does not appear to play a role in the disease.'  If the OR > 1 then exposed people are more likely to be diseased than those not exposed so it would appear that exposure is a risk.  If the OR < 1 then exposed people are less likely to be diseased than those that are unexposed, therefore the exposure has a protective effect (think medical treatment).

### Example:

A clinical trial of gamma globulin in the treatment of children with Kawasaki syndrome (a rare but fatal condition) randomized approximately half of the patients to receive gamma globulin.  The other half received the standard treatment of aspirin.  Under the aspirin treatment, approximately one fourth of patients developed coronary abnormalities.  It was thought that gamma globulin would help prevent the development of coronary abnormalities.  Subjects were followed over a 7-week period.

H0:  Patients receiving gamma globulin will develop coronary abnormalities at the same or higher rate as those patients receiving the present standard of care which is aspirin. (OR ≥ 1)

Ha:  Patients receiving gamma globulin will develop fewer coronary abnormalities then those patients receiving the present standard of care which is aspirin. (OR < 1)


The results were:

|-|-|Coronary abnormalities|Total|
| --- | --- | --- | --- |
|-|Yes|No||
|Gamma globulin|5|	78 |83|
|Aspirin |21|	63 |84|
|Total|26|141|167|

This is a 'case/control' scenario because Kawasaki syndrome is, fortunately, rare.  Thus, we cannot use the relative risk as a measure of effect but will use the odds ratio instead. 

$$\hat{OR}  =  \frac {5*63}{78*21}=0.19$$

The CI for the OR: $= exp(ln(0.19) ± 1.96\sqrt {\frac{1}{5}+ \frac {1}{78}+ \frac {1}{21} + \frac {1}{63}} = $ 

exp(-1.66 ± 1.03) = (0.0679, 0.5326).  Since the CI does not contain the null hypothesis (OR = 1), we would reject the null hypothesis and conclude that gamma globulin appears to protect Kawasaki patients from ancillary coronary abnormalities.

In [None]:
#install.packages("vcd")
library(vcd)
tabl1 = matrix(c(5, 78, 21, 63), ncol = 2, byrow = TRUE)
odds_rat = oddsratio(tabl1, log = FALSE)

In [None]:
summary(odds_rat)
confint(odds_rat)

In [None]:
data("CoalMiners")
CoalMiners

[z_table](http://www.z-table.com/)