<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2021/blob/main/notebooks/cda_2021_03_30_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Analysis of contingency tables

During lectures we will focus on:

+ measuring associations in contingency tables
+ measuring odds ratios etc
+ visualising contingency tables with correspondence analysis

In [1]:
install.packages("vcd")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘zoo’, ‘lmtest’




In [3]:
library(vcd)

Loading required package: grid



We create some data based on real study

In [4]:
data_vec <- c(1933, 1175, 1186, 646, 579, 671, 707, 780, 767, 768, 962, 1126)
data_mat <- matrix(data = data_vec, nrow = 4, ncol = 3, byrow = T)
rownames(data_mat) <- c("Rural", "city 20k", "city 20k-100k", "city 100k+")
colnames(data_mat) <- c("Realised","Refusals","Errors")
data_mat

Unnamed: 0,Realised,Refusals,Errors
Rural,1933,1175,1186
city 20k,646,579,671
city 20k-100k,707,780,767
city 100k+,768,962,1126


In [7]:
sum(data_mat)

Let's calculate proportions with `prop.table` function

In [6]:
prop.table(data_mat)*100 ## general proportions in the data

Unnamed: 0,Realised,Refusals,Errors
Rural,17.106195,10.39823,10.495575
city 20k,5.716814,5.123894,5.938053
city 20k-100k,6.256637,6.902655,6.787611
city 100k+,6.79646,8.513274,9.964602


In [8]:
prop.table(data_mat, margin = 1)*100 ## row proportions

Unnamed: 0,Realised,Refusals,Errors
Rural,45.0163,27.36376,27.61993
city 20k,34.07173,30.53797,35.3903
city 20k-100k,31.36646,34.60515,34.02839
city 100k+,26.89076,33.68347,39.42577


In [9]:
prop.table(data_mat, margin = 2)*100

Unnamed: 0,Realised,Refusals,Errors
Rural,47.6813,33.60984,31.62667
city 20k,15.93488,16.56178,17.89333
city 20k-100k,17.43957,22.31121,20.45333
city 100k+,18.94425,27.51716,30.02667


How can we verify relationship within this table? For this task we may use:
 
+ `chisq.test` -- which covers both veryfing whether given variable comes from some distribution and calculating $\chi^2$ test
+ `vcd::assocstats` -- calculates association statistics ($\chi^2$ test, Cramers' V, etc.)

In [10]:
chisq.test(data_mat)


	Pearson's Chi-squared test

data:  data_mat
X-squared = 290.2, df = 6, p-value < 2.2e-16


In [11]:
assocstats(data_mat)

                    X^2 df P(> X^2)
Likelihood Ratio 290.04  6        0
Pearson          290.20  6        0

Phi-Coefficient   : NA 
Contingency Coeff.: 0.158 
Cramer's V        : 0.113 

Output of the assocstats function

+ Likelihood Ratio -- alternative way to verify the hypothesis
```
Likelihood Ratio 290.04  6        0
```
+ Pearson -- is $\chi^2$ Statistic
```
Pearson          290.20  6        0
```

+ 3 statistics that refer to strength of relationship
```
Phi-Coefficient   : NA 
Contingency Coeff.: 0.158 
Cramer's V        : 0.113 
```

## Odds and log-odds

Assume that we have the following random variable (passing CDA lecture) that follows Bernoulli distribution

$$
X \sim \text{Bernoulli}(p = 0.7)
$$

In [13]:
0.7 / (1-0.7) * 10

Some artificaion example about covd-19 vaccines, probability of success (i.e. not being ill with covid-19 after vaccination) is 0.90. What is the odds? 



In [15]:
0.9/0.1*1000

Let's calculate odds ratios for the following example:

1. we compare rural vs cities
2. we compare participation in the study vs refusals

In [17]:
rural_part <- 1933
rural_refu <- 1175
city_part <- 646+707+768
city_refu <- 579 + 780 + 962

In [22]:
prob_part_rural <- rural_part/(rural_part + rural_refu)
prob_part_rural ## probability of participation in the study for people who live in rural parts
prob_part_city <- city_part / (city_part + city_refu)
prob_part_city ## probability of participation in the study for people who live in cities

Now, let's calculate odds ratios:

$$
\text{ODDS ratios} = \frac{\text{Odds for rural}}{\text{Odds for cities}}
$$



In [25]:
odds_rural <- prob_part_rural / (1-prob_part_rural)
odds_city <- prob_part_city / (1-prob_part_city)
odds_ratio <- odds_rural/odds_city
odds_ratio

**Interpretation of this result**: odds ratio is higher than 1, which means that there is relationship between place of living and participation in survey. Value 1.8 means that, for example: if we have 100 people who participated in the study and lived in cities, we have 180 people who participated in the study and lived in rural areas.


In [26]:
log(odds_ratio)