In 1946, Joseph Berkson, a biostatistician at the Mayo Clinic, pointed out a peculiarity of observational studies conducted in a hospital setting: even if two diseases have no relation to each other in the general population, they can appear to be associated among patients in a hospital. where two items that seem correlated to the general people are actually not correlated in reality. In statistical terms, it means that even when two values are statistically negatively correlated it may seem that they are positively correlated. To understand Berkson’s observation, let’s start with a causal diagram. It’s also helpful to think of a very extreme possibility: neither Disease 1 nor Disease 2 is ordinarily severe enough to cause hospitalization, but the combination is. In this case, we would expect Disease 1 to be highly correlated with Disease 2 in the hospitalized population.

![10](img/10.png)


## a)

In [13]:
total_population = 17+207+184+2376
total_population

In [14]:
hosp_population = 219+18+15+5
hosp_population

One way of finding correlation between events is using **LIFT** measure: 

$$ lift = \frac{P(A \cap B)}{P(A)P(B)} $$

In [15]:
lift_population = (17/total_population) / ((201/total_population)*(224/total_population))
lift_population

In [16]:
lift_hospital = (5/hosp_population) / ((23/hosp_population)*(20/hosp_population))
lift_hospital

as we can see in the above calculation correlation between diseases obviously **increases** in the hospital sample.

<hr/>
another way to calculate correlation is using $\chi^2$ test. for this case we use general population data to find the expected value of Hospitalized table.

In [17]:
tab <- matrix(nrow=2, ncol=2, byrow=TRUE)
colnames(tab) <- c('Yes', 'No')
rownames(tab) <- c( 'Yes', 'No')
tab[1,] = c(17*hosp_population/total_population,
            207*hosp_population/total_population)
            
tab[2,] = c(184*hosp_population/total_population,
            2376*hosp_population/total_population)
tab

Unnamed: 0,Yes,No
Yes,1.569325,19.10884
No,16.985632,219.33621


Now we use $\chi^2$ to estimate correlation
$$ \chi^2 = \sum{\frac{(Observed - Expected)^2}{Expected}} $$

In [18]:
chi2 = ((5 - 2)^2)/2 + ((15 - 20)^2)/20 + ((19 - 17)^2)/17 + ((219 - 219)^2)/219
chi2

as we can see in the above calculation the data from hospitalization is fairly correlated toward the expected values that we got from the general population.

## b)

**Null transactions**: Transactions that contain neither B nor C

problem with measurments like lift and $\chi^2$ is that they aren't null invariant. this means that too many null values may corrupt our conclusion because Null invariance is crucial for correlation analysis. for example in the table below we have some test data: 

![test1](img/test1.png)

BC is much rarer than $B\bar{C}(1000)$ and $\bar{B}C(1000)$ but there are many $\bar{B}\bar{C}(100000)$ and we can infer that it's unlikely that B & C will happen together!

but $lift(B, C) = 8.44$ (Lift shows B and C are strongly positively correlated!)

and $\chi^2 = 670$ (Observed (BC) >> expected value, also shows strong correlation)

one way of fixing this issue is to change samples in a way that null values aren't too much

another way of solving this problem is to use null-invariant measures like **all_confidence**, **coherence**, ...
