<a href="https://colab.research.google.com/github/painterV/some_coding/blob/main/lab08_ohie_probability_continued.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OHIE

The Oregon Health Insurance Experiment (OHIE) was a program run in
Oregon, USA in 2008 in which certain residents of that state were
offered the opportunity to enroll in a subsidized health insurance
program.  To allocate this opportunity fairly, interested people
were invited to participate in a lottery.  The people who won the
lottery ("treated") were then given the opportunity to apply to the insurance
program.  A subset of these people actually applied to the program,
and finally a subset of these applicants who were confirmed to be
eligible were granted insurance.

Since the opportunity to apply for insurance was allocated randomly with all households having equal probability of being selected,
this program is essentially a randomized experiment (although the
randomization was employed for fairness, not to facilitate
research).  In particular, there was great interest in the outcomes
over the subsequent several years of the people who were awarded
insurance, compared to those who participated in the lottery but
were not selected.  Since this selection was random, in principal
both the treatment measurment should not be associated with any other measurement. For example, treated households should be similar in age or income to the non-treated households.

In this notebook, we only consider the "baseline" information,
namely, characteristics of the individuals who applied to the
lottery.  We also know who "won" the lottery, who among those given
the opportunity to apply for insurance actually did so, and who
among those who applied for insurance were deemed to be eligible and
granted insurance.

A primary focus of this notebook is to use the OHIE data to
illustrate concepts from probability, including conditional
probabilities and conditional independence.

In [1]:
import os
import pandas as pd
import numpy as np

Load the OHIE data from a file:

In [3]:
section = "100"
#base = "/scratch/stats206s%sf22_class_root/stats206s%sf22_class/materials/data" % (section, section)
base = "./"
df = pd.read_csv(os.path.join(base, "oregonhie.csv.gz"))
df.columns

Index(['person_id', 'household_id', 'treatment', 'draw_treat', 'draw_lottery',
       'applied_app', 'approved_app', 'dt_notify_lottery', 'dt_retro_coverage',
       'dt_app_decision', 'postn_death', 'numhh_list', 'birthyear_list',
       'have_phone_list', 'english_list', 'female_list', 'first_day_list',
       'last_day_list', 'pobox_list', 'self_list', 'week_list',
       'zip_msa_list'],
      dtype='object')

## Setup

We start this lab by recreating a few steps from the previous lab.

In [4]:
df["female"] = df["female_list"] == "1: Female"
df["has_phone"] = df["have_phone_list"] == "Gave Phone Number"
df["treatment"] = df["treatment"] == "Selected"
df["applied_app"] = df["applied_app"] == "Submitted an Application to OHP"
df["approved_app"] = df["approved_app"] == "Yes"
df["zip_msa"] = df["zip_msa_list"] == "Zip code of residence in a MSA"

hh = df.groupby('household_id').agg({'treatment': 'first',
                                     'applied_app': 'max',
                                     'approved_app': 'max',
                                     'zip_msa': 'first',
                                     'birthyear_list': 'median',
                                     'female': 'mean',
                                     'has_phone': 'mean'
                                    })

We also computed some probabilities for the different categories.

In [5]:
status = hh.groupby(["treatment", "applied_app", "approved_app"]).size()
status

treatment  applied_app  approved_app
False      False        False           41473
True       False        False            9494
           True         False            7672
                        True             7746
dtype: int64

And converted to probabilities:

In [6]:
status_prop = status / status.sum()
status_prop

treatment  applied_app  approved_app
False      False        False           0.624735
True       False        False           0.143014
           True         False           0.115568
                        True            0.116683
dtype: float64

We will start using an age group variable that we construct below.
First we obtain the age of each subject in the first year of the
program, then ask Pandas to group the subjects into three groups of
equal size based on age.

In [7]:
hh["age"] = 2008 - hh["birthyear_list"]
hh["agegrp"] = pd.qcut(hh["age"], 3)

Here is the contingency table for age group and treatment status.

In [8]:
counts = hh.groupby(["agegrp", "treatment"]).size().unstack()
counts

treatment,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",13825,8378
"(32.5, 47.0]",14207,8647
"(47.0, 63.0]",13441,7887


In [9]:
probs = counts / counts.sum().sum()
probs

treatment,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",0.208255,0.126203
"(32.5, 47.0]",0.214009,0.130255
"(47.0, 63.0]",0.20247,0.118807


Now on to the new material.

# Conditional probabilities of treatment assignment

Recall that a conditional probability is a joint probability divided
by a marginal probability.  In general $P(A \mid B) = P(A,B) / P(B)$, where
A and B are two 'events'. In particular, we're iterested in the conditional probabilities of winning the lotter for the different age groups:

$$P(L \mid G = g) = P(L, G = g) / P(G = g)$$

for the three age goup categories $g$.

The only other possible outcome of the lottery is not winning, and
the probability of this happening is 1 minus the probability of
winning the lottery.  The conditional distribution of 'win lottery'
given 'age group' is the collection consisting of all probabilities
of either winning or not winning the lottery, for all of the age
groups.  That is, 6 separate ways to apply the above formula, since
we can set 'win lottery' to either false or true, and we can set
'age group' to any of the three age groups.

To compute, the conditional probabilities by age group, we first
calculate the marginal probabilities by age group by summing over the rows (use `axis = 1` in the `sum` functions).

Call your variable `age_marg` and show the marginal probabilities for the three age categories.

In [10]:
age_marg = probs.sum(1)
age_marg

agegrp
(19.999, 32.5]    0.334458
(32.5, 47.0]      0.344265
(47.0, 63.0]      0.321277
dtype: float64

<details>

```
age_marg = probs.sum(1)
age_marg
```

</details>

In this case, these probabilities are approximately equal, since we
constructed the age groups using 'qcut' to have this property.
However, age is only measured to the nearest years and there are lot
of people with the same age.  So it is not possible to perfectly
divide the sample into thirds based on age.  

Next we construct the conditional probabilities, by dividing the
joint probabilities by the marginal probabilities. We cannot simply
use 'dp / mpa' since this would be dividing a dataframe by a series,
and Pandas doesn't know how to align structures with different
shapes.  

The 'div' method divides a dataframe by a series, and
contains an additional argument so that 'x.div(y, 0)' means that
every column of 'x' is divided by 'y', and 'x.div(y, 1)' means that
every row of 'x' is divided by 'y'.  Note that in the first case,
the length of 'y' must be equal to the number of rows of 'x', and in
the second case the length of 'y' must be equal to the number of
columns of 'x'.

Divide the joint probabilities along the columns using the `age_marg` series. Call your variable `age_cond`.

In [11]:
age_cond = probs.div(age_marg, axis = 0)
age_cond

treatment,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",0.622664,0.377336
"(32.5, 47.0]",0.621642,0.378358
"(47.0, 63.0]",0.630204,0.369796


<details>
    
```
age_cond = probs.div(age_marg, axis = 0)
age_cond
```
    
</details>

To confirm that we have valid conditional probabilities, check that
the elements sum to 1 along the rows.

In [13]:
age_cond.sum(1)

agegrp
(19.999, 32.5]    1.0
(32.5, 47.0]      1.0
(47.0, 63.0]      1.0
dtype: float64

<details>

```
age_cond.sum(1)
```

</details>

What do we learn from these conditional distributions?

*Write your answer here.*

<details>
    
Within
each age band, the probability of being selected is almost exactly
0.4. Therefore, the conditional probability of does not change for the different age categories, suggesting that age and treatment assignment are independent.
    
</details>

Construct the conditional probabilities of winning the library given age group:

In [14]:
treatment_cond = probs.div(probs.sum(0), 1)
treatment_cond

treatment,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",0.333349,0.336304
"(32.5, 47.0]",0.34256,0.347102
"(47.0, 63.0]",0.32409,0.316594


<details>

```
treatment_cond = probs.div(probs.sum(0), 1)
treatment_cond
```

</details>

The fact that the conditional probabilities given treatment
assignment are roughly constant (at 1/3 per age band) is a
consequence of the random treatment assignment, and of the fact that
we defined the age bands to include equal fractions of the sample.

## Conditional probabilities of applying for insurance

Now let's consider a non-randomized variable -- whether a person who
is randomly selected to be given the opportunity to apply for
insurance actually submits the application.

Create a table `hh_selected` that holds  the people who were selected (i.e. who won the
lottery) since those who are not selected cannot apply for insurance
under this program.

In [15]:
hh_selected = hh.loc[hh["treatment"]]

<details>

```
hh_selected = hh.loc[hh["treatment"]]
```

</details>

Next, calculate counts and proportions for each combination of age
and 'applied_app', which tells us whether each household submitted the
application to obtain insurance.

In [16]:
counts = hh_selected.groupby(["agegrp", "applied_app"]).size().unstack()
probs = counts / counts.sum().sum()
probs

applied_app,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",0.155548,0.180756
"(32.5, 47.0]",0.126365,0.220737
"(47.0, 63.0]",0.099189,0.217405


<details>

```
counts = hh_selected.groupby(["agegrp", "applied_app"]).size().unstack()
probs = counts / counts.sum().sum()
probs
```

</details>

Find the marginal probabilities with for the age groups (`age_marg`) and applied for treatment status (`app_marg`).

In [17]:
age_marg = probs.sum(1)
app_marg = probs.sum(0)

<details>
    
```
age_marg = probs.sum(1)
app_marg = probs.sum(0)
```
    
</details>

Create the conditional probabilities for both measurments.

In [18]:
age_cond = probs.div(age_marg, axis = 0)
app_cond = probs.div(app_marg, axis = 1)

<details>

```
age_cond = probs.div(age_marg, axis = 0)
app_cond = probs.div(app_marg, axis = 1)
```

</details>

How are these values different from what we saw before?

*Write your answer here.*

<details>

```
age_marg
```

If we print out the table we notice that as the groups get older, the conditinal probability of completing the app increases.
</details>

# Independence

Two events A and B are statistically independent if P(A, B) =
P(A)*P(B).  To illustrate this concept using the OHIE data, we
consider whether the event of applying for insurance is independent
of age, considering only people who were given the opportunity to
apply (i.e. who won the lottery).

If two random variables are independent, then the joint
probabilities can be constructed as the product of the marginal
probabilities. 

Use the `np.outer` function and the marginal probabilities from the previous section to compute the probabilies that we would observe if these two measurments were independent. Call the result `ind`.

In [19]:
ind = np.outer(age_marg, app_marg)
ind

array([[0.12816587, 0.20813792],
       [0.13228101, 0.21482079],
       [0.1206546 , 0.19593981]])

<details>

```
ind = np.outer(age_marg, app_marg)
ind
```

</details>

These joint probabilities represent the closest exactly independent
distribution to the observed distribution of our data.

Next we can compare these exactly independent joint probabilities to
the observed joint probabilities. Compute the difference between the observed probabilities and `ind`.

In [20]:
resid = probs - ind
resid

applied_app,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",0.027382,-0.027382
"(32.5, 47.0]",-0.005916,0.005916
"(47.0, 63.0]",-0.021465,0.021465


<details>

```
resid = probs - ind
resid
```

</details>

We notice that the differences are not exactly zero, but would this lead to a very different sample if we were talking about counts? 

Using the `counts` from ealiers, compute the total number of observations `n`.

Scale up the `probs` and the `ind` tables be on the scale of counts.

In [21]:
n = counts.sum().sum()
diff_counts = resid * n
expected_ind = n * ind

<details>
    
```
n = counts.sum().sum()
diff_counts = resid * n
expected_ind = n * ind
```

</details>

Now divide the the difference by the square root of the expected counts (`np.sqrt`).

In [22]:
(n * resid) / np.sqrt(n * ind)

applied_app,False,True
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(19.999, 32.5]",12.071961,-9.473019
"(32.5, 47.0]",-2.56743,2.014695
"(47.0, 63.0]",-9.753768,7.653904


<details>

```
(n * resid) / np.sqrt(n * ind)
```

</details>

Recall that the choice to submit an application *was not randomly assigned*, and we are seeing a tendencing for yonger people not to follow through after being selected in the lottery.