# Homework: Data Wrangling with the Oregon Health Insurance Experiment (OHIE)
This exercise set is based on the `OHIE_12m.csv` dataset and is designed to take about 45 minutes.

In this exercise, you will:
- Recode and rename variables
- Create derived variables using `.apply()` and `lambda`
- Query the data using `.query()`
- Aggregate data using `.groupby()` and `.agg()`

Let's begin by importing pandas and loading the dataset:

In [1]:
import pandas as pd

# Make Google Drive available to the script
from google.colab import drive
drive.mount('/content/drive')

# Let's load the Oregon Health Insurance Experiment dataset
filename = 'drive/MyDrive/Colab Notebooks/Intro to Python for Epidemiologists/Data/OHIE_12m.csv'
OHIE = pd.read_csv(filename)
OHIE.head()

Mounted at /content/drive


Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
0,64350,164350,Not selected,,Lottery Draw 6,,,2008-07-14,2008-08-08,1974,...,No,Yes,No,No,No,2.0,3.0,3.0,6.0,True
1,55655,155655,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1987,...,Yes,No,No,No,No,2.0,1.0,1.0,2.0,False
2,20087,128134,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Submitted an Application to OHP,No,2008-07-14,2008-08-08,1963,...,No,No,No,Yes,No,7.0,0.0,1.0,1.0,False
3,70998,170998,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1954,...,Yes,No,No,No,No,2.0,3.0,2.0,5.0,True
4,8839,108839,Selected,Draw 8: selected in lottery 09/02/2008,Lottery Draw 8,Did NOT submit an application to OHP,No,2008-09-11,2008-10-08,1964,...,No,No,Yes,No,No,4.0,2.0,2.0,4.0,True


# Start by orienting yourself on the dataset
Use .head, .shape, .info

In [2]:
OHIE.shape

(4000, 44)

In [3]:
OHIE.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 44 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   person_id            4000 non-null   int64  
 1   household_id         4000 non-null   int64  
 2   treatment            4000 non-null   object 
 3   draw_treat           2011 non-null   object 
 4   draw_lottery         4000 non-null   object 
 5   applied_app          2010 non-null   object 
 6   approved_app         2010 non-null   object 
 7   dt_notify_lottery    4000 non-null   object 
 8   dt_retro_coverage    4000 non-null   object 
 9   birthyear_list       4000 non-null   int64  
 10  female_list          4000 non-null   object 
 11  ins_any_12m          3939 non-null   object 
 12  weight_12m           4000 non-null   float64
 13  employ_12m           3868 non-null   object 
 14  edu_12m              3853 non-null   object 
 15  dep_sad_12m          3936 non-null   o

## Exercise 1: Renaming Columns
Rename the following columns to more readable names:

- `PHQ2_sum` → `depression_score`
- `PHQ2_cutoff` → `depressed`
- `smk_curr_12m` → `current_smoker`
- `edu_12m` → `education`

Store the renamed DataFrame as `df_renamed`.

In [4]:
# Your code here
df_renamed = OHIE.rename(columns = {'PHQ2_sum':'depression_score',
                                    'PHQ2_cutoff':'depressed',
                                    'smk_curr_12m':'current_smoker',
                                    'edu_12m':'education'})
df_renamed.columns

Index(['person_id', 'household_id', 'treatment', 'draw_treat', 'draw_lottery',
       'applied_app', 'approved_app', 'dt_notify_lottery', 'dt_retro_coverage',
       'birthyear_list', 'female_list', 'ins_any_12m', 'weight_12m',
       'employ_12m', 'education', 'dep_sad_12m', 'dep_interest_12m',
       'dep_rx_12m', 'current_smoker', 'smk_ever_12m', 'race_white_12m',
       'race_black_12m', 'race_hisp_12m', 'race_asian_12m',
       'race_amerindian_12m', 'race_pacific_12m', 'race_other_qn_12m',
       'chl_chk_12m', 'dia_chk_12m', 'mam_chk_12m', 'pap_chk_12m',
       'hhinc_cat_12m', 'hhinc_pctfpl_12m', 'live_alone_12m',
       'live_partner_12m', 'live_parents_12m', 'live_friends_12m',
       'live_relatives_12m', 'live_other_12m', 'hhsize_12m', 'PHQ2_1',
       'PHQ2_2', 'depression_score', 'depressed'],
      dtype='object')

## Exercise 2: Recoding Categorical Variables
Let's inspect the `current_smoker` variable. What are the levels? How many participants fall into each level? Recode such that participants who smoke even a little bit get a 1, and only participants who smoke not at all get a 0. Store the result in a new column called `smoker_binary`.

In [8]:
df_renamed['current_smoker'].value_counts()

Unnamed: 0_level_0,count
current_smoker,Unnamed: 1_level_1
not at all,2291
every day,1255
some days,338


In [9]:
# Your code here
df_renamed['smoker_binary'] = (df_renamed['current_smoker']
                               .replace({'every day':1,'some days':1,'not at all':0})
                               .astype('Int64'))
df_renamed[['current_smoker','smoker_binary']].head()

  .replace({'every day':1,'some days':1,'not at all':0})


Unnamed: 0,current_smoker,smoker_binary
0,not at all,0
1,not at all,0
2,not at all,0
3,not at all,0
4,some days,1


## Exercise 3: Creating a Derived Variable with `apply()`
Use the `birthyear_list` column to calculate age (assuming the year is 2008) and store it as a new column `age`.

In [10]:
# Your code here
df_renamed['age'] = df_renamed['birthyear_list'].apply(lambda x: 2008 - x)
df_renamed[['birthyear_list','age']].head()

Unnamed: 0,birthyear_list,age
0,1974,34
1,1987,21
2,1963,45
3,1954,54
4,1964,44


## Exercise 4: Create a Depression Severity Category
Using `depression_score`, create a new column `depression_category` with the following logic:

- 0–2: 'None'
- 3–4: 'Mild'
- 5–6: 'Moderate'
- 7–8: 'Severe'

Use `.apply()` with a custom function or a `lambda`. Store the result in an ordered categorical variable.

In [14]:
# Your code here
depr_levels = ['None', 'Mild','Moderate','Severe']
df_renamed['depression_category'] = pd.cut(df_renamed['depression_score'],
                                bins = [0, 3, 5, 7, 9],
                                labels = depr_levels)
df_renamed['depression_category'] = pd.Categorical(
    df_renamed['depression_category'],
    categories = depr_levels,
    ordered = True
)
df_renamed[['depression_score','depression_category']].head()

Unnamed: 0,depression_score,depression_category
0,6.0,Moderate
1,2.0,
2,1.0,
3,5.0,Mild
4,4.0,Mild


In [17]:
# Checking the ordering
df_renamed['depression_category'].dtype

CategoricalDtype(categories=['None', 'Mild', 'Moderate', 'Severe'], ordered=True, categories_dtype=object)

## Exercise 5: Querying the Data
Using `.query()`, filter the dataset to include only respondents who are:

- Female (`female_list == "1: Female"`) (mind the double quotes)
- Currently smoke (using the new `smoker_binary` variable)
- Have a `depression_score` of 5 or more

Store this subset as `df_query`.

In [22]:
# Your code here
df_query = df_renamed.query('female_list == "1: Female" and smoker_binary == 1 and depression_score >= 5')
df_query

  df_query = df_renamed.query('female_list == "1: Female" and smoker_binary == 1 and depression_score >= 5')


Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,depression_score,depressed,smoker_binary,age,depr_cat,depression_category
22,28584,128584,Not selected,,Lottery Draw 3,,,2008-04-16,2008-05-08,1960,...,No,2.0,3.0,3.0,6.0,True,1,48,Moderate,Moderate
36,21051,121051,Selected,Draw 4: selected in lottery 05/01/2008,Lottery Draw 4,Submitted an Application to OHP,No,2008-05-09,2008-06-09,1955,...,No,2.0,3.0,3.0,6.0,True,1,53,Moderate,Moderate
72,64335,164335,Selected,Draw 1: selected in lottery 03/05/2008,Lottery Draw 1,Submitted an Application to OHP,Yes,2008-03-10,2008-03-11,1961,...,No,2.0,3.0,3.0,6.0,True,1,47,Moderate,Moderate
80,35463,135463,Not selected,,Lottery Draw 8,,,2008-09-11,2008-10-08,1968,...,No,2.0,3.0,3.0,6.0,True,1,40,Moderate,Moderate
89,1488,101488,Selected,Draw 2: selected in lottery 03/27/2008,Lottery Draw 2,Did NOT submit an application to OHP,No,2008-04-07,2008-04-08,1971,...,No,1.0,3.0,3.0,6.0,True,1,37,Moderate,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3863,13053,113053,Selected,Draw 8: selected in lottery 09/02/2008,Lottery Draw 8,Submitted an Application to OHP,Yes,2008-09-11,2008-10-08,1972,...,No,3.0,2.0,3.0,5.0,True,1,36,Mild,Mild
3881,16361,150982,Selected,Draw 8: selected in lottery 09/02/2008,Lottery Draw 8,Submitted an Application to OHP,No,2008-09-11,2008-10-08,1981,...,Yes,5.0,3.0,3.0,6.0,True,1,27,Moderate,Moderate
3906,35172,154209,Not selected,,Lottery Draw 2,,,2008-04-07,2008-04-08,1956,...,No,2.0,3.0,3.0,6.0,True,1,52,Moderate,Moderate
3915,46867,146867,Selected,Draw 5: selected in lottery 06/02/2008,Lottery Draw 5,Submitted an Application to OHP,No,2008-06-11,2008-07-08,1967,...,No,3.0,3.0,3.0,6.0,True,1,41,Moderate,Moderate


## Exercise 6: Aggregating the Data
Using `.groupby()` and `.agg()`, compute the mean `depression_score` and mean `age` grouped by `treatment` group.

In [23]:
# Your code here
df_renamed.groupby('treatment').agg({'depression_score':'mean',
                                     'age':'mean'})

Unnamed: 0_level_0,depression_score,age
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
Not selected,2.205325,42.550025
Selected,1.933574,42.29637
