# Lab 8: Experiments and Models

Welcome to lab 8! This week, we will be completing the hiring test that Enroll America used to screen their data analysts. Enroll America was a non-profit group that used data science to help sign people up for health insurance under the Affordable Care Act.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Bring in model data for Part 1
data_model = pd.read_csv("ea_test_model_analysis.csv")
data_model_dict = pd.read_csv("model_data_dict.csv")
# Bring in experiment data for Part 2
data_experiment = pd.read_csv("ea_test_experiment_analysis.csv")
data_experiment_dict = pd.read_csv("experiment_data_dict.csv")

## Part 1: Model

The above file `data_model` contains a dataset that was in part used to build a predictive model identifying an individual’s likelihood of being uninsured. A data dictionary is included in `data_model_dict`. The column `q1_healthcare` contains the results to a survey question asking if the respondent has health insurance or not.

Using the dataset, produce an evaluation of the model that will help someone understand how the model works. Some ideas to consider include basic summary information about the model, model validation, cross tabs of interesting demographic groups and useful visualizations.

First, take a look at the dictionary and the data.

In [2]:
# Here is data_model
data_model.head()

Unnamed: 0,id,gender,age_5way,ethnicity_4way,q1_healthcare,married,modeled_income_bucket,uninsured_score,state_cd,voter_status,party_id,hh_size,gen_vote_2012,gen_vote_2010,support_score,turnout_score,county_fips,zip,census_neighborhood_type
0,MI-9738252,M,a_18to26,W,1,0,1-under 40k,48.88,MI,Active,Unknown,2,(null),(null),29.43,28.874,29,49712,(null)
1,TX-000002087318,F,b_27to34,W,1,0,1-under 40k,25.06,TX,Inactive,Unknown,1,(null),(null),20.29,59.5919,201,77055,(null)
2,IL-000005343750,M,C_35TO49,B,2,0,1-under 40k,40.19,IL,Active,(null),3,(null),(null),86.91,27.6557,197,60475,Urban
3,MO-000000184817,F,C_35TO49,B,1,0,3-over 80k,4.73,MO,Active,(null),1,(null),(null),81.41,83.4038,183,63367,(null)
4,PA-9509497,M,A_18TO26,W,1,1,2-40 - 80k,17.93,PA,Active,Democrat,8,(null),(null),36.39,42.0622,3,15024,Rural


In [3]:
pd.set_option('display.max_colwidth', -1) #https://stackoverflow.com/questions/25351968/how-to-display-full-non-truncated-dataframe-information-in-html-when-convertin
# Here is the dictionary
data_model_dict

Unnamed: 0,id,unique ID for survey vendor
0,phone,Phone number for consumer
1,gender,Indicator of individual's gender
2,age_5way,age buckets
3,ethnicity_4way,"A = Asian, B = Black, H = Hispanic, W = White"
4,q1_healthcare,"""Health insurance is complicated, and tens of millions of people in this country do not have it for various reasons. Please think carefully for a minute: are you currently covered by a health insurance plan?"" (1=Yes, 2=No, 3=Unknown, 4=Refuse)"
5,married,indicator for whether someone is believed to be married or not
6,modeled_income_bucket,Modeled income in 3 categories
7,uninsured_score,"Modeled probability of being uninsured. The score was developed using a predicitive model that used the variable ""q1_healthcare"" as the outcome."
8,state_cd,state code
9,voter_status,an individual's voter status


Below, evaluate and interpret the model. You need a mix of both code and textual interpretation.

## Part 2: Experiment

The above file `data_experiment` contains a dataset with the results of a randomized controlled experiment conducted by a non-profit organization working on health insurance enrollment. A data dictionary is included in `data_experiment_dict`. 

For the experiment, the control group was suppressed from receiving any contact from the organization for a month. The treatment group was included in the regular program of the organization, which consists of two forms of contact. Most were called by a field organizer or volunteer and, if contacted, encouraged by phone to enroll in health insurance. Additionally, if an individual was subscribed to the email list they also received several emails a week encouraging them to enroll in health insurance. 

The most important outcome is whether an individual said they currently have health insurance when surveyed by phone 2 weeks after the end of the experiment. The results to the follow-up survey are in columns Q1 through Q15. Using the dataset, evaluate whether the outreach efforts had any causal effect on insurance status in the treatment group. 

First, take a look at the dictionary and the data.

In [4]:
# Here is data_experiment
pd.set_option('display.max_columns', 43) #https://stackoverflow.com/questions/47022070/display-all-dataframe-columns-in-a-jupyter-python-notebook
data_experiment.head()

Unnamed: 0,ea_id,spanish_speaking,state,modeled_income_bucket,race4way,gender,age_bucket,uninsured_score,registered_voter,gen_vote_2012,gen_vote_2010,party_id,uninsured_reported,subscribed,voted_2012,voted_2010,treatment,medicaid_name,chase_attempts,chase_conversation_counts,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q10A,Q10B,Q10C,Q10D,Q10E,Q10F,Q11,Q12,Q13,Q14,Q15,caller,date
0,3,0,Arizona,1-under 40k,H,F,45-54,30.06,1.0,E,E,Democrat,0.0,1.0,1.0,1.0,0.0,Access,0.0,0.0,2.0,2.0,,2.0,,,,2.0,1.0,2.0,,,,,,,1.0,1.0,2.0,1963.0,2.0,9033.0,4/23/14
1,6,0,Arizona,1-under 40k,H,M,35-44,36.81,1.0,,,Democrat,1.0,,0.0,0.0,0.0,Access,0.0,0.0,2.0,2.0,,1.0,2.0,1.0,1.0,1.0,1.0,4.0,,,,,,,,1.0,2.0,1962.0,1.0,11552.0,4/17/14
2,12,0,Arizona,1-under 40k,H,M,35-44,35.99,1.0,E,E,Democrat,1.0,1.0,1.0,1.0,1.0,Access,2.0,2.0,1.0,,1.0,2.0,,,,1.0,2.0,2.0,,,,,,,1.0,1.0,2.0,1971.0,1.0,11208.0,4/16/14
3,17,0,Arizona,1-under 40k,W,M,26-34,26.85,1.0,E,Y,Unaffiliated,1.0,,1.0,1.0,1.0,Access,1.0,1.0,2.0,,,,,,,,,,,,,,,,,,,,1.0,11471.0,4/17/14
4,19,0,Arizona,1-under 40k,W,F,45-54,22.98,1.0,E,Y,Democrat,1.0,1.0,1.0,1.0,1.0,Access,3.0,0.0,2.0,2.0,,1.0,2.0,2.0,2.0,1.0,2.0,2.0,,,,,,,,1.0,2.0,1964.0,2.0,7562.0,4/17/14


In [5]:
# Here is the dictionary
data_experiment_dict

Unnamed: 0,phone,Phone number for consumer
0,ea_id,unique ID for survey vendor
1,spanish_speaking,Tagged as Spanish speaking during a previous conversation with the organization
2,state,State of residence
3,modeled_income_bucket,Modeled income in 3 categories
4,race4way,"A = Asian, B = Black, H = Hispanic, W = White"
5,gender,Indicator of individual's gender
6,age_bucket,Modeled age in 6 categories
7,uninsured_score,Modeled probability of being uninsured
8,registered_voter,Binary indicator for registered voter as of the 2012 general election
9,gen_vote_2012,"Whether someone voted in the 2012 general election. A = absentee, E = early, M = mail, P = polls, Q = questionable/provisional, Y = voted"


Below, evaluate and interpret the model. You need a mix of both code and textual interpretation.

## Part 3

Based on your analysis from Part 2, briefly propose a follow-up randomized controlled experiment. The proposed test should be designed to expand upon the knowledge derived from the first test.

**Enter your proposal here.**

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

**To turn in your lab, you will need to submit a PDF through Canvas. You can download a notebook by opening it, turning Edit mode on, then navigating to File -> Download as -> PDF.**