# predicting the health uninsured in the US

This project will call data from the American Community Survey Public Use MicroSample(PUMS) API. The most recent survey publicly availble at this API is from 2019. (Note data from 2020 are available to download in csv files).

This project uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.

1. American Community Survey (ACS) (census.gov).
https://www.census.gov/programs-surveys/acs/
2. American Community Survey Data via API (census.gov).
https://www.census.gov/programs-surveys/acs/data/data-via-api.html

The goal of this project is to predict whether an individual has health insurance (or not) based on demographic data in the PUMS. 

In [1]:
# import libraries
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
nsamples = 500
datadir  = os.path.join(os.getcwd(),"data")
new_df_fname = os.path.join(datadir,'allstates_subsample' + str(nsamples) + '_select_columns.csv')
df = pd.read_csv(new_df_fname)
datatype_fname = os.path.join(datadir,'allstates_subsample' + str(nsamples) + '_select_columns_datatype.csv')
datatype = pd.read_csv(datatype_fname)
datatype.set_index('var_name',inplace=True)

In [3]:
print('df n records: ' + str(df.shape[0]))
print('df n columns: ' + str(df.shape[1]))
print('data type rows ' + str(datatype.shape[0]))

n records: 25500
n columns: 16


In [4]:
df.drop('Unnamed: 0',axis=1,inplace=True)

In [5]:
print(df.dtypes)

FPOBP          int64
FSCHP          int64
FMARHWP        int64
FDISP          int64
FDEYEP         int64
FCITP          int64
not_insured    int64
FPOWSP         int64
FLANXP         int64
FDDRSP         int64
FDEARP         int64
FMIGP          int64
FDREMP         int64
FDPHYP         int64
FDOUTP         int64
dtype: object
(25500, 15)
(15, 4)


In [6]:
datatype

Unnamed: 0_level_0,Unnamed: 0,range_key,n_unq_vals,info_from_api
var_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
FPOBP,26,False,2,True
FSCHP,97,False,2,True
FMARHWP,148,False,2,True
FDISP,162,False,2,True
FDEYEP,202,False,2,True
FCITP,203,False,2,True
not_insured,247,False,2,True
FPOWSP,284,False,2,True
FLANXP,320,False,2,True
FDDRSP,328,False,2,True


In [7]:
#convert the datatypes
print('converting data types...')
for index in datatype.index:
    if datatype.loc[index].range_key == False:
        df[index] = df[index].astype('category')

converting data types...


In [8]:
df.dtypes.unique()

array([CategoricalDtype(categories=[0, 1], ordered=False)], dtype=object)

one-hot encode all categorical variables (all columns in df)

In [9]:
target_y = df['not_insured']
X = df.drop('not_insured',axis=1)
dummydf = pd.get_dummies(X)

split data into training and test sets

In [10]:
X_train,X_test,y_train,y_test = train_test_split(dummydf,target_y,test_size=0.25,random_state=123,stratify=target_y)

In [12]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(19125, 28)
(6375, 28)
(19125,)
(6375,)
