## Data Explanation

Targets:
- h1n1_vaccine - Whether respondent received H1N1 flu vaccine.
- seasonal_vaccine - Whether respondent received seasonal flu vaccine.

Predictors:
- h1n1_concern - Level of concern about the H1N1 flu.
        0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.
- h1n1_knowledge - Level of knowledge about H1N1 flu.
        0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.
- behavioral_antiviral_meds - Has taken antiviral medications. (binary)
- behavioral_avoidance - Has avoided close contact with others with flu-like symptoms. (binary)
- behavioral_face_mask - Has bought a face mask. (binary)
- behavioral_wash_hands - Has frequently washed hands or used hand sanitizer. (binary)
- behavioral_large_gatherings - Has reduced time at large gatherings. (binary)
- behavioral_outside_home - Has reduced contact with people outside of own household. (binary)
- behavioral_touch_face - Has avoided touching eyes, nose, or mouth. (binary)
- doctor_recc_h1n1 - H1N1 flu vaccine was recommended by doctor. (binary)
- doctor_recc_seasonal - Seasonal flu vaccine was recommended by doctor. (binary)
- chronic_med_condition - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)
- child_under_6_months - Has regular close contact with a child under the age of six months. (binary)
- health_worker - Is a healthcare worker. (binary)
- health_insurance - Has health insurance. (binary)
- opinion_h1n1_vacc_effective - Respondent's opinion about H1N1 vaccine effectiveness.
        1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
- opinion_h1n1_risk - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
        1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
- opinion_h1n1_sick_from_vacc - Respondent's worry of getting sick from taking H1N1 vaccine.
        1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
- opinion_seas_vacc_effective - Respondent's opinion about seasonal flu vaccine effectiveness.
        1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
- opinion_seas_risk - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
        1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
- opinion_seas_sick_from_vacc - Respondent's worry of getting sick from taking seasonal flu vaccine.
        1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
- age_group - Age group of respondent. 
- education - Self-reported education level.
- race - Race of respondent.
- sex - Sex of respondent.
- income_poverty - Household annual income of respondent with respect to 2008 Census poverty thresholds.
- marital_status - Marital status of respondent.
- rent_or_own - Housing situation of respondent.
- employment_status - Employment status of respondent.
- hhs_geo_region - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.
- census_msa - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.
- household_adults - Number of other adults in household, top-coded to 3.
- household_children - Number of children in household, top-coded to 3.
- employment_industry - Type of industry respondent is employed in. Values are represented as short random character strings.
- employment_occupation - Type of occupation of respondent. Values are represented as short random character strings.


## Import Resources

In [1]:
import numpy as np
import pandas as pd
import xlrd
import os
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier, StackingRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [2]:
df_features = pd.read_csv('Data/training_set_features.csv')
df_targets = pd.read_csv('Data/training_set_labels.csv')

In [3]:
df = df_features.merge(df_targets)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
for c in df.columns:
    print("---- %s ---" % c)
    print(df[c].value_counts())

In [None]:
df.corr()

## Preprocessing

In [29]:
df.isna().sum()

h1n1_concern                      92
h1n1_knowledge                   116
behavioral_antiviral_meds         71
behavioral_avoidance             208
behavioral_face_mask              19
behavioral_wash_hands             42
behavioral_large_gatherings       87
behavioral_outside_home           82
behavioral_touch_face            128
doctor_recc_h1n1                2160
chronic_med_condition            971
child_under_6_months             820
health_worker                    804
health_insurance               12274
opinion_h1n1_vacc_effective      391
opinion_h1n1_risk                388
opinion_h1n1_sick_from_vacc      395
age_group                          0
education                       1407
race                               0
sex                                0
income_poverty                  4423
marital_status                  1408
rent_or_own                     2042
employment_status               1463
hhs_geo_region                     0
census_msa                         0
h

In [4]:
# Missing Data
drop = ['respondent_id', 'doctor_recc_seasonal', 'opinion_seas_vacc_effective', 'opinion_seas_risk',
                'opinion_seas_sick_from_vacc', 'seasonal_vaccine', 'employment_industry', 'employment_occupation']

# Impute with most frequent
most_frequent = ['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 
                'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 
                'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'chronic_med_condition',
                'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective',
                'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'income_poverty', 'marital_status',
                'rent_or_own', 'employment_status', 'household_adults', 'household_children', 'education']

# Encoding and scaling
cat_cols = ['age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status',
       'hhs_geo_region', 'census_msa']

num_cols = ['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 
                'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 
                'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'chronic_med_condition',
                'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective',
                'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc',
                'household_adults', 'household_children']

In [5]:
df = df.drop(drop, axis=1)

In [6]:
X = df.drop('h1n1_vaccine', axis=1)
y = df['h1n1_vaccine']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train = X_train.fillna(df.mode().iloc[0])
X_test = X_test.fillna(df.mode().iloc[0])

In [8]:
ct = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), cat_cols),
    ('scaler', MinMaxScaler(), num_cols)
], remainder='passthrough')

In [9]:
ct.fit(X_train)
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)

In [10]:
ohe_col_names = ct.named_transformers_['ohe'].get_feature_names(input_features = cat_cols)

In [11]:
X_train_trans = pd.DataFrame(X_train_trans,
             columns = [*ohe_col_names, *num_cols], # Using * to unpack lists
             index = X_train.index)

X_test_trans = pd.DataFrame(X_test_trans,
             columns = [*ohe_col_names, *num_cols], # Using * to unpack lists
             index = X_test.index)

In [30]:
X_train_trans.describe()

Unnamed: 0,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_< 12 Years,education_College Graduate,education_Some College,race_Hispanic,race_Other or Multiple,race_White,...,doctor_recc_h1n1,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,household_adults,household_children
count,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,...,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0
mean,0.142944,0.195834,0.208799,0.255277,0.08776,0.430049,0.2652,0.066136,0.061877,0.794477,...,0.203651,0.274561,0.079616,0.110414,0.935081,0.71279,0.335174,0.338942,0.296653,0.17694
std,0.350024,0.396851,0.40646,0.436027,0.282953,0.495094,0.44145,0.248526,0.240937,0.404093,...,0.402722,0.446303,0.270704,0.313413,0.246389,0.250402,0.319782,0.338335,0.250323,0.308851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.75,0.25,0.25,0.333333,0.0
75%,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.75,0.75,0.333333,0.333333
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [31]:
X_train_trans.corr()

Unnamed: 0,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_< 12 Years,education_College Graduate,education_Some College,race_Hispanic,race_Other or Multiple,race_White,...,doctor_recc_h1n1,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,household_adults,household_children
age_group_35 - 44 Years,1.0,-0.201535,-0.209797,-0.239104,-0.034038,0.088764,-0.019363,0.052205,0.022772,-0.051074,...,0.004666,-0.085249,0.010301,0.039167,-0.016139,-0.017282,0.036853,0.030482,0.037584,0.369461
age_group_45 - 54 Years,-0.201535,1.0,-0.253509,-0.288922,-0.040513,0.023746,0.011864,-0.007932,0.001521,-2.7e-05,...,-0.012909,-0.02478,-0.008764,0.031246,-0.010712,0.008214,0.007333,0.016948,0.065558,0.037049
age_group_55 - 64 Years,-0.209797,-0.253509,1.0,-0.300767,-0.033576,0.031297,-0.002102,-0.05701,-0.025348,0.051251,...,0.019592,0.050363,-0.009005,0.029557,0.013837,0.016439,-0.004398,-0.009197,-0.038508,-0.218125
age_group_65+ Years,-0.239104,-0.288922,-0.300767,1.0,0.087395,-0.097895,-0.02247,-0.092742,-0.047441,0.107571,...,-0.006054,0.161287,-0.077419,-0.10625,0.12551,0.004585,-0.056075,-0.042149,-0.217832,-0.304492
education_< 12 Years,-0.034038,-0.040513,-0.033576,0.087395,1.0,-0.269423,-0.186336,0.159082,0.015778,-0.15378,...,-0.013903,0.061603,0.041383,-0.065992,-0.088139,-0.028725,0.053696,0.074678,0.013509,0.007804
education_College Graduate,0.088764,0.023746,0.031297,-0.097895,-0.269423,1.0,-0.521846,-0.078236,0.008819,0.089221,...,0.007948,-0.103941,-0.043486,0.060788,0.124123,0.061313,-0.007563,-0.068728,-0.009059,0.051102
education_Some College,-0.019363,0.011864,-0.002102,-0.02247,-0.186336,-0.521846,1.0,-0.022496,0.001499,0.010888,...,0.021093,0.032628,0.008968,0.045805,-0.015995,-0.024736,-0.015782,0.003699,0.009528,-0.006481
race_Hispanic,0.052205,-0.007932,-0.05701,-0.092742,0.159082,-0.078236,-0.022496,1.0,-0.068346,-0.523225,...,0.015546,-0.036273,0.053226,-0.013831,-0.120982,-0.015736,0.08166,0.091475,0.094171,0.154679
race_Other or Multiple,0.022772,0.001521,-0.025348,-0.047441,0.015778,0.008819,0.001499,-0.068346,1.0,-0.504946,...,0.002785,-0.009999,-0.004487,-0.005559,-0.019063,-0.025066,0.015279,0.038999,0.039962,0.02144
race_White,-0.051074,-2.7e-05,0.051251,0.107571,-0.15378,0.089221,0.010888,-0.523225,-0.504946,1.0,...,-0.016041,0.013651,-0.039968,0.009916,0.108571,0.043072,-0.058409,-0.116731,-0.059259,-0.122661


## Decision Tree Gridsearch

In [46]:
treepipe = Pipeline([
    ('tree', DecisionTreeClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'tree__criterion': ['gini', 'entropy'],
    'tree__max_depth': list(range(10)[1:]),
    'tree__min_samples_leaf': list(range(10))[1:],
    'tree__random_state': [42]
}

cv_tree = GridSearchCV(treepipe, param_grid=parameters)

cv_tree.fit(X_train_trans, y_train)
y_pred_tree = cv_tree.predict(X_test_trans)
print(y_pred_tree)

[0 0 0 ... 1 0 0]


In [42]:
cv_tree.best_params_

{'tree__criterion': 'gini', 'tree__max_depth': 4, 'tree__min_samples_leaf': 1}

In [20]:
cv_tree.best_score_

0.829253451907325

## Logistic Gridsearch

In [52]:
logpipe = Pipeline([
    ('logistic', LogisticRegression())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'logistic__C': [1, 10, 100, 1000, 10000, 100000],
    'logistic__class_weight': [None, 'balanced'],
    'logistic__random_state': [42]
}

cv_log = GridSearchCV(logpipe, param_grid=parameters)

cv_log.fit(X_train_trans, y_train)
y_pred_tree = cv_log.predict(X_test_trans)
print(y_pred_tree)

[0 0 0 ... 1 0 0]


In [53]:
cv_log.best_params_

{'logistic__C': 1,
 'logistic__class_weight': None,
 'logistic__random_state': 42}

In [54]:
cv_log.best_score_

0.8336999765972386

## KNN Gridsearch

In [58]:
knnpipe = Pipeline([
    ('KNN', KNeighborsClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'KNN__n_neighbors': [5, 7, 9, 11]
}

knn_log = GridSearchCV(knnpipe, param_grid=parameters)

knn_log.fit(X_train_trans, y_train)
y_pred_tree = knn_log.predict(X_test_trans)
print(y_pred_tree)

[0 0 0 ... 0 0 0]


In [59]:
knn_log.best_params_

{'KNN__n_neighbors': 11}

In [60]:
knn_log.best_score_

0.8043529136438099