## Data Explanation

Targets:
- h1n1_vaccine - Whether respondent received H1N1 flu vaccine.
- seasonal_vaccine - Whether respondent received seasonal flu vaccine.

Predictors:
- h1n1_concern - Level of concern about the H1N1 flu.
        0 = Not at all concerned; 1 = Not very concerned; 2 = Somewhat concerned; 3 = Very concerned.
- h1n1_knowledge - Level of knowledge about H1N1 flu.
        0 = No knowledge; 1 = A little knowledge; 2 = A lot of knowledge.
- behavioral_antiviral_meds - Has taken antiviral medications. (binary)
- behavioral_avoidance - Has avoided close contact with others with flu-like symptoms. (binary)
- behavioral_face_mask - Has bought a face mask. (binary)
- behavioral_wash_hands - Has frequently washed hands or used hand sanitizer. (binary)
- behavioral_large_gatherings - Has reduced time at large gatherings. (binary)
- behavioral_outside_home - Has reduced contact with people outside of own household. (binary)
- behavioral_touch_face - Has avoided touching eyes, nose, or mouth. (binary)
- doctor_recc_h1n1 - H1N1 flu vaccine was recommended by doctor. (binary)
- doctor_recc_seasonal - Seasonal flu vaccine was recommended by doctor. (binary)
- chronic_med_condition - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)
- child_under_6_months - Has regular close contact with a child under the age of six months. (binary)
- health_worker - Is a healthcare worker. (binary)
- health_insurance - Has health insurance. (binary)
- opinion_h1n1_vacc_effective - Respondent's opinion about H1N1 vaccine effectiveness.
        1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
- opinion_h1n1_risk - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
        1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
- opinion_h1n1_sick_from_vacc - Respondent's worry of getting sick from taking H1N1 vaccine.
        1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
- opinion_seas_vacc_effective - Respondent's opinion about seasonal flu vaccine effectiveness.
        1 = Not at all effective; 2 = Not very effective; 3 = Don't know; 4 = Somewhat effective; 5 = Very effective.
- opinion_seas_risk - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
        1 = Very Low; 2 = Somewhat low; 3 = Don't know; 4 = Somewhat high; 5 = Very high.
- opinion_seas_sick_from_vacc - Respondent's worry of getting sick from taking seasonal flu vaccine.
        1 = Not at all worried; 2 = Not very worried; 3 = Don't know; 4 = Somewhat worried; 5 = Very worried.
- age_group - Age group of respondent. 
- education - Self-reported education level.
- race - Race of respondent.
- sex - Sex of respondent.
- income_poverty - Household annual income of respondent with respect to 2008 Census poverty thresholds.
- marital_status - Marital status of respondent.
- rent_or_own - Housing situation of respondent.
- employment_status - Employment status of respondent.
- hhs_geo_region - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.
- census_msa - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.
- household_adults - Number of other adults in household, top-coded to 3.
- household_children - Number of children in household, top-coded to 3.
- employment_industry - Type of industry respondent is employed in. Values are represented as short random character strings.
- employment_occupation - Type of occupation of respondent. Values are represented as short random character strings.


## Import Resources

In [56]:
import numpy as np
import pandas as pd
import xlrd
import os
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV,\
cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier,\
ExtraTreesClassifier, VotingClassifier, StackingRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, precision_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.feature_selection import SelectKBest, f_regression

In [2]:
df_features = pd.read_csv('Data/training_set_features.csv')
df_targets = pd.read_csv('Data/training_set_labels.csv')

In [3]:
df = df_features.merge(df_targets)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
for c in df.columns:
    print("---- %s ---" % c)
    print(df[c].value_counts())

In [None]:
df.corr()

## Preprocessing

In [None]:
df.isna().sum()

In [4]:
# Missing Data
drop = ['respondent_id', 'doctor_recc_seasonal', 'opinion_seas_vacc_effective', 'opinion_seas_risk',
                'opinion_seas_sick_from_vacc', 'seasonal_vaccine', 'employment_industry', 'employment_occupation']

# Impute with most frequent
most_frequent = ['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 
                'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 
                'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'chronic_med_condition',
                'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective',
                'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'income_poverty', 'marital_status',
                'rent_or_own', 'employment_status', 'household_adults', 'household_children', 'education']

# Encoding and scaling
cat_cols = ['age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status',
       'hhs_geo_region', 'census_msa']

num_cols = ['h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 
                'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 
                'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'chronic_med_condition',
                'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective',
                'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc',
                'household_adults', 'household_children']

In [5]:
df = df.drop(drop, axis=1)

In [6]:
X = df.drop('h1n1_vaccine', axis=1)
y = df['h1n1_vaccine']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train = X_train.fillna(df.mode().iloc[0])
X_test = X_test.fillna(df.mode().iloc[0])

In [8]:
ct = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(drop='first'), cat_cols),
    ('scaler', MinMaxScaler(), num_cols)
], remainder='passthrough')

In [9]:
ct.fit(X_train)
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)

In [10]:
ohe_col_names = ct.named_transformers_['ohe'].get_feature_names(input_features = cat_cols)

In [11]:
X_train_trans = pd.DataFrame(X_train_trans,
             columns = [*ohe_col_names, *num_cols], # Using * to unpack lists
             index = X_train.index)

X_test_trans = pd.DataFrame(X_test_trans,
             columns = [*ohe_col_names, *num_cols], # Using * to unpack lists
             index = X_test.index)

In [12]:
X_train_trans.describe()

Unnamed: 0,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_< 12 Years,education_College Graduate,education_Some College,race_Hispanic,race_Other or Multiple,race_White,...,doctor_recc_h1n1,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,household_adults,household_children
count,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,...,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0,21365.0
mean,0.142944,0.195834,0.208799,0.255277,0.08776,0.430049,0.2652,0.066136,0.061877,0.794477,...,0.203651,0.274561,0.079616,0.110414,0.935081,0.71279,0.335174,0.338942,0.296653,0.17694
std,0.350024,0.396851,0.40646,0.436027,0.282953,0.495094,0.44145,0.248526,0.240937,0.404093,...,0.402722,0.446303,0.270704,0.313413,0.246389,0.250402,0.319782,0.338335,0.250323,0.308851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.75,0.25,0.25,0.333333,0.0
75%,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.75,0.75,0.333333,0.333333
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
X_train_trans.corr()

## Feature Selection

In [61]:
selector = SelectKBest(score_func=f_regression, k=10)
selector.fit(X_train_trans, y_train)

# X_k_best_train = selector.transform(X_train_trans)
# X_k_best_test = selector.transform(X_test_trans)

cols = selector.get_support(indices=True)
features_df_new = X_train_trans.iloc[:,cols]

In [62]:
features_df_new

Unnamed: 0,h1n1_concern,h1n1_knowledge,behavioral_wash_hands,behavioral_touch_face,doctor_recc_h1n1,chronic_med_condition,health_worker,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc
24706,0.666667,0.5,1.0,1.0,0.0,1.0,0.0,1.00,0.75,0.25
5393,0.666667,0.5,1.0,1.0,0.0,0.0,0.0,1.00,0.75,0.75
20898,0.000000,0.5,1.0,0.0,0.0,0.0,0.0,0.00,0.25,0.00
3429,0.333333,0.5,0.0,0.0,0.0,0.0,0.0,0.75,0.25,0.00
8731,0.333333,0.0,1.0,0.0,0.0,0.0,0.0,0.50,0.25,1.00
...,...,...,...,...,...,...,...,...,...,...
21575,0.666667,0.5,1.0,1.0,0.0,0.0,0.0,0.75,0.25,0.25
5390,0.333333,0.5,1.0,1.0,0.0,0.0,0.0,0.75,0.25,0.00
860,0.666667,0.5,0.0,1.0,0.0,0.0,0.0,0.75,0.25,0.25
15795,0.666667,0.5,1.0,1.0,1.0,1.0,1.0,0.00,0.00,0.75


## Dummy Classifier

In [37]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train_trans, y_train)

dummy_preds = dummy.predict(X_test_trans)

print(accuracy_score(y_test, dummy_preds))
print(precision_score(y_test, dummy_preds))

0.7884687383002621
0.0


In [53]:
(unique, counts) = np.unique(dummy_preds, return_counts=True)

frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0 5342]]


## Decision Tree Gridsearch

In [13]:
treepipe = Pipeline([
    ('tree', DecisionTreeClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'tree__criterion': ['gini', 'entropy'],
    'tree__max_depth': list(range(10)[1:]),
    'tree__min_samples_leaf': list(range(10))[1:],
    'tree__random_state': [42]
}

cv_tree = GridSearchCV(treepipe, param_grid=parameters, scoring='precision')

cv_tree.fit(X_train_trans, y_train)
y_pred_tree = cv_tree.predict(X_test_trans)
print(y_pred_tree)

[0 0 0 ... 0 0 0]


In [14]:
cv_tree.best_params_

{'tree__criterion': 'gini',
 'tree__max_depth': 2,
 'tree__min_samples_leaf': 1,
 'tree__random_state': 42}

In [15]:
cv_tree.best_score_

0.6885314980379216

In [16]:
accuracy_score(y_test, y_pred_tree)

0.8260950954698615

## Logistic Gridsearch

In [18]:
logpipe = Pipeline([
    ('logistic', LogisticRegression())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'logistic__C': [1, 10, 100, 1000, 10000, 100000],
    'logistic__class_weight': [None, 'balanced'],
    'logistic__random_state': [42]
}

cv_log = GridSearchCV(logpipe, param_grid=parameters, scoring='precision')

cv_log.fit(X_train_trans, y_train)
y_pred_log = cv_log.predict(X_test_trans)
print(y_pred_log)

[0 0 0 ... 1 0 0]


In [52]:
(unique, counts) = np.unique(y_pred_log, return_counts=True)

frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0 5051]
 [   1  291]]


In [19]:
cv_log.best_params_

{'logistic__C': 1,
 'logistic__class_weight': None,
 'logistic__random_state': 42}

In [20]:
cv_log.best_score_

0.6812667659379484

## KNN Gridsearch

In [27]:
knnpipe = Pipeline([
    ('KNN', KNeighborsClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'KNN__n_neighbors': [7, 9, 11, 13, 15]
}

cv_knn = GridSearchCV(knnpipe, param_grid=parameters, scoring='precision')

cv_knn.fit(X_train_trans, y_train)
y_pred_knn = cv_knn.predict(X_test_trans)
print(y_pred_knn)

[0 0 0 ... 0 0 0]


In [51]:
(unique, counts) = np.unique(y_pred_knn, return_counts=True)

frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0 5110]
 [   1  232]]


In [25]:
cv_knn.best_params_

{'KNN__n_neighbors': 11}

In [26]:
cv_knn.best_score_

0.6495931766409934

## Forest Gridsearch

In [42]:
import warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

In [43]:
forestpipe = Pipeline([
    ('forest', RandomForestClassifier())
])

# Note you use the part of the pipeline's name `NAME__{parameter}`
parameters = {
    'forest__n_estimators': [25, 50, 100, 150],
    'forest__criterion': ['gini', 'entropy', 'log_loss'],
    'forest__max_depth': list(range(5)[1:]),
    'forest__random_state': [42]
}

cv_forest = GridSearchCV(forestpipe, param_grid=parameters, scoring='precision')

cv_forest.fit(X_train_trans, y_train)
y_pred_forest = cv_forest.predict(X_test_trans)
print(y_pred_forest)

[0 0 0 ... 0 0 0]


In [48]:
(unique, counts) = np.unique(y_pred_forest, return_counts=True)

frequencies = np.asarray((unique, counts)).T

print(frequencies)

[[   0 5315]
 [   1   27]]


In [39]:
cv_forest.best_params_

{'forest__criterion': 'entropy',
 'forest__max_depth': 3,
 'forest__n_estimators': 50,
 'forest__random_state': 42}

In [40]:
cv_forest.best_score_

0.885024154589372

In [41]:
accuracy_score(y_test, y_pred_forest)

0.7927742418569824