# Feature Engineering:

In [1]:
#import necessary packages
import os
import pandas as pd

In [2]:
#import previously cleaned data
onehot_train = pd.read_csv('data/onehot_train')
onehot_test = pd.read_csv('data/onehot_test')

## OneHot Encoding:

In [3]:
#want to see which columns are not numeric and still need encoding
onehot_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 38 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26707 non-null  float64
 2   h1n1_knowledge               26707 non-null  float64
 3   behavioral_antiviral_meds    26707 non-null  float64
 4   behavioral_avoidance         26707 non-null  float64
 5   behavioral_face_mask         26707 non-null  float64
 6   behavioral_wash_hands        26707 non-null  float64
 7   behavioral_large_gatherings  26707 non-null  float64
 8   behavioral_outside_home      26707 non-null  float64
 9   behavioral_touch_face        26707 non-null  float64
 10  doctor_recc_h1n1             26707 non-null  float64
 11  doctor_recc_seasonal         26707 non-null  float64
 12  chronic_med_condition        26707 non-null  float64
 13  child_under_6_mo

The health_insurance, race, hhs_geo_region, employment_industry and employment_occupation columns all contain some (or all) of their inputs as the strings. Machine learning does not do well with strings so all of these must be encoded to numeric form. When dealing with features where the data is categorical but the categories have no meaningful order to them, we must use onehot encoding. This will make a new column for every possible category in that feature. Thusly, each row will have 0's in all of the new columns except 1. This binary data can then be fed to the machine learning model without issue.

### Dummy features created for all nominal data columns:

In [4]:
#get_dummies will only affect columns with Dtype of object in my dataframe, automatically creating dummy variables for each 
encoded_train = pd.get_dummies(onehot_train)
encoded_test = pd.get_dummies(onehot_test)

In [5]:
#new dataframe shape and Dtypes
encoded_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 96 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   respondent_id                   26707 non-null  int64  
 1   h1n1_concern                    26707 non-null  float64
 2   h1n1_knowledge                  26707 non-null  float64
 3   behavioral_antiviral_meds       26707 non-null  float64
 4   behavioral_avoidance            26707 non-null  float64
 5   behavioral_face_mask            26707 non-null  float64
 6   behavioral_wash_hands           26707 non-null  float64
 7   behavioral_large_gatherings     26707 non-null  float64
 8   behavioral_outside_home         26707 non-null  float64
 9   behavioral_touch_face           26707 non-null  float64
 10  doctor_recc_h1n1                26707 non-null  float64
 11  doctor_recc_seasonal            26707 non-null  float64
 12  chronic_med_condition           

get_dummies method created 58 new columns, which more than doubles the width of this dataframe. This is due to the 2 employment features having over 20 categorical response options each. We can later decide if these features are worth keeping depending on how much or little they affect the accuracy of the machine learning models.

## Creating New Feature:

I felt that the total number of people in a household was more important than knowing how many were adults versus children. Having more people in close contact regardless of age is likely to increase likelihood of catching colds and flus. This likely would affect a person's probability of getting a vaccine due to how often they get sick and how easily they may spread a disease to loved ones.

In [6]:
#Adding household adults with household children to create a total column
encoded_train['household_total'] = encoded_train['household_adults'] + encoded_train['household_children']
encoded_test['household_total'] = encoded_test['household_adults'] + encoded_test['household_children']

## Saving Dataframe for ML Modeling

In [7]:
#saving for ML models
encoded_train.to_csv('data/ML_ready_train', index=False)
encoded_test.to_csv('data/ML_ready_test', index=False)

# Returning to Feature Engineering:

While in the modeling stage of my project I was able to use a Random Forest Classifier to determine which of the features in my data were most/least important in predicting vaccinations. For both the seasonal vaccine and H1N1 vaccine, the only features eliminated came from the 'employment_industry' and 'employment_occupation' dummy features. 13 of these dummies were removed from my seasonal model and 28 were removed from my H1N1 model. These features were also missing data for close to half the responses. With all this in mind, it does not seem these 45 extra dummy features are worth the computing power they use up, so I will remove them from the dataframe and any final predicitive model I present.

## Removing Features for Final Modeling

In [8]:
final_train = onehot_train.drop(columns=['employment_industry', 'employment_occupation'])
final_test = onehot_test.drop(columns=['employment_industry', 'employment_occupation'])

## Repeating Feature Engineering after Removal of Employment Columns

In [9]:
final_train = pd.get_dummies(final_train)
final_test = pd.get_dummies(final_test)

In [11]:
final_train['household_total'] = final_train['household_adults'] + final_train['household_children']
final_test['household_total'] = final_test['household_adults'] + final_test['household_children']

In [12]:
final_train.to_csv('data/ML_final_train', index=False)
final_test.to_csv('data/ML_final_test', index=False)