# THE H1N1 AND SEASONAL FLU VACCINES PROJECT

As the world struggles to vaccinate the global population against COVID-19, an understanding of how people’s backgrounds, opinions, and health behaviors are related to their personal vaccination patterns can provide guidance for future public health efforts. Your audience could be someone guiding those public health efforts.
The *CRoss-Industry Standard Process for Data Mining (CRISP-DM)* methodology will be used inthis project

# BUSINESS UNDERSTANDING

Good questions for this stage include:

Who are the stakeholders in this project? Who will be directly affected by the creation of this project?

What business problem(s) will this Data Science project solve for the organization?

What problems are inside the scope of this project?

What problems are outside the scope of this project?

What data sources are available to us?

What is the expected timeline for this project? Are there hard deadlines (e.g. "must be live before holiday season shopping") or is this an ongoing project?

Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't?

## DATA UNDERSTANDING

Consider the following questions when working through this stage:

What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?

Who controls the data sources, and what steps are needed to get access to the data?

What is our target?

What predictors are available to us?

What data types are the predictors we'll be working with?

What is the distribution of our data?

How many observations does our dataset contain? Do we have a lot of data? Only a little?

Do we have enough data to build a model? Will we need to use resampling methods?

How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?

## DATA PREPARATION

During this stage, we'll want to handle the following issues:

Detecting and dealing with missing values

Data type conversions (e.g. numeric data mistakenly encoded as strings)

Checking for and removing multicollinearity (correlated predictors)

Normalizing our numeric data

Converting categorical data to numeric format through one-hot encoding

## MODELLING

Consider the following questions during the modeling step:

Is this a classification task? A regression task? Something else?

What models will we try?

How do we deal with overfitting?

Do we need to use regularization or not?

What sort of validation strategy will we be using to check that our model works well on unseen data?

What loss functions will we use?

What threshold of performance do we consider as successful?

## EVALUATION

During this step, we'll evaluate the results of our modeling efforts. Does our model solve the problems that we outlined all the way back during step 1? Why or why not? Often times, evaluating the results of our modeling step will raise new questions, or will cause us to consider changing our approach to the problem. Notice from the CRISP-DM diagram above, that the "Evaluation" step is unique in that it points to both Business Understanding and Deployment. As we mentioned before, Data Science is an iterative process -- that means that given the new information our model has provided, we'll often want to start over with another iteration, armed with our newfound knowledge! Perhaps the results of our model showed us something important that we had originally failed to consider the goal of the project or the scope. Perhaps we learned that the model can't be successful without more data, or different data. Perhaps our evaluation shows us that we should reconsider our approach to cleaning and structuring the data, or how we frame the project as a whole (e.g. realizing we should treat the problem as a classification rather than a regression task). In any of these cases, it is totally encouraged to revisit the earlier steps.

## DEPLOYMENT

During this stage, we'll focus on moving our model into production and automating as much as possible. Everything before this serves as a proof-of-concept or an investigation. If the project has proved successful, then you'll work with stakeholders to determine the best way to implement models and insights. For example, you might set up an automated ETL (Extract-Transform-Load) pipelines of raw data in order to feed into a database and reformat it so that it is ready for modeling. During the deployment step, you'll actively work to determine the best course of action for getting the results of your project into the wild, and you'll often be involved with building everything needed to put the software into production.

*******

*****


Import the necessary libraries

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from sklearn.metrics import r2_score,roc_auc_score,accuracy_score,precision_score,classification_report
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler,MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from IPython.display import display, HTML

import the files

In [10]:
#submission file
sub=pd.read_csv(r'submission_format.csv')

In [96]:
#test
test=pd.read_csv(r'test_set_features.csv')

In [6]:
training_feat=pd.read_csv(r'training_set_features.csv')
training_feat.head(5)

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [94]:
#training_label
training_label=pd.read_csv(r'training_set_labels.csv')

In [38]:
df=pd.DataFrame(training_feat)

In [31]:
#load the columns
df.columns

Index(['respondent_id', 'h1n1_concern', 'h1n1_knowledge',
       'behavioral_antiviral_meds', 'behavioral_avoidance',
       'behavioral_face_mask', 'behavioral_wash_hands',
       'behavioral_large_gatherings', 'behavioral_outside_home',
       'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal',
       'chronic_med_condition', 'child_under_6_months', 'health_worker',
       'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk',
       'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective',
       'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'age_group',
       'education', 'race', 'sex', 'income_poverty', 'marital_status',
       'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa',
       'household_adults', 'household_children', 'employment_industry',
       'employment_occupation'],
      dtype='object')

In [32]:
#load the data description
df.describe()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,household_adults,household_children
count,26707.0,26615.0,26591.0,26636.0,26499.0,26688.0,26665.0,26620.0,26625.0,26579.0,...,25903.0,14433.0,26316.0,26319.0,26312.0,26245.0,26193.0,26170.0,26458.0,26458.0
mean,13353.0,1.618486,1.262532,0.048844,0.725612,0.068982,0.825614,0.35864,0.337315,0.677264,...,0.111918,0.87972,3.850623,2.342566,2.35767,4.025986,2.719162,2.118112,0.886499,0.534583
std,7709.791156,0.910311,0.618149,0.215545,0.446214,0.253429,0.379448,0.47961,0.472802,0.467531,...,0.315271,0.3253,1.007436,1.285539,1.362766,1.086565,1.385055,1.33295,0.753422,0.928173
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
25%,6676.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,3.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0
50%,13353.0,2.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,4.0,2.0,2.0,4.0,2.0,2.0,1.0,0.0
75%,20029.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.0,1.0,5.0,4.0,4.0,5.0,4.0,4.0,1.0,1.0
max,26706.0,3.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,3.0


In [98]:
#load the training data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26707 non-null  float64
 2   h1n1_knowledge               26707 non-null  float64
 3   behavioral_antiviral_meds    26707 non-null  float64
 4   behavioral_avoidance         26707 non-null  float64
 5   behavioral_face_mask         26707 non-null  float64
 6   behavioral_wash_hands        26707 non-null  float64
 7   behavioral_large_gatherings  26707 non-null  float64
 8   behavioral_outside_home      26707 non-null  float64
 9   behavioral_touch_face        26707 non-null  float64
 10  doctor_recc_h1n1             26707 non-null  float64
 11  doctor_recc_seasonal         26707 non-null  float64
 12  chronic_med_condition        26707 non-null  float64
 13  child_under_6_mo

In [88]:
#look at the null values
df_null=df.isnull().sum()
df_null

respondent_id                  0
h1n1_concern                   0
h1n1_knowledge                 0
behavioral_antiviral_meds      0
behavioral_avoidance           0
behavioral_face_mask           0
behavioral_wash_hands          0
behavioral_large_gatherings    0
behavioral_outside_home        0
behavioral_touch_face          0
doctor_recc_h1n1               0
doctor_recc_seasonal           0
chronic_med_condition          0
child_under_6_months           0
health_worker                  0
health_insurance               0
opinion_h1n1_vacc_effective    0
opinion_h1n1_risk              0
opinion_h1n1_sick_from_vacc    0
opinion_seas_vacc_effective    0
opinion_seas_risk              0
opinion_seas_sick_from_vacc    0
age_group                      0
education                      0
race                           0
sex                            0
income_poverty                 0
marital_status                 0
rent_or_own                    0
employment_status              0
hhs_geo_re

In [110]:
#check for multicollinearity
df_corr=df.corr()
print(df_corr)

                             respondent_id  h1n1_concern  h1n1_knowledge  \
respondent_id                     1.000000      0.017146        0.003626   
h1n1_concern                      0.017146      1.000000        0.071910   
h1n1_knowledge                    0.003626      0.071910        1.000000   
behavioral_antiviral_meds        -0.008458      0.089344       -0.008838   
behavioral_avoidance              0.011288      0.239102        0.092814   
behavioral_face_mask             -0.006427      0.152620        0.029502   
behavioral_wash_hands             0.010663      0.292980        0.093078   
behavioral_large_gatherings       0.004704      0.252863       -0.046827   
behavioral_outside_home           0.009080      0.243858       -0.065442   
behavioral_touch_face             0.007532      0.248585        0.090702   
doctor_recc_h1n1                 -0.002597      0.139284        0.095868   
doctor_recc_seasonal              0.000762      0.122682        0.075378   
chronic_med_

*********

<span style="color:red;">health_insurance </span>

since the insurance cannot be null values a person either has insurance or not.we will replace null with 0 to indicate no insurance

<span style="color:red;">H1N1 concern</span>

for H1N1 concern if the respondent had knowledge we will fill the null values for concern with 1 if the respondent had no knowledge we will fill the null with 0

<span style="color:red;">behavioral_antiviral_meds</span>

for behavioral_antiviral_meds we will fill the null values with the mode 
for behavioral_avoidance we will replce null based on the h1n1 concern with those that were highly concerned we will assume they avoided gatherings.

<span style="color:red;">behavioral_face_mask</span>

if the responded had knowledge about H1N1 and avoided contanct with others with flu like symptoms then we will fill the null values for face mask with 1 otherwise 0

<span style="color:red;">behavioral_wash_hands</span>

for behavioral_wash_hands if the respondent had knowldege and wore mask is an indication that the respondent took precautionary action of washing hands as well

<span style="color:red;">behavioral_large_gatherings</span>

for behavioral_large_gatherings if the respondent avoided close contanct with people who had flu-like symptoms then probably thier exposure to gatherings was reduced as well

 <span style="color:red;">behavioral_outside_home</span>

for behavioral_outside_home if the respondent avoided contanct withflu like symptoms and exposure to public gathering then the would have prefered to stay at home or reduce contact with people outside their household

<span style="color:red;">behavioral_touch_face</span>

for behavioral_touch_face if the respondent is knowlegdable ,has a face mask and washes hands then the respondent probbly avoided touching their eyes,nose or mouth

<span style="color:red;">doctor_recc_h1n1</span>

for doctor_recc_h1n1 if the respondent is very concerned of H1N1 and is knowledgable and has taken antiviral medication probably the vaccine they took was also recommended by the doctor either at the hospital or through the media

<span style="color:red;">doctor_recc_seasonal</span>

for doctor_recc_seasonal if the respondents H1N1 was reccomended by the doctor then probably the seasonal flu vaccines were recommended by the doctorthrough the same medium

<span style="color:red;">chronic_med_condition</span>

for chronic_med_condition for respondents with null values in chronic condition we will replace with 0 

**NB**
For the rest of the null columns will be replaced with thier mode values





******

In [87]:
#fill the null with zero
#health insurance
df['health_insurance']=df['health_insurance'].fillna(0)

#h1n1_concern
df['h1n1_concern'] = np.where(
    df['h1n1_knowledge'] >= 1,  # Condition
    df['h1n1_concern'].fillna(df['h1n1_knowledge']),  # Value if condition is True
    df['h1n1_concern'].fillna(0))

#h1n1 knowlegde
df['h1n1_knowledge']=np.where( 
    df['behavioral_antiviral_meds']>=1,
    df['h1n1_knowledge'].fillna(1),
    df['h1n1_knowledge'].fillna(0) )

#behavioral_antiviral_meds
mode_behavior=df['behavioral_antiviral_meds'].mode()[0]
df['behavioral_antiviral_meds']=df['behavioral_antiviral_meds'].fillna(mode_behavior)

#behavioral_avoidance
df['behavioral_avoidance']=np.where(df['h1n1_concern']==3,
                                   df['behavioral_avoidance'].fillna(1),
                                    df['behavioral_avoidance'].fillna(0))

#behavioral_face_mask
#if the responded had knowledge about H1N1 and avoided contanct with others with flu like symptoms then we will fill the null values for face mask with 1 otherwise 0
df['behavioral_face_mask']= np.where((df['h1n1_knowledge']>= 1) & (df['behavioral_avoidance']==1),
                                    df['behavioral_face_mask'].fillna(1),
                                    df['behavioral_face_mask'].fillna(0))

#for behavioral_wash_hands if the respondent had knowldege and wore mask is an indication that the respondent took precautionary action of washing hands as well
df['behavioral_wash_hands'] = np.where((df['h1n1_knowledge']>= 1) & (df['behavioral_face_mask']==1),
                                      df['behavioral_wash_hands'].fillna(1),
                                       df['behavioral_wash_hands'].fillna(0) )

# for behavioral_large_gatherings if the respondent avoided close contanct with people who had flu-like symptoms then probably thier exposure to gatherings was reduced as well
df['behavioral_large_gatherings']= np.where(df['behavioral_avoidance']==1,
                                            df['behavioral_large_gatherings'].fillna(1),
                                             df['behavioral_large_gatherings'].fillna(0))

# for behavioral_outside_home if the respondent avoided contanct withflu like symptoms and exposure to public gathering then the would have prefered to stay at home or reduce contact with people outside their household
df['behavioral_outside_home']=np.where( (df['behavioral_avoidance']==1) & (df['behavioral_large_gatherings']==1),
                                      df['behavioral_outside_home'].fillna(1),
                                       df['behavioral_outside_home'].fillna(0))

#for behavioral_touch_face if the respondent is knowlegdable ,has a face mask and washes hands then the respondent probbly avoided touching their eyes,nose or mouth
df['behavioral_touch_face']= np.where((df['h1n1_knowledge']>=1) & (df['behavioral_face_mask']==1) & (df['behavioral_wash_hands']==1),
                                     df['behavioral_touch_face'].fillna(1),
                                      df['behavioral_touch_face'].fillna(0))

#for doctor_recc_h1n1 if the respondent is very concerned of H1N1 and is knowledgable and has taken antiviral medication probably the vaccine they took was also recommended by the doctor either at the hospital or through the media
df['doctor_recc_h1n1']= np.where((df['h1n1_concern']==3) & (df['h1n1_knowledge']==2) & (df['behavioral_antiviral_meds']==1),
                                 df['doctor_recc_h1n1'].fillna(1),
                                 df['doctor_recc_h1n1'].fillna(0)) 

#for doctor_recc_seasonal if the respondents H1N1 was reccomended by the doctor then probably the seasonal flu vaccines were recommended by the doctorthrough the same medium
df['doctor_recc_seasonal']= np.where(df['doctor_recc_h1n1']==1,
                                    df['doctor_recc_seasonal'].fillna(1),
                                      df['doctor_recc_seasonal'].fillna(0)  ) 

#for chronic_med_condition for respondents with null values in chronic condition we will replace with 0 
df['chronic_med_condition']=df['chronic_med_condition'].fillna(0)

df['child_under_6_months'].fillna(df['child_under_6_months'].mode()[0], inplace=True)
df['health_worker'].fillna(df['health_worker'].mode()[0], inplace=True)
df['opinion_h1n1_vacc_effective'].fillna(df['opinion_h1n1_vacc_effective'].mode()[0], inplace=True)
df['opinion_h1n1_risk'].fillna(df['opinion_h1n1_risk'].mode()[0], inplace=True)
df['opinion_seas_vacc_effective'].fillna(df['opinion_seas_vacc_effective'].mode()[0], inplace=True)

df['opinion_h1n1_sick_from_vacc'].fillna(df['opinion_h1n1_sick_from_vacc'].mode()[0], inplace=True)
df['opinion_seas_risk'].fillna(df['opinion_seas_risk'].mode()[0], inplace=True)
df['opinion_seas_sick_from_vacc'].fillna(df['opinion_seas_sick_from_vacc'].mode()[0], inplace=True)
df['age_group'].fillna(df['age_group'].mode()[0], inplace=True)
df['race'].fillna(df['race'].mode()[0], inplace=True)
df['sex'].fillna(df['sex'].mode()[0], inplace=True)
df['income_poverty'].fillna(df['income_poverty'].mode()[0], inplace=True)
df['marital_status'].fillna(df['marital_status'].mode()[0], inplace=True)
df['rent_or_own'].fillna(df['rent_or_own'].mode()[0], inplace=True)
df['employment_status'].fillna(df['employment_status'].mode()[0], inplace=True)
df['hhs_geo_region'].fillna(df['hhs_geo_region'].mode()[0], inplace=True)
df['census_msa'].fillna(df['census_msa'].mode()[0], inplace=True)
df['household_adults'].fillna(df['household_adults'].mode()[0], inplace=True)
df['household_children'].fillna(df['household_children'].mode()[0], inplace=True)
df['employment_industry'].fillna(df['employment_industry'].mode()[0], inplace=True)
df['employment_occupation'].fillna(df['employment_occupation'].mode()[0], inplace=True)
df['education'].fillna(df['education'].mode()[0], inplace=True)







In [92]:
#seperate numerical from categorical columns
numericals=df.select_dtypes(include=['Float64','int64'])
categoricals=df.select_dtypes(include=['object'])

numerical_columns = numericals.columns.tolist()
categorical_columns = categoricals.columns.tolist()

print("num_col:",numerical_columns)
print("cat_col:",categorical_columns)

num_col: ['respondent_id', 'h1n1_concern', 'h1n1_knowledge', 'behavioral_antiviral_meds', 'behavioral_avoidance', 'behavioral_face_mask', 'behavioral_wash_hands', 'behavioral_large_gatherings', 'behavioral_outside_home', 'behavioral_touch_face', 'doctor_recc_h1n1', 'doctor_recc_seasonal', 'chronic_med_condition', 'child_under_6_months', 'health_worker', 'health_insurance', 'opinion_h1n1_vacc_effective', 'opinion_h1n1_risk', 'opinion_h1n1_sick_from_vacc', 'opinion_seas_vacc_effective', 'opinion_seas_risk', 'opinion_seas_sick_from_vacc', 'household_adults', 'household_children']
cat_col: ['age_group', 'education', 'race', 'sex', 'income_poverty', 'marital_status', 'rent_or_own', 'employment_status', 'hhs_geo_region', 'census_msa', 'employment_industry', 'employment_occupation']


In [97]:
df

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,fcxhlnwr,xtkaffoo
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,fcxhlnwr,xtkaffoo
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Not in Labor Force,qufhixun,Non-MSA,0.0,0.0,fcxhlnwr,xtkaffoo
26703,26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Rent,Employed,lzgpxyit,"MSA, Principle City",1.0,0.0,fcxhlnwr,cmhcxjea
26704,26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,"MSA, Not Principle City",0.0,0.0,fcxhlnwr,xtkaffoo
26705,26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Rent,Employed,lrircsnp,Non-MSA,1.0,0.0,fcxhlnwr,haliazsg


In [105]:
#hot encode the categorical data
encoder=OneHotEncoder(drop='first',sparse_output=False)
data_cat= categoricals.copy()
encoder.fit(data_cat)
encoded_data=encoder.transform(data_cat)
#create a new dataframe
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(data_cat.columns))
encoded_df

Unnamed: 0,age_group_35 - 44 Years,age_group_45 - 54 Years,age_group_55 - 64 Years,age_group_65+ Years,education_< 12 Years,education_College Graduate,education_Some College,race_Hispanic,race_Other or Multiple,race_White,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
26703,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26704,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
26705,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [108]:
#append the encoded data to numerical to make one dataframe
new_training_feat=pd.concat([numericals,encoded_df],axis=1)
new_training_feat

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,employment_occupation_qxajmpny,employment_occupation_rcertsgn,employment_occupation_tfqavkke,employment_occupation_ukymxvdu,employment_occupation_uqqtjvyb,employment_occupation_vlluhbov,employment_occupation_xgwztkwe,employment_occupation_xqwwgdyp,employment_occupation_xtkaffoo,employment_occupation_xzmlyyjv
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26702,26702,2.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
26703,26703,1.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26704,26704,2.0,2.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
26705,26705,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
