    Vaccine Usage Prediction
    
    Abstract:
                    Subjects receiving the same vaccine often show different levels of immune responses and some may even present adverse side effects to the vaccine. Systems vaccinology can combine data and machine learning techniques to obtain highly predictive signatures of vaccine immunogenicity and reactogenicity. Currently, several machine learning methods are already available to researchers with no background in bioinformatics.
    
    
    Problem Statement:
        Predict how likely it is that the people will take an H1N1 flu vaccine using Logistic Regression.
    
    
    Dataset Information:
    
    Column	Description
        unique_id	Unique identifier for each respondent
    h1n1_worry	Worry about the h1n1 flu(0,1,2,3) 0=Not worried at all, 1=Not very worried, 2=Somewhat worried, 3=Very worried
            h1n1_awareness	Signifies the amount of knowledge or understanding the respondent has about h1n1 flu - (0,1,2) - 0=No knowledge, 1=little knowledge, 2=good knowledge
    antiviral_medication	Has the respondent taken antiviral vaccination - (0,1)
    contact_avoidance	Has avoided any close contact with people who have flu-like symptoms - (0,1)
    bought_face_mask	Has the respondent bought mask or not - (0,1)
    wash_hands_frequently	Washes hands frequently or uses hand sanitizer - (0,1)
    

    
    avoid_large_gatherings	Has the respondent reduced time spent at large gatherings - (0,1)
    reduced_outside_home_cont	Has the respondent reduced contact with people outside their own house - (0,1)
    avoid_touch_face	Avoids touching nose, eyes, mouth - (0,1)
    dr_recc_h1n1_vacc	Doctor has recommended h1n1 vaccine - (0,1)
    dr_recc_seasonal_vacc	The doctor has recommended seasonal flu vaccine - (0,1)
    chronic_medic_condition	Has any chronic medical condition - (0,1)
    cont_child_undr_6_mnth	Has regular contact with child the age of 6 months - (0,1)
    is_health_worker	Is respondent a health worker - (0,1)
    has_health_insur	Does respondent have health insurance - (0,1)
        is_h1n1_vacc_effective	Does respondent think that the h1n1 vaccine is effective - (1,2,3,4,5)- (1=Thinks not effective at all, 2=Thinks it is not very effective, 3=Doesn't know if it is effective or not, 4=Thinks it is somewhat effective, 5=Thinks it is highly effective)
        is_h1n1_risky	What respondents think about the risk of getting ill with h1n1 in the absence of the vaccine- (1,2,3,4,5)- (1=Thinks it is not very low risk, 2=Thinks it is somewhat low risk, 3=don’t know if it is risky or not, 4=Thinks it is a somewhat high risk, 5=Thinks it is very highly risky)
            sick_from_h1n1_vacc	Does respondent worry about getting sick by taking the h1n1 vaccine - (1,2,3,4,5)- (1=Respondent not worried at all, 2=Respondent is not very worried, 3=Doesn't know, 4=Respondent is somewhat worried, 5Respondent is very worried) -
        is_seas_vacc_effective	Does respondent think that the seasonal vaccine is effective- (1,2,3,4,5)- (1=Thinks not effective at all, 2=Thinks it is not very effective, 3=Doesn't know if it
    
    
    
	is effective or not, 4=Thinks it is somewhat effective, 5=Thinks it is highly effective)
            is_seas_flu_risky	What respondents think about the risk of getting ill with seasonal flu in the absence of the vaccine- (1,2,3,4,5)- (1=Thinks it is not very low risk, 2=Thinks it is somewhat low risk, 3=Doesn't know if it is risky or not, 4=Thinks it is somewhat high risk, 5=Thinks it is very highly risky)
            sick_from_seas_vacc	Does respondent worry about getting sick by taking the seasonal flu vaccine - (1,2,3,4,5)- (1=Respondent not worried at all, 2=Respondent is not very worried, 3=Doesn't know, 4=Respondent is somewhat worried, 5Respondent is very worried)
    age_bracket	Age bracket of the respondent - (18 - 34 Years, 35 - 44
        Years, 45 - 54 Years, 55 - 64 Years, 64+ Years)
        qualification	Qualification/education level of the respondent as per their response -(<12 Years, 12 Years, College Graduate, Some College)
    race	Respondent's race - (White, Black, Other or Multiple
    ,Hispanic)
        sex	Respondent's sex - (Female, Male)
        income_level	Annual income of the respondent as per the 2008 poverty Census - (<=
        75000−AbovePoverty,> 75000−AbovePoverty,>75000, Below Poverty)
        marital_status	Respondent's marital status - (Not Married, Married)
        housing_status	Respondent's housing status - (Own, Rent)
    employment	Respondent's employment status - (Not in Labor Force, Employed, Unemployed)
        census_msa	Residence of the respondent with the MSA(metropolitan statistical area)(Non-MSA, MSA- Not Principle, CityMSA-Principle city) - (Yes, no)



    no_of_adults	Number of adults in the respondent's house (0,1,2,3) - (Yes, no)
    no_of_children	Number of children in the respondent's house(0,1,2,3)
        - (Yes, No)
    h1n1_vaccine	Dependent variable)Did the respondent receive the h1n1 vaccine or not(1,0) - (Yes, No)
    
    
    
    Scope:
    ●Exploratory data analysis
    ●Data Pre-processing
    ●Training logistic regression model with MLE for prediction
    ●Tuning the model to improve the performance
    
    
    Learning Outcome:
            The students will get a better understanding of how the variables are linked to each other and how the EDA approach will help them gain more insights and knowledge about the data that we have and train Logistic Regression using MLE.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('Dataset/h1n1_vaccine_prediction.csv')

In [7]:
print(df)

       unique_id  h1n1_worry  h1n1_awareness  antiviral_medication  \
0              0         1.0             0.0                   0.0   
1              1         3.0             2.0                   0.0   
2              2         1.0             1.0                   0.0   
3              3         1.0             1.0                   0.0   
4              4         2.0             1.0                   0.0   
...          ...         ...             ...                   ...   
26702      26702         2.0             0.0                   0.0   
26703      26703         1.0             2.0                   0.0   
26704      26704         2.0             2.0                   0.0   
26705      26705         1.0             1.0                   0.0   
26706      26706         0.0             0.0                   0.0   

       contact_avoidance  bought_face_mask  wash_hands_frequently  \
0                    0.0               0.0                    0.0   
1                    

In [5]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve

In [6]:
df.columns

Index(['unique_id', 'h1n1_worry', 'h1n1_awareness', 'antiviral_medication',
       'contact_avoidance', 'bought_face_mask', 'wash_hands_frequently',
       'avoid_large_gatherings', 'reduced_outside_home_cont',
       'avoid_touch_face', 'dr_recc_h1n1_vacc', 'dr_recc_seasonal_vacc',
       'chronic_medic_condition', 'cont_child_undr_6_mnths',
       'is_health_worker', 'has_health_insur', 'is_h1n1_vacc_effective',
       'is_h1n1_risky', 'sick_from_h1n1_vacc', 'is_seas_vacc_effective',
       'is_seas_risky', 'sick_from_seas_vacc', 'age_bracket', 'qualification',
       'race', 'sex', 'income_level', 'marital_status', 'housing_status',
       'employment', 'census_msa', 'no_of_adults', 'no_of_children',
       'h1n1_vaccine'],
      dtype='object')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 34 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   unique_id                  26707 non-null  int64  
 1   h1n1_worry                 26615 non-null  float64
 2   h1n1_awareness             26591 non-null  float64
 3   antiviral_medication       26636 non-null  float64
 4   contact_avoidance          26499 non-null  float64
 5   bought_face_mask           26688 non-null  float64
 6   wash_hands_frequently      26665 non-null  float64
 7   avoid_large_gatherings     26620 non-null  float64
 8   reduced_outside_home_cont  26625 non-null  float64
 9   avoid_touch_face           26579 non-null  float64
 10  dr_recc_h1n1_vacc          24547 non-null  float64
 11  dr_recc_seasonal_vacc      24547 non-null  float64
 12  chronic_medic_condition    25736 non-null  float64
 13  cont_child_undr_6_mnths    25887 non-null  flo

In [10]:
df.isna().sum()

unique_id                        0
h1n1_worry                      92
h1n1_awareness                 116
antiviral_medication            71
contact_avoidance              208
bought_face_mask                19
wash_hands_frequently           42
avoid_large_gatherings          87
reduced_outside_home_cont       82
avoid_touch_face               128
dr_recc_h1n1_vacc             2160
dr_recc_seasonal_vacc         2160
chronic_medic_condition        971
cont_child_undr_6_mnths        820
is_health_worker               804
has_health_insur             12274
is_h1n1_vacc_effective         391
is_h1n1_risky                  388
sick_from_h1n1_vacc            395
is_seas_vacc_effective         462
is_seas_risky                  514
sick_from_seas_vacc            537
age_bracket                      0
qualification                 1407
race                             0
sex                              0
income_level                  4423
marital_status                1408
housing_status      

In [9]:
X1 = df.drop(['h1n1_vaccine','age_bracket', 'qualification',
       'race', 'sex', 'income_level', 'marital_status', 'housing_status',
       'employment', 'census_msa'], axis = 1)
y1 = df['h1n1_vaccine']

X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size = 0.3, random_state = 36)

logistic_model_1 = LogisticRegression().fit(X_train1, y_train1)

# Check whether how this model fits the data
predicted_y_train1 = logistic_model_1.predict(X_train1)
train_accuracy_1 = accuracy_score(y_train1, predicted_y_train1)

predicted_y_test1 = logistic_model_1.predict(X_test1)
test_accuracy_1 = accuracy_score(y_test1, predicted_y_test1)

if train_accuracy_1 > test_accuracy_1 and abs(train_accuracy_1 - test_accuracy_1) < 0.10:
    print('Model is slightly overfitting')
elif train_accuracy_1 < test_accuracy_1 or abs(train_accuracy_1 - test_accuracy_1) < 0.10:
    print('Model is slightly overfitting')
elif train_accuracy_1 < 0.7 or test_accuracy_1 < 0.7:
    print('Model is underfitting')
else:
    print('Model is a good fit') 
    
print(f"Train_accuracy : {train_accuracy_1} \nTest_accuracy : {test_accuracy_1}")

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values