### PREDICTING H1N1 FLU VACCINATION STATUS USING MACHINE LEARNING


## 1. Business Understanding

### 1.1 Overview

In this project, the aim was to use data from the National Flu Survey (NHFS 2009) to predict whether respondents received the H1N1 vaccine. Understanding past vaccination trends is crucial for interpreting patterns in more recent pandemics, such as COVID-19. Key factors influencing vaccination status include Doctor recommendations for the H1N1 vaccine,health insurance, opinions on the vaccine's effectiveness perceptions of the risk posed by H1N1.
I employed six machine learning models for prediction: 

1.Decision Tree Classifier

2.Logistic Regression

3.Random Forest

4.K-Nearest Neighbors Classifier

5.Gradient Boosting Classifier

6.XGBoost Classifier

Among these, the Gradient Boosting Classifier achieved the highest accuracy and precision.

## 1.2 Business Problem.

Vaccination stands as one of the most effective public health interventions ever implemented, leading to the elimination and control of diseases that were once widespread globally. Despite substantial medical evidence and the strong consensus among healthcare professionals supporting vaccination, skepticism has increased in many countries in recent years. This troubling trend has resulted in decreased immunization coverage, with several outbreaks of infectious diseases linked to undervaccinated communities. The growing issue of vaccine hesitancy has become so pervasive that it is now the subject of numerous studies aiming to understand the sources and correlations of attitudes toward vaccination.

This study aims to predict the likelihood of individuals receiving the H1N1 flu vaccine. We believe the predictive models and analyses from this study will provide public health professionals and policymakers with a clear understanding of the factors associated with low vaccination rates. This, in turn, will enable them to systematically address the barriers preventing people from getting vaccinated.

The methodologies employed in these models can serve as a reference for future work and can be compared with other models for performance evaluation. Given the nature of our data and our objectives, we implemented multiple machine learning classification models, including Logistic Regression, Decision Tree, Random Forest, k-Nearest Neighbors (kNN), Gradient Boosting, and XGBoost.

To accurately classify those who received the H1N1 flu shot from those who did not, we require models with high accuracy and high precision, which corresponds to a low false positive rate (those mistakenly identified as vaccinated when they were not). This will be further evaluated using the ROC curve, accuracy score, precision score, and confusion matrix.

**Target Audience**: Public health officials of the American Public Health Association (APHA)

**OBJECTIVES**:
1. Predicting who is vaccinated or not accurately.(Deliverable: Model)
2. Analyse the factors that influence people to get H1N1 vaccine or not.  (Deliverable: Analysis)

**Context**:
- False negative: Saying people did not get the vaccine when they actually did. 
- Outcome: Not a big problem

- False positive: Saying people got the vaccine when they actually did not. 
- Outcome: Big problem


**Evaluation**:
We will focus on accuracy, f1, and precision scores for our model iterations in order to minimize False Positives, because in our business context false positives are a much more costly mistake than false negatives.

- **Accuracy**
- **Precision**
- Recall
- **F1-Score**


## 2. Data Understanding

### 2.1  Import Libraries

In [2]:
# Common libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for model training

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import category_encoders as ce
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, cross_val_score

# Libraries for algorithm

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier


import xgboost     # extreme gradient boosting

# Libraries for testing

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

# Removing warnings

import warnings
warnings.filterwarnings('ignore')

# Storing plots

%matplotlib inline

# To visualize the 100 many the columns in data
pd.options.display.max_columns=100

### 2.2 Load Dataset

This data comes from a NHFS National Flu Survey from 2009, which inquires about whether or not people received the seasonal flu and/or the H1N1 flu vaccination, as well as their demographic, behavioral, and health factors. There are 26,000 respondents to this survey. In this project I chose H1N1 vaccination rate as our target variable.I used all features in the survey, and filled missing values using the Iterative Imputer. 

In [5]:
# Reading in the data
Data1 = pd.read_csv('DATA/H1N1_Flu_Vaccines.csv')
Data1.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,doctor_recc_h1n1,doctor_recc_seasonal,chronic_med_condition,child_under_6_months,health_worker,health_insurance,opinion_h1n1_vacc_effective,opinion_h1n1_risk,opinion_h1n1_sick_from_vacc,opinion_seas_vacc_effective,opinion_seas_risk,opinion_seas_sick_from_vacc,age_group,education,race,sex,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation,h1n1_vaccine,seasonal_vaccine
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,2.0,2.0,1.0,2.0,55 - 64 Years,< 12 Years,White,Female,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,,0,0
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,4.0,4.0,4.0,2.0,4.0,35 - 44 Years,12 Years,White,Male,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe,0,1
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,,1.0,0.0,0.0,,3.0,1.0,1.0,4.0,1.0,2.0,18 - 34 Years,College Graduate,White,Male,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo,0,0
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,,3.0,3.0,5.0,5.0,4.0,1.0,65+ Years,12 Years,White,Female,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,,0,1
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,3.0,3.0,2.0,3.0,1.0,4.0,45 - 54 Years,Some College,White,Female,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb,0,0
