# Predicting H1N1 Flu Vaccine

# 1.0 Bussiness Understanding 

## 1.1 Business Overview

Vaccination is one of the most effective public health measures for preventing the spread of infectious diseases. In recent years, there has been the development of vaccines for other pandemics such as COVID-19. Vaccination not only helps individuals who have been immunised but also the community from the wider spread of the virus.

For this study, we are using data from a survey conducted in 2009 during the H1N1 influenza pandemic, also known as the "swine flu". This led to an estimated death toll worldwide in its first year of between 151,000 and 575,000. To reduce this, a vaccine was introduced in late 2009 alongside the seasonal flu that was already available.  

The survey was used to understand the uptake of both vaccines. These included respondents sharing information on their health conditions, demographics, risk perception, and behaviours. By analyzing this dataset, we can better understand which factors influenced vaccine uptake. These insights can help healthcare professionals design more effective, targeted campaigns to improve vaccine acceptance and coverage in future pandemics.

## 1.2 Problem Statement

The study aims to predict whether individuals received the H1N1 vaccine using survey data, to identify key factors influencing uptake to inform more effective public health interventions.


## 1.3 Business Objectives

### Main Objective

Build a predictive model that estimates the probability of individuals receiving the H1N1 vaccine based on features from the survey.

### Specific Objectives

- Identify which demographic, behavioral, and opinion factors are most strongly associated with vaccine uptake.
- Provide actionable insights to public health decision-makers for designing targeted awareness campaigns.

## 1.4 Success Criteria 
- Clearly show what factors make people more or less likely to get vaccinated, so healthcare professionals can improve vaccination plans in future pandemics.
- Identify which groups of people are less likely to get vaccinated, so campaigns can be directed where they are needed most.


# 2.0 Data Understanding

In [1]:
# importing the libraries 
import numpy as np 
import pandas as pd

In [2]:
# loading the training set labels dataset
train_labels = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/training_set_labels.csv") 
train_labels.head(2)

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1


In [3]:
# loading the training set features dataset
train_feat = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/training_set_features.csv") 
train_feat.head(2)

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe


In [4]:
# loading the test set features dataset
test_feat = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/test_set_features.csv") 
test_feat.head(2)

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
1,26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp


In [5]:
train_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26707 entries, 0 to 26706
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26707 non-null  int64  
 1   h1n1_concern                 26615 non-null  float64
 2   h1n1_knowledge               26591 non-null  float64
 3   behavioral_antiviral_meds    26636 non-null  float64
 4   behavioral_avoidance         26499 non-null  float64
 5   behavioral_face_mask         26688 non-null  float64
 6   behavioral_wash_hands        26665 non-null  float64
 7   behavioral_large_gatherings  26620 non-null  float64
 8   behavioral_outside_home      26625 non-null  float64
 9   behavioral_touch_face        26579 non-null  float64
 10  doctor_recc_h1n1             24547 non-null  float64
 11  doctor_recc_seasonal         24547 non-null  float64
 12  chronic_med_condition        25736 non-null  float64
 13  child_under_6_mo

In [6]:
test_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26708 entries, 0 to 26707
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   respondent_id                26708 non-null  int64  
 1   h1n1_concern                 26623 non-null  float64
 2   h1n1_knowledge               26586 non-null  float64
 3   behavioral_antiviral_meds    26629 non-null  float64
 4   behavioral_avoidance         26495 non-null  float64
 5   behavioral_face_mask         26689 non-null  float64
 6   behavioral_wash_hands        26668 non-null  float64
 7   behavioral_large_gatherings  26636 non-null  float64
 8   behavioral_outside_home      26626 non-null  float64
 9   behavioral_touch_face        26580 non-null  float64
 10  doctor_recc_h1n1             24548 non-null  float64
 11  doctor_recc_seasonal         24548 non-null  float64
 12  chronic_med_condition        25776 non-null  float64
 13  child_under_6_mo

## 2.1 Data Overview

The datasets are provided by the United States National Center for Health Statistics.

The three datasets are:
- Training Features (training_set_features.csv) – which contains 26,707 respondents and 35 predictive features describing respondents’ demographics, behaviors, health status, and opinions.
- Training Labels (training_set_labels.csv)– contains the two target variable (h1n1_vaccine).
- Test Features (test_set_features.csv) – same as training features but without the labels.

## 2.2 Variables Overview

#### Target Variable:
- h1n1_vaccine: (binary: 0 = not vaccinated, 1 = vaccinated)

#### Feature Types:
- Binary variables (0, 1): behavioral and medical columns
- Categorical columns: age_group, education, marital status, etc

## 2.3 Initial Observations

- Some variables (e.g., employment_industry, employment_occupation) contain missing values (NaNs).
- Several variables are categorical with string values, requiring encoding before modeling.
- Many features represent attitudes or opinions, which may strongly influence vaccination decisions.

# 3.0 Data Preparation

This is what is to be done in this section: 
- Merge training features + labels (the target variables) using the respondent_ID column
- Handle missing values
- Drop the unnecessary columns 
- One Hot Encoding
- Check for class imbalance and handle
- Do the same for the test set 

Methodology 
- EDA
- Baseline: Logistic Regression for both targets - gives probabilities + simple interpretability.
- Tree-based model: Decision Tree 
- Advanced model: Random Forest - maximize ROC AUC

