# Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines

# 1.0 Bussiness Understanding 

## 1.1 Business Overview

Vaccination is one of the most effective public health measures for preventing the spread of infectious diseases. In recent years, there has been the development of vaccines for other pandemics such as COVID-19. Vaccination not only helps individuals who have been immunised but also the community from the wider spread of the virus.

For this study, we are using data from a survey conducted in 2009 during the H1N1 influenza pandemic, also known as the "swine flu". This led to an estimated death toll worldwide in its first year of between 151,000 and 575,000. To reduce this, a vaccine was introduced in late 2009 alongside the seasonal flu that was already available.  

The survey was used to understand the uptake of both vaccines. These included respondents sharing information on their health conditions, demographics, risk perception, and behaviours. By analyzing this dataset, we can better understand which factors influenced vaccine uptake. These insights can help healthcare professionals design more effective, targeted campaigns to improve vaccine acceptance and coverage in future pandemics.

## 1.2 Problem Statement

The study aims to predict whether individuals received the H1N1 and/or seasonal flu vaccines using survey data, to identify key factors influencing uptake to inform more effective public health interventions.


## 1.3 Business Objectives

### Main Objective

Build a predictive model that estimates the probability of individuals receiving the H1N1 and seasonal flu vaccines based on features from the survey.

### Special Objectives

- Identify which demographic, behavioral, and opinion factors are most strongly associated with vaccine uptake.
- Provide actionable insights to public health decision-makers for designing targeted awareness campaigns.
- Evaluate differences between H1N1 and seasonal flu vaccine uptake patterns.

## 1.4 Success Criteria 
- Clearly show what factors make people more or less likely to get vaccinated, so healthcare professionals can improve vaccination plans in future pandemics.
- Identify which groups of people are less likely to get vaccinated, so campaigns can be directed where they are needed most.


# 2.0 Data Understanding

In [1]:
# importing the libraries 
import numpy as np 
import pandas as pd

In [2]:
# loading the training set labels dataset
train_labels = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/training_set_labels.csv") 
train_labels.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [3]:
# loading the training set features dataset
train_feat = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/training_set_features.csv") 
train_feat.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [4]:
# loading the test set features dataset
test_feat = pd.read_csv("C:/Users/PC/Desktop/School work/Projects/Phase 3/Phase-3-Project/Data/test_set_features.csv") 
test_feat.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
1,26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp
2,26709,2.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,"> $75,000",Married,Own,Employed,lrircsnp,Non-MSA,1.0,0.0,nduyfdeo,pvmttkik
3,26710,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,lrircsnp,"MSA, Not Principle City",1.0,0.0,,
4,26711,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,Non-MSA,0.0,1.0,fcxhlnwr,mxkfnird


## 2.1 Data Overview

The datasets are provided by the United States National Center for Health Statistics.

The three datasets are:
- Training Features (training_set_features) – which contains 35 variables describing respondents’ demographics, behaviors, health status, and opinions.
- Training Labels (training_set_labels)– contains the two target variables (h1n1_vaccine, seasonal_vaccine).
- Test Features (test_set_features) – same as training features but without the labels.

## 2.2 Variables Overview

Target Variables:
- h1n1_vaccine: (binary: 0 = not vaccinated, 1 = vaccinated)
- seasonal_vaccine: (binary: 0 = not vaccinated, 1 = vaccinated)

Feature Types:
- Numerical: household size, age group (ordinal categories), concern/knowledge/opinion scores (ordinal).
- Categorical: education, race, sex, marital status, income level, employment industry/occupation.
- Binary: behavioral indicators (mask use, handwashing), doctor recommendations, chronic conditions, health insurance, etc.

## 2.3 Initial Observations

- Some variables (e.g., employment_industry, employment_occupation) contain missing values (NaNs).

- Several variables are categorical with string values, requiring encoding before modeling.

- Many features represent attitudes or opinions, which may strongly influence vaccination decisions.