# Business Understanding

Vaccines stimulate the immune system to protect against specific diseases by containing parts of pathogens. Flu vaccines, including seasonal and H1N1 vaccines, help prevent the spread of influenza. Seasonal vaccines are updated yearly, while the H1N1 vaccine was developed following the 2009 pandemic. Vaccination protects individuals and promotes herd immunity; however, challenges such as vaccine hesitancy, misinformation, and limited access can hinder widespread uptake.

This project aims to predict factors influencing H1N1 vaccine uptake using data from the National 2009 H1N1 Flu Survey. The primary stakeholders are public health officials and policymakers, who can use the insights to design targeted vaccination strategies and improve future public health campaigns.

The analysis will focus on demographic information, health behaviors, and prior vaccination history to understand which factors influence H1N1 vaccine uptake. Key features include age, race, employment status, chronic health conditions, and opinions about vaccines. By identifying patterns in these factors, the project seeks to provide actionable insights on which populations are most likely to get vaccinated and which groups may require additional outreach.

This is a binary classification problem, with the target variable indicating whether the respondent received the H1N1 vaccine (0 = No, 1 = Yes). The goal is to provide data-driven guidance to stakeholders, helping optimize public health strategies and improve vaccination uptake in future campaigns.



# Data Understanding

The data for this project comes from three datasets: `training_set_features.csv`, `test_set_features.csv`, and `training_set_labels.csv`. The primary target variable for this analysis is whether an individual received the H1N1 vaccine (`h1n1_vaccine`), while predictors include demographic, health-related, and behavioral features such as age, income, health concerns, and vaccination recommendations.

The datasets contain both categorical (e.g., age group, marital status) and numerical (e.g., household adults, ) features, including some binary and ordinal variables. The size and distribution of the datasets will be examined during exploration. If necessary, resampling techniques may be applied to address imbalances in the target variable.

The data is collected via surveys, which means it may include missing values, biases, or inconsistencies. These issues will require careful cleaning and preprocessing before building predictive models. 


# Data Preparation

In [3]:
# import pandas
import pandas as pd

# Loading the training features 
training_features_df = pd.read_csv("./Data/training_set_features.csv")

# Displaying the first few rows
training_features_df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Own,Not in Labor Force,oxchjgsf,Non-MSA,0.0,0.0,,
1,1,3.0,2.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,"MSA, Not Principle City",0.0,0.0,pxcmvdjn,xgwztkwe
2,2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,qufhixun,"MSA, Not Principle City",2.0,0.0,rucpziij,xtkaffoo
3,3,1.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Not in Labor Force,lrircsnp,"MSA, Principle City",0.0,0.0,,
4,4,2.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"<= $75,000, Above Poverty",Married,Own,Employed,qufhixun,"MSA, Not Principle City",1.0,0.0,wxleyezf,emcorrxb


In [5]:
# loading training_labels 
training_labels_df = pd.read_csv("./Data/training_set_labels.csv")

# Displaying the first few rows
training_labels_df.head()

Unnamed: 0,respondent_id,h1n1_vaccine,seasonal_vaccine
0,0,0,0
1,1,0,1
2,2,0,0
3,3,0,1
4,4,0,0


In [6]:
# loading test_features
test_features_df = pd.read_csv("./Data/test_set_features.csv")

# Displaying the first few rows
test_features_df.head()

Unnamed: 0,respondent_id,h1n1_concern,h1n1_knowledge,behavioral_antiviral_meds,behavioral_avoidance,behavioral_face_mask,behavioral_wash_hands,behavioral_large_gatherings,behavioral_outside_home,behavioral_touch_face,...,income_poverty,marital_status,rent_or_own,employment_status,hhs_geo_region,census_msa,household_adults,household_children,employment_industry,employment_occupation
0,26707,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,"> $75,000",Not Married,Rent,Employed,mlyzmhmf,"MSA, Not Principle City",1.0,0.0,atmlpfrs,hfxkjkmi
1,26708,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Below Poverty,Not Married,Rent,Employed,bhuqouqj,Non-MSA,3.0,0.0,atmlpfrs,xqwwgdyp
2,26709,2.0,2.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,"> $75,000",Married,Own,Employed,lrircsnp,Non-MSA,1.0,0.0,nduyfdeo,pvmttkik
3,26710,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,"<= $75,000, Above Poverty",Married,Own,Not in Labor Force,lrircsnp,"MSA, Not Principle City",1.0,0.0,,
4,26711,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,...,"<= $75,000, Above Poverty",Not Married,Own,Employed,lzgpxyit,Non-MSA,0.0,1.0,fcxhlnwr,mxkfnird
