# H1N1 and Seasonal Flu Vaccines

## Business Understanding

The COVID-19 pandemic reshaped our understanding of how society responds to a common crisis and recentered the conversation around public health. In order to provide guidance for future public health efforts, this analysis revisits the 2009 public health response to the H1N1 influenza virus and attempts to understand how people’s backgrounds, opinions, and health behaviors are related to their personal vaccination pattern.

**H1N1**

Beginning in spring 2009, a pandemic caused by the H1N1 influenza virus, colloquially named "swine flu," swept across the world. Researchers estimate that in the first year, it was responsible for between 151K - 575K deaths globally. A vaccine for the H1N1 flu virus became publicly available in October 2009.

In late 2009 and early 2010, the United States conducted the National 2009 H1N1 Flu Survey. This phone survey asked respondents whether they had received the H1N1 and seasonal flu vaccines, in conjunction with questions about themselves. These additional questions covered their social, economic, and demographic background, opinions on risks of illness and vaccine effectiveness, and behaviors towards mitigating transmission. 




***ROADMAP NOTES***

- Targets
    - h1n1_vaccine - whether respondent received H1N1 flu vaccine
    - seasonal_vaccine - whether respondent received seasonal flu vaccine
    
- Classes - Binary
    - 0 = No
    - 1 = Yes

- Is there a class imbalance? If so might need to consider certain techniques
    - Either SMOTE or class_weight maybe

- Determine the business context/problem for stakeholder
    - How is your model helping to address/solve the problem?

- Determine which evaulation metric/metrics are most important
    - THIS IS BIG
    - Think in terms of cost matrix, fp vs fn etc.

## Data Understanding

***ROADMAP NOTES***

- Basic summary info, .describe, null or missing values, column types

- What features are relevant for your model/analysis and why? 

- Relationships with target

- Exploratory visuals, correlations, pairplots, heatmaps, histograms etc…

- Determine how you will handle null values
    - Impute values, drop rows or columns etc…
    - Take a look at the different types/strategies of imputers (KNNI is powerful but slow)

- Determine how you will handle categorical variables
    - Binary?
    - Ordinal or One-Hot encode?

- Will you need to scale the numeric data?



## Data Preparation

***ROADMAP NOTES***

1. Develop ColumnTransformer → data preprocessing pipeline based on EDA above
- Sub-pipes for numeric and categorical columns
    - As many as you need if treating some columns differently
    - Keep in mind things like handle_unknown = ‘ignore’
- Create one pipeline with numeric scaling, another without if needed
- Test your column transformer
    - Fit_transform train data
    - Transform test data
    - Should have the same number of columns after transformation

2. Determine an appropriate validation procedure
- Highly recommended to cross_validate with a pipeline (best way)
- train_test_split your data

## Modeling


Iterative Approach to Modeling

- Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model
- After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

With the additional techniques you have learned in Phase 3, be sure to explore:
1. Model features and preprocessing approaches
2. Different kinds of models (logistic regression, decision trees, etc.)
3. Different model hyperparameters

At minimum you must build three models:
- A dummy model for comparison using a simple strategy to predict
- A simple, interpretable model (logistic regression or single decision tree)
- A version of the simple model with tuned hyperparameters

***ROADMAP NOTES***

1. Pipeline using ColumnTransformer and model algorithm
- Start with a DummyClassifier
    - Evaluate based on chosen metric/metrics → need to beat this
- Create and evaluate first simple model: Simple → Complex
    - Start with defaults, logistic regression or decision tree based on your data
    - Could try other algorithms if you want to explore a few rabbit holes
- Iterate over previous models
    - If overfit → reduce complexity, add regularization, prune tree
    - If underfit → increase complexity, reduce regularization, add new/more features (feature engineering)
    - Might require you to go back and adapt the pre-processing pipeline or tune hyperparameters

2. Final Model → Thursday AM you should have a good idea here
- Choose a final model based on validation scores of the chosen metric/metrics
- Fit final model to training set
- Evaluate the final model using your hold-out test set
- Discuss final model in the context of your business problem and stakeholder
    - Analyze your predictive power and results, where is the model doing good and where is it maybe not doing good (confusion matrix)
    - This could include insights and recommendations from model
        - Coefs (logreg)
        - Feature importances (trees)




***ROADMAP NOTES***
Explanatory visuals for presentation
- Back up any inference with visual
- Show final model vs. others (esp. Dummy)
- Spiced up confusion matrix, need to make sure its non-technical enough



## Evaluation & Conclusion

**Findings and Recommendations**

A predictive finding might include:
- How well your model is able to predict the target
- What features are most important to your model

A predictive recommendation might include:
- The contexts/situations where the predictions made by your model would and would not be useful for your stakeholder and business problem
- Suggestions for how the business might modify certain input variables to achieve certain target results

**Classification Metrics**
- You must choose appropriate classification metrics and use them to evaluate your models.
- Choosing the right classification metrics is a key data science skill, and should be informed by data exploration and the business problem itself.
- You must then use this metric to evaluate your model performance using both training and testing data.