## Guess who got the shot

Let's try to predict whether people got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors.


![vaccine](COVID.jpg "")

First we're going to import some libraries. In the intro to programming class, the functions we used were very basic, and they were available by default. But python has a wide variety of uses, so there are thousands of additional libraries, some of which are built in to python, and some of which are external, which provide additional functionality.

To use these libraries, we must first import them. For now we're mainly going to need pandas and numpy.

In [None]:
import numpy as np
import pandas as pd

In [None]:
#This code will download the data we need. You don't need ot worry about this for now.
!wget https://raw.githubusercontent.com/haritha-j/alchemize/main/flu/training_set_features.csv
!wget https://raw.githubusercontent.com/haritha-j/alchemize/main/flu/training_set_labels.csv
!wget https://raw.githubusercontent.com/haritha-j/alchemize/main/flu/test_set_featuress.csv
!wget https://raw.githubusercontent.com/haritha-j/alchemize/main/flu/submission_format.csv

## DATA DATA DATA

We will be using 4 main files.

**Training Features**: These are the input variables that your model will use to predict the probability that people received H1N1 flu and seasonal flu vaccines. There are 35 feature columns in total, each a response to a survey question. These questions cover several different topics, such as whether people observed safe behavioral practices, their opinions about the diseases and the vaccines, and their demographics. Check out the problem description page for more information.


**Training Labels**: These are the labels corresponding to the observations in the training features. There are two target variables: h1n1_vaccine and seasonal_vaccine. Both are binary variables, with 1 indicating that a person received the respective flu vaccine and 0 indicating that a person did not receive the respective flu vaccine. Note that this is what is known as a "multilabel" modeling task.


**Test Features**: These are the features for observations that you will use to generate the submission predictions after training a model. We don't give you the labels for these samples—it's up to you to generate them.


**Submission Format**: This file serves as an example for how to format your submission. It contains the index and columns for our submission prediction. The two target variable columns are filled with 0.5 and 0.7 as an example. Your submission to the leaderboard must be in this exact format (with different prediction values) in order to be scored successfully!

![data](data.jpg "")

Let's learn a little about our data.

The first column `respondent_id` is a unique and random identifier. The remaining 35 features are described below.

For all binary variables: `0` = No; `1` = Yes.

-   `h1n1_concern` - Level of concern about the H1N1 flu.
    -   `0` = Not at all concerned; `1` = Not very concerned; `2` = Somewhat concerned; `3` = Very concerned.
-   `h1n1_knowledge` - Level of knowledge about H1N1 flu.
    -   `0` = No knowledge; `1` = A little knowledge; `2` = A lot of knowledge.
-   `behavioral_antiviral_meds` - Has taken antiviral medications. (binary)
-   `behavioral_avoidance` - Has avoided close contact with others with flu-like symptoms. (binary)
-   `behavioral_face_mask` - Has bought a face mask. (binary)
-   `behavioral_wash_hands` - Has frequently washed hands or used hand sanitizer. (binary)
-   `behavioral_large_gatherings` - Has reduced time at large gatherings. (binary)
-   `behavioral_outside_home` - Has reduced contact with people outside of own household. (binary)
-   `behavioral_touch_face` - Has avoided touching eyes, nose, or mouth. (binary)
-   `doctor_recc_h1n1` - H1N1 flu vaccine was recommended by doctor. (binary)
-   `doctor_recc_seasonal` - Seasonal flu vaccine was recommended by doctor. (binary)
-   `chronic_med_condition` - Has any of the following chronic medical conditions: asthma or an other lung condition, diabetes, a heart condition, a kidney condition, sickle cell anemia or other anemia, a neurological or neuromuscular condition, a liver condition, or a weakened immune system caused by a chronic illness or by medicines taken for a chronic illness. (binary)
-   `child_under_6_months` - Has regular close contact with a child under the age of six months. (binary)
-   `health_worker` - Is a healthcare worker. (binary)
-   `health_insurance` - Has health insurance. (binary)
-   `opinion_h1n1_vacc_effective` - Respondent's opinion about H1N1 vaccine effectiveness.
    -   `1` = Not at all effective; `2` = Not very effective; `3` = Don't know; `4` = Somewhat effective; `5` = Very effective.
-   `opinion_h1n1_risk` - Respondent's opinion about risk of getting sick with H1N1 flu without vaccine.
    -   `1` = Very Low; `2` = Somewhat low; `3` = Don't know; `4` = Somewhat high; `5` = Very high.
-   `opinion_h1n1_sick_from_vacc` - Respondent's worry of getting sick from taking H1N1 vaccine.
    -   `1` = Not at all worried; `2` = Not very worried; `3` = Don't know; `4` = Somewhat worried; `5` = Very worried.
-   `opinion_seas_vacc_effective` - Respondent's opinion about seasonal flu vaccine effectiveness.
    -   `1` = Not at all effective; `2` = Not very effective; `3` = Don't know; `4` = Somewhat effective; `5` = Very effective.
-   `opinion_seas_risk` - Respondent's opinion about risk of getting sick with seasonal flu without vaccine.
    -   `1` = Very Low; `2` = Somewhat low; `3` = Don't know; `4` = Somewhat high; `5` = Very high.
-   `opinion_seas_sick_from_vacc` - Respondent's worry of getting sick from taking seasonal flu vaccine.
    -   `1` = Not at all worried; `2` = Not very worried; `3` = Don't know; `4` = Somewhat worried; `5` = Very worried.
-   `age_group` - Age group of respondent.
-   `education` - Self-reported education level.
-   `race` - Race of respondent.
-   `sex` - Sex of respondent.
-   `income_poverty` - Household annual income of respondent with respect to 2008 Census poverty thresholds.
-   `marital_status` - Marital status of respondent.
-   `rent_or_own` - Housing situation of respondent.
-   `employment_status` - Employment status of respondent.
-   `hhs_geo_region` - Respondent's residence using a 10-region geographic classification defined by the U.S. Dept. of Health and Human Services. Values are represented as short random character strings.
-   `census_msa` - Respondent's residence within metropolitan statistical areas (MSA) as defined by the U.S. Census.
-   `household_adults` - Number of *other* adults in household, top-coded to 3.
-   `household_children` - Number of children in household, top-coded to 3.
-   `employment_industry` - Type of industry respondent is employed in. Values are represented as short random character strings.
-   `employment_occupation` - Type of occupation of respondent. Values are represented as short random character strings.

Next we need to load our data. This should be a little familiar to you if you went through the additional exercise you were given.

In [None]:
features_df = pd.read_csv("training_set_features.csv", index_col="respondent_id")
labels_df = pd.read_csv("training_set_labels.csv", index_col="respondent_id")

In [None]:
print("features_df.shape", features_df.shape)


Let's take a quick look at our data.

In [None]:
features_df.head()

In [None]:
features_df.dtypes

Now let's look at the labels.

In [None]:
print("labels_df.shape", labels_df.shape)

In [None]:
labels_df.head()

## VISUALIZE VISUALIZE 

Before we do anything with our data, we need to get familiar with it. The best way to do this is to visualize our data.

We're going to need one extra libary for this.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(2, 1, sharex=True)

n_obs = labels_df.shape[0]

(labels_df['h1n1_vaccine']
    .value_counts()
    .div(n_obs)
    .plot.barh(title="Proportion of H1N1 Vaccine", ax=ax[0])
)
ax[0].set_ylabel("h1n1_vaccine")

(labels_df['seasonal_vaccine']
    .value_counts()
    .div(n_obs)
    .plot.barh(title="Proportion of Seasonal Vaccine", ax=ax[1])
)
ax[1].set_ylabel("seasonal_vaccine")

fig.tight_layout()

![anti](anti.jpg "")

It looks like roughy half of people received the seasonal flu vaccine, but only about 20% of people received the H1N1 flu vaccine. In terms of class balance, we say that the seasonal flu vaccine target has balanced classes, but the H1N1 flu vaccine target has moderately imbalanced classes.

Are the two target variables independent? Let's take a look.

In [None]:
pd.crosstab(
    labels_df["h1n1_vaccine"], 
    labels_df["seasonal_vaccine"], 
    margins=True,
    normalize=True
)

In [None]:
# Measure Pearson correlation for two binary variables
(labels_df["h1n1_vaccine"]
     .corr(labels_df["seasonal_vaccine"], method="pearson")
)

These two variables have a phi coefficient of 0.37, indicating a moderate positive correlation. We can see that in the cross-tabulation as well. Most people who got an H1N1 flu vaccine also got the seasonal flu vaccine. While a minority of people who got the seasonal vaccine got the H1N1 vaccine, they got the H1N1 vaccine at a higher rate than those who did not get the seasonal vaccine.

### Features

Next, let's take a look at our features. From the problem description page, we know that the feature variables are almost all categorical: a mix of binary, ordinal, and nominal features. Let's pick a few and see how the rates of vaccination may differ across the levels of the feature variables.

First, let's combine our features and labels into one dataframe.

In [None]:
joined_df = features_df.join(labels_df)
print(joined_df.shape)
joined_df.head()

### Plotting

Next, let's see how the features are correlated with the target variables. We'll start with trying to visualize if there is simple bivariate correlation. If a feature is correlated with the target, we'd expect there to be different patterns of vaccination as you vary the values of the feature.

Jumping right to the right final visualization is hard. We can instead pick one feature and one target and work our way up to a prototype, before applying it to more features and both targets. We'll use h1n1_concern, the level of concern the person showed about the H1N1 flu, and h1n1_vaccine as a target variable.

First, we'll get the count of observations for each combination of those two variables

![charts](charts.png "")

In [None]:
counts = (joined_df[['h1n1_concern', 'h1n1_vaccine']]
              .groupby(['h1n1_concern', 'h1n1_vaccine'])
              .size()
              .unstack('h1n1_vaccine')
         )
counts

Let's visualize this.

In [None]:
ax = counts.plot.barh()
ax.invert_yaxis()
ax.legend(
    loc='center right', 
    bbox_to_anchor=(1.3, 0.5), 
    title='h1n1_vaccine'
)

### What's wrong with this picture?



Let's try something a little different.

In [None]:
h1n1_concern_counts = counts.sum(axis='columns')
h1n1_concern_counts

In [None]:
props = counts.div(h1n1_concern_counts, axis='index')
props

In [None]:
ax = props.plot.barh()
ax.invert_yaxis()
ax.legend(
    loc='center left', 
    bbox_to_anchor=(1.05, 0.5),
    title='h1n1_vaccine'
)

Now we have a clearer picture of what's happening! In this plot, each pair of blue (no vaccine) and orange (received vaccine) bars add up to 1.0. We can clearly see that even though most people don't get the H1N1 vaccine, they are more likely to if they have a higher level of concern. It looks like h1n1_concern will be a useful feature when we get to modeling.

Since every pair of bars adds up to 1.0 and we only have two bars, this is actually a good use case for a stacked bar chart, to make it even easier to read.

In [None]:
ax = props.plot.barh(stacked=True)
ax.invert_yaxis()
ax.legend(
    loc='center left', 
    bbox_to_anchor=(1.05, 0.5),
    title='h1n1_vaccine'
)

### Back to functions

Remember how we learnt that we can define our own functions so that we won't have to keep writing the same function over and over?
Let's make one to visualize our data.

In [None]:
def vaccination_rate_plot(col, target, data, ax=None):
    """Stacked bar chart of vaccination rate for `target` against 
    `col`. 
    
    Args:
        col (string): column name of feature variable
        target (string): column name of target variable
        data (pandas DataFrame): dataframe that contains columns 
            `col` and `target`
        ax (matplotlib axes object, optional): matplotlib axes 
            object to attach plot to
    """
    counts = (joined_df[[target, col]]
                  .groupby([target, col])
                  .size()
                  .unstack(target)
             )
    group_counts = counts.sum(axis='columns')
    props = counts.div(group_counts, axis='index')

    props.plot(kind="barh", stacked=True, ax=ax)
    ax.invert_yaxis()
    ax.legend().remove()

### Your turn 
use this funtion to draw some graphs, and discover some trends!

In [None]:
column = 'h1n1_concern'

fig, ax = plt.subplots(
    1, 2, figsize=(9, 2.5)
)
vaccination_rate_plot(
    column, 'h1n1_vaccine', joined_df, ax=ax[0]
)
vaccination_rate_plot(
    column, 'seasonal_vaccine', joined_df, ax=ax[1]
)

ax[0].legend(
    loc='lower center', bbox_to_anchor=(0.5, 1.05), title='h1n1_vaccine'
)
ax[1].legend(
    loc='lower center', bbox_to_anchor=(0.5, 1.05), title='seasonal_vaccine'
)

Now let's use this function to plot a whole bunch of charts together.

In [None]:
cols_to_plot = [
    'h1n1_concern',
    'h1n1_knowledge',
    'opinion_h1n1_vacc_effective',
    'opinion_h1n1_risk',
    'opinion_h1n1_sick_from_vacc',
    'opinion_seas_vacc_effective',
    'opinion_seas_risk',
    'opinion_seas_sick_from_vacc',
    'sex',
    'age_group',
    'race',
]

fig, ax = plt.subplots(
    len(cols_to_plot), 2, figsize=(9,len(cols_to_plot)*2.5)
)
for idx, col in enumerate(cols_to_plot):
    vaccination_rate_plot(
        col, 'h1n1_vaccine', joined_df, ax=ax[idx, 0]
    )
    vaccination_rate_plot(
        col, 'seasonal_vaccine', joined_df, ax=ax[idx, 1]
    )
    
ax[0, 0].legend(
    loc='lower center', bbox_to_anchor=(0.5, 1.05), title='h1n1_vaccine'
)
ax[0, 1].legend(
    loc='lower center', bbox_to_anchor=(0.5, 1.05), title='seasonal_vaccine'
)
fig.tight_layout()

It looks like the knowledge and opinion questions have pretty strong signal for both target variables.

The demographic features have stronger correlation with seasonal_vaccine, but much less so far h1n1_vaccine. In particular, we interestingly see a strong correlation with age_group with the seasonal_vaccine but not with h1n1_vaccine. It appears that with seasonal flu, people act appropriately according to the fact that people more impacted and have higher risk of flu-related complications with age. It turns out though that H1N1 flu has an interesting relationship with age: even though older people have higher risk of complications, they were less likely to get infected! While we know anything about causality from this analysis, it seems like the risk factors ended up being reflected in the vaccination rates.

## Predictions!

### Logistic regression
We will be using logistic regression, a simple and fast linear model for classification problems. Logistic regression is a great model choice for a first-pass baseline model when starting out on a problem.

In [None]:
TODO
### Primer

In [None]:
# We're gonna need some additional imports here, from a library called scikit learn
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve, roc_auc_score

RANDOM_SEED = 6    # Why?

We will be using scikit-learn's logistic regression implementation.

Standard logistic regression only works with numeric input for features. Since this is a benchmark, we're going to build simple models only using the numeric columns of our dataset.

Categorical variables with non-numeric values take a little more preprocessing to prepare for many machine learning algorithms. We're not going to deal with them in this benchmark walkthrough, but there are many different ways to encode categorical variables into numeric values. 

In [None]:
features_df.dtypes != "object"

In [None]:
numeric_cols = features_df.columns[features_df.dtypes != "object"].values
print(numeric_cols)

### Processing features
 
Preprocessing is one of the key aspects of data analyiss, and typically data analysts spend a lot of time on this step,

Data often comes with many missing values, errors and other issues that needs to be fixed before anything useful can be inferred from the data.

For this intro, we will only be focus on a couple of very common preprocessing techniques.

**Scaling:** Transform all features to be on the same scale. This matters when using regularization, which we will discuss in the next section. We will use StandardScaler, also known as Z-score scaling. This scales and shifts features so that they have zero mean and unit variance.

**NA Imputation:** Logistic regression does not handle NA values. We will use median imputation, which fills missing values with the median from the training data, implemented with SimpleImputer.

![clean](clean.jpg "")

We are going to start using Scikit-Learn's built-in composition functionality to encapsulate everything into a pipeline. Building pipelines is a best practice for building machine learning models. Among other benefits, it makes it easy to reuse on new data (such as our test data). The great thing about pipelines is that they have the same interface as transformers and estimators, so you can treat them as if they are.

In the block below, we're going to first chain together the preprocessing steps (scaling and imputing) into one intermediate pipeline object numeric_preprocessing_steps. Then, we use that with Scikit-Learn's ColumnTransformer, which is a convenient way to grab columns out of a pandas data frame and then apply a specified transformer.

In [None]:
# chain preprocessing into a Pipeline object
# each step is a tuple of (name you chose, sklearn transformer)
numeric_preprocessing_steps = Pipeline([
    ('standard_scaler', StandardScaler()),
    ('simple_imputer', SimpleImputer(strategy='median'))
])

# create the preprocessor stage of final pipeline
# each entry in the transformer list is a tuple of
# (name you choose, sklearn transformer, list of columns)
preprocessor = ColumnTransformer(
    transformers = [
        ("numeric", numeric_preprocessing_steps, numeric_cols)
    ],
    remainder = "drop"
)

### Let's get to the good stuff

Now we're finally getting close to what we set out to do : predicting values.

We'll use scikit-learn's default hyperparameters (hyper what now?) for LogisticRegression of L2 (a.k.a. Ridge) regularization with C value (inverse regularization strength) of 1. Regularization is useful because it reduces overfitting.



Because we have two labels we need to predict, we can use Scikit-Learn's MultiOutputClassifier. This is a convenient shortcut for training two of the same type of model and having them run together.

In [None]:
estimators = MultiOutputClassifier(
    estimator=LogisticRegression(penalty="l2", C=1)
)


In [None]:
# now lets combine the preprocessing and the classifier.
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("estimators", estimators),
])

In [None]:
# let's take a look at what we've built
full_pipeline


## Training
![test](train.jpg "")