# THE H1N1 AND SEASONAL FLU VACCINES PROJECT

As the world struggles to vaccinate the global population against COVID-19, an understanding of how people’s backgrounds, opinions, and health behaviors are related to their personal vaccination patterns can provide guidance for future public health efforts. Your audience could be someone guiding those public health efforts.
The *CRoss-Industry Standard Process for Data Mining (CRISP-DM)* methodology will be used inthis project

# BUSINESS UNDERSTANDING

Good questions for this stage include:

Who are the stakeholders in this project? Who will be directly affected by the creation of this project?

What business problem(s) will this Data Science project solve for the organization?

What problems are inside the scope of this project?

What problems are outside the scope of this project?

What data sources are available to us?

What is the expected timeline for this project? Are there hard deadlines (e.g. "must be live before holiday season shopping") or is this an ongoing project?

Do stakeholders from different parts of the company or organization all have the exact same understanding about what this project is and isn't?

## DATA UNDERSTANDING

Consider the following questions when working through this stage:

What data is available to us? Where does it live? Do we have the data, or can we scrape/buy/source the data from somewhere else?

Who controls the data sources, and what steps are needed to get access to the data?

What is our target?

What predictors are available to us?

What data types are the predictors we'll be working with?

What is the distribution of our data?

How many observations does our dataset contain? Do we have a lot of data? Only a little?

Do we have enough data to build a model? Will we need to use resampling methods?

How do we know the data is correct? How is the data collected? Is there a chance the data could be wrong?

## DATA PREPARATION

During this stage, we'll want to handle the following issues:

Detecting and dealing with missing values

Data type conversions (e.g. numeric data mistakenly encoded as strings)

Checking for and removing multicollinearity (correlated predictors)

Normalizing our numeric data

Converting categorical data to numeric format through one-hot encoding

## MODELLING

Consider the following questions during the modeling step:

Is this a classification task? A regression task? Something else?

What models will we try?

How do we deal with overfitting?

Do we need to use regularization or not?

What sort of validation strategy will we be using to check that our model works well on unseen data?

What loss functions will we use?

What threshold of performance do we consider as successful?

## EVALUATION

During this step, we'll evaluate the results of our modeling efforts. Does our model solve the problems that we outlined all the way back during step 1? Why or why not? Often times, evaluating the results of our modeling step will raise new questions, or will cause us to consider changing our approach to the problem. Notice from the CRISP-DM diagram above, that the "Evaluation" step is unique in that it points to both Business Understanding and Deployment. As we mentioned before, Data Science is an iterative process -- that means that given the new information our model has provided, we'll often want to start over with another iteration, armed with our newfound knowledge! Perhaps the results of our model showed us something important that we had originally failed to consider the goal of the project or the scope. Perhaps we learned that the model can't be successful without more data, or different data. Perhaps our evaluation shows us that we should reconsider our approach to cleaning and structuring the data, or how we frame the project as a whole (e.g. realizing we should treat the problem as a classification rather than a regression task). In any of these cases, it is totally encouraged to revisit the earlier steps.

## DEPLOYMENT

During this stage, we'll focus on moving our model into production and automating as much as possible. Everything before this serves as a proof-of-concept or an investigation. If the project has proved successful, then you'll work with stakeholders to determine the best way to implement models and insights. For example, you might set up an automated ETL (Extract-Transform-Load) pipelines of raw data in order to feed into a database and reformat it so that it is ready for modeling. During the deployment step, you'll actively work to determine the best course of action for getting the results of your project into the wild, and you'll often be involved with building everything needed to put the software into production.

*******

*****


Import the necessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from sklearn.metrics import r2_score,roc_auc_score,accuracy_score,precision_score
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler,MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline