# Pandas recap

Recall that Pandas is a Python library used to handle datasets, in a spreadsheet like manner, using "DataFrame" objects.  
[See the site used for exercise 1](https://www.w3resource.com/python-exercises/pandas/index.php) for some useful Pandas commands.
![Pandas DataFrames](pandas-data-structure.svg)

Having the Pandas documentation nearby is useful, especially the DataFrame page: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html  

In [None]:
# Import pandas (and optionally numpy) using shorthand alias
import pandas as pd
import numpy as np

Create a example DataFrame using a Python dictionary (data type) with 4 columns and 10 rows

In [None]:
example_dict = {
    'name':     ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score':    [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify':  ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']
    }

# Create DataFrame using above data
df = pd.DataFrame(example_dict)
# Print df
df

Display only first and last column of your DataFrame 

In [None]:
# Index dataframe using indices. Syntax:
# [<row id(s)>, <column id(s)]
# Select all rows       --> :
# Select column 0 and 3 --> [0,3]
df.iloc[:, [0,3]]

# Alternative using column names instead of indices:
# df.loc[:, ['name', 'qualify']]

# Machine Learning with scikit-learn
Now, let's move on how to do machine learning using the sklearn python library. We will be using covid data for USA, trying to predict the number of deaths only knowing the number of cases (and later also state.)

Contents:
* What is Machine Learning?
  * Types of Machine Learning
* Train-Test split
* Use `sklearn` to build linear regression model*
* One-Hot Encoding
* Pipelines
* Evaluation Metrics


**Disclaimer: Linear regression is not the most suitable algorithm for this dataset, but we are using it to illustrate how to use scikit-learn*

## What is Machine Learning?

If you have had little experience with Machine Learning before, refer to chapter 1 and 2 of todays reading. In short:

* Learning patterns in your data without being explicitly programmed
* A function that maps features to an output

<img src="https://brookewenig.com/img/DL/al_ml_dl.png" style="height: 350px; padding: 10px"/>

## 3 Types of Machine Learning and their subtypes
* Supervised Learning
  * Regression <img src="https://miro.medium.com/max/640/1*LEmBCYAttxS6uI6rEyPLMQ.png" style="height: 150px; padding: 10px"/>
  * Classification
    <img src="https://cdn3-www.dogtime.com/assets/uploads/2018/10/puppies-cover.jpg" style="height: 150px; padding: 10px"/>
    <img src="https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&w=1000&q=80" style="height: 150px; padding: 10px"/>
* Unsupervised Learning
<img src="https://www.iotforall.com/wp-content/uploads/2018/01/Screen-Shot-2018-01-17-at-8.10.14-PM.png" style="height: 150px; padding: 10px"/>
* Reinforcement Learning
<img src="https://brookewenig.com/img/ReinforcementLearning/Rl_agent.png" style="height: 150px; padding: 10px"/>

We will be referencing the [scikit-learn docs](https://scikit-learn.org/stable/user_guide.html) and [pandas docs](https://pandas.pydata.org/pandas-docs/stable/index.html) where relevant, and will be analyzing data from the New York Times COVID-19 US States dataset from https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv

Today we're going to start simple and focus on a supervised learning (regression) problem. Here we will use a linear regression model to predict the number of deaths resulting from COVID-19.

In [None]:
# Load csv directly from the internet
# Overwrite previous defined 'df' variable
df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
df


We see the first and last 5 rows of the dataframe above. For instance, on 2020-01-21 there was 1 case and 0 deaths in the state Washington (and FIPS state codes not relevant).

We check the amount of rows/columns in dataframe:

In [None]:
df.shape

## Relationship between Cases & Deaths

Lets make a scatterplot for each state along with its 2 variables: Cases and covid deaths in the state.

In [None]:
# Filter to 2020-05-01
# Use dataframe mask to do so. First get the date column, then make a condition:
mask_05_01 = df["date"] == "2020-05-01"
df_05_01 = df[mask_05_01]

# Create an Axes object using Matplotlib indirectly
# Matplotlib naming explained: https://www.machinelearningplus.com/wp-content/uploads/2019/01/99_matplotlib_structure.png
ax = df_05_01.plot(x="cases", y="deaths", kind="scatter",
                   figsize=(12,8), s=100, title="Deaths vs Cases on 2020-05-01 - All States")

# Set labels for each point:
# Iterate over all rows, getting index and (column names)
# (cases, deaths, state) pass this tuple to the ax.text function to set observation labels
for index, (cases, deaths, state) in df_05_01[["cases", "deaths", "state"]].iterrows():

    # Set text for every (x,y) point in plot (cases, deaths) using "state" as text
    ax.text(cases, deaths, state)

## New York & New Jersey are Outliers

In [None]:
# Filter to states that are NOT New York and NOT New Jersey
mask_not_ny_nj = (df["state"] != "New York") & (df["state"] != "New Jersey") 

# Filter out unwanted rows using mask
not_ny = df[ mask_not_ny_nj ]

# Display first 5 rows
not_ny.head()

In [None]:
# Filter to 2020-05-01
mask_not_ny_05_01 = not_ny["date"] == "2020-05-01"
not_ny_05_01 = not_ny[ mask_not_ny_05_01 ]

# Create an Axes object using Matplotlib indirectly
# Matplotlib naming explained: https://www.machinelearningplus.com/wp-content/uploads/2019/01/99_matplotlib_structure.png
ax = not_ny_05_01.plot(x="cases", y="deaths", kind="scatter", 
                   figsize=(12,8), s=50, title="Deaths vs Cases on 2020-05-01 - All States but NY and NJ")

# Iterate over all rows, getting index and (column names)
# (cases, deaths, state) pass this tuple to the ax.text function to set observation labels
for index, (cases, deaths, state) in not_ny_05_01[["cases", "deaths", "state"]].iterrows():

    # Set text for every (x,y) point in plot (cases, deaths) using "state" as text
    ax.text(cases, deaths, state)

## New York versus California COVID-19 deaths comparison
Lets plot how these 2 states # of deaths change through time with a lineplot. Lets first use a mask (filter) to only get the two states:

In [None]:
# Mask with boolean condition again
mask_ny_cali = (df["state"] == "New York") | (df["state"] == "California") 
df_ny_cali = df[ mask_ny_cali ]

# Get last observations
df_ny_cali.tail()

Does not look like its easy to make a plot when the dataframe has this structure. We need a structure like this:  

    date    	California	New York  
    2022-08-30	94973.0	       70374.0

That would make it easy to plot the change over time. To do that, let's "pivot" our df_ny_cali DataFrame so that we can plot deaths over time for both states:

In [None]:
df_ny_cali_pivot = df_ny_cali.pivot(index='date', columns='state', values='deaths')

# Fill missing calues with 0
df_ny_cali_pivot = df_ny_cali_pivot.fillna(0)
df_ny_cali_pivot

Voilà! Now we are ready to plot:

In [None]:
df_ny_cali_pivot.plot(kind='line',
                    title="Deaths 2020-01-25 to 2020-05-01 - CA and NY",
                    figsize=(12,8))

## Train-Test Split (aka Train-Dev split, aka Train-Validation split)
First we need to prepare our dataset before we go into linear regression.  
*For more information on this, see page 51 in  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.*  
Basically, we set aside a part of out dataset that is use to see how good our model is. 

![](https://brookewenig.com/img/IntroML/trainTest.png)

Often, you would select a random part of the data as the test set.  
Because this is temporal data, instead of doing a random split, we will use data from March 1 to April 7 to train our model, and test our model by predicting values for April 8 - 14.

Apart from splitting the data into training data and test data, we also split the training data in 2 different Pandas DataFrames:

In [None]:
train_df = df[(df["date"] >= "2020-03-01") & (df["date"] <= "2020-04-07")]
test_df = df[df["date"] > "2020-04-07"]

# Create 2 dataframes for our training data
X_train = train_df[["cases"]]   # 1 feature
y_train = train_df["deaths"]    # target label to predict

# Create 2 dataframes for our test data
X_test = test_df[["cases"]]
y_test = test_df["deaths"]

Lets investigate what the dataframes contain:

In [None]:
X_train.head(2)

In [None]:
y_train.head(2)

## Linear Regression
If we know the number of cases on a given day, can we predict the number of deaths?  
Let's build a Linear regression *model*, to do that. It finds a linear relationship between deaths and covid cases. Is it true that more cases -> more deaths?  We will see.   
* Goal: Find a line that best fits our set of datapoints. Equation for line:
$$\hat{y} = w_0 + w_1x$$


$${y} \approx \hat{y} + \epsilon $$
* *x*: feature (cases)
* *y*: label (deaths)

For example, here's some random data in a scatterplot, where 'x' is used to predict 'y' by reading off the position on the line: 

![](https://miro.medium.com/max/640/1*LEmBCYAttxS6uI6rEyPLMQ.png)

Here we will be fitting a [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model from the scikit-learn python library.

In [None]:
from sklearn.linear_model import LinearRegression

# Create a LinearRegression object 
lr_model = LinearRegression()

# Fit (train) the model by giving it the training data
# (It learns the relationship between X and y)
lr_model.fit(X_train, y_train)

print('The equation to calculate the # of deaths:')
print(f"deaths = {lr_model.intercept_:.4f} + {lr_model.coef_[0]:.4f}*cases")

This is our linear regression model, than can be used to predict the number of deaths, using the number of cases.  
What happens if we have 0 cases? How many deaths would that equal?  
Hmmm... if we have no cases, then there should be no deaths caused by COVID-19, so let's set the intercept to be 0.

In [None]:
# Set intercept to 0 with fit_intercept=False
lr_model = LinearRegression(fit_intercept=False).fit(X_train, y_train)
print(f"deaths = {lr_model.coef_[0]:.4f}*cases")

So this model tells us the mortality rate. But we know that some states have higher mortality rates than others.   
Right now we only used 1 feature (cases) to predict the number of deaths. Let's also include the state as a feature!

## One-Hot Encoding
How do we handle non-numeric features, such as the state?  
Imagine if we only used *state* as a feature. Maybe the model would look like this:  
$$deaths = 0.0355*state$$

But! We cannot multiply using the state, eg "Washington", since it is a categorical feature.

One idea is to map each state to a number. For example:
  * 'New York' = 1, 'California' = 2, 'Louisiana' = 3
  
BUT this implies $$New\ York*2 = California$$

Better idea:
* Create a ‘dummy’ feature for each category (state)
* 'New York' => [1, 0, 0], 'California' => [0, 1, 0], 'Louisiana' => [0, 0, 1]

This technique is known as ["One Hot Encoding"](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [None]:
from sklearn.preprocessing import OneHotEncoder

X_train = train_df[["cases", "state"]]
X_test = test_df[["cases", "state"]]

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)

# Encode all columns in the training data:
encoded_states = enc.fit(X_train).transform(X_train)

Let's check the shape

In [None]:
print("Before transforming we had", X_train.shape[0], "rows (observations) and", X_train.shape[1], "columns")
print("After transforming using one hot encoding we have", encoded_states.shape[0], "rows (observations) and", encoded_states.shape[1], "columns")

Yikes way too many columns now! It one-hot encoded the (numerical) cases variable too. We only wanted to encode the categorical feature (states)

We need sklearn's [column transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to only apply the one hot encoding to a single column.  
In Sklearn, the One Hot encoder `enc` we created above is called a transformer, since it has a `transform` method.

**Useful Sklearn jargon definitions:**  
*estimator(s)*: An object which manages the estimation and decoding of a model. Estimators must provide a fit method, and should provide set_params and get_params, although these are usually provided by inheritance from base.BaseEstimator.

*transformers*: An estimator supporting transform and/or fit_transform.

*predictor(s)*:
An estimator supporting predict and/or fit_predict. This encompasses classifier, regressor, outlier detector and clusterer.
Not to be confused with: In statistics, “predictors” refers to features.

*regressor(s)*:
A supervised (or semi-supervised) predictor with continuous output values.  
Regressors usually inherit from base. RegressorMixin, which sets their _estimator_type attribute.  
A regressor can be distinguished from other estimators with is_regressor.

A regressor must implement:
* fit
* predict
* score

See above and more with links on: https://scikit-learn.org/stable/glossary.html

We transform only the "state" column in X_train :

In [None]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("encoder transformer", enc, ["state"])], remainder="passthrough")

# Print the result of the above transformation nicely using pandas:
pd.DataFrame(ct.fit_transform(X_train), dtype=int)

Before we had 2 columns in X_train; cases and state.  
How many features (columns) do we have now? 55+1! Whats the 55 new ones? They are basically a vector of zeroes along with a single 1 for every state. Let's verify that by printing how many states we have in the *training* dataset:

In [None]:
print("There's", len( X_train['state'].unique() ), "unique states in the training dataset")

So for every state, we have a new column. Let's use these columns for build a even better linear regression model that takes state and cases into account when predicting the number of deaths.

## Pipelines

We combine the data transformation step we did above with the linear model we created earlier in a "pipeline". Data "flows" from top to bottom. The beauty of pipelines is the modularity and clarity for the whole data science process.    
We can chain together a series of data transformations with a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). This way we also ensure that whatever operations we apply to our training set, we also apply in the same order to our test set.

In [None]:
from sklearn.pipeline import Pipeline

# Define the pipeline steps. 
pipeline = Pipeline(steps=[
                            ("ct", ct),         # step 1 in pipeline: Transform the columns we specified above (onehote encode state)
                            ("lr", lr_model)    # step 2 in pipeline: Fit the linear model using the transformed data
                        ])


pipeline_model = pipeline.fit(X_train, y_train)

y_pred = pipeline_model.predict(X_test)

We now have a new model that contains a coefficient for each column.  
Thus when calculating the number of deaths, the model can use the coeffient as a "weight" of how important the state is. 

In [None]:
# Get step 1 in the pipline (lr_model object) in the pipeline, then the 1'th element of the tuple, then the coef_ attribute for the LinearModel object. 
print("Our linear model contains", len(pipeline_model.steps[1][1].coef_), "coefficients")

In [None]:
# Try to print these one at a time to understand above code:
# print(pipeline_model.steps[1])
# print(pipeline_model.steps[1][1])
# print(pipeline_model.steps[1][1].coef_)

## Evaluation Metrics
To measure how good our regression model is, we can use the following metrics:

<img src="https://brookewenig.com/img/IntroML/RMSE.png" style="height: 150px; padding: 10px"/>

So how "good" is our regression model?  
Let's compute the MSE and RMSE for our test dataset using the [sklearn.metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean_squared_error).


In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"MSE is {mse:.1f}, RMSE is {rmse:.1f}")

The lower the error the better. If we had spend time on optimizing our model, we could perhaps lower the error.

## Compare Predictions

It is difficult to say if this is a "good" model or not. Let's also judge it by some example predictions. We show the test dataset's true deaths and compare it to what the linear model predicted. 

In [None]:
# Concatenate the test dataset (what we hope to predict as close as possible) with the predictions made by our model:
pred = pd.concat([ 
                    test_df.reset_index(drop=True), 
                    pd.DataFrame(y_pred, columns=["predicted_deaths"])
                ], axis=1)
pred

Voila! You have successfully built a machine learning pipeline using scikit-learn!

This tutorial was based on https://github.com/databricks/tech-talks/blob/master/2020-04-22%20%7C%20Machine%20Learning%20with%20scikit-learn/Machine%20Learning%20with%20scikit-learn.ipynb