# Data Programming in Python | BAIS:6040
# Advanced Data Analytics: Machine Learning Pipelines

Instructor: Jeff Hendricks

Topics to be covered:
- Building a machine learning pipeline for regression
- Exercises to build a pipeline for classification

References: 
- Documentation scikit-learn (http://scikit-learn.org/stable/documentation.html)
- Introduction to Machine Learning with Python (http://shop.oreilly.com/product/0636920030515.do)


### Importing Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import joblib

### Load the Dataset & Feature Engineering
- Adding the Month and MonthText Features engineered from the Date

In [2]:
# Load data & add month feature
weather = pd.read_csv('../../Data/weather.csv', parse_dates=['Date'])

weather['Month'] = pd.Categorical(weather.Date.dt.month)
weather['MonthText'] = pd.Categorical(weather.Date.dt.month_name())

weather.head(2)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow,Month,MonthText
0,2007-11-01,Canberra,8.0,24.3,0.0,3.4,6.3,NW,30.0,SW,...,1015.0,7,7,14.4,23.6,No,3.6,Yes,11,November
1,2007-11-02,Canberra,14.0,26.9,3.6,4.4,9.7,ENE,39.0,E,...,1008.4,5,3,17.5,25.7,Yes,3.6,Yes,11,November


In [3]:
weather.isna().sum()

Date              0
Location          0
MinTemp           0
MaxTemp           0
Rainfall          0
Evaporation       0
Sunshine          3
WindGustDir       3
WindGustSpeed     2
WindDir9am       31
WindDir3pm        1
WindSpeed9am      7
WindSpeed3pm      0
Humidity9am       0
Humidity3pm       0
Pressure9am       0
Pressure3pm       0
Cloud9am          0
Cloud3pm          0
Temp9am           0
Temp3pm           0
RainToday         0
RISK_MM           0
RainTomorrow      0
Month             0
MonthText         0
dtype: int64

### Prepare Data for Modeling

- Divide the data into train and test subsets

In [4]:
from sklearn.model_selection import train_test_split

# Names of different columns
categorical_cols = ["WindGustDir", "RainToday", "MonthText"]
continuous_cols = ["Sunshine", "Humidity3pm", "MaxTemp"]

predictor_cols = categorical_cols + continuous_cols
target_col = "Rainfall"

X=weather[predictor_cols]
y=weather[target_col]

X_train, X_test, y_train, y_test = train_test_split(weather[predictor_cols], y, random_state=0)

In [5]:
X_train.shape

(274, 6)

### Simple Imputer to Handle Missing Values
- This is a different approach from last week where we handled the missing values in the exploration

Reference
- https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [6]:
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(strategy='mean')
imp_mf = SimpleImputer(strategy='most_frequent')

imp_mean.fit(X_train.Sunshine.values.reshape(-1, 1))
imp_mf.fit(X_train.WindGustDir.values.reshape(-1, 1))

X_train['WindGustDir']=imp_mf.transform(X_train.WindGustDir.values.reshape(-1, 1))[:,0]
X_train['Sunshine'] = imp_mean.transform(X_train.Sunshine.values.reshape(-1, 1))[:,0]

In [7]:
X_train.shape

(274, 6)

In [8]:
X_train.Sunshine.values.shape

(274,)

In [9]:
X_train.Sunshine.values.reshape(-1, 1).shape

(274, 1)

In [10]:
X_train.Sunshine.values.reshape(-1, 1)[:,0].shape

(274,)

### Build a Pipeline for Regression Modeling with Ridge

- Least Squares with l2 Regularization
- We're using the default parameters for the Ridge Regressor
- Notice we're dropping the first column for the One Hot Encoding.  This is common for linear models

References

- sklearn.linear_model.Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

- SimpleImputer: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
- OneHotEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- MinMaxScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [11]:
from sklearn.model_selection import train_test_split

# Names of different columns
categorical_cols = ["WindGustDir", "RainToday", "MonthText"]
continuous_cols = ["Sunshine", "Humidity3pm", "MaxTemp"]

predictor_cols = categorical_cols + continuous_cols
target_col = "Rainfall"

X=weather[predictor_cols]
y=weather[target_col]

X_train, X_test, y_train, y_test = train_test_split(weather[predictor_cols], y, random_state=0)

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge

num_transformer = Pipeline(steps=[('impute', SimpleImputer(strategy='mean'))
                                 ,('scale', MinMaxScaler())])

cat_transformer = Pipeline(steps=[('impute',SimpleImputer(strategy='most_frequent'))
                                 ,('enc', OneHotEncoder(sparse = False, drop='first', handle_unknown='error'
                                                        ,dtype=np.int32))])


preprocessor = ColumnTransformer(transformers=[('num', num_transformer, continuous_cols),
                                               ('cat', cat_transformer, categorical_cols)]
                                 ,remainder='passthrough')

pipe_ridge = Pipeline(steps=[('preprocess', preprocessor)
                            ,('rgr', Ridge())])

pipe_ridge.steps

[('preprocess',
  ColumnTransformer(remainder='passthrough',
                    transformers=[('num',
                                   Pipeline(steps=[('impute', SimpleImputer()),
                                                   ('scale', MinMaxScaler())]),
                                   ['Sunshine', 'Humidity3pm', 'MaxTemp']),
                                  ('cat',
                                   Pipeline(steps=[('impute',
                                                    SimpleImputer(strategy='most_frequent')),
                                                   ('enc',
                                                    OneHotEncoder(drop='first',
                                                                  dtype=<class 'numpy.int32'>,
                                                                  sparse=False))]),
                                   ['WindGustDir', 'RainToday', 'MonthText'])])),
 ('rgr', Ridge())]

#### Each component of the pipeline can be fit & transformed

In [13]:
print('X_train shape ', X_train.shape)
print('Preprocessed shape ', preprocessor.fit_transform(X_train).shape)

preprocessor.fit_transform(X_train)

X_train shape  (274, 6)
Preprocessed shape  (274, 30)


array([[0.625     , 0.5060241 , 0.37234043, ..., 0.        , 0.        ,
        0.        ],
       [0.42647059, 0.54216867, 0.28014184, ..., 0.        , 0.        ,
        0.        ],
       [0.68382353, 0.27710843, 0.37234043, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.85294118, 0.1686747 , 0.73049645, ..., 0.        , 0.        ,
        0.        ],
       [0.63970588, 0.36144578, 0.47163121, ..., 0.        , 0.        ,
        0.        ],
       [0.74264706, 0.25301205, 0.46099291, ..., 0.        , 0.        ,
        0.        ]])

In [14]:
print('X_train continuous shape ', X_train[continuous_cols].shape)
print('num_transformer shape ', num_transformer.fit_transform(X_train[continuous_cols]).shape)

num_transformer.fit_transform(X_train[continuous_cols])

X_train continuous shape  (274, 3)
num_transformer shape  (274, 3)


array([[0.625     , 0.5060241 , 0.37234043],
       [0.42647059, 0.54216867, 0.28014184],
       [0.68382353, 0.27710843, 0.37234043],
       [0.83823529, 0.04819277, 0.87588652],
       [0.75      , 0.26506024, 0.64893617],
       [0.77205882, 0.25301205, 0.93971631],
       [0.83088235, 0.02409639, 0.92907801],
       [0.95588235, 0.19277108, 0.71985816],
       [0.88970588, 0.08433735, 0.70212766],
       [0.57352941, 0.42168675, 0.45744681],
       [0.69852941, 0.30120482, 0.14184397],
       [0.58088235, 0.5060241 , 0.4822695 ],
       [0.74264706, 0.15662651, 0.79432624],
       [0.19852941, 0.62650602, 0.26595745],
       [0.        , 0.96385542, 0.34751773],
       [0.44117647, 0.36144578, 0.41843972],
       [0.55882353, 0.37349398, 0.22695035],
       [0.66176471, 0.34939759, 0.32269504],
       [0.43382353, 0.56626506, 0.58865248],
       [0.86029412, 0.13253012, 0.26241135],
       [0.125     , 0.78313253, 0.16666667],
       [0.72794118, 0.56626506, 0.06028369],
       [0.

In [15]:
print('X_train categorical shape ', X_train[categorical_cols].shape)
print('cat_transformer shape ', cat_transformer.fit_transform(X_train[categorical_cols]).shape)

cat_transformer.fit_transform(X_train[categorical_cols])

X_train categorical shape  (274, 3)
cat_transformer shape  (274, 27)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)

### Fit the Pipeline

- Fitting the pipeline to the training dataset is performing fit_transform for each of the preprocessing steps in the pipeline
- Fitting the pipeline is also fitting the model algorithm to the transformed training dataset

In [16]:
pipe_ridge.fit(X_train, y_train)

Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer()),
                                                                  ('scale',
                                                                   MinMaxScaler())]),
                                                  ['Sunshine', 'Humidity3pm',
                                                   'MaxTemp']),
                                                 ('cat',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('enc',
                                                                   On

In [17]:
# what's the score (R2) against the training set
pipe_ridge.score(X_train, y_train)

0.5179593839415586

In [18]:
# what's the score (R2) against the test set
pipe_ridge.score(X_test, y_test)

0.4895266992909896

#### Wait, don't I need to transform the test dataset before scoring it?
- Running __score__ or __predict__ methods on the pipeline executes __transform__ method for each preprocessor in the pipeline before scoring or predicting.
- It does not __fit__ the pipeline to the test dataset.  That would be bad... data leakage.

In [19]:
import math
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

preds = pipe_ridge.predict(X_test)

mae = mean_absolute_error(y_test, preds)
rmse = math.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
    
print("MAE = {:.3f} | RMSE = {:.3f} | R2 = {:.5f}".format(mae, rmse, r2))

MAE = 1.746 | RMSE = 3.427 | R2 = 0.48953


### Now let's get a prediction

In [20]:
pipe_ridge.predict(X_test.head(1))

array([-0.19935467])

### Serialize The Pipeline

In [21]:
with open("../../Data/ridge_pipe.pkl", "wb") as fwb:
    joblib.dump(pipe_ridge, fwb)

FileNotFoundError: [Errno 2] No such file or directory: 'data/ridge_pipe.pkl'

# Exercises for Pipeline (10 questions)
- Use the weather dataset and build a pipeline to predict whether it will rain tomorrow using a logistic regression model

1\. Build a list called categorical_cols to indicate you intend to use WindGustDir, RainToday, and MonthText as categorical variables.

In [None]:
# Your answer here


2\. Build a list called continuous_cols to indicate you intend to use Sunshine, Humidity3pm, and MaxTemp as numeric variables. Then concatenate the two lists into a list called predictor_cols

In [None]:
# Your answer here


3\. Assign the name of the column you're trying to predict to a variable called target_col

In [None]:
# Your answer here


4\. Split <i>X</i> and <i>y</i> into two training datasets <i>X_train</i> and <i>y_train</i> and two test datasets <i>X_text</i> and <i>y_test</i>. Set the `test_size` to 0.25 and `random_state` to 0.

In [None]:
# Your answer here


5\. Build a pipeline for __Logistic Regression__. Use the mean to replace missing values of numeric variables and the most frequent value to replace missing values for categorical variables.  The numeric variables should be scaled with MinMaxScaler and the categorical variables should be One Hot Encoded (first value dropped). You can use the default parameter values for Logistic Regression.

In [None]:
# Your answer here


6\. Fit the pipeline to the training dataset.

In [None]:
# Your answer here


7\. What's the accuracy score against the training dataset?

In [None]:
# Your answer here


8\. What's the accuracy score against the test dataset?

In [None]:
# Your answer here


9\. Get a prediction for first observation in the test dataset?

In [None]:
# Your answer here


10\. What's the probability for rain tomorrow using the first observation in the test dataset?

In [None]:
# Your answer here
