### Module 13-1 Learning Notebook: Intro to ML Pipelines
The goal of this teching notebook is to explain how a pipeline works in ML. <P>
    
A machine learning pipeline is a way to codify and automate the workflow it takes to produce a machine learning model. Machine learning pipelines consist of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.  We will just use the basics.

**Data:**
    
The data used in this problem is a simplified version of using "gene expression" to predict cancer in people. It is based on this dataset:
- http://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq

In this data, there are 6 genes that are each represented by a floating point number. The target value is 'cancer_detected' which is 0 = false and 1 = true.<P>
    
The goal is to use a classification algorithm to predict cancer based on the values of the genes.<P>
    
Our method:
1. Load, isolate and split the data
2. Define the steps in the pipeline
3. Create the pipeline
4. Use the pipeline to transform and train your model
5. Evaluate the result
6. Bring it all together using a different set of steps

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import boto3
import pandas as pd
import numpy as np

### 1. Load, isolate and split the data

In [3]:
# Load df from S3 .csv
sess = boto3.session.Session()
s3 = sess.client('s3') 
source_bucket = 'machinelearning-read-only'
source_key = 'data/gene-cancer-small.csv'
response = s3.get_object(Bucket=source_bucket, Key=source_key)
df = pd.read_csv(response.get("Body"))
df.head(5)

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,cancer_detected
0,0.759334,27.342287,118.878384,-29.80047,641.214491,-12.905525,0
1,3.726902,16.190669,122.51987,-56.616092,239.28903,107.212129,1
2,2.234535,19.345805,128.827574,-90.478848,374.4595,55.037188,1
3,4.922451,20.416719,57.906599,-62.897717,398.818805,146.694338,0
4,1.227942,26.41599,87.027782,-38.962616,581.078233,26.624324,1


In [4]:
# Notice the scales of the features
df.describe()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,cancer_detected
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,4.117973,20.732453,82.764879,-50.00312,408.900251,112.913532,0.32
std,1.540117,4.111156,32.476013,21.672628,164.818999,48.643582,0.468826
min,-0.253117,10.72402,15.323295,-127.53962,26.890555,-12.905525,0.0
25%,3.238394,17.776597,61.06833,-60.886868,284.871681,83.468788,0.0
50%,4.58848,20.462301,81.359593,-48.004696,407.236999,120.271416,0.0
75%,5.173925,23.642621,106.615862,-39.194564,526.621934,151.744827,1.0
max,7.146552,29.3893,150.491443,9.13385,813.139351,206.951441,1.0


In [5]:
# Features
X = df.drop(['cancer_detected'],axis = 1)
# Target
y = df['cancer_detected']
# Split into train/test
# Reserve 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state = 42)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (80, 6)
y_train: (80,)
X_test: (20, 6)
y_test: (20,)


### 2. Define the steps in the pipeline

In [6]:
# Pipelines consist of sequential steps. The are technically a 'list of tuples'.
#
#    https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
#
# Step 1: scale the data using the normalization (MinMaxScaler) scaler
# Step 2: fit it a logistic regression model
#

norm_scaler = MinMaxScaler()
logReg = LogisticRegression()

steps = [('Normalizer', norm_scaler), ('LogRegClassifier', logReg)]
steps

[('Normalizer', MinMaxScaler(copy=True, feature_range=(0, 1))),
 ('LogRegClassifier',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                     warm_start=False))]

### 3. Create the pipeline

In [7]:
# Create the pipeline
pipe = Pipeline(steps)
pipe # Show parameters

Pipeline(memory=None,
         steps=[('Normalizer', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('LogRegClassifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

#### What is happening?
A pipeline is a list of data transforms and a final estimator.<P>
    
Remember, our normailzation scaler has two methods: **.fit()** and **.transform()**. These are requirements for every step except the last step.<P>

The method sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

### 4. Use the pipeline to scale your data and train your model

In [8]:
# Now, perform the pipeline steps on the data
# The end result is a trained model
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('Normalizer', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('LogRegClassifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

### 5. Evaluate the result

In [9]:
# Treat the pipe object just like trained model
y_pred = pipe.predict(X_test)
print('Accuracy:', pipe.score(X_test, y_test))
confusion_matrix(y_test, y_pred)

Accuracy: 0.9


array([[16,  0],
       [ 2,  2]])

### 6. Bring it all together using a different set of steps
This can be a very efficient way to work with ML models.

In [10]:
# Different: standardize the data and use the GBC algorithm

stand_scaler = StandardScaler()
GB_classifier = GradientBoostingClassifier()
#
steps = [('std_scaler', stand_scaler), ('gbd', GB_classifier)]
#
# Combine creation and fit together
pipe = Pipeline(steps).fit(X_train, y_train)
# Accuracy
y_pred = pipe.predict(X_test)
print('Accuracy:', pipe.score(X_test, y_test))
confusion_matrix(y_test, y_pred)

Accuracy: 0.95


array([[16,  0],
       [ 1,  3]])

That is the basics. A pipeline can streamline the steps needed to prepare the data, train and evaluate a model.