### Module 13-2 Learning Notebook: Column transformation and Pipelines

This gets a little more complicated, but if you can master both the ColumnTransformer & the Pipeline, you can write very short, efficient code to explore many algorithms quickly.

**Data:**
    
We'll use the same 'gene expression' dataset, except I've included a 'gender' column with it.

The data used in this problem is a simplified version of using "gene expression" to predict cancer in people. It is 
based on this dataset:

    http://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq
    
**Method:**
1. Load the data
2. Introduction to a column transformer
3. Column Transformer missing values (impute)
4. Convert categories to numbers
5. Putting is all togehter: do all transfomers at once
6. Combine ColumnTransformer with the use of the Pipeline 

In [2]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer,ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import boto3
import pandas as pd
import numpy as np
# Prevent pandas from displaying in scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)

### 1. Load the data

In [3]:
# Load df from S3 .csv
sess = boto3.session.Session()
s3 = sess.client('s3') 
source_bucket = 'machinelearning-read-only'
source_key = 'data/gene-cancer-gender.csv'
response = s3.get_object(Bucket=source_bucket, Key=source_key)
df = pd.read_csv(response.get("Body"))
# Notice scales, missing data and categorical (string) data
df.head(5)

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gender,cancer_detected
0,0.759,27.342,118.878,-29.8,641.214,-12.906,male,0
1,3.727,16.191,122.52,-56.616,239.289,,male,1
2,2.235,19.346,128.828,-90.479,374.46,,female,1
3,4.922,20.417,57.907,-62.898,398.819,,male,0
4,1.228,26.416,87.028,-38.963,581.078,26.624,female,1


In [4]:
df.describe() # This won't include 'gender' since it is not numeric
# Notice gene6 is missing values

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,cancer_detected
count,100.0,100.0,100.0,100.0,100.0,75.0,100.0
mean,4.118,20.732,82.765,-50.003,408.9,114.232,0.32
std,1.54,4.111,32.476,21.673,164.819,47.338,0.469
min,-0.253,10.724,15.323,-127.54,26.891,-12.906,0.0
25%,3.238,17.777,61.068,-60.887,284.872,86.485,0.0
50%,4.588,20.462,81.36,-48.005,407.237,120.822,0.0
75%,5.174,23.643,106.616,-39.195,526.622,152.787,1.0
max,7.147,29.389,150.491,9.134,813.139,206.951,1.0


### 2. Introduction to a column transformer

Applies transformers to one or more columns of pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

#### Review: scale the whole dataset using a scaler:

In [5]:
# Recall from previous lesson on scaling data:
#
X = df.drop(['gender', 'cancer_detected'], axis = 1) # Drop some columns
# This creates a scaler
norm_scaler = MinMaxScaler()
# Compute the minimum and maximum to be used for later scaling.
norm_scaler.fit(X)
# Do the scaling, this returns a numpy array
norm_scaled_array = norm_scaler.transform(X) 
# Create a new data frame from the scaled values
X_scaled = pd.DataFrame(data = norm_scaled_array, columns = X.columns)
X_scaled.describe()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6
count,100.0,100.0,100.0,100.0,100.0,75.0
mean,0.591,0.536,0.499,0.567,0.486,0.578
std,0.208,0.22,0.24,0.159,0.21,0.215
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.472,0.378,0.338,0.488,0.328,0.452
50%,0.654,0.522,0.489,0.582,0.484,0.608
75%,0.733,0.692,0.675,0.646,0.636,0.754
max,1.0,1.0,1.0,1.0,1.0,1.0


#### Scale only one or a set of columns:

In [6]:
# We can do it only on a column or a set of columns
#
# Create a fresh set of the data, only withhold the target column
X = df.drop('cancer_detected', axis = 1)
X.head(3)

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gender
0,0.759,27.342,118.878,-29.8,641.214,-12.906,male
1,3.727,16.191,122.52,-56.616,239.289,,male
2,2.235,19.346,128.828,-90.479,374.46,,female


In [7]:
from sklearn.compose import ColumnTransformer

# Identify columns to scale in different ways
norm_columns = ['gene1', 'gene2', 'gene3']
stand_columns = ['gene4', 'gene5'] # Leave gene6 alone for now

# Create the column transformer using the right scaler
ct = ColumnTransformer(
    transformers = [
    ('norm', MinMaxScaler(), norm_columns), # 0-1
    ('stand', StandardScaler(), stand_columns) # Mean = 0, std_dev = 1
    ], 
    remainder = 'passthrough') # Let the other columns ('gender') just passthrough

# Prepare the ct to transform the data
ct.fit(X)

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('norm',
                                 MinMaxScaler(copy=True, feature_range=(0, 1)),
                                 ['gene1', 'gene2', 'gene3']),
                                ('stand',
                                 StandardScaler(copy=True, with_mean=True,
                                                with_std=True),
                                 ['gene4', 'gene5'])],
                  verbose=False)

In [8]:
# Perform the transformation
transformed_array = ct.transform(X)  # Returns a numpy array
# Print the first 5
for row in transformed_array[0:5]:
    print(row,'\n')

[0.13682382844443777 0.890330448269027 0.7661204972831208
 0.9368696470604763 1.416612146906083 -12.90552460113543 'male'] 

[0.5378644677950846 0.2928779251687599 0.7930609165447469
 -0.30666730808115206 -1.0342599583793624 nan 'male'] 

[0.33618422107090584 0.4619156392077014 0.8397265238705481
 -1.877005328467732 -0.21001375265276553 nan 'female'] 

[0.6994323337892093 0.5192903128294121 0.31503948534484494
 -0.5979689019582491 -0.0614749160394467 nan 'male'] 

[0.20015209591008581 0.8407037029387529 0.5304836118863425
 0.5119879279002758 1.0499116216970483 26.6243243517391 'female'] 



In [9]:
# This is typically not necessary, but just to show the results, create a dataframe from the array
X_transformed_df = pd.DataFrame(data = transformed_array, columns = X.columns)
# Need to convert these columns from objects to floats:
X_transformed_df = X_transformed_df.apply(pd.to_numeric, errors='ignore')
X_transformed_df.describe() # Ignore 'gender' for now. It is not a numeric column
# We have successfuly scaled columns 1-3 (normalized) and 4-5 (standardized), 'gene6' and 'gender' was ignored

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6
count,100.0,100.0,100.0,100.0,100.0,75.0
mean,0.591,0.536,0.499,0.0,-0.0,114.232
std,0.208,0.22,0.24,1.005,1.005,47.338
min,0.0,0.0,0.0,-3.596,-2.329,-12.906
25%,0.472,0.378,0.338,-0.505,-0.756,86.485
50%,0.654,0.522,0.489,0.093,-0.01,120.822
75%,0.733,0.692,0.675,0.501,0.718,152.787
max,1.0,1.0,1.0,2.742,2.465,206.951


### 3. Column Transformer missing values (impute)
We are missing 25 rows in the 'gene6' column<P>
    
Be aware of how the transfomrd array is returned below. It is out of order.

In [10]:
# Impute the missing values using the column transformer
impute_columns = ['gene6']

# Create the column transformer using the right scaler
impute_ct = ColumnTransformer(    
    transformers = 
    [
    # This is the same imputer we used back in the lesson on imputing
    ('impute', SimpleImputer(missing_values=np.nan, strategy='mean'), impute_columns)
    ],
    remainder = 'passthrough',
    )

# Prepare the ct to transform the data
impute_ct.fit(X)
# Perform the transformation
transformed_array = impute_ct.transform(X)  # Returns a numpy array
# Show the first 5 rows
for row in transformed_array[0:5]:
    print(row,'\n')
# This numpy array is all the columns, but they are out of order:
#    They are ordered like this: gene6, gene1, gene2, gene3, gene4, gene5, gender
#    This is because the transformed column is returned first

[-12.90552460113543 0.7593336360642229 27.342287359757897
 118.87838370542934 -29.800470404715927 641.2144908190065 'male'] 

[114.23174731226524 3.7269018931959352 16.190668631240115
 122.51987028373456 -56.61609162376188 239.2890296354452 'male'] 

[114.23174731226524 2.23453468762236 19.34580492210508 128.82757400189527
 -90.47884846942158 374.45950044501 'female'] 

[114.23174731226524 4.92245073152067 20.41671927958799 57.90659871761008
 -62.897716926182376 398.8188051515788 'male'] 

[26.6243243517391 1.2279418963505029 26.415990254527344 87.02778230264246
 -38.96261648874555 581.0782326303199 'female'] 



In [11]:
# Just to show the results, create a dataframe from the array
new_cols = ['gene6', 'gene1','gene2','gene3','gene4','gene5','gender']
X_imputed_df = pd.DataFrame(data = transformed_array, columns = new_cols)
X_imputed_df.head(5) 

Unnamed: 0,gene6,gene1,gene2,gene3,gene4,gene5,gender
0,-12.906,0.759,27.342,118.878,-29.8,641.214,male
1,114.232,3.727,16.191,122.52,-56.616,239.289,male
2,114.232,2.235,19.346,128.828,-90.479,374.46,female
3,114.232,4.922,20.417,57.907,-62.898,398.819,male
4,26.624,1.228,26.416,87.028,-38.963,581.078,female


In [12]:
# Why only 4 stats here?
X_imputed_df.describe()

Unnamed: 0,gene6,gene1,gene2,gene3,gene4,gene5,gender
count,100.0,100.0,100.0,100.0,100.0,100.0,100
unique,76.0,100.0,100.0,100.0,100.0,100.0,2
top,114.232,0.759,27.342,118.878,-29.8,641.214,female
freq,25.0,1.0,1.0,1.0,1.0,1.0,55


In [13]:
# Here is why: 
X_imputed_df.dtypes # The data types got change to objects

gene6     object
gene1     object
gene2     object
gene3     object
gene4     object
gene5     object
gender    object
dtype: object

In [14]:
# Convert them back to floats so we can look at stats
# Remember, our goal was to impute missing values for 'gene6'
X_imputed_df = X_imputed_df.apply(pd.to_numeric, errors='ignore') # This dropped 'gender'
X_imputed_df.describe()

Unnamed: 0,gene6,gene1,gene2,gene3,gene4,gene5
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,114.232,4.118,20.732,82.765,-50.003,408.9
std,40.927,1.54,4.111,32.476,21.673,164.819
min,-12.906,-0.253,10.724,15.323,-127.54,26.891
25%,99.475,3.238,17.777,61.068,-60.887,284.872
50%,114.232,4.588,20.462,81.36,-48.005,407.237
75%,143.868,5.174,23.643,106.616,-39.195,526.622
max,206.951,7.147,29.389,150.491,9.134,813.139


#### We have successfully imputed missing values in 'gene6'

### 4. Convert categories to numbers
Convert a category 'string' column to a numeric column

In [18]:
# get a fresh copy of the data
X = df.drop('cancer_detected', axis = 1)
X.head()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gender
0,0.759,27.342,118.878,-29.8,641.214,-12.906,male
1,3.727,16.191,122.52,-56.616,239.289,,male
2,2.235,19.346,128.828,-90.479,374.46,,female
3,4.922,20.417,57.907,-62.898,398.819,,male
4,1.228,26.416,87.028,-38.963,581.078,26.624,female


In [19]:
category_columns = ['gender']
#
# Create the column transformer using the OrdinalEncoder
#   https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
#
category_ct = ColumnTransformer(
    transformers = [
    ('cat', OrdinalEncoder(categories = [['male','female']]), category_columns)
    ],
    remainder = 'passthrough')

# Prepare the ct to transform the data
category_ct.fit(X)
# perform the transformation
transformed_array = category_ct.transform(X)
for row in transformed_array[0:5]:
    print(row,'\n')
# Gender is now in the first column, since it was transformed, the other columns were passed though.

[  0.           0.75933364  27.34228736 118.87838371 -29.8004704
 641.21449082 -12.9055246 ] 

[  0.           3.72690189  16.19066863 122.51987028 -56.61609162
 239.28902964          nan] 

[  1.           2.23453469  19.34580492 128.827574   -90.47884847
 374.45950045          nan] 

[  0.           4.92245073  20.41671928  57.90659872 -62.89771693
 398.81880515          nan] 

[  1.           1.2279419   26.41599025  87.0277823  -38.96261649
 581.07823263  26.62432435] 



In [20]:
# Create a dataframe from the array
new_cols = ['gender','gene1','gene2','gene3','gene4','gene5','gene6']
X_cat_df = pd.DataFrame(data = transformed_array, columns = new_cols)
X_cat_df.head()
# Successfully transformed gender to 0 = 'male' and 1 = 'female'

Unnamed: 0,gender,gene1,gene2,gene3,gene4,gene5,gene6
0,0.0,0.759,27.342,118.878,-29.8,641.214,-12.906
1,0.0,3.727,16.191,122.52,-56.616,239.289,
2,1.0,2.235,19.346,128.828,-90.479,374.46,
3,0.0,4.922,20.417,57.907,-62.898,398.819,
4,1.0,1.228,26.416,87.028,-38.963,581.078,26.624


### 5. Putting is all togehter: do all transfomers at once

In [21]:
# Fresh dataframe
X = df.drop('cancer_detected', axis = 1)
#
# First, the normalized columns
norm_columns = ['gene1', 'gene2', 'gene3']
norm_transformer = MinMaxScaler()
#
# Second, the standardized columns
stand_columns = ['gene4', 'gene5', 'gene6']
# But, we need a two step process: Impute, then standardize
steps=[
    ("imputer", SimpleImputer(missing_values=np.nan, strategy='mean')),
    ("scaler", StandardScaler())]
# Use the pipeline to perform the steps in order
stand_transformer = Pipeline(steps)
#
# Third, the categories
categorical_columns = ['gender']
categorical_transformer = OrdinalEncoder(categories = [['male','female']])

In [22]:
# Now, use the Column Transformer to perform the 3 different groups of transformations
preprocessor = ColumnTransformer(
    transformers=[
        ("norm",norm_transformer,norm_columns), # This is a single step transfomer: only norm scale
        ("stand", stand_transformer, stand_columns), # This is a 2-step transfomer: impute, then stand scale
        ("cat", categorical_transformer, categorical_columns), # This is a single step, just convert categories
    ]
)
# Show the details of the object
preprocessor

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('norm',
                                 MinMaxScaler(copy=True, feature_range=(0, 1)),
                                 ['gene1', 'gene2', 'gene3']),
                                ('stand',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0)),
                                                 ('scaler',


In [23]:
# Fit the column transformer
preprocessor.fit(X)
# Tranform the entire dataframe
transformed_array = preprocessor.transform(X)
# Build the dataframe, the column order is preserved because we setup the ColumnTransformer from left to right
X_transformed = pd.DataFrame(data = transformed_array,columns = X.columns)
X_transformed.head()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gender
0,0.137,0.89,0.766,0.937,1.417,-3.122,0.0
1,0.538,0.293,0.793,-0.307,-1.034,0.0,0.0
2,0.336,0.462,0.84,-1.877,-0.21,0.0,1.0
3,0.699,0.519,0.315,-0.598,-0.061,0.0,0.0
4,0.2,0.841,0.53,0.512,1.05,-2.151,1.0


In [24]:
# We have successfully prepared the data exactly how we wanted:
# genes 1,2,3: standardized
# genes 4,5,6: imputed missing values, then standardized
# gender: converted to a number
X_transformed.describe()

Unnamed: 0,gene1,gene2,gene3,gene4,gene5,gene6,gender
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.591,0.536,0.499,0.0,-0.0,0.0,0.55
std,0.208,0.22,0.24,1.005,1.005,1.005,0.5
min,0.0,0.0,0.0,-3.596,-2.329,-3.122,0.0
25%,0.472,0.378,0.338,-0.505,-0.756,-0.362,0.0
50%,0.654,0.522,0.489,0.093,-0.01,0.0,1.0
75%,0.733,0.692,0.675,0.501,0.718,0.728,1.0
max,1.0,1.0,1.0,2.742,2.465,2.277,1.0


### 5. Combine ColumnTransformer with the use of the Pipeline 

In [30]:
# Now, let's combine all of this with the pipeline to build a machine learning model

In [25]:
# Features
X = df.drop(['cancer_detected'],axis = 1)
# Target
y = df['cancer_detected']
# Split into train/test
# Reserve 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state = 42)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (80, 7)
y_train: (80,)
X_test: (20, 7)
y_test: (20,)


#### Use the preprocessor we setup above

In [26]:
preprocessor

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('norm',
                                 MinMaxScaler(copy=True, feature_range=(0, 1)),
                                 ['gene1', 'gene2', 'gene3']),
                                ('stand',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0)),
                                                 ('scaler',


#### Create a new Logisitic Regression classification model and use it in a pipeline

In [27]:
# Create the pipeline with our preprocess and a new classifier model
#
lr = LogisticRegression() # Create a new model
# 
# Use the preprocessor with the model
#
pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("LogisticRegressor", lr)]
)

# Perform the preprocessing and the training of the model
pipe.fit(X_train, y_train)

# Treat the pipe object just like trained model
y_pred = pipe.predict(X_test)
# Report the performance
print('Logistic Regression Accuracy:', pipe.score(X_test, y_test))
confusion_matrix(y_test, y_pred)

Logistic Regression Accuracy: 0.9


array([[16,  0],
       [ 2,  2]])

#### Raw data in, trained and evaluated model out!

In [28]:
# Want to do it again with a different algorithm?
#
gbc = GradientBoostingClassifier()
#
# Use the model with the preprocessor
#
pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("GBC_classifier", gbc)]
)
#
# Perform the preprocessing and the training of the model
pipe.fit(X_train, y_train)
#
# Treat the pipe object just like trained model
y_pred = pipe.predict(X_test)
# Report the performance
print('GBC Accuracy:', pipe.score(X_test, y_test))
confusion_matrix(y_test, y_pred)

GBC Accuracy: 0.95


array([[16,  0],
       [ 1,  3]])

### What we did:

1. Load the data
2. Used ColumnTransfomer to scale/transform one or more columns
3. Use a ColumnTransformer to deal with (impute) missing values
4. Use a ColumnTranformer to convert categories ('male'/'female') to numbers (0,1)
5. Putting several transfomers together to do all transformation with a single ColumnTransfomer
6. Combine ColumnTransformer with the use of the Pipeline to transform raw data and fit/evaluate a model