# Steps to Make a Pipeline:

0. Import Necessary Libraries.

1. Load Data.

2. Inspect the Data.

3. Validation Split.

4. Instantiate the Transformers.

5. Instantiate the Pipeline Using Transformers.

6. Fit Pipeline on the Training Data.

7. Transform Both the Training and Testing Data.

8. Inspect the Result.

# Import Necessary Libraries

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')


# Load Data

In [2]:
path = './Life Expectancy Data (all_numeric) - Life Expectancy Data (all_numeric).csv'
life_df = pd.read_csv(path, index_col='CountryYear')
life_df.info()
life_df.head()


<class 'pandas.core.frame.DataFrame'>
Index: 2928 entries, Afghanistan2015 to Zimbabwe2000
Data columns (total 20 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Status                           2928 non-null   int64  
 1   Life expectancy                  2928 non-null   float64
 2   Adult Mortality                  2928 non-null   int64  
 3   infant deaths                    2928 non-null   int64  
 4   Alcohol                          2735 non-null   float64
 5   percentage expenditure           2928 non-null   float64
 6   Hepatitis B                      2375 non-null   float64
 7   Measles                          2928 non-null   int64  
 8   BMI                              2896 non-null   float64
 9   under-five deaths                2928 non-null   int64  
 10  Polio                            2909 non-null   float64
 11  Total expenditure                2702 non-null   float64
 12  Dip

Unnamed: 0_level_0,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
CountryYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan2015,0,65.0,263,62,0.01,71.279624,65.0,1154,19.1,83,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
Afghanistan2014,0,59.9,271,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
Afghanistan2013,0,59.9,268,66,0.01,73.219243,64.0,430,18.1,89,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
Afghanistan2012,0,59.5,272,69,0.01,78.184215,67.0,2787,17.6,93,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
Afghanistan2011,0,59.2,275,71,0.01,7.097109,68.0,3013,17.2,97,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


# Inspect Data

In [3]:
print(life_df.isna().sum())

Status                               0
Life expectancy                      0
Adult Mortality                      0
infant deaths                        0
Alcohol                            193
percentage expenditure               0
Hepatitis B                        553
Measles                              0
BMI                                 32
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                443
Population                         644
thinness  1-19 years                32
thinness 5-9 years                  32
Income composition of resources    160
Schooling                          160
dtype: int64


We can see that several columns are missing data. We will want to impute the missing data before we scale the data, so our pipeline will be ordered as:

Step 1. Imputer

Step 2. Scaler.

All of our data is numeric, so we don't need to one-hot encode the data. We can also use median imputation or mean imputation on all of the columns.

If we wanted to, we COULD use ColumnTransformer to split the columns by integers and floats and apply mean imputation to the floats and median imputation to the integers, and then scale them all. 

However, for this less we will do only median imputer

# Validation Split

In [4]:
# divide features and target and perform a train/test split.
target = 'Life expectancy'
X = life_df.drop(columns=[target])
y = life_df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)


# Instantiate the Transformers

In [5]:
#instantiate an imputer and a scaler
median_imputer = SimpleImputer(strategy='median')
scaler = StandardScaler()



# Instantiate the Pipeline

In [6]:
#combine the imputer and the scaler into a pipeline
# order matters in pipelines
preprocessing_pipeline = make_pipeline(median_imputer, scaler) # do I put as many transformers as I want?
preprocessing_pipeline
# inputting missing data and scaling numeric data at the same time


# Fit the Pipeline on the Training Data

In [7]:
#fit pipeline on training data
# always fir the training and never test data
preprocessing_pipeline.fit(X_train)



# Transform both the training data and the testing data

In [8]:
#transform train and test sets
X_train_processed = preprocessing_pipeline.transform(X_train)
X_test_processed = preprocessing_pipeline.transform(X_test)

# X_train_processed and X_test_processed are the numpy arrays

# Inspect the Result

In [9]:
#inspect the result of the transformation
print(np.isnan(X_train_processed).sum().sum(), 'missing values \n')
X_train_processed


0 missing values 



array([[ 0.        , -0.81229166, -0.26366021, ..., -0.87868801,
         1.19451878,  1.92222335],
       [ 0.        ,  1.43809769,  0.15576412, ...,  0.58477555,
         0.22791761,  0.08271906],
       [ 0.        ,  2.02690924, -0.18501814, ...,  0.87303352,
        -0.68443553, -0.80637468],
       ...,
       [ 0.        , -1.10266448, -0.11511409, ..., -0.10260885,
        -0.88170108, -1.17427554],
       [ 0.        , -0.73163255, -0.24618419, ..., -0.96738278,
         0.97259504,  0.87983758],
       [ 0.        ,  1.43003177, -0.20249416, ...,  1.07259673,
        -3.11080174, -2.24731971]])

In [11]:
X_train_processed_df = pd.DataFrame(X_train_processed, columns=X_train.columns)

X_train_processed_df.head()

Unnamed: 0,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,0.0,-0.812292,-0.26366,0.576498,3.98043,0.397596,-0.210002,0.915824,-0.267716,0.709313,1.528662,0.706285,-0.324668,3.027465,-0.196608,-0.868293,-0.878688,1.194519,1.922223
1,0.0,1.438098,0.155764,-1.158369,-0.374618,0.397596,0.502075,-0.96222,0.227591,-3.349408,-0.082413,-3.242927,-0.167257,-0.364664,-0.16793,0.650505,0.584776,0.227918,0.082719
2,0.0,2.026909,-0.185018,-0.534022,-0.374618,0.397596,0.110118,-1.778543,-0.178815,-3.306685,-1.362161,-3.201357,0.993643,-0.364664,-0.16793,0.967865,0.873034,-0.684436,-0.806375
3,0.0,-0.594512,-0.26366,1.7254,3.530549,0.397596,-0.209829,0.840702,-0.267716,0.452973,1.60321,0.456861,-0.324668,3.168864,-0.088476,-0.822956,-0.856514,1.312878,1.40103
4,0.0,-0.731633,-0.26366,2.078515,4.503751,0.397596,-0.205086,0.960897,-0.267716,0.452973,1.135212,0.456861,-0.324668,4.102906,-0.107131,-1.026973,-1.033904,1.367126,1.79959


In [12]:
X_train_processed_df.info()
# all columns are floats now

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2196 entries, 0 to 2195
Data columns (total 19 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Status                           2196 non-null   float64
 1   Adult Mortality                  2196 non-null   float64
 2   infant deaths                    2196 non-null   float64
 3   Alcohol                          2196 non-null   float64
 4   percentage expenditure           2196 non-null   float64
 5   Hepatitis B                      2196 non-null   float64
 6   Measles                          2196 non-null   float64
 7   BMI                              2196 non-null   float64
 8   under-five deaths                2196 non-null   float64
 9   Polio                            2196 non-null   float64
 10  Total expenditure                2196 non-null   float64
 11  Diphtheria                       2196 non-null   float64
 12  HIV/AIDS            

In [13]:
X_train_processed_df.describe()

Unnamed: 0,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0,2196.0
mean,0.0,-6.794808e-17,4.853434e-18,-4.5298720000000006e-17,2.103155e-17,-1.19718e-16,-1.0515770000000001e-17,-2.91206e-17,-1.9413740000000002e-17,-1.451986e-16,1.156735e-16,-1.92924e-16,3.235623e-18,3.073842e-17,2.2649360000000003e-17,3.8827470000000006e-17,-1.100112e-16,2.556142e-16,-1.261893e-16
std,0.0,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228,1.000228
min,0.0,-1.32851,-0.2636602,-1.158369,-0.3746177,-3.557306,-0.210002,-1.843649,-0.2677158,-3.392132,-2.289875,-3.326069,-0.3246676,-0.4963791,-0.1971945,-1.07231,-1.056078,-3.110802,-3.688265
25%,0.0,-0.7396985,-0.2636602,-0.8794599,-0.371889,-0.03700868,-0.210002,-0.9522037,-0.2677158,-0.2306012,-0.6425617,-0.1666991,-0.3246676,-0.4548794,-0.1882915,-0.7322811,-0.7456459,-0.6499141,-0.5611074
50%,0.0,-0.1670187,-0.2374462,-0.2141715,-0.3408743,0.397596,-0.2086653,0.2472404,-0.2423154,0.452973,-0.08241293,0.4152901,-0.3246676,-0.3646644,-0.1679298,-0.3469146,-0.3465194,0.2279176,0.08271906
75%,0.0,0.494386,-0.07142405,0.7274676,-0.1523422,0.5714378,-0.1800339,0.8957917,-0.08832579,0.6238665,0.6009479,0.6231433,-0.1869336,-0.1421662,-0.09684581,0.5144931,0.5182545,0.6927246,0.6345703
max,0.0,4.446683,15.46475,3.411651,9.598923,0.7018192,18.08848,1.972537,15.6075,0.7093133,4.846068,0.7062846,9.611853,8.198382,25.75888,5.184229,5.263424,1.564392,2.658025


# Summary

When you are preparing your data for modeling you will often need to perform multiple preprocessing steps. Pipelines can bundle transformers into one transparent preprocessing object that applies them in sequential steps. Using preprocessing Pipelines in your machine learning workflow will save you time, reduce errors, prevent data leakage, and prepare your models for future machine learning tools.