# Sci-Kit Learn Pipeline tutorial

Pipeline?
-a pipeline consists of a chain of processing elements arranged so that the output of each element is the input of the next

Sklearn Pipeline ? Sequentially apply a list of transforms and a final estimator. Pipeline of transforms with a final estimator.

A typical data science workflow would like

    1.Get the training data
    2.Clean/preprocess the data
    3.Train a machine learning model
    4.Evaluate and optimise the model
    5.Clean/preprocess new data
    6.Fit the model on new data to make predictions.

Pipeline in sklearn makes it easy to apply the same preprocessiong  to train, test and furture predictions



In [1]:
import numpy as np
import pandas as pd
import sklearn

### Loading dataset
Boston housing dataset loaded from scikit-learn datasets.

$Number of Cases$:The dataset contains a total of 506 cases.

$Variables$ :There are 14 attributes in each case of the dataset. 
They are:

CRIM    - per capita crime rate by town

ZN      - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS   - proportion of non-retail business acres per town.

CHAS    - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX     - nitric oxides concentration (parts per 10 million)

RM      - average number of rooms per dwelling

AGE     - proportion of owner-occupied units built prior to 1940

DIS     - weighted distances to five Boston employment centres

RAD     - index of accessibility to radial highways


TAX     - full-value property-tax rate per 10,000

PTRATIO - pupil-teacher ratio by town

B       - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT   - % lower status of the population

MEDV    - Median value of owner-occupied homes in $1000's

In [2]:
# boston housing dataset available in sklearn 
from sklearn.datasets import load_boston
data=load_boston()

In [3]:
#converting to X and y dataframes ..X is features and y is the target variable to be predicted

X=pd.DataFrame(data=data.data,columns=data.feature_names)
y=pd.DataFrame(data=data.target,columns=['MEDV'])
print(X.head())
print(y.head())
y=np.ravel(y)

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  
   MEDV
0  24.0
1  21.6
2  34.7
3  33.4
4  36.2


## splitting data

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

## Creating pipeline

$Pipeline (steps= [('nameOfPreprocessor', preprocessor),...,
                 ('nameOfMLmodel', MLmodel())])$
                

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor


#### Preprocessor

In [6]:
preprocessor = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())])

#### Estimator

In [7]:
pipeline = Pipeline(steps = [  ('preprocessor', preprocessor)  ,('regressor',RandomForestRegressor()) ])


In [8]:
rf_model = pipeline.fit(X_train, y_train)
print (rf_model)

Pipeline(steps=[('preprocessor',
                 Pipeline(steps=[('imputer', SimpleImputer()),
                                 ('scaler', StandardScaler())])),
                ('regressor', RandomForestRegressor())])


## Prediction and Evaluation

In [9]:
y_predict=rf_model.predict(X_test)

In [10]:
from sklearn.metrics import r2_score
print(r2_score(y_test,y_predict))

0.7828458355278474


## Saving Model 

In [11]:
import joblib
joblib.dump(rf_model, './rf_model.pkl')

['./rf_model.pkl']

*Note

In another notebooks we can do rf_model prediction simply by loading the pickle file

rf_model = joblib.load('PATH/TO/rf_model.pkl')

new_prediction = rf_model.predict(new_data)