<a href="https://colab.research.google.com/github/hargurjeet/MachineLearning/blob/master/ML_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

In [57]:
!pip install opendatasets



In [58]:
import opendatasets as od
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [59]:
od.download('https://www.kaggle.com/camnugent/california-housing-prices')

raw_df = pd.read_csv('/content/california-housing-prices/housing.csv')
raw_df.head()

Skipping, found downloaded files in "./california-housing-prices" (use force=True to force download)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [60]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [61]:
## Few columns might not be relvant for training data. Hence removing those columns
X = raw_df.drop(['longitude', 'latitude', 'median_house_value'], axis='columns').copy()
y = raw_df.median_house_value

In [62]:
numeric_cols = X.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X.select_dtypes('object').columns.tolist()

print(numeric_cols)
print()
print(categorical_cols)

['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']

['ocean_proximity']


# Splitting the data into train, test and split

In [63]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Implementing Imputer

In [64]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'constant').fit(X_train[numeric_cols])
X_train.loc[:, (numeric_cols)] = imputer.transform(X_train[numeric_cols])
X_test.loc[:, (numeric_cols)] = imputer.transform(X_test[numeric_cols])

In [65]:
X_test[numeric_cols].isna().sum()

housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
dtype: int64

## Scaling Numeric Features

In [66]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X_train[numeric_cols])

X_train.loc[:, (numeric_cols)] = scaler.transform(X_train[numeric_cols])
X_test.loc[:, (numeric_cols)] = scaler.transform(X_test[numeric_cols])

In [67]:
X_train[numeric_cols].describe().loc[['min', 'max']]

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
min,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


## Encoding Categorical Data

In [68]:
X_train[categorical_cols]

Unnamed: 0,ocean_proximity
5088,<1H OCEAN
17096,NEAR OCEAN
5617,<1H OCEAN
20060,INLAND
895,<1H OCEAN
...,...
11284,<1H OCEAN
11964,INLAND
5390,<1H OCEAN
860,<1H OCEAN


In [69]:
from sklearn.preprocessing import OneHotEncoder

imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent').fit(X_train[categorical_cols])

X_train.loc[:, (categorical_cols)] = imputer.transform(X_train[categorical_cols])
X_test.loc[:, (categorical_cols)] = imputer.transform(X_test[categorical_cols])

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(X_train[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
X_train[encoded_cols] = encoder.transform(X_train[categorical_cols])
X_test[encoded_cols] = encoder.transform(X_test[categorical_cols])

In [70]:
X_train = X_train[numeric_cols + encoded_cols]
X_test = X_test[numeric_cols + encoded_cols]
X_train.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
5088,0.352941,0.027004,0.048417,0.020264,0.045387,0.033172,1.0,0.0,0.0,0.0,0.0
17096,0.627451,0.08095,0.07185,0.028364,0.070054,0.256776,0.0,0.0,0.0,0.0,1.0
5617,0.803922,0.035556,0.040813,0.029177,0.039467,0.210266,1.0,0.0,0.0,0.0,0.0
20060,0.45098,0.048674,0.060366,0.047171,0.06101,0.079102,0.0,1.0,0.0,0.0,0.0
895,0.254902,0.156444,0.187151,0.076656,0.182042,0.240755,1.0,0.0,0.0,0.0,0.0


In [71]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13828 entries, 5088 to 15795
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   housing_median_age          13828 non-null  float64
 1   total_rooms                 13828 non-null  float64
 2   total_bedrooms              13828 non-null  float64
 3   population                  13828 non-null  float64
 4   households                  13828 non-null  float64
 5   median_income               13828 non-null  float64
 6   ocean_proximity_<1H OCEAN   13828 non-null  float64
 7   ocean_proximity_INLAND      13828 non-null  float64
 8   ocean_proximity_ISLAND      13828 non-null  float64
 9   ocean_proximity_NEAR BAY    13828 non-null  float64
 10  ocean_proximity_NEAR OCEAN  13828 non-null  float64
dtypes: float64(11)
memory usage: 1.3 MB


## Implementing ML model

In [72]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

# model.score(X_test, y_test)

preds = model.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

MAE: 51059.69392248972


# The pipeline implementation

In [50]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [51]:
numeric_cols = X_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X_train.select_dtypes('object').columns.tolist()
print(numeric_cols)
print()
print(categorical_cols)

['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']

['ocean_proximity']


## Pipeline creation

In [52]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values = np.nan, strategy='mean')),
    ('scaler', MinMaxScaler())
])


# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values = np.nan , strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

## Defining the model and Implementation

In [53]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [56]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor)
                              ,('model', model)
                             ])

# # Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# my_pipeline.score(X_test, y_test)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)


MAE: 50869.600575788994


In [55]:
## Implementing cross validation

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [5.88072304e+04 5.63761923e+04 1.73501063e+13 5.66366264e+04
 4.88699040e+04]


# Reference

- https://www.kaggle.com/alexisbcook/pipelines