<a href="https://colab.research.google.com/github/hargurjeet/MachineLearning/blob/master/ML_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

# **Table Of Contents**<a name="Top"></a>


---



---

  1. [About the Dataset](#dataset)
  2. [Performing Train Test Split](#splitting)
  3. [Preprocessing W/O pipeline](#Data-Pre)
    
    3.1 [Imputing Numberic Columns](#ImputeNum)
    
    3.2 [Scaling Numberic Columns](#Scaling-Num)
    
    3.3 [Imputing Categorical Columns](#Impute-Cat)
    
    3.4 [Encoding Categrorical Columns](#Encoding-Cat)
  4. [Model Implementation W/O Pipelines](#Model-Implementation)
  5. [Pipeline](#Pipeline)
    
    5.1 [Pipeline Implementation](#Pipeline-Implementation)
    
    5.2 [Model Implementation with Pipelines](#Implementation-with-pipeline)
  6. [Summary](#Summary)
  7. [References](#References)

# **1: About the Dataset** <a name="dataset"></a>


---

The dataset has been picked up from kaggle and can be accessed from [here](https://www.kaggle.com/camnugent/california-housing-prices).The data contains information from the 1990 California census.

The dataset contains the following columns

1. longitude: A measure of how far west a house is; a higher value is farther west

2. latitude: A measure of how far north a house is; a higher value is farther north

3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

4. totalRooms: Total number of rooms within a block

5. totalBedrooms: Total number of bedrooms within a block

6. population: Total number of people residing within a block

7. households: Total number of households, a group of people residing within a home unit, for a block

8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

10. oceanProximity: Location of the house w.r.t ocean/sea

medianHouseValue being the target block



# **2: Performing Train Test Split** <a name="splitting"></a>


---


In [2]:
# Importing all the required libraries
!pip install opendatasets --quiet
import opendatasets as od
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [3]:
## acessing the dataset
od.download('https://www.kaggle.com/camnugent/california-housing-prices')

raw_df = pd.read_csv('/content/california-housing-prices/housing.csv')
raw_df.head()

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: hargurjeet
Your Kaggle Key: ··········
Downloading california-housing-prices.zip to ./california-housing-prices


100%|██████████| 400k/400k [00:00<00:00, 62.6MB/s]







Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
## Few columns might not be relvant for training data. Hence removing those columns
X = raw_df.drop(['longitude', 'latitude', 'median_house_value'], axis='columns').copy()
y = raw_df.median_house_value

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# **3: Preprocessing W/O pipeline** <a name="Data-Pre"></a>


---


In [6]:
numeric_cols = X.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X.select_dtypes('object').columns.tolist()

print(numeric_cols)
print()
print(categorical_cols)

['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']

['ocean_proximity']


## 3.1 Imputing Numberic Columns

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy = 'constant').fit(X_train[numeric_cols])
X_train.loc[:, (numeric_cols)] = imputer.transform(X_train[numeric_cols])
X_test.loc[:, (numeric_cols)] = imputer.transform(X_test[numeric_cols])

In [8]:
X_test[numeric_cols].isna().sum()

housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
dtype: int64

## 3.2 Scaling Numberic Columns

In [9]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X_train[numeric_cols])

X_train.loc[:, (numeric_cols)] = scaler.transform(X_train[numeric_cols])
X_test.loc[:, (numeric_cols)] = scaler.transform(X_test[numeric_cols])

In [10]:
X_train[numeric_cols].describe().loc[['min', 'max']]

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
min,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


## 3.3 Imputing Categorical Columns

In [11]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent').fit(X_train[categorical_cols])

X_train.loc[:, (categorical_cols)] = imputer.transform(X_train[categorical_cols])
X_test.loc[:, (categorical_cols)] = imputer.transform(X_test[categorical_cols])

## 3.4 Encoding Categrorical Columns

In [12]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(X_train[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
X_train[encoded_cols] = encoder.transform(X_train[categorical_cols])
X_test[encoded_cols] = encoder.transform(X_test[categorical_cols])

# **4: Model Implementation W/O Pipelines** <a name="Model-Implementation"></a>


---


In [13]:
X_train = X_train[numeric_cols + encoded_cols]
X_test = X_test[numeric_cols + encoded_cols]
X_train.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
5088,0.352941,0.027004,0.048417,0.020264,0.045387,0.033172,1.0,0.0,0.0,0.0,0.0
17096,0.627451,0.08095,0.07185,0.028364,0.070054,0.256776,0.0,0.0,0.0,0.0,1.0
5617,0.803922,0.035556,0.040813,0.029177,0.039467,0.210266,1.0,0.0,0.0,0.0,0.0
20060,0.45098,0.048674,0.060366,0.047171,0.06101,0.079102,0.0,1.0,0.0,0.0,0.0
895,0.254902,0.156444,0.187151,0.076656,0.182042,0.240755,1.0,0.0,0.0,0.0,0.0


In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

model = LinearRegression()

model.fit(X_train, y_train)

preds = model.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

MAE: 51059.69392248972


# **5: Pipeline** <a name="Pipeline"></a>


---


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [17]:
numeric_cols = X_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X_train.select_dtypes('object').columns.tolist()
print(numeric_cols)
print()
print(categorical_cols)

['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']

['ocean_proximity']


## 5.1 Pipeline Implementation

In [18]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values = np.nan, strategy='mean')),
    ('scaler', MinMaxScaler())
])


# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(missing_values = np.nan , strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

## 5.2 Model Implementation with Pipelines

In [19]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [20]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor)
                              ,('model', model)
                             ])

# # Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

MAE: 50869.600575788994


## 5.3 Bonus - Implementing cross validation

In [21]:
## Implementing cross validation

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [5.88072304e+04 5.63761923e+04 1.73501063e+13 5.66366264e+04
 4.88699040e+04]


# **6: Summary** <a name="Summary"></a>


---

- We imported the califonia housing dataset from kaggle.
- We implemented all the pre processing step (like filling missing values, scaling, encoding...etc) on the dataset.
- We trained the ML model.
- Now we repeated the preprocessing steps using ML pipelines.
- We understood the benefits of pipeline implementation and the bonus tip (cross validation).
- We trained the ML model.


# **7: References** <a name="References"></a>


---

- https://www.kaggle.com/alexisbcook/pipelines
