# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization (Pipeline Building)

#### (Notebook-2)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* create custom classes required for data processing
* implement pipeline and train the model
* save the model/pipeline
* make prediction using the saved model/pipeline

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features.

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:**
    * spring
    * summer
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered [Target column]

In [1]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

bike-sharing-dataset.csv
Dataset downloaded successfully!


### Import Required Packages

In [2]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# ========== NEW IMPORTS FOR PIPELINE BUILDING ========

# to create pipeline
from sklearn.pipeline import Pipeline

# for including custom preprocessors within pipeline
from sklearn.base import BaseEstimator, TransformerMixin

## **1. Pre-Pipeline-Steps:**

### 1.1 Load, Explore, and Prepare the Data Set

* Load the dataset
* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [4]:
bike = pd.read_csv('bike-sharing-dataset.csv')
print(f"The shape of the dataset: {bike.shape}\n")

print(f"Understanding the data types of each column and missing values: \n")
bike.info()

The shape of the dataset: (17379, 14)

Understanding the data types of each column and missing values: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     16504 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  16121 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(3), object(7)
memory usage: 1.9+ MB


In [5]:
bike.head(2)

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5


### 1.2 Working on `dteday` column to extract year and month

- Create a function to extract year and month from the date column and create two another columns
  

In [6]:
def get_month_year(dataframe):

    df = dataframe.copy()
    df['dteday'] = pd.to_datetime(df['dteday'], format='%Y-%m-%d')
    df['Year'] = df['dteday'].dt.year
    df['Month'] = df['dteday'].dt.month_name()

    return df

In [7]:
bike = get_month_year(bike)

bike.head(2)

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,November
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,July


### 1.3 Find numerical and categorical variables

In [8]:
cols_delete = ['dteday', 'casual', 'registered']
target_col = ['cnt']

num_cols = []
cat_cols = []

for col in bike.columns:
    if col not in target_col + cols_delete:
        if bike[col].dtypes == 'float64':
            num_cols.append(col)
        else:
            cat_cols.append(col)

print(f"Total no. of numerical columns: {len(num_cols)} -> {num_cols}")
print(f"Total no. of categorical columns: {len(cat_cols)} -> {cat_cols}")

Total no. of numerical columns: 4 -> ['temp', 'atemp', 'hum', 'windspeed']
Total no. of categorical columns: 8 -> ['season', 'hr', 'holiday', 'weekday', 'workingday', 'weathersit', 'Year', 'Month']


## **2. Pipeline-Steps:**

Build custom classes which are compatible with Skearn pipeline for imputation, feature mapping, and any column specific operation.

### **A. Imputation**

#### Build a custom Imputation class compatible with Sklearn for handling missing values in `weekday` column.

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [9]:
class WeekdayImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weekday' column by extracting dayname from 'dteday' column """

    def __init__(self, col_name: str):

        if not isinstance(col_name, str):
            raise ValueError("Column should be a string")

        self.col_name = col_name

    def fit(self, dataframe: pd.DataFrame, target: pd.Series = None):

        return self

    def transform(self, dataframe: pd.DataFrame):

        print("Imputing the day values from the extracted day information from date.")
        df = dataframe.copy()
        wkday_null_idx = df[df['weekday'].isnull() == True].index
        df.loc[wkday_null_idx, 'weekday'] = df.loc[wkday_null_idx, 'dteday'].dt.day_name().apply(lambda x: x[:3])

        return df

In [10]:
# Apply weekday imputer

wday = WeekdayImputer(col_name = 'weekday')

bike1 = wday.fit_transform(bike)
print(f"No. of missing values after transformation: {bike1['weekday'].isnull().sum()}")

Imputing the day values from the extracted day information from date.
No. of missing values after transformation: 0


In [11]:
class ColumnDropper(BaseEstimator, TransformerMixin):

    def __init__(self, col_list: list):

        if not isinstance(col_list, list):
            raise ValueError("Columns should be a list of strings")

        self.col_list = col_list

    def fit(self, dataframe = pd.DataFrame, target: pd.Series = None):

        return self

    def transform(self, dataframe = pd.DataFrame, target: pd.Series = None):

        df = dataframe.copy()
        df.drop(columns=self.col_list,inplace=True)

        return df

#### Build another custom Imputation class compatible with Sklearn for handling missing values in `weathersit` column.

- Fill in the missing rows in this column with the most frequent category

In [12]:
class WeathersitImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weathersit' column by replacing them with the most frequent category value """

    def __init__(self, col_name: str):
        if not isinstance(col_name, str):
            raise ValueError("Column should be a string")

        self.col_name = col_name

    def fit(self, dataframe: pd.DataFrame, target: pd.Series = None):

        df = dataframe.copy()
        self.fill_value=df[self.col_name].mode()[0]
        return self

    def transform(self, dataframe: pd.DataFrame,):

        print("Imputing the mode value from the weather column.")
        df = dataframe.copy()
        df[self.col_name] = df[self.col_name].fillna(self.fill_value)

        return df

In [13]:
# Apply weathersit imputer

weat = WeathersitImputer(col_name = 'weathersit')
bike2 = bike1.copy(deep=True)
bike2 = weat.fit_transform(bike)
print(f"No. of missing values after transformation: {bike2['weathersit'].isnull().sum()}")

Imputing the mode value from the weather column.
No. of missing values after transformation: 0


### **B. Mapping**

#### Build a Mapper class for mapping `yr`, `mnth`, `season`, `weathersit`, `holday`, `workingday`, and `hr` columns.

In [14]:
class Mapper(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self, col_map: dict):

        if not isinstance(col_map, dict):
            raise ValueError("Mappings should be a dictionary of col, strings pair")
        self.col_map = col_map

    def fit(self, dataframe: pd.DataFrame, target: pd.Series = None):

        return self

    def transform(self, dataframe: pd.DataFrame):
        print("Ordinal encoding of categorical features.")
        df = dataframe.copy()

        for key, val in self.col_map.items():
            df[key] = df[key].map(val)

        return df

In [15]:
# Instantiate mapper for all ordinal categorical features

mapping_dict = {
'Year': {2011: 0, 2012: 1},
'Month': {'January': 0, 'February': 1, 'December': 2, 'March': 3, 'November': 4, 'April': 5,
                'October': 6, 'May': 7, 'September': 8, 'June': 9, 'July': 10, 'August': 11},
'season': {'spring': 0, 'winter': 1, 'summer': 2, 'fall': 3},
'weathersit': {'Heavy Rain': 0, 'Light Rain': 1, 'Mist': 2, 'Clear': 3},
'holiday': {'Yes': 0, 'No': 1},
'workingday': {'No': 0, 'Yes': 1},
'hr': {'4am': 0, '3am': 1, '5am': 2, '2am': 3, '1am': 4, '12am': 5, '6am': 6, '11pm': 7, '10pm': 8,
                '10am': 9, '9pm': 10, '11am': 11, '7am': 12, '9am': 13, '8pm': 14, '2pm': 15, '1pm': 16,
                '12pm': 17, '3pm': 18, '4pm': 19, '7pm': 20, '8am': 21, '6pm': 22, '5pm': 23}
                 }
clmapper = Mapper(mapping_dict)

In [16]:
# Map values for all ordinal categorical features
bike3 = bike2.copy(deep=True)

bike3 = clmapper.fit_transform(bike3)

bike3.head()

Ordinal encoding of categorical features.


Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2012-11-05,1,6,1,Mon,1,2,6.1,3.0014,49.0,19.0012,4,135,139,1,4
1,2011-07-13,3,0,1,Wed,1,3,26.78,28.9988,58.0,16.9979,0,5,5,0,10
2,2012-02-09,0,11,1,Thu,1,3,3.28,-0.9982,52.0,15.0013,4,95,99,1,1
3,2012-03-22,2,12,1,Thu,1,2,14.56,15.0002,100.0,6.0032,29,332,361,1,3
4,2011-11-08,1,17,1,Tue,1,3,16.44,17.0,52.0,8.9981,28,175,203,0,4


### **C. Class for Specific operation**

#### Build a Class for handling outliers in numerical columns

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [17]:
class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Change the outlier values:
        - to upper-bound, if the value is higher than upper-bound, or
        - to lower-bound, if the value is lower than lower-bound respectively.
    """

    def __init__(self, col_list: list):

        if not isinstance(col_list, list):
            raise ValueError("Columns should be a list of strings")
        self.col_list = col_list
        self.limit_dict = {}

    def fit(self, dataframe: pd.DataFrame, target: pd.Series = None):

        df = dataframe.copy()
        # limit_dict = {}

        for col in self.col_list:

            q1 = df.describe()[col].loc['25%']
            q3 = df.describe()[col].loc['75%']
            iqr = q3 - q1
            lower_bound = q1 - (1.5 * iqr)
            upper_bound = q3 + (1.5 * iqr)
            self.limit_dict[col] = [lower_bound, upper_bound]

        self.limits = self.limit_dict
        return self


    def transform(self, dataframe: pd.DataFrame):
        print("Handling outliers for numerical features.")
        df = dataframe.copy()

        for col in self.col_list:
            for i in df.index:

                if df.loc[i,col] > self.limits[col][1]:
                    df.loc[i,col]= self.limits[col][1]

                if df.loc[i,col] < self.limits[col][0]:
                    df.loc[i,col]= self.limits[col][0]

        return df

In [18]:
# Instantiate outlier handler for all numerical features

print(f"Numerical columns: {num_cols}")

outtreat = OutlierHandler(num_cols)

Numerical columns: ['temp', 'atemp', 'hum', 'windspeed']


In [19]:
# Handle outliers for all numerical columns

bike4 = bike3.copy(deep=True)

bike4 = outtreat.fit_transform(bike4)

bike4.head()

Handling outliers for numerical features.


Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2012-11-05,1,6,1,Mon,1,2,6.1,3.0014,49.0,19.0012,4,135,139,1,4
1,2011-07-13,3,0,1,Wed,1,3,26.78,28.9988,58.0,16.9979,0,5,5,0,10
2,2012-02-09,0,11,1,Thu,1,3,3.28,-0.9982,52.0,15.0013,4,95,99,1,1
3,2012-03-22,2,12,1,Thu,1,2,14.56,15.0002,100.0,6.0032,29,332,361,1,3
4,2011-11-08,1,17,1,Tue,1,3,16.44,17.0,52.0,8.9981,28,175,203,0,4


#### Build a Class to One-hot Encode `weekday` column

In [20]:

class WeekdayOneHotEncoder(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """

    def __init__(self, col_list: list):

        if not isinstance(col_list, list):
            raise ValueError("Columns should be a list of strings")

        self.col_list = col_list
        self.categories_ = {}

    def fit(self, dataframe: pd.DataFrame, target: pd.Series = None):

        df = dataframe.copy()
        for col in self.col_list:
            self.categories_[col] = df[col].unique()

        return self

    def transform(self, dataframe: pd.DataFrame):
        print("One hot encoding for particular features.")
        if not self.categories_:
            raise ValueError("Must fit the transformer before transforming the data.")

        df = dataframe.copy()
        for col in self.col_list:
            categories = self.categories_[col]
            for category in categories:
                new_column_name = f"{col}_{category}"
                df[new_column_name] = (df[col] == category).astype(int)
            df = df.drop(col, axis=1)

        return df

In [21]:
# Treat 'weekday' column as a Categorical variable, perform one-hot encoding
cols_list = ['weekday']

onehot = WeekdayOneHotEncoder(cols_list)

bike5 = bike4.copy(deep=True)
bike5 = onehot.fit_transform(bike5)

bike5.head()

One hot encoding for particular features.


Unnamed: 0,dteday,season,hr,holiday,workingday,weathersit,temp,atemp,hum,windspeed,...,Year,Month,weekday_Mon,weekday_Wed,weekday_Thu,weekday_Tue,weekday_nan,weekday_Fri,weekday_Sun,weekday_Sat
0,2012-11-05,1,6,1,1,2,6.1,3.0014,49.0,19.0012,...,1,4,1,0,0,0,0,0,0,0
1,2011-07-13,3,0,1,1,3,26.78,28.9988,58.0,16.9979,...,0,10,0,1,0,0,0,0,0,0
2,2012-02-09,0,11,1,1,3,3.28,-0.9982,52.0,15.0013,...,1,1,0,0,1,0,0,0,0,0
3,2012-03-22,2,12,1,1,2,14.56,15.0002,100.0,6.0032,...,1,3,0,0,1,0,0,0,0,0
4,2011-11-08,1,17,1,1,3,16.44,17.0,52.0,8.9981,...,0,4,0,0,0,1,0,0,0,0


## **3. Build Pipeline**

Build a pipeline and implement all the above class transformers inside the pipeline along with the regressor.

In [22]:
bike_pipe = Pipeline([

    ## Weekday imputer ##
    ('weekday_imputation', WeekdayImputer('weekday')),

    ## Weather imputer ##
    ('weather_imputation', WeathersitImputer('weathersit')),

    ## Unused columns ##
    ('unused_column_dropper',ColumnDropper(cols_delete)),

    ## Mapping ##
    ('all_mapper', Mapper(mapping_dict)),

    ## Outlier handling ##
    ('all_outlier', OutlierHandler(num_cols)),

    ## One hot encoder ##
    ('one_hot_encoder', WeekdayOneHotEncoder(['weekday'])),

    ## scaler ##
    ('scaler', StandardScaler()),

    ## Model fit
    ('model_rf', RandomForestRegressor(n_estimators=150, max_depth=10, random_state=42))
])

## **4. Fit Pipeline**

- Separate target and prediction features
- Split data into train and test set
- Fit pipeline on train set
- Get prediction on test set
- Calculate the mse and r2_score

In [23]:
features = bike.drop('cnt', axis=1)
target = bike['cnt']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)
X_train.shape, X_test.shape

((13903, 15), (3476, 15))

In [24]:
bike_pipe.fit(X_train, y_train)

Imputing the day values from the extracted day information from date.
Imputing the mode value from the weather column.
Ordinal encoding of categorical features.
Handling outliers for numerical features.
One hot encoding for particular features.


In [25]:
y_pred = bike_pipe.predict(X_test)

Imputing the day values from the extracted day information from date.
Imputing the mode value from the weather column.
Ordinal encoding of categorical features.
Handling outliers for numerical features.
One hot encoding for particular features.


In [26]:
# Calculate the score/error

print("R2 score:", r2_score(y_test, y_pred))
print("Mean squared error:", mean_squared_error(y_test, y_pred))

R2 score: 0.9194377452225064
Mean squared error: 2729.0113881981115


### Check for package versions may be used for requirements.txt file

In [27]:
!pip -qq install pydantic
!pip -qq install strictyaml
!pip -qq install ruamel.yaml

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.9/123.9 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.8/117.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.7/526.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [28]:
import numpy as np
import pandas as pd
import sklearn
import pydantic
import strictyaml
import ruamel.yaml
import joblib

In [29]:
print(np.__version__)
print(pd.__version__)
print(sklearn.__version__)
print(pydantic.__version__)
print(strictyaml.__version__)
print(ruamel.yaml.__version__)
print(joblib.__version__)

1.25.2
2.0.3
1.2.2
2.7.4
1.6.2
0.18.6
1.4.2


## **5. Modularize the application**

- Convert the above regression application to a production environment format (.py files) inside VS code.

- Create different modules specific to functionality:
    - requirements
    - configuration
    - data manager
    - feature engineering
    - pipeline building
    - pipeline training
    - predict
