# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Regression and Modularization (Pipeline Building)

#### (Notebook-2)

## Problem Statement

Predict the bike rental count per hour based on the environmental and seasonal settings (such as weather, day, time, humidity, wind speed, season etc).

## Learning Objectives

At the end of the mini-project, you will be able to :

* create custom classes required for data processing
* implement pipeline and train the model
* save the model/pipeline
* make prediction using the saved model/pipeline

## Dataset Description

The dataset chosen for this mini-project is a modified version of [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the capital bike share system with the corresponding weather and seasonal information. This dataset consists of 17379 instances of each 14 features.

<br>
<img src="https://cdn.iisc.talentsprint.com/AIandMLOps/Images/BikeShareSystem.jpg" width=400px>
<br><br>

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic. Through these systems, the user can easily rent a bike from a particular position and return to another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. As opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position are explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

### Dataset Characteristics

* **dteday:** hourly date
* **season:**
    * spring
    * summer
    * fall
    * winter
* **hr:** hour
* **holiday:** whether the day is considered a holiday
* **weekday:** day of the week
* **workingday:** whether the day is neither a weekend nor holiday
* **weathersit:**
    * Clear, Few clouds, Partly cloudy, Partly cloudy
    * Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    * Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    * Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog<br>   
* **temp:** temperature in Celsius
* **atemp:** "feels like" temperature in Celsius
* **humidity:** relative humidity
* **windspeed:** wind speed
* **casual:** count of casual/non-registered users
* **registered:** count of registered users
* **cnt:** count of total rental bikes including both casual and registered

In [1]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/bike-sharing-dataset.csv
!ls | grep ".csv"
print("Dataset downloaded successfully!")

bike-sharing-dataset.csv
Dataset downloaded successfully!


### Import Required Packages

In [1]:
# Loading the Required Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

In [2]:
# ========== NEW IMPORTS FOR PIPELINE BUILDING ========

# to create pipeline
from sklearn.pipeline import Pipeline

# for including custom preprocessors within pipeline
from sklearn.base import BaseEstimator, TransformerMixin

## **1. Pre-Pipeline-Steps:**

### 1.1 Load, Explore, and Prepare the Data Set

* Load the dataset
* Understand different features in the training dataset
* Understand the data types of each columns
* Notice the columns of missing values

In [3]:
# YOUR CODE HERE
df = pd.read_csv('/content/bike-sharing-dataset.csv')
df.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   dteday      17379 non-null  object 
 1   season      17379 non-null  object 
 2   hr          17379 non-null  object 
 3   holiday     17379 non-null  object 
 4   weekday     16504 non-null  object 
 5   workingday  17379 non-null  object 
 6   weathersit  16121 non-null  object 
 7   temp        17379 non-null  float64
 8   atemp       17379 non-null  float64
 9   hum         17379 non-null  float64
 10  windspeed   17379 non-null  float64
 11  casual      17379 non-null  int64  
 12  registered  17379 non-null  int64  
 13  cnt         17379 non-null  int64  
dtypes: float64(4), int64(3), object(7)
memory usage: 1.9+ MB


### 1.2 Working on `dteday` column to extract year and month

- Create a function to extract year and month from the date column and create two another columns
  

In [5]:
# YOUR CODE HERE
# def year_month(data: pd.DataFrame) -> pd.DataFrame:
#   df = data.copy()
#   df['dteday'] = pd.to_datetime(df['dteday'],format='%Y-%m-%d')
#   df['Year'] = df['dteday'].dt.year
#   df['Month'] = df['dteday'].dt.month_name()

#   return df

class year_month(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self, variable: str):
        # YOUR CODE HERE
        self.variable = variable

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # YOUR CODE HERE
        return self

    def transform(self,  X: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = X.copy()
        X[self.variable] = pd.to_datetime(X[self.variable],format='%Y-%m-%d')
        X['Year'] = X[self.variable].dt.year
        X['Month'] = X[self.variable].dt.month_name()
        return X

In [9]:
ym = year_month(variable='dteday')
df = ym.fit_transform(df)
df.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2012-11-05,winter,6am,No,Mon,Yes,Mist,6.1,3.0014,49.0,19.0012,4,135,139,2012,November
1,2011-07-13,fall,4am,No,Wed,Yes,Clear,26.78,28.9988,58.0,16.9979,0,5,5,2011,July
2,2012-02-09,spring,11am,No,Thu,Yes,Clear,3.28,-0.9982,52.0,15.0013,4,95,99,2012,February
3,2012-03-22,summer,7am,No,Thu,Yes,Mist,14.56,15.0002,100.0,6.0032,29,332,361,2012,March
4,2011-11-08,winter,12pm,No,Tue,Yes,Clear,16.44,17.0,52.0,8.9981,28,175,203,2011,November


### 1.3 Find numerical and categorical variables

In [None]:
# YOUR CODE HERE
target = ['cnt']
drop = ['dteday','casual','registered']

num=[]
cat = []

for i in df.columns:
  if i not in target + drop:
    if df[i].dtypes == 'float64' or df[i].dtypes == 'int64':
      num.append(i)
    else:
      cat.append(i)

In [None]:
df[num].isnull().sum()

temp         0
atemp        0
hum          0
windspeed    0
Year         0
dtype: int64

In [None]:
df[cat].isnull().sum()

season           0
hr               0
holiday          0
weekday        875
workingday       0
weathersit    1258
Month            0
dtype: int64

## 1.X Train Test Split

In [6]:
df.sort_values(by='dteday',inplace=True,ascending=True)
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,spring,7pm,No,Sat,No,Light Rain,11.74,11.9972,88.0,16.9979,6,31,37
1,2011-01-01,spring,10am,No,Sat,No,Clear,9.86,9.9974,76.0,16.9979,12,24,36
2,2011-01-01,spring,8am,No,Sat,No,Clear,3.28,3.0014,75.0,0.0,1,7,8
3,2011-01-01,spring,4pm,No,Sat,No,Mist,11.74,11.9972,82.0,19.9995,41,52,93
4,2011-01-01,spring,9pm,No,Sat,No,Mist,10.8,11.0006,87.0,12.998,3,31,34


In [7]:
train, test = np.split(df,[int(0.8*len(df))])
train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)
train.shape, test.shape

((13903, 14), (3476, 14))

## **2. Pipeline-Steps:**

Build custom classes which are compatible with Skearn pipeline for imputation, feature mapping, and any column specific operation.

### **A. Imputation**

#### Build a custom Imputation class compatible with Sklearn for handling missing values in `weekday` column.

- Find the number of NaN entries in the `weekday` column, and get their row indices
- Use the `dteday` column to extract day names
- Impute values for the missing row indices in `weekday` column with the day names extracted above

**Note that** the extracted day names will contain full names (eg. 'Monday'), and the `weekday` column contains only first three letters (eg. 'Mon').

In [38]:
class WeekdayImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weekday' column by extracting dayname from 'dteday' column """

    def __init__(self, variable: str):
        # YOUR CODE HERE
        self.variable = variable

    def fit(self,X: pd.DataFrame, y: pd.Series = None):
        return self

    def transform(self,X: pd.DataFrame) -> pd.DataFrame:
      # YOUR CODE HERE
      X = X.copy()
      null_row = X[X[self.variable].isnull()==True].index
      X.loc[null_row,self.variable] = X.loc[null_row,'dteday'].dt.day_name().apply(lambda x: x[:3])
      X = X.drop(['dteday'],axis=1)

      return X

In [None]:
# Apply weekday imputer
wday = WeekdayImputer(variable='weekday')
# YOUR CODE HERE
train = wday.fit_transform(train)
test = wday.transform(test)
train['weekday'].unique(), test['weekday'].unique()

(array(['Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'], dtype=object),
 array(['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon'], dtype=object))

#### Build another custom Imputation class compatible with Sklearn for handling missing values in `weathersit` column.

- Fill in the missing rows in this column with the most frequent category

In [9]:
class WeathersitImputer(BaseEstimator, TransformerMixin):
    """ Impute missing values in 'weathersit' column by replacing them with the most frequent category value """

    def __init__(self, variable: str):
        # YOUR CODE HERE
        self.variable = variable

    def fit(self,X: pd.DataFrame, y: pd.Series = None):
        # YOUR CODE HERE
        self.fill = X[self.variable].mode()[0]
        return self

    def transform(self,X: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = X.copy()
        X[self.variable] = X[self.variable].fillna(self.fill)
        return X

In [None]:
# Apply weathersit imputer
wesit = WeathersitImputer(variable='weathersit')
# YOUR CODE HERE
train = wesit.fit_transform(train)
test = wesit.transform(test)
train['weathersit'].isnull().sum(), test['weathersit'].isnull().sum()

(0, 0)

In [None]:
train.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2011-01-01,spring,7pm,No,Sat,No,Light Rain,11.74,11.9972,88.0,16.9979,6,31,37,2011,January
1,2011-01-01,spring,10am,No,Sat,No,Clear,9.86,9.9974,76.0,16.9979,12,24,36,2011,January
2,2011-01-01,spring,8am,No,Sat,No,Clear,3.28,3.0014,75.0,0.0,1,7,8,2011,January
3,2011-01-01,spring,4pm,No,Sat,No,Mist,11.74,11.9972,82.0,19.9995,41,52,93,2011,January
4,2011-01-01,spring,9pm,No,Sat,No,Mist,10.8,11.0006,87.0,12.998,3,31,34,2011,January


### **B. Mapping**

#### Build a Mapper class for mapping `yr`, `mnth`, `season`, `weathersit`, `holday`, `workingday`, and `hr` columns.

In [None]:
for i in cat+['Year']:
  print(f'{i}: {df[i].unique()}')

season: ['spring' 'summer' 'fall' 'winter']
hr: ['7pm' '10am' '8am' '4pm' '9pm' '9am' '12pm' '6am' '1pm' '2am' '11am'
 '8pm' '7am' '3am' '3pm' '1am' '4am' '10pm' '2pm' '6pm' '5pm' '12am' '5am'
 '11pm']
holiday: ['No' 'Yes']
weekday: ['Sat' 'Sun' nan 'Mon' 'Tue' 'Wed' 'Thu' 'Fri']
workingday: ['No' 'Yes']
weathersit: ['Light Rain' 'Clear' 'Mist' nan 'Heavy Rain']
Month: ['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August'
 'September' 'October' 'November' 'December']
Year: [2011 2012]


In [10]:
class Mapper(BaseEstimator, TransformerMixin):
    """
    Ordinal categorical variable mapper:
    Treat column as Ordinal categorical variable, and assign values accordingly
    """

    def __init__(self, variable: str, mappings: dict):
        # YOUR CODE HERE
        self.variable = variable
        self.mappings = mappings

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # YOUR CODE HERE
        return self

    def transform(self,  X: pd.DataFrame) -> pd.DataFrame:
        # YOUR CODE HERE
        X = X.copy()
        X[self.variable] = X[self.variable].map(self.mappings).astype(int)
        return X

In [None]:
# Instantiate mapper for all ordinal categorical features
yr = Mapper('Year',{2011: 0, 2012: 1})
holiday = Mapper('holiday',{'No':0,'Yes':1})
work_day = Mapper('workingday',{'No':0,'Yes':1})
season= Mapper('season',{'winter':0,'fall':1,'spring':2,'summer':3})
wathersit= Mapper('weathersit',{'Mist':0,'Clear':1,'Light Rain':2,'Heavy Rain':3})
month= Mapper('Month',{'January':0,'February':1,'March':2,'April':3,'May': 4,'June': 5,'July': 6,'August': 7,'September': 8,'October': 9,\
                                 'November': 10,'December': 11})
hr= Mapper('hr',{'12am':0,'1am':1,'2am':2,'3am':3,'4am':4,'5am':5,'6am':6,'7am':7,'8am':8,'9am':9,'10am':10,'11am':11,\
                 '12pm':12,'1pm':13,'2pm':14,'3pm':15,'4pm':16,'5pm':17,'6pm':18,'7pm':19,'8pm':20,'9pm':21,'10pm':22,'11pm':23,})
# YOUR CODE HERE
for i in [holiday,work_day,season,wathersit,month,hr,yr]:
  train = i.fit(train).transform(train)
  test = i.transform(test)

In [None]:
train.head()

Unnamed: 0,dteday,season,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,Year,Month
0,2011-01-01,2,19,0,Sat,0,2,11.74,11.9972,88.0,16.9979,6,31,37,0,0
1,2011-01-01,2,10,0,Sat,0,1,9.86,9.9974,76.0,16.9979,12,24,36,0,0
2,2011-01-01,2,8,0,Sat,0,1,3.28,3.0014,75.0,0.0,1,7,8,0,0
3,2011-01-01,2,16,0,Sat,0,0,11.74,11.9972,82.0,19.9995,41,52,93,0,0
4,2011-01-01,2,21,0,Sat,0,0,10.8,11.0006,87.0,12.998,3,31,34,0,0


### **C. Class for Specific operation**

#### Build a Class for handling outliers in numerical columns

- Instead of removing the outliers, change their values
    - to upper-bound, if the value is higher than upper-bound, or
    - to lower-bound, if the value is lower than lower-bound respectively.

In [11]:
class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Change the outlier values:
        - to upper-bound, if the value is higher than upper-bound, or
        - to lower-bound, if the value is lower than lower-bound respectively.
    """

    def __init__(self, variable: str):
      # YOUR CODE HERE
      if not isinstance(variable,str):
        raise ValueError('variables should be a str')
      self.variable = variable

    def fit(self,X: pd.DataFrame, y=None):
      # YOUR CODE HERE
      q1 = np.percentile(X[self.variable],25)
      q3 = np.percentile(X[self.variable],75)

      iqr = q3-q1

      self.lb = q1 - 1.5 * iqr
      self.ub = q3 + 1.5 * iqr
      return self


    def transform(self, X) -> pd.DataFrame:
      # YOUR CODE HERE
      X = X.copy() # so that we do not over-write the original dataframe
      X.loc[X[self.variable]<self.lb,self.variable] = self.lb
      X.loc[X[self.variable]>self.ub,self.variable] = self.ub
      return X

In [None]:
# Instantiate outlier handler for all numerical features
# YOUR CODE HERE
temp = OutlierHandler('temp')
atemp = OutlierHandler('atemp')
hum = OutlierHandler('hum')
windspeed = OutlierHandler('windspeed')

In [None]:
# Handle outliers for all numerical columns
# YOUR CODE HERE
for i in [temp,atemp,hum,windspeed]:
  train = i.fit_transform(train)
  test = i.transform(test)

#### Build a Class to One-hot Encode `weekday` column

In [12]:
class WeekdayOneHotEncoder(BaseEstimator, TransformerMixin):
    """ One-hot encode weekday column """

    def __init__(self, variable: str):
      # YOUR CODE HERE
      if not isinstance(variable,str):
        raise ValueError('variables should be a str')
      self.variable = variable

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        # YOUR CODE HERE
        self.ohe = OneHotEncoder(dtype=int,drop='first')
        self.ohe.fit(X[self.variable].values.reshape(-1,1))
        self.col = [self.variable + '_' + x[3:] for x  in self.ohe.get_feature_names_out()]
        return self

    def transform(self, X: pd.DataFrame, y: pd.Series = None)-> pd.DataFrame:
      # YOUR CODE HERE
      X = X.copy()
      out = pd.DataFrame(self.ohe.transform(X[self.variable].values.reshape(-1,1)).toarray(),columns=self.col)
      X = pd.concat([X,out],axis=1)
      X = X.drop([self.variable],axis=1)
      return X

In [None]:
# Treat 'weekday' column as a Categorical variable, perform one-hot encoding
# YOUR CODE HERE
weekday_encoder = WeekdayOneHotEncoder('weekday')
train = weekday_encoder.fit_transform(train)
test = weekday_encoder.transform(test)

### **D. Scalling**

In [None]:
train_X, train_y = train.drop(['dteday','cnt'],axis=1), train.cnt.values
test_X, test_y = test.drop(['dteday','cnt'],axis=1), test.cnt.values

In [None]:
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)

## 2.X Modelling

In [14]:
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import  DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error

### Linear Model

In [None]:
grid = {'alpha':[10**i for i in range(-5,4)],'penalty':['l2', 'l1']}

reg_model = SGDRegressor(random_state=3,loss='squared_error')
search = RandomizedSearchCV(reg_model,grid,cv=3,verbose=2,n_jobs=-2,random_state=3,scoring = 'neg_root_mean_squared_error')
result = search.fit(train_X, train_y)

print(result.best_score_)
print(result.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END ................................alpha=1, penalty=l1; total time=   0.1s
[CV] END ................................alpha=1, penalty=l1; total time=   0.2s
[CV] END ................................alpha=1, penalty=l1; total time=   0.1s
[CV] END ...........................alpha=0.0001, penalty=l2; total time=   0.1s
[CV] END ...........................alpha=0.0001, penalty=l2; total time=   0.3s
[CV] END ...........................alpha=0.0001, penalty=l2; total time=   0.1s
[CV] END ............................alpha=1e-05, penalty=l1; total time=   0.2s
[CV] END ............................alpha=1e-05, penalty=l1; total time=   0.3s
[CV] END ............................alpha=1e-05, penalty=l1; total time=   0.1s
[CV] END ..............................alpha=100, penalty=l2; total time=   0.2s
[CV] END ..............................alpha=100, penalty=l2; total time=   0.1s
[CV] END ..............................alpha=100

In [None]:
reg = SGDRegressor(penalty='l1',alpha=0.01)
reg.fit(train_X, train_y)

pred_train = reg.predict(train_X)
pred_test = reg.predict(test_X)

print(f"Train RMSE: {np.sqrt(mean_squared_error(train_y,pred_train))}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(test_y,pred_test))}")

Train RMSE: 0.01372854680854058
Test RMSE: 0.017987605854684313


### Decision Tree

In [None]:
grid = {"splitter":["best","random"],
        "max_depth" : [1,3,5,7,9,11,12],
        "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
        "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5],
        "max_features":["auto","log2","sqrt"],
        "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90]
       }

dtr = DecisionTreeRegressor()

col = ['fullVisitorId','totals.transactionRevenue','returned']

search = RandomizedSearchCV(dtr,grid,cv=3,verbose=1,n_jobs=-2,random_state=3,scoring='neg_root_mean_squared_error')
result = search.fit(train_X, train_y)

print(result.best_score_)
print(result.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits




-59.98830092745584
{'splitter': 'best', 'min_weight_fraction_leaf': 0.1, 'min_samples_leaf': 1, 'max_leaf_nodes': 40, 'max_features': 'auto', 'max_depth': 5}




In [None]:
reg = DecisionTreeRegressor(splitter='best',min_weight_fraction_leaf=0.1,min_samples_leaf=1,max_leaf_nodes=40,
                            max_features='auto',max_depth=5)
reg.fit(train_X, train_y)

pred_train = reg.predict(train_X)
pred_test = reg.predict(test_X)

print(f"Train RMSE: {np.sqrt(mean_squared_error(train_y,pred_train))}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(test_y,pred_test))}")

Train RMSE: 56.00014237389092
Test RMSE: 86.45953053046136




### Random Forest

In [None]:
grid = {'n_estimators':[100,200,300,500,700,800],
        'max_depth':[2,3,5,7,8,9,10],
        'min_samples_split':[2,3,5,7],
        'min_samples_leaf':[1,2,3,4]
       }

rfr = RandomForestRegressor()
search = RandomizedSearchCV(rfr,grid,cv=3,verbose=1,n_jobs=-2,random_state=3,scoring='neg_root_mean_squared_error')

result = search.fit(train_X, train_y)

print(result.best_score_)
print(result.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
-14.870425332189106
{'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 10}


In [None]:
reg = RandomForestRegressor(n_estimators=500,min_samples_split=5,min_samples_leaf=2,max_depth=10,n_jobs=-1)
reg.fit(train_X, train_y)

pred_train = reg.predict(train_X)
pred_test = reg.predict(test_X)

print(f"Train RMSE: {np.sqrt(mean_squared_error(train_y,pred_train))}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(test_y,pred_test))}")

Train RMSE: 1.7700859925162753
Test RMSE: 8.640018219714236


## Got Best Result For Linear Regression

## **3. Build Pipeline**

Build a pipeline and implement all the above class transformers inside the pipeline along with the regressor.

In [42]:
# YOUR CODE HERE
pipe = Pipeline([

    ##==========Extract Month and Year======##
    ('dteday', year_month(variable='dteday')),

    ##==========Imputer======##
    ('weekday', WeekdayImputer(variable='weekday')),
    ('weathersit', WeathersitImputer(variable='weathersit')),

    ##==========Mapper======##
    ('map_yr', Mapper('Year',{2011: 0, 2012: 1})),
    ('map_holiday', Mapper('holiday',{'No':0,'Yes':1})),
    ('map_workingday', Mapper('workingday',{'No':0,'Yes':1})),
    ('map_season', Mapper('season',{'winter':0,'fall':1,'spring':2,'summer':3})),
    ('map_weathersit', Mapper('weathersit',{'Mist':0,'Clear':1,'Light Rain':2,'Heavy Rain':3})),
    ('map_month', Mapper('Month',{'January':0,'February':1,'March':2,'April':3,'May': 4,'June': 5,'July': 6,'August': 7,'September': 8,'October': 9,\
                                 'November': 10,'December': 11})),
    ('map_hr', Mapper('hr',{'12am':0,'1am':1,'2am':2,'3am':3,'4am':4,'5am':5,'6am':6,'7am':7,'8am':8,'9am':9,'10am':10,'11am':11,\
                 '12pm':12,'1pm':13,'2pm':14,'3pm':15,'4pm':16,'5pm':17,'6pm':18,'7pm':19,'8pm':20,'9pm':21,'10pm':22,'11pm':23,})),

    ##=========Outlier======##
    ('out_temp',OutlierHandler(variable='temp')),
    ('out_atemp',OutlierHandler(variable='atemp')),
    ('out_hum',OutlierHandler(variable='hum')),
    ('out_windspeed',OutlierHandler(variable='windspeed')),

    ##=========OneHot======##
    ('onehot_weekday',WeekdayOneHotEncoder(variable='weekday')),


    ##=========Scalling======##
    ('scaler', StandardScaler()),

    ##=========Model======##
    ('model', SGDRegressor(penalty='l1',alpha=0.01))

])

## **4. Fit Pipeline**

- Separate target and prediction features
- Split data into train and test set
- Fit pipeline on train set
- Get prediction on test set
- Calculate the mse and r2_score

In [40]:
# YOUR CODE HERE
train_X, train_y = train.drop(['cnt'],axis=1), train.cnt.values
test_X, test_y = test.drop(['cnt'],axis=1), test.cnt.values

In [48]:
pipe.fit(train_X, train_y)
y_pred = pipe.predict(test_X)
print(f"Test RMSE: {np.sqrt(mean_squared_error(test_y,y_pred))}")
print(f"Test MSE: {mean_squared_error(test_y,y_pred)}")
print(f"Test R2: {r2_score(test_y,y_pred)}")

Test RMSE: 0.015047992862263099
Test MSE: 0.0002264420891827212
Test R2: 0.999999995344593


### Check for package versions may be used for requirements.txt file

In [45]:
!pip -qq install pydantic
!pip -qq install strictyaml
!pip -qq install ruamel.yaml

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.9/123.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.7/526.7 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [46]:
import numpy as np
import pandas as pd
import sklearn
import pydantic
import strictyaml
import ruamel.yaml
import joblib

In [47]:
# YOUR CODE HERE
!pip freeze > requirement.txt

## **5. Modularize the application**

- Convert the above regression application to a production environment format (.py files) inside VS code.

- Create different modules specific to functionality:
    - requirements
    - configuration
    - data manager
    - feature engineering
    - pipeline building
    - pipeline training
    - predict
