<a href="https://colab.research.google.com/github/nitish6121999/Capstone-project-Bike-Sharing-Demand-prediction-project/blob/main/SUPERVISED_ML_BIKE_SHARING__PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Description**   



##### **Project Type**    - Supervised Machine Learning Regression
##### **Project Title -** - Bike Sharing Demand Prediction
##### **Contribution**    - Individual
##### **Created by -**    - Nitish N Naik


# **Project Summary -**

1. The goal of this project is to combine the historical bike usage patterns with
the weather data in order to forecast bike rental demand.

2. The dataset contains weather information (Temperature, Humidity, Windspeed,
Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.




# **GitHub Link -**

https://github.com/nitish6121999/Capstone-project-Bike-Sharing-Demand-prediction-project



# **Problem Statement**


Bike sharing systems have become increasingly popular in urban areas, providing an efficient and eco-friendly mode of transportation. To ensure optimal bike availability and meet customer demand, bike sharing companies **need accurate predictions** of bike rental demand at different locations and times.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ridge_regression
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

### Dataset First View

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path= '/content/drive/MyDrive/CAPSTONE PROJECTS/Project supervised ML :Bike sharing prediction/dataset for project/SeoulBikeData.csv'

bike_df=pd.read_csv(path, encoding='unicode_escape', parse_dates=[0])

In [None]:
# Dataset First Look
bike_df


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
bike_df.shape

### Dataset Information

In [None]:
# Dataset Info
bike_df.info()

Numerical features:- Rented Bike Count, Hour , Temperatures, Humidity, Wind Speed, Visibility, Dew point Temperatures, solar radiation, rainfall, and snowfall.

Categorical features :- Season, Holliday, Functioning day

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

bike_df.duplicated().sum()


**There are no dupliacted values present.**

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
bike_df.isnull().sum()

**There are no missing values present in the dataset**

In [None]:
# Visualizing the missing values
plt.figure(figsize=(20,5))
sns.heatmap(bike_df.isnull(), cbar=False, yticklabels=False, cmap='coolwarm')
plt.xlabel('name of the columns')
plt.title('location of missing values')

**There is no indication of the missing values in the above showed graph.**

### What did you know about your dataset?

The dataset has more of the numerical columns and three of the categorical columns ,where in there is no problem for the duplicate values ,null values or the missing values  .

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
bike_df.columns

In [None]:
# Dataset Describe
bike_df.describe(include='all').T

### Variables Description


🔶: Date :- year-month-day

🔶: Rented Bike count :- Count of bikes rented at each hour

🔶: Hour :- Hour of he day (0 to 23)

🔶: Temperature :-Temperature of the day in degree celsius

🔶: Humidity :- Humidity measurement in %

🔶: Windspeed :- windspeed in m/s

🔶: Visibility :-Visibility measurement around 10meter

🔶: Dew point temperature :- Dew point measurement in degree Celsius

🔶: Solar radiation :- Solar radiation measureent in MJ/m2 (i.e. Megajules per meter square)

🔶: Rainfall :- Rainfall measurement in mm

🔶: Snowfall :- Snowfall measurement in cm

🔶: Seasons :- Winter, Spring, Summer, Fall or Autumn

🔶: Holiday :- Holiday/No holiday

🔶: Functional Day :- No Func(Non Functional Hours), Fun(Functional hours)





### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
bike_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
bike_df['Year']=bike_df['Date'].map(lambda x: x.year).astype('object')
bike_df['Month']=bike_df['Date'].dt.month_name()
bike_df['Day']=bike_df['Date'].dt.day_name()


In [None]:
bike_df.head()


**Since we have converted Date columns into three respective year, month, day columns.**
**So no need of Date column in dataframe so we will drop it.**


In [None]:
bike_df.drop(columns="Date", inplace=True)

In [None]:
#Changing "Hour" column -->'int' dtype into 'object' dtype

bike_df['Hour']=bike_df['Hour'].astype(object)

In [None]:
bike_df.info()

### What all manipulations have you done and insights you found?

Number of Numerical Features = 9

Number of Categorical Features = 7.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart 1: PAIRPLOT (for how the features are related with each other)

In [None]:
# Chart - 1 visualization code
sns.pairplot(bike_df, diag_kind= 'kde')

##### 1. Why did you pick the specific chart?

WE can understand how data is distributed across the dataset

**Storing Numerical and Categorical columns separately**

In [None]:
numeric_col= bike_df.describe().columns

categorical_col= bike_df.describe(include=['object','category']).columns

In [None]:
numeric_col

In [None]:
len(numeric_col)

In [None]:
categorical_col

In [None]:
len(categorical_col)

# **Chart - Analysis of Categorical columns**



In [None]:
# Charts

for col in categorical_col:
  counts = bike_df[col].value_counts().sort_index()
  fig = plt.figure(figsize=(10,8))
  ax = fig.gca()
  counts.plot.bar(ax=ax,color='orange')
  ax.set_title(col+'  counts distribution')
  ax.set_xlabel(col)
  ax.set_ylabel('frequency')

plt.show()

##### 1. Why did you pick the specific chart?

Column chart could easily exlain the categorical columns all in one go .

##### 2. What is/are the insight(s) found from the chart?

From the above graphs

1.  Season distribution is evenly distributed across all the four seasons

2.  In Holiday columns ,most of the data is from the Non_holiday section

3.  In Function column, most of the data is from the functioning day

4.  Most of the data is from the year 2018.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

According to yearly data ,2018 have been the opening year for the demand in the bikes, from this it can be predicted that there will be a upward graphs for the coming years.

Bikes are in demand for the functioning day and when there is no holidays , so the analysis could be helpful for the business to fulfill the need if there is  shortage of bikes in any of the loactions and for that particular time.

# **Chart - Understanding how the categorical columns are related with the dependent variable on the hourly basis.**

In [None]:
for col in categorical_col:
  if col == "Hour":
    pass

  else:
    plt.figure(figsize=(16,7))
    sns.pointplot(x=bike_df['Hour'], y=bike_df['Rented Bike Count'], hue=bike_df[col])
    plt.title(f'rented bike count for {col} wrt Hours')

  plt.show()

Analysis of the Rented bike count for different columns with respect to time .


*   For Season column, there is a clear difference in the count of the Rented bike count in SUMMER(more count) and WINTER seasons(less count)

*   During No holidays and Functioning days there is more count in the Rented bikes

*   IN 2018 there is more usage of bikes compaered to 2017, may be because in 2018 most of them were aware of the rented bikes then in 2017

*   In the Month columns ,the usage of bikes in the november ,december and january is very low then rest of the months , there is high usage in the summer months.


*   In the week columns , only in the weekends there is lower line ,rest of the days are quite similar as those are the functiong days

*   Considering all the columns ,the 8th Hour and 18th Hour are the peak time of the bike usage because of the office working hours







     

##### 1. Why did you pick the specific chart?

This chart can explian the different columns wrt to the dependent variable and Hour columns

##### 2. What is/are the insight(s) found from the chart?

Analysis of the Rented bike count for different columns with respect to time .


*   For Season column, there is a clear difference in the count of the Rented bike count in SUMMER(more count) and WINTER seasons(less count)

*   During No holidays and Functioning days there is more count in the Rented bikes

*   IN 2018 there is more usage of bikes compaered to 2017, may be because in 2018 most of them were aware of the rented bikes then in 2017

*   In the Month columns ,the usage of bikes in the november ,december and january is very low then rest of the months , there is high usage in the summer months.


*   In the week columns , only in the weekends there is lower line ,rest of the days are quite similar as those are the functiong days

*   Considering all the columns ,the 8th Hour and 18th Hour are the peak time of the bike usage because of the office working hours







     

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The above gained insights clearly help positive business impact and also help tackle the negative impacts on business

#### Chart -4 Feature Engineering on Hour column

Since hour column contains 0 to 23 hours data so if we divide these hourly data into sessions i.e. [ morning , afternoon, evening, night ]. then it will be better to analyze when number of bike count is maximum and minimum.

In [None]:
def hour(h):

  if h >= 7 and h <=10:
    return 'Morning'

  elif h>=11 and h<=16:
    return 'Afternoon'

  elif h>=17 and h<=22:
    return 'Evening'

  else:
    return 'Night'

In [None]:
bike_df['Hour']=bike_df['Hour'].apply(hour)

In [None]:
bike_df['Hour'].value_counts()


Checking if some outliers present in categroical features wrt dependent variable.

# **Chart  : To detect the outliers in the dataset**

In [None]:
# Chart - 5 visualization code

n=1
for col in categorical_col:
  plt.figure(figsize=(15,15))
  plt.subplot(6,2,n)
  n+=1
  print('\n')
  print('='*70,col,'='*70)
  print('\n')
  sns.boxplot(x=bike_df[col],y=bike_df["Rented Bike Count"])
  plt.tight_layout()
  plt.show()

In the above box plot we detect some outliers and it can be removed by IQR method, but here in seoul dataset we are not removing outliers because in features these outliers are not causing much skewness to our data.

##### 1. Why did you pick the specific chart?

To know the outliers in the dataset

##### 2. What is/are the insight(s) found from the chart?

Some the boxes in the columns have outliers .

# Categoring the weekend and weekdays

In [None]:
bike_df['week']=bike_df['Day'].apply(lambda x: 'weekend' if x=='sunday' and x=='saturday' else 'weekdays')
bike_df.drop(columns=['Day'], inplace=True)
bike_df.head()


In [None]:
bike_df.info()

In [None]:
bike_df.shape

In [None]:
categorical_col

In [None]:
categorical_col=categorical_col.drop('Day')

In [None]:
categorical_col


# **Chart : Data distribution Percentage**

In [None]:
# Chart - 6 visualization code
n=1
plt.figure(figsize=(15,15))
for col in categorical_col:
  plt.subplot(4,2,n)
  n=n+1
  plt.pie(bike_df[col].value_counts(),labels = bike_df[col].value_counts().keys().tolist(),autopct='%.0f%%')
  plt.title(col)
  plt.tight_layout()


##### 1. Why did you pick the specific chart?

Pie chart displays the dataset distribution in percentage .

##### 2. What is/are the insight(s) found from the chart?

There is proper distirbution of the data ,there is no noisy distribution of the data.

## **Till here we have understood how Categorical columns data is distributed and the relation with dependent variable.**

# Chart  : ANALYSE NUMERICAL COLUMNS

In [None]:
numeric_col

In [None]:
# Chart - 7 visualization code

for col in numeric_col:
    fig = plt.figure(figsize=(7, 6))
    ax = fig.gca()
    feature = bike_df[col]
    sns.histplot(data=bike_df,x=col ,ax = ax,color='Orange', kde=True)
    plt.title(col + '  Distribution')

plt.show()

##### 1. Why did you pick the specific chart?

From the above graphs we can see that there are a lot of attributes which are positively and negatively distributed.

so these types of features distribution will not give better results and will not give better understanding about model.

so to get better result we need to normalize data by using transformations.

##### 2. What is/are the insight(s) found from the chart?

These are the methods which can deal with skewness of the data

**square-root for moderate skew** :- sqrt(x) for positively skewed data, sqrt(max(x+1) - x) for negatively skewed data

**log for greater skew** :- log10(x) for positively skewed data, log10(max(x+1) - x) for negatively skewed data

**inverse for severe skew** :- 1/x for positively skewed data 1/(max(x+1) - x) for negatively skewed data

**Linearity and heteroscedasticity** :- First try log transformation in a situation where the dependent variable starts to increase more rapidly with increasing independent variable values If your data does the opposite – dependent variable values decrease more rapidly with increasing independent variable values – you can first consider a square transformation.

Above methods can provide good outcomes but we have to perform different transformation for different skewed data to make normaly distributed data.

Instead of this long process we can use ***power transformation*** on features to make them in normal distribution to give better visualisation.

# **Chart  : RELATION BETWEEN THE NUMERICAL COLUMNS AND DEPENDENT VARIABLE**


In [None]:
# Chart - 8 visualization code
for col in numeric_col[1:]:    # since rented bike count column is not included in this for loop
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  feature = bike_df[col]
  correlation = feature.corr(bike_df['Rented Bike Count'])
  plt.scatter(x=feature, y= bike_df['Rented Bike Count'])
  plt.xlabel(col)
  plt.ylabel('Rented Bike Count')
  ax.set_title('Rented Bike Count vs '+col+ ' with the correlation value : '+str(correlation))
  z= np.polyfit(bike_df[col],bike_df['Rented Bike Count'],1)
  y_hat = np.poly1d(z)(bike_df[col])
  plt.plot(bike_df[col],y_hat,'r--',lw=1)

plt.show()

##### 1. Why did you pick the specific chart?

To understand the correlation of the columns wrt the dependent variable

##### 2. What is/are the insight(s) found from the chart?

From above scatter plot or Regression plot shows some numeric features has positive correlation with dependent variable and some has negative correlation

correlation of features with dependent feature :--

positive correlation = Temperature, wind speed, Visibility, Dew point temperature, solar radiation.

negative correlation = Humidity, Rainfall, snowfall

# **Chart : Correlation map**

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(abs(bike_df.corr()), cmap='coolwarm', annot=True)

##### 1. Why did you pick the specific chart?

From this graph we are able to see that there is multicollinearity in temperature(°C) and dev point temperature(°C) column.

so we need to remove multicollinearity because :- Multicollinearity reduces the precision of the estimated coefficients, which weakens the statistical power of our regression model.

We have to remove the multicollinearity. To remove we can use VIF method

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
calc_vif(bike_df[[i for i in bike_df.describe().columns if i not in ['Rented Bike Count','Dew point temperature(°C)']]])

Since VIF ( Variance inflation factor ) in general, a VIF above 10 indicates high correlation and is cause for concern.

If VIF <=3 means variables are less correlated and multicollinearity does not exist in the regression model.

So in our data, features are less correlated with each other so it does not causes more concern



## ***6. Feature Engineering & Data Pre-processing***

 using one hot encoding (get dummies) or Data encoding method.

Since performing feature encoding on = [ Hour, Seasons, Holiday, Functioning Day, Month, Week ]

In [None]:
bike_df1=pd.get_dummies(bike_df, drop_first=True, sparse=True )

In [None]:
bike_df1.info()


In [None]:
bike_df1.head()

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Transforming the dependent variable ,because it is positively skewed

In [None]:
# Transform Your data
sns.distplot((bike_df1['Rented Bike Count']**2), color='red').set_title('Rented Bike count by square trnasformation')

In [None]:
sns.distplot(np.sqrt(bike_df1['Rented Bike Count']), color='red').set_title('Rented Bike count by square trnasformation')

Square root transformation gives us the normal distribution from positively skewed data for dependent vairable.

***DIVIDING THE DATA AS INDEPENDENT AND DEPENDENT DATA ***

DEPENDENT VARIABLE   Y   = 'RENTED BIKE COUNT '

INDEPENDENT VARIABLE X   = 'ALL THE VARIABLE EXCEPT RENTED BIKE COUNT'

In [None]:
x= bike_df1.drop(columns=['Rented Bike Count'])
x=x.drop(columns=['Dew point temperature(°C)'])

In [None]:
y=np.sqrt(bike_df1['Rented Bike Count'])

SPLITING THE DATA INTO TRAIN AND TEST DATA

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.20,random_state=10)

In [None]:
xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

## ***7. ML Model Implementation***

In [None]:

# Appending all models parameters to the corrosponding list
mean_absolut_error = []
mean_sq_error=[]
root_mean_sq_error=[]
training_score =[]
r2_list=[]
adj_r2_list=[]
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


def score_metrix (model,X_train,X_test,Y_train,Y_test):

  '''
    train the model and gives mae, mse,rmse,r2,adj r2 score of the model

  '''
  #training the model
  model.fit(X_train,Y_train)

  # Training Score
  training  = model.score(X_train,Y_train)
  print("Training score  =", training)

  try:
      # finding the best parameters of the model if any
    print(f"The best parameters found out to be :{model.best_params_} \nwhere model best score is:  {model.best_score_} \n")
  except:
    pass


  #predicting the Test set and evaluting the models

  if model == LinearRegression() or model == Lasso() or model == Ridge():
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test**2,Y_pred**2)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test**2,Y_pred**2)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test**2,Y_pred**2)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test**2,Y_pred**2))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

  else:
    # for tree base models
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test,Y_pred)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test,Y_pred)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test,Y_pred)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test,Y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

    try:

      # ploting the graph of feature importance

      best = model.best_estimator_
      features = X_train.columns
      importances = best.feature_importances_
      indices = np.argsort(importances)
      plt.figure(figsize=(10,15))
      plt.title('Feature Importance')
      plt.barh(range(len(indices)), importances[indices], color='red', align='center')
      plt.yticks(range(len(indices)), [features[i] for i in indices])
      plt.xlabel('Relative Importance')
      plt.show()

    except:
      pass

  # Here we appending the parameters for all models
  mean_absolut_error.append(MAE)
  mean_sq_error.append(MSE)
  root_mean_sq_error.append(RMSE)
  training_score.append(training)
  r2_list.append(r2)
  adj_r2_list.append(adj_r2)

  print('*'*80)
  # print the cofficient and intercept of which model have these parameters and else we just pass them
  try :
    print("coefficient \n",model.coef_)
    print('\n')
    print("Intercept  = " ,model.intercept_)
  except:
    pass
  print('\n')
  print('*'*20, 'ploting the graph of Actual and predicted only with 80 observation', '*'*20)

  # ploting the graph of Actual and predicted only with 80 observation for better visualisation which model have these parameters and else we just pass them
  try:
    # ploting the line graph of actual and predicted values
    plt.figure(figsize=(15,7))
    plt.plot((Y_pred)[:80])
    plt.plot((np.array(Y_test)[:80]))
    plt.legend(["Predicted","Actual"])
    plt.show()
  except:
    pass

# MODEL 1: LINEAR REGRESSION MODEL

In [None]:
score_metrix (LinearRegression(),xtrain,xtest,ytrain,ytest)

Above linear regression model shows quite good results.

----> By observing training score we say that model is quite overfit.

----> We can increase accuracy of model by scaling or transforming the training data by either min max scaler or standard scalar or Power transformer.

# ML Model - 2 Linear regression with Powertransformer

Power transformations are very useful when we have to deal with skewed features and our model is sensitive to the symmetry of the distributions.

The fit(data) method is used to compute the mean and std dev for a given feature to be used further for scaling. The transform(data) method is used to perform scaling using mean and std dev calculated using the . fit() method.

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
xtrain_trans = pt.fit_transform(xtrain)      # fit transform the training set
xtest_trans = pt.transform(xtest)

In [None]:
score_metrix(LinearRegression(),xtrain_trans,xtest_trans,ytrain,ytest)

# ML Model - 3 Linear regression with Standardscaler

StandardScaler removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. StandardScaler can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.preprocessing import StandardScaler
sl = StandardScaler()
xtrain_strans = sl.fit_transform(xtrain)      # fit transform the training set
xtest_strans = sl.transform(xtest)

In [None]:
score_metrix(LinearRegression(),xtrain_strans,xtest_strans,ytrain,ytest)

# Model 4 :Linear Regression with polynomial features

Polynomial regression is a kind of linear regression in which the relationship shared between the dependent and independent variables Y and X is modeled as the nth degree of the polynomial.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)                               #creating variable with degree 2
poly_xtrain = poly.fit_transform(xtrain_trans)                # fit the train set
poly_xtest = poly.transform(xtest_trans)                    #transform the test set


In [None]:
score_metrix(LinearRegression(),poly_xtrain,poly_xtest,ytrain,ytest)

# **Regularization**

Regularization is one of the most important concepts of machine learning. It is a technique to prevent the model from overfitting by adding extra information to it.

Two techniques of regularization are = **1) Lasso (l1 norm) and 2) Ridge regression (L2 norm)**

# Model 4: Ridge Regression

In [None]:
L2 = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,0.5,1.5,1.6,1.7,1.8,1.9]}      # giving parameters
L2_cv = GridSearchCV(L2, parameters, scoring='r2', cv=5)                                                                    #using gridsearchcv and cross validate the model
score_metrix(L2_cv,xtrain_trans,xtest_trans,ytrain,ytest)

# Model : Lasso regression

In [None]:
L1 = Lasso() #creating variable
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]} #lasso parameters
lasso_cv = GridSearchCV(L1, parameters, cv=5) #using gridsearchcv and cross validate the model

In [None]:
score_metrix(lasso_cv,xtrain_trans,xtest_trans,ytrain,ytest)

# Model : RANDOM FORREST REGRESSION

In [None]:
new_x=bike_df1.drop(columns='Rented Bike Count')
new_y=bike_df1['Rented Bike Count']

In [None]:
new_xtrain, new_xtest, new_ytrain, new_ytest = train_test_split(new_x, new_y, test_size= 0.20, random_state = 10)


new_xtrain.shape,   new_xtest.shape,   new_ytrain.shape,  new_ytest.shape

In [None]:
from sklearn.ensemble import RandomForestRegressor


# parameters for random forest regression model

rf_param_grid ={"n_estimators":[50,100,150],                    ### we can put any values for parameters
              'max_depth' : [10,15,20,25,'none'],
              'min_samples_split': [10,50,100],
              'max_features' :[24,35,40,49]}


# Using grid search cv
Ranom_forest_Grid_search = GridSearchCV(RandomForestRegressor(),param_grid=rf_param_grid,n_jobs=-1,verbose=2)


In [None]:
score_metrix(Ranom_forest_Grid_search, new_xtrain, new_xtest, new_ytrain, new_ytest)

# Model : DECISION TREE REGRESSION

Include all the columns

Since in decission tree multicollinearity of features does not affect the model accuracy. So in previous models we have removed multicollinear features (such as "Dew Point Temperature").


In [None]:
bike_df1.shape

In [None]:
from sklearn.tree import DecisionTreeRegressor


# Parameters for Decission Tree model
param_grid = {'criterion' : ["squared_error"],
              'splitter' : ["best", "random"],
              'max_depth' : [10,15,25, 'none'],
              'min_samples_split': [10,50,100],
              'max_features' :[24,35,40,49]}

# Gridsearch CV
Dt_grid_search = GridSearchCV(DecisionTreeRegressor(),param_grid=param_grid,cv=2,n_jobs=-1)

In [None]:
score_metrix(Dt_grid_search,new_xtrain,new_xtest,new_ytrain,new_ytest)

# **Conclusion**

# **EDA Insights :-**

1)  Bikes have been used more often in weekdays than weekends

2) 98% of the bikes are rented when there is non Holiday day present. That means Most of the user may use bike on rent to go there respective work places.

3) More number of bikes are rented in the Summer season and the lowest in the winter season.

4)- Most number of bikes are rented when there is no snowfall or rainfall.

5)- Peak rise in 8th and 18th hour

6)- Gradual Increase in bike rent count is in morning 6 to 10 am i.e. it must be working time of employees. And after 10 am there slight decrease in count, And again start increasing count rate from 16 to 20 (4pm to 8pm) i.e. it must be leaving time of employees and they uses bike on rent to go there home.

7)- The highest number of bike rentals have been done in the 18th hour, i.e 6pm, and lowest in the 4th hour, i.e 4am.

8)- Most of the bike rentals have been made when there is high visibility.

9)- In 2018 demand for Rented bikes is increased as compare to 2017 year. It may be because in 2017 people are aware about rented bike facility.

# **ML MODEL INSIGHTS**

# Linear regression with original x train test and y train test

    Training score  = 0.7025535504935366
    MAE : 5.272386430820542
    MSE : 44.56236594756176
    RMSE : 6.675504920795262
    R2 : 0.7148244596355067
    Adjusted R2 :  0.710358253376898


# **Linear regression with powertransformer**

    Training score  = 0.7350893725883549
    MAE : 5.081760981637516
    MSE : 41.82562872623285
    RMSE : 6.467273670275045
    R2 : 0.7323381283856463
    Adjusted R2 :  0.7281462081225445

# **Linear regression with Standardscaler**
    Training score  = 0.7025535504935366
    MAE : 5.272386430820545
    MSE : 44.56236594756178
    RMSE : 6.675504920795264
    R2 : 0.7148244596355066
    Adjusted R2 :  0.710358253376898

# **Linear regression with Polynomial features**
    Training score  = 0.8409849039057424
    MAE : 3.7984606042854523
    MSE : 26.260078443291366
    RMSE : 5.124458843945511
    R2 : 0.8319494061672605
    Adjusted R2 :  0.7812218663188648

#**Ridge Regression**

The best parameters found out to be :{'alpha': 10}

where model best score is:  0.7321484459424913

    Training score  = 0.7350857300036486
    MAE : 5.081672261352577
    MSE : 41.81620442680659
    RMSE : 6.466545014674111
    R2 : 0.7323984389105542
    Adjusted R2 :  0.7282074631858355     

#**Lasso Regression**


The best parameters found out to be :{'alpha': 0.01}

where model best score is:  0.7321939783512486

    Training score  = 0.7350625551639408
    MAE : 5.079811518964977
    MSE : 41.802365832489144
    RMSE : 6.465474911596916
    R2 : 0.7324869985848095
    Adjusted R2 :  0.7282974098155461

#**Randomforrest Regression**


The best parameters found out to be :{'max_depth': 25, 'max_features': 24, 'min_samples_split': 10, 'n_estimators': 150}

where model best score is:  0.8237031376924666

    Training score  = 0.9416053146851839
    MAE : 169.6591645191808
    MSE : 68890.38858227018
    RMSE : 262.4697860369269
    R2 : 0.8303437741263449
    Adjusted R2 :  0.8275867373739001


#**Decision Tree Regression**


The best parameters found out to be :{'criterion': 'squared_error', 'max_depth': 10, 'max_features': 35, 'min_samples_split': 50, 'splitter': 'best'}

where model best score is:  0.7589229343218231

    Training score  = 0.8303549823331889
    MAE : 189.78425743618234
    MSE : 82961.48773177019
    RMSE : 288.03035904531
    R2 : 0.7956909056387869
    Adjusted R2 :  0.7923707346334973



By Observing Insights of models we can conclude that :-

1)- Random Forest Regression is the best model with an increased accuracy to predict bike rent count. i.e. R2 score of 0.83034....

2)- Linear reagression model is the worst performing model with an r2 score of 0.71035...

Actual vs Prediction visualisation is done for all the 6 models.

 Temperature and Hour are the two most important factors according to all the models. And they are very useful while predicting the bike rented count.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***