<a href="https://colab.research.google.com/github/paralkardhananjay/Bike-Sharing-Demand-Prediction-D-Paralkar/blob/main/Bike_Sharing_Demand_Prediction_%7CD_Paralkar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Bike Sharing Demand Prediction



##### **Project Type**     Regression
##### **Contribution**    - Individual
##### **Team Member 1 -**  Dhananjay Paralkar


# **Project Summary -**

Rental bikes have become a popular addition to urban cities, improving mobility and comfort. Ensuring their availability and accessibility is crucial, reducing waiting times. However, a significant challenge lies in maintaining a stable supply of rental bikes. The key is accurately predicting the required bike count per hour to meet demand. This prediction becomes crucial for achieving a seamless and consistent rental bike service throughout the city.

# **GitHub Link -**

Provide your GitHub Link here. 

# **Problem Statement**


**Write Problem Statement Here.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import StackingRegressor

%matplotlib inline

In [None]:
#let's mount the google drive for import the dtaset
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
# Load Dataset
Bike_share = pd.read_csv('/content/drive/MyDrive/SeoulBikeData.csv', encoding= 'unicode_escape')

In [None]:
# Viewing the datadata of top 5 rows
Bike_share.head()

In [None]:
# View the data of bottom 5 rows 
Bike_share.tail()

In [None]:
#Getting the shape of dataset with rows and columns
print(Bike_share.shape)


In [None]:
#check details about the data set
Bike_share.info()

In [None]:
#Looking for the description of the dataset
Bike_share.describe()

In [None]:
#Getting all the columns
Bike_share.columns

### Dataset First View

In [None]:
# Dataset First Look

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# finding total no of rows in dataset
print("The no of rows in the dataset is ",len(Bike_share))

### Dataset Information

In [None]:
# Dataset Info
Bike_share.info()

#### Duplicate Values

In [None]:
# Checking Duplicate Values
duplicates =len(Bike_share[Bike_share.duplicated()])
print(duplicates)

Convert date column into 3 separeate column

In [None]:
import datetime as dt
Bike_share['date'] = Bike_share['Date'].apply(lambda x: dt.datetime.strptime(x,"%d/%m/%Y"))

In [None]:
Bike_share['year'] = Bike_share['date'].dt.year
Bike_share['month'] = Bike_share['date'].dt.month

Bike_share['day'] = Bike_share['date'].dt.day_name()


In [None]:
#creating a new column of "weekdays_weekend" and drop the column "Date","day","year"
Bike_share['week']=Bike_share['day'].apply(lambda x : "weekend" if x=='Saturday' or x=='Sunday' else "weekday" )


In [None]:
# checking no of years
Bike_share['week'].value_counts()

In [None]:
Bike_share=Bike_share.drop(columns=['date','day','year'],axis=1)

In [None]:
Bike_share.head()

To enhance the prediction of bike count, the "date" column can be transformed into three separate columns: "year," "month," and "day." The "year" column contains only two unique numbers, representing the period from December 2017 to November 2018, which can be considered as one year. Therefore, this column can be dropped from the dataset. 

The "day" column, which contains details about each day of the month, can be modified to indicate whether a day is a weekday or a weekend. This conversion provides relevant information while reducing the unnecessary granularity of daily data. Consequently, the "day" column can also be dropped.

By eliminating the "year" and "day" columns, the dataset becomes more focused on the required factors for predicting bike count, such as the month and the distinction between weekdays and weekends. These modifications contribute to a more streamlined and efficient analysis process.

In [None]:
#Change the int64 column into catagory column
cols=['Hour','month','week']
for col in cols:
  Bike_share[col]=Bike_share[col].astype('category')

In [None]:
Bike_share.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = Bike_share.isnull().sum()

print(missing_values_count)

In [None]:
# Visualizing the missing values
Bike_share = pd.DataFrame(Bike_share)

# Create a heatmap of missing values
sns.heatmap(Bike_share.isnull(), cmap='viridis')

plt.title('Missing Values Heatmap')
plt.show()

# Exploratory Data Analysis

Chart 1

In [None]:
# anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,5))
sns.barplot(data=Bike_share,x='month',y='Rented Bike Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Month ')

From the above bar plot we can clearly say that from the month 5 to 10 the demand of the rented bike is high as compare to other months


chart-2

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=Bike_share,x='week',y='Rented Bike Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to weekdays_weekenday ')

chart 3

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=Bike_share,x='Hour',y='Rented Bike Count',hue='week',ax=ax)
ax.set(title='Count of Rented bikes acording to weekdays_weekend ')

Based on the line plot and bar plot, we can observe that weekdays, represented by the blue color, indicate a higher demand for rental bikes, likely due to office commutes. The peak times for bike rentals are from 7 am to 9 am and 5 pm to 7 pm during weekdays.

On the other hand, weekends, represented by the orange color, show a comparatively lower demand for rented bikes, particularly in the morning hours. However, as the evening approaches, starting from 4 pm to 8 pm, there is a slight increase in demand for rental bikes during weekends.

These patterns suggest that weekdays experience a higher demand for bike rentals, driven by office-related activities, while weekends show a relatively lower demand, with a slight increase in the evening hours.

chart 4 ( Functioning Day)

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(10,8))
sns.barplot(data=Bike_share,x='Functioning Day',y='Rented Bike Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=Bike_share,x='Hour',y='Rented Bike Count',hue='Functioning Day',ax=ax)
ax.set(title='Count of Rented bikes acording to Functioning Day ')

The bar plot shows two bars representing two categories of the "Functioning Day" variable: "Yes" and "No."

1. "Yes" (representing weekdays): The bar for weekdays (Functioning Day = "Yes") shows a higher count of rented bikes compared to weekends. This suggests that weekdays experience a greater demand for rental bikes, possibly due to office commutes and regular working days.

2. "No" (representing weekends): The bar for weekends (Functioning Day = "No") indicates a lower count of rented bikes compared to weekdays. This implies that weekends have a relatively lower demand for rental bikes, possibly due to leisure activities and reduced work-related commuting.

chart 5 (According to Season)

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.boxplot(data=Bike_share,x='Seasons',y='Rented Bike Count',ax=ax)
ax.set(title='Count of Rented bikes acording to Seasons ')

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(15,8))
sns.barplot(data=Bike_share,x='Seasons',y='Rented Bike Count',ax=ax,capsize=.2)
ax.set(title='Count of Rented bikes acording to Seasons ')

The box plot analysis based on seasons indicates that Summer and Fall have a higher median count of rented bikes, suggesting increased demand during these seasons. In contrast, Spring and Winter show a relatively lower median count, implying lower bike rental activity.

Chart 6 (According to Holidays)

In [None]:
Bike_share.groupby('Holiday').sum()['Rented Bike Count'].plot.pie(radius=1.5)

In [None]:
#anlysis of data by vizualisation
fig,ax=plt.subplots(figsize=(20,8))
sns.pointplot(data=Bike_share,x='Hour',y='Rented Bike Count',hue='Holiday',ax=ax)
ax.set(title='Count of Rented bikes acording to Holiday ')

The pie chart illustrates that non-holidays have a higher count of rented bikes, while holidays show a lower count, indicating varying demand based on the holiday status.

Chart 7 Analysis of numerical variables

In [None]:
numerical_columns=['Rented Bike Count','Temperature(°C)','Humidity(%)',	'Wind speed (m/s)',	'Visibility (10m)',	'Solar Radiation (MJ/m2)'	,'Rainfall(mm)'	,'Snowfall (cm)']	

In [None]:
# checking the distribution
plt.figure(figsize=(10,10))
for index,item in enumerate(numerical_columns):
  plt.subplot(3,3,index+1)
  sns.distplot(Bike_share[item])

In [None]:
# creating a dataframe containing the count of bikes rented in different temperature

df_temp = pd.DataFrame(Bike_share.groupby('Temperature(°C)')['Rented Bike Count'].sum())
df_temp.reset_index(inplace=True)

In [None]:
# plot showing distribution of bike rentals according to temperature intensity

plt.figure(figsize=(8,6))
sns.distplot(df_temp['Temperature(°C)'])

Above plot shows that people tend to rent bikes when the temperature is between -5 to 25 degrees.

In [None]:
# creating a dataframe containing the count of bikes rented in differant visibility ranges

df_visiual = pd.DataFrame(Bike_share.groupby('Visibility (10m)')['Rented Bike Count'].sum())
df_visiual.reset_index(inplace=True)

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(df_visiual['Visibility (10m)'])

Above plot shows that people tend to rent bikes when the visibility is between 


# Label Encoding

In [None]:
# creating dummy variables for categorical feature --> Seasons, month, DayOfWeek, year, fuctioning day, holiday

seasons = pd.get_dummies(Bike_share['Seasons'])

working_day = pd.get_dummies(Bike_share['Holiday'])

F_day = pd.get_dummies(Bike_share['Functioning Day'])

month = pd.get_dummies(Bike_share['month'])

week_day = pd.get_dummies(Bike_share['week'])



In [None]:
Bike_share = pd.concat([Bike_share,seasons,working_day,F_day,month,week_day],axis=1)


In [None]:
# checking the data dummy variable is created or not

Bike_share.head()

In [None]:
## dropping columns for which dummy variables were created

Bike_share.drop(['Seasons','Holiday','Functioning Day','week','month'],axis=1,inplace=True)


In [None]:
Bike_share.drop(['Date'],axis=1,inplace=True) # droping date because we already extract the date from the data


**Checking Multicollinearity**

In [None]:
# function to calculate Multicollinearity

# checking the vif
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = Bike_share

  
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
  
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
vif_data['VIF'] = round(vif_data['VIF'],2)


print(vif_data)

In [None]:
Bike_share=Bike_share.drop(['Rainfall(mm)','Snowfall (cm)'],axis=1)

# Regression Plot

In [None]:
numerical_columns=['Rented Bike Count','Temperature(°C)','Humidity(%)',	'Wind speed (m/s)',	'Visibility (10m)',	'Solar Radiation (MJ/m2)'	]

In [None]:
#printing the regression plot for all the numerical features
for col in numerical_columns:
  fig,ax=plt.subplots(figsize=(10,6))
  sns.regplot(x=Bike_share[col],y=Bike_share['Rented Bike Count'],scatter_kws={"color": 'orange'}, line_kws={"color": "black"})


In [None]:
#Assign the value in X and Y
X = Bike_share.drop(columns=['Rented Bike Count']).astype(str)
y = np.sqrt(Bike_share['Rented Bike Count'])


In [None]:
#Creat test and train data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)
print(X_train.shape)
print(X_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train.values)
X_test = scaler.transform(X_test.values)


# Linear Regression Model

In [None]:
#import the packages
from sklearn.linear_model import LinearRegression
reg= LinearRegression().fit(X_train, y_train)

In [None]:
#check the score
reg.score(X_train, y_train)

In [None]:
#check the coefficeint
reg.coef_

In [None]:
#get the X_train and X-test value
y_pred_train=reg.predict(X_train)
y_pred_test=reg.predict(X_test)

In [None]:
y_pred_test

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#calculate MSE
MSE_lr= mean_squared_error((y_train), (y_pred_train))
print("MSE :",MSE_lr)

#calculate RMSE
RMSE_lr=np.sqrt(MSE_lr)
print("RMSE :",RMSE_lr)


#calculate MAE
MAE_lr= mean_absolute_error(y_train, y_pred_train)
print("MAE :",MAE_lr)



#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_lr= r2_score(y_train, y_pred_train)
print("R2 :",r2_lr)
Adjusted_R2_lr = (1-(1-r2_score(y_train, y_pred_train))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

print("Adjusted R2 :",Adjusted_R2_lr)

**Linear Regression with L2 Regularization**

In [None]:
#import the packages
from sklearn.linear_model import Ridge

ridge= Ridge(alpha=0.1)

In [None]:
#FIT THE MODEL
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)

In [None]:
# evaluating metrics
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE  = mean_squared_error(y_test,y_pred_test_ridge)
print("MSE :" , MSE)
#calculate RMSE
RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)
#calculate r2 and adjusted r2
r2_ridge_test = r2_score(y_test,y_pred_test_ridge)
print("R2 :" ,r2_ridge_test)
print("Adjusted R2 : ",1-(1-r2_score(y_test,y_pred_test_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)))




**Heteroscedasticity**

Heteroscedasticity occurs when the variance of the errors (residuals) in a regression model is not consistent across all levels of the independent variable(s).This contradicts one of linear regression's assumptions, which states that the variance of the errors should be constant (homoscedastic) across all levels of the independent variable(s). Heteroscedasticity is indicated if the plot has a funnel shape, with the spread of residuals growing or decreasing as the predicted values increase.

In [None]:
### Heteroscadacity - Residual plot 
plt.scatter((y_pred_test),(y_test)-(y_pred_test))
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

In [None]:
# Actual Price vs predicte for Linear Regression plot
plt.figure(figsize=(12,8))
plt.plot(y_pred_test)
plt.plot(np.array(y_test))
plt.legend(["Predicted","Actual"])
plt.xlabel('No of Test Data')
plt.show()

**RIDGE REGRESSION**

Ridge regression is a method for estimating regression model coefficients when the independent variables are highly linked. It use the linear regression model in conjunction with the L2 regularisation approach.

In [None]:
#import the packages
from sklearn.linear_model import Ridge

ridge= Ridge(alpha=0.1)


In [None]:
#FIT THE MODEL
ridge.fit(X_train,y_train)

In [None]:
#check the score
ridge.score(X_train, y_train)
     

In [None]:
#get the X_train and X-test value
y_pred_train_ridge=ridge.predict(X_train)
y_pred_test_ridge=ridge.predict(X_test)
     

In [None]:
y_pred_train_ridge

In [None]:
y_pred_test_ridge

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r= mean_squared_error((y_train), (y_pred_train_ridge))
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_train, y_pred_train_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score(y_train, y_pred_train_ridge)
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_ridge))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

     

Looks like our train set's r2 score value is 0.79 that means our model is able to capture most of the data variance. Lets save it in a dataframe for later comparisons.

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}


In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_r= mean_squared_error(y_test, y_pred_test_ridge)
print("MSE :",MSE_r)

#calculate RMSE
RMSE_r=np.sqrt(MSE_r)
print("RMSE :",RMSE_r)


#calculate MAE
MAE_r= mean_absolute_error(y_test, y_pred_test_ridge)
print("MAE :",MAE_r)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_r= r2_score((y_test), (y_pred_test_ridge))
print("R2 :",r2_r)
Adjusted_R2_r=(1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_ridge)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

     

The r2_score for the test set is 0.80. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Ridge regression ',
       'MAE':round((MAE_r),3),
       'MSE':round((MSE_r),3),
       'RMSE':round((RMSE_r),3),
       'R2_score':round((r2_r),3),
       'Adjusted R2':round((Adjusted_R2_r ),2)}

In [None]:
### Heteroscadacity - Residual plot 
plt.scatter((y_pred_test_ridge),(y_test)-(y_pred_test_ridge))
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

In [None]:
#Plot the figure
plt.figure(figsize=(10,8))
plt.plot((y_pred_test_ridge))
plt.plot((np.array(y_test)))
plt.legend(["Predicted","Actual"])
plt.show()

# Linear Regression with Elastic Net

In [None]:
#import the packages
from sklearn.linear_model import ElasticNet
#a * L1 + b * L2
#alpha = a + b and l1_ratio = a / (a + b)
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)

In [None]:
#FIT THE MODEL
elasticnet.fit(X_train,y_train)

In [None]:
#check the score
elasticnet.score(X_train, y_train)

In [None]:
#get the X_train and X-test value
y_pred_train_en=elasticnet.predict(X_train)
y_pred_test_en=elasticnet.predict(X_test)

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_e= mean_squared_error((y_train), (y_pred_train_en))
print("MSE :",MSE_e)

#calculate RMSE
RMSE_e=np.sqrt(MSE_e)
print("RMSE :",RMSE_e)


#calculate MAE
MAE_e= mean_absolute_error(y_train, y_pred_train_en)
print("MAE :",MAE_e)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_e= r2_score(y_train, y_pred_train_en)
print("R2 :",r2_e)
Adjusted_R2_e=(1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_en))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )


# Decision Tree

In [None]:
# training model

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X_train, y_train)

In [None]:
Y_pred_train =regressor.predict(X_train)
Y_pred_train

In [None]:
Y_pred_test = regressor.predict(X_test)
Y_pred_test

In [None]:
r2_score(Y_pred_train,y_train)

In [None]:
DT = r2_score(Y_pred_test,y_test)
DT

# Random Forest

In [None]:
#import the packages
from sklearn.ensemble import RandomForestRegressor
# Create an instance of the RandomForestRegressor
rf_model = RandomForestRegressor()

rf_model.fit(X_train,y_train)

In [None]:
#  Making predictions on train and test data

y_pred_train_r = rf_model.predict(X_train)
y_pred_test_r = rf_model.predict(X_test)


In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
print("Model Score:",rf_model.score(X_train,y_train))

#calculate MSE
MSE_rf= mean_squared_error(y_train, y_pred_train_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_train, y_pred_train_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score(y_train, y_pred_train_r)
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_r))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

     

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),
       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}
     

In [None]:
#import the packages
from sklearn.metrics import mean_squared_error
#calculate MSE
MSE_rf= mean_squared_error(y_test, y_pred_test_r)
print("MSE :",MSE_rf)

#calculate RMSE
RMSE_rf=np.sqrt(MSE_rf)
print("RMSE :",RMSE_rf)


#calculate MAE
MAE_rf= mean_absolute_error(y_test, y_pred_test_r)
print("MAE :",MAE_rf)


#import the packages
from sklearn.metrics import r2_score
#calculate r2 and adjusted r2
r2_rf= r2_score((y_test), (y_pred_test_r))
print("R2 :",r2_rf)
Adjusted_R2_rf=(1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_test_r)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
     

The r2_score for the test set is 0.91. This means our linear model is performing well on the data. Let us try to visualize our residuals and see if there is heteroscedasticity(unequal variance or scatter).

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict2={'Model':'Random forest regression ',
       'MAE':round((MAE_rf),3),
       'MSE':round((MSE_rf),3),
       'RMSE':round((RMSE_rf),3),
       'R2_score':round((r2_rf),3),
       'Adjusted R2':round((Adjusted_R2_rf ),2)}

In [None]:
### Heteroscadacity- Residual plot
plt.scatter((y_pred_test_r),(y_test)-(y_pred_test_r))
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

# Hyperparameter tuning

**Gradient Boosting Regressor with GridSearchCV**



*Provide the range of values for chosen hyperparameters*

---



In [None]:
# Number of trees
n_estimators = [50,80,100]

# Maximum depth of trees
max_depth = [4,6,8]

# Minimum number of samples required to split a node
min_samples_split = [50,100,150]

# Minimum number of samples required at each leaf node
min_samples_leaf = [40,50]

# HYperparameter Grid
param_dictory = {'n_estimators' : n_estimators,
              'max_depth' : max_depth,
              'min_samples_split' : min_samples_split,
              'min_samples_leaf' : min_samples_leaf}
              

In [None]:
param_dictory

In [None]:
# Importing Gradient Boosting Regressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

# Create an instance of the GradientBoostingRegressor
gb_model = GradientBoostingRegressor()

# Grid search
param_dict = {'learning_rate': [0.1, 0.01],
              'n_estimators': [50, 100],
              'max_depth': [3, 5]}

gb_grid = GridSearchCV(estimator=gb_model,
                       param_grid=param_dict,
                       cv=3, verbose=2, n_jobs=-1)

gb_grid.fit(X_train, y_train)

In [None]:
gb_grid.best_estimator_


In [None]:
gb_optimal_model = gb_grid.best_estimator_


In [None]:
gb_grid.best_params_


In [None]:
# Making predictions on train and test data

y_pred_train_g_g = gb_optimal_model.predict(X_train)
y_pred_g_g= gb_optimal_model.predict(X_test)
     

In [None]:
from sklearn.metrics import mean_squared_error
print("Model Score:",gb_optimal_model.score(X_train,y_train))
MSE_gbh= mean_squared_error(y_train, y_pred_train_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_train, y_pred_train_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score(y_train, y_pred_train_g_g)
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score(y_train, y_pred_train_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

     

In [None]:
# storing the test set metrics value in a dataframe for later comparison
dict1={'Model':'Gradient Boosting gridsearchcv ',
       'MAE':round((MAE_gbh),3),
       'MSE':round((MSE_gbh),3),
       'RMSE':round((RMSE_gbh),3),
       'R2_score':round((r2_gbh),3),
       'Adjusted R2':round((Adjusted_R2_gbh ),2)
      }
     

from sklearn.metrics import mean_squared_error
MSE_gbh= mean_squared_error(y_test, y_pred_g_g)
print("MSE :",MSE_gbh)

RMSE_gbh=np.sqrt(MSE_gbh)
print("RMSE :",RMSE_gbh)


MAE_gbh= mean_absolute_error(y_test, y_pred_g_g)
print("MAE :",MAE_gbh)


from sklearn.metrics import r2_score
r2_gbh= r2_score((y_test), (y_pred_g_g))
print("R2 :",r2_gbh)
Adjusted_R2_gbh = (1-(1-r2_score(y_test, y_pred_g_g))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )
print("Adjusted R2 :",1-(1-r2_score((y_test), (y_pred_g_g)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1)) )

     

# SVR using grid searchcv

In [None]:
from sklearn.svm import SVR

In [None]:
# selecting the values of SVR
param = {'C' : [1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')},


In [None]:
#train the model
modelsvr = SVR(kernel='rbf')
grids = GridSearchCV(modelsvr,param,cv=3)
grids.fit(X_train,y_train)

In [None]:
# predicting for both train and test
y_pred_train3=grids.predict(X_train)
y_pred_test3=grids.predict(X_test)

In [None]:
# finding each of the metrics for training set

print('The MAE of training set = ',mean_absolute_error(y_train, y_pred_train3))
print('The MSE of training set = ',mean_squared_error(y_train, y_pred_train3))
print('The R2_score of training set = ',r2_score(y_train, y_pred_train3))

In [None]:
# finding each metrics for test set
svr = r2_score(y_test, y_pred_test3)
print('The MAE of test set = ',mean_absolute_error(y_test, y_pred_test3))
print('The MSE of test set = ',mean_squared_error(y_test, y_pred_test3))
print('The R2_score of test set = ',r2_score(y_test, y_pred_test3))

# **Conclusion**

Write the conclusion here.

We may take the following conclusions based on the observations and analyses presented.
1. The hour of the day is the most important factor for estimating bike rentals. This implies that the time of day has a considerable impact on the demand for rental bikes.
2. The greatest season for bike rentals is Autumn/Fall, followed by Summer, and the lowest season is Spring. This suggests that seasonal fluctuations influence demand for rental bikes.
3. Clear days have the most bike rentals, whilst snowy or wet days have the fewest. The weather has a significant impact on bike rental demand.
Season_winter, Temperature, Hour, Season_autumn, and Humidity are the top five relevant characteristics for forecasting bike rentals. These characteristics have a substantial influence on the number of rented bikes.
4. The top five functions for predicting bike rentals are Season_Winter, Temperature, Hour, Season_Autumn, and Humidity. These variables have a large impact on the number of rental bikes.
5. There is less demand for rental bicycles on holidays. From this, it can be seen that there are many days when rental bicycles are used, such as commuting.
6. People tend to rent bikes when the temperature is between -5 and 25 degrees Celsius. This suggests that mild to moderately warm weather is suitable for bike rentals.
7. Bike demand peaks between 8am-9am and 6pm-7pm. These times correspond to typical commute hours. 
8. People generally prefer to rent bikes in the summer rather than in the winter, probably because of the better weather conditions.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***