<a href="https://colab.research.google.com/github/khushijain822/bike_sharing/blob/main/seoul_bike_sharing_mlproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**-
# SEOUL BIKE SHARING DEMAND PRIDICTION


##### **Contribution**    - Individual


# **Project Summary-**





Rental bike programs in cities make it easier to get around and are good for the environment. For these programs to work well, bikes need to be available when and where people need them. This requires predicting how many bikes will be needed at different times of the day.

By looking at data like past usage patterns and weather, advanced technology can help forecast bike demand. Real-time monitoring also helps keep track of where bikes are and move them as needed. Working with city planners and linking bike rentals with other public transportation can improve the system further.

In short, using smart predictions and good management ensures that rental bikes are always available, making the system reliable and user-friendly.

# **GitHub Link -** https://github.com/khushijain822/bike_sharing.git

# **Problem Statement**


Rental bikes have recently been introduced in numerous urban areas to improve mobility and convenience. Ensuring that these rental bikes are available and accessible to the public at the right times is essential, as it reduces waiting times. Consequently, maintaining a consistent supply of rental bikes throughout the city becomes a significant concern. The key challenge is accurately predicting the number of bikes needed at each hour to ensure a stable supply of rental bikes.

## <b> Data Description </b>

### <b> The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.</b>


### <b>Attribute Information: </b>

* ### Date : year-month-day
* ### Rented Bike count - Count of bikes rented at each hour
* ### Hour - Hour of he day
* ### Temperature-Temperature in Celsius
* ### Humidity - %
* ### Windspeed - m/s
* ### Visibility - 10m
* ### Dew point temperature - Celsius
* ### Solar radiation - MJ/m2
* ### Rainfall - mm
* ### Snowfall - cm
* ### Seasons - 1 = Winter,2 = Spring, 3 = Fall, 4 = Summer
* ### Holiday - Holiday/No holiday
* ### Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours)

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as px
from datetime import date

sns.set_style('darkgrid')
# Importing Minmaxscaler to scale data
from sklearn.preprocessing import MinMaxScaler,StandardScaler

#Import the Models
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# importing library called warning to ignore warnings.
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# load & save data
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/SeoulBikeData.csv',encoding='latin-1')

In [None]:
# creating copy so as to not disturb original dataset
df=data.copy()

### Dataset First View

In [None]:
# Dataset First Look
df.head().T

In [None]:
#checking bottom 5 rows
df.tail()

In [None]:
#checking random samples of data
df.sample(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
#rows& columns of data
df.shape


In [None]:
#total datapoints
df.size

### Dataset Information

In [None]:
# Dataset Info
#checking non null and datatypes
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Heatmap to see null values in dataset

plt.figure(figsize=(15,8))
sns.heatmap(df.isnull(),cbar=False,cmap="PiYG")
plt.title('Missing values display',fontsize=30,fontweight="bold")
plt.show()

### What did you know about your dataset?





1.In our Dataset 8760 Rows and 14 Coloums

2.No null values found in our Dataset

3.We will change the Datatypes of date column from object to date_time format.

4.We will convert datatypes of Functioning Day , Season ,Holidays from object type to categorical data , which help in Machine learning algos.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.to_list()

In [None]:
# Dataset Describe
df.describe().round(2).T

In [None]:
df.describe(include='O').T

### Variables Description

1.Season:
There are 4 unique seasons: Spring, Summer, Winter, and Fall.
Spring is the most frequent season, appearing 2,208 times.

2.Holiday:
There are 2 unique values: Holiday and No-Holiday.
No-Holiday is the most common, appearing 8,328 times.

3.Functioning Day:
There are 2 unique values: Yes and No.

4.Rented Bike Count:
The maximum number of bikes rented in an hour is 3,356.
The minimum number of bikes rented in an hour is 0.
These details help us understand the data better and prepare it for further analysis.







### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable. So that if some wrong entries like #,@,%,?,+,& in string or in integer type coloumn that we are unable to find during null value detection.
for num,col in enumerate(df.columns,1):
    print('\n')
    print(num,')\n','{} : {}'.format(col,df[col].unique().tolist()))

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# get sum of missing values in every column
df.isna().sum()

In [None]:
# sum of duplicated rows in dataset
df.duplicated().sum()

In [None]:
# extracting day,month,year from date
from datetime import date
df['Date']=pd.to_datetime(df['Date'], format="%d/%m/%Y")
df['year']=df['Date'].dt.year
df['month']=df['Date'].dt.month
df['day']=df['Date'].dt.day
df['day_name']=df['Date'].dt.day_name()

In [None]:
# Convert Hour in Object form
df['Hour']=df['Hour'].astype('object')


In [None]:
df.info()

### What all manipulations have you done and insights you found?

1. Calculated the total number of missing values in each column.
2. Calculated the number of duplicated rows in the dataset.
3. Converted the 'Date' column to a datetime object and extracted year,
   month, day, and day name from the date.
4. Changed the data type of the 'Hour' column to an object (string).
5. Provided a concise summary of the DataFrame, including the number of
   non-null entries, column names, data types, and memory usage.   



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
# Bar plot for Daily, Hourly, monthly & yearly Rented Bike count
cols = ['day','Hour','year','month']

n=1
plt.figure(figsize=(20,12))
for i in cols:
  plt.subplot(2,2,n)
  n=n+1
  sns.barplot(data=df,x=i,y='Rented Bike Count')
  plt.title(f"count of {i}")
plt.show()

##### 1. Why did you pick the specific chart?

I chose bar plots for their clarity in comparing categorical data, making it easy to interpret variations in 'Rented Bike Count' across days, hours, years, and months. They effectively highlight trends and patterns in bike rentals. Bar plots are simple, straightforward, and well-suited for categorical comparisons.







##### 2. What is/are the insight(s) found from the chart?

The insights from the bar plots are:

1. **Day**: Certain days of the week have higher bike rentals, indicating weekly usage patterns.
2. **Hour**: There are clear peak hours for bike rentals, suggesting specific times of day when bikes are in higher demand.
3. **Year**: The trend in yearly data shows whether bike rentals are increasing or decreasing over time.
4. **Month**: Seasonal variations are evident, with some months having higher bike rentals, indicating the influence of weather or other seasonal factors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can positively impact business by optimizing resource allocation and marketing strategies. Negative growth indicators, like declining yearly trends, prompt proactive adjustments to sustain market competitiveness.

## **Monthly Rented Bike count for 2017 & 2018**

 Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(data=df,x='month',y='Rented Bike Count',hue ='year')
plt.title('Monthly Rented Bike count for 2017 & 2018')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a grouped bar chart ('sns.barplot') with 'month' on the x-axis and 'Rented Bike Count' on the y-axis, differentiated by 'year' using colors ('hue='year''). This chart is suitable because it allows for easy comparison of monthly bike rental counts between two different years, 2017 and 2018, showing any seasonal patterns or trends.


##### 2. What is/are the insight(s) found from the chart?

The insights from the chart showing the monthly rented bike count for 2017 and 2018 are:

1. Seasonal Patterns: Identification of months with consistently high or low bike rental counts across both years, indicating seasonal trends in demand.

2. Yearly Comparison: Comparison between 2017 and 2018 reveals any growth or decline in bike rentals month-by-month, highlighting areas of improvement or concern.

3. Peak Months: Identification of peak months where bike rental counts are significantly higher, suggesting opportunities for targeted promotions or increased inventory during those periods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Positive Impact: Insights can optimize resource allocation and marketing strategies during peak months, enhancing operational efficiency and revenue generation.

2. Negative Growth Considerations: Declining rental counts between 2017 and 2018 may highlight challenges such as changing market dynamics or seasonal fluctuations, prompting proactive adjustments to sustain growth and competitiveness.







#### Chart - 3

In [None]:
# Chart - 3 visualization code

In [None]:
# Rented Bike count in every season
plt.figure(figsize=(8,8))
df.groupby('Seasons')['Rented Bike Count'].sum().plot.pie(autopct="%.2f%%")
plt.title(' Rented Bike count in every season')
plt.show()

**As seen earlier demand for rented bike is high in summer 36.99%  
Demand is least in winter only 7.8%**


In [None]:
df.groupby('Hour')['Solar Radiation (MJ/m2)'].sum()

##### 1. Why did you pick the specific chart?

I chose a pie chart because it shows clearly how bike rentals are divided among different seasons, making it easy to compare which season has the most rentals. It's a simple way to see the proportion of rentals in each season at a glance.







##### 2. What is/are the insight(s) found from the chart?

The insights from the pie chart showing the distribution of rented bike counts in each season are:

1. Seasonal Demand: It reveals which seasons have the highest and lowest bike rental counts, indicating peak periods of demand.

2. Relative Contribution: The chart illustrates the proportion of bike rentals contributed by each season to the total, helping to understand the seasonal variation in rental activity throughout the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Positive Impact: Insights into seasonal rental patterns help optimize resource allocation, staffing, and inventory management during peak demand periods, improving operational efficiency and customer satisfaction.

2. Negative Growth Considerations: Identification of low rental activity in certain seasons suggests potential challenges such as decreased revenue and underutilization of resources, prompting the need for targeted strategies to stimulate demand and maintain business growth.







**Hourly Solar radiation Season wise**

#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
plt.figure(figsize=(20,5))
#df.groupby('Hour').sum()['Solar Radiation (MJ/m2)'].plot(kind='bar', color='red')
sns.pointplot(x='Hour',y='Solar Radiation (MJ/m2)',hue='Seasons',data=df)
plt.title('Hourly Solar radiation Season wise')
plt.show()


**Solar radiations are at peak at 1pm
And hourly interval of solar radiation seen for every season**


##### 1. Why did you pick the specific chart?

I chose a point plot because it's great for showing how solar radiation changes throughout the day across different seasons. It helps easily spot peak times of solar activity and compare patterns between seasons.







##### 2. What is/are the insight(s) found from the chart?

The insights from the point plot showing hourly solar radiation across different seasons are:

1. Peak Hours: Identification of hours during the day when solar radiation is highest, which is crucial for optimizing solar energy capture and usage.

2. Seasonal Variations: Comparison of solar radiation patterns across seasons reveals how sunlight intensity differs throughout the year, aiding in understanding seasonal changes in solar energy availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights into peak solar radiation hours can optimize energy usage and reduce costs. However, lower radiation during critical periods may pose challenges for meeting energy demands, requiring strategic adjustments in energy management.



**Rented Bike count on functioing day**

In [None]:
df['Functioning Day'].value_counts()

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
sns.barplot(data=df,x='Functioning Day',y='Rented Bike Count')
plt.title('Rented Bike count on functioing day')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar plot because it's great for comparing how many bikes are rented on days when the system is functioning versus when it's not. It makes it easy to see the difference in rental counts between these two types of days.







##### 2. What is/are the insight(s) found from the chart?

1. Usage Impact: It reveals how bike rentals vary significantly between days when the system is operational versus non-operational, indicating the system's impact on user demand.

2. Operational Efficiency: Insights into rental patterns can highlight days with higher or lower demand, guiding operational decisions to optimize system uptime and resource allocation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights into rental patterns can optimize operations, but inconsistent bike availability on non-functioning days may hinder growth due to missed revenue and customer dissatisfaction.







**Rented Bike count on Holiday-non Holiday**

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
sns.barplot(data=df,x='Holiday',y='Rented Bike Count')
plt.title('Rented Bike count on Holiday-non Holiday')
plt.show()

**No holiday has more rented bike count, this may indicates that customer uses bike on working day
for travelling at workplace more than used on Holidays**



##### 1. Why did you pick the specific chart?

I chose a bar plot (sns.barplot) because it's effective for comparing numerical values (bike rental counts) across different categories (holidays versus non-holidays). It visually represents the differences in rental counts between these two categorical groups in a straightforward and easy-to-understand manner.

##### 2. What is/are the insight(s) found from the chart?

Holiday Impact: It reveals whether bike rentals significantly differ on holidays compared to non-holidays, indicating how holidays influence customer behavior regarding bike usage.

Operational Adjustments: Insights into rental patterns can guide businesses in adjusting operational strategies such as staffing levels and bike availability to meet varying customer demand during holidays versus regular days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, insights into holiday versus non-holiday rental patterns can optimize operations and increase revenue. However, lower rental activity during holidays may pose challenges for maximizing profitability and customer satisfaction, potentially impacting business growth negatively.







**Hourly distribution of Rented bike count on Holiday & non holiday**

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
plt.figure(figsize=(14,8))
sns.pointplot(data=df,y='Rented Bike Count',x='Hour',hue='Holiday')
plt.title('Hourly distribution of Rented bike count on Holiday & non holiday')
plt.show()

**We can see peak from 7-9 am & (17-20) 5-8pm on NO holiday which indicates high demand period in daily time for rented bikes**

##### 1. Why did you pick the specific chart?

I chose a point plot because it shows hourly variations in bike rentals on holidays versus non-holidays, making it easy to spot peak demand times and compare trends across different days.







##### 2. What is/are the insight(s) found from the chart?

Peak Demand Hours: Identification of specific hours during the day (such as 7-9 am and 5-8 pm) when bike rentals are highest, indicating peak demand periods.

Holiday Influence: Comparison of rental patterns between holidays and non-holidays reveals how holiday status affects bike usage throughout the day, highlighting potential differences in commuter or recreational behaviors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights into peak demand hours help optimize operations and increase revenue. However, lower rentals during critical times or holidays may impact profitability negatively if not managed effectively.







**Rented Bike count in every season hourly distribution**

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
plt.figure(figsize=(14,8))
sns.pointplot(data=df,y='Rented Bike Count',x='Hour',hue='Seasons')
plt.title('Rented Bike count in every season hourly distribution')
plt.show()

**Similar hourly pattern seen in every season so need of bike availability can be identified on hour basis. Irrespective of season peak is seen at 8am & 6pm**


In [None]:
plt.figure(figsize=(14,8))
sns.lineplot(data=df,y='Rented Bike Count',x='month')

##### 1. Why did you pick the specific chart?

I chose the hourly point plot because it succinctly reveals peak bike rental hours at 8 AM and 6 PM across all seasons, aiding in understanding consistent demand patterns crucial for managing bike availability effectively and optimizing operational strategies.









##### 2. What is/are the insight(s) found from the chart?

1. Peak bike rental hours are consistently observed at 8 AM and 6 PM across all seasons.
2. Similar hourly patterns are evident regardless of the season, indicating stable demand trends throughout the year.
3. This consistency highlights the critical need for bike availability management during these peak hours, optimizing resource allocation and service efficiency.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Efficient resource allocation during peak hours enhances service availability and customer satisfaction, potentially increasing rental turnover.

Negative Impact: Overemphasis on peak hours might neglect resource optimization during off-peak times, leading to inefficiencies and higher operational costs.







**Regplot – Relationship between Rental Bike count & numerical  variables**

#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
numrical_var=['Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

plt.figure(figsize=(12,10))
n=1
for i in numrical_var:
  plt.subplot(4,2,n)
  n += 1
  sns.regplot(x=df[i],y=df['Rented Bike Count'],scatter_kws={"color": "orange"}, line_kws={"color": "red"},lowess=True)
  plt.tight_layout()

**This regression plots shows that some of our features are positive linear and some are negative linear in relation to our target variable.**



##### 1. Why did you pick the specific chart?

I selected the regression plots because they visually depict how each numerical feature relates to 'Rented Bike Count', clearly showing whether the relationship is positive or negative, which is crucial for understanding factors affecting bike rental demand.







##### 2. What is/are the insight(s) found from the chart?

Positive Relationships: Features like 'Temperature(°C)' and 'Solar Radiation (MJ/m2)' positively correlate with 'Rented Bike Count', indicating increased rentals during warmer weather and higher solar radiation levels.

Negative Relationships: Variables such as 'Humidity(%)' and 'Rainfall(mm)' exhibit negative correlations with bike rentals, implying reduced demand during higher humidity and rainfall conditions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Utilizing insights on weather-related correlations can optimize marketing and operations, enhancing rental demand and operational efficiency.

Negative Impact: Overreliance on weather-dependent insights and failure to adapt to seasonal variability may pose risks, potentially leading to revenue fluctuations and operational challenges.








# **Multicollinearity Detection**

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
numeric_columns = df.select_dtypes(include=['int', 'float'])

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(numeric_columns.corr(),annot=True,cmap='PuOr')
plt.title('Multicollinearity Detection by Heatmap')
plt.show()

**Observation:**

 We can see that there is **strong correlation** between the **temperature** and **dew point temperature** features which may cause trouble during the prediction. We will find/detect this type of multicollinearity in a different way ahead.

In [None]:
# detecting multicollinearity by VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
attributes = df[['Temperature(°C)','Dew point temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']]
VIF = pd.DataFrame()
VIF["feature"] = attributes.columns
#calculating VIF
VIF["Variance Inflation Factor"] = [variance_inflation_factor(attributes.values, i)
                          for i in range(len(attributes.columns))]

print(VIF)

In [None]:
# watching correlation between target variable and remaining independent variable
numeric_columns = df.select_dtypes(include=['int', 'float'])
numeric_columns.corr()['Rented Bike Count']

Temperature has more correlation with Dependend varaible, so lets drop Due point temp. from list and check VIF

In [None]:
# detecting multicollinearity by VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
attributes = df[['Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']]
VIF = pd.DataFrame()
VIF["feature"] = attributes.columns
#calculating VIF
VIF["Variance Inflation Factor"] = [variance_inflation_factor(attributes.values, i)
                          for i in range(len(attributes.columns))]

print(VIF)

Now VIF is preety much normal and hence Dropping Dew Point temperature would be better choice

In [None]:
df.drop(['Dew point temperature(°C)'],axis=1,inplace=True)

### Total columns after droping Dew Point temperature , remaining columns are   

In [None]:
df.columns.to_list()

# **Feature Transformation**

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
# checking distribution of Coubtinous Vriable
numrical_col=['Rented Bike Count','Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']
plt.figure(figsize=(18,10))
n=1
for i in numrical_col:
  plt.subplot(3,3,n)
  n=n+1
  sns.distplot(df[i])

In [None]:
# checking skewness of features
df[numrical_col].skew().sort_values(ascending=False)

In [None]:
# applying power transformation
from sklearn.preprocessing import PowerTransformer
sc_X=PowerTransformer(method = 'yeo-johnson')
df[numrical_col]=sc_X.fit_transform(df[numrical_col])

In [None]:
# Data distribution after applying Power Transformer
plt.figure(figsize=(18,10))
n=1
for i in numrical_col:
  plt.subplot(3,3,n)
  n=n+1
  sns.distplot(df[i])

In [None]:
# skewness after power transformation
df[numrical_col].skew().sort_values(ascending=False)

##### 1. Why did you pick the specific chart?

I chose the 'distplot' chart to visualize the distribution of continuous variables because it succinctly presents key characteristics like central tendency, spread, and shape, aiding in understanding data distribution patterns efficiently.



##### 2. What is/are the insight(s) found from the chart?

1. Normalization: The data distributions are more symmetric and bell-shaped after transformation.
2. Reduced Skewness: Skewness values are significantly lower, indicating more balanced data.
3. Improved Suitability: The transformed data is now better suited for statistical analysis and predictive modeling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

1. Positive Impact:
Improved model accuracy and reliable statistical analysis enhance decision-making.
2. Negative Impact:
Misinterpretation of transformed data and insufficient documentation could lead to errors.






# **Encoding**

 ***Technique of converting categorical variables into numerical values so that it could be easily fitted to a machine learning model***

In [None]:
# lets have look at dataset to know which columns need to be encoded
df.head().T

columns to encode

1. Seasons
2. Holiday
3. Functioning Day
4. day_name
5. year

Binary Encoding

In [None]:
df.replace({'Holiday': { 'No Holiday': 0,'Holiday': 1 },'Functioning Day': { 'Yes': 0,'No': 1},'year':{2017:0,2018:1}},inplace=True)

In [None]:
df1=df.copy()
df1.head()

In [None]:
# shape of data after binary encoding
df.shape

In [None]:
#df['Hour'].value_counts()

In [None]:
dummy_col=pd.get_dummies(df[['Seasons','day_name','Hour']],drop_first=True)

In [None]:
# dummy columns in data
dummy_col.columns

In [None]:
# HAVE A LOOK AT ENCODED DATA
df.head().T

In [None]:
df.shape

In [None]:
df.columns

In [None]:
len(df.columns)

In [None]:
# dropping year & month to re
df.drop(['year','month','Date','Seasons','day','day_name'] , axis=1 , inplace=True)

In [None]:
# joining dummy features to dataframe df
df=df.join(dummy_col)

In [None]:
# HAVE A LOOK AT ENCODED DATA
df.head().T

In [None]:
df.columns

In [None]:

# Importing Minmaxscaler to scale data
from sklearn.preprocessing import MinMaxScaler,StandardScaler

#Import the Models
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
# x= independant variable , y= Dependant variable

X=df.drop(columns=['Rented Bike Count'])
y=df['Rented Bike Count']

In [None]:
# train_test_split to divide data into training & testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# checking shape of trainign data & testing data
X_train.shape , X_test.shape

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
numrical_col

In [None]:
categorical_col=['Holiday', 'Functioning Day',
       'Seasons_Spring', 'Seasons_Summer', 'Seasons_Winter', 'day_name_Monday',
       'day_name_Saturday', 'day_name_Sunday', 'day_name_Thursday',
       'day_name_Tuesday', 'day_name_Wednesday', 'Hour_Second half',
       'Hour_Third half', 'Hour_fourth half']

categorical_col

# **Scaling**



In [None]:
# Transform Numrical features by scaling each feature to a given range.
scaler = MinMaxScaler()
scaling_cols = ['Temperature(°C)','Humidity(%)','Wind speed (m/s)','Visibility (10m)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']
X_train[scaling_cols]=scaler.fit_transform(X_train[scaling_cols])
X_test[scaling_cols]=scaler.transform(X_test[scaling_cols])

In [None]:
# Shape of Testing data
X_test.shape

In [None]:
X_train.head()

# ML Model Implementation

### ML Model - 1

In [None]:
# defining function to fit model get evaluation metrics also cross validation score

def fit_evaluate (model):
  model.fit(X_train,Y_train)
  y_pred=model.predict(X_test)

  MSE  = mean_squared_error(Y_test, y_pred)
  print("MSE:" ,round(MSE,2))
  MAE=mean_absolute_error(Y_test, y_pred)
  print("MAE :" ,round(MAE,2))

  RMSE = np.sqrt(MSE)
  print("RMSE :" ,round(RMSE,2))

  r2 = r2_score(Y_test, y_pred)
  print("R2 :" ,round(r2,2))
  Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  print("Adjusted R2 : ",round(Adjusted_R2,2))

  # measuring the accuracy of the model against the training data & testing daya
  print('                          ')
  print("-------Model accuracy-------")
  print(f"Training accuracy: {round(model.score(X_train,Y_train)*100)}%")
  print(f"Testing accuracy: {round(model.score(X_test,Y_test)*100)}%")
  print('                          ')
  print("-------cross_val_score-------")
  accuracies = cross_val_score(estimator = model, X = X_train, y = Y_train, cv = 5)
  print("Cross Val Accuracy: {:.2f} %".format(accuracies.mean()*100))

  # Ploting graph of actual vs predicted
  plt.figure(figsize=(20,10))
  plt.plot((y_pred)[:100])
  plt.plot((np.array(Y_test)[:100]))
  plt.legend(["Predicted","Actual"])
  plt.title(f'Difference in predicted & actual for {model}')
  plt.show()

In [None]:
# Shape of Training data
X_train.shape

### ML Model - 2

**LinearRegression**

In [None]:

lr= LinearRegression()

In [None]:
fit_evaluate(lr)

In [None]:
X_train.head()

In [None]:
# Applying Polynomial Linear Regression
# degree 2
poly = PolynomialFeatures(degree=2,include_bias=True)
X_train_trans = poly.fit_transform(X_train)
X_test_trans = poly.transform(X_test)

In [None]:
lr = LinearRegression()
lr.fit(X_train_trans,Y_train)
y_pred1 = lr.predict(X_test_trans)

In [None]:
training_score=lr.score(X_train_trans,Y_train)*100
testing_score=lr.score(X_test_trans,Y_test)*100
print(f"Training score: {training_score}")
print(f"testing score: {testing_score}")

In [None]:
MSE  = mean_squared_error(Y_test, y_pred1)
print("MSE :" , MSE)

MAE=mean_absolute_error(Y_test, y_pred1)
print("MAE :" ,MAE)

RMSE = np.sqrt(MSE)
print("RMSE :" ,RMSE)

r2 = r2_score(Y_test, y_pred1)
print("R2 :" ,r2)
Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred1))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",Adjusted_R2)

 # Ploting graph of actual vs predicted
plt.figure(figsize=(20,10))
plt.plot((y_pred1)[:100])
plt.plot((np.array(Y_test)[:100]))
plt.legend(["Predicted","Actual"])
plt.title(f'Difference in predicted & actual for polynomial Regression')
plt.show()

In [None]:
 poly_score = {'r2':r2,'Adjusted_R2':Adjusted_R2,'MSE':MSE,'RMSE':RMSE,'MAE':MAE,'Training_score':training_score,'testing_score':testing_score}

### ML Model - 3

**Ridge**

In [None]:
R = Ridge(alpha=9)
fit_evaluate(R)

**Decision Tree**

In [None]:
regressor=DecisionTreeRegressor(max_depth=18)
fit_evaluate(regressor)

**BaggingRegressor**

In [None]:
bag_regressor= BaggingRegressor(random_state=22)
fit_evaluate(bag_regressor)

**Random Forest**

In [None]:
random_forest=RandomForestRegressor(n_estimators=10,random_state=0)
fit_evaluate(random_forest)

**Randomized Search Cv**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['friedman_mse', 'squared_error','gini']}
print(random_grid)

In [None]:
random_forest_best=RandomForestRegressor()

In [None]:
model_randomcv=RandomizedSearchCV(estimator=random_forest_best,param_distributions=random_grid,n_iter=10,cv=3,verbose=2,
                               random_state=100,n_jobs=-1)
### fit the randomized model
model_randomcv.fit(X_train,Y_train)

In [None]:
model_randomcv.best_params_

In [None]:
random_forest=RandomForestRegressor(n_estimators=600,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',max_depth=120,criterion='squared_error',random_state=0)
fit_evaluate(random_forest)

**Adaboost**

In [None]:
# weakbase --> accuracy 50% or just more than 50%
# decision sGtump -> smallest decision tree, depth=1
# adaboost--> join multiple weakbase and create strong learner
# weaklearner of adaboost--> Decision stump

In [None]:
ada_regressor= AdaBoostRegressor(random_state=22)
fit_evaluate(ada_regressor)

**Gradientboost**

In [None]:
gb= GradientBoostingRegressor(random_state=22)
fit_evaluate(gb)

In [None]:
# defining function to save accuracy metrics for model evaluation summary
evaluation_summary=[]
def save_score (model):
  model.fit(X_train,Y_train)
  y_pred=model.predict(X_test)

  MSE  = mean_squared_error(Y_test, y_pred)
  MAE=mean_absolute_error(Y_test, y_pred)
  RMSE = np.sqrt(MSE)
  r2 = r2_score(Y_test, y_pred)
  Adjusted_R2 = 1-(1-r2_score(Y_test, y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
  training_score=round(model.score(X_train,Y_train)*100,2)
  testing_score=round(model.score(X_test,Y_test)*100,2)
  cv_accuracies = cross_val_score(estimator = model, X = X_train, y = Y_train, cv = 5)
  model={'r2':round(r2,2),'Adjusted_R2':round(Adjusted_R2,2),'MSE':round(MSE,2),'RMSE':round(RMSE,2),'MAE':round(MAE,2),'Training_score':round(training_score,2),'testing_score':round(testing_score,2)}
  evaluation_summary.append(model)
  #evaluation_summary.write("\n")




In [None]:
algo=[lr,R,regressor,bag_regressor,random_forest,random_forest_best,ada_regressor,gb]
l=[]
for i in algo:
  save_score(i)

In [None]:
for idx, summary in enumerate(evaluation_summary, 1):
    print(f"Model {idx}:")
    for metric, value in summary.items():
        print(f"{metric}: {value}")
    print()

In [None]:
 poly_score={'r2':round(r2,2),'Adjusted_R2':round(Adjusted_R2,2),'MSE':round(MSE,2),'RMSE':round(RMSE,2),'MAE':round(MAE,2),'Training_score':round(training_score,2),'testing_score':round(testing_score,2)}

In [None]:
df=pd.DataFrame(evaluation_summary,index=['lr','R','decision_tree','bag_regressor','random_forest','random_forest_best','ada_regressor','gb']).rename_axis('model', axis=1).sort_values(by='r2',ascending=False)

In [None]:
df

In [None]:
poly_df=pd.DataFrame(poly_score,index=['poly'])

In [None]:
new_df=pd.concat([df,poly_df]).sort_values(by='r2',ascending=False)
new_df

# **Model summary**

In [None]:
new_df.round(2).style.background_gradient(cmap='Pastel2')

OR

In [None]:
from tabulate import tabulate
import pandas as pd

In [None]:
print(tabulate(new_df, headers = 'keys', tablefmt = 'fancy_grid'))

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

1. Best Model: We found that using Random Forest with advanced tuning gave us the most accurate predictions, achieving a high
𝑅-square score of 0.91.

2. Peak Times: The busiest times for bike rentals are at 8am and 6pm, likely when people are going to and from work.

3. Seasonal Trends: More bikes are rented in summer and fewer in winter, which makes sense because fewer people bike in colder weather.

4. Weekday vs Weekend: Bike rentals are more popular on weekdays, probably because people use them for commuting to work.

5. Growth in Demand: We saw a big increase in bike rentals during 2018, suggesting that more people were using bike-sharing services.

6. Project Completion: successfully finished Machine Learning Capstone Project



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***