<a href="https://colab.research.google.com/github/naman39910/Exploratory-Data-Analysis/blob/main/Seoul_Bike_Sharing_ML_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Regression
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

This project focuses on analyzing the Seoul Bike Sharing Dataset to gain a comprehensive understanding of the factors that influence the number of rented bikes. The dataset, sourced from "/content/SeoulBikeData.csv", contains detailed hourly information on bike rental counts in Seoul, alongside a variety of related features. These features include crucial environmental factors such as temperature, humidity, wind speed, visibility, rainfall, and snowfall, as well as temporal information like the date, hour, and seasonal indicators. Additionally, the dataset includes sunshine duration, which could play a significant role in bike usage patterns.

The primary objective of this project is to conduct an in-depth exploratory data analysis (EDA) to uncover underlying patterns, trends, and relationships within the data. This involves visualizing the distribution of bike rental counts, examining how different environmental and temporal factors correlate with rental demand, and identifying any anomalies or interesting observations. For instance, we will investigate how bike rentals vary across different hours of the day, days of the week, and seasons of the year. We will also explore the impact of weather conditions, such as temperature, rainfall, and sunshine, on the number of bikes rented.

Beyond the exploratory analysis, the project aims to leverage the insights gained to potentially build a regression model. This model will be designed to predict hourly bike rental demand based on the available features. The development of such a predictive model involves several steps, including feature engineering, where we might create new features from existing ones (e.g., combining date and hour to represent specific time periods), handling missing values if any exist (although the initial data inspection suggests none), addressing potential outliers, and encoding categorical variables appropriately.

Before model building, we will also consider data transformation and scaling techniques to ensure the data is in a suitable format for the chosen regression algorithm. The dataset will then be split into training and testing sets to train and evaluate the model's performance. Various regression algorithms will be considered, and their performance will be evaluated using appropriate metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared. Cross-validation and hyperparameter tuning techniques will be employed to optimize the chosen model and improve its predictive accuracy.

The insights derived from both the EDA and the regression model will be crucial for providing actionable recommendations to optimize bike allocation and management in Seoul. For example, understanding the peak hours and seasons for bike rentals can help the operating company ensure sufficient bike availability during these times. Similarly, knowing how weather conditions affect demand can inform strategies for adjusting bike distribution or offering incentives during unfavorable weather. The project aims to deliver a comprehensive analysis that not only explains the factors influencing bike rentals but also provides practical insights for improving the efficiency and effectiveness of the Seoul Bike Sharing program, ultimately contributing to better urban mobility and sustainability. The project adheres to general guidelines for structured code, proper documentation, and the inclusion of meaningful visualizations and model explanations to ensure clarity and reproducibility.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Currently Rental bikes are introduced in many urban cities. The business problem is to ensure a stable supply of rental bikes in urban cities by predicting the demand for bikes at each hour. By providing a stable supply of rental bikes, the system can enhance mobility comfort for the public and reduce waiting time, leading to greater customer satisfaction.

To address this problem, i need to develop a predictive model that takes into account various factors that may influence demand, such as time of day, seasonality, weather conditions, and holidays. By accurately predicting demand, the bike sharing system operators can ensure that there is an adequate supply of bikes available at all times, which can improve the user experience and increase usage of the bike sharing system. This can have a positive impact on the sustainability of urban transportation, as it can reduce congestion, air pollution, and greenhouse gas emissions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:

# Import Libraries
import pandas as pd
import numpy as np
from datetime import datetime as dt


# Import Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import warnings
import warnings
warnings.filterwarnings('ignore')

# Import preporcessing libraries
from sklearn.preprocessing import MinMaxScaler,StandardScaler

# Import model selection libraries
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV

# Import Outlier influence library
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Import Model
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from xgboost import XGBRegressor
import xgboost as xgb

# Import evaluation metric libraries
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

# Import tree for visualization
from sklearn.tree import export_graphviz
from sklearn import tree
from IPython.display import SVG,display
from graphviz import Source


### Dataset Loading

In [None]:
# Load Dataset
dataset_url= "/content/SeoulBikeData.csv"
dataset = pd.read_csv(dataset_url, encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

In this dataset we know that these things-

1. Number of Entries: The dataset consists of 28242 entries, ranging
from index 0 to 28241.

2. Columns & Rows: There are 14 columns  and 8760 rows in the dataset.

3. Data Types:
Most of the columns (6 out of 8) are of the int64 & float64 data type.
Only the Item and Area columns are of the object data type.

4. Missing Values: There doesn't appear to be any missing values in the dataset as each column has 28242 non-null entries.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe()


### Variables Description

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.
Attribute Information :

Date : The date of the observation.

Rented Bike Count : The number of bikes rented during the observation period.

Hour : The hour of the day when the observation was taken.

Temperature(°C) : The temperature in Celsius at the time of observation.

Humidity(%) : The percentage of humidity at the time of observation.

Wind speed (m/s) : The wind speed in meters per second at the time of observation.

Visibility (10m) : The visibility in meters at the time of observation.

Dew point temperature(°C) : The dew point temperature in Celsius at the time of observation.

Solar Radiation (MJ/m2) : The amount of solar radiation in mega-joules per square meter at the time of observation.

Rainfall(mm) : The amount of rainfall in millimeters during the observation period.

Snowfall(cm) : The amount of snowfall in centimeters during the observation period.

Seasons : The season of the year when the observation was taken.

Holiday : Whether the observation was taken on a holiday or not.

Functioning Day : Whether the bike sharing system was operating normally or not during the observation period.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
dataset.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# converting date variable into datetime datatype
dataset['Date'] = dataset['Date'].apply(lambda x: dt.strptime(x, '%d/%m/%Y'))

In [None]:
# Creating new columns for day and month
dataset['Day of week'] = dataset['Date'].apply(lambda x : x.isoweekday())
dataset['Month'] = dataset['Date'].apply(lambda x : x.month)

In [None]:
# engineering new feature 'weekend' from day_of_week
dataset['weekend'] = dataset['Day of week'].apply(lambda x: 1 if x>5 else 0)

In [None]:
dataset.head()

In [None]:
# defining continuous independent variables separately
cont_var = ['Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']

In [None]:
# defining dependent variable
dependent_variable = ['Rented Bike Count']

In [None]:
# defining categorical independent variables separately
cat_var = ['Hour', 'Seasons', 'Holiday', 'Functioning Day','Month', 'Day of week', 'weekend']

### What all manipulations have you done and insights you found?

From the Date column, 'month' and 'day of the week' columns are created.

From the 'day of the week' column, 'weekend' column is created where 6 and 7 are the weekends (Saturday and Sunday).

We have also defined the continuous variables, dependent variable and categorical variables for ease of plotting graphs.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Dependent variable Distribution

In [None]:
# Chart - 1 visualization code for distribution of target variable
plt.figure(figsize=(8,6))
sns.displot(dataset['Rented Bike Count'])
plt.title('Distribution of Rented Bike Count')
plt.show()

##### 1. Why did you pick the specific chart?


A distplot, also known as a histogram-kernel density estimate (KDE) plot. It is useful because it provides a quick and easy way to check the distribution of the data, identify patterns or outliers, and compare the distribution of multiple variables. It also allows to check if the data is following normal distribution or not.

Thus, I used the histogram plot to analyse the variable distributions over the whole dataset whether it's symmetric or not.

##### 2. What is/are the insight(s) found from the chart?


From above distribution plot of dependent variable rented bike, we can clearly see that the distribution is positively skewed (Right skewed).

It means that distribution is not symmetric around the the mean.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, definately from this insight we got to know that our data is not normally distributed. So, before doing or implementing any model on this data we need to normalise this data.

#### Chart - 2 : Distribution/Box plot

In [None]:
# visualizing code of histogram plot and box plot for each columns to know the distribution of the dataset
for i in dataset.columns:
    fig,axes = plt.subplots(nrows=1,ncols=2,figsize=(13,4))
    sns.histplot(dataset[i], ax = axes[0],kde=True)
    sns.boxplot(dataset[i], ax = axes[1], orient='h',showmeans=True,color='pink')
    fig.suptitle("Distribution plot of "+ i, fontsize = 12)
    plt.show()


##### 1. Why did you pick the specific chart?

A histplot is a type of chart that displays the distribution of a dataset. It is a graphical representation of the data that shows how often each value or group of values occurs. Histplots are useful for understanding the distribution of a dataset and identifying patterns or trends in the data. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, we used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

A boxplot is used to summarize the key statistical characteristics of a dataset, including the median, quartiles, and range, in a single plot. Boxplots are useful for identifying the presence of outliers in a dataset, comparing the distribution of multiple datasets, and understanding the dispersion of the data. They are often used in statistical analysis and data visualization.

Thus, for each numerical varibale in the given dataset, we used box plot to analyse the outliers and interquartile range including mean, median, maximum and minimum value.

##### 2. What is/are the insight(s) found from the chart?

From above univariate analysis of all continuous feature variables. We got to know that only tempture and humidity columns are looks normally distributed others shows the different distributions.

Also we can see that there are outlier values in snowfall, rainfall, wind speed & solar radiation columns

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Histogram and Box plot cannot give us whole information regarding data. It's done just to see the distribution of the column data over the dataset.

#### Chart - 3 : Dependent variable with continuous variables (Bivariate)

In [None]:
# Analyzing the relationship between the dependent variable and the continuous variables
for i in cont_var:
    plt.figure(figsize=(8,6))
    sns.relplot(x=i,y= dependent_variable[0],data=dataset,kind='scatter')
    plt.title(f'Relationship between {dependent_variable[0]} and {i}')
    plt.show()

##### 1. Why did you pick the specific chart?

Regplot is used to create a scatter plot with linear regression line. The purpose of this function is to visualize the relationship between two continuous variables. It can help to identify patterns and trends in the data, and can also be used to test for linearity and independence of the variables.

To check the patterns between independent variable with our rented bike dependent variable we used this regplot.

##### 2. What is/are the insight(s) found from the chart?

From above regression plot we can see that there is some linearity between temperature, solar radiation & dew point temperature with dependent variable rented bike

Other variables are not showing any patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, it helped a little bit from this we got to know that there are few variables which are showing some patterns with dependent variable this variable are maybe important feature while predicting for rented bike count so business needs focus on these variables.

#### Chart - 4 : Categorical variables with dependent variable(Bivariate)

In [None]:
# Analyzing the relationship between the dependent variable and the categorical variables
for i in cat_var:
    plt.figure(figsize=(8,6))
    sns.barplot(x=i,y= dependent_variable[0],data=dataset)
    plt.title(f'Relationship between {dependent_variable[0]} and {i}')
    plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are used to compare the size or frequency of different categories or groups of data. Bar charts are useful for comparing data across different categories, and they can be used to display a large amount of data in a small space.

To show the distribution of the rented bike count with other categorical variables we used bar charts.

##### 2. What is/are the insight(s) found from the chart?

From above bar charts we got insights:

1. In hour vs rented bike chart there is high demand in the morning 8'o clock and evening 18'o clock.

2. From season vs rented bike chart there is more demand in summer and less demand in winter.
From day_of_week vs rented bike chart there is high demand on working days.

3. From month chart we know that there is high demand in mont

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, this insights are going to provide some positive business impact, beacause analysing the demand on the basis of categorical varible we got to know that when demand for bike is more so we can focus more on that portion.

#### Chart - 5 : Rented Bike vs Hour

In [None]:
# ploting line graph
avg_rent_per_hour = dataset.groupby('Hour')['Rented Bike Count'].mean()

# plot average rent over time(hrs)
plt.figure(figsize=(10,5))
sns.lineplot(x=avg_rent_per_hour.index, y=avg_rent_per_hour.values)
plt.xlabel('Hour')
plt.ylabel('Rented Bike Count')
plt.title('Average Rented Bike Count Over Time')

##### 1. Why did you pick the specific chart?

A line plot, also known as a line chart or line graph, is a way to visualize the trend of a single variable over time. It uses a series of data points connected by a line to show how the value of the variable changes over time.

Line plots are useful because they can quickly and easily show trends and patterns in the data. They are particularly useful for showing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

To see how rented bike demand is distributed over 24 hours time we used line plot.

##### 2. What is/are the insight(s) found from the chart?

From above line plot we can clearly see that there is high demand in the morning and in the evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from above insight we know that there is high demand in morning and evening so business needs to focus more on that time slot, as well as try to meet the demand on that time slot.

#### Chart - 6 Bike demand throughout the day (Multivariate)

In [None]:
# ploting point plot
for i in cat_var:
    if i == 'Hour':
        continue
    else:
        fig, ax = plt.subplots(figsize=(10,6))
        sns.pointplot(x= 'Hour',y='Rented Bike Count',data=dataset, hue =i)
        plt.title(f'Bike demand throughout the day for {i}')
        plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title=i)
        plt.show()

##### 1. Why did you pick the specific chart?


A line plot, also known as a line chart or line graph, is a way to visualize the trend of a single variable over time. It uses a series of data points connected by a line to show how the value of the variable changes over time.

Line plots are useful because they can quickly and easily show trends and patterns in the data. They are particularly useful for showing how a variable changes over a period of time. They are also useful for comparing the trends of multiple variables.

To show the demand of rented bike throughout the day on the basis of other categorical variable we used line plot drawing multiple lines on charts.

##### 2. What is/are the insight(s) found from the chart?


From above line plots we see that :

1. In winter season there is no significant demand even in the morning or in the evening.

2. On the functional day (i.e No Holiday) there is spike in morning and in evening, but that is not there on Holidays.

3. Around 3 months in winter season (i.e December, January & February) there is low demand.

4. On weekend almost throught the day there is demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, from this analysis we figure out some key factors such as high demand in morning and evening slot in all the seasons.

#### Chart - 7 : Pie plot seasons

In [None]:
# Chart - 7 visualization code
winter_data = dataset[dataset['Seasons'] == 'Winter']['Rented Bike Count'].sum()
spring_data = dataset[dataset['Seasons'] == 'Spring']['Rented Bike Count'].sum()
summer_data = dataset[dataset['Seasons'] == 'Summer']['Rented Bike Count'].sum()
autumn_data = dataset[dataset['Seasons'] == 'Autumn']['Rented Bike Count'].sum()

Bike_Seasons = [winter_data, spring_data, summer_data, autumn_data]
Seasons = ['Winter','Spring','Summer','Autumn']

plt.figure(figsize=(8, 8))
plt.pie(Bike_Seasons, labels=Seasons, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Rented Bike Count by Season')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

##### 1. Why did you pick the specific chart?


Pie charts are generally used to show the proportions of a whole, and are especially useful for displaying data that has already been calculated as a percentage of the whole.

So, we used pie chart to see percentage distribution of rented bike on the basis of seasonsAnswer Here.

##### 2. What is/are the insight(s) found from the chart?


From above pie chart:

1. In year data season summer contributes around 36% then autumn around 29%

2. Lowest demand in winter, it contributes around only 7%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


This insights only tell about percentage contribution of year data of season variable, which clearly gave indication about demand.

#### Chart - 8  : Categorical plot for seasons

In [None]:
# Chart - 8 visualization code
sns.catplot(x='Seasons',y='Rented Bike Count',data=dataset,kind='box')
plt.show()

##### 1. Why did you pick the specific chart?


Catplot is used to create a categorical plot. Categorical plots are plots that are used to visualize the distribution of a categorical variable. They can be used to show how a variable is related to a categorical variable and can also be used to compare the distribution of multiple categorical variables.

To see the distribution of the rented bike on basis of season column we used catplot.

##### 2. What is/are the insight(s) found from the chart?

From above catplot we got to know that:

1. There is low demand in winter

2. Also in all seasons upto the 2500 bike counts distribution is seen dense.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, from this catplot we know that there is high bike count upto the 2500 so, above that there maybe outliers present. business needs to evaluate that.Answer Here

#### Chart - 9 : Avg Rented Bike Count by Wind speed

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,5))
sns.lineplot(x='Wind speed (m/s)',y='Rented Bike Count',data=dataset)
plt.title('Avg Rented Bike Count by Wind speed')
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time (in this case over wind speed).Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time (in this case over wind speed).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This may not be much helpful in creating positive business impact as
this is a natural phenomenon and we can't control it.

#### Chart - 10 : Avg Rented Bike Count by Humidity

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10,5))
sns.lineplot(x='Humidity(%)',y='Rented Bike Count',data=dataset)
plt.title('Avg Rented Bike Count by Humidity')
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time (in this case over humidity).

##### 2. What is/are the insight(s) found from the chart?

After certain level as humidity increases demand decreases as too much humidity may generally caused due to rain or snowfall as we already saw they leads to decrease in demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This may not be much helpful in creating positive business impact as
this is a natural phenomenon and we can't control it.

#### Chart - 1 : Avg Rented Bike Count by Visibility

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10,5))
sns.lineplot(x='Visibility (10m)',y='Rented Bike Count',data=dataset)
plt.title('Avg Rented Bike Count by Visibility')
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time (in this case over wind speed).

##### 2. What is/are the insight(s) found from the chart?

After certain level as visibility lower the rental bike deman has decrease due to the safty concerns and unpleasant riding and in higher visibility incerese the jrental bike demand.

and also huge fluctuations in solar visibility may be caused due to day-night cycle as there is no sunlight at night time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This may not be much helpful in creating positive business impact as this is a natural phenomenon and we can't control it.

#### Chart - 11 : Temprature over time

In [None]:
# Chart - 11 visualization code
plt.subplot(2, 1, 1)
sns.lineplot(x='Date',y='Temperature(°C)',hue = 'Month',data=dataset,color='red')
plt.title("Temperature by Date for each Month")
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time.

##### 2. What is/are the insight(s) found from the chart?

As expected temperature rises during summer months and lowers in winter months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This may not be much helpful in creating positive business impact as this is a natural phenomenon and we can't control it.

#### Chart - 12 Solar radiation over time

In [None]:
# Chart - 12 visualization code
plt.subplot(2, 1, 1)
sns.lineplot(x='Date',y='Solar Radiation (MJ/m2)',hue = 'Month',data=dataset,color='red')
plt.title("Solar Radiation by Date for each Month")

##### 1. Why did you pick the specific chart?

Line charts are a useful tool for visualizing trends over time. It allows us in easy identification of patterns and changes over time.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

solar radiation is more in summer months compared to winter months, also huge fluctuations in solar radiation may be caused due to day-night cycle as there is no sunlight at night time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This may not be much helpful in creating positive business impact as this is a natural phenomenon and we can't control it.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
numeric_dataset = dataset.select_dtypes(include=np.number)
corr = numeric_dataset.corr()
mask = np.zeros_like(corr)

mask[np.triu_indices_from(mask)] = True

with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(14,7))
    ax = sns.heatmap(corr , mask=mask, vmin = -1,vmax=1, annot = True, cmap="YlGnBu")
    plt.show()

##### 1. Why did you pick the specific chart?

The correlation coefficient is a measure of the strength and direction of a linear relationship between two variables. A correlation matrix is used to summarize the relationships among a set of variables and is an important tool for data exploration and for selecting which variables to include in a model. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, we have used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From above correlation map we can clearly see that:

1. There is high multicolinearity between independent variable (i.e temperature & dew point temp, humidity & dew point temp, weekend & day of week).
2. There is correlation of temperature, hour, dew point temp & solar radiation with dependent variable rented bike.
3. Other than that we didnt see any correlation.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(dataset)
plt.show()

##### 1. Why did you pick the specific chart?

A pairplot, also known as a scatterplot matrix, is a visualization that allows you to visualize the relationships between all pairs of variables in a dataset. It is a useful tool for data exploration because it allows you to quickly see how all of the variables in a dataset are related to one another.

Thus, we used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From above pair plot we got to know that, there is not clear linear relationship between variables. other than dew point temp, temperature & solar radiation there is not any relationship.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.


Based on above chart experiments i have noticed that our dependent variable (Rented Bike Demand) does not seems to normally distributed so i have made hypothetical assumption that our data is normally distributed and for that i have decided to do statistical analysis.

1. Rented Bike Demand in hot weather is higher compared to demand in cold weather.

2. Rented Bike Demand during rush hour (7-9AM & 5-7PM) and non-rush hour are different.

3. Average Rented Bike Demand is different in different seasons.

### Hypothetical Statement - 1

Rented Bike Demand in hot weather is higher compared to demand in cold weather.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: H₀:𝜇hot = 𝜇cold

Alternate Hypothesis :  H₁:
𝜇
hot
≠
𝜇
cold

Test Type: Two-sample t-test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats

# split the data into the hot and cold weather
hot_weather = dataset[dataset['Temperature(°C)'] >= 20]['Rented Bike Count']
cold_weather = dataset[dataset['Temperature(°C)'] < 20]['Rented Bike Count']

print("cold weather bike demand variance :", np.var(cold_weather))
print("hot weather bike demand variance :", np.var(hot_weather))


In [None]:
# Sample size for different weather groups
print("cold weather sample size :", len(cold_weather))
print("hot weather sample size :", len(hot_weather))

In [None]:
# Perform the t-test
t_statistic, p_value = stats.ttest_ind(hot_weather, cold_weather, equal_var=False)

print("t-statistic:", t_statistic)
if p_value < 0.05:
    print(f"Since p-value ({p_value}) is less than 0.05, we reject null hypothesis.\nHence, There is a significant difference in mean bike rentals between the 'hot' and 'cold' temperature groups.")
else:
    print(f"Since p-value ({p_value}) is greater than 0.05, we fail to reject null hypothesis.\nHence, There is no significant difference in mean bike rentals between the 'hot' and 'cold' temperature groups.")



##### Which statistical test have you done to obtain P-Value?

I have used Two sample T-test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected and Mean Rented Bike counts different in hot temperatures and cold temperaures.

##### Why did you choose the specific statistical test?


The two sample t-test is used to determine if there is a significant difference between the means of two groups, making it an appropriate test for comparing the mean number of Rented Bike Count between the hot and cold temperature groups.

Also We know from previous charts that Rented Bike Count is right skewed with large sample sizes (i.e.,cold weather sample size : 5832
 &hot weather bike demand variance : 505140
) and we don't know
.

### Hypothetical Statement - 2

Rented Bike Demand during rush hour (7-9AM & 5-7PM) is higher compared to non-rush hour.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis: H₀:𝜇hot = 𝜇cold

Alternate Hypothesis : H₁: 𝜇 hot ≠ 𝜇 cold

Test Type: Two-sample t-test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Create subsets of the data based on hour
rush_hour = dataset[(dataset['Hour'] >= 7) & (dataset['Hour'] <= 9) | (dataset['Hour'] >= 17) & (dataset['Hour'] <= 19)]['Rented Bike Count']
non_rush_hour = dataset[~((dataset['Hour'] >= 7) & (dataset['Hour'] <= 9) | (dataset['Hour'] >= 17) & (dataset['Hour'] <= 19))]['Rented Bike Count']

print("Rush Hour Bike Demand Variance: ", np.var(rush_hour))
print("Non-Rush Hour Bike Demand Variance: ", np.var(non_rush_hour))

In [None]:
import scipy
# Conduct a two-sample t-test to compare the mean bike rental demand during rush hour with the mean bike rental demand during non-rush hour times
t_stat, p_val = scipy.stats.ttest_ind(rush_hour, non_rush_hour, equal_var=False)

# Print the t-test results
# print('t-statistic:', t_stat)
# print('p-value:', p_val)

if p_val < 0.05:
    print(f"Since p-value ({p_val}) is less than 0.05, we reject null hypothesis.\nHence, There is a significant difference in mean bike rentals between the 'rush hour' and 'non-rush hour' times of day.")
else:
    print(f"Since p-value ({p_val}) is greater than 0.05, we fail to reject null hypothesis.\nHence, There is no significant difference in mean bike rentals between the 'rush hour' and 'non-rush hour' times of day.")


##### Which statistical test have you done to obtain P-Value?

I have used Two sample T-test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected and Mean Rented Bike counts different in rush hours and non-rush hours.

##### Why did you choose the specific statistical test?

The two sample t-test is used to determine if there is a significant difference between the means of two groups, making it an appropriate test for comparing the mean number of Rented Bike Count between the rush hours and non-rush hours.

Also We know from previous charts that Rented Bike Count is right skewed with large sample sizes (i.e.Rush Hour Bike Demand Variance:  651191,
 & Non-Rush Hour Bike Demand Variance:  294088
) and we don't know
.

### Hypothetical Statement - 3


Rented Bike Demand is different in different seasons with highest in summer and lowest in winter.



#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 Null Hypothesis H₀:
 No significant difference between rented bike counts for different seasons.

Alternate Hypothesis H₁ :
 Significant difference between rented bike counts for different seasons.

Test Type: One-way ANOVA test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# sample size for different seasons
dataset.groupby('Seasons')['Rented Bike Count'].count()

In [None]:
# group the data by season and calculate the mean number of bike rentals for each season
season_means = dataset.groupby('Seasons')['Rented Bike Count'].mean()

# conduct the ANOVA test
f_stat, p_value = scipy.stats.f_oneway(dataset.loc[dataset['Seasons']=='Spring', 'Rented Bike Count'],
                                  dataset.loc[dataset['Seasons']=='Summer', 'Rented Bike Count'],
                                  dataset.loc[dataset['Seasons']=='Autumn', 'Rented Bike Count'],
                                  dataset.loc[dataset['Seasons']=='Winter', 'Rented Bike Count'])

# Print the results
print('F-statistic:', f_stat)
print('p-value:', p_val)
print()

# Conduct Tukey's HSD test for detailed difference b/w each groups
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey_results = pairwise_tukeyhsd(dataset['Rented Bike Count'], dataset['Seasons'])

# Print the Tukey HSD test results
print(tukey_results)

##### Which statistical test have you done to obtain P-Value?

I have used One-way ANOVA test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected and Mean Rented Bike counts are significantly different in different seasons.

##### Why did you choose the specific statistical test?

The one-way ANOVA test is used to determine if there is a significant difference between the means of more than two groups, making it an appropriate test for comparing the mean number of Rented Bike Count between different seasons.

Also We know from previous charts that Rented Bike Count is right skewed with large sample sizes .

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
dataset.isnull().sum().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing value in this dataset.

### 2. Handling Outliers

In [None]:
'''# Handling Outliers & Outlier treatments
# Removing outliers by using IQR method:
q1, q3, median = dataset['Rented Bike Count'].quantile([0.25,0.75,0.5])
lower_limit = q1 - 1.5*(q3-q1)
upper_limit = q3 + 1.5*(q3-q1)
dataset['Rented Bike Count'] = np.where(dataset['Rented Bike Count'] > upper_limit, median,np.where(dataset['Rented Bike Count'] < lower_limit,median,dataset['Rented Bike Count']))

# Removing outliers by Capping:
for col in ['Wind speed (m/s)','Solar Radiation (MJ/m2)','Rainfall(mm)','Snowfall (cm)']:
  upper_limit = dataset[col].quantile(0.99)
  dataset[col] = np.where(dataset[col] > upper_limit, upper_limit, dataset[col])'''

##### What all outlier treatment techniques have you used and why did you use those techniques?

Here i used IQR method and Capping method, Based on IQR method i set Upper limit and Lower limit of rented bike count and convert those outliers into median values.

Also i have capp outliers upto 99th percentile and above that i convert those outliers into upper limit value.

### Note :-
I have tried to remove the outliers but it has seen that there is drop in performance after removing the outliers around 10% drop in model performance
So, i have decided that i will perform the model without removing the outliers.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
dataset['Snowfall (cm)'] = dataset['Snowfall (cm)'].apply(lambda x: 1 if x>0 else 0)
dataset['Rainfall(mm)'] = dataset['Rainfall(mm)'].apply(lambda x: 1 if x>0 else 0)
dataset['Visibility (10m)'] = dataset['Visibility (10m)'].apply(lambda x: 0 if x> 399 else (1 if 400<=x<=999 else 2))
dataset['Holiday'] = np.where(dataset['Holiday'] == 'Holiday',1,0)
dataset['Functioning Day'] = np.where(dataset['Functioning Day'] == 'Yes',1,0)


In [None]:
# One hot encoding
dataset = pd.get_dummies(dataset, columns=[ 'Month', 'Hour', 'Day of week','Visibility (10m)'])

In [None]:
dataset.columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Since there are very few day on which there was snowfall / rainfall, it is in my interest that i convert these columns to binary categorical columns indicating whether there was rainfall / snowfall at that particular hour

For visibility

When

Visibility >= 20 Km ---> Clear (high visibility)

4 Km <= Visibility < 10 Km ---> Haze (medium visibility)

Visibility < 4 Km ---> Fog (low visibility)

Converting visibility based on the above mentioned threshold values. Since they are ordinal, we can encode them as 0 (low visibility), 1 (medium visibility), 2 (high visibility)

For func day and holiday There are two categories whether its holiday or func day so we use 0 and 1 for that.

For Hour, visibility, month & day of the week we use here one hot encoding.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:

# Manipulate Features to minimize feature correlation and create new features

# We see that the temperature and dew temperature are highly correlated

# Plotting Scatter plot to visualize the relationship between
# temperature and dew point temperature
plt.figure(figsize=(9,6))
sns.scatterplot(x='Temperature(°C)',y='Dew point temperature(°C)',data=dataset)
plt.xlabel('Temperature(°C)')
plt.ylabel('Dew point temperature(°C)')
plt.title('Temperature(°C) VS Dew point temperature(°C)')
plt.show()

In [None]:
# correlection
dataset[['Temperature(°C)', 'Dew point temperature(°C)']].corr()

In [None]:
# Creating new temperature column with 50% of both temp
dataset['Temperature'] = (dataset['Temperature(°C)']+dataset['Dew point temperature(°C)'])/2

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
features = [i for i in dataset.columns if i not in ['Rented Bike Count','Temperature(°C)', 'Dew point temperature(°C)']]
features

In [None]:
# Remove multicollinearity by using VIF technique
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [None]:
continuous_variables = ['Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)', 'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Temperature']

In [None]:
continuous_feature_df = pd.DataFrame(dataset[continuous_variables])
continuous_feature_df

In [None]:
calc_vif(dataset[[i for i in continuous_feature_df]])

In [None]:
# Removing Temperature and dew point temperature
calc_vif(dataset[[i for i in continuous_feature_df if i not in ['Dew point temperature(°C)','Temperature(°C)']]])


In [None]:
# Dropping data, weekend, temperature and dew_point_temperature
dataset.drop(['Dew point temperature(°C)', 'Temperature(°C)','Seasons', 'Date', 'weekend'],axis=1, inplace=True)

In [None]:
dataset.head()

##### What all feature selection methods have you used  and why?

I have used pearson correlation coefficient to check correlation between variables and also with dependent variable.

And also i check the multicollinearity using VIF and remove those who are having high VIF value.

##### Which all features you found important and why?

From above methods i have found that there is high correlation between temperature and dew point temperature. So, i take 50 % of the both and create new variable 'temprature' by adding both of them.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Visualizing the distribution of the dependent variable - rental bike count
plt.figure(figsize=(9,5))
sns.displot(dataset[dependent_variable])
plt.xlabel('Rented Bike Count')
plt.title('Distribution of Rented Bike Count')
plt.axvline(dataset[dependent_variable[0]].mean(), color='red', linestyle='dashed', linewidth=2)
plt.axvline(dataset[dependent_variable[0]].median(), color='green', linestyle='dashed', linewidth=2)
plt.show()

In [None]:
# Skew of the dependent variable
dataset[dependent_variable[0]].skew()

In [None]:
# Visualizing the distribution of dependent variable after log transformation
plt.figure(figsize=(9,5))
sns.displot(np.log(dataset[dependent_variable[0]]))
plt.xlabel('Rented Bike Count')
plt.title('Distribution of Rented Bike Count')
plt.axvline(np.log(dataset['Rented Bike Count']).mean(), color='red', linestyle='dashed', linewidth=2)
plt.axvline(np.log(dataset['Rented Bike Count']).median(), color='green', linestyle='dashed', linewidth=2)
plt.show()

In [None]:
# Skew of the dependent variable after log transformation
np.log(dataset[dependent_variable[0]]).skew()

In [None]:
# Visualizing the distribution of dependent variable after sqrt transformation
plt.figure(figsize=(9,5))
sns.displot(np.sqrt(dataset[dependent_variable[0]]))
plt.xlabel('Rented Bike Count')
plt.title('Distribution of Rented Bike Count')
plt.axvline(np.sqrt(dataset[dependent_variable[0]]).mean(), color='red', linestyle='dashed', linewidth=2)
plt.axvline(np.sqrt(dataset[dependent_variable[0]]).median(), color='green', linestyle='dashed', linewidth=2)
plt.show()

In [None]:
# Skew of the dependent variable after sqrt transformation
np.sqrt(dataset[dependent_variable[0]]).skew()

In [None]:
# Defining dependent and independent variables
X = dataset.drop('Rented Bike Count',axis=1)
y = np.sqrt(dataset[dependent_variable])

In [None]:
features

###I have ploted distribution plot and also i did normality test and i have found that the data is not normally distributed, it needs transformation.

So, first i have calculate the skewness value and i have found that the rented bike attribute is positively skewed so i used log transfomation but it affected negatively.

So, i finally used square root transformation & now the data looks normally distrubuted & skewness is also reduced.

### 6. Data Scaling

In [None]:
features = [i for i in dataset.columns if i not in ['Rented Bike Count']]

In [None]:
# Scaling your data
scaler = StandardScaler()
X = scaler.fit_transform(dataset[features])

##### Which method have you used to scale you data and why?


In this i have different independent features of different scale so i have used standard scalar method to scale our independent features into one scale.

### 7. Dimesionality Reduction

In [None]:
# Processed dataset shape(rows, columns)
dataset.shape

##### Do you think that dimensionality reduction is needed? Explain Why?

With 57 columns (independent features) and 8760 rows, and after doing all the feature engineering steps like removing multicolinearity, feature selection manupulations etc. I don't think I need dimensionality reduction here.

Essentially where high dimensions are a problem or where it is a particular point in the algorithm to dimension reduction.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 0)

##### What data splitting ratio have you used and why?

To train the model i have split the data into train and test using train_test_split method.

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If we have a total of 100 instances, we should probably stick with cross validation as no single split is going to give you satisfactory variance in our estimates. If we have 100,000 instances, it doesn't really matter whether we choose an 80:20 split or a 90:10 split.

It is surprising to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle.

So, in this case i have split 80% of the data into train and 20% into test.

### 9. Handling Imbalanced Dataset

In [None]:
# Distribution of Rented Bike Count (target variable)
_ = sns.displot(y, kde=True)
plt.xlabel('Rented Bike Count')
plt.show()

##### Do you think the dataset is imbalanced? Explain Why.

Looking at the distribution of traget variable (i.e., Rented Bike Count), values are not concentrated in a narrow range, and is normally distributed accross wide range of values. So the dataset is not imbalanced.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

In [None]:
# Defining a function to print evaluation matrix
def evaluate_model(model, y_test, y_pred):

  '''takes model, y test and y pred values to print evaluation metrics, plot the actual and predicted values,
  plot the top 20 important features, and returns a list of the model scores'''

  # Squring the y test and and pred as we have used sqrt transformation
  y_t = np.square(y_test)
  y_p = np.square(y_pred)
  y_train2 = np.square(y_train)
  y_train_pred = np.square(model.predict(X_train))

  # Calculating Evaluation Matrix
  mse = mean_squared_error(y_t,y_p)
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(y_t,y_p)
  r2_train = r2_score(y_train2, y_train_pred)
  r2 = r2_score(y_t,y_p)
  r2_adjusted = 1-(1-r2)*((len(X_test)-1)/(len(X_test)-X_test.shape[1]-1))

  # Printing Evaluation Matrix
  print("MSE :" , mse)
  print("RMSE :" ,rmse)
  print("MAE :" ,mae)
  print("Train R2 :" ,r2_train)
  print("Test R2 :" ,r2)
  print("Adjusted R2 : ", r2_adjusted)


  # plot actual and predicted values
  plt.figure(figsize=(13,4))
  plt.plot((y_p)[:100])
  plt.plot((np.array(y_t)[:100]))
  plt.legend(["Predicted","Actual"])
  plt.title('Actual and Predicted Bike Count', fontsize=15)

  try:
    importance = model.feature_importances_
  except:
    importance = model.coef_
  importance = np.absolute(importance)
  if len(importance)==len(features):
    pass
  else:
    importance = importance[0]

  # Feature importances
  feat = pd.Series(importance, index=features)
  plt.figure(figsize=(9,7))
  plt.title('Feature Importances (top 20) for '+str(model), fontsize = 15)
  plt.xlabel('Relative Importance')
  feat.nlargest(20).plot(kind='barh')


  model_score = [mse,rmse,mae,r2_train,r2,r2_adjusted]
  return model_score

In [None]:
# Create a score dataframe
score = pd.DataFrame(index = ['MSE', 'RMSE', 'MAE', 'Train R2', 'Test R2', 'Adjusted R2'])

### ML Model - 1 : LinearRegression

In [None]:
# Import the LinearRegression class
reg = LinearRegression()

# Fit the linear regression model to the training data
reg.fit(X_train, y_train)

# Predict on the model
y_pred_li = reg.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
linear_score = evaluate_model(reg, y_test,y_pred_li)
# Evaluation Metric Score chart
score['Linear regression'] = linear_score

In [None]:
score

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the model
reg = LinearRegression()

# Define the hyperparameter grid
param_grid = {
    'fit_intercept': [True, False]}

# Perform grid search
grid_search = GridSearchCV(reg, param_grid, cv=5, scoring='r2', return_train_score = True)
grid_search.fit(X_train, y_train)

In [None]:
# Print the best parameters and the corresponding score
print("Best parameters: ", grid_search.best_params_)
print("Best R2 score: ", grid_search.best_score_)


In [None]:
# Use the best parameter to train the model
best_reg = grid_search.best_estimator_
best_reg.fit(X_train, y_train)

In [None]:
# Predict on test data
y_pred_li2 = best_reg.predict(X_test)

In [None]:
# Visualizing evaluation Metric Score chart
linear_score2 = evaluate_model(best_reg, y_test,y_pred_li2)

In [None]:
score['Linear regression tuned'] = linear_score2
score

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort i have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For training data, I found R2 score of 0.784428 & 88090.659090 and 201.806803 as MSE and MAE respectively.

For testing data, I found R2 score of 0.789520 & 88090.659090 and 201.806803 as MSE and MAE respectively.

For Both training and testing data, no improvement is seen.

### ML Model - 2 : Lasso Regression

In [None]:
# Import the Lasso Regression class
lasso = Lasso()

# Fit the lasso regression model to your training data
lasso.fit(X_train, y_train)

# Predict on the model
y_pred_lasso1 = lasso.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
lasso_score = evaluate_model(lasso, y_test,y_pred_lasso1)
# Evaluation Metric Score chart
score['Lasso regression'] = lasso_score

In [None]:
score

It is seen that using Lasso regression analysis the performance of the model has drop down. So i will try to tuned the model.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Define the model
lasso = Lasso()

# Define the parameters to be optimized & Perform grid search
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_squared_error', cv=5)

# Fitting model
lasso_regressor.fit(X_train,y_train)

In [None]:
# Getting optimum parameters
print("The optimum alpha value is found out to be :" ,lasso_regressor.best_params_)
print("\nUsing ",lasso_regressor.best_params_, " the negative mean squared error is: ", lasso_regressor.best_score_)

In [None]:
# Import the Lasso Regression class with best alpha
lasso = Lasso(alpha = lasso_regressor.best_params_['alpha'])

# Fit the lasso regression model to your training data
lasso.fit(X_train, y_train)

# Predict the model
y_pred_lassocv = lasso.predict(X_test)


#Evaluation matrices for Lasso regression
lasso2 = evaluate_model(lasso, y_test,y_pred_lassocv)

name = 'Lasso with alpha = ' + str(lasso_regressor.best_params_['alpha'])

score[name] = lasso2

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort i have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
score

After tuning i have seen that there is increse in performance from 52% to 78%

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

RMSE & MSE are measuring the average squared values between the predicted and actual values.

Whereas R2 score is measure how well the model fits the data.

In a business context, a low RMSE and high R2 score would indicate that the model is making accurate predictions and is a good fit for the data. This would be desirable for a business because it would mean that the model is able to provide useful insights and make accurate predictions about future outcomes.

### ML Model - 3 : Ridge Regression

In [None]:
# Ridge regressor class
ridge = Ridge()

# Fit the ridge regression model to your training data
ridge.fit(X_train, y_train)

# Predict on the model
y_pred_ridge1 = ridge.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
result = evaluate_model(ridge, y_test,y_pred_ridge1)
score['Ridge'] = result

In [None]:
score

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Import Ridge regressor Class
ridge = Ridge()

# Define the parameters to be optimized & Perform grid search
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100]}
ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_squared_error', cv=5)

# Fitting model
ridge_regressor.fit(X_train,y_train)

In [None]:
# Getting optimum parameters
print("The best fit alpha value is found out to be :" ,ridge_regressor.best_params_)
print("\nUsing ",ridge_regressor.best_params_, " the negative mean squared error is: ", ridge_regressor.best_score_)

In [None]:
# Initiate ridge with best alpha
ridge = Ridge(alpha = ridge_regressor.best_params_['alpha'])

# Fit the ridge regression model to your training data
ridge.fit(X_train, y_train)

# Predict on model
y_pred_ridge = ridge.predict(X_test)


# Evaluation matrices for Ridge regression
result = evaluate_model(ridge, y_test,y_pred_ridge)

namer = 'Ridge with alpha = ' + str(ridge_regressor.best_params_['alpha'])

score[namer] = result

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV is used to find the best hyperparameters for a machine learning model by searching over a specified parameter grid. It helps to ensure that a model is not overfitting or underfitting by evaluating the model's performance using cross-validation techniques. GridSearchCV can save time and resources compared to manually tuning the parameters of a model.

To reduce time and effort we have used GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

In [None]:
score

### ML Model - 4 : Extreme Gradient Boosting Regressor

In [None]:
# Import Extreme Gradient Boosting Regressor class
xgb_model = xgb.XGBRegressor(random_state=0,
                             objective='reg:squarederror')

# Fit the Extreme Gradient Boosting model to the training data
xgb_model.fit(X_train,y_train)

# Predict on the model
y_pred_xgb1 = xgb_model.predict(X_test)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
result = evaluate_model(xgb_model, y_test,y_pred_xgb1)
score['Extreme Gradient Boosting Regressor'] = result

In [None]:
score

Using Extreme Gradient Boosting Regressor i have got accuracy around 97% and 92% on train and test data respectively, which is very good till now, than others algorithm.

So, lets tune it.

2. Cross- Validation & Hyperparameter Tuning

In [None]:
# XG boost model
xgb_model = xgb.XGBRegressor(random_state=0,
                             objective='reg:squarederror')
xgb_params = {'n_estimators':[500],
             'min_samples_leaf':np.arange(20,22)}

In [None]:
# Perform the randomized search
xgb_search = RandomizedSearchCV(xgb_model,xgb_params,cv=6,scoring='neg_root_mean_squared_error',n_iter=100, n_jobs=-1)
xgb_search.fit(X_train,y_train)
xgb_best_params = xgb_search.best_params_

In [None]:
# Best parameters for XG boost Model
xgb_best_params

In [None]:
# Building a XG boost model with best parameters
xgb_model = xgb.XGBRegressor(n_estimators=xgb_best_params['n_estimators'],
                             min_samples_leaf=xgb_best_params['min_samples_leaf'],
                             random_state=0)

In [None]:
# Fitting model
xgb_model.fit(X_train,y_train)

In [None]:
# Predict on the model
y_pred_xgb = xgb_model.predict(X_test)

In [None]:
# Evaluation matrices for XGBRegressor
result = evaluate_model(xgb_model, y_test,y_pred_xgb)
score['Extreme Gradient Boosting Regressor Tuned'] = result

Which hyperparameter optimization technique have you used and why?

Randomized search cross-validation (CV) is used to efficiently explore the hyperparameter space of a machine learning model. It works by randomly sampling from the search space of hyperparameters, rather than exhaustively trying every possible combination. This allows for a more efficient search while still providing a good chance of finding good hyperparameter values. Additionally, by using cross-validation to evaluate the performance of each set of hyperparameters, one can ensure that the model is not overfitting to the training data.

Because of its randomly sampling technique and to save the time i have decided to use Randomized search CV.

Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
score

After tuning the model i have got accuracy on train data arround 99 % and 93% on test data which is very good model accuracy, like Gradient Boosting Tuning data. But it is overfitting the model's training accuracy is almost 100%.

So, i would like to go with the model accuracy data of XGboost, before tuning. Which is 97% on train data and 92% on test data.

**Plot R2 scores for each model**

In [None]:
score.columns

In [None]:
# R2 Scores plot

models = list(score.columns)
train = score.iloc[-3,:]
test = score.iloc[-2,:]

X_axis = np.arange(len(models))

plt.figure(figsize=(20,8))
plt.bar(X_axis - 0.2, train, 0.4, label = 'Train R2 Score')
plt.bar(X_axis + 0.2, test, 0.4, label = 'Test R2 Score')


plt.xticks(X_axis,models, rotation=30)
plt.ylabel("R2 Score")
plt.title("R2 score for each model")
plt.legend()
plt.show()

**Plot of adjusted R2**

In [None]:
# Removing the overfitted models which have more than 5% gap between train and test values
score_t = score.transpose()            #taking transpose of the score dataframe to create new difference column
score_t['diff']=score_t['Train R2']-score_t['Test R2']                   #creating new column diff of train R2 and test R2 score
remove_models = list(score_t[score_t['diff']>=.05].index)                #creating a list of models which have difference more than .05 that is 5%
remove_models

adj = score_t['Adjusted R2'].drop(remove_models)                     #creating a new dataframe with required models and adjusted r2 score


# Visualizing a bar plot for adjusted R2 score
plt.figure(figsize=(12,7))
plots = sns.barplot(x=list(adj.index), y=adj)
for bar in plots.patches:
  plots.annotate(format(bar.get_height(),'.2f'),
                   (bar.get_x() + bar.get_width() / 2,
                    bar.get_height()), ha='center', va='center',
                   size=8, xytext=(0, 5),
                   textcoords='offset points')
plt.xticks(rotation=30)

plt.title(" Adjusted R2 score", fontsize = 15)
plt.xlabel('Models', fontsize = 12)
plt.ylabel('Score', fontsize = 12)

# Setting limit of the y axis from 0 to 30
plt.ylim(0,1)
plt.show()

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

On the basis of all the model i have decided to select R2 score Evaluation matrics which shows the accuracy of the model which is very good indicator to check the feasibility of the model.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I have ran a several models like linear regression, decision, and xtreame gradient boosting but amongst them i have selected extreme gradient boosting regressor as i achieved 97% training accuracy and 92% testing accuracy. Some models were overfitted so i did not consider them.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**SHAP(Shapley additive Explanations)**

In [None]:
import shap
# Shap explainer for xgb model (tree based)
explainer = shap.TreeExplainer(xgb_model, X_train, feature_names=features)
# Plotting
shap.initjs()
# Select an instance from the test set
instance = X_test[50, :]

# Compute the SHAP values for the instance
shap_values = explainer(instance)

# Create the SHAP force plot
shap.plots.force(shap_values)


The force plot shows the shap values for a particular instance.

Here i have considered the 50th index row values for the plot. I can see that the prediction is 22.54 (sqrt value). The different contribution of the columns is shown for getting the prediction.

In [None]:
# Get shap values of test data
shap_values = explainer(X_test)

In [None]:
# Plotting the SHAP summary plot
shap.summary_plot(shap_values, X_test)

In the summary plot i can see the top 20 columns and their impact on the prediction. The red color indicates that the value of the columns is high and blue color shows that the value of the column is low.

For categorical columns, i have zeros and ones where zero is blue color and one is red color.

Shap values are also displayed and the impact on the prediction is also shown. Towards the right hand side, the impact is positive (increases the predicted value) and towards the left hand side, the impact is negative (decreases the predicted value).

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import pickle
# Save the model to a pickle file
with open("xgb_model.pkl", "wb") as f:
  pickle.dump(xgb_model, f)

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
with open("xgb_model.pkl", "rb") as f:
    loaded_xgb_model = pickle.load(f)

new_test_preds = loaded_xgb_model.predict(X_test)

# Sanity Check
mse = mean_squared_error(y_test, new_test_preds)
mae = mean_absolute_error(y_test, new_test_preds)
r2 = r2_score(y_test, new_test_preds)
print("MSE:", mse)
print("MAE:", mae)
print("R2 Score:", r2)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The project successfully demonstrated the feasibility of using machine learning techniques to predict bike demand in Seoul.

Some of the key points are:-

High demand in the morning and evening.
Less Demand in the winter season.
Highest demand in june.
Found multicollinearity between temperature and dew point temperature.
Perform linear regression, decision tree, random forest, gradient boosting, Extreme gradient boosting. & got highest accuracy i.e 97% on train and 92% on test on Xtreme gradient boosting.
There is no use of removing outliers, it affects negatively on model performance.
Overall, the project highlights the potential of machine learning in solving real-world problems and provides a roadmap for future research in this area. The findings of this project can be extended to other cities with similar bike sharing systems, leading to more effective and efficient bike sharing operations, and better outcomes for all stakeholders.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***