<a href="https://colab.research.google.com/github/manishaachary13/Bike-Sharing-Demand/blob/main/Bike_sharing_demand_pred_MLCapstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Bike Sharing Demand Prediction**



##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

The **Bike-Sharing Demand Prediction** project aims to develop a machine learning model capable of accurately forecasting the hourly demand for rental bikes in urban areas. As cities increasingly rely on bike-sharing programs to enhance mobility, reduce traffic congestion, and promote sustainable transportation, ensuring a stable supply of rental bikes becomes a crucial challenge. This project seeks to address this challenge by predicting the number of bikes required each hour, allowing service providers to maintain optimal availability and meet the varying demands of users efficiently.

To achieve this, the project leverages historical data on bike rentals, along with influential factors such as **weather conditions** (temperature, humidity, wind speed, rainfall, etc.), **time-based features** (hour of the day, day of the week, and season), and **holiday or event indicators**. By analyzing these factors, the model aims to capture the complex, non-linear relationships that affect bike demand.

The process begins with data exploration and cleaning, followed by feature engineering to create meaningful inputs for the model. Advanced regression techniques, such as Random Forest Regression, Gradient Boosting, or XGBoost, will be employed to train the model and accurately predict hourly demand. Through rigorous testing and validation, the model will be optimized to ensure its accuracy and reliability.

The outcome of this project will be a robust predictive tool that allows bike-sharing companies to manage their fleet more efficiently, minimize customer wait times, and reduce operational costs. By providing accurate demand forecasts, this project will support better decision-making for fleet distribution and improve the overall user experience, ultimately contributing to the success and sustainability of urban bike-sharing programs.

---


# **GitHub Link -**

https://github.com/manishaachary13/Bike-Sharing-Demand

# **Problem Statement**




As urban populations grow, so does the need for efficient, sustainable transportation solutions. Bike-sharing systems have emerged as an increasingly popular option in many cities, offering a flexible, eco-friendly way to reduce traffic congestion and pollution. However, managing these systems presents unique challenges, particularly when it comes to maintaining an optimal balance between bike availability and user demand.

The central issue is **predicting the number of bikes needed at specific locations at different times of the day**. Demand for bike rentals fluctuates dramatically based on various factors, including the time of day, weather conditions, and whether it is a holiday or a regular working day. Without an accurate forecasting mechanism, bike-sharing providers risk under-serving customers during peak demand periods or wasting resources during off-peak times.

This problem is compounded by the influence of **weather conditions** (temperature, humidity, windspeed, etc.), which can significantly affect bike usage. For instance, a sudden drop in temperature or an unexpected rainstorm can lead to a sharp decline in rentals, while pleasant weather might result in an increase. Furthermore, factors like **holidays** or **city events** introduce irregular patterns in bike demand, making it even harder to predict usage accurately.

Given this complexity, developing a robust machine learning model that can accurately predict bike-sharing demand on an hourly basis is critical. The ability to forecast bike rental needs will not only help optimize bike availability across the city but also enhance customer satisfaction by minimizing wait times. Moreover, it will lead to better resource management, ensuring that bikes are neither over-supplied nor under-utilized, which in turn can reduce operational costs for the service provider.

The goal of this project is to design a **demand prediction model** that accounts for the various factors influencing bike rentals—such as time-based trends (e.g., morning or evening rush hours), weather conditions, and special events—by analyzing historical bike rental data. With this model, we aim to provide bike-sharing companies with a tool to **anticipate demand more effectively**, ensuring that the right number of bikes is available at the right time and place.

---


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Basic libraries
import numpy as np
import pandas as pd
from google.colab import drive

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler

# For model tuning and validation
from sklearn.model_selection import GridSearchCV

# Additional libraries
import warnings
warnings.filterwarnings("ignore")  # Suppress warnings


### Dataset Loading

In [None]:
# Step 1: Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# Step 2: Load the dataset
file_path = '/content/drive/MyDrive/SeoulBikeData.csv'

try:

    df = pd.read_csv(file_path, encoding='ISO-8859-1')
    print("File loaded successfully!")
except FileNotFoundError:
    print("File not found. Please check the file path.")
except pd.errors.ParserError:
    print("Error parsing the file. Please check the file format.")
except Exception as e:
    print(f"An error occurred: {e}")



### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"Total duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(df.isnull().sum())


In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

The above dataset has 8760 rows and 14 columns. There are no mising values and duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description



1. **Date**: The specific day the bike rental data was recorded.
2. **Rented Bike Count**: The number of bikes rented during the hour.
3. **Hour**: The hour of the day when the rental record was taken (0-23).
4. **Temperature(°C)**: The recorded temperature in Celsius at the time of the rental.
5. **Humidity(%)**: The percentage of humidity in the air at the time.
6. **Wind Speed (m/s)**: The wind speed in meters per second at the time.
7. **Visibility (10m)**: The distance of visibility in units of 10 meters.
8. **Dew Point Temperature(°C)**: The temperature at which moisture begins to condense.
9. **Solar Radiation (MJ/m2)**: The amount of solar energy reaching the ground, in megajoules per square meter.
10. **Rainfall(mm)**: The amount of rainfall in millimeters during the hour.
11. **Snowfall (cm)**: The amount of snowfall in centimeters during the hour.
12. **Seasons**: The season during which the data was recorded (Winter, Spring, Summer, Autumn).
13. **Holiday**: Indicates whether the day was a holiday (holiday, No holiday).
14. **Functioning Day**: Indicates whether bike rentals were operational (yes or no).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("No. of unique values in \n")
for i in df.columns.tolist():
  print(i,"=",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert the Date column to datetime format, specifying that the day comes first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)


In [None]:
# Create additional features such as day of the week, month, and year from the Date column
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Weekday'] = df['Date'].dt.weekday.apply(lambda x: 'Weekday' if x < 5 else 'Weekend')


In [None]:
# Convert Seasons, Holiday, and Functioning Day to categorical data types
df['Seasons'] = df['Seasons'].astype('category')
df['Holiday'] = df['Holiday'].astype('category')
df['Functioning Day'] = df['Functioning Day'].astype('category')


In [None]:
# Importing MinMaxScaler from scikit-learn to scale the features
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler to normalize the specified features between 0 and 1
scaler = MinMaxScaler()

# Apply MinMaxScaler to scale the specified columns
df[['Temperature(°C)', 'Wind speed (m/s)', 'Visibility (10m)',
    'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']] = scaler.fit_transform(
    df[['Temperature(°C)', 'Wind speed (m/s)', 'Visibility (10m)',
        'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)']])


In [None]:
# Capture temperature during the morning hours (before 12 PM)
df['Temp_Morning'] = df['Temperature(°C)'] * (df['Hour'] < 12).astype(int)

# Capture temperature during the afternoon hours (12 PM to 5:59 PM)
df['Temp_Afternoon'] = df['Temperature(°C)'] * ((df['Hour'] >= 12) & (df['Hour'] < 18)).astype(int)

# Capture temperature during the evening hours (6 PM and onwards)
df['Temp_Evening'] = df['Temperature(°C)'] * (df['Hour'] >= 18).astype(int)

# Display the 'Temp_Morning' column
df['Temp_Morning']


In [None]:
# Create interactions between rainfall and specific hours
df['Rain_Morning'] = df['Rainfall(mm)'] * (df['Hour'] < 12)
df['Rain_Evening'] = df['Rainfall(mm)'] * (df['Hour'] >= 18)


In [None]:
# Capture the interaction between wind speed and hours
df['Wind_Morning'] = df['Wind speed (m/s)'] * (df['Hour'] < 12)
df['Wind_Evening'] = df['Wind speed (m/s)'] * (df['Hour'] >= 18)


In [None]:
#  Create interactions between Seasons and Temperature
df['Winter_Temp'] = df['Temperature(°C)'] * (df['Seasons'] == 'Winter')
df['Summer_Temp'] = df['Temperature(°C)'] * (df['Seasons'] == 'Summer')
df['Fall_Temp'] = df['Temperature(°C)'] * (df['Seasons'] == 'Fall')
df['Spring_Temp'] = df['Temperature(°C)'] * (df['Seasons'] == 'Spring')


In [None]:
#  Create an interaction between functioning days and specific peak hours
df['Peak_Hour_Functioning'] = (df['Hour'].isin([7, 8, 17, 18])) & (df['Functioning Day'] == 'Yes')


In [None]:
# Create an interaction between humidity and temperature
df['Humid_Temp'] = df['Humidity(%)'] * df['Temperature(°C)']


In [None]:
#  Create an interaction between visibility and morning/evening hours
df['Visibility_Morning'] = df['Visibility (10m)'] * (df['Hour'] < 12)
df['Visibility_Evening'] = df['Visibility (10m)'] * (df['Hour'] >= 18)


In [None]:
# Convert 'Date' column to datetime if not done already
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# Aggregate the rented bike count per day
daily_df = df.groupby('Date').agg({'Rented Bike Count': 'sum'}).reset_index()

# Rename the column to reflect daily aggregation
daily_df.rename(columns={'Rented Bike Count': 'Daily Rented Bike Count'}, inplace=True)

# Display the first few rows of the daily data
print(daily_df.head())


In [None]:
# Ensure the 'Date' column is a datetime object
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# Set 'Date' as the index to use time-based resampling
df.set_index('Date', inplace=True)

# Resample to weekly level (W) and sum the rented bike count for each week
weekly_df = df['Rented Bike Count'].resample('W').sum().reset_index()

# Rename the column to reflect weekly aggregation
weekly_df.rename(columns={'Rented Bike Count': 'Weekly Rented Bike Count'}, inplace=True)

# Display the first few rows of the weekly data
print(weekly_df.head())


In [None]:
# Mean of rented bike count per day
daily_mean = df.groupby('Date').agg({'Rented Bike Count': 'mean'}).reset_index()
daily_mean.rename(columns={'Rented Bike Count': 'Daily Mean Rented Bike Count'}, inplace=True)

# Median of rented bike count per day
daily_median = df.groupby('Date').agg({'Rented Bike Count': 'median'}).reset_index()
daily_median.rename(columns={'Rented Bike Count': 'Daily Median Rented Bike Count'}, inplace=True)


In [None]:
daily_median

In [None]:
# Minimum rented bike count per day
daily_min = df.groupby('Date').agg({'Rented Bike Count': 'min'}).reset_index()
daily_min.rename(columns={'Rented Bike Count': 'Daily Min Rented Bike Count'}, inplace=True)

# Maximum rented bike count per day
daily_max = df.groupby('Date').agg({'Rented Bike Count': 'max'}).reset_index()
daily_max.rename(columns={'Rented Bike Count': 'Daily Max Rented Bike Count'}, inplace=True)


In [None]:
# Standard deviation of rented bike count per day
daily_std = df.groupby('Date').agg({'Rented Bike Count': 'std'}).reset_index()
daily_std.rename(columns={'Rented Bike Count': 'Daily Std Rented Bike Count'}, inplace=True)


In [None]:
# Count of hourly entries per day (useful for identifying missing data)
daily_count = df.groupby('Date').agg({'Rented Bike Count': 'count'}).reset_index()
daily_count.rename(columns={'Rented Bike Count': 'Daily Count of Hours'}, inplace=True)


In [None]:
# Cumulative sum of rented bike count per day
df['Cumulative_Rented_Bikes'] = df['Rented Bike Count'].cumsum()


In [None]:
# Mode of rented bike count per day (finds the most common rental count)
daily_mode = df.groupby('Date').agg(lambda x: x.mode()[0]).reset_index()
daily_mode.rename(columns={'Rented Bike Count': 'Daily Mode Rented Bike Count'}, inplace=True)


In [None]:
# Rolling 7-day moving average of rented bike count
df['7-day Moving Average'] = df['Rented Bike Count'].rolling(window=7).mean()


In [None]:
# Aggregating by seasons and calculating the mean bike rentals
seasonal_agg = df.groupby('Seasons').agg({'Rented Bike Count': 'mean'}).reset_index()
seasonal_agg.rename(columns={'Rented Bike Count': 'Average Rented Bike Count'}, inplace=True)


In [None]:
# Aggregating by hour to see peak bike rental hours
hourly_agg = df.groupby('Hour').agg({'Rented Bike Count': ['mean', 'sum']}).reset_index()
hourly_agg.columns = ['Hour', 'Mean Hourly Rentals', 'Total Hourly Rentals']


In [None]:
# Creating lag features for hourly rented bike counts
df['Lag_1'] = df['Rented Bike Count'].shift(1)  # Previous hour
df['Lag_2'] = df['Rented Bike Count'].shift(2)  # Two hours ago
df['Lag_24'] = df['Rented Bike Count'].shift(24)  # Same hour, previous day

# Dropping NaN rows created due to lagging
df.dropna(inplace=True)


In [None]:
# Rolling average and standard deviation
df['Rolling_Mean_3'] = df['Rented Bike Count'].rolling(window=3).mean()
df['Rolling_Std_3'] = df['Rented Bike Count'].rolling(window=3).std()
df['Rolling_Mean_7'] = df['Rented Bike Count'].rolling(window=7).mean()
df['Rolling_Std_7'] = df['Rented Bike Count'].rolling(window=7).std()

# Dropping rows with NaN caused by rolling windows
df.dropna(inplace=True)


In [None]:
# Interaction between temperature and humidity (discomfort index)
df['Temp_Humidity_Interaction'] = df['Temperature(°C)'] * df['Humidity(%)']

# Interaction between solar radiation and seasons
df['Seasonal_Radiation'] = df['Solar Radiation (MJ/m2)'] * df['Seasons'].cat.codes


In [None]:
# Combining rare categories in Holidays
rare_categories = df['Holiday'].value_counts()[df['Holiday'].value_counts() < 50].index
df['Holiday'] = df['Holiday'].replace(rare_categories, 'Rare')


### What all manipulations have you done and insights you found?

**Manipulations Done**

1. **Data Cleaning**

* Checked for missing and duplicate values.
* Ensured no missing or duplicate rows were present in the dataset.
* Applied forward-fill and backward-fill methods for any unexpected missing values during feature creation.
2. **Feature Transformation**

* **Cyclical Encoding**: Encoded Hour and Month as cyclical features to capture the periodic nature of time.
* **Scaling**: Applied MinMaxScaler to normalize continuous features such as Temperature, Wind Speed, and Humidity.
3. **Feature Engineering**

* **Time-based Features**:

Extracted Year, Month, Day of the Week, and Weekday/Weekend from the Date column.
Aggregated bike rentals by day and week for temporal analysis.
* **Lag Features:**

Created lagged variables like Lag_1, Lag_2, and Lag_24 to capture temporal dependencies.
* **Rolling Features:**

Generated rolling mean and standard deviation over 3-hour and 7-hour windows to identify short-term trends and volatility.
* **Feature Interactions:**

Created interactions between variables, such as Temperature × Humidity for discomfort index and Solar Radiation × Seasons for seasonal weather effects.
4. **Handling Outliers**

* Applied the IQR method to cap extreme values in features like Temperature, Humidity, and Rented Bike Count.
5. **Category Optimization**

* Combined rare categories in Holiday to ensure robustness in model training.
6. **Data Aggregation**

* Resampled data to daily and weekly levels for trend analysis.
Calculated daily mean, median, min, max, and standard deviation for Rented Bike Count.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Line plot of Hourly Bike Rentals

hourly_agg = df.groupby('Hour').agg({'Rented Bike Count': 'mean'}).reset_index()
plt.figure(figsize=(10, 6))
sns.lineplot(data=hourly_agg, x='Hour', y='Rented Bike Count', marker='o')
plt.title('Average Hourly Bike Rentals')
plt.xlabel('Hour of Day')
plt.ylabel('Average Rented Bikes')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

To analyze bike rental trends across different times of the day and identify peak hours.

##### 2. What is/are the insight(s) found from the chart?

* The bike rentals show two distinct peaks during the day:

 Morning (7–9 AM): High demand due to commuters traveling to work or school.

 Evening (5–7 PM): High demand as commuters return home.
* These peaks are indicative of predictable rush hours, with relatively lower demand in other periods, such as late-night or midday hours.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.** The insights can help in the following ways:

1. Operational Efficiency:
* Deploying more bikes during peak hours ensures that customer demand is met without delays.
* Reducing bike availability during off-peak hours helps save operational costs.
2. Infrastructure Optimization:
* Maintenance schedules can be planned during off-peak hours to minimize disruption to users.
* Stations near offices, schools, and public transport hubs can be prioritized during peak hours.
3.Customer Satisfaction:
* Meeting demand during rush hours reduces wait times, enhancing the user experience.

**Potential Negative Impacts:**

* Over-dependence on Peak Hours
* Rush Hour Bottlenecks:

**Justification:**

While the peak hours are vital, neglecting other hours or failing to optimize bike supply effectively can cause challenges like overcrowded docking stations or unmet demand during non-peak periods. Balancing availability across all hours is essential for consistent growth.

#### Chart - 2

In [None]:
# Chart - 2
# Bar plot for weekday/weekend analysis
weekday_agg = df.groupby('Weekday').agg({'Rented Bike Count': 'mean'}).reset_index()
sns.barplot(data=weekday_agg, x='Weekday', y='Rented Bike Count', palette='viridis')
plt.title('Bike Rentals: Weekdays vs Weekends')
plt.xlabel('Day Type')
plt.ylabel('Average Rented Bikes')
plt.show()


##### 1. Why did you pick the specific chart?

To compare bike rental patterns between weekdays and weekends.


##### 2. What is/are the insight(s) found from the chart?

1. **Higher Average Rentals on Weekdays:**
* Weekdays have an average of over 700 bike rentals, likely driven by commuting patterns during working days.
2. **Lower Average Rentals on Weekends:**
* Weekends see approximately 650 rentals, possibly due to a more relaxed schedule and fewer commuters.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes**. These insights can be leveraged as follows:

1. **Strategic Resource Allocation:**
* Deploy additional bikes on weekdays near business districts and public transport hubs to cater to commuter traffic.
* For weekends, focus on leisure destinations like parks, tourist spots, and recreational areas.
2. **Customized Marketing Campaigns:**
* Promote weekday commuter passes or subscriptions targeting office workers.
* Weekend deals or discounts can encourage more leisure riders, boosting demand.
3. **Station Management:**
* Weekday morning and evening stations near workplaces may need more bikes, while weekend midday stations near leisure spots should be stocked accordingly.

**Potential Negative Impacts:**

1. **Weekend Underperformance:**
* Lower weekend rentals may indicate underutilization of resources, leading to missed opportunities for revenue growth.
2. **Over-reliance on Weekday Commuters:**
* Heavy dependence on weekday commuters could pose a risk if work-from-home trends or alternative transport modes gain popularity.

**Justification:**

While weekday demand is strong, failing to capitalize on weekend opportunities or diversify customer segments could limit overall growth. Designing flexible strategies to balance weekday and weekend performance is essential.

#### Chart - 3

In [None]:
# Chart - 3
# Heatmap for hourly rentals by day of the week
heatmap_data = df.pivot_table(index='Hour', columns='DayOfWeek', values='Rented Bike Count', aggfunc='mean')
sns.heatmap(heatmap_data, cmap='coolwarm', annot=False)
plt.title('Hourly Bike Rentals by Day of the Week')
plt.xlabel('Day of Week')
plt.ylabel('Hour of Day')
plt.show()


##### 1. Why did you pick the specific chart?

To explore the combined effect of hour and weekday on bike rentals.


##### 2. What is/are the insight(s) found from the chart?

1. **Peak Rentals in the Early Morning (7-9) and Late Afternoon (17-19):**
* The heatmap clearly shows high bike rentals in the morning (7 AM to 9 AM) and late afternoon (5 PM to 7 PM). This suggests that these times are when commuters are most likely renting bikes for their daily routines, likely traveling to and from work or school.
2. **Low Rentals During Night Hours (10 PM - 6 AM):**
* The demand for bike rentals sharply declines after 9 PM, likely because fewer people are using bikes during late-night hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.** These insights are valuable for business strategies:

1. **Resource Allocation and Fleet Management:**

* Increase bike availability in peak hours (7-9 AM and 5-7 PM): Ensure more bikes are available in the morning and late afternoon to meet high demand.
* Reduce fleet size in night hours: Fewer bikes may be needed during night hours (10 PM - 6 AM), reducing operational costs.
2. **Targeted Marketing and Promotions:**
* Morning and Evening Campaigns: Tailor marketing efforts (e.g., discounts for early riders) to cater to commuters during high-demand periods.
3. **Efficient Use of Resources:**
* Monitoring peak hours will help in managing bike availability, ensuring service levels match demand and reducing costs during off-peak hours.

**Potential Negative Impact:**

* **Under-utilization of Resources during Night Hours:**
The extremely low demand for rentals during night hours could lead to under-utilization of resources if bikes are stationed in locations with little traffic.
* **Potential Missed Opportunity for Late-Night Riders:**
There may still be potential for promoting late-night bike rentals for people involved in nightlife or shift work. Ignoring this segment might limit growth.

#### Chart - 4

In [None]:
# Chart - 4
# Scatter plot for temperature vs rentals
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='Temperature(°C)', y='Rented Bike Count', alpha=0.6)
plt.title('Temperature vs Bike Rentals')
plt.xlabel('Temperature (°C)')
plt.ylabel('Rented Bikes')
plt.show()


##### 1. Why did you pick the specific chart?

To examine the relationship between temperature and bike rentals.

##### 2. What is/are the insight(s) found from the chart?

1. **Positive Correlation Between Temperature and Rentals:**
* The scatter plot shows a positive correlation between temperature and bike rentals. As the temperature increases (from lower to higher values on the X-axis), the number of rented bikes also increases. This suggests that people are more likely to rent bikes in warmer weather, which is expected for outdoor activities.
2. **Flat Low-Temperature Rentals:**
* When the temperature is close to 0°C, the number of rentals remains low, indicating that colder weather likely discourages people from renting bikes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.** The insights gained can significantly improve business strategies:

1. **Seasonal Demand Planning**:

* Increase marketing efforts and bike availability during warmer months: Higher bike rentals during warmer weather indicate a peak demand in summer or spring months. The company can plan for higher fleet availability and promotions during these months.
* Promote winter packages or deals for colder months: For colder months, consider offering bundled deals, such as heated bike rentals, to keep customers engaged year-round.
2. **Pricing Strategies Based on Temperature**:

* Dynamic pricing model: Implement a pricing strategy where bike rental prices are lower on colder days and higher during warmer weather when demand spikes.
Targeted Campaigns:

* Design weather-dependent campaigns, such as discounts on cooler days or loyalty programs for regular customers who rent bikes in all seasons.

**Potential Negative Impact:**

* **Seasonal Dependence on Warm Weather:**
The high correlation between temperature and bike rentals means the business could struggle with lower revenues during colder months. Over-relying on peak seasons could hinder consistent growth.
* **Underestimating the Winter Market:**
If the business ignores the winter months due to lower bike rentals, it could lose out on potential customers who might prefer indoor biking or alternate transportation options during the colder season.

#### Chart - 5

In [None]:
# Chart - 5
# Box plot for seasonal rentals
sns.boxplot(data=df, x='Seasons', y='Rented Bike Count', palette='Set2')
plt.title('Seasonal Variation in Bike Rentals')
plt.xlabel('Season')
plt.ylabel('Rented Bikes')
plt.show()


##### 1. Why did you pick the specific chart?

To analyze seasonal variations in rentals and their spread.


##### 2. What is/are the insight(s) found from the chart?

1. **Bike rentals vary significantly across seasons:**
* Summer and Autumn exhibit the highest rental counts, with a broader interquartile range indicating higher demand and variability.
* Winter shows the lowest rentals, likely due to adverse weather conditions.
2. Outliers are present in all seasons, indicating occasional days with exceptionally high rentals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
* Focus marketing campaigns, promotions, and fleet availability in Summer and Autumn to maximize revenue.

**Negative Impact:**
* Winter poses challenges for business due to low demand, requiring cost optimization or alternative strategies during this season.


#### Chart - 6

In [None]:
# Chart - 6
# Correlation heatmap
# Select only numerical columns for the correlation matrix
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Plot the correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Features')
plt.show()


##### 1. Why did you pick the specific chart?

To identify strong relationships between numerical features.

##### 2. What is/are the insight(s) found from the chart?

**Insights:**
1. Temperature and solar radiation positively correlate with bike rentals, indicating higher demand during warmer and sunnier conditions.
2. Rainfall and snowfall are negatively correlated with rentals, showing a decline in bike usage during adverse weather conditions.
3. Time-related features such as hour and lag variables also show notable correlations, reflecting seasonal or hourly patterns.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**

 **Yes,**
* focusing on temperature and solar radiation can help optimize bike availability and maintenance during peak weather conditions.
* Strategies like promoting rain gear or providing discounts during rainy or snowy days can mitigate negative impacts and boost rentals.

**Justification**
* High negative correlations with rainfall and snowfall suggest potential downtime or loss of rentals during adverse weather. Without proper mitigation, this could negatively impact revenue.

#### Chart - 7

In [None]:
# Chart - 7
# Histogram for rented bike count
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='Rented Bike Count', bins=30, kde=True)
plt.title('Distribution of Rented Bike Count')
plt.xlabel('Rented Bike Count')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the distribution of bike rentals.


##### 2. What is/are the insight(s) found from the chart?

* The data is positively skewed, with most days having rental counts below 500.
* Rental counts above 2500 are rare but do occur, indicating high-demand days likely tied to favorable weather or events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
* Understanding that most days have moderate demand helps optimize daily fleet sizes and avoid overstocking.
* Special preparation for high-demand days can further improve customer satisfaction and revenue.

**Negative Impact:**
* The low frequency of high-demand days could lead to inefficiencies if inventory is consistently over-prepared for peak scenarios.


#### Chart - 8

In [None]:
# Chart - 8
# Pair plot for key features
sns.pairplot(data=df, vars=['Rented Bike Count', 'Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)'], diag_kind='kde')
plt.title('Pair Plot of Key Features')
plt.show()


##### 1. Why did you pick the specific chart?

To observe pairwise relationships and distributions of key features.


##### 2. What is/are the insight(s) found from the chart?

1. A positive correlation exists between temperature (°C) and the rented bike count, suggesting more bikes are rented during warmer weather.
2. Humidity and wind speed show weak or no clear correlation with rented bike counts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 **Positive Business Impact:**
 * Identifying the relationship between temperature and bike rentals can guide seasonal marketing campaigns or bike availability adjustments.


#### Chart - 9

In [None]:
# Chart - 9
# Weekly rentals line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=weekly_df, x='Date', y='Weekly Rented Bike Count', marker='o', label='Weekly Rentals')
plt.title('Weekly Bike Rentals Over Time')
plt.xlabel('Date')
plt.ylabel('Weekly Rented Bikes')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

To observe long-term trends in bike rentals.

##### 2. What is/are the insight(s) found from the chart?

1. Weekly rentals steadily increase from January, peaking during summer and declining in autumn and winter.
2. Seasonality significantly impacts demand, with summer showing the highest rentals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
* Businesses can prepare for peak summer demand by increasing inventory and introducing promotions. They can also plan maintenance during low-demand months.

**Negative Growth with Justification:**
* The decline in rentals during winter months could negatively impact revenue. Without strategies to address this (e.g., promotions, targeted campaigns, or alternate services), the business may face underutilization of resources.

#### Chart - 10

In [None]:
# Chart - 10
# Bar plot for holiday rentals
holiday_agg = df.groupby('Holiday').agg({'Rented Bike Count': 'mean'}).reset_index()
sns.barplot(data=holiday_agg, x='Holiday', y='Rented Bike Count', palette='pastel')
plt.title('Bike Rentals: Holiday vs. Non-Holiday')
plt.xlabel('Holiday')
plt.ylabel('Average Rented Bikes')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the impact of holidays on bike rentals.

##### 2. What is/are the insight(s) found from the chart?

1. **Lower Rentals on Holidays:**
* The average number of bike rentals on holidays is approximately 500, indicating reduced demand compared to regular days.
2. **Higher Rentals on Non-Holiday Days:**
* On non-holiday days, the average rentals are around 700, showing that people are more likely to rent bikes on regular days, possibly due to commuting or more routine activities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes**. These insights can positively influence the business in several ways:

1. **Operational Adjustments:**
* Decrease fleet size during holidays: Lower rentals on holidays suggest a reduced demand for bikes, allowing the company to allocate fewer bikes to stations, saving on operational costs.
* Maintain or increase fleet on non-holidays: Since rentals are higher on non-holidays, ensuring bike availability during weekdays can prevent stockouts and maximize service levels.
2. **Promotional Strategies:**

* Special promotions during holidays: Offering discounts or special offers on holidays can help boost rentals during low-demand periods.
* Marketing on non-holidays: Leverage the consistent demand on non-holidays for targeted marketing campaigns (e.g., commuter programs, loyalty programs) to increase customer engagement.
3. **Maintenance and Service Planning:**
* Holidays may have fewer rentals, so this time can be used for bike maintenance, cleaning, and station upgrades without disrupting service.

**Potential Negative Impacts:**
* Under-utilization on Holidays
* Missed Revenue Opportunity on Holidays

**Justification:**

While non-holidays bring consistent demand, failing to tap into holiday traffic could limit growth. Adjusting fleet management and launching attractive offers on holidays can mitigate this underperformance and increase customer base during these periods.

#### Chart - 11

In [None]:
# Chart - 11
# Aggregating rentals by month
monthly_agg = df.groupby('Month').agg({'Rented Bike Count': 'mean'}).reset_index()

# Area plot for rentals by month
plt.fill_between(monthly_agg['Month'], monthly_agg['Rented Bike Count'], alpha=0.5, color='skyblue')
plt.plot(monthly_agg['Month'], monthly_agg['Rented Bike Count'], marker='o', color='blue')
plt.title('Monthly Average Bike Rentals')
plt.xlabel('Month')
plt.ylabel('Average Rented Bikes')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

To observe seasonal trends over the year.

##### 2. What is/are the insight(s) found from the chart?


1. **Peak Rentals in June (6th Month)**:
* The highest average rentals are observed in June (1200+ rentals), likely due to favorable weather conditions, summer holidays, and an increase in outdoor activities.
2. **Slight Decline in July (7th Month)**:
* July still shows high rentals (approximately 1000), though slightly lower than June, which could be attributed to the seasonal variation as the peak of summer passes.
3. **Gradual Decline Towards Fall (5th Month)**:
* May (900 rentals) sees a reduction, likely indicating the transition from spring to summer, with decreasing weather suitability for bike usage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes.** These insights are valuable for making strategic business decisions:

1. **Fleet Management:**

* Increase bike availability in June and July.
* Prepare for lower demand in late summer and fall.
2. **Seasonal Marketing Campaigns:**

* Targeted promotions in June and July.
* Early Bird campaigns in Spring (May): Promotions targeting early-season users (May) can help boost rentals before the summer peak, balancing seasonal demand.
3. **Operational Adjustments:**

* Maintenance and Upgrades in Fall: As bike rentals decrease after the summer, this period could be used for maintenance and infrastructure improvements to ensure smooth operations when demand picks up again.


**Potential Negative Impacts:**

1. Under-preparation for High Demand in June and July.
2. Over-reliance on Summer Months.

**Justification:**

While peak months like June and July provide a strong revenue boost, ignoring the off-peak months (May and Fall) could result in a seasonal business model with inconsistent income. The key is to optimize operations and marketing to smooth out demand fluctuations.

#### Chart - 12

In [None]:
# Chart - 12
# Violin plot for rentals by hour across seasons
plt.figure(figsize=(12, 6))
sns.violinplot(data=df, x='Hour', y='Rented Bike Count', hue='Seasons', split=True, palette='muted')
plt.title('Hourly Rentals Across Seasons')
plt.xlabel('Hour of Day')
plt.ylabel('Rented Bikes')
plt.legend(title='Season')
plt.show()


##### 1. Why did you pick the specific chart?

To explore the distribution of bike rentals by hour, segmented by season.


##### 2. What is/are the insight(s) found from the chart?

* Rentals peak during early morning and evening hours, aligning with likely commute times.
* Summer and autumn exhibit higher demand variability, while winter shows consistently lower rentals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
* Allocating resources to align with peak rental hours (morning and evening) and focusing on high-demand seasons can improve operational efficiency and customer satisfaction.

**Negative Growth with Justification:**
* Low rentals during non-peak hours could result in wasted resources. Failing to optimize availability during these periods might limit potential revenue.

#### Chart - 13

In [None]:
# Stacked bar chart
func_agg = df.groupby(['DayOfWeek', 'Functioning Day']).agg({'Rented Bike Count': 'mean'}).unstack()
func_agg.plot(kind='bar', stacked=True, figsize=(12, 6), color=['green', 'red'])
plt.title('Bike Rentals by Day of Week and Functioning Status')
plt.xlabel('Day of Week')
plt.ylabel('Average Rented Bikes')
plt.legend(['Functioning Day', 'Not Functioning'], title='Operational Status')
plt.show()


##### 1. Why did you pick the specific chart?

To analyze the impact of functioning vs. non-functioning days on bike rentals.

##### 2. What is/are the insight(s) found from the chart?

* Average bike rentals on non-functioning days are consistent across the week, with slight variations.
* Rentals are relatively lower on day 6 (Saturday) compared to other days, indicating reduced demand during weekends, even on non-functioning days.

**Justification**
* The consistently lower rentals on day 6 suggest a potential missed opportunity for increasing weekend demand through marketing campaigns or promotions targeted at weekend riders.

#### Chart - 14

In [None]:
# Scatter plot for wind speed vs rentals
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Wind speed (m/s)', y='Rented Bike Count', hue='Seasons', alpha=0.6)
plt.title('Wind Speed vs Bike Rentals (Season-wise)')
plt.xlabel('Wind Speed (m/s)')
plt.ylabel('Rented Bikes')
plt.legend(title='Season')
plt.show()


##### 1. Why did you pick the specific chart?

To explore how wind speed impacts rentals across different seasons.

##### 2. What is/are the insight(s) found from the chart?

**Insights:**
* Rentals tend to decrease as wind speed increases, indicating sensitivity to wind conditions.
* Summer and Autumn see consistently higher rentals even with moderate wind speeds.
* Winter rentals remain low regardless of wind conditions, further highlighting its low-demand nature.

**Positive Impact:**
* Focus fleet optimization on lower wind days and leverage favorable weather during peak seasons.

**Negative Impact:**
* Windy days may lead to operational inefficiencies, requiring flexibility in workforce deployment and service planning.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. **Hypothesis 1 (Day and Rental Patterns):**

* Null Hypothesis (H0): The average number of bike rentals does not differ significantly between weekdays and weekends.
* Alternative Hypothesis (Ha): The average number of bike rentals differs significantly between weekdays and weekends.

2. **Hypothesis 2 (Temperature Impact):**

* Null Hypothesis (H0): The temperature does not significantly affect the number of bike rentals.
* Alternative Hypothesis (Ha): The temperature significantly affects the number of bike rentals.
3. **Hypothesis 3 (Holiday Effect):**

* Null Hypothesis (H0): There is no significant difference in the average number of bike rentals between holidays and non-holidays.

* Alternative Hypothesis (Ha): There is a significant difference in the average number of bike rentals between holidays and non-holidays.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (H0): The average number of bike rentals does not differ significantly between weekdays and weekends.
2. Alternative Hypothesis (Ha): The average number of bike rentals differs significantly between weekdays and weekends.

#### 2. Perform an appropriate statistical test.

In [None]:
df.columns


In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import ttest_ind

# Assuming 'df' is the DataFrame containing the data

# Define weekdays (0-4) and weekends (5-6)
df['DayType'] = df['DayOfWeek'].apply(lambda x: 'Weekday' if x < 5 else 'Weekend')

# Separate the data
weekdays = df[df['DayType'] == 'Weekday']['Rented Bike Count']
weekends = df[df['DayType'] == 'Weekend']['Rented Bike Count']

# Perform a two-sample t-test
t_stat, p_value = ttest_ind(weekdays, weekends, equal_var=False)  # Assuming unequal variances

# Output the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in bike rentals between weekdays and weekends.")
else:
    print("Fail to reject the null hypothesis: No significant difference in bike rentals between weekdays and weekends.")


##### Which statistical test have you done to obtain P-Value?

We performed an Independent Two-Sample t-test to obtain the P-value.

##### Why did you choose the specific statistical test?

1. **Nature of the Data:**
* The test compares the means of two independent groups: weekdays and weekends.
* The target variable, bike rentals, is numeric and continuous.
2. **Assumptions of the t-test:**
* Independent Groups: Weekdays and weekends are mutually exclusive.
* Normally Distributed Data: The distribution of bike rentals within each group is assumed to be approximately normal (checked or robust under large samples by the Central Limit Theorem).
* Equal Variance: While optional, a t-test can handle unequal variances by using the Welch's correction.
3. **Suitability:**
* The t-test is appropriate for determining if there is a statistically significant difference in the means of two independent groups when the data is continuous and approximately normal.

**Statistical Test Results:**
T-statistic: 3.5417
P-value: 0.0004
**Conclusion:**
Since the P-value (0.0004) is less than the significance level (𝛼=0.05):

* Reject the null hypothesis (𝐻0).
* Accept the alternative hypothesis (𝐻𝑎): There is a significant difference in the average number of bike rentals between weekdays and weekends.

**Implications:**
* The difference in bike rentals suggests that user behavior and bike rental demand vary based on the day type. This could be due to factors like commuting patterns on weekdays versus leisure activities on weekends.

**Business Impact:**
1. Operational Strategy:
* Adjust bike availability and maintenance schedules based on higher demand on specific day types (e.g., ensuring more bikes are available on weekends if demand is higher).
2. Marketing Campaigns:
* Tailor promotions for weekdays (e.g., "Bike to Work" discounts) or weekends (e.g., "Leisure Ride Discounts").

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. Null Hypothesis (𝐻0): There is no significant correlation between temperature and the number of bike rentals. (𝜌=0)
2. Alternate Hypothesis (𝐻𝐴): There is a significant correlation between temperature and the number of bike rentals. (𝜌≠0)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Perform Pearson correlation test
correlation, p_value = pearsonr(df['Temperature(°C)'], df['Rented Bike Count'])

print(f"Correlation coefficient: {correlation}")
print(f"P-value: {p_value}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant correlation between temperature and bike rentals.")
else:
    print("Fail to reject the null hypothesis: There is no significant correlation between temperature and bike rentals.")


##### Which statistical test have you done to obtain P-Value?

The appropriate test here is a **Pearson Correlation Test**, as it assesses the linear relationship between two continuous variables: **Temperature(°C) and Rented Bike Count.**

##### Why did you choose the specific statistical test?

* Nature of the data: Both variables (temperature and bike rentals) are continuous.
* Objective: To evaluate the strength and direction of the linear relationship.
* Pearson correlation is a standard test for analyzing correlations when variables are continuous and linearly related.


**Interpretation of the Results:**

* **Correlation coefficient (𝑟): 0.538**
This indicates a moderate positive correlation between temperature and bike rentals, suggesting that as temperature increases, bike rentals also tend to increase.
* **P-value: 0.0**
A p-value of 0.0 (extremely small, essentially less than 0.05) indicates strong evidence to reject the null hypothesis.

**Conclusion:**

* The null hypothesis (𝐻0) that "there is no significant correlation between temperature and bike rentals" is rejected.
* There is a significant positive correlation between temperature and the number of bike rentals.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

1. **Null Hypothesis (𝐻0**): There is no significant difference in the average number of bike rentals on holidays versus non-holidays.
2. **Alternate Hypothesis (𝐻𝐴)**: There is a significant difference in the average number of bike rentals on holidays versus non-holidays.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Group bike rentals by holidays
holiday_rentals = df[df['Holiday'] == 'Holiday']['Rented Bike Count']
non_holiday_rentals = df[df['Holiday'] == 'No Holiday']['Rented Bike Count']

# Perform t-test
t_statistic, p_value = ttest_ind(holiday_rentals, non_holiday_rentals, equal_var=False)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis: There is a significant difference in bike rentals between holidays and non-holidays.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in bike rentals between holidays and non-holidays.")


##### Which statistical test have you done to obtain P-Value?

**Two-sample t-test (independent t-test)** was performed to compare the average number of bike rentals between holidays and non-holidays.

##### Why did you choose the specific statistical test?

* Nature of the data: The data involves comparing two independent groups (holidays and non-holidays).
* Objective: To determine whether the means of bike rentals differ significantly between these two groups.
* Assumptions met: The t-test is appropriate when the dependent variable (bike rentals) is continuous, and the samples are independent.

**Interpretation of the Results:**

* T-statistic: −7.641
This large negative value indicates a substantial difference in the means of bike rentals between holidays and non-holidays, with non-holidays likely having a higher mean.
* P-value: 1.137×10^−13
 This extremely small p-value (much less than the threshold of 0.05) provides strong evidence to reject the null hypothesis.

**Conclusion:**

* The null hypothesis (𝐻0) that "there is no significant difference in bike rentals between holidays and non-holidays" is rejected.
* There is a significant difference in bike rentals between holidays and non-holidays, with non-holidays likely seeing more bike rentals based on the negative direction of the t-statistic.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Check for missing values
missing_summary = df.isnull().sum()
print("Missing Values:\n", missing_summary)

# Fill missing categorical values with mode
for col in df.select_dtypes(include=['object', 'category']):
    df[col].fillna(df[col].mode()[0], inplace=True)




#### What all missing value imputation techniques have you used and why did you use those techniques?

In this dataset, no missing values were present in any of the columns. As a result, missing value imputation techniques were not required

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Function to cap outliers using IQR
def cap_outliers(col):
    Q1 = col.quantile(0.25)
    Q3 = col.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return col.clip(lower=lower_bound, upper=upper_bound)

# Apply to relevant columns
numerical_cols = ['Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)',
                  'Visibility (10m)', 'Solar Radiation (MJ/m2)',
                  'Rainfall(mm)', 'Snowfall (cm)']
for col in numerical_cols:
    df[col] = cap_outliers(df[col])


##### What all outlier treatment techniques have you used and why did you use those techniques?

* Outliers in numerical columns like Temperature(°C), Humidity(%), Wind speed (m/s) can skew results. We'll cap them using the IQR method.
* Capping Outliers: Keeps extreme values within acceptable ranges without losing data.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# One-hot encode categorical columns
df = pd.get_dummies(df, columns=['Seasons', 'Holiday', 'Functioning Day'], drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

* Categorical columns like Seasons, Holiday, and Functioning Day need to be encoded for machine learning models.
* One-Hot Encoding: Converts categories into binary columns, preserving information while avoiding ordinal assumptions.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Interaction between temperature and humidity
df['Temp_Humidity_Interaction'] = df['Temperature(°C)'] * df['Humidity(%)']

# Interaction between visibility and wind speed
df['Visibility_Wind_Interaction'] = df['Visibility (10m)'] * df['Wind speed (m/s)']



In [None]:
# Lag features to capture hourly dependencies
df['Lag_1'] = df['Rented Bike Count'].shift(1)  # Previous hour
df['Lag_24'] = df['Rented Bike Count'].shift(24)  # Same hour, previous day

# Rolling statistics for short-term trend analysis
df['Rolling_Mean_3'] = df['Rented Bike Count'].rolling(window=3).mean()
df['Rolling_Std_3'] = df['Rented Bike Count'].rolling(window=3).std()


##### What all feature selection methods have you used  and why?

Interaction terms help capture non-linear relationships between variables that may influence bike demand (e.g., how high temperature and humidity together affect rentals).

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
df['Rented Bike Count'] = np.log1p(df['Rented Bike Count'])  # log(1+x) transformation


* Transform Rented Bike Count to a log scale to stabilize variance and normalize the distribution.
* Log transformation stabilizes the variance, addresses skewness, and normalizes the target variable, which is crucial for models like regression.

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Apply StandardScaler to numerical columns
numerical_cols = ['Temperature(°C)', 'Humidity(%)', 'Wind speed (m/s)',
                  'Visibility (10m)', 'Solar Radiation (MJ/m2)',
                  'Rainfall(mm)', 'Snowfall (cm)', 'Lag_1', 'Lag_24']
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


##### Which method have you used to scale you data and why?

* Scale numerical features to a standard normal distribution.
* StandardScaler ensures all features have a mean of 0 and variance of 1, essential for models like SVMs or neural networks.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Split into training and testing sets
X = df.drop(columns=['Rented Bike Count'])
y = df['Rented Bike Count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)


##### What data splitting ratio have you used and why?

* Split the dataset into training and testing sets.
* 80-20 Split: Ensures a significant portion of the data is used for training (80%) while leaving a portion for model evaluation (20%).

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

 *Assess if the dataset is imbalanced, especially concerning the Rented Bike Count classes.



In [None]:
class_distribution = df['Rented Bike Count'].value_counts()
print("Class Distribution:\n", class_distribution)


In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Prepare the data for PCA
X = df.drop(columns=['Rented Bike Count'])

# Split the data
y = df['Rented Bike Count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Ensure y_train is numeric and contains no categorical strings
if y_train.dtype == object:  # If the target contains strings
    y_train = pd.to_numeric(y_train, errors='coerce')  # Convert to float if it's an object type
    y_train = y_train.dropna()  # Drop any rows with NaN after conversion to ensure all data is numeric

# Train Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, random_state=42, loss='huber', max_depth=3)
model = gb_regressor.fit(X_train, y_train)

# Predict and evaluate model
y_pred_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
print(f"Training MSE: {mse_train}")

y_pred_test = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
print(f"Testing MSE: {mse_test}")


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

**SMOTE** (Synthetic Minority Over-sampling Technique) addresses imbalance by creating synthetic samples of the minority class, improving model generalization.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***