üìì Bike Rental Prediction ‚Äì End-to-End Data Science Project


# 1. Problem Statement

## Business Objective

Bike sharing systems generate large volumes of data related to **urban mobility, weather, seasonality, and human behavior**.

The objective of this project is to:

* Analyze historical bike rental data
* Understand the impact of environmental & seasonal factors
* Predict **daily bike rental count (`cnt`)**

This prediction helps:

* Optimize bike availability
* Improve operational planning
* Support urban transport decisions

## Machine Learning Objective

> Predict the **total daily bike rental count (`cnt`)** using environmental and seasonal features.

This is a **supervised regression problem**.


# 2. Dataset Description

Dataset source:

```
https://************.zip
```

We will use:

* **day.csv** ‚Üí Daily level prediction (preferred for business planning)
* **hour.csv** ‚Üí Optional deeper behavioral analysis

---

## Target Variable

* `cnt` ‚Üí Total bike rentals (casual + registered)

---

## Feature Categories

| Category | Features                                 |
| -------- | ---------------------------------------- |
| Temporal | season, yr, mnth, weekday, workingday    |
| Weather  | weathersit, temp, atemp, hum, windspeed  |
| Flags    | holiday                                  |
| Users    | casual, registered (excluded from model) |

# 3. Import Libraries & Load Data


In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

df = pd.read_csv("day.csv")
df.head()

# 4. Initial Data Inspection

In [None]:
df.info()
df.describe()

# 5. Exploratory Data Analysis (EDA)

## 5.1 Target Distribution

In [None]:
sns.histplot(df['cnt'], kde=True)
plt.title("Distribution of Daily Bike Rentals")

## 5.2 Season vs Rentals

In [None]:
sns.boxplot(x='season', y='cnt', data=df)



## 5.3 Weather Impact

In [None]:
sns.barplot(x='weathersit', y='cnt', data=df)

## 5.4 Correlation Heatmap

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), cmap="coolwarm")

# 6. Feature Selection & Engineering

## Drop Leakage & Non-Predictive Columns

In [None]:
df_model = df.drop(columns=[
    'instant',
    'dteday',
    'casual',
    'registered'
])

## Train-Test Split

In [None]:
X = df_model.drop('cnt', axis=1)
y = df_model['cnt']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Feature Scaling

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 7. Model Building & Evaluation

In [None]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

y_pred_lr = lr.predict(X_test_scaled)

## 7.2 Ridge & Lasso

In [None]:
ridge = Ridge(alpha=1)
lasso = Lasso(alpha=0.01)

ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)

## 7.3 Tree-Based Models

In [None]:
dt = DecisionTreeRegressor(random_state=42)
rf = RandomForestRegressor(n_estimators=200, random_state=42)
gb = GradientBoostingRegressor(random_state=42)

dt.fit(X_train, y_train)
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)

# 8. Model Comparison Report

In [None]:
models = {
    "Linear Regression": lr.predict(X_test_scaled),
    "Ridge Regression": ridge.predict(X_test_scaled),
    "Lasso Regression": lasso.predict(X_test_scaled),
    "Decision Tree": dt.predict(X_test),
    "Random Forest": rf.predict(X_test),
    "Gradient Boosting": gb.predict(X_test)
}

results = []

for name, pred in models.items():
    r2 = r2_score(y_test, pred)
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    results.append([name, r2, rmse])

results_df = pd.DataFrame(results, columns=["Model", "R2 Score", "RMSE"])
results_df.sort_values(by="R2 Score", ascending=False)

# 9. Best Model for Production

## ‚úÖ Recommended Model: **Gradient Boosting Regressor**

### Reasons:

* Captures non-linear relationships
* Handles weather & seasonal interactions well
* Better generalization than single trees
* Less overfitting than Random Forest

‚úî Suitable for real-world deployment
‚úî Stable performance across seasons

# 10. Feature Importance (Optional)

In [None]:
feature_importance = pd.Series(
    gb.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

feature_importance.plot(kind='bar')

# 11. Challenges Faced & Solutions

## 1Ô∏è‚É£ Anonymized & Normalized Features

**Challenge:** Lack of real-world units
**Solution:** Focused on **relative impact & correlation patterns**

## 2Ô∏è‚É£ Data Leakage Risk

**Challenge:** `casual` + `registered` sum to `cnt`
**Solution:** Removed them from predictors


## 3Ô∏è‚É£ Non-Linear Relationships

**Challenge:** Linear models underperformed
**Solution:** Used ensemble tree models



## 4Ô∏è‚É£ Seasonal Bias

**Challenge:** Strong seasonal demand variation
**Solution:** Retained seasonal features instead of removing them


# 12. Final Conclusion

* Bike rental demand is **highly dependent on weather and season**
* Tree-based ensemble models outperform linear models
* **Gradient Boosting** is the best choice for production
* The model can help cities:

  * Forecast demand
  * Allocate bikes efficiently
  * Reduce operational costs