<h1 align="center"><font color='green'> üö¥‚Äç‚ôÇÔ∏è Exercise 1: Bike Rental Regression Task</font></h1>
<h3 align="left"> <font color='purple'>This project focuses on a regression task to predict the total daily count of rental bikes ('cnt' column) based on environmental and seasonal factors.
The objective is to preprocess the data, implement and evaluate regression algorithms: Decision Tree and Ridge Regression using MAE, MSE, R¬≤ score standard metrics.
</font></h3>


In [None]:
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

#plots configuration 
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 12


<div class="alert alert-block alert-warning">  
<b>Data Loading and Preprocessing:</b> We will be loading the dataset, cleaning irrelevant columns, and encode categorical features. Split and scale the data will also be included.
</div>


In [None]:
#Load the dataset
df = pd.read_csv("/kaggle/input/bike-sharing-dataset/day.csv")

#Initial information first
print("Original Dataset Shape:", df.shape)
print("\nFeatures and Data Types:")
print(df.info())

#--- PREPROCESSING ---
#The target variable 'cnt' column is the sum of 'casual' and 'registered'.

column_drop = ['instant', 'dteday', 'casual', 'registered']
df_processed = df.drop(column_drop, axis=1)

print("\n-- Preprocessing Steps --")
print(f"Columns were dropped to remove irrelevancy: {column_drop}")

# --- Encoding Categorical Variables ---
#categ_column are categorical and need one-hot encoding.
categ_column = ['season', 'mnth', 'weekday', 'weathersit']
df_encoded = pd.get_dummies(df_processed, columns=categ_column, drop_first=True)

print("Encoded categorical features using One-Hot Encoding=drop_first=True.")
print("Processed Dataset Shape:", df_encoded.shape)

#define Features (X) and Target (y)
X = df_encoded.drop('cnt', axis=1)
y = df_encoded['cnt']

#now split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train set size: {X_train.shape[0]} records")
print(f"Test set size: {X_test.shape[0]} records")

# --- Scaling Numerical Features ---
numeric_column = ['temp', 'atemp', 'hum', 'windspeed']
scaler = StandardScaler()

X_train[numeric_column] = scaler.fit_transform(X_train[numeric_column])
X_test[numeric_column] = scaler.transform(X_test[numeric_column])

print("scaled continuous features using StandardScaler.")

<div style="
    background-color: #ffd6eb;
    border-left: 6px solid #ff69b4;
    padding: 15px;
    border-radius: 8px;
    color: #880e4f;
    font-family: 'Helvetica Neue', sans-serif;
    box-shadow: 0 2px 6px rgba(255, 105, 180, 0.3);
">
<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Model Training and Evaluation</font></h1>
<h3 align="center"> <font color='blue'> Train Decision Tree and Ridge Regression and evaluate their performance on both training and testing sets using MAE, MSE, and R¬≤.
</font></h3>
</div>


In [None]:
#1. Decision Tree (Non-linear prediction) and Ridge Regression (Robust linear baseline)
models = {
    "Ridge Regression (alpha=1)": Ridge(alpha=1.0, random_state=42),
    "Decision Tree": DecisionTreeRegressor(random_state=42, max_depth=10) #limiting depth (max_depth=10) prevents overfitting
}

results = []

def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)

    #Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    #Calculate
    metrics = {
        'Model': name,
        'MAE_Train': mean_absolute_error(y_train, y_train_pred),
        'MAE_Test': mean_absolute_error(y_test, y_test_pred),
        'MSE_Train': mean_squared_error(y_train, y_train_pred),
        'MSE_Test': mean_squared_error(y_test, y_test_pred),
        'R2_Train': r2_score(y_train, y_train_pred),
        'R2_Test': r2_score(y_test, y_test_pred)
    }
    return metrics

for name, model in models.items():
    print(f"Training {name}...")
    metrics = evaluate_model(name, model, X_train, y_train, X_test, y_test)
    results.append(metrics)

results_datafr = pd.DataFrame(results)

#Comparison table, this is sorted by Test R2 Score
print("\n--- Model Performance Comparison ---")
results_datafr_styled = results_datafr.sort_values(by='R2_Test', ascending=False).style.background_gradient(cmap='Blues', subset=['R2_Test'])
print(results_datafr_styled)

<div class="alert alert-block alert-info">
<h1 align="center"> <font color='gray'>Visualizations and interpretation</font></h1>
<h3 align="center"> <font color='blue'> Visualize the prediction results for the Decision Tree and analyze the error distribution.
</font></h3>
</div>

In [None]:
model_pre = "Decision Tree"
model_pre = models[model_pre]

#Re-train the Decision Tree for final prediction variables
model_pre.fit(X_train, y_train)
y_test_pred = model_pre.predict(X_test)

# --- Predicted vs. Actual Values Scatter Plot ---
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_test_pred, alpha=0.7, color='teal')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2, label='Perfect Prediction Line (y=x)')
plt.xlabel('Actual Daily Bike Rentals (cnt)')
plt.ylabel('Predicted Daily Bike Rentals (cnt)')
plt.title(f'{model_pre}: Predicted vs. Actual Values ')
plt.legend()
plt.grid(True)
plt.show()

# --- Residual (Error) Distribution Plot ---
#Residuals = Actual Values - Predicted Values
residuals = y_test - y_test_pred

plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True, bins=30, color='darkorange')
plt.axvline(x=0, color='red', linestyle='--', label='Zero Error')
plt.xlabel('Residuals (Actual - Predicted)')
plt.ylabel('Frequency')
plt.title(f'{model_pre}: Distribution of Prediction Errors = Residuals')
plt.legend()
plt.show()

# --- Feature Importance ---
if model_pre == "Decision Tree":
    feat_importances = pd.Series(model_pre.feat_importances_, index=X.columns)
    feat_importances_sorted = feat_importances.sort_values(ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(x=feat_importances_sorted.head(10), y=feat_importances_sorted.head(10).index, palette='viridis')
    plt.title('Top 10 Feature Importances')
    plt.xlabel('Importance Score')
    plt.ylabel('Feature')
    plt.show()

### Interpretation of Visualizations

#### **Predicted vs. Actual Values Scatter Plot**
This plot compares the model's predictions on the test set (Y-axis) against the true daily bike counts (X-axis). 
* **Ideal Outcome:** The dots should cluster tightly around the red straight line, which shows a perfect prediction that Predicted = Actual.
* **Finding:** My findings on the visualizations are that the tight cluster around the line shows that the model has a good prediction and is generalizing well to the data. The slight scatter at the higher end indicates more difficulty predicting peak demand days precisely.

#### **Distribution of Prediction Errors (Residuals)**
This histogram shows the frequency of the model's errors (Residuals = Actual - Predicted). 
* **Ideal Outcome:** The distribution should look like a bell curve centered exactly at zero.
* **Finding:** The distributions showed a centered near 0 in the middle which suggests that the model is unbiased overall, but the distribution shows a slight skew, suggesting that the model tends to underestimate high-demand events a bit more often than it over predicts them.

#### **Top 10 Feature Importances**
This bar chart shows the relative contribution of each feature to the Decision Tree's final prediction.
* **Finding:** **`temp`and **`yr`  are the dominant predictors. This exhibits that people are more likely to bike when the weather is warmer, and the rise in yr (comparing 2011 to 2012) reflects how the bike-sharing system became noticeably more popular over time.

<div class="alert alert-block alert-info">
<h1 align="center"><font color='green'>Summary of Regression Analysis on Bike Sharing Dataset</font></h1>
</div>


This exercise 1 addresses the prediction of the total daily bike rentals (cnt). I used a data set containing weather and temporal information which was perfectly suitable since the target (cnt) is a continuous numeric count. The entire dataset has 731 records and 16 features.

#### Preprocessing and Feature Engineering
The initial processing was cleaning the data and lessening the data leakage, which signifies when the model accidentally cheats by seeing the answer before it concludes.
The 'instant', 'dteday' was removed along with 'casual' and 'registered' columns. This was crucial as they make up the total count (cnt = casual registered). Then, I used One-Hot Encoding to handle my categorical data, such as 'season', 'weekday', and 'weathersit'.
Finally, the continuous variables like 'temp', 'hum', and'windspeed' were normalized using StandardScaler. This ensures that every feature contributes equally, especially for the Ridge model. Moreover, I split the data 80/20 into training and testing sets to properly check how well the model generalizes.


#### Model Implementation and Evaluation

I picked two models that were the best suited for this data, which follows:
1.  **Decision Tree Regressor:** I chose this model because it fits well by capturing non-linear relationships and usually offers the highest predictive accuracy.
2.  **Ridge Regression:** This was my reliable linear baseline model. It uses L2 regularization, which helps keep things stable when features might be correlated.

The models were evaluated using the standard regression metrics: **Mean Absolute Error (MAE)**, **Mean Squared Error (MSE)**, and the **R¬≤ score**.

| Model | R¬≤ Train | R¬≤ Test | MAE Test | MSE Test |
| :--- | :--- | :--- | :--- | :--- |
| Decision Tree | 0.941 | **0.871** | 473.4 | 425,000 |
| Ridge Regression | 0.816 | 0.826 | 639.1 | 599,000 |

#### Results and Interpretation

The **Decision Tree Regressor** significantly outperformed the linear baseline, achieving an $R^2$ score of **0.871** on the test data. This means it approximately explains about **87.1%** of the variability in total daily bike rentals. The Ridge Regression model achieved an $R^2$ of **0.826**, showing an interpretable linear baseline.

Insights:
1.  **Generalization:** The gap between the Decision Tree's training $R^2$ (0.941) and testing $R^2$ (0.871) suggests tha model has a tendency to **overfit**. I already capped the tree depth at 10 to manage, however it shows how easily this non-linear model can memorize the training data.
2.  **Feature Importance:** As shown in the visualization, 'temp' and 'year' are the strongest predictors. This reflects the dependence of biking activity when it is in warm weather and the overall growth of the program.
3.  **Error Analysis:** The error plots **Predicted vs. Actual** shows a high accuracy, centering perfectly around zero, which confirms the model is unbiased. The only slight issue is that the residual distribution is slightly skewed, meaning the model tends to under-predict the data with the absolute highest rental demand.


In conclusion, the **Decision Tree Regressor** offers the highest predictive accuracy, while **Ridge Regression** provides a stable and reliable linear model. 
