# Part A Data Preprocessing and Baseline

## A.1 Data Loading and Feature Engineering:

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Set a random state for reproducibility
RANDOM_STATE = 42

We'll load the hour.csv file and take a first look at its structure, data types, and any potential missing values.

In [8]:
# Load the dataset
try:
    df_hour = pd.read_csv('hour.csv')
except FileNotFoundError:
    print("Error: 'hour.csv' not found. Please download it from the UCI repository.")

print("--- Data Head ---")
print(df_hour.head())
print("\n--- Data Info ---")
df_hour.info()

--- Data Head ---
   instant      dteday  season  yr  mnth  hr  holiday  weekday  workingday  \
0        1  2011-01-01       1   0     1   0        0        6           0   
1        2  2011-01-01       1   0     1   1        0        6           0   
2        3  2011-01-01       1   0     1   2        0        6           0   
3        4  2011-01-01       1   0     1   3        0        6           0   
4        5  2011-01-01       1   0     1   4        0        6           0   

   weathersit  temp   atemp   hum  windspeed  casual  registered  cnt  
0           1  0.24  0.2879  0.81        0.0       3          13   16  
1           1  0.22  0.2727  0.80        0.0       8          32   40  
2           1  0.22  0.2727  0.80        0.0       5          27   32  
3           1  0.24  0.2879  0.75        0.0       3          10   13  
4           1  0.24  0.2879  0.75        0.0       0           1    1  

--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entrie

The data is loaded successfully. We have 17,379 hourly entries.

There are no missing values, as indicated by the info() output.

dteday is an object (string), which we'll be dropping.

Most features are int64 or float64, which is good.

### Dropping Irrelevant & Leakage Columns
As per the assignment, we will drop instant, dteday, casual, and registered.

instant: This is just a row index and provides no predictive information.

dteday: This date column is redundant. All its information (year, month, day) is already captured in yr, mnth, weekday, and hr.

casual and registered: This is the most critical drop. These two columns sum up to our target variable, cnt. Including them would be data leakage, leading to a perfect but useless model (it's like predicting a total score when you already know the component scores).

In [9]:
# Drop the specified columns
df_processed = df_hour.drop(['instant', 'dteday', 'casual', 'registered'], axis=1)

print(f"\nOriginal columns: {df_hour.shape[1]}")
print(f"Processed columns: {df_processed.shape[1]}")
print("Dropped 'instant', 'dteday', 'casual', 'registered'.")


Original columns: 17
Processed columns: 13
Dropped 'instant', 'dteday', 'casual', 'registered'.


### Feature Engineering: A Hybrid Approach

Next, we convert our categorical features into a numerical format. This is the most critical step for model performance, and we must be precise. Our features fall into two distinct groups:

1.  **Cyclical Features:** `hr`, `mnth`, and `weekday`.
2.  **Nominal Features:** `season` and `weathersit`.

A common mistake is to use One-Hot Encoding (OHE) for all of them. However, that would make our model *worse*. For example, OHE treats "hour 0" (midnight) and "hour 23" (11 PM) as completely unrelated, when in reality, they are adjacent.

**Our Expert Strategy:**

* **Cyclical Encoding (`sin`/`cos`):** For `hr`, `mnth`, and `weekday`, we will use a mathematical transformation. By converting them into `sin` and `cos` components, we map them onto a circle. This explicitly tells the model that "hour 23" is very close to "hour 0," and "month 12" is very close to "month 1." This should provide a massive boost in predictive power.
* **One-Hot Encoding (`pd.get_dummies`):** For `season` and `weathersit`, which are nominal (unordered categories), OHE remains the correct approach. We'll set `drop_first=True` to prevent multicollinearity (the "dummy variable trap"), which is important for our Linear Regression baseline.

This hybrid approach gives our models the best possible representation of the underlying data.

In [10]:
# --- Cyclical Feature Engineering ---
# Apply sin/cos transformations to cyclical features

# Handle 'hr' (0-23, so a 24-hour cycle)
df_processed['hr_sin'] = np.sin(2 * np.pi * df_processed['hr'] / 24.0)
df_processed['hr_cos'] = np.cos(2 * np.pi * df_processed['hr'] / 24.0)

# Handle 'mnth' (1-12, so a 12-month cycle)
df_processed['mnth_sin'] = np.sin(2 * np.pi * df_processed['mnth'] / 12.0)
df_processed['mnth_cos'] = np.cos(2 * np.pi * df_processed['mnth'] / 12.0)

# Handle 'weekday' (0-6, so a 7-day cycle)
df_processed['weekday_sin'] = np.sin(2 * np.pi * df_processed['weekday'] / 7.0)
df_processed['weekday_cos'] = np.cos(2 * np.pi * df_processed['weekday'] / 7.0)

# --- One-Hot Encoding for *non-cyclical* categories ---
# 'season' and 'weathersit' are nominal, so OHE is still correct here.
non_cyclical_cats = ['season', 'weathersit']
df_processed = pd.get_dummies(df_processed, 
                              columns=non_cyclical_cats, 
                              drop_first=True)

# Now we can drop the original categorical columns
df_processed = df_processed.drop(['hr', 'mnth', 'weekday'], axis=1)

print("--- Data Head After *Cyclical* Preprocessing ---")
print(df_processed.head())
print(f"\nTotal features after new preprocessing: {df_processed.shape[1]}")

--- Data Head After *Cyclical* Preprocessing ---
   yr  holiday  workingday  temp   atemp   hum  windspeed  cnt    hr_sin  \
0   0        0           0  0.24  0.2879  0.81        0.0   16  0.000000   
1   0        0           0  0.22  0.2727  0.80        0.0   40  0.258819   
2   0        0           0  0.22  0.2727  0.80        0.0   32  0.500000   
3   0        0           0  0.24  0.2879  0.75        0.0   13  0.707107   
4   0        0           0  0.24  0.2879  0.75        0.0    1  0.866025   

     hr_cos  mnth_sin  mnth_cos  weekday_sin  weekday_cos  season_2  season_3  \
0  1.000000       0.5  0.866025    -0.781831      0.62349     False     False   
1  0.965926       0.5  0.866025    -0.781831      0.62349     False     False   
2  0.866025       0.5  0.866025    -0.781831      0.62349     False     False   
3  0.707107       0.5  0.866025    -0.781831      0.62349     False     False   
4  0.500000       0.5  0.866025    -0.781831      0.62349     False     False   

   seas

## A.2: Train/Test Split

CRITICAL INSIGHT: This is time-series data. We cannot use a random train_test_split. That would be a major methodological error, as it would train the model on future data to predict the past, leading to an overly optimistic and incorrect evaluation.

Our Strategy: We must perform a chronological split. We will use the earlier data (approx. 80%) for training and reserve the later data (approx. 20%) for testing. This simulates a real-world scenario of forecasting the future.

In [23]:
# 1. Separate features (X) and target (y)
y = df_processed['cnt']  # The RAW count
X = df_processed.drop('cnt', axis=1)

# 2. Perform the chronological split
split_index = int(len(X) * 0.8)

X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]

# --- This is the key reset ---
# We are creating y_train and y_test from the RAW 'y'
y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (13903, 19)
y_train shape: (13903,)
X_test shape: (3476, 19)
y_test shape: (3476,)


## A.3 Baseline Model (Single Regressor):

### Scaling Numerical Features
Our dataset now contains the original numerical features (temp, atemp, hum, windspeed) and the new dummy variables. Linear Regression is sensitive to the scale of features. Decision Trees are not, but our future ensemble models might be.

To ensure a fair comparison and best practice, we will scale the original numerical features.

Important: We will fit the scaler only on the X_train data and then use it to transform both X_train and X_test. This prevents any information from the test set from "leaking" into our training process.

In [24]:
# 3. Scaling Numerical Features 

numerical_features = ['temp', 'atemp', 'hum', 'windspeed']
scaler = StandardScaler()

# Fit *only* on the training data
scaler.fit(X_train[numerical_features])

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Now, we modify the copies, which is 100% safe
X_train_scaled[numerical_features] = scaler.transform(X_train[numerical_features])
X_test_scaled[numerical_features] = scaler.transform(X_test[numerical_features])


print("\n--- Training Data Head After Scaling (No Warning) ---")
print(X_train_scaled.head())


   yr  holiday  workingday      temp     atemp       hum  windspeed    hr_sin  \
0   0        0           0 -1.310866 -1.076496  0.943574   -1.57778  0.000000   
1   0        0           0 -1.412024 -1.162563  0.893116   -1.57778  0.258819   
2   0        0           0 -1.412024 -1.162563  0.893116   -1.57778  0.500000   
3   0        0           0 -1.310866 -1.076496  0.640830   -1.57778  0.707107   
4   0        0           0 -1.310866 -1.076496  0.640830   -1.57778  0.866025   

     hr_cos  mnth_sin  mnth_cos  weekday_sin  weekday_cos  season_2  season_3  \
0  1.000000       0.5  0.866025    -0.781831      0.62349     False     False   
1  0.965926       0.5  0.866025    -0.781831      0.62349     False     False   
2  0.866025       0.5  0.866025    -0.781831      0.62349     False     False   
3  0.707107       0.5  0.866025    -0.781831      0.62349     False     False   
4  0.500000       0.5  0.866025    -0.781831      0.62349     False     False   

   season_4  weathersit_2

### Baseline Model 1: Linear Regression

In [25]:
# 1. Initialize and train the model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# 2. Make predictions on the test set
y_pred_lr = lr_model.predict(X_test)

# 3. Calculate and report RMSE
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
print(f"Linear Regression RMSE: {rmse_lr:.4f}")

Linear Regression RMSE: 164.4087


### Baseline Model 2: Decision Tree Regressor
We will use the specified max_depth=6 to prevent the tree from overfitting. We also set random_state for reproducibility.

In [26]:
# 1. Initialize and train the model
dt_model = DecisionTreeRegressor(max_depth=6, random_state=RANDOM_STATE)
dt_model.fit(X_train, y_train)

# 2. Make predictions on the test set
y_pred_dt = dt_model.predict(X_test)

# 3. Calculate and report RMSE
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print(f"Decision Tree (Depth=6) RMSE: {rmse_dt:.4f}")

Decision Tree (Depth=6) RMSE: 110.8498


### 5. Part A Conclusion: Establishing the Baseline

We evaluated our two single models on the test set to establish a baseline performance metric. The Root Mean Squared Error (RMSE) for each model is as follows:

* **Linear Regression RMSE:** 164.41
* **Decision Tree (max\_depth=6) RMSE:** 110.85

**Conclusion:**

The **Decision Tree Regressor** (RMSE: 110.85) performed significantly better than the Linear Regression model (RMSE: 164.41). This suggests the relationships in the data (like time of day and weather) are highly non-linear, which the tree model can capture more effectively.

Therefore, the **baseline performance metric** for this assignment is set at **110.85 RMSE**. The goal for our ensemble models is to improve upon this score.

# Part B: Ensemble Techniques for Bias and Variance Reduction

## B.1 Bagging (Variance Reduction):

In [27]:
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [31]:
### --- Model 3: Bagging Regressor  ---
print("\n--- Running Model 3: Bagging Regressor ---")

# Define the baseline RMSE we are trying to beat (from Part A)
baseline_rmse = 110.85 


bagging_model_simple = BaggingRegressor(
    estimator=dt_model,   # using the base estimator (our baseline model)   
    n_estimators=100,       
    random_state=RANDOM_STATE,
    n_jobs=-1               
)

print("Training simple Bagging Regressor (100 trees, max_depth=6)...")
bagging_model_simple.fit(X_train_scaled, y_train) 

y_pred_bagging_simple = bagging_model_simple.predict(X_test_scaled)
rmse_bagging_simple = np.sqrt(mean_squared_error(y_test, y_pred_bagging_simple)) 

print(f"Bagging Regressor (D=6) RMSE: {rmse_bagging_simple:.4f}")
print(f"Baseline to Beat: {baseline_rmse:.4f}") # This will work now


--- Running Model 3: Bagging Regressor ---
Training simple Bagging Regressor (100 trees, max_depth=6)...
Bagging Regressor (D=6) RMSE: 101.8245
Baseline to Beat: 110.8500


Let us also try to a RamdomForest with tuned hyperparameters to see if it does any better

In [32]:
### --- Model 4: Tuned Random Forest (Expert Step) ---
print("\n--- Running Model 4: Tuned Random Forest ---")

rf = RandomForestRegressor(random_state=RANDOM_STATE, n_jobs=-1)

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200],         # Number of trees
    'max_depth': [15, 20],              # Deeper trees (low bias)
    'max_features': ['sqrt', 1.0],      # 'sqrt' is RF, '1.0' is standard Bagging
    'min_samples_leaf': [1, 2]        # Regularization
}

rf_grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=2,  # Set to 2 to see progress
    n_jobs=-1
)

print("\nStarting GridSearchCV for RandomForestRegressor...")
rf_grid_search.fit(X_train_scaled, y_train) # Fit on raw y_train

print("\n--- RF GridSearch Complete ---")
print(f"Best parameters found: {rf_grid_search.best_params_}")

# Evaluate the best model found by the grid search
best_rf_model = rf_grid_search.best_estimator_
y_pred_rf_best = best_rf_model.predict(X_test_scaled)
rmse_rf_best = np.sqrt(mean_squared_error(y_test, y_pred_rf_best)) # Eval on raw y_test

print(f"\nBest Tuned RandomForest RMSE: {rmse_rf_best:.4f}")
print(f"Baseline to Beat: {baseline_rmse:.4f}")


--- Running Model 4: Tuned Random Forest ---

Starting GridSearchCV for RandomForestRegressor...
Fitting 3 folds for each of 16 candidates, totalling 48 fits

--- RF GridSearch Complete ---
Best parameters found: {'max_depth': 20, 'max_features': 1.0, 'min_samples_leaf': 1, 'n_estimators': 100}

Best Tuned RandomForest RMSE: 87.0095
Baseline to Beat: 110.8500


## Discussion: Bagging & Variance Reduction

**1. RMSE achieved**

* **Single Decision Tree (Baseline) RMSE:** 110.85
* **Bagging Regressor (100 Trees, `max_depth=6`) RMSE:** 101.82
* **Tuned Random Forest (100 Trees, `max_depth=20`) RMSE:** 87.01

**2. Discussion of Effectiveness**

**Yes, the bagging technique was highly effective at reducing variance and improving model performance.**

The evidence for this is clear and twofold:

* **Initial Improvement:** Our baseline **single Decision Tree** (RMSE: 110.85) is a high-variance model, meaning its predictions are unstable and highly sensitive to the specific training data. By implementing a simple `BaggingRegressor`—which averages 100 of these same shallow trees—we achieved an immediate **RMSE reduction to 101.82**. This 8.1% improvement confirms our hypothesis: the averaging process "smoothed out" the individual trees' errors and reduced the ensemble's overall variance.

* **Unlocking Full Potential:** The true power of bagging is in controlling models with *low bias* and *high variance*. Our initial `max_depth=6` trees were "weak learners" with high bias. Our tuned **`RandomForestRegressor`** (which is an advanced form of bagging) used deep, complex `max_depth=20` trees. A single one of these deep trees would have overfit the data and performed terribly. However, the ensemble-averaging of bagging effectively contained this high variance, slashing the error to **87.01 (a 21.5% improvement)**.

**Conclusion:** Bagging successfully reduced the variance of our baseline model. Furthermore, this technique proved to be the key that unlocked the use of much deeper, more complex trees, leading to our best-performing model so far.

## B.2 Boosting (Bias Reduction):

In [34]:
from sklearn.ensemble import GradientBoostingRegressor

# Define our new "best score" to beat
baseline_rmse = 110.85
rf_best_rmse = 87.0095  # Our new best score from Part B

print(f"\n--- Running Model 5: Gradient Boosting Regressor ---")

# 1. Initialize the model
# max_depth=3, learning_rate=0.1, n_estimators=100 are common defaults
gbr_model = GradientBoostingRegressor(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    random_state=RANDOM_STATE
)

# 2. Train the model
print("Training Gradient Boosting Regressor (100 trees, max_depth=3)...")
gbr_model.fit(X_train_scaled, y_train)

# 3. Evaluate
y_pred_gbr = gbr_model.predict(X_test_scaled)
rmse_gbr = np.sqrt(mean_squared_error(y_test, y_pred_gbr))

print(f"\nGradient Boosting RMSE: {rmse_gbr:.4f}")
print(f"Baseline to Beat: {baseline_rmse:.4f}")
print(f"Bagging/RF Best to Beat: {rf_best_rmse:.4f}")


--- Running Model 5: Gradient Boosting Regressor ---
Training Gradient Boosting Regressor (100 trees, max_depth=3)...

Gradient Boosting RMSE: 111.1437
Baseline to Beat: 110.8500
Bagging/RF Best to Beat: 87.0095


Our default Gradient Boosting model performed poorly (RMSE: 111.14), even worse than our baseline. This is a classic sign of **severe underfitting**. The model's parameters (100 trees, 3-level depth) were too simple to capture the data's complexity.

To fix this and unleash the model's true power, we must perform **hyperparameter tuning**. We will now use `GridSearchCV` to test a more robust set of parameters:

* **`n_estimators`:** More trees to allow for more correction steps.
* **`max_depth`:** Deeper trees to reduce bias and capture complex interactions.
* **`learning_rate`:** A different learning pace to find a more precise result.

In [35]:
### --- Model 6: Tuned Gradient Boosting Regressor ---
print("\n--- Running Model 6: Tuned Gradient Boosting Regressor ---")

# 1. Define the parameter grid for GridSearchCV
gbr_param_grid = {
    'n_estimators': [100, 200],         # Number of trees
    'learning_rate': [0.1, 0.05],       # How fast the model learns
    'max_depth': [3, 5]                 # Deeper trees
}

# 2. Initialize the base estimator
gbr = GradientBoostingRegressor(random_state=RANDOM_STATE)

# 3. Initialize GridSearchCV
gbr_grid_search = GridSearchCV(
    estimator=gbr,
    param_grid=gbr_param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=2,  # Set to 2 to see progress
    n_jobs=-1
)

print("\nStarting GridSearchCV for GradientBoostingRegressor...")
gbr_grid_search.fit(X_train_scaled, y_train)

print("\n--- GBR GridSearch Complete ---")
print(f"Best parameters found: {gbr_grid_search.best_params_}")

# 4. Evaluate the best model
best_gbr_model = gbr_grid_search.best_estimator_
y_pred_gbr_best = best_gbr_model.predict(X_test_scaled)
rmse_gbr_best = np.sqrt(mean_squared_error(y_test, y_pred_gbr_best))

print(f"\nBest Tuned Gradient Boosting RMSE: {rmse_gbr_best:.4f}")
print(f"Random Forest Best to Beat: {rf_best_rmse:.4f}")


--- Running Model 6: Tuned Gradient Boosting Regressor ---

Starting GridSearchCV for GradientBoostingRegressor...
Fitting 3 folds for each of 8 candidates, totalling 24 fits

--- GBR GridSearch Complete ---
Best parameters found: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}

Best Tuned Gradient Boosting RMSE: 80.0026
Random Forest Best to Beat: 87.0095


## Discussion: Boosting & Bias Reduction

**1. RMSE Calculation**

* **Single Decision Tree (Baseline) RMSE:** 110.85
* **Best Bagging Ensemble (Tuned RF) RMSE:** 87.01
* **Best Boosting Ensemble (Tuned GBR) RMSE:** 80.00

**2. Discussion of Effectiveness**

**Yes, the boosting technique achieved a significantly better result than both the single model and the bagging ensemble.**

Our `GradientBoostingRegressor`, once tuned, produced the **lowest RMSE (80.00)** of all models tested. This result strongly supports our hypothesis that boosting effectively reduces bias.

Here's the story the data tells:
1.  **High Initial Bias:** Our single tree (110.85 RMSE) and the default GBR (111.14 RMSE) both suffered from high bias (underfitting).
2.  **Bagging's Approach:** Our `RandomForest` (a bagging method) reduced this error to 87.01 by averaging many deep, *low-bias* trees. It effectively reduced *variance*.
3.  **Boosting's (Superior) Approach:** The `GradientBoostingRegressor` took a different, more direct route. It sequentially built models, with each new model laser-focused on correcting the errors (the *bias*) of the previous one. By tuning for deeper trees (`max_depth=5`) and more correction steps (`n_estimators=200`), we gave the model the power and time it needed to systematically "boost" itself and drive down this initial bias, ultimately achieving the lowest error score.