<a href="https://colab.research.google.com/github/mradulpatle03/Basic_ML_learning/blob/main/model_selection_framework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error

In [3]:
df = pd.read_csv("/content/hour.csv")

In [4]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [5]:
X = df.drop(columns=[
    "cnt","casual","registered","instant","dteday"
])

y = df["cnt"]

**We convert hour into sin and cos to preserve the cyclic nature of time.
This prevents artificial distance between adjacent hours like 23 and 0, enabling models—especially linear ones—to learn smooth temporal patterns.**

WHENEVER YOU SHOULD USE CYCLIC ENCODING

✔ Hour of day
✔ Day of week
✔ Month of year
✔ Wind direction
✔ Angles

❌ When order matters but cycle does not (e.g., ranking)

**Using only sine is ambiguous because different times can map to the same sine value.
Using both sine and cosine uniquely represents cyclic features by embedding them on a unit circle, preserving both continuity and direction.**

In [9]:
X["hr_sin"]=np.sin(2* np.pi * X["hr"]/24)
X[["hr_sin","hr"]]

Unnamed: 0,hr_sin,hr
0,0.000000,0
1,0.258819,1
2,0.500000,2
3,0.707107,3
4,0.866025,4
...,...,...
17374,-0.965926,19
17375,-0.866025,20
17376,-0.707107,21
17377,-0.500000,22


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


1. Does it repeat?
2. Are the endpoints neighbors?
3. Would a circle represent it better than a line?

If YES YES YES → use cyclic encoding.

In [11]:
X["hr_sin"] = np.sin(2 * np.pi * X["hr"] / 24)
X["hr_cos"] = np.cos(2 * np.pi * X["hr"] / 24)
X.drop(columns=["hr"], inplace=True)

X["mnth_sin"] = np.sin(2 * np.pi * X["mnth"] / 12)
X["mnth_cos"] = np.cos(2 * np.pi * X["mnth"] / 12)
X.drop(columns=["mnth"], inplace=True)

In [12]:
from sklearn.model_selection import TimeSeriesSplit

TimeSeriesSplit ensures that training data always precedes validation data in time, preventing data leakage and providing realistic performance estimates for time-dependent problems.

In [14]:
tscv = TimeSeriesSplit(n_splits=5)

# train → validate → extend train → validate → ...


In [15]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

models = {
    "Linear": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "Lasso": Lasso(alpha=0.001),
    "DecisionTree": DecisionTreeRegressor(max_depth=10, random_state=42),
    "RandomForest": RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        random_state=42,
        n_jobs=-1
    )
}


In [20]:
for train_idx, test_idx in tscv.split(X):
    print("Train indices:", train_idx)
    print("Test indices:", test_idx)
    break # Just show the first split as an example

Train indices: [   0    1    2 ... 2896 2897 2898]
Test indices: [2899 2900 2901 ... 5792 5793 5794]


In [23]:
def evaluate_models(models, X, y, splitter):
    results = {}

    for name, model in models.items():
        maes = []

        for train_idx, val_idx in splitter.split(X):
            X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
            y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

            model.fit(X_train, y_train)
            preds = model.predict(X_val)
            mae = mean_absolute_error(y_val, preds)
            maes.append(mae)

        results[name] = {
            "mean_mae": np.mean(maes),
            "std_mae": np.std(maes)
        }

    return pd.DataFrame(results).T


In [24]:
results = evaluate_models(models, X, y, tscv)
results.sort_values("mean_mae")


Unnamed: 0,mean_mae,std_mae
RandomForest,56.854666,13.992851
DecisionTree,63.317388,12.597552
Ridge,105.123234,19.617324
Lasso,105.471894,19.554867
Linear,105.488039,19.550214


RandomForest performed best because it captures nonlinear feature interactions inherent in human behavior. However, it introduces risks related to interpretability, data drift, and maintenance.