<a href="https://colab.research.google.com/github/mradulpatle03/Basic_ML_learning/blob/main/Feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("/content/hour.csv")

In [3]:
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [4]:
DROP_COLS = [
    "cnt",          # target
    "casual",       # leakage
    "registered",   # leakage
    "instant",      # id
    "dteday"        # raw date (we'll extract info later)
]

X = df.drop(columns=DROP_COLS)
y = df["cnt"]


In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [6]:
def evaluate_model(X,y):

  split_point=int(len(X)*0.8)

  X_train=X.iloc[:split_point]
  X_test=X.iloc[split_point:]
  y_train=y.iloc[:split_point]
  y_test=y.iloc[split_point:]

  model = LinearRegression();
  model.fit(X_train,y_train)

  y_pred = model.predict(X_test)
  mse = mean_absolute_error(y_test,y_pred)

  return mse

In [7]:
mae_raw = evaluate_model(X, y)
print("MAE with raw features:", mae_raw)


MAE with raw features: 138.29175310827318


In [8]:
categorical_features = [
    "season", "weathersit", "weekday",
    "holiday", "workingday", "mnth", "hr"
]

**One Hot Encoding using pandas**

**Encoding is essential as our data consists cols where categorical valules have been represented by 0 1 2 etc which can create unwanted order in nominal data.**

In [10]:
X_cat=pd.get_dummies(X,columns=[
    "season","weathersit","weekday"
],drop_first=True)

In [11]:
mae_cat = evaluate_model(X_cat,y)
print("mae in categorical col",mae_cat)

mae in categorical col 136.66796396426935


**Not much difference in mae**

In [12]:
X_cyclic = X_cat.copy()

X_cyclic["hr_sin"] = np.sin(2 * np.pi * X_cyclic["hr"] / 24)
X_cyclic["hr_cos"] = np.cos(2 * np.pi * X_cyclic["hr"] / 24)

X_cyclic.drop(columns=["hr"], inplace=True)


In [13]:
X_cyclic["mnth_sin"] = np.sin(2 * np.pi * X_cyclic["mnth"] / 12)
X_cyclic["mnth_cos"] = np.cos(2 * np.pi * X_cyclic["mnth"] / 12)

X_cyclic.drop(columns=["mnth"], inplace=True)


In [14]:
mae_cyclic = evaluate_model(X_cyclic, y)
print("MAE after cyclic features:", mae_cyclic)


MAE after cyclic features: 120.6377749791419


**Reduction in mae because hr and month are not linear they are cyclic**

Linear regression + cyclic encoding â‰ˆ manual non-linearity

In [15]:
for col in ["holiday", "weekday_6"]:
    if col in X_cyclic.columns:
        temp = X_cyclic.drop(columns=[col])
        print(col, evaluate_model(temp, y))


holiday 120.6377749791419
weekday_6 120.78081058894931


In [16]:
results = {
    "baseline_mean": 174.8,
    "raw_features": mae_raw,
    "categorical": mae_cat,
    "cyclic": mae_cyclic
}

results


{'baseline_mean': 174.8,
 'raw_features': 138.29175310827318,
 'categorical': 136.66796396426935,
 'cyclic': 120.6377749791419}

**With good features linear models can also reduce errors, that's why feature engineering is necessary.**

simple model with good features can perform same as complex models with bade features.