# 04 Data Encoding

## Notebook Overview

This notebook transforms categorical time features into a machine-learning-friendly format using one-hot encoding.

**Key Steps:**

* **Input:** Cleaned hourly-level dataset (`data_cleaned.csv`)
* **Encoding:** One-hot encodes `weekday` and `timing` columns to capture cyclical and behavioral patterns across time
* **Output:** Saves the encoded dataset as `data_encoded.csv` for use in model training

> Purpose: Ensure temporal categorical features are numerically represented without introducing implicit ordering.

### Thoughts, Tradeoffs & Considerations

* **Avoided ordinal encoding:** `weekday` and `timing` may look ordered but aren’t numerically meaningful (e.g., Friday ≠ 5). One-hot encoding avoids injecting false structure into tree models.
* **Kept full dummies (no drop-first):** Didn’t drop the first column in one-hot encoding. For trees, multicollinearity isn’t a problem, and interpretability is better with all categories visible.
* **Didn’t use cyclical encoding:** Considered sine/cosine transformations for `hour`, but since we already captured behavior via `timing`, it was redundant here.
* **Prefix naming:** Added `wd_` and `time_` prefixes to avoid name clashes or confusion when inspecting feature importance later.
* **Sparse matrix not needed:** Although one-hot expands columns, the dataset is still small enough to keep in dense format without performance issues.

> Encoding time-based context like weekday and timing helps models capture behavioral patterns (e.g., higher energy use on Mondays or in the evening). Clean, explicit encoding here will help with interpretability and downstream feature selection.

In [8]:
import pandas as pd
import numpy as np

In [9]:
# Show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# widen the column width and overall display width
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 0)

In [10]:
df: pd.DataFrame = pd.read_csv('../data/interim/data_cleaned.csv')

In [11]:
# One-hot encode 'weekday' and 'timing'
df_encoded = pd.get_dummies(df, columns=["weekday", "timing"], prefix=["wd", "time"])

# sanity check
print(df_encoded.filter(like="wd_").columns)
print(df_encoded.filter(like="time_").columns)

Index(['wd_Friday', 'wd_Monday', 'wd_Saturday', 'wd_Sunday', 'wd_Thursday',
       'wd_Tuesday', 'wd_Wednesday'],
      dtype='object')
Index(['time_Afternoon', 'time_Evening', 'time_Morning', 'time_Night'], dtype='object')


In [12]:
df_encoded.head()

Unnamed: 0,time,use_house_overall,generated_solar,dishwasher,homeoffice,fridge,winecellar,garagedoor,barn,well,microwave,livingroom,temperature,humidity,visibility,pressure,windspeed,cloudcover,windbearing,precipprobability,furnace,kitchen,year,month,day,weekofyear,hour,minute,wd_Friday,wd_Monday,wd_Saturday,wd_Sunday,wd_Thursday,wd_Tuesday,wd_Wednesday,time_Afternoon,time_Evening,time_Morning,time_Night
0,2016-01-01 05:00:00,1.04413,0.003307,6.4e-05,0.241814,0.037861,0.063351,0.013046,0.038881,0.001042,0.021652,0.001505,36.131,0.619667,10.0,1016.888,9.150333,0.75,282.1,0.0,0.393188,0.000274,2016.0,1.0,1.0,53.0,5,29.5,True,False,False,False,False,False,False,False,False,True,False
1,2016-01-01 06:00:00,0.918167,0.003422,9.9e-05,0.043294,0.075522,0.112942,0.012836,0.039181,0.001021,0.004216,0.001618,35.838667,0.61,10.0,1016.232,8.284,0.75,284.733333,0.0,0.456708,0.00025,2016.0,1.0,1.0,53.0,6,29.5,True,False,False,False,False,False,False,False,False,True,False
2,2016-01-01 07:00:00,0.714736,0.003448,4.3e-05,0.043416,0.059486,0.007184,0.013299,0.034439,0.001014,0.004246,0.001629,35.385,0.613,10.0,1015.989,7.927,0.75,279.4,0.0,0.37217,0.000242,2016.0,1.0,1.0,53.0,7,29.5,True,False,False,False,False,False,False,False,False,True,False
3,2016-01-01 08:00:00,0.960013,0.003447,0.000138,0.065014,0.060412,0.007045,0.012925,0.034195,0.001016,0.004274,0.001634,35.282,0.64,10.0,1016.042,5.684667,0.75,265.0,0.0,0.61637,0.000269,2016.0,1.0,1.0,53.0,8,29.5,True,False,False,False,False,False,False,False,False,True,False
4,2016-01-01 09:00:00,0.639836,0.003439,6e-05,0.043392,0.035106,0.007143,0.01322,0.03183,0.001014,0.004258,0.00165,35.451667,0.641667,10.0,1015.815,6.975,0.625,265.5,0.0,0.343842,0.000265,2016.0,1.0,1.0,53.0,9,29.5,True,False,False,False,False,False,False,False,False,True,False


In [13]:
df_encoded["hour_sin"] = np.sin(2 * np.pi * df_encoded["hour"] / 24)
df_encoded["hour_cos"] = np.cos(2 * np.pi * df_encoded["hour"] / 24)

In [7]:
df_encoded.to_csv("../data/interim/data_encoded.csv", index=False)