# Electricity Consumption Prediction – Feature Engineering & Model Training

This notebook focuses on:

- Filtering electricity consumption data (meter = 0)
- Creating custom transformers
- Building preprocessing pipelines
- Training multiple regression models
- Hyperparameter tuning (GridSearchCV & RandomizedSearchCV)
- Building a weighted VotingRegressor ensemble
- Evaluating test performance (R² score)

This notebook corresponds to the preprocessing and model training steps for the ASHRAE Great Energy Predictor III dataset.


In [None]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split

# Load and Merge the Raw Datasets

We load:
- `train.csv` → meter readings
- `building_metadata.csv` → building characteristics
- `weather_train.csv` → weather data

Then merge them to build a complete dataset.


In [2]:
data = pd.read_csv('train.csv')
building = pd.read_csv('building_metadata.csv')
weather = pd.read_csv('weather_train.csv')
df1 = data.merge(building, on="building_id", how="left")
df2 = df1.merge(weather, on=["site_id", "timestamp"], how="left")
data = df2[df2['meter']==0]

In [3]:
data.columns

Index(['building_id', 'meter', 'timestamp', 'meter_reading', 'site_id',
       'primary_use', 'square_feet', 'year_built', 'floor_count',
       'air_temperature', 'cloud_coverage', 'dew_temperature',
       'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction',
       'wind_speed'],
      dtype='object')

# 3️⃣ Train/Test Split

We split the filtered data into:
- `data_train` (80%)
- `data_test` (20%)

This dataset will be preprocessed and used for model training.


In [4]:
data_train , data_test = train_test_split(data,test_size=0.2 , random_state = 77)

# Separate Features and Target

We separate:
- Features → `elect`
- Target → `elect_labels` (meter_reading)

This is required before building preprocessing pipelines.


In [5]:
elect = data_train.drop(['meter_reading'],axis = 1)
elect_labels = data_train['meter_reading'].copy()

# Custom Transformer: Extract Hour from Timestamp

We define a custom scikit-learn transformer to extract the hour of day from the timestamp.

This helps capture daily energy patterns.


In [6]:
from sklearn.base import BaseEstimator , TransformerMixin
class hours(BaseEstimator, TransformerMixin) :
    def __init__(self):
        pass
    def fit(self,X,y=None):
       return self
    def transform(self,X):
        return  pd.to_datetime(X['timestamp']).dt.hour.to_frame()
    def get_feature_names_out(self,input_features=None):
        return ['hour']


      

In [7]:
h = hours()

In [8]:
h.fit_transform(elect)

Unnamed: 0,timestamp
3437061,15
13067537,20
18065561,12
8221186,5
10397264,21
...,...
7134057,21
13022101,0
11257746,4
7961924,14


In [9]:
from sklearn.preprocessing import FunctionTransformer

#  Custom Transformer: Log1p Transformation for Skewed Features

We define a `LogTransformer` class to apply `log1p` on highly skewed features (e.g., square_feet).

This improves model stability and predictive power.


In [10]:


class LogTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.log1p(X)
    
    def get_feature_names_out(self, input_features=None):
        return np.array([f"{col}_log" for col in input_features])


In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

#  Numerical Preprocessing Pipeline

We build a pipeline for numerical inputs:
- Median imputation
- Standard scaling

Used for weather-related variables.


In [12]:
stand = Pipeline([('impute',SimpleImputer(strategy='median')),('scaler',StandardScaler(with_mean=True , with_std =True ))])

In [13]:
stand.fit(elect[['air_temperature']])

In [14]:
elect.columns


Index(['building_id', 'meter', 'timestamp', 'site_id', 'primary_use',
       'square_feet', 'year_built', 'floor_count', 'air_temperature',
       'cloud_coverage', 'dew_temperature', 'precip_depth_1_hr',
       'sea_level_pressure', 'wind_direction', 'wind_speed'],
      dtype='object')

In [15]:
from sklearn.preprocessing import OneHotEncoder

#  Categorical Preprocessing Pipeline

We process:
- `primary_use`
- `site_id`

using:
- Most frequent imputation
- OneHotEncoder (ignore unknown categories)


In [16]:
#cat for categorial 
cat_pipeline  = Pipeline([('imputer' , SimpleImputer(strategy='most_frequent')),('encoder',OneHotEncoder(handle_unknown='ignore',sparse_output=False))])
cat_pipeline.fit(elect[['primary_use']])

In [17]:
cat_pipeline.get_feature_names_out()


array(['primary_use_Education',
       'primary_use_Entertainment/public assembly',
       'primary_use_Food sales and service', 'primary_use_Healthcare',
       'primary_use_Lodging/residential',
       'primary_use_Manufacturing/industrial', 'primary_use_Office',
       'primary_use_Other', 'primary_use_Parking',
       'primary_use_Public services', 'primary_use_Religious worship',
       'primary_use_Retail', 'primary_use_Services',
       'primary_use_Technology/science', 'primary_use_Utility',
       'primary_use_Warehouse/storage'], dtype=object)

In [18]:
cat_pipeline.transform(elect[['primary_use']])

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#  Log Transformation Pipeline (for square_feet)

We apply:
- median imputation
- log1p transform
- standard scaling


In [19]:
#tansformation for  squarefeet
log_features = Pipeline([('impute',SimpleImputer(strategy='median')),('log',LogTransformer()),('scaler',StandardScaler(with_mean=True , with_std =True))])

In [20]:
log_features.fit(elect[['square_feet']])

In [21]:
log_features.transform(elect[['square_feet']])

array([[-0.5658291 ],
       [ 0.00804435],
       [ 1.21795219],
       ...,
       [-1.58843104],
       [ 0.35961759],
       [ 0.96636127]])

In [22]:
from sklearn.compose import ColumnTransformer

In [23]:
elect['precip_depth_1_hr'].value_counts()

precip_depth_1_hr
 0.0      6749406
-1.0       504288
 3.0       111869
 5.0        54851
 8.0        32979
           ...   
 162.0         13
 47.0          13
 164.0         12
 103.0         12
 98.0          12
Name: count, Length: 128, dtype: int64

In [24]:
elect.columns 

Index(['building_id', 'meter', 'timestamp', 'site_id', 'primary_use',
       'square_feet', 'year_built', 'floor_count', 'air_temperature',
       'cloud_coverage', 'dew_temperature', 'precip_depth_1_hr',
       'sea_level_pressure', 'wind_direction', 'wind_speed'],
      dtype='object')

# Log Transformation Pipeline (for square_feet)

We apply:
- median imputation
- log1p transform
- standard scaling


In [25]:
preprocessing = ColumnTransformer([('log_square_feet',log_features,['square_feet']),('hour',hours(),['timestamp']) ,('cat',cat_pipeline,['primary_use','site_id']),('floor_count',SimpleImputer(strategy='median'),['floor_count']),('simple',stand,[ 'air_temperature',
       'cloud_coverage', 'dew_temperature', 'precip_depth_1_hr',
       'sea_level_pressure', 'wind_direction', 'wind_speed'])])

In [26]:
preprocessing.fit(elect)

#  Fit Preprocessing on the Training Data

We fit the ColumnTransformer and generate:
- `X_tr` → training sample (15k rows)
- `X_all` → full transformed dataset


In [27]:
sample = elect.sample(15000)
X_tr = preprocessing.transform(sample)


In [28]:

X_all = preprocessing.transform(elect)

In [29]:
preprocessing.get_feature_names_out()

array(['log_square_feet__square_feet_log', 'hour__hour',
       'cat__primary_use_Education',
       'cat__primary_use_Entertainment/public assembly',
       'cat__primary_use_Food sales and service',
       'cat__primary_use_Healthcare',
       'cat__primary_use_Lodging/residential',
       'cat__primary_use_Manufacturing/industrial',
       'cat__primary_use_Office', 'cat__primary_use_Other',
       'cat__primary_use_Parking', 'cat__primary_use_Public services',
       'cat__primary_use_Religious worship', 'cat__primary_use_Retail',
       'cat__primary_use_Services', 'cat__primary_use_Technology/science',
       'cat__primary_use_Utility', 'cat__primary_use_Warehouse/storage',
       'cat__site_id_0', 'cat__site_id_1', 'cat__site_id_2',
       'cat__site_id_3', 'cat__site_id_4', 'cat__site_id_5',
       'cat__site_id_6', 'cat__site_id_7', 'cat__site_id_8',
       'cat__site_id_9', 'cat__site_id_10', 'cat__site_id_11',
       'cat__site_id_12', 'cat__site_id_13', 'cat__site_id_14',
 

In [30]:
from sklearn.linear_model import LinearRegression


In [31]:
from sklearn.ensemble import RandomForestRegressor , VotingRegressor , ExtraTreesRegressor
from sklearn.model_selection import GridSearchCV
sample2 = elect.sample(8000) 


#  Hyperparameter Tuning – ExtraTreesRegressor (GridSearchCV)

We tune:
- `max_features`  
- Number of estimators remains default  

ExtraTrees is used for high-dimensional, non-linear patterns.


In [32]:
ext = ExtraTreesRegressor( n_jobs=-1, random_state = 47)
params_ext = {'max_features':np.arange(2,X_all.shape[1],2)}
search_ext = GridSearchCV(ext,params_ext , cv=5,scoring = 'neg_mean_squared_error' )
search_ext.fit(X_tr , elect_labels[sample.index])
search_ext.best_params_


{'max_features': np.int64(36)}

In [33]:
search_ext.best_estimator_.score(preprocessing.transform(sample2),elect_labels[sample2.index])

0.7964510363817976

In [34]:
from sklearn.ensemble import GradientBoostingRegressor as GBR

#  Hyperparameter Tuning – GradientBoostingRegressor (RandomizedSearchCV)

We tune:
- n_estimators
- max_depth
- learning_rate

GradientBoosting helps capture complex non-linear relationships.


In [35]:
from sklearn.model_selection import RandomizedSearchCV
GB = GBR( random_state = 47)
params_GB = {'n_estimators':[ 200, 300,500,400,600] ,'max_depth':range(2,5) ,'learning_rate':[0.1,0.2,0.3,0.01]}
search_GB = RandomizedSearchCV(GB , params_GB ,n_iter = 8,  cv = 3, scoring = 'neg_mean_squared_error' )
search_GB.fit(X_tr , elect_labels[sample.index])
search_GB.best_params_

{'n_estimators': 600, 'max_depth': 4, 'learning_rate': 0.1}

In [36]:

search_GB.best_estimator_.score(preprocessing.transform(sample2),elect_labels[sample2.index])



0.7632213716070683

Spline model 

In [37]:
from sklearn.preprocessing import SplineTransformer
from sklearn.linear_model import Ridge

# Baseline Spline + Ridge Regression Model

We train a pipeline:
- Spline expansion (non-linear)
- Ridge regression (stabilized linear model)

Then tune:
- Spline degree
- Number of knots
- Ridge alpha


In [56]:
spline_model = Pipeline([('spline',SplineTransformer(degree= 4 , n_knots=8)) , ('ridge',Ridge(alpha=0.01))])
spline_model.fit(X_tr , elect_labels[sample.index])
spline_model.score(preprocessing.transform(sample2),elect_labels[sample2.index])
params_ridge = {'spline__degree':[3,4,5,6] ,'spline__n_knots':range(3,14) , 'ridge__alpha':[0.01,0.1,1,0.2,0.05]}
search_ridge = RandomizedSearchCV(spline_model , params_ridge , cv = 3 , scoring = 'neg_mean_squared_error')
search_ridge.fit(X_tr , elect_labels[sample.index])


In [45]:
sample_poids = elect.sample(8000) 
X_poids = preprocessing.transform(sample_poids)

In [40]:
search_ridge.best_estimator_.score(preprocessing.transform(sample2),elect_labels[sample2.index])

0.4713271693195802

In [42]:
search_ridge.best_params_

{'spline__n_knots': 11, 'spline__degree': 4, 'ridge__alpha': 0.01}

#  Compute RMSE for Each Model

We evaluate:
- ExtraTrees RMSE
- GradientBoosting RMSE
- SplineRidge RMSE

These RMSE scores are used to compute weights for the VotingRegressor ensemble.


calculate predictions for the 3 models

In [46]:

y_pred_ext = search_ext.best_estimator_.predict(X_poids)
y_pred_GB = search_GB.best_estimator_.predict(X_poids)
y_pred_ridge = search_ridge.best_estimator_.predict(X_poids)

MSE for the 3 models

In [69]:
from sklearn.metrics import mean_squared_error ,r2_score
RMSE_ext = np.sqrt(mean_squared_error(y_pred_ext , elect_labels[sample_poids.index]))
RMSE_GB = np.sqrt(mean_squared_error(y_pred_GB , elect_labels[sample_poids.index]))
RMSE_ridge = np.sqrt(mean_squared_error(y_pred_ridge , elect_labels[sample_poids.index]))
RMSE_ext , RMSE_GB , RMSE_ridge

(np.float64(205.19459529747388),
 np.float64(225.2424397352213),
 np.float64(294.9179221839175))

In [52]:
best_params_gb = search_GB.best_params_
best_params_ext = search_ext.best_params_
best_params_ridge = search_ridge.best_params_

#  Weighted VotingRegressor

We create an ensemble of:
- ExtraTrees
- GradientBoosting
- SplineRidge

Weights = inverse RMSE  
Better models receive higher influence.


Voting regressor 

In [60]:


VG = VotingRegressor(
    estimators=[ ('extratree',search_ext.best_estimator_ ,
                ('Gradient boosting ' ,search_GB.best_estimator_ ,
                ('Ridge_regression' , search_ridge.best_estimator_)
              ],
    weights =[1/RMSE_ext , 1/RMSE_GB,1/RMSE_ridge]
)

0.7523679791710571

In [None]:
general_pipe = Pipeline([('pre',preprocessing ) , ('voting',VG)])
elect_sample_sele = elect.sample(20000)
general_pipe.fit(elect,elect_labels)

In [67]:
X_test = data_test.drop(['meter_reading'],axis=1)
Y_test = data_test['meter_reading'].copy()

In [70]:
Y_pred_test = general_pipe.predict(X_test)


In [71]:
r2_score(Y_pred_test,Y_test)

0.6177444229243141