# Seattle non-residential buildings energy consumption

This notebook aims at predicting the total energy consumptions of non-residential buildings in the city of Seattle.
It relies on an official dataset available here: https://data.seattle.gov/dataset/2016-Building-Energy-Benchmarking/2bpz-gwpy
We are using a modified (by us) version of the database that you can find at this adress: inserer lien github

Another almost identical notebook exists and deals with GHG emissions rather than Energy consumption.

## Initialization

In [63]:
# Packages import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [64]:
# Loading the dataset
data = pd.read_csv(r'./Data/seattle_cleaned_dataset.csv', sep = ',', low_memory = False)

In [65]:
data.describe()

Unnamed: 0,OSEBuildingID,ZipCode,Latitude,Longitude,NumberofBuildings,NumberofFloors,PropertyGFATotal,PropertyGFAParking,PropertyGFABuilding(s),ENERGYSTARScore,SiteEUIWN(kBtu/sf),SiteEnergyUse(kBtu),SiteEnergyUseWN(kBtu),SteamUse(kBtu),Electricity(kBtu),NaturalGas(kBtu),TotalGHGEmissions,GHGEmissionsIntensity,BuildingAge
count,1636.0,1620.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0,1078.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0,1636.0
mean,16255.489609,98116.932716,47.616126,-122.332965,1.099633,4.118582,113748.1,12945.718826,100802.3,65.138219,75.018949,7970777.0,8122033.0,470024.8,5475530.0,1997737.0,180.552152,1.626333,52.827628
std,13854.775408,18.533239,0.048417,0.024662,1.161766,6.565362,194133.3,42535.544496,172846.4,28.379155,74.91048,21707280.0,22190250.0,5156614.0,13393650.0,9453094.0,708.466267,2.352106,32.526775
min,1.0,98006.0,47.49917,-122.41182,0.0,0.0,11285.0,0.0,3636.0,1.0,0.0,0.0,0.0,0.0,-115417.0,0.0,-0.8,-0.02,0.0
25%,577.75,98105.0,47.58516,-122.343335,1.0,1.0,29505.5,0.0,28523.25,48.0,36.099998,1251083.0,1322253.0,0.0,727268.8,0.0,20.4275,0.36,26.0
50%,21131.0,98110.0,47.61238,-122.33289,1.0,2.0,49712.0,0.0,47637.5,73.0,54.299999,2582214.0,2736046.0,0.0,1628064.0,514163.0,50.015,0.88,49.0
75%,24591.5,98125.0,47.64976,-122.321765,1.0,4.0,106010.2,0.0,95067.5,89.0,85.299997,6928335.0,7187220.0,0.0,4882877.0,1529470.0,144.87,1.91,85.0
max,50226.0,98199.0,47.73387,-122.25864,27.0,99.0,2200000.0,512608.0,2200000.0,100.0,834.400024,448385300.0,471613900.0,134943500.0,274532500.0,297909000.0,16870.98,34.09,115.0


As minimums of Electricity(kBtu) and TotalGHGEmissions, we find negative values. As previously investigated, it corresponds to the Bullit Center, self-designated as "the Greenest Commercial Building in the World". Indeed, on average for a given year, it produces more energy that it requires to operate.

Considering it is not a measurement error but an outlier, we'll keep it for the moment, and maybe run two models: one with and one without this observation.

In [66]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1636 entries, 0 to 1635
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   OSEBuildingID              1636 non-null   int64  
 1   PrimaryPropertyType        1636 non-null   object 
 2   PropertyName               1636 non-null   object 
 3   Address                    1636 non-null   object 
 4   ZipCode                    1620 non-null   float64
 5   Neighborhood               1636 non-null   object 
 6   Latitude                   1636 non-null   float64
 7   Longitude                  1636 non-null   float64
 8   NumberofBuildings          1636 non-null   float64
 9   NumberofFloors             1636 non-null   int64  
 10  PropertyGFATotal           1636 non-null   int64  
 11  PropertyGFAParking         1636 non-null   int64  
 12  PropertyGFABuilding(s)     1636 non-null   int64  
 13  ListOfAllPropertyUseTypes  1636 non-null   objec

All variables seems to be coded in the proper format.

We have some missing values for ZipCode, LargestPropertyUseType, and ENERGYSTARScore. We'll delete Zipcode (as well as other localization/id variables) but keep the two other.

## Preprocessing

In [67]:
# Exclude some 'id' and localization variables for now
data = data.drop(['OSEBuildingID', 'PropertyName', 'Address', 'ZipCode', 'ComplianceStatus'], axis = 1)

Some buildings have 0 as a value for SiteEnergyUse and therefore for SiteEnergyUseWN, we will delete them.

In [68]:
data = data[~(data['SiteEnergyUseWN(kBtu)'] == 0)]
data = data[~(data['SiteEnergyUse(kBtu)'] == 0)]

Also, a building has 0 consumption of steam, natural gas and electricity, which is not possible. We delete it as well.

In [69]:
data = data[~(data['SiteEnergyUseWN(kBtu)'] == 12843856.0)]

#### Numerical vs Categorical features decomposition

In [70]:
# Categorical features
categorical_features = data[['PrimaryPropertyType', 'Neighborhood', 
                             'ListOfAllPropertyUseTypes', 'LargestPropertyUseType']]
categorical_features.nunique()

PrimaryPropertyType           21
Neighborhood                  19
ListOfAllPropertyUseTypes    359
LargestPropertyUseType        54
dtype: int64

As seen in the data exploration notebook, there are a lot of PropertyUseTypes (362). On the other hand, PrimaryPropertyType (with 21 unique values) may be a little to innacurate.  
In between, LargestPropertyUseType has a interesting 55 unique values and will be retained as our 'building type' variable. So we have to delete the four observations that do not have a value for this feature.

In [71]:
data = data[~(data['LargestPropertyUseType'].isnull())]

In [72]:
categorical_features = data[['Neighborhood', 'LargestPropertyUseType']]

In [73]:
# Numerical features
numerical_features = data[['Latitude', 'Longitude',
                           'NumberofBuildings', 'NumberofFloors', 'BuildingAge',
                           'PropertyGFATotal','PropertyGFAParking', 'PropertyGFABuilding(s)',
                           'SiteEUIWN(kBtu/sf)', 'SiteEnergyUseWN(kBtu)',
                           'SteamUse(kBtu)', 'Electricity(kBtu)', 'NaturalGas(kBtu)',
                           'TotalGHGEmissions', 'GHGEmissionsIntensity',
                           'ENERGYSTARScore']]

The retained numerical features for sure present multicolinearity. For example, the PropertyGFATotal is positively correlated with the NumberofBuildings.
Including both will not a problem with regards to the overall performance and prediction power of the model, but it will blur the explanation impact of each variable. We should keep that in mind.

ENERGYSTARScore will recieve a special treatment due to its relative low number of observations.

Our target is the energy consumption. Tow variables measure that: 'SiteEUIWN(kBtu/sf)' and 'SiteEnergyUseWN(kBtu)'.
We will take the gross energy consumption: 'SiteEnergyUseWN(kBtu)' as feature to predict. The 'energy use intensity' ('SiteEUIWN(kBtu/sf)') will be deleted. Correspondingly, we are going to use 'SiteEnergyUse(kBtu)' (non-weather normalized) to transform SteamUse(kBtu), Electricity(kBtu) and NaturalGas(kBtu) with their respective share in energy use. 

Similarly, the GHGEmissions (raw and intensity) are also highly correlated with consumptions, and must be removed.

In [74]:
data['elec_share'] = data['Electricity(kBtu)'] / data['SiteEnergyUse(kBtu)']
data['gas_share'] = data['NaturalGas(kBtu)'] / data['SiteEnergyUse(kBtu)']
data['steam_share'] = data['SteamUse(kBtu)'] / data['SiteEnergyUse(kBtu)']

#### Train & Test sets

In [75]:
from sklearn.model_selection import train_test_split

X = data[['Latitude', 'Longitude', 
          'NumberofBuildings', 'NumberofFloors', 'BuildingAge',
          'PropertyGFATotal','PropertyGFAParking', 'PropertyGFABuilding(s)',
          'elec_share', 'gas_share', 'steam_share',
          'Neighborhood', 'LargestPropertyUseType']]

y = data['SiteEnergyUseWN(kBtu)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### Results table

In [91]:
Results_table = pd.DataFrame([], columns=['Algorithm','R²', 'MAE', 'RMSE'])

## Models

We will include all remaining preprocessing in pipelines.

First, we are going to run two linear models: a LinearRegression (as baseline model) and an ElasticNet regression.
Then, we will implement two ensemble methods: a Random Forest and a XGBoost.

All R² and MSE will be stored in a table to compare results.  

In [76]:
# Selected Scikit-Learn modules
# Pipeline tools
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Transformers
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder

# Linear models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

### Baseline model: Linear Regression

In [77]:
# Transformer
transformer = make_column_transformer(
                                      (RobustScaler(),  
                                       # We choose the robust scaler because we have outliers
                                       # that we want to keep
                                              ['Latitude', 'Longitude', 
                                               'NumberofBuildings', 'NumberofFloors', 'BuildingAge',
                                               'PropertyGFATotal','PropertyGFAParking', 'PropertyGFABuilding(s)',
                                               'elec_share', 'gas_share', 'steam_share'
                                              ]),   
                                      (OneHotEncoder(handle_unknown='ignore'),
                                       # we decide to ignore unknown categories because some building types are
                                       # unique, and therefore only exist
                                       # in either the test or the train set.
                                              ['Neighborhood', 'LargestPropertyUseType'])
                                     )

This transformer is going ot be used in all linear_models.

In [78]:
## Linear Regression
# Model fitting
model_lr = make_pipeline(transformer, LinearRegression())
model_lr.fit(X_train, y_train)

# Predictions
y_pred = model_lr.predict(X_test)

# Results
print("R² =", model_lr.score(X_train, y_train))
print("MAE =", mean_absolute_error(y_test, y_pred))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))



R² = 0.6004126973993781
MAE = 5128957.403165458
RMSE = 11809621.043526754


In [95]:
Linear_Regression_results = pd.DataFrame(['Linear Regression', model_lr.score(X_train, y_train), mean_absolute_error(y_test, y_pred), np.sqrt(mean_squared_error(y_test, y_pred))])

In [96]:
Linear_Regression_results

Unnamed: 0,0
0,Linear Regression
1,0.600413
2,4247805.79863
3,9552474.605328


In [100]:
pd.concat([Results_table, Linear_Regression_results], axis=1)

Unnamed: 0,Algorithm,R²,MAE,RMSE,0
0,,,,,Linear Regression
1,,,,,0.600413
2,,,,,4247805.79863
3,,,,,9552474.605328


The linear regression model performs quite poorly. We will try a cross-validation to see if any improvement occurs before moving to different modelizations.

In [79]:
# Cross-validation - Linear Regression
from sklearn.model_selection import GridSearchCV

params = {'linearregression__fit_intercept' : [True, False]}
grid = GridSearchCV(model_lr, param_grid = params, cv=10)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
print("R² =",grid.score(X_train, y_train))
print("MAE =", mean_absolute_error(y_test, y_pred))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

R² = 0.6004126973993781
MAE = 5128957.403165458
RMSE = 11809621.043526754


The implementation of a 10-fold cross-validation and a fit of an intercept or not has not changed the result much. Let's try a more sophisticated model.

### Ridge regression

In [80]:
from sklearn.linear_model import Ridge

# Cross-validated - Ridge Regression
alphas = np.logspace(-5, 50, 100)
params = {'ridge__alpha' : alphas,
         }

model_ridge = make_pipeline(transformer, Ridge())

grid = GridSearchCV(model_ridge, param_grid = params, cv=10)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
print("R² =", grid.score(X_train, y_train))
print("MAE =", mean_absolute_error(y_test, y_pred))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

R² = 0.035531006945403454
MAE = 6669196.985372566
RMSE = 9769007.261571793


### Lasso Regression

In [81]:
from sklearn.linear_model import Lasso

# Cross-validated - Lasso Regression
alphas = np.logspace(-5, 50, 100)
params = {'lasso__alpha' : alphas,
         }

model_lasso = make_pipeline(transformer, Lasso())

grid = GridSearchCV(model_lasso, param_grid = params, cv=10)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
print("R² =", grid.score(X_train, y_train))
print("MAE =", mean_absolute_error(y_test, y_pred))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(


R² = 0.5738341057563239
MAE = 4247805.798629896
RMSE = 9552474.605327912


  model = cd_fast.sparse_enet_coordinate_descent(


In [82]:
from sklearn.linear_model import LassoCV

# Automatically cross-validated - Lasso Regression
alphas = np.logspace(-5, 50, 100)

model_lassoCV = make_pipeline(transformer, LassoCV(alphas = alphas))
model_lassoCV.fit(X_train, y_train)
y_pred = model_lassoCV.predict(X_test)

print("R² =", model_lassoCV.score(X_train, y_train))
print("MAE =", mean_absolute_error(y_test, y_pred))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


R² = 0.5738341057563239
MAE = 4247805.798629896
RMSE = 9552474.605327912


  model = cd_fast.sparse_enet_coordinate_descent(


### ElasticNet

In [None]:
# Transformer
transformer = make_column_transformer(
                                      (RobustScaler(),  
                                              ['Latitude', 'Longitude', 
                                               'NumberofBuildings', 'NumberofFloors', 'BuildingAge',
                                               'PropertyGFATotal','PropertyGFAParking', 'PropertyGFABuilding(s)',
                                               'elec_share', 'gas_share', 'steam_share'
                                              ]),   
                                      (OneHotEncoder(handle_unknown='ignore'),
                                              ['Neighborhood', 'LargestPropertyUseType'])
                                     )

# Model fitting
model_EN = make_pipeline(transformer, ElasticNet())
model_EN.fit(X_train, y_train)

# Predictions
y_pred = model_EN.predict(X_test)

# Results
print("R² =", model_EN.score(X_train, y_train))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

Without any tuning of the hyperparameters, the ElasticNet model is outperformed by the standard Linear regression. We will now try to imorove it's result with a proper cross validation.

In [None]:
# Cross-validation - ElasticNet Regression

alphas = np.logspace(-5, 50, 100000)
l1_ratios = np.arange(0, 1, 100)
params = {'elasticnet__tol' : [0.001],
          #'elasticnet__max_iter' : [10000],
          'elasticnet__alpha' : alphas,
          'elasticnet__l1_ratio' : l1_ratios}

grid = GridSearchCV(model_EN, param_grid = params, cv=10)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)
print("R² =", grid.score(X_train, y_train))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

In [None]:
print(grid.best_params_)
print("R² =", grid.score(X_train, y_train))
print("RMSE =", np.sqrt(mean_squared_error(y_test, y_pred)))

### Results table

In [None]:
Results = pd.DataFrame([]

Results.append([model, column, model.score(X_test, y_test), rmse)