# Feature Engineering Notebook

At the end of the previous notebook (05_models) I attempted some feature engineering. However, when I used these features on the model I was getting very weird results. the original features were producing an RMSE score of about 5,000 and the engineered features were producing RMSE scores around 4-5. So either my feature engineering skills are top notch (not the case) or i made a mistake somewhere. I looked through the notebook multiple times but was unable to find any obvious mistakes, so i want to try and recreate that work in this notebook to ensure feature engineering was done correctly.

I essentially added three groups of features.
- Time element (hour of day and week of year)
- Previous weather conditions (weather observations three hours prior to "current" observations)
- Derivative of past weather conditions to previous weather conditions (three hour interval)

To gauge how these features affect the performance of model's I will test each iteration on a random forest model with default parameters. I will use RMSE as a metric for each iteration.

## Layout of Notebook
- Random forest performance on original features
- Random forest performance with addition of time element
- Random forest performance with addition of previous weather observations
- Random forest performance with addition of derivative of each weather condition over three hour time period

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Scaling
from sklearn.preprocessing import StandardScaler

# Data Split/Cross Validation
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV

# Model Metrics
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score, make_scorer

# Ensembles
from sklearn.ensemble import RandomForestRegressor

# Use functions from .py file
%load_ext autoreload
%autoreload 2
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import src.data_gathering as dg


In [2]:
# import energy and weather data
energy = dg.energy_data()
weather = dg.weather_data()

In [3]:
# make mse scorer object
mse_score = make_scorer(mse)

### Initial weather dataframe cleaning (will be added to data_gathering.py file after this)

Drop rows from weather dataframe at time 23:59:00 from each day. 

In [4]:
to_drop = weather[(weather.index.hour == 23) & (weather.index.minute == 59)].index

In [5]:
weather.drop(to_drop, axis=0, inplace=True)

Convert all instances of \* in dataframe to np.nan and all instances of '' in wind direction column to np.nan as well

In [6]:
weather['HourlyDewPointTemperature'].loc[weather['HourlyDewPointTemperature'] == '*'] = np.nan
weather['HourlyDryBulbTemperature'].loc[weather['HourlyDryBulbTemperature'] == '*'] = np.nan
weather['HourlyRelativeHumidity'].loc[weather['HourlyRelativeHumidity'] == '*'] = np.nan
weather['HourlyVisibility'].loc[weather['HourlyVisibility'] == '*'] = np.nan

In [7]:
weather['HourlyWindDirection'] = weather['HourlyWindDirection'].replace('', np.nan)

Convert weather dataframe to float

In [8]:
weather = weather.astype(float)

## Original Features

I need to import IterativeImputer, which is the imputation method I have been using to fill the missing values.

In [9]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [10]:
imp1 = IterativeImputer(random_state=42)

In [11]:
weather1_imp = imp1.fit_transform(weather)

In [12]:
weather_df1 = pd.DataFrame(index=weather.index, columns=weather.columns, data=weather1_imp)

Now I can pair up the energy data and weather data into a single dataframe. First I need to aggregate the weather data to hourly. To ensure the observations line up correctly (solar energy output with weather observations three hours prior) I am going to add a 'time' column to the weather dataframe that i can remove once I ensure it is formatted correctly.

In [13]:
weather_hourly1 = weather_df1.resample('H').mean()

In [14]:
weather_hourly1['time'] = weather_hourly1.index

In [15]:
base_df1 = pd.concat([energy, weather_hourly1.shift(3)], axis=1)

In [16]:
base_df1.head()

Unnamed: 0_level_0,nexus_meter,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-01-29 00:00:00,,,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,,NaT
2018-01-29 03:00:00,,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 04:00:00,,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00


Great this dataframe is formatted correctly, now I can drop the time column from the dataframe and also the first 24 rows as they contain no energy data

In [17]:
base_df1.drop('time', axis=1, inplace=True)

In [18]:
base_df1.drop(base_df1[:'2018-01-29'].index, axis=0, inplace=True)

In [19]:
base_df1.head()

Unnamed: 0_level_0,nexus_meter,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-30 00:00:00,0.0,30.47,15.0,19.0,84.0,29.64,10.0,6.0,310.0,0.0,0.0
2018-01-30 01:00:00,0.0,30.48,13.0,17.0,84.0,29.65,10.0,6.0,320.0,0.0,0.0
2018-01-30 02:00:00,0.0,30.48,13.0,17.0,84.0,29.65,10.0,5.0,300.0,0.0,0.0
2018-01-30 03:00:00,0.0,30.46,14.0,17.0,88.0,29.63,10.0,6.0,310.0,0.0,0.0
2018-01-30 04:00:00,0.0,30.47,11.0,15.0,84.0,29.64,10.0,6.0,320.0,0.0,0.0


I want to focus my analysis on the time frame 5am to 9pm, as these are the hours where energy may be produced (not likely in winter months)

In [20]:
model_df1 = base_df1[(base_df1.index.hour >= 5) & (base_df1.index.hour <= 21)]

Great, now I can separate the features and perform a train test split before modeling on the random forest regressor.

In [21]:
# separate target and features

# target
y1 = model_df1['nexus_meter']

# features
X1 = model_df1.drop('nexus_meter', axis=1)

Since I am just testing the feature performance, I am only going to perform a single train test split.

In [22]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, random_state=42, test_size=.3)

I do a test size of .3 to replicate the size of the training data when doing two splits.

Now i can scale the training data and fit the random forest model

In [23]:
scaler1 = StandardScaler()

In [24]:
X_train1_scaled = scaler1.fit_transform(X_train1)

In [25]:
# instantiate random forest regressor with default parameters
rf1 = RandomForestRegressor(random_state=42)

Now I will perform a 5 split cross validation on the training data using RMSE as the metric.

In [26]:
np.sqrt(cross_val_score(rf1, X_train1_scaled, y_train1, scoring=mse_score, cv=5).mean())

4659.64242523034

The first feature iteration produced an RMSE of 4659, now I can move on to adding the time element to the feature set

## Random Forst with time element

I will use the time of energy production because I am more interested in what it will be when the energy is being produced, not when we are making the predictions. I will add two columns, one for hour of day and one for week of year.

In [27]:
energy['hour'] = energy.index.hour
energy['week'] = energy.index.week

Next I will combine this new energy dataframe with the weather_hourly dataframe.

In [28]:
base_df2 = pd.concat([energy, weather_hourly1.shift(3)], axis=1)

In [29]:
base_df2.head()

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2018-01-29 00:00:00,,,,,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,,,,NaT
2018-01-29 03:00:00,,,,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 04:00:00,,,,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00


Everything is formatted correctly, I can now drop the 'time' column and the first 24 rows again.

In [30]:
base_df2.drop('time', axis=1, inplace=True)
base_df2.drop(base_df2[:'2018-01-29'].index, axis=0, inplace=True)

Again I will subset this dataframe to only include the rows between 5am and 9pm.

In [31]:
model_df2 = base_df2[(base_df2.index.hour >= 5) & (base_df2.index.hour <= 21)]

Perform a train test split with same random state and test size as previous iteration

In [32]:
# separate target and features

# target
y2 = model_df2['nexus_meter']

# features
X2 = model_df2.drop('nexus_meter', axis=1)

In [33]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=42, test_size=.3)

I want to import a new standard scaler object to ensure no data is passed over from the previous fit. I know this might be unneccesary but I just want to be sure.

In [34]:
scaler2 = StandardScaler()

In [35]:
X_train2_scaled = scaler2.fit_transform(X_train2)

In [36]:
# instantiate new random forest object with default parameters
rf2 = RandomForestRegressor(random_state=42)

In [37]:
# 5 split cross validation
np.sqrt(cross_val_score(rf2, X_train2_scaled, y_train2, scoring=mse_score, cv=5).mean())

2389.196932888855

Adding the time element alone cut the RMSE score in half. That makes me feel a lot better about the results I was getting from the 05_models notebook. Next I will add the weather observations from three hour's prior of the current time and see how that affects the model before including the derivative of each individual weather condition over that time period.

## Random Forest with Time element and past weather conditions

The addition of the time features was extremely beneficial to the model performance, so i am going to keep those. The next features I want to add are more weather conditions, only from three hours prior to the current observations. So the dataframe will be set up in the following format, energy observations, weather conditions three hours prior to energy production observation (original weather features), and an additional set of weather features three hours prior to the current observations (6 hours before energy production). My hope is that by including the weather observations from further back, the model will be able to pick up on how the weather has changed up to the point of current observation, and how it might change up to the point of energy production.

To construct that lagged weather observations, I will simply shift the weather_hourly dataframe, exactly how I do it when concatenating the weather and energy dataframe.

In [38]:
weather_lagged_df = weather_hourly1.shift(3)

In [39]:
weather_lagged_df.head()

Unnamed: 0_level_0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-29 00:00:00,,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,NaT
2018-01-29 03:00:00,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 04:00:00,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00


I want to change the column name's so I don't mix up the lagged weather and current weather.

In [40]:
weather_lagged_df.columns = ['lag3_altimeter', 'lag3_dew_point', 'lag3_temp', 'lag3_humidity', 'lag3_pressure', 
                            'lag3_visibility', 'lag3_wind_speed', 'lag3_wind_direction', 'lag3_precipitation', 'lag3_cloud_coverage', 'lag3_time']

In [41]:
weather_lagged_df.isna().sum()

lag3_altimeter         13
lag3_dew_point         13
lag3_temp              13
lag3_humidity          13
lag3_pressure          13
lag3_visibility        13
lag3_wind_speed        13
lag3_wind_direction    13
lag3_precipitation     13
lag3_cloud_coverage    13
lag3_time               3
dtype: int64

In [42]:
weather_hourly1.isna().sum()

HourlyAltimeterSetting       10
HourlyDewPointTemperature    10
HourlyDryBulbTemperature     10
HourlyRelativeHumidity       10
HourlyStationPressure        10
HourlyVisibility             10
HourlyWindSpeed              10
HourlyWindDirection          10
HourlyPrecipitation          10
cloud_coverage               10
time                          0
dtype: int64

In [43]:
weather_lagged_df[weather_lagged_df['lag3_altimeter'].isna() == True]

Unnamed: 0_level_0,lag3_altimeter,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage,lag3_time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-29 00:00:00,,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,NaT
2018-04-12 03:00:00,,,,,,,,,,,2018-04-12 00:00:00
2018-04-12 04:00:00,,,,,,,,,,,2018-04-12 01:00:00
2018-05-16 03:00:00,,,,,,,,,,,2018-05-16 00:00:00
2018-05-16 04:00:00,,,,,,,,,,,2018-05-16 01:00:00
2019-07-25 01:00:00,,,,,,,,,,,2019-07-24 22:00:00
2019-07-25 02:00:00,,,,,,,,,,,2019-07-24 23:00:00
2019-07-25 03:00:00,,,,,,,,,,,2019-07-25 00:00:00


Both weather_hourly1 and weather_lagged_df still contain missing values, fortunately they are all at times that will not be included in the analysis, so i can ignore those because they won't be included in the modeling data. Now I want to combine these two dataframes before adding on the energy data because I want investigate the collinearity between the two weather dataframes.

In [44]:
big_weather_df1 = pd.concat([weather_hourly1, weather_lagged_df], axis=1)

In [45]:
big_weather_df1.head()

Unnamed: 0_level_0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,...,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage,lag3_time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-29 00:00:00,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,...,,,,,,,,,,NaT
2018-01-29 01:00:00,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,...,,,,,,,,,,NaT
2018-01-29 02:00:00,30.33,28.0,30.0,92.0,29.5,5.0,13.0,340.0,0.0,80.0,...,,,,,,,,,,NaT
2018-01-29 03:00:00,30.33,28.0,30.0,92.0,29.5,6.0,13.0,320.0,0.0,80.0,...,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 04:00:00,30.33,28.0,30.0,92.5,29.5,6.0,10.5,325.0,0.009066,80.0,...,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00


In [46]:
big_weather_df1.corr()

Unnamed: 0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,lag3_altimeter,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage
HourlyAltimeterSetting,1.0,-0.434606,-0.36973,-0.146624,0.999904,0.110492,-0.22472,-0.062811,-0.13181,-0.210783,0.975846,-0.440683,-0.387236,-0.11679,0.975718,0.113905,-0.203265,-0.004632,-0.143586,-0.205365
HourlyDewPointTemperature,-0.434606,1.0,0.925492,0.083132,-0.434653,0.067593,-0.230326,-0.123985,0.10176,-0.072403,-0.418642,0.986729,0.918611,0.069932,-0.418689,0.063576,-0.247077,-0.150267,0.099764,-0.0829
HourlyDryBulbTemperature,-0.36973,0.925492,1.0,-0.293323,-0.369868,0.217558,-0.153317,-0.100561,0.034522,-0.184664,-0.3383,0.915897,0.965362,-0.227705,-0.338369,0.191524,-0.180979,-0.122538,0.038207,-0.194667
HourlyRelativeHumidity,-0.146624,0.083132,-0.293323,1.0,-0.146298,-0.477416,-0.183343,-0.053699,0.18921,0.327048,-0.183734,0.078985,-0.218888,0.788644,-0.183613,-0.398863,-0.155098,-0.06678,0.170354,0.325048
HourlyStationPressure,0.999904,-0.434653,-0.369868,-0.146298,1.0,0.110117,-0.224702,-0.063134,-0.131763,-0.210508,0.975743,-0.440727,-0.387382,-0.116401,0.975688,0.113487,-0.203145,-0.004872,-0.143454,-0.205152
HourlyVisibility,0.110492,0.067593,0.217558,-0.477416,0.110117,1.0,0.009911,0.067353,-0.381465,-0.385921,0.101823,0.071134,0.200956,-0.409129,0.101628,0.651922,0.034287,0.105882,-0.165008,-0.333767
HourlyWindSpeed,-0.22472,-0.230326,-0.153317,-0.183343,-0.224702,0.009911,1.0,0.233074,0.056054,0.243868,-0.22699,-0.215068,-0.180905,-0.06685,-0.226871,-0.04512,0.7393,0.114943,0.029438,0.236718
HourlyWindDirection,-0.062811,-0.123985,-0.100561,-0.053699,-0.063134,0.067353,0.233074,1.0,-0.017153,0.048051,-0.120503,-0.100985,-0.097983,0.002035,-0.120805,0.021333,0.203992,0.602288,-0.017722,0.077186
HourlyPrecipitation,-0.13181,0.10176,0.034522,0.18921,-0.131763,-0.381465,0.056054,-0.017153,1.0,0.222786,-0.124373,0.101167,0.05892,0.112931,-0.124382,-0.166389,0.033438,-0.052643,0.157707,0.15353
cloud_coverage,-0.210783,-0.072403,-0.184664,0.327048,-0.210508,-0.385921,0.243868,0.048051,0.222786,1.0,-0.211195,-0.077303,-0.186336,0.320372,-0.210831,-0.359529,0.209171,-0.00394,0.152706,0.69811


## NOTE
Different weather conditions are pretty correlated by themselves, but the same weather conditions measured within a short time frame (3 hours) are extremely correlated to one another. For that reason when i continue model iterations I will need to include I will perform PCA to account for this. I will also likely drop a few features that are either similar to another or don't influence energy production very much. BUt for now, I want to check the model on all of these features to ensure the concept checks out. 

Next I will combine this big weather dataframe with the energy dataframe and perform the same process in previous iterations. Separate features and target, train test split, scale, and model

In [47]:
# combine energy and weather data
base_df3 = pd.concat([energy, big_weather_df1.shift(3)], axis=1)

In [48]:
base_df3.head()

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,...,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage,lag3_time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-29 00:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 03:00:00,,,,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,...,,,,,,,,,,NaT
2018-01-29 04:00:00,,,,30.34,28.0,30.0,92.0,29.51,3.0,10.0,...,,,,,,,,,,NaT


In [49]:
# drop first 24 rows
base_df3.drop(base_df3[:'2019-01-29'].index, axis=0, inplace=True)

In [50]:
base_df3.head(10)

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,...,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage,lag3_time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-01-30 00:00:00,-48.9,0.0,5.0,30.11,-9.0,1.0,62.0,29.29,10.0,20.0,...,-7.0,4.0,60.0,29.28,10.0,17.0,300.0,0.0,0.0,2019-01-29 18:00:00
2019-01-30 01:00:00,-50.8,1.0,5.0,30.1,-9.0,0.0,65.0,29.28,10.0,14.0,...,-9.0,2.0,59.0,29.28,10.0,20.0,310.0,0.0,0.0,2019-01-29 19:00:00
2019-01-30 02:00:00,-50.2,2.0,5.0,30.11,-10.0,0.0,62.0,29.29,10.0,24.0,...,-9.0,2.0,59.0,29.29,10.0,22.0,300.0,0.0,0.0,2019-01-29 20:00:00
2019-01-30 03:00:00,-75.0,3.0,5.0,30.11,-7.5,-1.0,73.5,29.29,5.25,21.0,...,-9.0,1.0,62.0,29.29,10.0,20.0,300.0,0.0,0.0,2019-01-29 21:00:00
2019-01-30 04:00:00,-110.2,4.0,5.0,30.1075,-11.0,-2.25,65.75,29.2875,7.5,25.75,...,-9.0,0.0,65.0,29.28,10.0,14.0,300.0,0.0,0.0,2019-01-29 22:00:00
2019-01-30 05:00:00,-97.8,5.0,5.0,30.13,-13.5,-4.5,65.0,29.31,6.5,25.5,...,-10.0,0.0,62.0,29.29,10.0,24.0,270.0,0.0,80.0,2019-01-29 23:00:00
2019-01-30 06:00:00,-98.6,6.0,5.0,30.18,-19.0,-8.5,59.5,29.36,10.0,26.5,...,-7.5,-1.0,73.5,29.29,5.25,21.0,290.0,-0.001069,80.0,2019-01-30 00:00:00
2019-01-30 07:00:00,1013.2,7.0,5.0,30.22,-23.0,-11.0,54.0,29.39,10.0,24.0,...,-11.0,-2.25,65.75,29.2875,7.5,25.75,285.0,0.0,80.0,2019-01-30 01:00:00
2019-01-30 08:00:00,1002.2,8.0,5.0,30.246667,-24.333333,-12.666667,55.0,29.416667,10.0,29.333333,...,-13.5,-4.5,65.0,29.31,6.5,25.5,290.0,-0.006216,80.0,2019-01-30 02:00:00
2019-01-30 09:00:00,459.0,9.0,5.0,30.276667,-26.0,-14.333333,54.333333,29.453333,9.666667,23.333333,...,-19.0,-8.5,59.5,29.36,10.0,26.5,295.0,0.0,80.0,2019-01-30 03:00:00


dataframe is formatted correctly, now I can drop the time columns and perform a train test split

In [51]:
base_df3.drop(['time', 'lag3_time'], axis=1, inplace=True)

In [52]:
model_df3 = base_df3[(base_df3.index.hour >= 5) & (base_df3.index.hour <= 21)]

In [53]:
model_df3[model_df3['lag3_altimeter'].isna() == True]

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,...,lag3_altimeter,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-07-25 05:00:00,-30.52,5.0,30.0,30.17,52.0,55.0,90.0,29.35,10.0,5.0,...,,,,,,,,,,
2019-07-25 06:00:00,640.9,6.0,30.0,30.17,52.0,58.0,81.0,29.35,10.0,5.0,...,,,,,,,,,,
2019-07-25 07:00:00,3608.0,7.0,30.0,30.19,51.0,57.0,81.0,29.37,10.0,3.0,...,,,,,,,,,,
2020-02-20 05:00:00,-61.2,5.0,8.0,30.58,13.0,21.0,71.0,29.75,10.0,13.0,...,,,,,,,,,,
2020-02-20 06:00:00,-0.83,6.0,8.0,30.59,14.0,21.0,74.0,29.76,10.0,9.0,...,,,,,,,,,,


Model_df3 is missing a few rows of the lagged weather dataframe. I will use IterativeImputer again to fill these in. It is only 5 rows out of 15,000 so I think this method should suffice.

In [54]:
imp3 = IterativeImputer(random_state=42)

In [55]:
imputed3 = imp3.fit_transform(model_df3.drop('nexus_meter', axis=1))

In [56]:
model3_df = pd.DataFrame(index=model_df3.drop('nexus_meter', axis=1).index, columns=model_df3.drop('nexus_meter', axis=1).columns, data=imputed3)

In [57]:
model3_df['nexus_meter'] = model_df3['nexus_meter']

In [58]:
# separate target and features

# target
y3 = model3_df['nexus_meter']

# features
X3 = model3_df.drop('nexus_meter', axis=1)

In [59]:
# train test split
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, random_state=42, test_size=.3)

Next I will instantiate a new StandardScaler object to scale the training data

In [60]:
scaler3 = StandardScaler()

In [61]:
# scale training data
X_train3_scaled = scaler3.fit_transform(X_train3)

In [62]:
# instantiate new Random Forest object with default parameters
rf3 = RandomForestRegressor(random_state=42)

In [63]:
# 5 split cross val score on training data
np.sqrt(cross_val_score(rf3, X_train3_scaled, y_train3, scoring=mse_score, cv=5).mean())

2532.131485637282

The addition of those features actually brought the RMSE up. I am hoping by adding in the derivative the model will be able to interpret this is meaningful instead of random data.

## Random Forest with all features

I have the weather_hourly df and weather_lagged_df, now I need to take the derivative of each value at each hour (index). This will give me the slope (rate of change) of each weather condition over the given time period of three hours. My hope is that including the derivative, the model will be able to analyze how the weather has changed over the past three hours, so it can better predict how it may change over the next 3. To calculate the slope I will use the following formula: (current weather - past weather)/3.

In [64]:
weather_hourly1.head()

Unnamed: 0_level_0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-29 00:00:00,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 01:00:00,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00
2018-01-29 02:00:00,30.33,28.0,30.0,92.0,29.5,5.0,13.0,340.0,0.0,80.0,2018-01-29 02:00:00
2018-01-29 03:00:00,30.33,28.0,30.0,92.0,29.5,6.0,13.0,320.0,0.0,80.0,2018-01-29 03:00:00
2018-01-29 04:00:00,30.33,28.0,30.0,92.5,29.5,6.0,10.5,325.0,0.009066,80.0,2018-01-29 04:00:00


In [65]:
weather_lagged_df.head()

Unnamed: 0_level_0,lag3_altimeter,lag3_dew_point,lag3_temp,lag3_humidity,lag3_pressure,lag3_visibility,lag3_wind_speed,lag3_wind_direction,lag3_precipitation,lag3_cloud_coverage,lag3_time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-29 00:00:00,,,,,,,,,,,NaT
2018-01-29 01:00:00,,,,,,,,,,,NaT
2018-01-29 02:00:00,,,,,,,,,,,NaT
2018-01-29 03:00:00,30.3425,28.0,30.0,92.25,29.5125,5.75,14.25,327.5,0.017317,80.0,2018-01-29 00:00:00
2018-01-29 04:00:00,30.34,28.0,30.0,92.0,29.51,3.0,10.0,340.0,0.0,80.0,2018-01-29 01:00:00


I am going to drop the first three rows of each dataframe since the lagged df only has NaN values for those rows. Also these rows are dropped when combined with the energy dataframe

In [66]:
weather_hourly1.drop(weather_hourly1.iloc[:3].index, axis=0, inplace=True)
weather_lagged_df.drop(weather_lagged_df.iloc[:3].index, axis=0, inplace=True)

In [72]:
weather_change_df_cols = ['altimeter_change', 'dew_point_change', 'temp_change', 'humidity_change', 'pressure_change', 'visibility_change', 
                            'wind_speed_change', 'precip_change', 'cloud_coverage_change']

In [73]:
weather_hourly1.drop('time', axis=1, inplace=True)
weather_lagged_df.drop('lag3_time', axis=1, inplace=True)

KeyError: "['time'] not found in axis"

In [85]:
weather_change_df = pd.DataFrame(index=weather_hourly1.index, columns=weather_hourly1.columns)

In [87]:
weather_lagged_df.columns = weather_hourly1.columns

In [89]:
for idx in weather_change_df.index:
    weather_change_df.loc[idx] = (weather_hourly1.loc[idx] - weather_lagged_df.loc[idx])/3

In [90]:
weather_change_df.head(25)

Unnamed: 0_level_0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-01-29 03:00:00,-0.00416667,0.0,0.0,-0.0833333,-0.00416667,0.0833333,-0.416667,-2.5,-0.00577225,0.0
2018-01-29 04:00:00,-0.00333333,0.0,0.0,0.166667,-0.00333333,1.0,0.166667,-5.0,0.00302193,0.0
2018-01-29 05:00:00,0.00333333,-0.333333,0.0,-1.33333,0.00333333,1.66667,0.333333,0.0,-0.000375227,0.0
2018-01-29 06:00:00,0.01,-0.666667,-0.333333,-1.0,0.01,-0.333333,1.0,10.0,0.0,0.0
2018-01-29 07:00:00,0.0133333,-1.0,-0.583333,-1.91667,0.0133333,1.0,1.33333,9.16667,-0.00308974,0.0
2018-01-29 08:00:00,0.0111111,-1.0,-0.666667,-1.0,0.0111111,-0.111111,1.0,3.33333,0.000122437,0.0
2018-01-29 09:00:00,0.0133333,-1.0,0.0,-3.66667,0.0133333,1.66667,-0.333333,0.0,0.0,0.0
2018-01-29 10:00:00,0.0133333,-0.666667,0.25,-2.91667,0.0133333,-0.333333,0.833333,-7.5,6.78004e-05,0.0
2018-01-29 11:00:00,0.00888889,-0.333333,0.333333,-2.33333,0.00888889,-1.22222,0.333333,-6.66667,0.000252789,0.0
2018-01-29 12:00:00,-0.00333333,-0.333333,0.0,-1.0,-0.00333333,0.0,0.333333,-6.66667,0.0,0.0


The weather change dataframe is now constructed, next I will combine all three weather dataframes before adding them to the energy data, just to ensure everything is formatted correctly. First I want to change the column names of the different dataframes so I can tell them apart.

In [92]:
weather_change_df = weather_change_df.astype(float)

In [94]:
weather_change_df.columns = ['altimeter_change', 'dew_point_change', 'temp_change', 'humidity_change', 'pressure_change', 'visibility_change', 
                            'wind_speed_change', 'wind_direction_change', 'precip_change', 'cloud_coverage_change']

In [95]:
weather_lagged_df.columns = ['lag3_altimeter', 'lag3_dew_point', 'lag3_temp', 'lag3_humidity', 'lag3_pressure', 
                            'lag3_visibility', 'lag3_wind_speed', 'lag3_wind_direction', 'lag3_precipitation', 'lag3_cloud_coverage']

In [96]:
big_weather_df2 = pd.concat([weather_hourly1, weather_lagged_df, weather_change_df], axis=1)

In [97]:
big_weather_df2.head()

Unnamed: 0_level_0,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,HourlyWindDirection,HourlyPrecipitation,cloud_coverage,...,altimeter_change,dew_point_change,temp_change,humidity_change,pressure_change,visibility_change,wind_speed_change,wind_direction_change,precip_change,cloud_coverage_change
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-29 03:00:00,30.33,28.0,30.0,92.0,29.5,6.0,13.0,320.0,0.0,80.0,...,-0.004167,0.0,0.0,-0.083333,-0.004167,0.083333,-0.416667,-2.5,-0.005772,0.0
2018-01-29 04:00:00,30.33,28.0,30.0,92.5,29.5,6.0,10.5,325.0,0.009066,80.0,...,-0.003333,0.0,0.0,0.166667,-0.003333,1.0,0.166667,-5.0,0.003022,0.0
2018-01-29 05:00:00,30.34,27.0,30.0,88.0,29.51,10.0,14.0,340.0,-0.001126,80.0,...,0.003333,-0.333333,0.0,-1.333333,0.003333,1.666667,0.333333,0.0,-0.000375,0.0
2018-01-29 06:00:00,30.36,26.0,29.0,89.0,29.53,5.0,16.0,350.0,0.0,80.0,...,0.01,-0.666667,-0.333333,-1.0,0.01,-0.333333,1.0,10.0,0.0,0.0
2018-01-29 07:00:00,30.37,25.0,28.25,86.75,29.54,9.0,14.5,352.5,-0.000203,80.0,...,0.013333,-1.0,-0.583333,-1.916667,0.013333,1.0,1.333333,9.166667,-0.00309,0.0


Everything looks properly formatted, next I can combine this dataframe with the energy data. To make sure things are formatted correctly in the base_df I will add a time column to this weather dataframe to make sure these observations are three hours behind each of the energy observations.

In [99]:
big_weather_df2['time'] = big_weather_df2.index

In [111]:
base_df4 = pd.concat([energy, big_weather_df2.shift(3)], axis=1)

In [108]:
base_df4.head()

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,...,dew_point_change,temp_change,humidity_change,pressure_change,visibility_change,wind_speed_change,wind_direction_change,precip_change,cloud_coverage_change,time
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-29 03:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 04:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 05:00:00,,,,,,,,,,,...,,,,,,,,,,NaT
2018-01-29 06:00:00,,,,30.33,28.0,30.0,92.0,29.5,6.0,13.0,...,0.0,0.0,-0.083333,-0.004167,0.083333,-0.416667,-2.5,-0.005772,0.0,2018-01-29 03:00:00
2018-01-29 07:00:00,,,,30.33,28.0,30.0,92.5,29.5,6.0,10.5,...,0.0,0.0,0.166667,-0.003333,1.0,0.166667,-5.0,0.003022,0.0,2018-01-29 04:00:00


All the weather data lines up with the future energy data, now I can drop the time column and the first 24 rows because they contain no energy data.

In [112]:
base_df4.drop('time', axis=1, inplace=True)
base_df4.drop(base_df4[:'2019-01-29'].index, axis=0, inplace=True)

In [113]:
base_df4.head()

Unnamed: 0_level_0,nexus_meter,hour,week,HourlyAltimeterSetting,HourlyDewPointTemperature,HourlyDryBulbTemperature,HourlyRelativeHumidity,HourlyStationPressure,HourlyVisibility,HourlyWindSpeed,...,altimeter_change,dew_point_change,temp_change,humidity_change,pressure_change,visibility_change,wind_speed_change,wind_direction_change,precip_change,cloud_coverage_change
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-01-30 00:00:00,-48.9,0.0,5.0,30.11,-9.0,1.0,62.0,29.29,10.0,20.0,...,0.003333,-0.666667,-1.0,0.666667,0.003333,0.0,1.0,0.0,0.0,0.0
2019-01-30 01:00:00,-50.8,1.0,5.0,30.1,-9.0,0.0,65.0,29.28,10.0,14.0,...,0.0,0.0,-0.666667,2.0,0.0,0.0,-2.0,-3.333333,0.0,0.0
2019-01-30 02:00:00,-50.2,2.0,5.0,30.11,-10.0,0.0,62.0,29.29,10.0,24.0,...,0.0,-0.333333,-0.666667,1.0,0.0,0.0,0.666667,-10.0,0.0,26.666667
2019-01-30 03:00:00,-75.0,3.0,5.0,30.11,-7.5,-1.0,73.5,29.29,5.25,21.0,...,0.0,0.5,-0.666667,3.833333,0.0,-1.583333,0.333333,-3.333333,-0.000356,26.666667
2019-01-30 04:00:00,-110.2,4.0,5.0,30.1075,-11.0,-2.25,65.75,29.2875,7.5,25.75,...,0.0025,-0.666667,-0.75,0.25,0.0025,-0.833333,3.916667,-5.0,0.0,26.666667


Next I need to filter this dataframe to only include the time stamps 5am to 9pm. I will likely change this in the future, but this is the time frame I have been using for past iterations so this is the one I need to use now.

## NOTE
Do more investigation into which times are likely to produce energy, then filter the dataframe that way.

In [114]:
model_df4 = base_df4[(base_df4.index.hour >= 5) & (base_df4.index.hour <= 21)]

In [115]:
model_df4.isna().sum()

nexus_meter                  0
hour                         0
week                         0
HourlyAltimeterSetting       0
HourlyDewPointTemperature    0
HourlyDryBulbTemperature     0
HourlyRelativeHumidity       0
HourlyStationPressure        0
HourlyVisibility             0
HourlyWindSpeed              0
HourlyWindDirection          0
HourlyPrecipitation          0
cloud_coverage               0
lag3_altimeter               5
lag3_dew_point               5
lag3_temp                    5
lag3_humidity                5
lag3_pressure                5
lag3_visibility              5
lag3_wind_speed              5
lag3_wind_direction          5
lag3_precipitation           5
lag3_cloud_coverage          5
altimeter_change             5
dew_point_change             5
temp_change                  5
humidity_change              5
pressure_change              5
visibility_change            5
wind_speed_change            5
wind_direction_change        5
precip_change                5
cloud_co

I am going to impute these missing values just as I did before with IterativeImputer, in future iterations I want to explore different methods of either dropping these rows or imputing differently, but for now I want to keep it consistent with past iterations.

In [116]:
imp4 = IterativeImputer(random_state=42)

In [117]:
model4_imputed = imp4.fit_transform(model_df4.drop('nexus_meter', axis=1))

In [118]:
model4_df = pd.DataFrame(index=model_df4.drop('nexus_meter', axis=1).index, columns=model_df4.drop('nexus_meter', axis=1).columns, data=model4_imputed)

In [119]:
model4_df['nexus_meter'] = model_df4['nexus_meter']

Now I can perform a train test split and run the random forest model on this dataset.

In [120]:
# separate target and features

# target
y4 = model4_df['nexus_meter']

# features
X4 = model4_df.drop('nexus_meter', axis=1)

In [121]:
# train test split
X_train4, X_test4, y_train4, y_test4 = train_test_split(X4, y4, random_state=42, test_size=.3)

In [122]:
# instantiate new StandardScaler object
scaler4 = StandardScaler()

In [123]:
# scale training data
X_train4_scaled = scaler4.fit_transform(X_train4)

In [124]:
# instantiate random forest regressor with default parameters
rf4 = RandomForestRegressor(random_state=42)

In [125]:
# cross val score over 5 splits
np.sqrt(cross_val_score(rf4, X_train4_scaled, y_train4, scoring=mse_score, cv=5).mean())

2577.9910534384926

So something did go wrong in the 05_models notebook, I didn't think I really would get the RMSE down to 4 so I was very confused when that happened. Just to double check everything, I want to run all of these feature combinations on random forest model with the same parameters as the best performing one from 05_models.

## Random Forest With Tuned Hyperparameters

In [127]:
rf5 = RandomForestRegressor(random_state=42, max_depth=15, min_samples_leaf=25, n_estimators=100)

Now I can calculate the cross val score over 5 splits for each of the training data combinations

### Base Features

In [128]:
np.sqrt(cross_val_score(rf5, X_train1_scaled, y_train1, scoring=mse_score, cv=5).mean())

4861.238871525712

In [132]:
rf5.fit(X_train1_scaled, y_train1)

RandomForestRegressor(max_depth=15, min_samples_leaf=25, random_state=42)

In [134]:
np.sqrt(mse(y_train1, rf5.predict(X_train1_scaled)))

4473.119484793107

### Include time element

In [129]:
np.sqrt(cross_val_score(rf5, X_train2_scaled, y_train2, scoring=mse_score, cv=5).mean())

2592.966985729268

In [135]:
rf5.fit(X_train2_scaled, y_train2)

RandomForestRegressor(max_depth=15, min_samples_leaf=25, random_state=42)

In [136]:
np.sqrt(mse(y_train2, rf5.predict(X_train2_scaled)))

2349.9764480720455

### Add past weather conditions

In [130]:
np.sqrt(cross_val_score(rf5, X_train3_scaled, y_train3, scoring=mse_score, cv=5).mean())

2792.360258877181

In [137]:
rf5.fit(X_train3_scaled, y_train3)

RandomForestRegressor(max_depth=15, min_samples_leaf=25, random_state=42)

In [138]:
np.sqrt(mse(y_train3, rf5.predict(X_train3_scaled)))

2501.1801611695814

### Add how weather changed

In [131]:
np.sqrt(cross_val_score(rf5, X_train4_scaled, y_train4, scoring=mse_score, cv=5).mean())

2829.224429679723

In [142]:
rf5.fit(X_train4_scaled, y_train4)

RandomForestRegressor(max_depth=15, min_samples_leaf=25, random_state=42)

In [143]:
np.sqrt(mse(y_train4, rf5.predict(X_train4_scaled)))

2497.2588592954285

## Conclusions

The best performing feature combinations were the base features with the addition of the time element (week of year/hour of day). Those are the features I am going to proceed with for now.