# A Case Study in Energy Efficiency
## Matthew T. Smith

### Problem Statement
A household’s normal energy usage most largely fluctuates due to the heating and cooling of the space based on weather. This particular analysis looks to model and predict a single household’s usage as it relates to that historical weather data. This compact model can be passed between an energy supplier or retailer and the consumer without having to hand off a user’s actual usage trends. Businesses can rely on normal meter reads and predictions from such a model to price their utility, and the consumers can better see how they use their energy and budget or plan accordingly. This small example of “federated learning” keeps privacy at the forefront of the model’s architecture. Through the use of the Keras Deep Learning library in Python on top of Google’s TensorFlow’s backend, a linear regression model is trained on a 10-month electricity usage report for an apartment in Wellington, New Zealand paired with the weather (high/low temperature, relative humidity, and rainfall) from that same time period. Predictions are then made on sample weather data to demonstrate probable energy consumption.

### Technology

#### Software
- **Keras Framework** [https://github.com/keras-team/keras]
- **TensorFlow Framework** [https://www.tensorflow.org/install/]
- **Python 3.7** [https://www.python.org/downloads/]
- **h5py** [http://docs.h5py.org/en/latest/build.html]
- **numpy** [https://scipy.org/install.html]
- **pandas** [https://pandas.pydata.org/pandas-docs/stable/install.html]
- **matplotlib** [https://scipy.org/install.html]

#### Hardware
- **Operating System:** macOS Mojave 10.14.4
- **Processor:** 2.3 GHz Intel Core i7
- **Memory:** 16 GB 1600 MHz DDR3
- **GPU**: N/A

### Data

#### Weather
- CliFlo – the web application providing access to New Zealand’s National Climate Database [https://cliflo.niwa.co.nz/]
- Collected from Wellington, NZ’s Greta Point Weather Station
    - Closest station to the home with complete data
- 10-month dataset of high and low temperatures (Celsius), relative humidity (percent), and rainfall (millimeters)

#### Energy Consumption
- Powershop NZ – Energy retailer for the home [https://www.powershop.co.nz/]
- Half-hourly consumption (kWh) fed from smart meter at the property
- The data downloaded is from my own home in NZ; consumption data is not freely available via the retailer

In [7]:
import pandas as pd

ENERGY = './data/energy_data_raw.csv'
WEATHER = './data/wellington_weather_raw.csv'
RAIN = './data/wellington_rain_raw.csv'

E = pd.read_csv(ENERGY)
W = pd.read_csv(WEATHER)
R = pd.read_csv(RAIN)

print("\nEnergy: \n")
print(E.head(3))
print("\nWeather: \n")
print(W.head(3))
print("\nRainfall: \n")
print(R.head(3))


Energy: 

         reading_start          reading_end  usage
0  29/03/2018 00:00:01  29/03/2018 00:30:00   0.05
1  29/03/2018 00:00:01  30/03/2018 00:00:00  11.51
2  29/03/2018 00:30:01  29/03/2018 01:00:00   0.04

Weather: 

                     Station         Date(UTC) Tmax(C) Period(Hrs)  Tmin(C)  \
Wellington   Greta Point Cws  29/03/2018 00:00    20.5           1     19.5   
Wellington   Greta Point Cws  29/03/2018 01:00    22.4           1     19.9   
Wellington   Greta Point Cws  29/03/2018 02:00    23.2           1     22.4   

            Period(Hrs).1 Tgmin(C) Period(Hrs).2 Tmean(C) RHmean(%)  \
Wellington              1        -             -     19.8        66   
Wellington              1        -             -     20.8        61   
Wellington              1        -             -     22.8        51   

           Period(Hrs).3 Freq  
Wellington             1    H  
Wellington             1    H  
Wellington             1    H  

Rainfall: 

                       Station

### Setup

#### Installation
The project and data can be downloaded from here: [https://github.com/matthew-t-smith/energy_by_weather]

From the home directory, run:

`~/energy_by_weather $ pip3 install -r requirements.txt`

#### Data Cleaning
The raw data already exists in the `/data/` directory, as does the cleaning script and output data. To re-clean the data, from the data directory, run:

`~/energy_by_weather/data $ python3 data_prep.py`

#### Model Creation and Training
The model will be created new, train, and save as an `.h5` file, as already exists in the directory currently. Plots will also be re-saved over the existing files when running. To re-create and re-train the model, from the home directory, run:

`~/energy_by_weather $ python3 model.py`

### Data Cleaning

In [8]:
import datetime as dt

# Read half-hourly energy data and merge usages to hourly increments
raw_data = pd.read_csv(ENERGY, skiprows=(lambda x: x == 49*(int(x/49)) + 2))
reads = []
for i in range(0, raw_data.shape[0] - 2, 2):
    new_row = [pd.to_datetime(raw_data.at[i, 'reading_start'], dayfirst=True).round(
        'H'), (raw_data.at[i, 'usage'] + raw_data.at[(i+1), 'usage'])]
    reads.append(new_row)

energy = pd.DataFrame(reads, columns=['Date(UTC)', 'usage'])

The energy data is first loaded, using a `lambda` function to skip every 49th entry (an added 24-hour accumulative reading). The reading datetime and energy usage are then extracted by adding the first thirty-minute read of the hour to the second thirty-minute read of the hour (and rounding to the hour to negate the one-second offset). We must do this because our weather data is only hourly.

In [9]:
# Timeslice rain, weather and energy data
def time_slice(dataset):
    dataset['Date(UTC)'] = pd.to_datetime(dataset['Date(UTC)'], dayfirst=True)
    dataset.set_index('Date(UTC)', inplace=True)
    dataset = dataset['2018-03-29':'2019-01-29']
    dataset.reset_index(inplace=True)
    return dataset


rain = pd.read_csv(RAIN, usecols=['Date(UTC)', 'Amount(mm)'], dayfirst=True)
rain = time_slice(rain)
weather = pd.read_csv(
    WEATHER, usecols=['Date(UTC)', 'Tmax(C)', 'Tmin(C)', 'RHmean(%)'], dayfirst=True)
weather = time_slice(weather)
energy = time_slice(energy)

The rain, weather, and energy data are then read (with only the relevant columns) and clipped to the 10-month period, mostly as an assurance that what was downloaded from the raw sources is equal.

In [10]:
# Create a single DataFrame to compile feed-in data
final = []
for index, row in weather.iterrows():
    energy_filter = energy['Date(UTC)'] == row['Date(UTC)']
    rain_filter = rain['Date(UTC)'] == row['Date(UTC)']

    rainfall = rain[rain_filter]['Amount(mm)'].values
    try:
        rainfall = float(rainfall)
    except (TypeError, ValueError):
        rainfall = -1

    usage = energy[energy_filter]['usage'].values
    try:
        usage = float(usage)
    except (TypeError, ValueError):
        usage = -1

    try:
        row['RHmean(%)'] = float(row['RHmean(%)']) / 100
    except (TypeError, ValueError):
        row['RHmean(%)'] = -1

    entry = [row['Date(UTC)'], row['Tmax(C)'], row['Tmin(C)'],
             row['RHmean(%)'], rainfall, usage]
    final.append(entry)

data = pd.DataFrame(final, columns=[
    'DateTime', 'Tmax(C)', 'Tmin(C)', 'RHmean(%)', 'Rain(mm)', 'Usage(kWh)'])
data.fillna(-1, inplace=True)

We next iterate over the `weather` dataframe (since it contains the most information to reuse), pair up the hour timeslot in the corresponding `rainfall` and `energy` dataframes, and then accumulate the values into new rows for our `final` dataframe. If there are any errors with incomplete, non-numeric, or otherwise empty nodes, we flag them to be dropped later with `-1`.

In [11]:
print(data.shape)
drops = []
for index, row in data.iterrows():
    if (row.isin(['-', -1.0]).any() or (row['Usage(kWh)'] >= 10)):
        drops.append(index)
data.drop(drops, axis='index', inplace=True)
print(data.shape)
data.to_csv('./data.csv')

(7356, 6)
(7126, 6)


Last, but not least, we drop the relevant rows (230 in total). We see the final shape of our data brings us to 7126 valid rows to use as input!

### Model

In [12]:
import math
import numpy as np
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
import matplotlib.pyplot as plt

# Normalizing function for dataset
def norm(d):
    return (d - d.min()) / (d.max() - d.min())

def denorm(y, d):
    return (y * (d.max() - d.min()) + d.min())


# Load the data
data = pd.read_csv('./data/data.csv',
                   usecols=['Tmax(C)', 'Tmin(C)', 'RHmean(%)', 'Rain(mm)', 'Usage(kWh)'])
dataset = data.values.astype(float)
X = dataset[:, 0:4]
Y = dataset[:, 4]

Using TensorFlow backend.


Here we load and split our data up.

In [13]:
# Normalize the dataset
normed_X = np.apply_along_axis(norm, axis=0, arr=X)
normed_Y = norm(Y)

# Split datasets into training and test
split = 0.8
val = math.floor(split*len(normed_X))
train_X = normed_X[:val]
train_Y = normed_Y[:val]
test_X = normed_X[val:]
test_Y = normed_Y[val:]

We then normalize each column to fit between 0 and 1, and then split that data 80-20 into training and test sets.

In [14]:
# Build the model with layers
model = Sequential()
model.add(Dense(4, input_dim=4, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(4, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1))
adam = Adam(lr=1e-5)
model.compile(loss='mean_squared_error',
              optimizer=adam, metrics=['mean_absolute_error', 'mean_squared_error'])

# Fit the model
history = model.fit(train_X,
                    train_Y,
                    epochs=200,
                    validation_data=(test_X, test_Y),
                    verbose=0)
model.save('energy.h5')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.


Lastly, we define our model. We expect 4 weather input nodes, which we then pass to a `Dropout` layer to help with overfitting because our dataset is relatively small. We repeat this again before finally pushing into a single output node. We are minimizing the loss of our `mean_squared_error` because our dataset is continuous rather than categorical. We will see the trends of training over these 200 epochs in the plots to follow.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error')
plt.plot(hist['epoch'], hist['mean_absolute_error'],
         label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
         label='Val Error')
plt.legend()
plt.savefig('./plots/mae.png')

plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Square Error')
plt.plot(hist['epoch'], hist['mean_squared_error'],
         label='Train Error')
plt.plot(hist['epoch'], hist['val_mean_squared_error'],
         label='Val Error')
plt.legend()
plt.savefig('./plots/mse.png')

plt.figure()
plt.xlabel('Usage (kWh)')
plt.ylabel('Temp (C)')
plt.plot(denorm(model.predict(test_X[-10:, :]),
                Y), X[-10:, 0], label='High')
plt.plot(denorm(model.predict(test_X[-10:, :]),
                Y), X[-10:, 1], label='Low')
plt.legend()
plt.savefig('./plots/predictions.png')

plt.show()

![mae](./plots/mae.png)
![mse](./plots/mse.png)
![predictions](./plots/predictions.png)

We see the `mean_squared_error` decrease nicely over the first 50 or so epochs before leveling off near 0. In the final plot, we can see the inverse linearity we hoped for with the usage increasing where the temperature decreases.

### Summary

#### Improvements

- The model is training using minimal weather data (temperature, humidity and rainfall), but more points of data could be collected to improve predictions.
- Other inputs could later include presence or absence in the house as energy consumption will be higher when individuals are home, minus a general baseline of appliances and electronics that run constantly.
- The trained model for this home could now be safely passed between Powershop NZ and the consumer to continue training as time goes on. Powershop NZ would not be able to reverse-engineer the actual half-hourly data usage if it were fed to the user rather than the retailer (as is the case here for data collection).
- Powershop could better estimate each home’s usage to better expect spikes in use and better price available electricity for future consumption.
- Similarly, the consumer can use these predictions to see how they might use energy compared to a weather forecast in the coming week(s) or month(s) in order to budget accordingly.

#### Looking Ahead

- Energy efficiency is just one small set of data individuals carry that can benefit from a federated learning model; more privacy on the user’s end and less resources for a business to have to manage.
- In an ideal world, corporations might collect a large set of models that are trained and personalized to each consumer rather than actual terabytes of private user data. These models could be deleted or restarted at a user’s request, but the corporations would still benefit from the personalized training done over time.
- Advertising, services, recommendations, and other “smart” metrics can gradually improve over time not only for individual users, but for a userbase as a whole if the collective models are then trained with each other.
- This type of learning becomes especially beneficial in high-sensitivity datasets like those found in healthcare. Diagnostic prediction models could benefit by training on large sets of cases without ever transporting that sensitive data outside of the hospital where it was collected in the first place.

### YouTube Links
- **2-Minute Version:** [https://youtu.be/287fuFMdb_4]
- **15-Minute Version:** [https://youtu.be/iY6oBf72PRU]