# Quantifying uncertainty in weather data
CityLearn comes with several dimenstions of weather data. Namely: actual hourly measurements as well as 6-,12-, and 24-hour point estimate forecasts for Outdoor Drybulb Temperature, Relative Humidity, Diffuse and Direct Solar Radiation.

The goal of this subproject is to identify the distributions of these point estimates, giving us the opportunity to know when to rely how much on these forecasts.

## Loading the Data

In [13]:
# canonical imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# let's read in the data (this would probably be easier in SQL, but I don't want to set up a database)
weather = pd.read_csv("data/citylearn_challenge_2022_phase_1/weather.csv")
# the buildings contain per-row time data
building1 = pd.read_csv("data/citylearn_challenge_2022_phase_1/Building_1.csv")

pricing = pd.read_csv("data/citylearn_challenge_2022_phase_1/pricing.csv")


In [2]:
# let's look at the weather data
weather

Unnamed: 0,Outdoor Drybulb Temperature [C],Relative Humidity [%],Diffuse Solar Radiation [W/m2],Direct Solar Radiation [W/m2],6h Prediction Outdoor Drybulb Temperature [C],12h Prediction Outdoor Drybulb Temperature [C],24h Prediction Outdoor Drybulb Temperature [C],6h Prediction Relative Humidity [%],12h Prediction Relative Humidity [%],24h Prediction Relative Humidity [%],6h Prediction Diffuse Solar Radiation [W/m2],12h Prediction Diffuse Solar Radiation [W/m2],24h Prediction Diffuse Solar Radiation [W/m2],6h Prediction Direct Solar Radiation [W/m2],12h Prediction Direct Solar Radiation [W/m2],24h Prediction Direct Solar Radiation [W/m2]
0,20.0,84.0,0.0,0.0,18.3,22.8,20.0,81.0,68.0,81.0,25.0,964.0,0.0,100.0,815.0,0.0
1,20.1,79.0,0.0,0.0,19.4,22.8,19.4,79.0,71.0,87.0,201.0,966.0,0.0,444.0,747.0,0.0
2,19.7,78.0,0.0,0.0,21.1,22.2,19.4,73.0,73.0,87.0,420.0,683.0,0.0,592.0,291.0,0.0
3,19.3,78.0,0.0,0.0,22.2,22.8,19.4,71.0,71.0,90.0,554.0,522.0,0.0,491.0,153.0,0.0
4,18.9,78.0,0.0,0.0,21.7,22.2,18.9,73.0,71.0,90.0,778.0,444.0,0.0,734.0,174.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,20.6,84.0,26.0,130.0,20.1,19.4,20.6,79.0,79.0,73.0,0.0,201.0,27.0,0.0,444.0,143.0
8756,21.1,81.0,0.0,0.0,19.7,21.1,20.0,78.0,73.0,76.0,0.0,420.0,0.0,0.0,592.0,0.0
8757,21.7,79.0,0.0,0.0,19.3,22.2,20.6,78.0,71.0,70.0,0.0,554.0,0.0,0.0,491.0,0.0
8758,21.3,76.0,0.0,0.0,18.9,21.7,20.6,78.0,73.0,73.0,0.0,778.0,0.0,0.0,734.0,0.0


In [3]:
# Let's declare some helper variables to deal with the column names
pred_6 = "6h Prediction "
pred_12 = "12h Prediction "
pred_24 = "24h Prediction "

true_6 = "6h Lookahead "
true_12 = "12h Lookahead "
true_24 = "24h Lookahead "

temperature = "Outdoor Drybulb Temperature [C]"
humidity = "Relative Humidity [%]"
diffuse = "Diffuse Solar Radiation [W/m2]"
direct = "Direct Solar Radiation [W/m2]"
measures = [temperature, humidity, diffuse, direct]

# quick sanity check
assert pred_6+temperature == "6h Prediction Outdoor Drybulb Temperature [C]"

In [4]:
# Next, let's integrate the date information from the building data.
# The next line shows us that we don't have to deal with Daylight Savings Time:
building1[building1["Daylight Savings Status"] != 0] # returns 0 rows

Unnamed: 0,Month,Hour,Day Type,Daylight Savings Status,Indoor Temperature [C],Average Unmet Cooling Setpoint Difference [C],Indoor Relative Humidity [%],Equipment Electric Power [kWh],DHW Heating [kWh],Cooling Load [kWh],Heating Load [kWh],Solar Generation [W/kW]


In [5]:
# This means we can omit this column, and simply add Month, Hour and Day Type to our weather dataframe:
weathertime = building1[["Month", "Hour", "Day Type"]].join(weather)
weathertime.head()

Unnamed: 0,Month,Hour,Day Type,Outdoor Drybulb Temperature [C],Relative Humidity [%],Diffuse Solar Radiation [W/m2],Direct Solar Radiation [W/m2],6h Prediction Outdoor Drybulb Temperature [C],12h Prediction Outdoor Drybulb Temperature [C],24h Prediction Outdoor Drybulb Temperature [C],6h Prediction Relative Humidity [%],12h Prediction Relative Humidity [%],24h Prediction Relative Humidity [%],6h Prediction Diffuse Solar Radiation [W/m2],12h Prediction Diffuse Solar Radiation [W/m2],24h Prediction Diffuse Solar Radiation [W/m2],6h Prediction Direct Solar Radiation [W/m2],12h Prediction Direct Solar Radiation [W/m2],24h Prediction Direct Solar Radiation [W/m2]
0,8.0,0.0,1.0,20.0,84.0,0.0,0.0,18.3,22.8,20.0,81.0,68.0,81.0,25.0,964.0,0.0,100.0,815.0,0.0
1,8.0,1.0,1.0,20.1,79.0,0.0,0.0,19.4,22.8,19.4,79.0,71.0,87.0,201.0,966.0,0.0,444.0,747.0,0.0
2,8.0,2.0,1.0,19.7,78.0,0.0,0.0,21.1,22.2,19.4,73.0,73.0,87.0,420.0,683.0,0.0,592.0,291.0,0.0
3,8.0,3.0,1.0,19.3,78.0,0.0,0.0,22.2,22.8,19.4,71.0,71.0,90.0,554.0,522.0,0.0,491.0,153.0,0.0
4,8.0,4.0,1.0,18.9,78.0,0.0,0.0,21.7,22.2,18.9,73.0,71.0,90.0,778.0,444.0,0.0,734.0,174.0,0.0


In [6]:
# Also, let's add the actual future values so we have something to compare the predictions to.
predictions = weathertime[:-24] #remove last day of predictions, which has no ground truth for 24h predicitons

ground_truth_6 = weathertime[measures][6:-18]
ground_truth_12 = weathertime[measures][12:-12]
ground_truth_24 = weathertime[measures][24:]

# a bit of shenanigans to reset the index and rename the columns
ground_truth_6 =  ground_truth_6.reset_index(drop=True )
ground_truth_12 = ground_truth_12.reset_index(drop=True)
ground_truth_24 = ground_truth_24.reset_index(drop=True)


ground_truth_6 =  ground_truth_6.rename(columns= (lambda lbl : true_6+lbl))
ground_truth_12 = ground_truth_12.rename(columns= (lambda lbl : true_12+lbl))
ground_truth_24 = ground_truth_24.rename(columns= (lambda lbl : true_24+lbl))


predictions = predictions.join([ground_truth_6, ground_truth_12, ground_truth_24])
predictions.head()

Unnamed: 0,Month,Hour,Day Type,Outdoor Drybulb Temperature [C],Relative Humidity [%],Diffuse Solar Radiation [W/m2],Direct Solar Radiation [W/m2],6h Prediction Outdoor Drybulb Temperature [C],12h Prediction Outdoor Drybulb Temperature [C],24h Prediction Outdoor Drybulb Temperature [C],...,6h Lookahead Diffuse Solar Radiation [W/m2],6h Lookahead Direct Solar Radiation [W/m2],12h Lookahead Outdoor Drybulb Temperature [C],12h Lookahead Relative Humidity [%],12h Lookahead Diffuse Solar Radiation [W/m2],12h Lookahead Direct Solar Radiation [W/m2],24h Lookahead Outdoor Drybulb Temperature [C],24h Lookahead Relative Humidity [%],24h Lookahead Diffuse Solar Radiation [W/m2],24h Lookahead Direct Solar Radiation [W/m2]
0,8.0,0.0,1.0,20.0,84.0,0.0,0.0,18.3,22.8,20.0,...,25.0,100.0,22.8,68.0,964.0,815.0,20.0,81.0,0.0,0.0
1,8.0,1.0,1.0,20.1,79.0,0.0,0.0,19.4,22.8,19.4,...,201.0,444.0,22.8,71.0,966.0,747.0,19.4,87.0,0.0,0.0
2,8.0,2.0,1.0,19.7,78.0,0.0,0.0,21.1,22.2,19.4,...,420.0,592.0,22.2,73.0,683.0,291.0,19.4,87.0,0.0,0.0
3,8.0,3.0,1.0,19.3,78.0,0.0,0.0,22.2,22.8,19.4,...,554.0,491.0,22.8,71.0,522.0,153.0,19.4,90.0,0.0,0.0
4,8.0,4.0,1.0,18.9,78.0,0.0,0.0,21.7,22.2,18.9,...,778.0,734.0,22.2,71.0,444.0,174.0,18.9,90.0,0.0,0.0


## Examine predictions and errors
Now that we have the data in a usable form, let's compare predictions with ground truths!

First, let's get the general distribution of prediction errors. Then we can slice them by hour of day to see if the distribution changes over time

In [11]:
for pred, true in zip([pred_6, pred_12, pred_24],[true_6, true_12, true_24]):
    for measure in measures:
        print(pred + measure + ": ", (predictions[pred + measure] != predictions[true + measure]).sum())
# APPARENTLY THE PREDICTIONS ARE JUST THE OBSERVATIONS! NICE ORACLE YOU HAVE THERE!

6h Prediction Outdoor Drybulb Temperature [C]:  0
6h Prediction Relative Humidity [%]:  0
6h Prediction Diffuse Solar Radiation [W/m2]:  0
6h Prediction Direct Solar Radiation [W/m2]:  0
12h Prediction Outdoor Drybulb Temperature [C]:  0
12h Prediction Relative Humidity [%]:  0
12h Prediction Diffuse Solar Radiation [W/m2]:  0
12h Prediction Direct Solar Radiation [W/m2]:  0
24h Prediction Outdoor Drybulb Temperature [C]:  0
24h Prediction Relative Humidity [%]:  0
24h Prediction Diffuse Solar Radiation [W/m2]:  0
24h Prediction Direct Solar Radiation [W/m2]:  0


Wow, this was much ado about nothing, I don't know why they would call them "predictions". Not sure why they would not use actual weather forecasts!

## Pricing

In [16]:
pricing.groupby("Electricity Pricing [$]").count()

Unnamed: 0_level_0,6h Prediction Electricity Pricing [$],12h Prediction Electricity Pricing [$],24h Prediction Electricity Pricing [$]
Electricity Pricing [$],Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.21,4617,4617,4617
0.22,2318,2318,2318
0.4,170,170,170
0.5,1215,1215,1215
0.54,440,440,440
