<a href="https://colab.research.google.com/github/intelligent-environments-lab/occupant_centric_grid_interactive_buildings_course/blob/main/src/notebooks/homework/homework_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2 - Getting to know the Southern Company Smart Neighborhood dataset
---

In this homework, you will familiarize yourself with the [Southern Company Smart Neighborhood](https://www.georgiapower.com/residential/save-money-and-energy/smart-neighborhood.html) dataset from a neighborhood in Atlanta, Georgia, USA by combining your SQL skills with your data analysis, visualization skills. The entire dataset consists of advanced metering infrastructure (AMI) time series that provides hourly electricity bought from and sold to the grid, submeter time series of plug loads, thermostat data, water heater use data, dynamic electricity pricing time series, battery and solar-inverter time series, e.t.c. 

This homework makes use of just the AMI data to learn about the electricity consumption patterns in the 46 buildings. There is also AMI data from 44 buildings considered to be baseline buildings that have similar floorplan as the smart neighborhood buildings but not as efficient and without battery and PV. In this homework, you will compare the electricity consumption of buildings within the smart neighborhood and across both the smart and baseline neighborhoods and also discover the typical daily profiles of each building using clustering. 

You will notice in the AMI data that for each timestamp and building, there are `kWh R`, `kWh F` or `kWh F + R` values. Sometimes there are only `kWh R` and `kWh F` or `kWh F` alone.  `kWh R`is kWh Reverse. It is tracking reverse power flow out of the home (the solar energy being sold back to utility from each home). It is typically 0 but there are sometimes where it is not.  `kWh F` is kWh Forward and is all the energy purchased from the utility. While you will insert data for all cases into the database, you should only use `kWh F` values in this homework which will be referred to as `net electricity consumption`.

Finally, you will build a regression model that predicts the net electricity consumption of any one building of your choice in either the smart or baseline neighborhood.

## Build database
---

Before the data analysis, you will build a SQLite database to hold the AMI data using the provided schema below and insert the data from the AMI `.csv` files. All columns in the tables should be set to `NOT NULL`.

<img style="background-color:#FFFFFF" src="https://github.com/intelligent-environments-lab/occupant_centric_grid_interactive_buildings_course/blob/main/figures/scsn_ami_data_sqlite_schema.jpg?raw=true" width=1000></img>

Create database and define in database:

In [1]:
# ******* CODE BELOW *******


Insert data from AMI csv files into database. Refer to Canvas for file URLs. You can uncomment the code below and replace `<replace with list of urls from Canvas>` with the list of URLs from Canvas if you want to read all the files into one dataframe:

In [2]:
# ******* CODE BELOW *******

# df_list = []
# file_urls = <replace with list of urls from Canvas>

# for url in file_urls:
#     df = pd.read_csv(url)
#     df_list.append(df)

# df = pd.concat(df_list, ignore_index=True)
# del df_list

## Querying the database
---

In this part of the homework, you will query the database to extract and plot information about the buildings' energy behavior.

Execute query that calculates the energy use intensity (EUI) of each building for each year of provided data. Follow this [link](https://my.matterport.com/show/?m=1oJSPeBrNuJ&brand=0) to estimate the total floor area of the  buildings assuming all buildings are similarly sized. How do the EUIs compare against what is expected for a typical residential building in the same location/climate zone?

In [3]:
# ******* CODE BELOW *******


Execute query that returns the monthly (include only last 12 months) net electricity consumption of each building and plot grouped bar graphs, one axes for each building. When season is electricity consumption highest/lowest and why?

In [4]:
# ******* CODE BELOW *******


Execute query that returns the average weekly net electricity consumption profile i.e. for an hourly resolution data, you will have 168 values (24 hours per day * 7 days of the week (Mon-Sun)) per building. Plot the profile for each building on a separate axes. What patterns do you discover about the average profile for each building and how do the profiles vary form one building to another?

In [5]:
# ******* CODE BELOW *******


Execute query that returns the average weekly net electricity consumption profile per neighborhood. Plot the profiles of the two neighborhoods on the same axes. Are there any differences in profiles between neighborhoods?

In [6]:
# ******* CODE BELOW *******


Execute query that returns the average and standard deviation of daily net electricity consumption profile i.e. for an hourly resolution data, you will have 24 values (24 hours per day) per building. Plot the profile for each building on a separate axes. What patterns do you discover about the average profile for each building and how do the profiles vary form one building to another? What time does the peak occur for each building? How do the peaks in the smart neighborhood compare to those in the baseline neighborhood?

In [7]:
# ******* CODE BELOW *******


Execute query that allows you visualize the distribution of daily peak times for each building.

In [8]:
# ******* CODE BELOW *******


## Data cleaning and outlier detection

In this section of the homework, you will clean the raw data and also look out for outliers. We want to spot missing values outliers and any other data quality issues you can think of. For outliers, you can use either the [IQR method](https://www.geeksforgeeks.org/interquartile-range-to-detect-outliers-in-data/) to identify potential outliers or any other preferred method of choice. However, note that you will still need to use your domain knowledge to decide if a value is indeed an outlier.

Query the electricity_consumption_time_series table to find any negative values. Take not of the building, unit of measurement and timestamp.

In [9]:
# ******* CODE BELOW *******


Figure out a way to find out if any timestamp for any building has missing value for `kWh F` unit of measurement. Take not of the building and timestamp.

In [10]:
# ******* CODE BELOW *******


Investigate potential outliers in each building's `kWh F` time series. You can choose to take a qualitative approach of eye-balling the time series to find the obvious ones and/or quantitiative approach using the IQR method or any other outlier detection method of choice. Take note of the building and timestamp where outliers occur.

In [11]:
# ******* CODE BELOW *******


Can you think of a way to replace the values you are convinced are outliers and the missing values?

In [12]:
# ******* CODE BELOW *******


Store your replacement values by creating a new table in the database called electricity_consumption_time_series_replacment that has the following columns: timestamp, building_id, value, notes. The notes column is a description of why you are replacing each value. Insert our replacement values for outliers and missing values in the new table.

In [13]:
# ******* CODE BELOW *******


## Finding similarities in energy profile across buildings
---

Here, you will calculate correlations between building time series and find clusters of daily net electricity consumption profiles.

Calculate the Pearson correlation coefficient of any two buildings' net electricity consumption (use values where unit of measuement = `kWh F`)

In [14]:
# ******* CODE BELOW *******


Use KMeans to find cluster of daily net electricity consumption profiles. Build one model that uses profiles from all buildings in both neighborhoods. You many use the [elbow method](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/) to determine the appropriate number of clusters but other informed approaches are encouraged. Using your domain knowledge, you might also choose to build one model for each season and/or weekday/weekend to reduce the noise in the model. It is up to you what you decide  but remember "garbage in, garbage out" :)!

In [15]:
# ******* CODE BELOW *******


What percentage of profiles for each building falls under any of the clusters? Is there a clear distinction of smart neighborhood and baseline neighborhood clusters?

In [16]:
# ******* CODE BELOW *******


What are the unique features of each cluster, i.e. try to explain the physical meaning of the generated clusters.

In [17]:
# ******* CODE BELOW *******


## Build hour ahead net electricity consumption prediction model
---

In this last section of the homework, you will use any supervised machine learning algorithm of your choice and any building of your choice to build a model that predicts the hour ahead net electricity consumption. Use the first 12 months of data as your training dataset and make predictions for the first 7 days of each month of the following 12 months. 

You will need some feature engineering and are allowed to use publicly available datasets like weather to improve your model. Make sure you cross validate your model to ensure it is generalizable and split the training data into train-validation-test sets.

Train and validate model

In [18]:
# ******* CODE BELOW *******


Test model and plot profile of predicted and actual values

In [19]:
# ******* CODE BELOW *******


Report model root mean square error for train and test sets

In [20]:
# ******* CODE BELOW *******
