# Linear Regression

***Summary***
- [Energy Data](#energy-data) <br>
- [Radiation Data](#radiation-data) <br>
- [Fit Simple Sinear Regression Model](#fit-simple-linear-regression-model) <br>
- [Visualize Results](#visualize-results) <br>

In this Jupyter Notebook you get the chance to apply your knowledge in linear regression that acquired in lecture 1 to 4.
We will regress the power production of the solar plants in St. Gallen on the global (solar) radiation measured near the solar plants.<br><br>
The data on the solar plants is provided by Open Data and can be found [here](https://daten.stadt.sg.ch/explore/dataset/stromproduktion-der-solaranlagen-der-stgaller-stadtwerke/table/?disjunctive.name&disjunctive.smart_me_name&disjunctive.modultyp&disjunctive.leistung_modul_in_wp).<br>
The weather data (global radiation in St.Gallen) was provided by [MeteoSchweiz](https://www.meteoschweiz.admin.ch/home.html?tab=overview) and is not publicly available.<br><br>
The intention is to find out if there is a relationship between the global radiation (measured in $W/m^2$) and the energy export of the solar plants (measured in $Wh$).
The granularity of the radiation data is 10 minutes and that of the energy export data is 15 minutes.

In [None]:
# Import libraries
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from google.colab import files
import io

<a id='energy-data'></a>
## I. Energy Data
We load the power plant data into a pandas dataframe.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, which is often used in machine learning to preprocess raw data.<br><br>
First, the timestamp of the measurements is converted to datetime and the rows are sorted by this timestamp.
`head(2)` is used to display the first two rows of the pandas dataframe.

In [None]:
# Upload stromproduktion-der-solaranlagen-der-stgaller-stadtwerke.csv as soon as `Choose Files` button appears
uploaded_data = files.upload()
df_e = pd.read_csv(io.BytesIO(uploaded_data['stromproduktion-der-solaranlagen-der-stgaller-stadtwerke.csv']), sep=';')

In [None]:
df_e['DateTime (Local Time)'] = pd.to_datetime(df_e['DateTime (Local Time)']).dt.tz_convert('UTC')
df_e.sort_values(by=['DateTime (Local Time)'], inplace=True)
df_e.head(2)

As you can see below, there are 11 different solar power plants and 12 different electricity meters.<br>
In this tutorial, only the plant/meter combination with the most data is considered.
We therefore select the subset of data from the power plant `Kirche Halden` with meter ID `f1bd39a3-7324-8f4b-bd05-00ba6719ca6f` (df_e_subset_1).
Moreover, the columns with useless information are omitted (df_e_subset_2).

In [None]:
sorted_locations = df_e.groupby(['Name','smart-me ID']).size().sort_values(ascending=False)
sorted_locations

In [None]:
max_count_name, max_count_id = sorted_locations.first_valid_index()

In [None]:
df_e_subset_1 = df_e.loc[(df_e['Name']==max_count_name) & (df_e['smart-me ID']==max_count_id)]
df_e_subset_1.sort_values(by=['DateTime (Local Time)'], inplace=True)
df_e_subset_1.reset_index(drop=True, inplace=True)
df_e_subset_1.head(2)

In [None]:
tmp = df_e_subset_1.loc[:,['DateTime (Local Time)','Additional Energy Export']].copy()
df_e_subset_2 = tmp.rename(columns={'DateTime (Local Time)':'date',
                                    'Additional Energy Export': 'energy'})
df_e_subset_2.set_index('date', drop=True, inplace=True)
df_e_subset_2.head()

We visualize the last week of recording of the resulting time series df_e_subset_2.
As you can see, there is a clear difference between day and night, which was to be expected.

In [None]:
mask = (df_e_subset_2.index > df_e_subset_2.index[-1]-pd.DateOffset(days=7))
df_e_subset_2.loc[mask].plot(figsize=(15,5), ylabel='Energy Export [Wh]')

<a id='radiation-data'></a>
## II. Radiation Data
Next, we load and prepare the dataset containing the global radiation measured at a weather station in St. Gallen.
This data set was kindly provided by MeteoSwiss for this purpose only.<br><br>
Again, we load the dataset into a pandas dataframe, convert the timestamp to datetime and sort the rows.
The result is a time series called df_r_subset_1 which contains the radiation value ($W/m^2$) in St. Gallen, averaged over 10 minutes.

In [None]:
# Upload Global_Radiation_STG.csv as soon as `Choose Files` button appears
uploaded_data = files.upload()
df_r = pd.read_csv(io.BytesIO(uploaded_data['Global_Radiation_STG.csv']), sep=';')

In [None]:
df_r['time'] = pd.to_datetime(df_r['time'], format='%Y%m%d%H%M').dt.tz_localize('UTC')
df_r.sort_values(by=['time'], inplace=True)
df_r.head()

In [None]:
df_r_subset_1 = df_r.loc[:,['time', 'gre000z0']].copy()
df_r_subset_1.rename({'time':'date',
                      'gre000z0':'radiation'}, inplace=True, axis=1)
df_r_subset_1.set_index('date', drop=True, inplace=True)
df_r_subset_1.head(2)

If we visualize the time series df_r_subset_1 for the last week of the solar energy time series, we can see that these time series look quite similar (except for a factor).<br><br>
Because these two time series (radiation and energy) were captured with different time granularity (10min and 15min intervals), with different starting and ending times, we must first match the timestamps.
This is done by selecting the radiation sample whose timestamp is closest to the energy timestamp (matching the radiation dataframe to the energy dataframe), using the `reindex()` function.
We then merge these two time series into one data frame df_merge.

In [None]:
mask = (df_r_subset_1.index > df_e_subset_2.index[-1]-pd.DateOffset(days=7)) & (df_r_subset_1.index < df_e_subset_2.index[-1])

In [None]:
mask = (df_r_subset_1.index > df_e_subset_2.index[-1]-pd.DateOffset(days=7)) & (df_r_subset_1.index < df_e_subset_2.index[-1])
df_r_subset_1.loc[mask].plot(figsize=(15,5), ylabel='Radiation [W/m2]')

In [None]:
df_r_subset_2 = df_r_subset_1.reindex(df_e_subset_2.index, method='nearest', tolerance=pd.Timedelta(8,'T'))
df_r_subset_2.head()

In [None]:
df_merge = pd.DataFrame({'energy':df_e_subset_2.energy,
                         'radiation':df_r_subset_2.radiation}, index=df_r_subset_2.index)
df_merge.head()

Plotting these two time series side by side reveals that there is a shift error between them.
Do you know where this error comes from?
Can you fix this problem?


In [None]:
df_merge.iloc[:500].plot(figsize=(25,5), subplots=True)

<a id='fit-simple-linear-regression-model'></a>
## III. Fit Simple Linear Regression Model
Next, we split the dataframe into a training dataset and a testing dataset with sklearn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.<br><br>
We will train a simple linear regression model based on the training dataset, using [statsmodels](https://www.statsmodels.org/stable/regression.html) and display some aspects of the of the linear model fit using [statsmodels.graphics](https://www.statsmodels.org/stable/examples/notebooks/generated/regression_plots.html).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(df_merge, test_size=0.33, random_state=42)

In [None]:
X_train.plot(x='radiation', y='energy', kind='scatter')

In [None]:
import statsmodels.formula.api as smf

mod = smf.ols(formula='energy ~ radiation', data=X_train)
res = mod.fit()
print(res.summary())

In [None]:
import statsmodels.api as sm
X_test = sm.add_constant(X_test)
ynewpred = res.predict(X_test)

rms_test = np.mean(np.square(ynewpred - X_test.energy))
rms_test

<a id='visualize-results'></a>
## IV. Visualize Results
Next, we visualize the Component-Component+Residual plot (more details are provided [here](https://www.statsmodels.org/stable/examples/notebooks/generated/regression_plots.html)) and the resulting least-square fit, together with the test dataset.

In [None]:
fig, ax = plt.subplots()
ax.plot(X_train.radiation, X_train.energy, '.', label='Training Data')
ax.plot(np.linspace(0,1200,100), np.linspace(0,1000,100)*res.params['radiation'], label='Learned Model')
ax.legend(loc="best")
ax.set_title('Training Dataset')
ax.set_xlabel(r'Radiation [$W/m^2$]')
ax.set_ylabel('Power [$Wh / \Delta t$]')

In [None]:
fig, ax = plt.subplots()
ax.plot(X_test.radiation, X_test.energy, '.', label='Test Data')
ax.plot(np.linspace(0,1200,100), np.linspace(0,1000,100)*res.params['radiation'], label='Learned Model')
ax.legend(loc="best")
ax.set_title('Test Dataset')
ax.set_xlabel(r'Radiation [$W/m^2$]')
ax.set_ylabel('Power [$Wh / \Delta t$]')