## Wind Power Generation Prediction
Today we will try to predict the weather multiple hours in advance. We will make a plot of prediction score vs # of hours in advance we are predicting.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

Lets import the data

In [3]:
data_gen = pd.read_csv('https://raw.githubusercontent.com/mkrogius/ai4all_wind_generation_public/master/time_series_60min_singleindex_filtered.csv',
                   usecols=lambda s: s.startswith('utc') or s.startswith('DE'),
                   parse_dates=[0], index_col=0)
data_gen_2016 = data_gen.loc['20160101':'20170101', 'DE_wind_generation_actual']
data_wind = pd.read_csv('https://raw.githubusercontent.com/mkrogius/ai4all_wind_generation_public/master/weather_data_filtered.csv',
                       parse_dates=[0], index_col=0)
data_wind_2016 = data_wind.loc['20160101':'20170101']
data_wind_2016 = data_wind_2016[data_wind_2016.index.duplicated() == False]
data_2016 = pd.concat([data_wind_2016, data_gen_2016], axis=1, join='inner')

Now lets split the data into train_x/train_y and test_x/test_y using just the first four columns like on the first day.

In [4]:
weather_columns = ['DE_windspeed_10m','DE_temperature', 'DE_radiation_direct_horizontal', 'DE_radiation_diffuse_horizontal']
train_x = data_2016.loc['20160101':'20160630', weather_columns]
test_x = data_2016.loc['20160630':'20170101', weather_columns]
train_y = data_2016.loc['20160101':'20160630', 'DE_wind_generation_actual']
test_y = data_2016.loc['20160630':'20170101', 'DE_wind_generation_actual']

Now comes the part that is different. We want to predict the energy prediction one hour in advance. We will do this by creating new array's for train_x/test_x which have all the data except for the last hour, and new arrays for train_y/test_y which have all the data except for the first hour. This way the x data for say 10am will line up with the y data for 11am.

We will us the .iloc pandas method. [Method reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer)

Now lets normalize the data using StandardScaler like last time.

Now lets fit a linear model using sklearn LinearRegression, just like last time.

The difference from last time is that we are now trying to predict the future energy generation. Does this model end up with a higher or lower score than yesterday's models?

Print out the score of the model (it should be around 0.74) and it's coefficients. How have the coefficients changed from our model last time?

Make a plot of the predictions of this model.

Now we want to try predicting energy generation more than one hour in the future. Make a function that takes as an argument the number of hours in the future we want to predict.

This function should:
1) Create the shifted train_x/train_y/test_x/test_y arrays.
2) Scale the arrays.
3) Fit a linear model to the arrays.
4) Return the score of the model.

Use the function you created to make a plot of prediction score vs how many hours in the future we are predicting. We should expect that the model score decreases as we try to predict the energy generation further into the future.

Make a plot of the predictions from the model trained to predict 24 hours in advance. Is this model working well?

How can we get better predictions further in advance? First, let's try using a fancier model. Create another function for predicting scores, but this time use your favorite model from yesterday.

Make another plot of score vs hours in advance with two data series, one for this model and one for the linear model.

Which model is better?

Let's try to make even better predictions by doing some feature engineering. Currently we are using the data from the current hour to do predictions of future hours. Now we will do what is called a sliding window where we use the data from the current hour and a few past hours for prediction.

First let's set up the train_x/train_y/test_x/test_y arrays for this data. We will need to select the subset of rows needed similar to before, but now we will be selecting two different subsets and concatenating them.

We want to have train_x_a which has data which is one hour ahead of train_x_b, which is one hour ahead of train_y, and similarly for the test dataset.

Now we will concatenate train_x_a and train_x_b using pd.concat. But first we will want to delete the index of both train_x_left and train_x_right using .reset_index(drop=True) so that pandas does not try to put rows with the same timestamp's together.



Now let's train a model using the sliding window test set, and print the score of the model. Does this do better than the predictions without a sliding window?

Final Challenge: Right a function that takes two arguments, first the number of hours in the future to predict, and second the number of previous hours of data to concatenate in the sliding window. It should return the score of your favorite model trained on this data.

Make a plot of the score of the model versus the number of hours in advance that we are predicting.

Try to change the size of the sliding window, as well as the type of model you are using in order to improve your results!