## Wind Power Generation Prediction
If we know weather conditions, can we predict how much energy wind farms will generate?

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

Lets import the data

In [7]:
data_gen = pd.read_csv('time_series_60min_singleindex_filtered.csv',
                   usecols=lambda s: s.startswith('utc') or s.startswith('DE'),
                   parse_dates=[0], index_col=0)

Now lets look at what the first few rows of the data look like. Use [DataFrame.head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

Print out the data type of each column using .dtype

We can also print out statistics for each column of the data by using .describe()

For this project we are going to predict the total wind generation using data from 2016. Take the data for this column from 2016, and store it into a variable. You will need to use .loc['startdate':'enddate'] for this.

Now plot wind generation over time.

This is the data that we will try to predict. Now let's load in the weather data that we will use to predict it.

In [8]:
data_wind = pd.read_csv('weather_data_filtered.csv',
                       parse_dates=[0], index_col=0)

Examine the first few rows of this data, print out the column datatypes, and print out summary statistics for each column

Now lets take the weather data for just 2016 and store it into a variable.

The data set has wind, temperature, and sunshine measurements for all of germany in the first four columns, and measurements of the same quantities by region in the rest of the columns. Let's start out by doing what analysis we can using the first four columns to predict wind power generation. First lets plot each of these first four columns. Try using .plot(y='column_name') or the api can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

Now lets make a plot with DE_windspeed_10m on the x-axis and wind generation on the y-axis. Unfortunately we can't just go ahead and plot these datasets against each other, since there are different rows of missing data in each dataset as well as some duplicated rows.

First, lets figure out how to get a list of the hours for which we have data for each data set. Try .index

First let's remove duplicated rows. Hint, you can use .index.duplicated() to get duplicated rows. This one might be tricky and require some googling to figure out.

Next lets merge the two dataframes with an inner join. An 'inner' join means that we are matching up rows from the same hour with each other, and when there is data missing in one of the dataframes for that hour, we just ignore that hour entirely.

You can find the syntax [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). Scroll down until you get to the inner join part.

Now lets make plots with wind generation on the y-axis and windspeed/temperature/radiation on the x-axis.

There are other columns in the data, but for now let's just try to predict the windspeed based on these four columns. We will split our dataset up into a training dataset for the start of the year, and a testing dataset for the second half of the year.

Lets create four arrays, train_x, train_y, test_x, and test_y. The train/test_x arrays should have two columns, for windspeed and temperature. The train/test_y arrays should have one column which has the wind generation. You will need to use .loc similar to before, but now also selecting columns as well as rows.

Now lets use [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to normalize the data before we try fitting a model to it. This is important so that features do not get assigned more/less importance because of their scale.

You should be calling .fit_transform on the training data and .transform on the test data. This ensures that the preprocessor is only learning the statistics of the training data.

Now use the LinearRegression from sklearn to fit a model to your data.

Now print out the score of the model. It should be around 0.78. Also, make a plot of the model's output vs the actual wind generation.

Since model.predict() returns a numpy array instead of a pandas dataframe, we will have to use matplotlib (plt) functions to plot what we want to plot this time.

We can get an idea of how important each of the four input columns are by looking at the coefficients assigned to them by the linear regression. Note that this only really works well because we scaled our input data. Use .coef_

We can see that the windspeed is the most important factor and this should match our intuition from the correlation plots we made earlier.

Since we know that windspeed is the most important factor, let's try training a model using only the windspeed.

Does this work as well as the first model?