# Wind Power forecasting for the day-ahead energy market
### by Compagnie Nationale du Rhône (CNR)

#  Table of Contents
* [Framing the Problem](#frame_problem)
    - [Context](#context)
    - [Goal](#goal)
    - [Data description](#data_des)
        - [Complementary data](#comp_data)
        - [Metric](#metric)
* [Getting the Data](#getting_data)
    - [Converting the data](#convertion)
    - [Data split in traning and test sets](#train_test)

# Framing the Problem
<a id="frame_problem"></a>

## Context
<a id="context"></a>
CNR is the French leading producer of exclusively renewable energy (water, wind, sun) and the concessionary of the Rhone river for hydroelectricity production, river navigation and irrigation for agricultural use. This challenge focuses on wind energy production forecast. CNR currently owns around 50 Wind Farms (WF) for a total installed capacity of more than 600 MW. Every day, CNR sells on the energy market its wind energy production for the day ahead. In order to sell the right amount of energy, as well as for legal requirements towards the French Transmission System Operator (TSO) in charge of the electric network stability, CNR needs to know beforehand how much energy the wind farms will produce the day ahead.

## Goal
<a id="goal"></a>
The goal of this challenge is to predict the energy production of six wind farms (WF) owned by CNR. Each WF production will be individually predicted, using meteorological forecasts as input. Predictions will focus on the day-ahead energy production (hourly production forecasts from day D+1 00h to day D+2 00h).

## Data Description
<a id="data_desc"></a>

The competitors will have access to the six WF hourly production data from May the 1st of 2018 to January the 15th of 2019 (8 months and 15 days). This defines the training dataset, since day-ahead WF hourly production is the prediction target (predictand). Provided hourly WF power production consists in the raw recordings of the DSO (Distribution System Operator), and should therefore be considered as a reference, even if it could contain erroneous or suspect data. The choice is left to competitors to criticize or not this data, using complementary data provided apart.

For both training and test periods, predicting variables (predictors) will be given. It consists in hourly forecasted meteorological variables, provided by various Numerical Weather Prediction (NWP) models. NWP models are meteorological models run by several national weather prediction services. For confidentiality reasons, the name of the NWP models will not appear. They will be named with generic names NWP1, ... , NWPn.

Here is a description of all data provided in the input csv files:

- *ID*: This is the unique ID of each row in the csv files. One ID correspond to a couple Time / WF. The ID of the test set are consecutive to the ID of the training set.

- *WF*: The considered Wind Farm. WF ranges from WF1 to WF6. It is crucial for the competitors to be aware that this prediction problem is totally dependent to the WF considered. In other words, the statistical link between input variables and wind power production is completely different from one WF to another. Consequently, **it could be judicious to train specific prediction algorithms for each WF, instead of training a unique algorithm which could be unable to model the behavior of each WF**.

- *Time* (UTC): date and hour of the target timestep, i.e. corresponding to the observed Power production. Time zone is Coordinated Universal Time (UTC).

- *Meteorological variables*: Numerical Weather Predictions are provided by meteorological centers several times a day (updates), typically at 00h UTC, 06h UTC, 12h UTC and 18h UTC. We call these sets of forecasts "Runs". Consequently, if the input file contains forecasts arising from several runs, this implies that a single NWP is associated with several forecasts for the same forecasting time. Therefore, the information on the hour of run is provided.

The format of the header of the csv files for the meteorological variables is the following: *NWPi_HourOfTheRun_DayOfTheRun_Variable*, with:

- *NWPi*: the considered Numerical Weather Prediction model (meteorological model);

- *HourOfTheRun*: the hour (UTC) of the considered run. According to the NWP, it could be 00h, 06h, 12h and 18h (case of NWP with 4 runs per day) or only 00h and 12h (case of NWP with 2 runs per day);

- *DayOfTheRun*: the day of the considered run. We provide in the csv files predictions from the D-2 day runs (the day before yesterday), D-1 day runs (yesterday) and D day runs;

- *Variable*: the different meteorological variables forecasted by the NWP:

    - *U* and *V* components of the wind at 100m (or 10m) height (m/s). These are the zonal and meridional velocities of the wind, respectively. Both are given at a height of 100m above ground for NWP1, NWP2 and NWP3. U and V are given at a height of 10m for NWP4. Even if these variables are given at hourly timestep, **we draw competitors attention on the fact that the temporal representativity of the given values is for a 10-minutes window ranging from H-10 min to H**.

    - *T*: Temperature of air (°C). This is the averaged temperature over the entire hour (from H-1 to H). Wind power production is sensitive to air temperature since it affects the air density. This variable is provided only for NWP1 and NWP3.

    - *CLCT*: Total cloud cover (%).This is the total cloud cover of the sky, ranging from 0% (clear sky, no cloud) to 100% (fully clouded sky). The value is an instant value at hour H. This variable is provided only for NWP4.

    - *Observed Power Production (MW or MW.h)*: this is the observed total amount of energy injected by the WF to the electric network over the entire hour H-1 to H (MW.h). Equivalently, we can consider that this is the mean power output of the WF observed between H-1 and H (MW).

### Complementary data
<a id="comp_data"></a>

We provide complementary data in the `.zip` supplementary files. These data may be used by the competitors to prepare or criticize WF hourly production data, but they are not predictors. The file `WindFarms_complementary_data.csv` contains the following hourly variables:

* Average power output for each wind turbine of the WF (MW)

* Cumulated energy produced by each wind turbine (MWh). This value could differ from the hourly average power output when the considered turbine has not been operational during the entire hour.

* Observed average wind direction at hub (nacelle) height for each wind turbine (°, from 0 to 359)

* Observed average wind speed at hub (nacelle) height for each wind turbine (m/s)

* Observed average nacelle direction for each wind turbine (°, from 0 to 359)

* Observed average rotational speed of each wind turbine (s$^{-1}$)

### Metric
<a id="metric"></a>

The metric used to rank the predicting performance is a relative form of the absolute error. We call it the CAPE (Cumulated Absolute Percentage Error). The formulation of CAPE for one WF would be the following:

$\text{CAPE}_{k}\left( {\widehat{Y}}_{k},Y_{k} \right) = 100 \times \frac{\sum_{i = 1}^{N_{k}}\left| Y_{i,k} - {\widehat{Y}}_{i,k} \right|}{\sum_{i = 1}^{N_{k}}Y_{i,k}}$

With $\text{CAPE}_{k}$ the metric for the WF $k$ (%),
$N_{k}$ the length of the test sample for WF $k$ only,
$Y_{i,k}$ the observed production for WF $k$ and hour $i$ (MW or MW.h),
and $\widehat{Y}_{i,k}$ the predicted production for WF $k$ and hour $i$ (MW or MW.h).

For convenience reasons, data relative to the 6 WF have been regrouped in the same train and test input files. Therefore, the metric used in the challenge is the overall average CAPE for the 6 WF, calculated as:

$\text{CAPE}\left( \widehat{Y},Y \right) = 100 \times \frac{\sum_{i = 1}^{M}\left| Y_{i} - {\widehat{Y}}_{i} \right|}{\sum_{i = 1}^{M}Y_{i}}$

With $M$ the length of the test sample for all the 6 WF ($M$ is the sum of $N_{k}$ for all $k$).

This formulation results in a non-homogeneous contribution of all the WF to the final value of CAPE: CAPE will be more sensitive to WF with the highest energy production values.

In [1]:
# Libraries

%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import numpy as np
import datetime as dt
import gc
import missingno as msno
import pandas_profiling
import re

from src.functions import data_import as dimp
from src.functions import data_exploration as dexp
from src.functions import data_transformation as dtr

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as pty

import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

## Getting the data
<a id="getting_data"></a>

In [2]:
# Import data
X_train = dimp.import_data('../data/raw/X_train_v2.csv')
Y_train = dimp.import_data('../data/raw/Y_train.csv')
X_test = dimp.import_data('../data/raw/X_test_v2.csv')
extra_data = pd.read_csv('../data/external/WindFarms_complementary_data.csv', sep=';', parse_dates=['Time (UTC)'])

# Parse dates
dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
X_train['Time'] = pd.to_datetime(X_train['Time'], format='%d/%m/%Y %H:%M')
X_test['Time'] = pd.to_datetime(X_test['Time'], format='%d/%m/%Y %H:%M')

Memory usage of dataframe is 29.94 MB
Memory usage after optimization is: 7.72 MB
Decreased by 74.2%
Memory usage of dataframe is 0.57 MB
Memory usage after optimization is: 0.21 MB
Decreased by 62.5%
Memory usage of dataframe is 29.26 MB



invalid value encountered in less


invalid value encountered in less



Memory usage after optimization is: 19.47 MB
Decreased by 33.5%


In [3]:
%store X_train
%store Y_train

Stored 'X_train' (DataFrame)
Stored 'Y_train' (DataFrame)


### Data split into training and test sets
<a id="train_test" />

In order to test the models localy, before making submissions to the challenge, we're splitting `train` dataframe created above, into traning and test dataframes. We'll get for testing the last two days of `Time` column for all the WF's; the rest of the `training` dataframe will be for traning.

In [4]:
def split_data_by_date(date, X, y):
    """
    It splits X and y sets by a 'Time' value 
    into sets for training and testing. 
        - Return: a dictionary with the four sets
                  (X_train, y_train, X_test, y_test)
    """
    sets = {}
    date_cut = dt.datetime.strptime(date, '%Y-%m-%d %H:%M:%S')
    
    X_test = X[X['Time'] > date_cut]
    X_train = X[X['Time'] <= date_cut]
    y_train = y[y.ID.isin(X_train.ID)]
    y_test = y[y.ID.isin(X_test.ID)]
    
    sets['X_train'] = X_train
    sets['X_test'] = X_test
    sets['y_train'] = y_train
    sets['y_test'] = y_test
    
    return sets

In [5]:
train_test_dfs = split_data_by_date('2018-11-13 23:00:00', X_train, Y_train)
X_train_2 = train_test_dfs.get('X_train')
X_test_2 = train_test_dfs.get('X_test')
Y_train_2 = train_test_dfs.get('y_train')
Y_test_2 = train_test_dfs.get('y_test')

In [6]:
X_train_2.tail()

Unnamed: 0,ID,WF,Time,NWP1_00h_D-2_U,NWP1_00h_D-2_V,NWP1_00h_D-2_T,NWP1_06h_D-2_U,NWP1_06h_D-2_V,NWP1_06h_D-2_T,NWP1_12h_D-2_U,...,NWP4_00h_D-1_CLCT,NWP4_12h_D-1_U,NWP4_12h_D-1_V,NWP4_12h_D-1_CLCT,NWP4_00h_D_U,NWP4_00h_D_V,NWP4_00h_D_CLCT,NWP4_12h_D_U,NWP4_12h_D_V,NWP4_12h_D_CLCT
35857,35858,WF6,2018-11-13 19:00:00,,,,,,,0.599121,...,79.8125,0.853027,-2.751953,71.4375,1.03418,-3.464844,95.0,0.871582,-2.109375,42.53125
35858,35859,WF6,2018-11-13 20:00:00,,,,,,,0.475586,...,65.5625,0.824219,-2.199219,62.0,1.120117,-2.613281,76.9375,0.643555,-1.414062,40.65625
35859,35860,WF6,2018-11-13 21:00:00,,,,,,,-0.066772,...,39.3125,0.79541,-1.462891,37.90625,0.778809,-1.99707,50.34375,0.559082,-1.168945,14.101562
35860,35861,WF6,2018-11-13 22:00:00,,,,,,,-0.5625,...,38.0,0.421143,-1.287109,40.21875,0.374512,-1.577148,40.0625,0.415771,-1.313477,22.671875
35861,35862,WF6,2018-11-13 23:00:00,,,,,,,-0.486328,...,37.28125,0.085022,-1.362305,54.03125,0.27832,-1.176758,28.625,0.373779,-1.395508,25.65625


In [7]:
X_test_2.head()

Unnamed: 0,ID,WF,Time,NWP1_00h_D-2_U,NWP1_00h_D-2_V,NWP1_00h_D-2_T,NWP1_06h_D-2_U,NWP1_06h_D-2_V,NWP1_06h_D-2_T,NWP1_12h_D-2_U,...,NWP4_00h_D-1_CLCT,NWP4_12h_D-1_U,NWP4_12h_D-1_V,NWP4_12h_D-1_CLCT,NWP4_00h_D_U,NWP4_00h_D_V,NWP4_00h_D_CLCT,NWP4_12h_D_U,NWP4_12h_D_V,NWP4_12h_D_CLCT
4726,4727,WF1,2018-11-14 00:00:00,-2.121094,-2.443359,287.5,-1.804688,-3.511719,287.0,-2.726562,...,95.0625,-0.558594,-1.636719,56.28125,-0.411865,-1.547852,69.3125,,,
4727,4728,WF1,2018-11-14 01:00:00,-1.703125,-0.535645,287.0,-1.742188,-2.023438,286.75,-2.023438,...,81.75,-0.671875,-1.813477,56.125,-0.644531,-1.72168,70.0,,,
4728,4729,WF1,2018-11-14 02:00:00,-1.638672,0.705566,287.0,-1.274414,-0.818359,286.5,-1.580078,...,68.1875,-0.846191,-1.924805,57.71875,-0.875977,-1.833984,72.9375,,,
4729,4730,WF1,2018-11-14 03:00:00,-1.932617,1.438477,287.0,-1.029297,-0.264404,286.25,-1.255859,...,89.0625,-0.915039,-1.80957,72.3125,-0.935547,-2.050781,68.5625,,,
4730,4731,WF1,2018-11-14 04:00:00,-2.136719,2.640625,286.75,-1.157227,0.812012,286.25,-1.219727,...,99.75,-0.908203,-1.504883,80.875,-1.035156,-1.847656,94.5,,,


In [8]:
X_test_2.tail()

Unnamed: 0,ID,WF,Time,NWP1_00h_D-2_U,NWP1_00h_D-2_V,NWP1_00h_D-2_T,NWP1_06h_D-2_U,NWP1_06h_D-2_V,NWP1_06h_D-2_T,NWP1_12h_D-2_U,...,NWP4_00h_D-1_CLCT,NWP4_12h_D-1_U,NWP4_12h_D-1_V,NWP4_12h_D-1_CLCT,NWP4_00h_D_U,NWP4_00h_D_V,NWP4_00h_D_CLCT,NWP4_12h_D_U,NWP4_12h_D_V,NWP4_12h_D_CLCT
37370,37371,WF6,2019-01-15 20:00:00,,,,,,,-1.790039,...,-1.3e-05,0.675293,-0.984375,-1.5e-05,0.560547,-0.938477,-1.6e-05,0.614746,-0.884277,-2e-05
37371,37372,WF6,2019-01-15 21:00:00,,,,,,,-1.796875,...,-1.3e-05,0.499756,-0.651855,-1.5e-05,0.368896,-0.654785,-1.6e-05,0.322754,-0.747559,-2e-05
37372,37373,WF6,2019-01-15 22:00:00,,,,,,,-1.620117,...,-1.3e-05,0.261475,-0.76709,-1.5e-05,0.129761,-0.636719,-1.6e-05,0.118164,-0.70752,-2e-05
37373,37374,WF6,2019-01-15 23:00:00,,,,,,,-1.260742,...,-1.3e-05,-0.030136,-0.537109,-1.5e-05,0.055542,-0.546875,-1.6e-05,-0.160889,-0.532715,-2e-05
37374,37375,WF6,2019-01-16 00:00:00,-1.301758,-3.533203,273.0,-1.021484,-4.1875,273.5,-0.780273,...,-1.6e-05,-0.255859,-0.389404,-2e-05,-0.071838,-0.601562,-1.4e-05,,,


In [9]:
Y_train_2

Unnamed: 0,ID,Production
0,1,0.020004
1,2,0.070007
2,3,0.219971
3,4,0.389893
4,5,0.409912
...,...,...
35857,35858,0.479980
35858,35859,0.830078
35859,35860,0.810059
35860,35861,0.540039


In [10]:
Y_train_2['Production'].describe()

count    28302.000000
mean         1.505859
std          1.910156
min          0.000000
25%          0.189941
50%          0.810059
75%          2.089844
max         13.406250
Name: Production, dtype: float64

In [11]:
Y_test_2.head()

Unnamed: 0,ID,Production
4726,4727,0.0
4727,4728,0.0
4728,4729,0.0
4729,4730,0.0
4730,4731,0.010002


In [12]:
%store X_train_2
%store Y_train_2
%store X_test_2
%store Y_test_2

Stored 'X_train_2' (DataFrame)
Stored 'Y_train_2' (DataFrame)
Stored 'X_test_2' (DataFrame)
Stored 'Y_test_2' (DataFrame)
