## Data Dictionarity

- Unnamed: 0: Timestamp sampled with period interval of 10 minutes.
- ActivePower: Active power generated by the windmill.
- AmbientTemperatue: Ambient temperature.
- BearingShaftTemperature: Bearing shaft temperature of the motor.
- Blade1PitchAngle: pitch angle of the wind turbine blade 1.
- Blade2PitchAngle: pitch angle of the wind turbine blade 2.
- Blade3PitchAngle: pitch angle of the wind turbine blade 3.
- ControlBoxTemperature: temperature of wind turbine control.
- GearboxBearingTemperature: Gearbox Bearing Temperature of the motor.
- GearboxOilTemperature: Gearbox Oil Temperature of the motor.
- GeneratorRPM: Measured RPM of the generator.
- GeneratorWinding1Temperature: Temperature of the winding in the generator 1.
- GeneratorWinding2Temperature: Temperature of the winding in the generator 2.
- HubTemperature: Temperature of Hub computer that control pitching of blades.
- MainBoxTemperature: Temperature of main box.
- NacellePosition: Position of the nacelle.
- ReactivePower: Reactive power generated by the windmill.
- RotorRPM: RPM of the rotor.
- TurbineStatus: Status of the turbine.
- WTG: Indicates that mechanical energy is instead converted to electricity, only G01.
- WindDirection: Wind direction in degrees. 
- WindSpeed: Wind speed in km/h.


## 0.0 Imports

In [18]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

### 0.1 Helper Functions

In [19]:
# Function that converts camel style to snake style
def camel_to_snake(s):
    return ''.join(['_'+c.lower() if c.isupper() else c for c in s]).lstrip('_')

In [20]:
# Function that calculate  the descriptive statistics
def descriptive_analysis(numerical_df):
    # Function that calculate all the statistics measures for the dataset
    # Central tendency - Mean, median
    ct1 = pd.DataFrame(numerical_df.apply(np.mean)).T
    ct2 = pd.DataFrame(numerical_df.apply(np.median)).T

    # Dispersion - std, min, max, range, skew, kurtosis
    d1 = pd.DataFrame(numerical_df.apply(np.std)).T
    d2 = pd.DataFrame(numerical_df.apply(min)).T
    d3 = pd.DataFrame(numerical_df.apply(max)).T
    d4 = pd.DataFrame(numerical_df.apply(lambda x: x.max() - x.min())).T
    d5 = pd.DataFrame(numerical_df.apply(lambda x: x.skew())).T
    d6 = pd.DataFrame(numerical_df.apply(lambda x: x.kurtosis())).T

    # Concatenate
    m = pd.concat([ct1, ct2, d1,d2,d3,d4,d5,d6]).T.reset_index()
    m.columns = ["Attributes", "Mean", "Median", "Std", "Min", "Max", "Range", "Skew", "Kurtosis"]

    return m

### 0.2 Loading Data

In [21]:
df0 = pd.read_csv('../data/Turbine_Data.csv')

In [22]:
df0.head()

Unnamed: 0.1,Unnamed: 0,ActivePower,AmbientTemperatue,BearingShaftTemperature,Blade1PitchAngle,Blade2PitchAngle,Blade3PitchAngle,ControlBoxTemperature,GearboxBearingTemperature,GearboxOilTemperature,...,GeneratorWinding2Temperature,HubTemperature,MainBoxTemperature,NacellePosition,ReactivePower,RotorRPM,TurbineStatus,WTG,WindDirection,WindSpeed
0,2017-12-31 00:00:00+00:00,,,,,,,,,,...,,,,,,,,G01,,
1,2017-12-31 00:10:00+00:00,,,,,,,,,,...,,,,,,,,G01,,
2,2017-12-31 00:20:00+00:00,,,,,,,,,,...,,,,,,,,G01,,
3,2017-12-31 00:30:00+00:00,,,,,,,,,,...,,,,,,,,G01,,
4,2017-12-31 00:40:00+00:00,,,,,,,,,,...,,,,,,,,G01,,


## 1.0 Data Description

Describing data is important in this section for better comprehension of the dataset. Also, by using Descriptive Statistics,  it is possible to describe data through methods such as graphical representations, measures of central tendency and measures of variability. It summarizes the data in a meaningful way which enables us to generate insights from it.

In [23]:
df1 = df0.copy()

### 1.1 Rename Columns

In [24]:
# Rename timestamp column 
df1 = df1.rename(columns={'Unnamed: 0': 'Date'})

In [25]:
# Convert the camel case to snake case
cols_old = ['Date', 'ActivePower', 'AmbientTemperatue', 'BearingShaftTemperature','Blade1PitchAngle', 'Blade2PitchAngle', 
            'Blade3PitchAngle', 'ControlBoxTemperature', 'GearboxBearingTemperature','GearboxOilTemperature', 
            'GeneratorRPM', 'GeneratorWinding1Temperature', 'GeneratorWinding2Temperature', 'HubTemperature',
            'MainBoxTemperature','NacellePosition', 'ReactivePower', 'RotorRPM', 'TurbineStatus', 'WTG',
            'WindDirection', 'WindSpeed']

snake_case = lambda x: camel_to_snake(x)

cols_new = list(map(snake_case, cols_old))

df1.columns = cols_new

# Correction after snake transformation
cols_to_correct = ['generator_r_p_m', 'rotor_r_p_m', 'w_t_g']
cols_corrected = ['generator_rpm', 'rotor_rpm', 'wtg']

df1 = df1.rename(columns={'generator_r_p_m': 'generator_rpm',
                         'rotor_r_p_m': 'rotor_rpm',
                         'w_t_g': 'wtg'})

### 1.2 Data Dimensions

In [26]:
# Print the number of examples and features
print("Number of rows: {}".format(df1.shape[0]))
print("Number of columns: {}".format(df1.shape[1]))

Number of rows: 118224
Number of columns: 22


### 1.3 Data Types

In [27]:
# Show the type of each column
df1.dtypes

date                               object
active_power                      float64
ambient_temperatue                float64
bearing_shaft_temperature         float64
blade1_pitch_angle                float64
blade2_pitch_angle                float64
blade3_pitch_angle                float64
control_box_temperature           float64
gearbox_bearing_temperature       float64
gearbox_oil_temperature           float64
generator_rpm                     float64
generator_winding1_temperature    float64
generator_winding2_temperature    float64
hub_temperature                   float64
main_box_temperature              float64
nacelle_position                  float64
reactive_power                    float64
rotor_rpm                         float64
turbine_status                    float64
wtg                                object
wind_direction                    float64
wind_speed                        float64
dtype: object

**Observations**:
- date is a object and not datetime.

In [28]:
# Convert date column from string to datetime
df1['date'] = pd.to_datetime(df1['date'])

### 1.4 Checking Missing Values

In [29]:
100* df1.isna().sum() / len(df1)

date                               0.000000
active_power                      19.855528
ambient_temperatue                20.644708
bearing_shaft_temperature         47.119028
blade1_pitch_angle                64.477602
blade2_pitch_angle                64.566416
blade3_pitch_angle                64.566416
control_box_temperature           47.421843
gearbox_bearing_temperature       47.100420
gearbox_oil_temperature           47.186696
generator_rpm                     47.307653
generator_winding1_temperature    47.196001
generator_winding2_temperature    47.177392
hub_temperature                   47.213764
main_box_temperature              47.128333
nacelle_position                  38.863513
reactive_power                    19.857220
rotor_rpm                         47.449756
turbine_status                    46.789146
wtg                                0.000000
wind_direction                    38.863513
wind_speed                        19.986636
dtype: float64

**Observations**:
- It was found a large number of missing values, representing 64.56% of the dataset in some features;
- There are some strategies that we can handle: Drop the columns with expressive missing values; Replace it by the mean/median; Moving average; Replace with the last or next value non-null.
- For this problem, I am going to use the method ffill and bfill from pandas, which propagates the last non-null value in backward and forward direction.


### 1.5 Replace NaN values

In [30]:
df1 = df1.fillna(method='ffill').fillna(method='bfill')

In [31]:
100* df1.isna().sum() / len(df1)

date                              0.0
active_power                      0.0
ambient_temperatue                0.0
bearing_shaft_temperature         0.0
blade1_pitch_angle                0.0
blade2_pitch_angle                0.0
blade3_pitch_angle                0.0
control_box_temperature           0.0
gearbox_bearing_temperature       0.0
gearbox_oil_temperature           0.0
generator_rpm                     0.0
generator_winding1_temperature    0.0
generator_winding2_temperature    0.0
hub_temperature                   0.0
main_box_temperature              0.0
nacelle_position                  0.0
reactive_power                    0.0
rotor_rpm                         0.0
turbine_status                    0.0
wtg                               0.0
wind_direction                    0.0
wind_speed                        0.0
dtype: float64

### 1.6 Descriptive Statistics

In [32]:
num_attributes = df1.select_dtypes(include = ['float64'])
cat_attributes = df1.select_dtypes(exclude = ['float64', 'datetime64[ns]'])

#### 1.6.1 Numerical Attributes

In [33]:
descriptive_analysis(num_attributes)

Unnamed: 0,Attributes,Mean,Median,Std,Min,Max,Range,Skew,Kurtosis
0,active_power,639.934548,423.44588,630.141658,-38.524659,1779.032,1817.557,0.648454,-1.082583
1,ambient_temperatue,28.57425,28.126315,4.308102,0.0,42.4056,42.4056,0.281966,0.811206
2,bearing_shaft_temperature,44.96683,47.901936,6.404004,0.0,55.08866,55.08866,-3.534934,20.653067
3,blade1_pitch_angle,31.797893,45.736893,22.150311,-43.156734,90.14361,133.3003,-0.436169,-0.989127
4,blade2_pitch_angle,30.718228,43.699357,21.173457,-26.443415,90.01783,116.4612,-0.366678,-0.828616
5,blade3_pitch_angle,30.718228,43.699357,21.173457,-26.443415,90.01783,116.4612,-0.366678,-0.828616
6,control_box_temperature,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,gearbox_bearing_temperature,68.885245,75.066281,11.675754,0.0,82.23793,82.23793,-2.404865,9.730579
8,gearbox_oil_temperature,60.146597,63.304417,5.81292,0.0,70.76458,70.76458,-1.396449,3.684245
9,generator_rpm,1359.868503,1691.274462,527.575804,0.0,1809.942,1809.942,-1.351028,0.83008


**Observations**:

- **active_power** has an average of 639.93 KW and it reaches a maximum value of 1.78 MW. The variable has low skew and kurtosis, it may follow a normal distribution.
- **ambient_temperature** may follow a normal distribution and the average value is 28.56 ºC.
- **bearing_shaft_temperature** has a range of 55 ºC, and its average is 44.96 ºC.
- All the **blade** features are similar, in which the blade 1 vary a little bit more than the others.
- **control_box_temperature** has no information, all the values are zero.
- **gearbox_bearing_temperature** is skewed compared to the other variables, and this may not follow a normal distribution. The range is 82.2 ºC and the median 75.06 ºC.
- **gearbox_oil_temperature** may follow the same distribution of gear box bearing temperature. 
- The maximum value of **generator_rpm** is 1809 RPM, while its minimum is zero (when it is off). The median value is 1691 RPM.
- **generator_winding_temperature 1 and 2** are similar, they may not follow a normal distribution and the maximum values reach 1260 ºC, which can indicate an outlier. 
- **hub_temperature** reaches 47.96 ºC with an average of 38.7 ºC.
- **main_box_temperature** is very similar to the hub temperature, but it reaches 54.25 ºC.
- **nacelle_position** has a range of degrees from 0 to 357º, with a median of 188º and an average of 201.71º.
- **reactive_power** has a minimum value of -203.18 VAr, which indicates a moment that the load is more capacitive. When the reactive power is positive, it means a inductive load. Mostly values are positive, it indicates a inductive load.
- **rotor_rpm** is very low compared to the generator. The average value is 12.19 RPM.
- **wind_direction** also has a range from 0 to 357º and the average value is 201.71º. 
- **wind_speed** may follow a normal distribution and the average value is 6.03 km/h, reaching the maximum value of 22.9 km/h.

#### 1.6.2 Categorical Attributes

- Only WTG is a categorical attribute and has no information contained in the variable. I will treat it later.

## 2.0 Feature Engineering