# ASSIGMENT 1

## Lara Monteserín Placer

## María Ferrero Medina

## Introduction

The aim of this project is to design a machine learning model that is able to predict the energy produced by the Sotavento wind farm. For this purpose, a dataset with 555 features ans 4748 instances is available.

The structure of the study will be the following:

1. Exploratory Data Analysis
2. Methodology
3. KNN model
4. Decission tree model
5. Ensemble method 1 model
6. Ensemble method 2 model
7. Selection and performance of the final model

For each of the models created, several steps have been followed to optimize them. First of all, a simple version of each model is created, with hyperparameters that seem reasonable, no feature selection and a basic imputation technique. Then, sequentially, models are improved.

1. First model
2. Feature selection
3. Imputation techniques
4. Hyperparameter tuning


Things to add: - another idea (new library?, new idea...)


## 1. Exploratory Data Analysis

Before starting to build the model, an EDA is made as a first approach to gain understanding of the dataset. In this Exploratory Data Analysis the data type of the features will be verified, the number of instances and features will be determined. Also, a brief summary of the missing values and columns with constant value will be included. 

### 1.1. Number of instances and features

This dataset has 4748 instances and 555 features.

### 1.2. Nature of the variables

This dataset contains information about the meteorological conditions in several locations, the time the measures of these conditions were made and the energy produced at each moment. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Read the data that is compressed as a gzip
wind_ava = pd.read_csv('wind_available.csv.gzip', compression="gzip")

# Display the first rows of the dataset just to see it
wind_ava.head()

# Display the data type of each column
column_data_types = wind_ava.dtypes
print(column_data_types)

energy     float64
year         int64
month        int64
day          int64
hour         int64
            ...   
v100.21    float64
v100.22    float64
v100.23    float64
v100.24    float64
v100.25    float64
Length: 555, dtype: object


After having checked the data types of all the different features, it has been verified that there are:

- 551 numerical variables (real numbers). From this 551, one is the energy, that is the output of the problem. And the remaining 550 are relative to the 22 different meteorological conditions measured at the 25 different locations.

- 4 numerical variables (integers). These 4 variables are the year, day, month and hour of the day. These variables characterize the moment the measures were taken.

### 1.3. Check for missing values 

In [2]:
# Return the number of Null values for each column
null_values = wind_ava.isnull().sum()
# Return the number of NaN values for each column (just in case they are not the same)
nan_values = wind_ava.isna().sum()

# Store in missing values the amount of Null and NaN values of each column
missing_values = pd.DataFrame({
    'Column': null_values.index,
    'Null Values': null_values.values,
    'NaN Values': nan_values.values
})

# Print the amount of Null and Nan values
print(missing_values)

# Identify columns with Null or NaN values
columns_with_null = wind_ava.columns[wind_ava.isnull().any()]
columns_with_nan = wind_ava.columns[wind_ava.isna().any()]

# Display the number of columns which have missing values
print("Number of columns with Null Values:", len(columns_with_null))
print("Number of columns with NaN Values:", len(columns_with_nan))

      Column  Null Values  NaN Values
0     energy            0           0
1       year            0           0
2      month            0           0
3        day            0           0
4       hour            0           0
..       ...          ...         ...
550  v100.21          261         261
551  v100.22          387         387
552  v100.23          569         569
553  v100.24          579         579
554  v100.25          436         436

[555 rows x 3 columns]
Number of columns with Null Values: 550
Number of columns with NaN Values: 550


All meteorological variables have missing values in different instances. The 4 categories that characterize the moment the measure was made and the target feature'energy' do not have missing values.


### 1.4. Check for constant columns


In [3]:
# Check for constant values in each column
constant_columns = wind_ava.columns[wind_ava.nunique() == 1]

# Print columns with constant values
print("Columns with constant values:", constant_columns)

Columns with constant values: Index([], dtype='object')



There are no constant columns. 


### 1.5. Type of problem

The objective of the model is to estimate the energy, as it is a continuous numerical value, this is a **regression problem**. 

## 2. Methodology

This section will explain the methodology that is going to be followed to evaluate the models. The evaluation tech Which are the outer evaluation and the inner evaluation. And the metrics.

- On the one hand, for the **inner evaluation**, crossvalidation will be the method applied. Crossvalidation will be used to determine which is the best combination of hyperparameters. 

- On the other hand, for the **outer evaluation**, holdout evaluation will be used. Thie method will be used to estimate the future performance of the designed method.

Later, to improve the performance of the method, once it is already computed the outer performance, the hyperparameters will be tuned again but this time using the whole dataset.

The objective function that is going to be used for the validation of the method is the Mean Squared Error (MSE). This metric is more sensitive to outliers and to distant values as it squares the magnitudes. This is useful to penalize the errors that are larger, and avoid having a model that might have such large errors. 

**It could be longer this explanation**

Note: as the variable that is going to be predicted is the 'energy', only the variables related to the meteorological characteristics will be used. It is considered that the energy produced does not depend on the moment of the day it is being produced.

## 3. KNN Regressor

First, the KNN algorithm is used for the predictions.

### 3.1. First model



In [4]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd
import time

# First, data will be divided into train and test set 
# Considering it is a time series, it must be split in an appropiate way
wind_ava['timestamp'] = pd.to_datetime(wind_ava[['year', 'month', 'day', 'hour']])
wind_ava = wind_ava.sort_values(by='timestamp')

train_size = 0.8  # Porcentaje de datos para entrenamiento
split_index = int(len(wind_ava) * train_size)
wind_ava = wind_ava.drop(columns=['timestamp'])

# Divide the data into X_train, y_train, X_test, y_test
train_data = wind_ava.iloc[:split_index]
test_data = wind_ava.iloc[split_index:]

X_train = train_data.drop('energy', axis=1)
y_train = train_data['energy']
X_test = test_data.drop('energy', axis=1)
y_test = test_data['energy']

# Now the first model will be created
first_knn = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',KNeighborsRegressor(n_neighbors=31))])
first_knn.fit(X_train,y_train)
y_predicted = first_knn.predict(X_test)
MSE = mean_squared_error(y_test,y_predicted)

print('MSE: ', MSE)


MSE:  376408.3075368221


## 4. Decision Tree

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer
from sklearn.metrics import r2_score
import pandas as pd

# FIRST OPTION - Univariate technique 
# Define the first tree as a Pipeline with a preprocessing stage with decision tree
first_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',DecisionTreeRegressor())])
first_tree.fit(X_train,y_train)
y_predicted = first_tree.predict(X_test)

MSE = mean_squared_error(y_test,y_predicted)

print('MSE: ', MSE)


### 4.1. Feature selection

For the feature selection, four different cases have been considered:
- Seleting only the features related to the location of the wind farm (Sotavento). This is the features that contain the sufix 13.
- Selecting onkly the features related to the wind characteristics. This is the features that start with u or v (the vertical and horizontal components of the wind.

In [6]:
# FIRST OPTION - Selecting only the features that correspond to the location 13 (Sotavento)
X_1 = wind_ava.filter(regex='\.13$', axis=1)

# Split into train and test sets
y_1 = wind_ava['energy']

train_size = 0.8  # Porcentaje de datos para entrenamiento
split_index = int(len(X_1) * train_size)

# Divide the data into X_train, y_train, X_test, y_test
X_train_1 = X_1.iloc[:split_index]
X_test_1 = X_1.iloc[split_index:]
y_train_1 = y_1.iloc[:split_index]
y_test_1 = y_1.iloc[split_index:]

# Define the first model as a Pipeline with a preprocessing stage with knn
# first_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('feature_selection',SelectKBest(f_regression)),('regression',DecisionTreeRegressor())])
first_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',DecisionTreeRegressor())])
first_tree.fit(X_train_1,y_train_1)
y_predicted_1 = first_tree.predict(X_test_1)
MSE_1 = mean_squared_error(y_test_1,y_predicted_1)

print('MSE: ', MSE_1)

MSE:  303632.6816254736


In [7]:
# SECOND OPTION - Selecting only the features related to the wind (the ones that start with u or v)
X_2 = wind_ava.filter(regex='^(u|v).*$', axis=1)

# Split into train and test sets
y_2 = wind_ava['energy']

train_size = 0.8  # Porcentaje de datos para entrenamiento
split_index = int(len(X_2) * train_size)

# Divide the data into X_train, y_train, X_test, y_test
X_train_2 = X_2.iloc[:split_index]
X_test_2 = X_2.iloc[split_index:]
y_train_2 = y_2.iloc[:split_index]
y_test_2 = y_2.iloc[split_index:]

# Define the first model as a Pipeline with a preprocessing stage with knn
second_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',DecisionTreeRegressor())])
second_tree.fit(X_train_2,y_train_2)
y_predicted_2 = second_tree.predict(X_test_2)
MSE_2 = mean_squared_error(y_test_2,y_predicted_2)

print('MSE: ', MSE_2)

MSE:  279850.94727757893


In [8]:
from sklearn.metrics import mean_squared_error

# THIRD OPTION - Selecting the features that correspond to magnitudes related to the wind in Sotavento
X_3 = wind_ava.filter(regex='^(u|v).*\.13$', axis=1)

# Split into train and test sets
y_3 = wind_ava['energy']

train_size = 0.8  # Porcentaje de datos para entrenamiento
split_index = int(len(X_3) * train_size)

# Divide the data into X_train, y_train, X_test, y_test
X_train_3 = X_3.iloc[:split_index]
X_test_3 = X_3.iloc[split_index:]
y_train_3 = y_3.iloc[:split_index]
y_test_3 = y_3.iloc[split_index:]

# Define the first model as a Pipeline with a preprocessing stage with knn
third_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',DecisionTreeRegressor())])
third_tree.fit(X_train_3,y_train_3)
y_predicted_3 = third_tree.predict(X_test_3)
MSE_3 = mean_squared_error(y_test_3,y_predicted_3)

print('MSE: ', MSE_3)

MSE:  302293.82269831584


From the previous results, it is demonstrated that the error from the model improve applying feature selection. The best results are obtained selecting only the features related to the wind. The second best result is obtained selecting only the features from Sotavento and related with the wind.

Based on this results, for the KNN model, only the features related to the wind will be selected (second option). So, from now on, it will be used X_2 and y_2.

### 4.3. Imputation techniques

Once the feature selection has been implemented, it is time to choose the imputation technique that is going to be used. Automtic techniques will be used (instead of manual imputation). Two types of techniques will be considered:
- Univariate imputation. That only uses the values from the feature that is going to be imputed. For this it will be used the Simple Imputer, that imputes all the missing values with the mean of the feature. We have chosen the mean instead of the median because there are not many outliers that could affect the distribution of the data. 
- Multivariate imputation. Which also uses values from other features. Two multivariate imputation techniques are used:
- KNN Imputer. Is based on the KNN algorithm.
- Iterative Imputer. Is based on iterative models that compute the values for the missing categories.

In [9]:
from sklearn.impute import KNNImputer, SimpleImputer

# FIRST: one univariate technique will be used for imputation, this is done with the Simple Imputer, using the mean

simple_tree = Pipeline([('imputer',SimpleImputer(strategy = 'mean')),('regression',DecisionTreeRegressor())])

simple_tree.fit(X_train_2,y_train_2)
y_predicted_2 = simple_tree.predict(X_test_2)
MSE_3 = mean_squared_error(y_test_2,y_predicted_2)

print('MSE with : ', MSE_3)

MSE with :  307378.97239999997


In [16]:
from sklearn.impute import KNNImputer, SimpleImputer

# SECOND: KNN

knn_tree = Pipeline([('imputer',KNNImputer(n_neighbors=3)),('regression',DecisionTreeRegressor())])

knn_tree.fit(X_train_2,y_train_2)
y_predicted_2 = knn_tree.predict(X_test_2)
MSE_3 = mean_squared_error(y_test_2,y_predicted_2)

print('MSE with : ', MSE_3)

# 280817.7
# 281559.36
# Why does MSE change???

MSE with :  281559.36892273684


In [18]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

iterative_tree = Pipeline([('imputer',IterativeImputer(max_iter = 14, random_state = 100515585)),('regression',DecisionTreeRegressor())])

iterative_tree.fit(X_train_2,y_train_2)
y_predicted_2 = iterative_tree.predict(X_test_2)
MSE_3 = mean_squared_error(y_test_2,y_predicted_2)

print('MSE with : ', MSE_3)

# 258502.783
# 281559.36
# Why does MSE change???

MSE with :  265675.5389817895


In [None]:
# Update the version of scikit-learn if it returns an Error with IterativeImputer
!pip install --upgrade scikit-learn --user

## 5. Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
t_0 = time.time()
# Create the ensemble model pipeline
random_forest = Pipeline([
    ('imputer',KNNImputer(n_neighbors=3)),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42)) 
])

# Fit the model on the training data
random_forest.fit(X_train, y_train)

# Predict on the test data
y_pred = random_forest.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

## 6. Bagging Regressor with KNN models

In [None]:
from sklearn.ensemble import BaggingRegressor

# Create the ensemble model pipeline
knn_ensemble_pipeline = Pipeline([
    ('imputer',KNNImputer(n_neighbors=3)),
    ('regressor', BaggingRegressor(base_estimator=KNeighborsRegressor(n_neighbors=5), n_estimators=10, random_state=42))  
])

# Fit the model on the training data
knn_ensemble_pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = knn_ensemble_pipeline.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
