# Data Prep

In this notebook we:
- Load the data
- Check for missing values
- Check for duplicates
- Clean the data
- Split the data into train, validation and test sets

## Summary

#### Strategies for splitting the data

Even though most of the columns are floats, we have some columns with low cardinality. This can be a problem when splitting the data into train, validation and test sets. If we split the data randomly, we can end up with records with the same value in the low cardinality columns in the training and validation/test sets. This can lead to data leakage and the model will not generalize well to unseen data i.e. the model will learn how to predict for only the small subset of values in the low cardinality columns and will not be able to generalize to unseen values.

To avoid this problem, we will use the following strategies to split the data:

1. **GroupKFold**: We will split the data into 5 groups based on the the low cardinallity columns. We will use 4 groups for training and with the remaining group we will split it into validation and test sets. This way, we can make sure that the model generalizes well to unseen data. The proportion of the data in each group is approximately 80% for training and 10% for validation and test sets.
2. **LeaveOneGroupOut**: We will use this strategy to evaluate the model's performance on unseen targets (for the low cardinality targets `potenciaGeradaTG1_2`, `consumoEspecificoTG1_2`). We will leave two non zero groups out for validation and test sets.
3. **RandomSplit**: We will use this strategy to split the data into training, validation and test sets only for comparison purposes.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold

In [2]:
df = pd.read_json('./data/raw/SimulationResult.json')
df.head()

Unnamed: 0,step,vazaoVapor,pressaoVapor,temperaturaVapor,cargaVaporTG1,cargaVaporTG2,habilitaTG1,habilitaTG2,potenciaGeradaTG1_2,potenciaGeradaTG2_2,potenciaGeradaTG2_1,potenciaGeradaTG1_1,vazaoVaporEscape,temperaturaVaporEscape,pressaoVaporEscape,consumoEspecificoTG2_2,consumoEspecificoTG2_1,consumoEspecificoTG1_2,consumoEspecificoTG1_1,status
0,0,273.0,57.0,718.0,107.0,53.0,0,0,0.0,0.0,0.0,0.0,298.992329,403.15,2.3,0.0,0.0,0.0,0.0,OK
1,1,273.0,57.0,718.0,107.0,53.0,0,1,0.0,2.743378,7.277625,0.0,294.620543,403.15,2.3,12.028968,7.282596,0.0,0.0,OK
2,2,273.0,57.0,718.0,107.0,53.0,1,0,13.876143,0.0,0.0,3.107132,254.569586,403.15,2.3,0.0,0.0,7.711077,34.436903,OK
3,3,273.0,57.0,718.0,107.0,53.0,1,1,13.876143,2.743378,7.277625,3.107132,254.569586,403.15,2.3,12.028968,7.282596,7.711077,34.436903,OK
4,4,273.0,57.0,718.0,107.0,61.375,0,0,0.0,0.0,0.0,0.0,288.558739,403.15,2.3,0.0,0.0,0.0,0.0,OK


As we discussed in the Exploratory Data Analysis, we are going to:
- drop the lines with Fail Simulation Status (`status` == "Falha na simulação")
- Drop the lines with ``potenciaGeradaTG2_2`` < 0 (because this column represent the power generated by the turbine TG2_2, and it can't be negative)
- Drop the targets ``temperaturaVaporEscape`` and ``pressaoVaporEscape`` (because they are constant values and only hinder the model from learning)
- Create a GroupKFold train-test split, as we learned that many columns of the dataset have low cardinality, and we want to make sure that the model can generalize well to unseen data (i.e., we want to test the model on groups it hasn't seen in training)

In [3]:
features = ['vazaoVapor', 'pressaoVapor', 'temperaturaVapor',
            'cargaVaporTG1', 'cargaVaporTG2', 'habilitaTG1', 'habilitaTG2']

In [4]:
targets = ['consumoEspecificoTG1_1', 'consumoEspecificoTG1_2',
           'consumoEspecificoTG2_1', 'consumoEspecificoTG2_2',
           'potenciaGeradaTG1_1', 'potenciaGeradaTG1_2',
           'potenciaGeradaTG2_1', 'potenciaGeradaTG2_2',
           'vazaoVaporEscape']

In [5]:
boolean_columns = ['habilitaTG1', 'habilitaTG2']

In [6]:
dataset = (df
        .query('status == "OK" and potenciaGeradaTG2_2 >= 0')
        .drop(columns=['status', 'step','temperaturaVaporEscape', 'pressaoVaporEscape']))

In [7]:
dataset.to_csv('./data/processed/dataset.csv', index=False)

In [8]:
# Check for duplicates
dataset.duplicated().sum()

0

In [9]:
# Cardinality of each column
dataset.nunique()

vazaoVapor                    9
pressaoVapor                  5
temperaturaVapor              5
cargaVaporTG1                 9
cargaVaporTG2                 9
habilitaTG1                   2
habilitaTG2                   2
potenciaGeradaTG1_2          10
potenciaGeradaTG2_2          34
potenciaGeradaTG2_1         826
potenciaGeradaTG1_1         226
vazaoVaporEscape          58054
consumoEspecificoTG2_2       34
consumoEspecificoTG2_1      826
consumoEspecificoTG1_2       10
consumoEspecificoTG1_1      226
dtype: int64

In [10]:
dataset.head()

Unnamed: 0,vazaoVapor,pressaoVapor,temperaturaVapor,cargaVaporTG1,cargaVaporTG2,habilitaTG1,habilitaTG2,potenciaGeradaTG1_2,potenciaGeradaTG2_2,potenciaGeradaTG2_1,potenciaGeradaTG1_1,vazaoVaporEscape,consumoEspecificoTG2_2,consumoEspecificoTG2_1,consumoEspecificoTG1_2,consumoEspecificoTG1_1
0,273.0,57.0,718.0,107.0,53.0,0,0,0.0,0.0,0.0,0.0,298.992329,0.0,0.0,0.0,0.0
1,273.0,57.0,718.0,107.0,53.0,0,1,0.0,2.743378,7.277625,0.0,294.620543,12.028968,7.282596,0.0,0.0
2,273.0,57.0,718.0,107.0,53.0,1,0,13.876143,0.0,0.0,3.107132,254.569586,0.0,0.0,7.711077,34.436903
3,273.0,57.0,718.0,107.0,53.0,1,1,13.876143,2.743378,7.277625,3.107132,254.569586,12.028968,7.282596,7.711077,34.436903
4,273.0,57.0,718.0,107.0,61.375,0,0,0.0,0.0,0.0,0.0,288.558739,0.0,0.0,0.0,0.0


In [11]:
low_cardinality_columns = [col for col in dataset.columns if dataset[col].nunique() < 15 and col not in boolean_columns]

print(f'{len(low_cardinality_columns)} columns have low cardinality')
print(f'Percentage of low cardinality unique groups: {len(low_cardinality_columns) / len(dataset.columns) * 100:.2f}%')

7 columns have low cardinality
Percentage of low cardinality unique groups: 43.75%


In [12]:
low_card_desc = (dataset[low_cardinality_columns]
 .nunique()
 .reset_index()
 .rename(columns={ 'index': 'column', 0: 'nunique' }))
low_card_desc['type'] = low_card_desc['column'].map(lambda x: 'feature' if x in features else 'target')
low_card_desc = low_card_desc[['column', 'type', 'nunique']]
low_card_desc

Unnamed: 0,column,type,nunique
0,vazaoVapor,feature,9
1,pressaoVapor,feature,5
2,temperaturaVapor,feature,5
3,cargaVaporTG1,feature,9
4,cargaVaporTG2,feature,9
5,potenciaGeradaTG1_2,target,10
6,consumoEspecificoTG1_2,target,10


In [13]:
# Add id to the group of low cardinality columns
dataset['group_id'] = dataset.groupby(low_cardinality_columns).ngroup()
n_groups = dataset['group_id'].nunique()
print(f'There are {n_groups:,} unique combinations of floats in the dataset ({n_groups/len(dataset):.1%} of the dataset)')

There are 36,450 unique combinations of floats in the dataset (50.6% of the dataset)


In [14]:
X, y = dataset[features], dataset[targets]

In [22]:
group_kfold = GroupKFold(n_splits=5)
group_kfold.get_n_splits(X, y, dataset['group_id'])

10

In [16]:
for i, (train_index, test_index) in enumerate(group_kfold.split(X, y, dataset['group_id'])):
    print(f'Fold {i}:')
    print(f'Train: {len(train_index)} samples')
    print(f'Test: {len(test_index)} samples')
    print(f'Groups in train: {dataset.iloc[train_index]["group_id"].nunique()}')
    print(f'Groups in test: {dataset.iloc[test_index]["group_id"].nunique()}')
    print()

Fold 0:
Train: 57576 samples
Test: 14395 samples
Groups in train: 29160
Groups in test: 7290

Fold 1:
Train: 57577 samples
Test: 14394 samples
Groups in train: 29160
Groups in test: 7290

Fold 2:
Train: 57577 samples
Test: 14394 samples
Groups in train: 29160
Groups in test: 7290

Fold 3:
Train: 57577 samples
Test: 14394 samples
Groups in train: 29160
Groups in test: 7290

Fold 4:
Train: 57577 samples
Test: 14394 samples
Groups in train: 29160
Groups in test: 7290



In [17]:
# Train set and Val set split by group
train_index, test_index = next(group_kfold.split(X, y, dataset['group_id']))

X_train, y_train = X.iloc[train_index], y.iloc[train_index]
X_test, y_test = X.iloc[test_index], y.iloc[test_index]

assert len(X_train) + len(X_test) == len(X)
assert len(y_train) + len(y_test) == len(y)
assert len(X) == len(y)
assert len(X) == len(dataset)

print(f'Train: {len(X_train)} samples')

Train: 57576 samples


In [18]:
# Create validation set
group_kfold = GroupKFold(n_splits=2)

X_test_group_id = pd.concat([X_test, y_test], axis='columns').groupby(low_cardinality_columns).ngroup()

train_index, val_index = next(group_kfold.split(X_test, y_test, X_test_group_id))

X_val, y_val = X_test.iloc[val_index], y_test.iloc[val_index]
X_test, y_test = X_test.iloc[train_index], y_test.iloc[train_index]

print(f'Validation: {len(X_val)} samples')
print(f'Test: {len(X_test)} samples')

Validation: 7198 samples
Test: 7197 samples


In [19]:
low_card_values = {
    col: dataset[col].unique()
    for col in low_cardinality_columns
}

In [20]:
low_card_values

{'vazaoVapor': array([273.   , 290.875, 308.75 , 326.625, 344.5  , 362.375, 380.25 ,
        398.125, 416.   ]),
 'pressaoVapor': array([57. , 58.5, 60. , 61.5, 63. ]),
 'temperaturaVapor': array([718.  , 757.75, 797.5 , 837.25, 877.  ]),
 'cargaVaporTG1': array([107. , 127.5, 148. , 168.5, 189. , 209.5, 230. , 250.5, 271. ]),
 'cargaVaporTG2': array([ 53.   ,  61.375,  69.75 ,  78.125,  86.5  ,  94.875, 103.25 ,
        111.625, 120.   ]),
 'potenciaGeradaTG1_2': array([ 0.      , 13.876143, 16.187112, 17.105682, 18.317565, 20.189402,
        21.626417, 21.611867, 22.235001, 26.134906]),
 'consumoEspecificoTG1_2': array([ 0.      ,  7.711077,  7.876637,  8.652096,  9.198821,  9.361347,
         9.687226, 10.642301, 11.266021, 10.369274])}