# MLP Model — Emission Point Classification

This notebook develops a Multilayer Perceptron (MLP) model to classify the point of particle emission (E1, E2, E3) based on aggregated features from simulated sensor data. The dataset used is aggregated_dataset.csv, where each row represents a simulation with statistical summaries of sensor readings.

The primary goals of this notebook are to:

- Prepare the dataset for classification;
- Normalize and encode the data effectively;
- Apply robust training-validation strategies;
- Build and evaluate an efficient MLP architecture;
- Report classification performance using cross-validation.

---
## Notebook Sctructure

1. Dataset Loading and Overview
2. Target Encoding and Feature Scaling
3. Cross-Validation Setup and Justification
4. MLP Architecture Design and Training
5. Performance Evaluation (Accuracy, Confusion Matrix, Report)

---
The implementation uses best practices in model validation (e.g., stratified k-fold cross-validation), and techniques like batch normalization and early stopping to improve generalization.

### 1. Importing Libraries and Configuring GPU

This block imports all the required libraries for data handling, preprocessing, evaluation, model building, and training. It includes tools for:

- Numerical and tabular data manipulation (`numpy`, `pandas`)
- Visualization (`matplotlib`, `seaborn`)
- Machine learning tools (`scikit-learn`)
- Deep learning with Keras and TensorFlow
- Progress tracking using `tqdm`
- Parameter grid creation with `product` from `itertools`

Additionally, the code checks if a GPU is available and configures TensorFlow to allocate GPU memory as needed. This ensures that training is accelerated on compatible hardware and avoids memory overload issues.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, f1_score, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
from tqdm import tqdm
from itertools import product

# GPU config
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print("GPU detected! Device: ", gpus[0])
    tf.config.experimental.set_memory_growth(gpus[0], True)
else:
    print("No GPU detected. Using CPU instead.")

GPU detected! Device:  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


### 2. Loading the Dataset and Preparing the Modeling Subset

This section performs initial data preparation:

- Loads the full dataset (`complete_dataset.csv`) containing all sensor readings and environmental data.
- Removes constant or non-informative sensor columns based on a precompiled list (`sensor_constantes.txt`).
- Separates 10% of the dataset into a `df_gen` subset, which is reserved exclusively for final generalization testing and **never used during training, validation, or hyperparameter tuning**.
- The remaining 90% is stored in `df_modeling` and will be used for all modeling experiments and cross-validation.

In [2]:
# Load dataset
df = pd.read_csv('../data/processed/complete_dataset.csv')

# Removing constant sensors
with open('../data/processed/sensor_constantes.txt', 'r') as f:
    constant_sensors = [line.strip() for line in f.readlines()]

df = df.drop(columns=constant_sensors, errors='ignore')

# Split 10% for final generalization (do not touch during modeling)
df_full = df.copy()
df_gen = df_full.sample(frac=0.10, random_state=42)
df_modeling = df_full.drop(df_gen.index).reset_index(drop=True)

### 3. Feature Selection, Label Encoding, and Normalization

This block performs three key preprocessing steps:

- **Feature Selection**: Removes non-predictive columns such as `classe` (target), `tag` (simulation identifier), and `Altura` (height) from the input features `X`.
- **Label Encoding**: Converts the categorical class labels (`E1`, `E2`, `E3`) into numerical format using `LabelEncoder`, resulting in `y_encoded`. This is necessary for model compatibility.
- **Normalization**: Applies `StandardScaler` to standardize all input features (`X_scaled`) with zero mean and unit variance. This step improves training convergence and model stability.

In [3]:
# Feature Selection
X = df_modeling.drop(columns=['classe', 'tag', 'Altura'], errors='ignore')
y = df_modeling['classe']

In [4]:
# Encode
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

In [5]:
# Normalization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 4. Defining the Hyperparameter Grid and MLP Model Architecture

In this section:

- A **hyperparameter grid** (`param_grid`) is defined for tuning the Multi-Layer Perceptron (MLP) model. It includes:
  - `units1`: number of neurons in the first hidden layer
  - `units2`: number of neurons in the second hidden layer
  - `lr`: learning rate for the optimizer

- The function `build_model()` constructs a Sequential MLP model with the specified architecture:
  - Two hidden layers with ReLU activation
  - An output layer with 3 neurons (for 3-class classification) and softmax activation
  - The model is compiled using the Adam optimizer and sparse categorical cross-entropy loss, suitable for integer-labeled targets.

In [6]:
param_grid = {
    'units1': [64, 128],
    'units2': [32, 64],
    'lr': [0.001, 0.0005]
}

def build_model(input_dim, units1, units2, lr):
    model = Sequential([
        Dense(units1, activation='relu', input_shape=(input_dim,)),
        Dense(units2, activation='relu'),
        Dense(3, activation='softmax')
    ])

    model.compile(optimizer=Adam(learning_rate=lr),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

    return model

### 5. Grid Search with Stratified K-Fold Cross-Validation

This block performs an exhaustive **grid search** over all combinations of the defined hyperparameters (`units1`, `units2`, and `lr`) to find the best MLP configuration.

For each combination:

- A **5-fold stratified cross-validation** is applied using `StratifiedKFold` to preserve class distribution across folds.
- The model is trained on each fold using the `build_model()` function.
- **Two evaluation metrics** are computed for each fold:
  - **Accuracy**
  - **Weighted F1-Score**, which accounts for class imbalance

After evaluating all folds for a given configuration:
- The average F1-score and accuracy are stored.
- The configuration with the **highest average F1-score** is tracked as the best.

Progress is monitored using `tqdm`, and intermediate training information is printed for traceability.

In [10]:
param_combinations = list(product(param_grid['units1'], param_grid['units2'], param_grid['lr']))

results = []
best_score = 0
best_config = None

for u1, u2, lr in tqdm(param_combinations, desc="GridSearch Progress", total=len(param_combinations), dynamic_ncols=True):
    tqdm.write(f"Treinando configuração: units1={u1}, units2={u2}, lr={lr}")

    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    f1_scores = []
    acc_scores = []

    for fold, (train_idx, val_idx) in enumerate(kfold.split(X_scaled, y_encoded), start=1):
        tqdm.write(f"   → Fold {fold}/5")

        X_train, X_val = X_scaled[train_idx], X_scaled[val_idx]
        y_train, y_val = y_encoded[train_idx], y_encoded[val_idx]

        model = build_model(X_train.shape[1], u1, u2, lr)
        model.fit(X_train, y_train, epochs=30, batch_size=32, verbose=0)

        y_pred = model.predict(X_val)
        y_pred_class = np.argmax(y_pred, axis=1)

        acc = accuracy_score(y_val, y_pred_class)
        f1 = f1_score(y_val, y_pred_class, average='weighted')

        acc_scores.append(acc)
        f1_scores.append(f1)

    avg_acc = np.mean(acc_scores)
    avg_f1 = np.mean(f1_scores)

    results.append((u1, u2, lr, avg_acc, avg_f1))

    if avg_f1 > best_score:
        best_score = avg_f1
        best_config = (u1, u2, lr)

GridSearch Progress:   0%|                                                                       | 0/8 [00:00<?, ?it/s]

Treinando configuração: units1=64, units2=32, lr=0.001
   → Fold 1/5


GridSearch Progress:   0%|                                                                       | 0/8 [07:26<?, ?it/s]

   → Fold 2/5


GridSearch Progress:   0%|                                                                       | 0/8 [14:59<?, ?it/s]

   → Fold 3/5


GridSearch Progress:   0%|                                                                       | 0/8 [22:33<?, ?it/s]

   → Fold 4/5


GridSearch Progress:   0%|                                                                       | 0/8 [30:06<?, ?it/s]

   → Fold 5/5


GridSearch Progress:  12%|███████▍                                                   | 1/8 [37:35<4:23:11, 2255.96s/it]

Treinando configuração: units1=64, units2=32, lr=0.0005
   → Fold 1/5


GridSearch Progress:  12%|███████▍                                                   | 1/8 [45:08<4:23:11, 2255.96s/it]

   → Fold 2/5


GridSearch Progress:  12%|███████▍                                                   | 1/8 [52:43<4:23:11, 2255.96s/it]

   → Fold 3/5


GridSearch Progress:  12%|███████▏                                                 | 1/8 [1:00:17<4:23:11, 2255.96s/it]

   → Fold 4/5


GridSearch Progress:  12%|███████▏                                                 | 1/8 [1:07:54<4:23:11, 2255.96s/it]

   → Fold 5/5


GridSearch Progress:  25%|██████████████▎                                          | 2/8 [1:15:36<3:47:01, 2270.17s/it]

Treinando configuração: units1=64, units2=64, lr=0.001
   → Fold 1/5


GridSearch Progress:  25%|██████████████▎                                          | 2/8 [1:23:14<3:47:01, 2270.17s/it]

   → Fold 2/5


GridSearch Progress:  25%|██████████████▎                                          | 2/8 [1:30:53<3:47:01, 2270.17s/it]

   → Fold 3/5


GridSearch Progress:  25%|██████████████▎                                          | 2/8 [1:38:56<3:47:01, 2270.17s/it]

   → Fold 4/5


GridSearch Progress:  25%|██████████████▎                                          | 2/8 [1:47:12<3:47:01, 2270.17s/it]

   → Fold 5/5


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [1:55:04<3:12:54, 2314.95s/it]

Treinando configuração: units1=64, units2=64, lr=0.0005


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [1:55:05<3:12:54, 2314.95s/it]

   → Fold 1/5


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [2:02:32<3:12:54, 2314.95s/it]

   → Fold 2/5


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [2:08:48<3:12:54, 2314.95s/it]

   → Fold 3/5


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [2:14:46<3:12:54, 2314.95s/it]

   → Fold 4/5


GridSearch Progress:  38%|█████████████████████▍                                   | 3/8 [2:20:42<3:12:54, 2314.95s/it]

   → Fold 5/5


GridSearch Progress:  50%|████████████████████████████▌                            | 4/8 [2:26:36<2:23:11, 2147.94s/it]

Treinando configuração: units1=128, units2=32, lr=0.001
   → Fold 1/5


GridSearch Progress:  50%|████████████████████████████▌                            | 4/8 [2:32:28<2:23:11, 2147.94s/it]

   → Fold 2/5


GridSearch Progress:  50%|████████████████████████████▌                            | 4/8 [2:38:22<2:23:11, 2147.94s/it]

   → Fold 3/5


GridSearch Progress:  50%|████████████████████████████▌                            | 4/8 [2:44:15<2:23:11, 2147.94s/it]

   → Fold 4/5


GridSearch Progress:  50%|████████████████████████████▌                            | 4/8 [2:50:07<2:23:11, 2147.94s/it]

   → Fold 5/5


GridSearch Progress:  62%|███████████████████████████████████▋                     | 5/8 [2:56:01<1:40:29, 2009.89s/it]

Treinando configuração: units1=128, units2=32, lr=0.0005
   → Fold 1/5


GridSearch Progress:  62%|███████████████████████████████████▋                     | 5/8 [3:01:52<1:40:29, 2009.89s/it]

   → Fold 2/5


GridSearch Progress:  62%|███████████████████████████████████▋                     | 5/8 [3:07:43<1:40:29, 2009.89s/it]

   → Fold 3/5


GridSearch Progress:  62%|███████████████████████████████████▋                     | 5/8 [3:13:36<1:40:29, 2009.89s/it]

   → Fold 4/5


GridSearch Progress:  62%|███████████████████████████████████▋                     | 5/8 [3:19:29<1:40:29, 2009.89s/it]

   → Fold 5/5


GridSearch Progress:  75%|██████████████████████████████████████████▊              | 6/8 [3:25:21<1:04:09, 1924.90s/it]

Treinando configuração: units1=128, units2=64, lr=0.001
   → Fold 1/5


GridSearch Progress:  75%|██████████████████████████████████████████▊              | 6/8 [3:31:12<1:04:09, 1924.90s/it]

   → Fold 2/5


GridSearch Progress:  75%|██████████████████████████████████████████▊              | 6/8 [3:37:04<1:04:09, 1924.90s/it]

   → Fold 3/5


GridSearch Progress:  75%|██████████████████████████████████████████▊              | 6/8 [3:42:58<1:04:09, 1924.90s/it]

   → Fold 4/5


GridSearch Progress:  75%|██████████████████████████████████████████▊              | 6/8 [3:48:50<1:04:09, 1924.90s/it]

   → Fold 5/5


GridSearch Progress:  88%|███████████████████████████████████████████████████▋       | 7/8 [3:54:42<31:11, 1871.33s/it]

Treinando configuração: units1=128, units2=64, lr=0.0005
   → Fold 1/5


GridSearch Progress:  88%|███████████████████████████████████████████████████▋       | 7/8 [4:00:36<31:11, 1871.33s/it]

   → Fold 2/5


GridSearch Progress:  88%|███████████████████████████████████████████████████▋       | 7/8 [4:06:29<31:11, 1871.33s/it]

   → Fold 3/5


GridSearch Progress:  88%|███████████████████████████████████████████████████▋       | 7/8 [4:12:22<31:11, 1871.33s/it]

   → Fold 4/5


GridSearch Progress:  88%|███████████████████████████████████████████████████▋       | 7/8 [4:18:14<31:11, 1871.33s/it]

   → Fold 5/5


GridSearch Progress: 100%|███████████████████████████████████████████████████████████| 8/8 [4:24:07<00:00, 1980.96s/it]


In [11]:
df_results = pd.DataFrame(results, columns=['units1', 'units2', 'lr', 'Acurácia Média', 'F1-Score Médio'])
df_results.sort_values(by='F1-Score Médio', ascending=False, inplace=True)

print("\n🏆 Melhor Configuração Encontrada:")
print(f"   ➤ units1 = {best_config[0]}")
print(f"   ➤ units2 = {best_config[1]}")
print(f"   ➤ learning rate = {best_config[2]}")
print(f"   ➤ F1 médio = {best_score:.4f}")
print("\n📊 Tabela completa de resultados:")
print(df_results)


🏆 Melhor Configuração Encontrada:
   ➤ units1 = 64
   ➤ units2 = 64
   ➤ learning rate = 0.001
   ➤ F1 médio = 0.3252

📊 Tabela completa de resultados:
   units1  units2      lr  Acurácia Média  F1-Score Médio
2      64      64  0.0010        0.459579        0.325168
3      64      64  0.0005        0.459238        0.324734
1      64      32  0.0005        0.458868        0.324374
7     128      64  0.0005        0.458736        0.324349
4     128      32  0.0010        0.459001        0.324339
5     128      32  0.0005        0.458617        0.324259
0      64      32  0.0010        0.458755        0.323965
6     128      64  0.0010        0.458419        0.323790


### 6. Summary of Results and Evaluation of the MLP Model

After performing an extensive grid search with stratified 5-fold cross-validation, the best MLP configuration found was:

- **units1**: 64 neurons  
- **units2**: 64 neurons  
- **learning rate**: 0.001  
- **Average F1-Score**: 0.325  
- **Average Accuracy**: 0.459  

Despite the use of validation strategies and hyperparameter tuning, the model's performance remained poor — slightly above random guessing for a 3-class problem (≈33% baseline).

#### 🔍 Key Findings:

- The model **failed to effectively distinguish between the emission classes (E1, E2, E3)**.
- Minor variation in F1-score across all tested configurations suggests that **the MLP architecture is not the main limiting factor**.
- The input features used lacked temporal structure. Since the particle emission classes appear to differ primarily through **temporal patterns**, a purely tabular MLP was unable to capture these dynamics.
- The performance is likely also affected by **partial overlap between classes** (as identified in the EDA) and possibly **class imbalance**.

#### ❌ Conclusion:

The MLP model, even with tuned hyperparameters and rigorous validation, is **not suitable** for this task due to the **lack of temporal modeling** and the nature of the dataset.  
We will proceed to more appropriate architectures, such as **LSTM** and **CNN**, which are capable of learning from sequential patterns in the data.