**Ioannis Michalainas** (AEM: 10902) and **Maria Charisi** (AEM: 10727)

# PART D

First, we begin by importing the necessary libraries. We use **pandas** for loading the datasets and **numpy** for saving the results of the prediction.

In [1]:
import pandas as pd
import numpy as np

Then, we will import some functions from *scikit-learn*, including **MLPClassifier** (Multilayer Perceptron) for the model, **StandardScaler** for scaling the data and **cross_val_scrore** for cross-validation.

In [2]:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

We declare some *constants* like the directory that contains the train and test sets.

In [3]:
NUM_FEATURES = 224
TRAIN_DATA = '../datasets/datasetTV.csv'
TEST_DATA = '../datasets/datasetTest.csv'

We load and preprocess the **train** and **test** sets.

In [4]:
train = pd.read_csv(TRAIN_DATA, header=None)
test = pd.read_csv(TEST_DATA, header=None)

feature_columns = [f'feature_{i+1}' for i in range(NUM_FEATURES)]

train.columns = feature_columns + ['label']
test.columns = feature_columns

X_train = train[feature_columns]
y_train = train['label']

X_test = test[feature_columns]

As we can observe, our data lies in a high-dimensional space (224 dimensions), so we decided to use **feature selection**. Specifically, we will use a *correlation-based* technique to discard features with low correlation to the label, as these are unlikely to provide meaningful information for the classification task. Using feature selection offers the following benefits:
- Reduces noise in the data
- Decreases the risk of **overfitting**, because the model focuses on learning the underlying patterns rather than memorizing noise
- Improves computational efficiency

In [5]:
correlation_matrix = X_train.corrwith(y_train)
correlation_threshold = 0.1
selected_features = correlation_matrix[abs(correlation_matrix) > correlation_threshold].index

X_train_reduced = X_train[selected_features]
X_test_reduced = X_test[selected_features]

Moreover, we will **scale** the data, as this is essential for models like **Neural Networks**. Scaling improves the *convergence* of *gradient descent* and prevents features with larger numerical ranges from dominating the learning process.

In [6]:
scaler = StandardScaler()
X_train_reduced_scaled = scaler.fit_transform(X_train_reduced)
X_test_reduced_scaled = scaler.transform(X_test_reduced)

We performed a *Grid Search* to tune the parameters of the MLPClassifier and found that the best combination for our training set is the following:

In [7]:
model = MLPClassifier(hidden_layer_sizes = (200,), learning_rate_init = 0.01, random_state = 42)

We evaluate the accuracy of the model using **5-fold cross-validation**.

In [8]:
scores = cross_val_score(model, X_train_reduced_scaled, y_train, cv=5, scoring='accuracy')
print(f"Neural Network - Cross-validation accuracy: {scores.mean():.2f}")

Neural Network - Cross-validation accuracy: 0.84


Finally, we will fit our model on the entire training set and make predictions for the given test set.

In [9]:
model.fit(X_train_reduced_scaled, y_train)
labelsX = model.predict(X_test_reduced_scaled)

np.save('../results/labels3.npy', labelsX)

We also make sure that *labelsX* can be loaded using *numpy.load()*

In [10]:
try:
    file_path = "../results/labels3.npy"
    labels = np.load(file_path)
    
    # checks if it is an 1D array
    if labels.ndim == 1:
        print("File loaded successfully!")
        print(f"Number of samples (N): {labels.shape[0]}")
    else:
        print(f"Error: Expected 1D array, but got {labels.ndim}D array with shape {labels.shape}.")
except Exception as e:
    print(f"Error loading file: {e}")

File loaded successfully!
Number of samples (N): 6955
