**Ioannis Michalainas** (AEM: 10902) and **Maria Charisi** (AEM: 10727)

# PART D

First, we begin by importing the necessary libriries. We use **pandas** for loading the datasets and **numpy** for numeric operations.

In [1]:
import pandas as pd
import numpy as np

We use **scikit-learn** to test different classifiers on our data.

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression, Perceptron, RidgeClassifier, SGDClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier#, GradientBoostingClassifier

After that, we declare some *constants* that will be helpful later on.

In [3]:
NUM_FEATURES = 224
TRAIN_DATA = '../datasets/datasetTV.csv'
TEST_DATA = '../datasets/datasetTest.csv'

Finally, we load the **train** and **test** set.

In [4]:
train = pd.read_csv(TRAIN_DATA, header=None)
test = pd.read_csv(TEST_DATA, header=None)

We *preprocess* the data...

In [5]:
feature_columns = [f'feature_{i+1}' for i in range(NUM_FEATURES)]

train.columns = feature_columns + ['label']
test.columns = feature_columns  # no 'label' column in the test set

# split train data into features and labels
X_train = train[feature_columns]
y_train = train['label']

# test data does not have labels
X_test = test[feature_columns]

# scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

...and define the **models** we want to test, along with their *parameters*.

In [6]:
models = {
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    
    "KNN(5NN)": KNeighborsClassifier(n_neighbors=5),
    "Naive Bayes": GaussianNB(),

    # "LDA": LinearDiscriminantAnalysis(),
    # "QDA": QuadraticDiscriminantAnalysis(),
    "Logistic Regression": LogisticRegression(max_iter=300, random_state=42),
    "Perceptron": Perceptron(max_iter=300, random_state=42), # two class problems
    "Least Squares": RidgeClassifier(),
    # "SGD Classifier": SGDClassifier(max_iter=1000, tol=1e-3, random_state=42),
    
    "SVM": SVC(kernel='linear', random_state=42),
    
    "Neural Network": MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42),

    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    
    # "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42)
}

We then train each **model** and calculate its accuracy using **5-fold cross-validation**.

In [7]:
results = {}
labelsX_dict = {}

for name, model in models.items():

    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    results[name] = scores.mean()
    print(f"{name} - Cross-validation accuracy: {scores.mean():.2f}")

    # train model on the entire training set
    model.fit(X_train_scaled, y_train)
    # predict on test data
    labelsX = model.predict(X_test_scaled)
    # save the predicted labels
    labelsX_dict[name] = labelsX
    
    print(f"{name} - Prediction done")

KNN(5NN) - Cross-validation accuracy: 0.82
KNN(5NN) - Prediction done


At last, we extract the **predicted labels** in .npy form of the model that scored the best in terms of *accuracy*.

In [8]:
best_model_name = max(results, key=results.get)
best_labelsX = labelsX_dict[best_model_name]
np.save('results/labelsX.npy', best_labelsX)

print(f"\nPredictions from the best model ({best_model_name}) saved to 'labelsX.npy'")


Predictions from the best model (KNN(5NN)) saved to 'labelsX.npy'
