### EJERCICIO 2

En este ejercicio trabajarás con datos que vienen de un experimento en el que se midió actividad muscular con la técnica de la Electromiografía en el brazo derecho de varios participantes cuando éstos realizaban un movimiento con la mano entre siete posible (Flexionar hacia arriba, Flexionar hacia abajo, Cerrar la mano, Estirar la mano, Abrir la mano, Coger un objeto, No moverse).

A su vez, la primera columna corresponde a la clase (1, 2, 3, 4, 5, 6, y 7), la segunda columna se ignora, y el resto de las columnas indican las variables que se calcularon de la respuesta muscular.


In [1]:
import pandas as pd
import numpy as np

In [7]:
df_txt = np.loadtxt("data/M_3.txt")
df = pd.DataFrame(df_txt)
df.drop(columns = 1, inplace = True)

In [9]:
df.head()

Unnamed: 0,0,2,3,4,5,6,7,8,9,10,...,622,623,624,625,626,627,628,629,630,631
0,1.0,-1.558677,1.108312,0.36259,-0.471151,0.385332,0.316969,-1.481753,0.399394,1.148663,...,-0.513296,-0.044463,0.211337,0.40454,-1.020636,0.598305,0.68847,0.2921,-0.435294,1.384082
1,1.0,-1.978207,0.055298,-0.27335,-0.581922,-0.955664,-0.529833,-1.983923,-0.728869,0.253809,...,-1.029816,-0.305097,-0.19568,0.641412,-0.50898,0.785134,0.580631,0.134605,-0.663639,1.234545
2,1.0,-2.002521,0.267381,0.26317,-1.390732,-0.285152,-0.352082,-1.917751,-0.27061,0.568314,...,-0.531808,-0.482899,-0.068502,0.126576,-0.880254,0.53368,0.67203,0.207432,-0.563343,1.046445
3,1.0,-2.152173,0.251282,0.183969,-1.082561,-0.088401,0.023352,-2.048161,0.244747,0.376057,...,-0.453654,-0.107637,-0.62158,0.807897,0.047029,1.071315,1.204314,0.039504,-0.89966,1.491114
4,1.0,-2.590018,-0.091138,-0.265972,-1.763059,-0.181685,-0.666207,-2.570451,-0.535319,-0.03175,...,-0.737064,0.055352,0.197176,-1.152101,-1.551049,1.236166,0.134769,0.805767,-0.454501,1.101017


### Dataset is perfectly balanced

In [13]:
df.loc[:,0].value_counts()

0
1.0    90
2.0    90
3.0    90
4.0    90
5.0    90
6.0    90
7.0    90
Name: count, dtype: int64

In [65]:
X = df.drop(columns = 0)
y = df.loc[:,0] - 1# For simplicity in future model training

### Function to evaluate models

In [18]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report

In [69]:
def evaluate_model(X, y, classifier):
    """
    Función para evaluar modelos utilizando validación cruzada sin subsampling ni manejo de desbalanceo de clases.

    Parámetros:
    X (pd.DataFrame): Características
    y (pd.Series): Etiquetas (deben ser siete clases numeradas del 1 al 7)
    classifier: Modelo clasificador a evaluar

    Retorna:
    None: Imprime el reporte de clasificación de la validación cruzada.
    """
    
    # Validación cruzada con 5 pliegues estratificados
    kf = StratifiedKFold(n_splits=5, shuffle=True)

    cv_y_test = []
    cv_y_pred = []

    # Iterar sobre cada pliegue
    for train_index, test_index in kf.split(X, y):
        X_train = X.iloc[train_index]
        y_train = y.iloc[train_index]

        # Entrenar el modelo sin subsampling
        classifier.fit(X_train, y_train)

        # Fase de prueba
        X_test = X.iloc[test_index]
        y_test = y.iloc[test_index]
        y_pred = classifier.predict(X_test)

        # Guardar predicciones y etiquetas reales
        cv_y_test.append(y_test)
        cv_y_pred.append(pd.Series(y_pred, index=y_test.index))

    # Concatenar predicciones y etiquetas reales
    y_test_concat = pd.concat(cv_y_test)
    y_pred_concat = pd.concat(cv_y_pred)

    # Imprimir reporte de clasificación
    print(f"--- Reporte de clasificación ---")
    print(classification_report(y_test_concat, y_pred_concat, labels=[0, 1, 2, 3, 4, 5, 6]))


In [72]:
# Eight Classifiers
from sklearn.svm import SVC # Linear and RBF
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from xgboost import XGBClassifier

#### K-nearest-neighbors

In [75]:
evaluate_model(X, y, KNeighborsClassifier(n_neighbors=5))

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        90
           1       0.97      0.98      0.97        90
           2       0.96      0.90      0.93        90
           3       1.00      1.00      1.00        90
           4       1.00      1.00      1.00        90
           5       0.92      0.93      0.93        90
           6       0.95      0.99      0.97        90

    accuracy                           0.97       630
   macro avg       0.97      0.97      0.97       630
weighted avg       0.97      0.97      0.97       630



### Linear Discriminant Analysis

In [77]:
evaluate_model(X, y, LinearDiscriminantAnalysis())

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.91      0.98      0.94        90
           1       0.87      0.89      0.88        90
           2       0.87      0.86      0.86        90
           3       0.98      0.99      0.98        90
           4       0.99      0.89      0.94        90
           5       0.81      0.84      0.83        90
           6       0.94      0.90      0.92        90

    accuracy                           0.91       630
   macro avg       0.91      0.91      0.91       630
weighted avg       0.91      0.91      0.91       630



#### Gaussian Naive-Bayes

In [80]:
evaluate_model(X, y, GaussianNB())

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.87      0.86      0.86        90
           1       0.71      0.80      0.75        90
           2       0.87      0.77      0.82        90
           3       0.94      0.92      0.93        90
           4       0.89      0.91      0.90        90
           5       0.67      0.64      0.66        90
           6       0.94      0.99      0.96        90

    accuracy                           0.84       630
   macro avg       0.84      0.84      0.84       630
weighted avg       0.84      0.84      0.84       630



#### Linear Support Vector Classifier

In [82]:
evaluate_model(X, y, SVC(kernel='linear'))

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       1.00      0.99      0.99        90
           1       0.99      0.98      0.98        90
           2       0.97      0.94      0.96        90
           3       1.00      1.00      1.00        90
           4       0.99      1.00      0.99        90
           5       0.94      0.93      0.94        90
           6       0.94      0.98      0.96        90

    accuracy                           0.97       630
   macro avg       0.97      0.97      0.97       630
weighted avg       0.97      0.97      0.97       630



#### Radial Support Vector Classifier

In [84]:
evaluate_model(X, y, SVC(kernel='rbf'))

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.99      0.99      0.99        90
           1       0.98      0.93      0.95        90
           2       0.99      0.87      0.92        90
           3       1.00      1.00      1.00        90
           4       0.99      0.99      0.99        90
           5       0.85      0.94      0.89        90
           6       0.94      0.99      0.96        90

    accuracy                           0.96       630
   macro avg       0.96      0.96      0.96       630
weighted avg       0.96      0.96      0.96       630



#### Desicion Tree

In [86]:
evaluate_model(X, y, DecisionTreeClassifier())

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.86      0.82      0.84        90
           1       0.70      0.66      0.68        90
           2       0.78      0.83      0.81        90
           3       0.92      0.87      0.89        90
           4       0.89      0.90      0.90        90
           5       0.64      0.70      0.67        90
           6       0.94      0.93      0.94        90

    accuracy                           0.82       630
   macro avg       0.82      0.82      0.82       630
weighted avg       0.82      0.82      0.82       630



#### Random Forest

In [92]:
evaluate_model(X, y, RandomForestClassifier())

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.97      0.94      0.96        90
           1       0.87      0.91      0.89        90
           2       0.94      0.86      0.90        90
           3       1.00      0.98      0.99        90
           4       0.96      0.98      0.97        90
           5       0.82      0.82      0.82        90
           6       0.94      1.00      0.97        90

    accuracy                           0.93       630
   macro avg       0.93      0.93      0.93       630
weighted avg       0.93      0.93      0.93       630



#### XGBoost

In [94]:
evaluate_model(X, y, XGBClassifier())

--- Reporte de clasificación ---
              precision    recall  f1-score   support

           0       0.97      0.93      0.95        90
           1       0.86      0.84      0.85        90
           2       0.88      0.84      0.86        90
           3       1.00      0.97      0.98        90
           4       0.97      1.00      0.98        90
           5       0.75      0.78      0.77        90
           6       0.94      1.00      0.97        90

    accuracy                           0.91       630
   macro avg       0.91      0.91      0.91       630
weighted avg       0.91      0.91      0.91       630



### Selected models: Linear SVC y KNN

#### Hyperparameter grids

In [128]:
from sklearn.svm import LinearSVC

In [152]:
# Grid para KNN
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}


# Grid para Linear SVC
param_grid_svc = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'loss': ['squared_hinge'],
    'dual': [False], 
    'max_iter': [1000, 2000, 5000] 
}

#### CV function to find the best hyperparameters

In [107]:
from sklearn.model_selection import GridSearchCV

In [121]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report
import pandas as pd

def evaluate_model_with_nested_cv(X, y, classifier, param_grid):
    """
    Función para evaluar modelos utilizando validación cruzada anidada con búsqueda de hiperparámetros.

    Parámetros:
    X (pd.DataFrame): Características (ya estandarizadas)
    y (pd.Series): Etiquetas (deben ser siete clases numeradas del 1 al 7)
    classifier: Modelo clasificador a evaluar
    param_grid (dict): Diccionario de hiperparámetros a explorar en GridSearchCV

    Retorna:
    None: Imprime el reporte de clasificación de la validación cruzada.
    """

    # Validación cruzada externa con 5 pliegues
    outer_cv = StratifiedKFold(n_splits=5, shuffle=True)

    cv_y_test = []
    cv_y_pred = []

    # Validación cruzada anidada (con GridSearchCV)
    for train_index, test_index in outer_cv.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # GridSearchCV para buscar los mejores hiperparámetros
        inner_cv = StratifiedKFold(n_splits=5, shuffle=True)
        grid_search = GridSearchCV(classifier, param_grid, cv=inner_cv, scoring='accuracy')

        # Ajustar GridSearchCV en el conjunto de entrenamiento
        grid_search.fit(X_train, y_train)

        # Predecir en el conjunto de prueba externo con los mejores hiperparámetros
        y_pred = grid_search.predict(X_test)

        # Guardar predicciones y etiquetas reales
        cv_y_test.append(y_test)
        cv_y_pred.append(pd.Series(y_pred, index=y_test.index))

        # Imprimir los mejores hiperparámetros encontrados en este pliegue externo
        print(f"Mejores hiperparámetros en este pliegue: {grid_search.best_params_}")

    # Concatenar todas las predicciones y etiquetas reales
    y_test_concat = pd.concat(cv_y_test)
    y_pred_concat = pd.concat(cv_y_pred)

    # Imprimir reporte de clasificación final
    print(f"--- Reporte de clasificación con validación cruzada anidada ---")
    print(classification_report(y_test_concat, y_pred_concat, labels=[0, 1, 2, 3, 4, 5, 6]))


#### K-nearest neighbors

In [125]:
evaluate_model_with_nested_cv(X, y, KNeighborsClassifier(), param_grid_knn)

Mejores hiperparámetros en este pliegue: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
Mejores hiperparámetros en este pliegue: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'}
Mejores hiperparámetros en este pliegue: {'metric': 'manhattan', 'n_neighbors': 5, 'weights': 'distance'}
Mejores hiperparámetros en este pliegue: {'metric': 'manhattan', 'n_neighbors': 3, 'weights': 'distance'}
Mejores hiperparámetros en este pliegue: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
--- Reporte de clasificación con validación cruzada anidada ---
              precision    recall  f1-score   support

           0       0.99      0.98      0.98        90
           1       0.96      0.94      0.95        90
           2       0.99      0.90      0.94        90
           3       1.00      0.99      0.99        90
           4       0.99      1.00      0.99        90
           5       0.90      0.97      0.93        90
           6       0.95     

#### Linear Support Vector Classifier

In [154]:
evaluate_model_with_nested_cv(X, y, LinearSVC(), param_grid_svc)

Mejores hiperparámetros en este pliegue: {'C': 0.001, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000}
Mejores hiperparámetros en este pliegue: {'C': 0.01, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000}
Mejores hiperparámetros en este pliegue: {'C': 0.001, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000}
Mejores hiperparámetros en este pliegue: {'C': 0.01, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000}
Mejores hiperparámetros en este pliegue: {'C': 0.001, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000}
--- Reporte de clasificación con validación cruzada anidada ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        90
           1       0.98      0.97      0.97        90
           2       0.98      0.96      0.97        90
           3       1.00      1.00      1.00        90
           4       1.00      1.00      1.00        90
           5       0.95      0.93      0.94        90

### Preparacion de modelos para produccion

#### K-nearest neighbors

In [162]:
clf_knn = KNeighborsClassifier(metric= 'manhattan', n_neighbors= 3, weights= 'distance')

In [164]:
clf_knn.fit(X, y)

In [None]:
# Guardar el modelo en un archivo pickle
'''
with open('tuned_knn_model.pkl', 'wb') as model_file:
    pickle.dump(clf_knn, model_file)

print("Modelo KNN guardado en 'tuned_knn_model.pkl'")
'''

#### Linear SVC

In [174]:
clf_svc = LinearSVC(C = 0.001, dual = False, loss = 'squared_hinge', max_iter = 1000)

In [176]:
clf_svc.fit(X, y)

In [None]:
# Guardar el modelo en un archivo pickle
'''
with open('tuned_svc_model.pkl', 'wb') as model_file:
    pickle.dump(clf_svc, model_file)

print("Modelo KNN guardado en 'tuned_svc_model.pkl'")
'''

### Answer the following questions:

- **Do you see a problem with the balance of the classes? Why?** No, because there were exactly ninety observations for each class. However, it is necessary to ensure that the observations are shuffled before evaluation, as they are not shuffled in the original dataset.

- **Which model or models were effective in classifying your data? Do you observe anything special about the models? Justify your answer.** The most effective models were Linear Support Vector Machines and K-Nearest Neighbors. The other models specifically struggled to classify the hand-opening movement. This result is quite surprising at first glance, as other models that are theoretically more powerful, such as Random Forest or XGBoost, were also evaluated. This interesting outcome may be due to multiple factors, such as the dimensionality of the dataset, the number of observations, or a linear relationship among the dataset's features.

- **Do you observe any significant improvement when optimizing hyperparameters? Is this the result you expected? Justify your answer.** Since the results were already very good, the improvement was minimal. However, any enhancement in a machine learning model is worthwhile.

- **What challenges are there in finding hyperparameters? Why?** In the case of Linear Support Vector Machines, one must be cautious when selecting the hyperparameter grid, as there may be convergence issues or compatibility problems between certain hyperparameters. Another issue that did not arise in this case but can occur is that this process may be excessively time-consuming for more complex models than those used.
