El objetivo de este código es implementar el algoritmo GBM, separando los datos por sujeto.


In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/My Drive/repo_tesis/entorno_tesis_Molina"
!source bin/activate

Mounted at /content/drive
/content/drive/My Drive/repo_tesis/entorno_tesis_Molina


In [None]:
# !pip install comet_ml

# Importo las librerías
from lightgbm import LGBMClassifier
import numpy as np
from joblib import load
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
import seaborn as sns
import time
from comet_ml import Experiment
import joblib
import matplotlib.pyplot as plt

# Levanto los datos
features = load("/content/drive/My Drive/repo_tesis/data/FEATURES_W200_I50.joblib")
label = load("/content/drive/My Drive/repo_tesis/data/label_W200_I50.joblib")

Construyo cada conjunto de datos haciendo la división por sujeto.


*   Dos sujetos para test.
*   Dos sujetos para validación.
*   El resto para entrenamiento.

Esto lo dejé así por defecto pero puede modificarse fácilmente.

In [None]:
# Calculo la cantidad de sujetos
label = np.array(label)
cantSujetos = np.max(label[:, 2])

# Hago un sorteo a ver que sujeto va para cada conjunto
sorteo = np.random.permutation(cantSujetos) + 1

# El primer sujeto del sorteo va a test, el segundo a val y el resto a train
indices_test = list(np.where(label[:, 2]==sorteo[0])[0])
indices_val = list(np.where(label[:, 2]==sorteo[2])[0])

indices_test.extend(list(np.where(label[:, 2]==sorteo[1])[0]))
indices_val.extend(list(np.where(label[:, 2]==sorteo[3])[0]))

indices_train = []
for j in sorteo[4:]:
    indices_train.extend(np.where(label[:, 2]==j)[0])

# lo paso a numpy array para poder definir de forma mas comoda los conjuntos de train test y val
features = np.array(features)

X_train = features[indices_train, :]
y_train = label[indices_train, 1]
X_val = features[indices_val, :]
y_val = label[indices_val, 1]
X_test =  features[indices_test, :]
y_test = label[indices_test, 1]

Voy a usar la implementación que me dio ChatGPT, ya que en la tesis de Renato no se habla de cómo elegir los parámetros

In [None]:
start_time = time.time()    # comienzo a medir el tiempo

# Crear una instancia de LGBMClassifier
clf = LGBMClassifier()

# Entrenar el clasificador
clf.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='logloss')

# Finalizo la medida del tiempo y calculo el tiempo de entrenamiento
end_time = time.time()
training_time = end_time - start_time

# # Realizar predicciones en el conjunto de validación
# y_pred = clf.predict(X_val)

# # Calcular la precisión
# accuracy = accuracy_score(y_val, y_pred)
# print(f'Accuracy: {accuracy}')

# Obtener las iteraciones del mejor modelo
best_iteration = clf.best_iteration_

# Imprimir la advertencia si existe
if hasattr(clf, 'best_iteration_') and clf.best_iteration_ is None:
    print("[LightGBM] [Warning] No further splits with positive gain, best gain: -inf")



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.216863 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50308
[LightGBM] [Info] Number of data points in the train set: 37655, number of used features: 240
[LightGBM] [Info] Start training from score -2.565481
[LightGBM] [Info] Start training from score -2.592793
[LightGBM] [Info] Start training from score -2.601708
[LightGBM] [Info] Start training from score -2.591020
[LightGBM] [Info] Start training from score -2.573805
[LightGBM] [Info] Start training from score -2.543614
[LightGBM] [Info] Start training from score -2.547339
[LightGBM] [Info] Start training from score -2.549036
[LightGBM] [Info] Start training from score -2.549036
[LightGBM] [Info] Start training from score -2.601708
[LightGBM] [Info] Start training from score -2.376846
[LightGBM] [Info] Start training from score -2.547339
[LightGBM] [Info] Start training from score -2.740986
Accurac

Voy a guardar el clasificador, y evaluar distintas métricas: accuracy, precision y recall. Voy a hacer una matriz de confusión.
Por otra parte, voy a guardar la partición de los datos para hacer reproducible el experimento.

In [None]:
# Guardar el modelo entrenado en un archivo
joblib.dump(bst, 'baseline_gbm_sep_sub_r1.pkl')

# Predecir en el conjunto de test
y_pred = bst.predict(X_test)

# Calcular métricas de desempeño
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

# Mostrar las métricas
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

# Visualizar la matriz de confusión
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="d", xticklabels=np.unique(y_test), yticklabels=np.unique(y_test))
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

# Guardar la imagen de la matriz de confusión
plt.savefig("confusion_matrix.png")

Voy a guardar las métricas calculadas en un experimento en Comet

In [None]:
# Conectar con Comet
API_KEY = 'ehXeElNypcj7Knar5zTmyjwSO' # Se puede encontrar en Settings(Arriba a la derecha en Comet)

# Crear un experimento con mi API KEY
exp = Experiment(api_key=API_KEY,
                 project_name='tesis-experimentos', # Nombre del proyecto donde se registran los experimentos
                 auto_param_logging=False)
exp.set_name('baseline_gbm_sep_sub_r1') # Nombre de este experimento
exp.add_tags(['baseline', 'gbm', 'sep_sub']) # Tags

exp.log_metric("accuracy", accuracy)
exp.log_metric("precision", precision)
exp.log_metric("recall", recall)
exp.log_metric("training_time", training_time)
exp.log_confusion_matrix(y_test, y_pred)
exp.log_parameter("partition_array", sorteo)   # Guarda el arreglo en el experimento
exp.log_text("Primeros dos sujetos --> test, tercero y cuarto --> validación, resto --> train. \n Corresponde a la primera ronda que entreno con estos parámetros. ")   # Comentario del experimento

In [None]:
# Subir el modelo
exp.log_model(name="baseline_gbm_sep_sub_r1", file_or_folder="baseline_gbm_sep_sub_r1.pkl")
exp.end()

Junto todo en un bloque de código

In [None]:
%cd "/content/drive/My Drive/repo_tesis/archivos_generados_codigos"

for k in range(1,2):
  label = np.array(label)
  cantSujetos = np.max(label[:, 2])
  sujeto_test = k
  sorteo = np.random.permutation(cantSujetos) + 1
  sorteo_sin_test = np.delete(sorteo, np.where(sorteo == sujeto_test))
  indices_test = list(np.where(label[:, 2]==sujeto_test)[0])
  indices_val = list(np.where(label[:, 2]==sorteo_sin_test[0])[0])
  indices_val.extend(list(np.where(label[:, 2]==sorteo_sin_test[1])[0]))

  indices_train = []
  for j in sorteo_sin_test[2:]:
      indices_train.extend(np.where(label[:, 2]==j)[0])
  features = np.array(features)

  X_train = features[indices_train, :]
  y_train = label[indices_train, 1]
  X_val = features[indices_val, :]
  y_val = label[indices_val, 1]
  X_test =  features[indices_test, :]
  y_test = label[indices_test, 1]

  start_time = time.time()

  # implemento gbm
  clf = LGBMClassifier()
  clf.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='logloss')

  end_time = time.time()
  training_time = end_time - start_time
  # Obtener las iteraciones del mejor modelo
  bst = clf.best_iteration_

  nombre = 'baseline_gbm_sep_sub' + str(k) + '_testing'

  joblib.dump(clf, nombre + '.pkl')

  y_pred = clf.predict(X_test)

  accuracy = accuracy_score(y_test, y_pred)
  precision = precision_score(y_test, y_pred, average='weighted')
  recall = recall_score(y_test, y_pred, average='weighted')
  conf_matrix = confusion_matrix(y_test, y_pred)
  plt.savefig("confusion_matrix.png")

  API_KEY = 'ehXeElNypcj7Knar5zTmyjwSO'

  exp = Experiment(api_key=API_KEY,
                  project_name='tesis-experimentos', # Nombre del proyecto donde se registran los experimentos
                  auto_param_logging=False)
  exp.set_name(nombre) # Nombre de este experimento
  exp.add_tags(['baseline', 'gbm', 'sep_sub', 'choose_test']) # Tags

  exp.log_metric("accuracy", accuracy)
  exp.log_metric("precision", precision)
  exp.log_metric("recall", recall)
  exp.log_metric("training_time", training_time)
  exp.log_confusion_matrix(y_test, y_pred)
  exp.log_parameter("partition_array", sorteo)   # Guarda el arreglo en el experimento
  exp.log_text("Se fija el sujeto de test y se sortea el resto. Los primeros dos sujetos del sorteo son los de validación.")   # Comentario del experimento
  exp.log_model(name=nombre, file_or_folder=nombre + '.pkl')
  exp.end()

/content/drive/My Drive/repo_tesis/archivos_generados_codigos
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.541994 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50345
[LightGBM] [Info] Number of data points in the train set: 45226, number of used features: 240
[LightGBM] [Info] Start training from score -2.565215
[LightGBM] [Info] Start training from score -2.600029
[LightGBM] [Info] Start training from score -2.605103
[LightGBM] [Info] Start training from score -2.578238
[LightGBM] [Info] Start training from score -2.572718
[LightGBM] [Info] Start training from score -2.544725
[LightGBM] [Info] Start training from score -2.544161
[LightGBM] [Info] Start training from score -2.547263
[LightGBM] [Info] Start training from score -2.552927
[LightGBM] [Info] Start training from score -2.595573
[LightGBM] [Info] Start training from score -2.382079
[LightGBM] [Info] Start training from score -2.553779

[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/manuelmolinach99/tesis-experimentos/bd0cddec9e424fa4bd23561cf03d600f

[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m Comet.ml Experiment Summary
[1;38;5;39mCOMET INFO:[0m ---------------------------------------------------------------------------------------
[1;38;5;39mCOMET INFO:[0m   Data:
[1;38;5;39mCOMET INFO:[0m     display_summary_level : 1
[1;38;5;39mCOMET INFO:[0m     url                   : https://www.comet.com/manuelmolinach99/tesis-experimentos/bd0cddec9e424fa4bd23561cf03d600f
[1;38;5;39mCOMET INFO:[0m   Metrics:
[1;38;5;39mCOMET INFO:[0m     accuracy      : 0.25368837711406983
[1;38;5;39mCOMET INFO:[0m     precision     : 0.26482242684700047
[1;38;5;39mCOMET INFO:[0m     recall        : 0.25368837711406983
[1;38;5;39mCOMET INFO:[0m     training_time : 113.34423160552979
[1;38;5;

<Figure size 640x480 with 0 Axes>