# Análisis de características Radiométricas de Stanford

Se procesan las características radiométricas y se experimentan para obtener el resultado en modelos
**Roberto Araya**


In [64]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, SelectPercentile, f_regression
#import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

In [65]:
# parameteres
binwidth = 5
sigma = [1,2,3]
normalize= True
imageTypes = ['Original', 'LoG', 'Square', 'SquareRoot']

sm_radiometrics_file = f'santamaria_data_all__binwidth_{binwidth}_sigma_{sigma}_imtype_{imageTypes}_normalize_{normalize}.csv'
sm_radiometrics_file='stanford_first_change.csv'
sm_radiometrics = pd.read_csv(sm_radiometrics_file)

In [66]:
print(sm_radiometrics['EGFR mutation status'].value_counts())
sm_radiometrics.head()

EGFR mutation status
Wildtype         126
Mutant            43
Unknown           33
Not collected      5
Name: count, dtype: int64


Unnamed: 0,Case ID,Patient affiliation,Age at Histological Diagnosis,Weight (lbs),Gender,Ethnicity,Smoking status,Pack Years,Quit Smoking Year,%GG,...,torax3d_wavelet-LLL_glszm_SmallAreaHighGrayLevelEmphasis,torax3d_wavelet-LLL_glszm_SmallAreaLowGrayLevelEmphasis,torax3d_wavelet-LLL_glszm_ZoneEntropy,torax3d_wavelet-LLL_glszm_ZonePercentage,torax3d_wavelet-LLL_glszm_ZoneVariance,torax3d_wavelet-LLL_ngtdm_Busyness,torax3d_wavelet-LLL_ngtdm_Coarseness,torax3d_wavelet-LLL_ngtdm_Complexity,torax3d_wavelet-LLL_ngtdm_Contrast,torax3d_wavelet-LLL_ngtdm_Strength
0,AMC-001,Stanford,34,Not Collected,Male,Not Recorded In Database,Nonsmoker,,,Not Assessed,...,,,,,,,,,,
1,AMC-002,Stanford,33,Not Collected,Female,Not Recorded In Database,Nonsmoker,,,Not Assessed,...,,,,,,,,,,
2,AMC-003,Stanford,69,Not Collected,Female,Not Recorded In Database,Nonsmoker,,,Not Assessed,...,,,,,,,,,,
3,AMC-004,Stanford,80,Not Collected,Female,Not Recorded In Database,Nonsmoker,,,Not Assessed,...,,,,,,,,,,
4,AMC-005,Stanford,76,Not Collected,Male,Not Recorded In Database,Former,30.0,1962.0,Not Assessed,...,,,,,,,,,,


## Procesamiento de datos
- Se eliminan columnas no relevantes.
- Se vectorizan las características compuestas por categorías de *strings* con el método *one-hot-encoding*.
- Se eliminan las columnas que posean alguna fila con valor nulo (INVESIGAR ESTOS CASOS, CASOS SIN EXAMENES TORAX3D).

In [67]:
egfr_mutation_status = sm_radiometrics['EGFR mutation status']

# Drop columns from 0 to 48
sm_radiometrics = sm_radiometrics.drop(sm_radiometrics.columns[0:49], axis=1)
#sm_radiometrics = sm_radiometrics.drop(sm_radiometrics.columns[1:2], axis=1)

# Add back 'EGFR mutation status' to the DataFrame
sm_radiometrics['EGFR mutation status'] = egfr_mutation_status


# Define columns related to images
columns_to_drop = ['diagnostics_Configuration_EnabledImageTypes', 'diagnostics_Configuration_Settings', 
                  'diagnostics_Image-original_Dimensionality', 'diagnostics_Mask-original_BoundingBox', 
                  'diagnostics_Versions_PyRadiomics', 'diagnostics_Mask-original_CenterOfMassIndex', 'diagnostics_Image-original_Hash', 
                  'diagnostics_Image-original_Size', 'diagnostics_Image-original_Spacing', 'diagnostics_Mask-original_Hash', 
                  'diagnostics_Mask-original_Size', 'diagnostics_Mask-original_Spacing', 'diagnostics_Versions_Numpy', 
                  'diagnostics_Versions_PyWavelet', 'diagnostics_Versions_Python', 'diagnostics_Versions_SimpleITK', 
                  'diagnostics_Image-interpolated_Size', 'diagnostics_Image-interpolated_Spacing', 'diagnostics_Mask-interpolated_BoundingBox', 
                  'diagnostics_Mask-interpolated_CenterOfMass', 'diagnostics_Mask-interpolated_CenterOfMassIndex', 'diagnostics_Mask-interpolated_Size',
                  'diagnostics_Mask-interpolated_Spacing', 'diagnostics_Mask-original_CenterOfMass']

# Define exam types
exam_types = ['body_', 'pet_', 'torax3d_']

# Generate columns related to images for each exam type
columns_to_drop = [exam + s for s in columns_to_drop for exam in exam_types]

# Remove specific column from the list
columns_to_drop.remove('body_diagnostics_Configuration_EnabledImageTypes')

# Drop specified columns while keeping 'EGFR Mutation Status'
sm_radiometrics = sm_radiometrics.drop(columns=columns_to_drop, errors='ignore')

sm_radiometrics['EGFR'] = sm_radiometrics['EGFR mutation status'].replace({'Mutant': 1, 'Wildtype': 0, 'Not collected': 2, 'Unknown': 3})
sm_radiometrics = sm_radiometrics.drop(columns=['EGFR mutation status'])

In [68]:
egfr_counts = sm_radiometrics['EGFR'].value_counts()

# Print the counts
print("EGFR Counts:")
print(egfr_counts)

# Filter rows where 'EGFR' is neither 0 nor 1
filtered_sm_radiometrics = sm_radiometrics[(sm_radiometrics['EGFR'] == 0) | (sm_radiometrics['EGFR'] == 1)]

# Display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_sm_radiometrics['EGFR'].value_counts())

EGFR Counts:
EGFR
0    126
1     43
3     33
2      5
Name: count, dtype: int64
Filtered DataFrame:
EGFR
0    126
1     43
Name: count, dtype: int64


In [69]:
null_threshold = 0.2

# Calculate the percentage of null values in each column
null_percentage = filtered_sm_radiometrics.isnull().mean()

# Filter columns with null percentage above the threshold
columns_to_drop = null_percentage[null_percentage >= null_threshold].index

print(columns_to_drop)

# Drop the identified columns
filtered_sm_radiometrics = filtered_sm_radiometrics.drop(columns=columns_to_drop)

# Display the filtered DataFrame
print("Filtered DataFrame:")

Index(['torax3d_diagnostics_Image-interpolated_Maximum',
       'torax3d_diagnostics_Image-interpolated_Mean',
       'torax3d_diagnostics_Image-interpolated_Minimum',
       'torax3d_diagnostics_Image-original_Maximum',
       'torax3d_diagnostics_Image-original_Mean',
       'torax3d_diagnostics_Image-original_Minimum',
       'torax3d_diagnostics_Mask-interpolated_Maximum',
       'torax3d_diagnostics_Mask-interpolated_Mean',
       'torax3d_diagnostics_Mask-interpolated_Minimum',
       'torax3d_diagnostics_Mask-interpolated_VolumeNum',
       ...
       'torax3d_wavelet-LLL_glszm_SmallAreaHighGrayLevelEmphasis',
       'torax3d_wavelet-LLL_glszm_SmallAreaLowGrayLevelEmphasis',
       'torax3d_wavelet-LLL_glszm_ZoneEntropy',
       'torax3d_wavelet-LLL_glszm_ZonePercentage',
       'torax3d_wavelet-LLL_glszm_ZoneVariance',
       'torax3d_wavelet-LLL_ngtdm_Busyness',
       'torax3d_wavelet-LLL_ngtdm_Coarseness',
       'torax3d_wavelet-LLL_ngtdm_Complexity',
       'torax3d_wavele

In [70]:
row_null_threshold = 10

print(filtered_sm_radiometrics.shape)
# Count null values in each row
null_counts_per_row = filtered_sm_radiometrics.isnull().sum(axis=1)

# Filter rows with more than the specified threshold of null values
sm_radiometrics = filtered_sm_radiometrics[null_counts_per_row <= row_null_threshold]

# Display the filtered DataFrame
print("Filtered DataFrame:")
print(sm_radiometrics.shape)

(169, 3031)
Filtered DataFrame:
(143, 3031)


## Entrenamiento de modelos y selección de características

Se sigue la metodología realizada por Hector Henriquez en el trabajo [EGFR mutation prediction using F18-FDG PET-CT based radiomics features in non-small cell lung cancer](https://arxiv.org/pdf/2303.08569.pdf) para evaluar el rendimiento de modelos en las caracterterísticas radiométricas.

Algunas configuraciones importantes:
- Se entrena con un modelo RandomForestClassifier.
- Se entrena y evalúa el rendimiento para KFold con $k=3$.
- Se obtienen los resultados en las métricas *accuracy*, *AUC*, *True Positive Rate*, *False Positive Rate*.

In [72]:
def evaluate_model(model, X, y):
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
    
    # Predicted probabilities for class 1 (positive class) for AUC calculation
    probas = model.predict_proba(X)[:, 1]
    auc = roc_auc_score(y, probas)
    
    # Confusion matrix for true positives and false positives
    cm = confusion_matrix(y, predictions)
    true_positives = cm[1, 1]
    false_positives = cm[0, 1]
    
    # Calculate precision, recall, and avoid division by zero
    #print('true positives: ', true_positives)
    #print('true negatives: ', cm[0,0])
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) != 0 else 0
    recall = true_positives / (true_positives + cm[1, 0]) if (true_positives + cm[1, 0]) != 0 else 0
    
    return accuracy, auc, precision, recall


### I. Transformación y filtro de datos
- Se filtran las características de acuerdo a *Low Variance Filter*.
- Se filtran las características con **Selección Univariada de características** con *p-value>0.05*.
- Se estandarizan las variables al intervalo $[0,1]$.

In [19]:
# Separate features and target
X = sm_radiometrics.drop(columns=['EGFR'])
y = sm_radiometrics['EGFR']

# Step 1: Remove features with low variance
variance_threshold = 0.01  # You can adjust this threshold as needed
selector = VarianceThreshold(threshold=variance_threshold)
X_high_variance = selector.fit_transform(X)

# Convert X_high_variance to a DataFrame with the selected features
X_high_variance = pd.DataFrame(X_high_variance, columns=X.columns[selector.get_support()])

# Step 2: Univariate feature selection - SelectPercentile - Filter features with p-value less than 0.05
p_value_threshold = 0.05

percentile_selector = SelectPercentile(f_regression, percentile=100)  # You can adjust the percentile
X_percentile = percentile_selector.fit_transform(X_high_variance, y)
significant_percentile_features = X_high_variance.columns[percentile_selector.pvalues_ < p_value_threshold]
X_filtered_percentile = X_high_variance[significant_percentile_features]

# Step 3: Scale the features to be non-negative and keep the original column names
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X_filtered_percentile), columns=X_filtered_percentile.columns)

print(X.shape)

(143, 77)


In [21]:
X.head()

Unnamed: 0,body_logarithm_firstorder_InterquartileRange,body_squareroot_firstorder_InterquartileRange,body_squareroot_firstorder_Range,body_wavelet-HLH_firstorder_Skewness,body_wavelet-HLH_glszm_HighGrayLevelZoneEmphasis,body_wavelet-HLH_glszm_LowGrayLevelZoneEmphasis,body_wavelet-HLL_gldm_DependenceEntropy,body_wavelet-LHL_gldm_DependenceEntropy,body_wavelet-LLL_glcm_Autocorrelation,body_wavelet-LLL_glcm_ClusterProminence,...,pet_wavelet-LLH_gldm_LargeDependenceLowGrayLevelEmphasis,pet_wavelet-LLH_glszm_ZoneEntropy,pet_wavelet-LLL_glcm_JointEnergy,pet_wavelet-LLL_glcm_MCC,pet_wavelet-LLL_gldm_DependenceEntropy,pet_wavelet-LLL_glrlm_LongRunEmphasis,pet_wavelet-LLL_glrlm_RunVariance,pet_wavelet-LLL_glszm_GrayLevelNonUniformityNormalized,pet_wavelet-LLL_glszm_LowGrayLevelZoneEmphasis,pet_wavelet-LLL_glszm_SmallAreaLowGrayLevelEmphasis
0,0.014132,0.013588,0.354126,0.475221,0.479053,0.520947,0.76013,0.746909,0.0,0.0,...,0.651405,0.929798,0.159218,0.94236,0.65462,0.260295,0.373979,0.235045,0.14142,0.061424
1,0.020424,0.017036,0.000257,0.627934,0.460317,0.539683,0.894294,0.907272,0.0,0.0,...,0.810247,0.584442,0.023034,0.930306,0.688166,0.009302,0.01568,0.196224,0.476495,0.321294
2,0.014544,0.01327,0.123973,0.610127,0.239766,0.760234,0.72706,0.663205,0.0,0.0,...,0.591208,0.639288,0.204099,0.949451,0.575652,0.401489,0.515776,0.859998,0.934445,0.593552
3,0.307839,0.30981,0.330353,0.670973,0.555556,0.444444,0.854749,0.865872,0.042336,0.630887,...,0.708948,0.571515,0.330991,0.874456,0.450543,0.315107,0.33512,0.285301,0.552439,0.850909
4,0.086327,0.081392,0.060996,0.454608,0.494949,0.505051,0.8865,0.875052,0.0,0.0,...,0.458212,0.409835,0.959231,0.340772,0.056749,1.0,1.0,0.841178,0.936721,0.36944


### II. Búsqueda de los mejores hiperparámetros del modelo con Grid Search

In [81]:
# Separate features and target
X = sm_radiometrics.drop(columns=['EGFR'])
y = sm_radiometrics['EGFR']

# gridsearch
param_grid = {
    'n_estimators': [50, 100, 150, 200, 250, 300],
    'criterion': ["gini", "entropy"],
    'max_depth': [None, 10, 20, 30, 40, 50, 100],
    'max_features': ['sqrt'],
    'min_samples_split': [2, 5, 10, 20, 30],
    'min_samples_leaf': [1, 2, 4, 10, 20],
}

rf_model = RandomForestClassifier(random_state=42)
clf = GridSearchCV(rf_model, param_grid)
clf.fit(X, y)

best_estimator = clf.best_estimator_
best_params = clf.best_params_

In [82]:
print(best_params)
print(clf.best_score_)

{'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
0.7975369458128079


### III. Evaluación inicial del modelo

In [83]:
# Number of folds for cross-validation
k_folds = 3
kf = KFold(n_splits=k_folds, shuffle=True, random_state=4)

In [85]:
# Initial model performance
initial_results = []

for train_index, test_index in kf.split(X):
    # Train a RandomForestClassifier
    rf_model = RandomForestClassifier(random_state=42, **best_params)
    X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]

    rf_model.fit(X_train, y_train)
    initial_results.append(evaluate_model(rf_model, X_test, y_test))

# Print initial results
initial_average_results = np.mean(initial_results, axis=0)

print("Initial Results:")
print(f"Accuracy: {initial_average_results[0]}")
print(f"AUC: {initial_average_results[1]}")
print(f"Precision: {initial_average_results[2]}")
print(f"Recall: {initial_average_results[3]}")

Initial Results:
Accuracy: 0.7557624113475176
AUC: 0.6131451914848096
Precision: 0.3833333333333333
Recall: 0.18472222222222223


### III. Selección de características con *backward selection*
Se filtran las características menos relevantes de acuerdo al proceso de *backward selection* usando el modelo *RandomForestClassifier* con KFold con *k=5*.

In [86]:
# Backward feature selection with k-fold cross-validation
selected_features = list(X.columns)
print(len(selected_features))

prev_accuracy = np.inf  # Initialize with a high value
tolerance = 1e-1  # Define a tolerance for stopping criterion

for _ in range(len(selected_features) - 1):
    # Store current performance
    best_accuracy = 0
    best_results = None
    feature_to_remove = None

    # Try removing each feature and evaluate the model using k-fold cross-validation
    for feature in selected_features:
        current_features = [f for f in selected_features if f != feature]
        accuracy_per_fold = []

        for train_index, test_index in kf.split(X):
            X_train, X_test = X[current_features].iloc[train_index], X[current_features].iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]

            rf_model = RandomForestClassifier(random_state=42, **best_params)
            rf_model.fit(X_train, y_train)
            accuracy_per_fold.append(evaluate_model(rf_model, X_test, y_test))

        # Update best feature to remove
        if np.mean(accuracy_per_fold, axis=0)[0] > best_accuracy:
            best_accuracy = np.mean(accuracy_per_fold, axis=0)[0]
            best_results = np.mean(accuracy_per_fold, axis=0)
            feature_to_remove = feature

    # Stop if the change in accuracy is below the tolerance
    if prev_accuracy - best_accuracy < tolerance:
        print("Stopping criterion reached. No significant improvement.")
        break

    #prev_accuracy = best_accuracy

    # Remove the least important feature
    selected_features.remove(feature_to_remove)
    print(f"Removed feature {feature_to_remove}, Current results: {best_results}")

# Final selected features
print("Final selected features:", selected_features)

77
Removed feature body_wavelet-LHL_gldm_DependenceEntropy, Current results: [0.79772459 0.62969792 0.63333333 0.20555556]
Removed feature pet_log-sigma-1-mm-3D_glszm_LowGrayLevelZoneEmphasis, Current results: [0.79063239 0.65866083 0.61111111 0.25138889]
Removed feature pet_log-sigma-2-mm-3D_glrlm_LongRunLowGrayLevelEmphasis, Current results: [0.78368794 0.64577472 0.51111111 0.20555556]
Removed feature pet_squareroot_glszm_LowGrayLevelZoneEmphasis, Current results: [0.79772459 0.64866036 0.6        0.20555556]
Removed feature pet_wavelet-HLH_glcm_JointEntropy, Current results: [0.79063239 0.65327103 0.54444444 0.23888889]
Removed feature body_squareroot_firstorder_Range, Current results: [0.79078014 0.64132321 0.55555556 0.23888889]
Removed feature body_wavelet-LLL_glcm_Imc2, Current results: [0.79772459 0.6356049  0.6        0.23888889]
Removed feature body_wavelet-LLL_ngtdm_Coarseness, Current results: [0.79772459 0.6722227  0.63333333 0.20555556]
Removed feature pet_log-sigma-2-mm

### Resultados en el conjunto de Santa Maria

In [153]:
test_ds = pd.read_csv('santamaria_data_all__binwidth_5_sigma_[1, 2, 3]_normalize_True.csv')

original_drop_columns = ['SEXO_MASCULINO', 'EDAD', 'PATIENT_ID', 'FECHA_CIRUGIA', 'BIOPSIA_QX_PULMONAR', 'BIOPSIA_FBC-EBUS', 'BIOPSIA_OTRO_SITIO', 'RESULTADO_BP', 'BP_COMPLETA', 'HISTOLOGIA', 'MUTACION_EGFR', 'MUTACION_PDL-1', 'MUTACION_ROS', 'RECIDIVA', 'COMENTARIO', '3D_TORAX_SEG', 'PET_SEG', 'BODY_CT_SEG']
images_columns = ['diagnostics_Mask-original_CenterOfMass', 'diagnostics_Configuration_EnabledImageTypes', 'diagnostics_Configuration_Settings', 
                  'diagnostics_Image-original_Dimensionality', 'diagnostics_Mask-original_BoundingBox', 
                  'diagnostics_Versions_PyRadiomics', 'diagnostics_Mask-original_CenterOfMassIndex', 'diagnostics_Image-original_Hash', 
                  'diagnostics_Image-original_Size', 'diagnostics_Image-original_Spacing', 'diagnostics_Mask-original_Hash', 
                  'diagnostics_Mask-original_Size', 'diagnostics_Mask-original_Spacing', 'diagnostics_Versions_Numpy', 
                  'diagnostics_Versions_PyWavelet', 'diagnostics_Versions_Python', 'diagnostics_Versions_SimpleITK', 
                  'diagnostics_Image-interpolated_Size', 'diagnostics_Image-interpolated_Spacing', 'diagnostics_Mask-interpolated_BoundingBox', 
                  'diagnostics_Mask-interpolated_CenterOfMass', 'diagnostics_Mask-interpolated_CenterOfMassIndex', 'diagnostics_Mask-interpolated_Size',
                  'diagnostics_Mask-interpolated_Spacing']

exam_types = ['body_', 'pet_', 'torax3d_']
images_columns = [exam + s for s in images_columns for exam in exam_types]

# extra columns that are not relevant
extra_columns = ['ALK', 'MUTACION_ALK', 'PDL-1','ROS', 'ADENOPATIAS', 'STAGE', 'IV CONTRAST', 'TAMAÑO_BP_mm', 'TAMAÑO_CT_mm']

drop_columns = original_drop_columns+images_columns+extra_columns
test_ds = test_ds.drop(columns=drop_columns)
test_ds = test_ds.drop(index=34)

# Separate features and target
X_test = test_ds.drop(columns=['EGFR'])
X_test = X_test[X.columns]
y_test = test_ds['EGFR']

In [164]:
rf_model = RandomForestClassifier(random_state=15)
rf_model.fit(X, y)
print(evaluate_model(rf_model, X_test, y_test))

(0.38235294117647056, 0.4659090909090909, 0.36363636363636365, 1.0)


In [68]:
import mdfs

# Separate features and target
X = sm_radiometrics.drop(columns=['EGFR']).to_numpy()
y = sm_radiometrics['EGFR'].astype(np.intc).to_numpy()

result = mdfs.run(X, y, seed=0, n_contrast=50, dimensions=2, divisions=2, discretizations=6,
        range_=None, pc_xi=0.25, p_adjust_method='fdr_tsbh', level=0.3)

relevant_var = result['relevant_variables']

# Get the names of relevant variables from the original DataFrame
original_column_names = sm_radiometrics.drop(columns=['EGFR']).columns
relevant_column_names = original_column_names[relevant_var]

# Create a new DataFrame with the relevant variables and original column names
X = pd.DataFrame(X[:, relevant_var], columns=relevant_column_names)
y = sm_radiometrics['EGFR']

In [72]:
# Number of folds for cross-validation
k_folds = 3
kf = KFold(n_splits=k_folds, shuffle=True, random_state=4)

# Initial model performance
initial_results = []

for train_index, test_index in kf.split(X):
    # Train a RandomForestClassifier
    rf_model = RandomForestClassifier(random_state=42, **best_params)
    X_train, X_test, y_train, y_test = X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]

    rf_model.fit(X_train, y_train)
    initial_results.append(evaluate_model(rf_model, X_test, y_test))

# Print initial results
initial_average_results = np.mean(initial_results, axis=0)

print("Initial Results:")
print(f"Accuracy: {initial_average_results[0]}")
print(f"AUC: {initial_average_results[1]}")
print(f"Precision: {initial_average_results[2]}")
print(f"Recall: {initial_average_results[3]}")

Initial Results:
Accuracy: 0.7557624113475176
AUC: 0.6091982130758807
Precision: 0.3611111111111111
Recall: 0.10972222222222222


### Preguntas
- Considerar las características mas estables.
- Chest-ct contrast, PET-CT contrast, columna de 1 para sm.
- Considerar metodos de seleccion de variables de multidimensionalidad.
- Reportar resultados en Cross-validation Santa Maria, Stanford y entre ellos.


- Revisar el procesamiento y aplicación del excel (filtros, normalización y otros) para la configuración del extractor - comparar con el paper de Hector.
- Consultar si las imágenes PET de los resultados del paper se realiza la normalización con el PET de liver.
- Se tienen que filtrar las columnas de torax3d porque para algunos pacientes no está aquella información.

### Falta
- Implementar para todos los filtros posibles.
- Normalizar imágenes PET (creo).
- hyperparameter search was performed with gridsearch and the performance metrics were
calculated with 100 repetitions of 5-fold cross-validation.
- Implementar lo anterior para que sea entrenado y validado en Stanford, para luego testear en Santa María.