# Construcción, ajuste y evaluación de Modelos de Machine Learning

Tarea: Construir, ajustar y evaluar modelos de Machine Learning utilizando técnicas y algoritmos apropiados.

Instrucciones:

* Seleccionar algoritmos de Machine Learning adecuados para resolver el problema planteado.
* Entrenar los modelos utilizando los datos preprocesados.
* Realizar ajustes de hiperparámetros para optimizar el rendimiento del modelo.
* Evaluar los modelos utilizando métricas de rendimiento específicas.

Importancia: Esta etapa es fundamental para desarrollar modelos precisos y eficientes que puedan resolver problemas específicos de manera efectiva

Empezamos probando con un módelo básico de Random Forest Classifier

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd


In [41]:
data_df = pd.read_csv(r"C:\Users\USER\Documents\MLOPS-MNA-Equipo-17\data\processed\TCGA_GBM_LGG_Mutations_clean.csv") #change path when testing
data_df.sample(5)


Unnamed: 0,Grade,Gender,Age_at_diagnosis,Race,Tumor_Type,Tumor_Specification,IDH1,TP53,ATRX,PTEN,...,FUBP1,RB1,NOTCH1,BCOR,CSMD3,SMARCA4,GRIN2A,IDH2,FAT4,PDGFRA
767,1,0,60.67,1,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
320,0,1,59.72,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
210,0,0,30.76,1,2,1,1,1,0,1,...,0,0,0,0,0,1,0,0,0,0
351,0,1,53.54,1,2,2,1,1,0,0,...,0,0,0,0,0,0,0,0,1,0
449,0,1,47.62,1,1,0,1,1,1,0,...,0,0,0,0,0,1,0,0,0,0


In [43]:
X=data_df.drop(["Grade"], axis=1)
y=data_df["Grade"]
y.sample(5)

239    0
438    0
773    1
488    0
341    0
Name: Grade, dtype: int64

Separamos nuestros datos para probar y entrenar con train_test_split y generamos el fit para el modelo

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
rf_model = RandomForestClassifier(n_estimators=50, random_state=44)
rf_model.fit(X_train,y_train)

Aquí podemos ya ver como el módelo predice el grado con base a su entrenamiento.

In [46]:
predictions = rf_model.predict(X_test)
predictions

array([0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

In [47]:
rf_model.predict_proba(X_test)

array([[1.  , 0.  ],
       [0.94, 0.06],
       [1.  , 0.  ],
       [0.02, 0.98],
       [0.96, 0.04],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.02, 0.98],
       [0.98, 0.02],
       [0.02, 0.98],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.96, 0.04],
       [0.96, 0.04],
       [0.96, 0.04],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [0.9 , 0.1 ],
       [0.  , 1.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.96, 0.04],
       [0.86, 0.14],
       [0.02, 0.98],
       [0.  , 1.  ],
       [0.92, 0.08],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.24, 0.76],
       [1.  , 0.  ],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.98, 0.02],
       [0.  , 1.  ],
       [0.96, 0.04],
       [0.  , 1.  ],
       [1.  , 0.  ],
       [0.  ,

De igual forma, podemos ver que propiedad de nuestros datos es la más importante para el proceso de predicción

In [26]:
importances = rf_model.feature_importances_
columns = X.columns
i=0
while i < len(columns):
    print(f" The importance of feature '{columns[i]}' is {round(importances[i] * 100, 2)}%.")
    i += 1

 The importance of feature 'Gender' is 0.29%.
 The importance of feature 'Age_at_diagnosis' is 6.52%.
 The importance of feature 'Race' is 0.51%.
 The importance of feature 'Tumor_Type' is 44.45%.
 The importance of feature 'Tumor_Specification' is 20.83%.
 The importance of feature 'IDH1' is 16.97%.
 The importance of feature 'TP53' is 0.68%.
 The importance of feature 'ATRX' is 1.65%.
 The importance of feature 'PTEN' is 2.04%.
 The importance of feature 'EGFR' is 0.89%.
 The importance of feature 'CIC' is 1.96%.
 The importance of feature 'MUC16' is 0.24%.
 The importance of feature 'PIK3CA' is 0.22%.
 The importance of feature 'NF1' is 0.17%.
 The importance of feature 'PIK3R1' is 0.19%.
 The importance of feature 'FUBP1' is 0.43%.
 The importance of feature 'RB1' is 0.31%.
 The importance of feature 'NOTCH1' is 0.45%.
 The importance of feature 'BCOR' is 0.19%.
 The importance of feature 'CSMD3' is 0.02%.
 The importance of feature 'SMARCA4' is 0.06%.
 The importance of feature 'G

El modelo de Random Forest nos arrojo una precisión de 100, sin embargo esto se debe al overfitting por nuestro tipo de datos. Por lo que para nuestro modelo usaremos una regresión linear en vez del Ranom Forest

In [36]:
from sklearn.metrics import classification_report
y_true, y_pred = y_test, rf_model.predict(X_test)
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       150
           1       1.00      1.00      1.00       108

    accuracy                           1.00       258
   macro avg       1.00      1.00      1.00       258
weighted avg       1.00      1.00      1.00       258



In [32]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
clf = RandomForestClassifier(n_estimators=50, random_state=44)
param_grid = {
    'n_estimators': [5, 10, 15, 20],
    'max_depth': [2, 5, 7, 9]
}
grid_clf = GridSearchCV(clf, param_grid, cv=10)
grid_clf.fit(X_train, y_train)

In [35]:
grid_clf.cv_results_

{'mean_fit_time': array([0.00578463, 0.00937476, 0.0130651 , 0.01685462, 0.0053854 ,
        0.00987346, 0.01326435, 0.01735327, 0.00568457, 0.00967615,
        0.01386259, 0.01755285, 0.00558503, 0.00957408, 0.01376305,
        0.01765246]),
 'std_fit_time': array([0.00039905, 0.00048845, 0.00029922, 0.00029909, 0.00048852,
        0.00029924, 0.000457  , 0.00048868, 0.00045703, 0.00045372,
        0.00069822, 0.00048862, 0.00048858, 0.00048873, 0.00039898,
        0.00045713]),
 'mean_score_time': array([0.00119677, 0.00129659, 0.00169525, 0.00169556, 0.0013962 ,
        0.00109706, 0.00159576, 0.00179524, 0.00129666, 0.00139408,
        0.00149603, 0.00189502, 0.00139625, 0.00129657, 0.00139637,
        0.00169547]),
 'std_score_time': array([0.00039868, 0.00045701, 0.00045702, 0.00045701, 0.00048835,
        0.00029918, 0.00048854, 0.00039897, 0.00045707, 0.00048604,
        0.00049863, 0.00029924, 0.00048866, 0.00045703, 0.00048846,
        0.00045705]),
 'param_max_depth': masked