# **Ejercicio 4**
# **Single Cell Perturbations**
##  **Modelos de Regresión** 
Considere el conjunto de datos `Open Problems– Single-Cell Perturbations`. Implemente la versión de regresión de cada uno de los modelos estudiados en clases, esto es KNN y Regresión Lineal en el conjunto de datos suministrado. Construir una tabla de error con las métricas usuales de regresión, MAPE, MAE, RMSE, MSE, R2 (ver Table 2). Utilice la métrica Mean Rowwise Root Mean Squared Error (MRRMSE) en la evaluación y validación, para seleccionar el mejor modelo de regresión.



**Cuadro 2: Modelo de regresión para velocidad del viento**

| **Modelo**            | **MAPE** | **MAE** | **RMSE** | **MSE** | **R2** |
|-----------------------|----------|---------|----------|---------|--------|
| K-NN                  | ...      | ...      | ...       | ...      | ...     |
| Linear Regression     | ...      | ...      | ...       | ...      | ...     |



## **Librerías y módulos necesarios**

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,make_scorer, mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### **Métrica asignada**

Esta métrica corresponde a la sugerida por el enunciado, además esta mencionada en la información de la base de datos. 

In [2]:
def MRRMSE(y_real, y_pred):
    rowwise_rmse = np.sqrt(np.mean((y_real - y_pred) ** 2, axis=1))
    return np.mean(rowwise_rmse)


La métrica **MRRMSE** (Mean Row-wise Root Mean Square Error) mide el error promedio entre los valores predichos y los reales por fila, lo que permite evaluar la precisión de predicciones en un conjunto de datos estructurados en filas, como en matrices de expresión génica. Primero calcula el **RMSE** por fila, para luego promediar estos valores, destacando su capacidad de capturar errores individuales por observación. En este contexto, se espera que el **MRRMSE** sea lo más bajo posible, lo que indicaría que las predicciones están muy cercanas a los valores reales en cada fila, lo que refleja un buen ajuste del modelo.

## **Datos**

Después de haber realizado un análisis descriptivo de la base de datos original (detallado en el ejercicio 2), finalmente se ha obtenido una base de datos codificada y preparada, la cual ha sido almacenada en un archivo CSV.

In [3]:
de_train_final = pd.read_csv('C:/Users/kamac/OneDrive/Desktop/MachineLearningUN/train.csv')

In [4]:
de_train_final

Unnamed: 0,cell_type_B cells,cell_type_Myeloid cells,cell_type_NK cells,cell_type_T cells CD4+,cell_type_T cells CD8+,cell_type_T regulatory cells,sm_name_5-(9-Isopropyl-8-methyl-2-morpholino-9H-purin-6-yl)pyrimidin-2-amine,sm_name_ABT-199 (GDC-0199),sm_name_ABT737,sm_name_AMD-070 (hydrochloride),...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.227781,-0.010752,-0.023881,0.674536,-0.453068,0.005164,-0.094959,0.034127,0.221377,0.368755
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.494985,-0.303419,0.304955,-0.333905,-0.315516,-0.369626,-0.095079,0.704780,1.096702,-0.869887
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.119422,-0.033608,-0.153123,0.183597,-0.555678,-1.494789,-0.213550,0.415768,0.078439,-0.259365
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.451679,0.704643,0.015468,-0.103868,0.865027,0.189114,0.224700,-0.048233,0.216139,-0.085024
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.758474,0.510762,0.607401,-0.123059,0.214366,0.487838,-0.819775,0.112365,-0.122193,0.676629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.549987,-2.200925,0.359806,1.073983,0.356939,-0.029603,-0.528817,0.105138,0.491015,-0.979951
610,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.236905,0.003854,-0.197569,-0.175307,0.101391,1.028394,0.034144,-0.231642,1.023994,-0.064760
611,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.077579,-1.101637,0.457201,0.535184,-0.198404,-0.005004,0.552810,-0.209077,0.389751,-0.337082
612,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.005951,-0.893093,-1.003029,-0.080367,-0.076604,0.024849,0.012862,-0.029684,0.005506,-1.733112


Nótese que la base de datos actual cuenta con **614** observaciones y **18363** columnas

Ahora, se divide el DataFrame `de_train_final` en dos partes: las variables predictoras (almacenadas en `x`) y las variables objetivo (almacenadas en `y`). Específicamente, las primeras 152 columnas de la base de datos se asignan a `x`, que representará los predictores o características, mientras que las columnas a partir de la posición 152 en adelante se asignan a `y`, que representan las respuestas o la variable objetivo.

In [5]:
X = de_train_final.iloc[:,:152]
y = de_train_final.iloc[:,152:]

In [6]:
X

Unnamed: 0,cell_type_B cells,cell_type_Myeloid cells,cell_type_NK cells,cell_type_T cells CD4+,cell_type_T cells CD8+,cell_type_T regulatory cells,sm_name_5-(9-Isopropyl-8-methyl-2-morpholino-9H-purin-6-yl)pyrimidin-2-amine,sm_name_ABT-199 (GDC-0199),sm_name_ABT737,sm_name_AMD-070 (hydrochloride),...,sm_name_Tivozanib,sm_name_Topotecan,sm_name_Tosedostat,sm_name_Trametinib,sm_name_UNII-BXU45ZH6LI,sm_name_Vandetanib,sm_name_Vanoxerine,sm_name_Vardenafil,sm_name_Vorinostat,sm_name_YK 4-279
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
610,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
611,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
612,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
y

Unnamed: 0,A1BG,A1BG-AS1,A2M,A2M-AS1,A2MP1,A4GALT,AAAS,AACS,AAGAB,AAK1,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1
0,0.104720,-0.077524,-1.625596,-0.144545,0.143555,0.073229,-0.016823,0.101717,-0.005153,1.043629,...,-0.227781,-0.010752,-0.023881,0.674536,-0.453068,0.005164,-0.094959,0.034127,0.221377,0.368755
1,0.915953,-0.884380,0.371834,-0.081677,-0.498266,0.203559,0.604656,0.498592,-0.317184,0.375550,...,-0.494985,-0.303419,0.304955,-0.333905,-0.315516,-0.369626,-0.095079,0.704780,1.096702,-0.869887
2,-0.387721,-0.305378,0.567777,0.303895,-0.022653,-0.480681,0.467144,-0.293205,-0.005098,0.214918,...,-0.119422,-0.033608,-0.153123,0.183597,-0.555678,-1.494789,-0.213550,0.415768,0.078439,-0.259365
3,0.232893,0.129029,0.336897,0.486946,0.767661,0.718590,-0.162145,0.157206,-3.654218,-0.212402,...,0.451679,0.704643,0.015468,-0.103868,0.865027,0.189114,0.224700,-0.048233,0.216139,-0.085024
4,4.290652,-0.063864,-0.017443,-0.541154,0.570982,2.022829,0.600011,1.231275,0.236739,0.338703,...,0.758474,0.510762,0.607401,-0.123059,0.214366,0.487838,-0.819775,0.112365,-0.122193,0.676629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,-0.014372,-0.122464,-0.456366,-0.147894,-0.545382,-0.544709,0.282458,-0.431359,-0.364961,0.043123,...,-0.549987,-2.200925,0.359806,1.073983,0.356939,-0.029603,-0.528817,0.105138,0.491015,-0.979951
610,-0.455549,0.188181,0.595734,-0.100299,0.786192,0.090954,0.169523,0.428297,0.106553,0.435088,...,-1.236905,0.003854,-0.197569,-0.175307,0.101391,1.028394,0.034144,-0.231642,1.023994,-0.064760
611,0.338168,-0.109079,0.270182,-0.436586,-0.069476,-0.061539,0.002818,-0.027167,-0.383696,0.226289,...,0.077579,-1.101637,0.457201,0.535184,-0.198404,-0.005004,0.552810,-0.209077,0.389751,-0.337082
612,0.101138,-0.409724,-0.606292,-0.071300,-0.001789,-0.706087,-0.620919,-1.485381,0.059303,-0.032584,...,0.005951,-0.893093,-1.003029,-0.080367,-0.076604,0.024849,0.012862,-0.029684,0.005506,-1.733112


Antes de aplicar los modelos se divide el conjunto de datos en dos partes: entrenamiento y prueba. Con la función `train_test_split`, las variables `X` y `y` se dividen en `X_train` y `X_test` (para las características) y en `y_train` y `y_test` (para la variable objetivo). Además, se está crea un "scoring" personalizado utilizando la función make_scorer, con la métrica MRRMSE (definida anteriormente), donde greater_is_better=False indica que se considera mejor un valor más bajo para la métrica, es decir, cuanto menor sea el error, mejor será el modelo.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
scoring = make_scorer(MRRMSE, greater_is_better = False) # Indica que un valor más bajo de la métrica MRRMSE es mejor

NameError: name 'train_test_split' is not defined

### **MODELO KNN Regressor**

En este código, se entrena un modelo **KNN Regressor** usando un **pipeline** con estandarización (`StandardScaler()`). Se realiza una búsqueda de hiperparámetros con **GridSearchCV** utilizando un conjunto de parámetros que incluyen el número de vecinos (`n_neighbors`), la métrica de distancia (`metric`), el método de ponderación (`weights`), y el parámetro `p` para la métrica Minkowski. Se utiliza **KFold** con 10 divisiones y se evalúa el modelo mediante la métrica **MRRMSE**. 

In [None]:
pipeline_kr = Pipeline([('scaler', StandardScaler()),('knr', KNeighborsRegressor())])

param_grid_kr = {'knr__n_neighbors': range(1 , 10),
                 'knr__metric': ['euclidean','minkowski', 'manhattan'],
                 'knr__weights': ['uniform', 'distance'],
                 'knr__p': [1 , 2, 3]}

kfo = KFold(n_splits = 5, shuffle = True, random_state = 11)

grid_knr = GridSearchCV(pipeline_kr, param_grid_kr, cv = kfo, scoring = scoring)
grid_knr.fit(X_train, y_train)

y_pred_knr = grid_knr.predict(X_test)
mrrmse_knn = MRRMSE(y_test, y_pred_knr)

print(f'Mejores parámetros: {grid_knr.best_params_}')
print(f'MRRMSE en el conjunto de entrenamiento: {grid_knr.best_score_}')
print(f'MRRMSE en el conjunto de prueba: {mrrmse_knn}')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('knn', KNeighborsRegressor())
])

kf = KFold(n_splits=10, shuffle=True, random_state=10)

param_grid = {
    'knn__n_neighbors': list(range(1, 200)),
    'knn__metric': ['minkowski', 'euclidean', 'manhattan'],
    'knn__p': [1, 2, 3],
    'knn__weights': ['uniform', 'distance']
}

mrrmse_scorer = make_scorer(MRRMSE, greater_is_better=False)

gs_knn = GridSearchCV(pipeline, param_grid, cv=kf, scoring=scoring)
gs_knn.fit(X_train, y_train)

mejor_modelo = gs_knn.best_estimator_

y_pred_train = mejor_modelo.predict(X_train)
y_pred_test = mejor_modelo.predict(X_test)

mrrmse_test = MRRMSE(y_test, y_pred_test)

trains = mejor_modelo.score(X_train, y_train)
tests = mejor_modelo.score(X_test, y_test)

print(f'Mejores parámetros: {gs_knn.best_params_}')
print(f'MRRMSE en el conjunto de entrenamiento: {gs_knn.best_score_}')
print(f'MRRMSE en el conjunto de prueba: {mrrmse_test}')


### **Modelo de regresión lineal**

Se entrena un modelo de **regresión lineal** utilizando un **pipeline** que incluye la estandarización de los datos con `StandardScaler()`.

In [12]:
pipeline_lr = Pipeline([('scaler', StandardScaler()), ('lr', LinearRegression())])

pipeline_lr.fit(X_train, y_train)

y_pred_train_lr = pipeline_lr.predict(X_train)
mrrmse_train_lr = MRRMSE(y_train, y_pred_train_lr)

y_pred_lr = pipeline_lr.predict(X_test)
mrrmse_lr = MRRMSE(y_test, y_pred_lr)


print(f'MRRMSE en el conjunto de entrenamiento: {mrrmse_train_lr}')
print(f'MRRMSE en el conjunto de prueba: {mrrmse_lr}')


MRRMSE en el conjunto de entrenamiento: 1.0340113450250037
MRRMSE en el conjunto de prueba: 1.3655802559553298


### **Cuadro 2**

Finalmente se realizan predicciones con los modelos **K-NN** y **regresión lineal**, luego se evalúan usando diversas métricas de error, como **MAPE** (Error absoluto medio en porcentaje), **MAE** (Error absoluto medio), **RMSE** (Raíz cuadrada del error cuadrático medio), **MSE** (Error cuadrático medio), y **R2** (Coeficiente de determinación).

In [None]:
# Se hacen las predicciones para poder comparar con las diferentes métricas
knr_predict= grid_knr.predict(X_test)
lr_predict = pipeline_lr.predict(X_test)

# Evalua las predicciones en cada métrica con las etiquetas del test
tabla2 = {
    'Modelo': ['K-NN', 'Regresión lineal'],
    'MAPE': [ # Error absoluto medio en porcentaje
        mean_absolute_percentage_error(y_test, knr_predict),
        mean_absolute_percentage_error(y_test, lr_predict)
    ],
    'MAE': [ # Error absoluto medio
        mean_absolute_error(y_test, knr_predict),
        mean_absolute_error(y_test, lr_predict)
    ],
    'RMSE': [ # Raíz cuadrada del error cuadrático medio
        np.sqrt(mean_squared_error(y_test, knr_predict)),
        np.sqrt(mean_squared_error(y_test, lr_predict))
    ],
    'MSE': [ # Error cuadrático medio
        mean_squared_error(y_test, knr_predict), 
        mean_squared_error(y_test, lr_predict)  
    ],
    'R2': [ # Coeficiente de determinación
        r2_score(y_test, knr_predict),
        r2_score(y_test, lr_predict)
    ]
}

cuadro2 = pd.DataFrame(tabla2)
cuadro2

Al revisar las métricas especificadas en la tabla, MAPE, MAE, MSE y RMSE, métricas asociadas al error, son menores en el modelo K-NN, indicando un mejor ajueste en este. Además, el modelo K-NN también presenta un coeficiente de determinación mayor: 76,95%, es decir, explica alrededor del 77% de la variabilidad en la expresión diferencial de los distintos genes. En cambio, el modelo de regresión lineal presenta errores mayores y solo esplixa el 20% de la variabilidad de los datos.