<a href="https://colab.research.google.com/github/japarra27/ML_Techniques/blob/main/notebooks/Labs/lab1/cardiovascularDiseasesClassificationColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Techniques - ISIS4219

**Segundo Semestre - 2021**

In [2]:
%%capture
!pip install scikit-optimize
!pip install plotly-express
!pip install jupyter-dash
pip install jyquickhelper

In [4]:
from jyquickhelper import add_notebook_menu
add_notebook_menu(menu_id="main_menu", last_level=6)

## Integrantes

*   Jaime Andrés Parra Mora - **202107161**
*   Integrante 2.
*   Integrante 3.

## **Problema**

<p style='text-align: justify;'>
Las enfermedades cardiovasculares son la principal causa de muerte en el mundo, y se calcula que cobran 17,9 millones de vidas al año (OMS). La enfermedad de las arterias coronarias es el tipo más común de enfermedad cardíaca y se produce debido a las obstrucciones (placa) desarrolladas en el interior de las arterias coronarias (vasos sanguíneos que alimentan los músculos del corazón). Los cardiólogos utilizan diversas técnicas de imagen y mediciones invasivas de la presión arterial para examinar y controlar la gravedad de dichas obstrucciones. <br>
Los factores de riesgo conductuales más importantes de estas enfermedades son una dieta poco saludable, la inactividad física, el consumo de tabaco y el uso nocivo del alcohol. Los efectos de los factores de riesgo pueden manifestarse en las personas en forma de aumento de la presión arterial, aumento de la glucosa en sangre, aumento de los lípidos en sangre y sobrepeso y obesidad.
Identificar a las personas con mayor riesgo de sufrir enfermedades cardiovasculares y garantizar que reciban el tratamiento adecuado puede evitar muertes prematuras. Con este objetivo en mente, se quiere utilizar las técnicas de machine learning para construir un modelo que permita predecir qué pacientes pueden estar en riesgo de padecer este tipo de cardiopatía.</p><br>

**Referencias.** <br>
[OMS (s.f.). “Cardiovascular diseases”.](https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1)
<br><br>
**Fuente de Datos.** <br>
[Kaggle](https://www.kaggle.com/agsam23/coronary-artery-disease/version/3)
<br>

### Exploración y descripción de los datos

<p style='text-align: justify;'>
Con el objetivo de tener un mayor entendimiento de la información que se tiene para el correspondiente análisis, se realizarán una serie de etapas que nos ayudarán con este proceso. Para este caso se realizará:
</p>

* Diccionario de datos
* Descipción del conjunto de datos

#### Diccionario de datos

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4289149%2Fa68afe56139dfa7bc4a45f35192b7a56%2Fcoronary.JPG?generation=1603132237334314&alt=media" >

In [5]:
# definición de variables

var_sex = {0: "female", 1: "male"}

var_cp = {
    1: "typical type 1",
    2: "typical type angina",
    3: "non angina pain",
    4: "asymptomatic",
}

var_fbs = {0: "<120mg/dL", 1: ">120mg/dL"}

var_restecg = {
    0: "normal",
    1: "having ST-T wave abnormal",
    2: "left ventricular hypertrophy",
}

var_exang = {0: "no", 1: "yes"}

var_slope = {1: "unsloping", 2: "flat", 3: "downsloping"}

var_thal = {3: "normal", 6: "fixed", 7: "reversible defect"}

#### Entendimiento y análisis de los datos

In [6]:
# Importar librerias
import os
import warnings
import numpy as np
import pandas as pd
import plotly.express as px

# Librerias procesamiento de datos
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import train_test_split

# librerias para el entrenamiento y validación del modelo
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Librerias para la optimización de hiperparámetros (enfoque bayesiano)
from skopt import gp_minimize
from skopt import BayesSearchCV
from skopt.utils import use_named_args
from skopt.space import Real, Integer, Categorical

# Librerias para la validación del modelo
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [7]:
# configuración directorio base
#BASE_PATH = '/mnt/f/Projects/ML_Techniques'
#os.chdir(BASE_PATH)

# configuración adicional pandas
pd.options.plotting.backend = "plotly"
pd.options.display.max_columns = 100

# configuración warnings
warnings.filterwarnings('ignore')

##### Cargue y entendimiento del dataset

###### Cargue archivo base

In [8]:
df = pd.read_csv('data/Labs/lab1/data.csv')

###### Verificación de los datos

In [9]:
# shape del dataset
df.shape

(303, 15)

In [10]:
# primeras filas
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,class
0,0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [11]:
# eliminar columnas no necesarios
df = df.drop('Unnamed: 0', axis=1)

In [12]:
# validación tipos de datos
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  class     303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB


In [13]:
# descripción del dataset
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,class
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.937294
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,1.228536
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,2.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,4.0


###### análisis validación inicial

Con base en las métricas iniciales, se observa que los datos no presentan valores nulos, sin embargo se observa:

* las columnas `ca`, `thal` tienen un tipo de dato object, pero según el diccionario de datos debería ser de tipo int.
* Se evidencia que existen columnas que deben ser tipo int pero se reconocen como float.

Por tal motivo se va a hacer el cambio de datos y validar nuevamente la presencia de valores nulos o errores.

In [14]:
# cast de valores
df.thal = pd.to_numeric(df.thal, errors='coerce')
df.ca = pd.to_numeric(df.ca, errors='coerce')

In [15]:
# validación de types y tamaño del dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  class     303 non-null    int64  
dtypes: float64(13), int64(1)
memory usage: 33.3 KB


In [16]:
# eliminar valores faltantes
df = df.dropna()

In [17]:
# Definición de columnas a parsear
change_to_int = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'slope', 'ca', 'thal', 'class']
change_to_cat = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'slope', 'thal']
change_to_float = ['oldpeak']

In [18]:
df[change_to_int] = df[change_to_int].astype("int16", errors='ignore')
df[change_to_cat] = df[change_to_cat].astype('category', errors='ignore')
df[change_to_float] = df[change_to_float].astype('float16', errors='ignore')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       297 non-null    int16   
 1   sex       297 non-null    category
 2   cp        297 non-null    category
 3   trestbps  297 non-null    int16   
 4   chol      297 non-null    int16   
 5   fbs       297 non-null    category
 6   restecg   297 non-null    category
 7   thalach   297 non-null    int16   
 8   exang     297 non-null    category
 9   oldpeak   297 non-null    float16 
 10  slope     297 non-null    category
 11  ca        297 non-null    category
 12  thal      297 non-null    category
 13  class     297 non-null    int16   
dtypes: category(8), float16(1), int16(5)
memory usage: 9.1 KB


Con base en el proceso anterior se evidencia:

* las columnas `ca` y `thal` eran de tipo object debido a que tenían datos incompletos, marcados como `?`, sin embargo al realizar el coerce estos cambian como NA, por lo tanto debido a que son datos categóricos, se hace drop de ellos, en vez de rellenarlos.
* Se cambian los tipos de datos para optimizar el tamaño del dataset, disminuyendo su tamaño en memoria un 66%; el motivo por el cual se realiza este paso es que se evidencia que las representaciones numéricas de float64 iniciales no son las más óptimas.

###### EDA

In [20]:
# Función para agregar pacientes por categoría y variable
def gen_class_plot(df: pd.DataFrame, var: str, var_type: dict):
    df_var = df[[var, "class"]]
    df_var[var] = df[var].map(var_type)
    df_var = df_var.groupby([var, "class"]).agg({"class": "count"}).reset_index(level=0)
    df_var.index.name = "idx_class"
    df_var = df_var.reset_index()
    return df_var

In [21]:
# Análisis comportamiento de edad
fig = px.histogram(df.sort_values('class', ascending=True), x="age", color="class", opacity=0.7, histnorm='probability')
fig.update_layout(title='Distribución - Edades por categoría', barmode='stack')
fig.show()

In [22]:
# Presión arterial en reposo
fig = px.histogram(df.sort_values('class', ascending=True), x="trestbps", color="class", opacity=0.7, histnorm='probability')
fig.update_layout(title='Distribución - Presión arterial en reposo por categoría', barmode='stack')
fig.show()

In [23]:
# Colesterol sérico
fig = px.histogram(df.sort_values('class', ascending=True), x="chol", color="class", opacity=0.7, histnorm='probability')
fig.update_layout(title='Distribución del colesterol sérico en pacientes por categoría', barmode='stack')
fig.show()

In [24]:
# Frecuencia máxima alcanzada
fig = px.histogram(df.sort_values('class', ascending=True), x="thalach", color="class", opacity=0.7, histnorm='probability')
fig.update_layout(title='Distribución de la frecuencia máxima en pacientes por categoría', barmode='stack')
fig.show()

In [25]:
# Frecuencia máxima alcanzada
fig = px.histogram(df.sort_values('class', ascending=True), x="oldpeak", color="class", opacity=0.7, histnorm='probability')
fig.update_layout(title='Distribución ST depression por el ejercicio por categoría', barmode='stack')
fig.show()

In [26]:
# Genero de usuario por clase
df_sex = gen_class_plot(df, var='sex', var_type=var_sex).sort_values('class',ascending=False)
fig = px.bar(df_sex, x='sex', y='class', color='idx_class', title='Género de pacientes por categoría')
fig.show()

In [27]:
# Tipos de dolor en el pecho
df_cp = gen_class_plot(df, var='cp', var_type=var_cp)
fig = px.bar(df_cp, x='class', y='cp', color='idx_class', title='Pacientes con dolor en el pecho por categoría')
fig.show()

In [28]:
# Azúcar en la sangre en ayunas
df_fbs = gen_class_plot(df, var='fbs', var_type=var_fbs)
fig = px.bar(df_fbs, x='class', y='fbs', color='idx_class', title='Azúcar en pacientes por categoría')
fig.show()

In [29]:
# resultados electrográficos en repososo
df_restecg = gen_class_plot(df, var='restecg', var_type=var_restecg).sort_values('class',ascending=True)
fig = px.bar(df_restecg, x='class', y='restecg', color='idx_class', title='Resultados electrográficos en reposo por categoría')
fig.show()

In [30]:
# Angina inducida por el ejercicio
df_exang = gen_class_plot(df, var='exang', var_type=var_exang).sort_values('class',ascending=True)
fig = px.bar(df_exang, x='class', y='exang', color='idx_class', title='Angina inducida por ejercicio por categoría')
fig.show()

In [31]:
# Inclinación del pico de segmento ST en ejercicio
df_slope = gen_class_plot(df, var='slope', var_type=var_slope).sort_values('slope',ascending=False)
fig = px.bar(df_slope, x='class', y='slope', color='idx_class', title='Slope en pacientes por categoría')
fig.show()

In [32]:
# Tipo de defecto
df_thal = gen_class_plot(df, var='thal', var_type=var_thal).sort_values('class',ascending=False)
fig = px.bar(df_thal, x='class', y='thal', color='idx_class', title='Tipo de defecto en pacientes por categoría')
fig.show()

In [33]:
# Sunburst para entendimiento de tipos de dolor de pecho por clase y género
fig = px.sunburst(data_frame = df.replace({'sex':var_sex, 'cp':var_cp}),
                 path = [ 'sex','class','cp'],
                 color = 'class',
                 maxdepth = -1,
                 title = 'Gráfico Sunburst > Gender > Age')
fig.update_traces(textinfo = 'label+percent parent')
fig.update_layout(margin=dict(t=0, l=0, r=0, b=0))
fig.show()

In [34]:
# Gráfico de correlación para entender posibles relaciones
fig = px.imshow(df.corr(), title='Matriz de correlaciones sobre variables')
fig.update_xaxes(side="top")
fig.show()

###### Análisis EDA

### Preparación de los datos

A continuación se empezará con el proceso de preparación de datos que van a ser utilizados para entrenar y validar el modelo.

In [35]:
# renombrar variable target por `label`
X = df.copy(deep=True)
X = X.rename(columns={'class': 'label'})
Y = X.pop('label')

In [36]:
# Creación de un kbins discretizer para aplicar a la edad y conformarla por grupos
kbins = KBinsDiscretizer(n_bins=7,encode='ordinal',strategy='kmeans')
age = pd.DataFrame(X.pop('age'))
X['age'] = kbins.fit_transform(age).astype('int32')

In [37]:
# Estandarización de variables continuas
standardScaler = StandardScaler()
cols_to_standarize = X.select_dtypes(['int16', 'float16']).columns
X[cols_to_standarize] = standardScaler.fit_transform(X[cols_to_standarize])

# Binning de variables categóricas
X = pd.get_dummies(X, columns=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'] )

In [38]:
X.describe()

Unnamed: 0,trestbps,chol,thalach,oldpeak,age,sex_0,sex_1,cp_1,cp_2,cp_3,cp_4,fbs_0,fbs_1,restecg_0,restecg_1,restecg_2,exang_0,exang_1,slope_1,slope_2,slope_3,ca_0,ca_1,ca_2,ca_3,thal_3,thal_6,thal_7
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,-1.814982e-08,-4.462195e-09,-4.3399e-09,1.708365e-08,3.259259,0.323232,0.676768,0.077441,0.164983,0.279461,0.478114,0.855219,0.144781,0.494949,0.013468,0.491582,0.673401,0.326599,0.468013,0.461279,0.070707,0.585859,0.218855,0.127946,0.06734,0.552189,0.060606,0.387205
std,1.001688,1.001688,1.001688,1.001688,1.448481,0.4685,0.4685,0.267741,0.371792,0.449492,0.500364,0.352474,0.352474,0.500818,0.115462,0.500773,0.469761,0.469761,0.499818,0.49934,0.256768,0.493404,0.414168,0.334594,0.251033,0.498108,0.239009,0.487933
min,-2.125634,-2.337704,-3.431849,-0.9067173,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.6594306,-0.7002541,-0.7247694,-0.9067173,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.09550636,-0.08380217,0.1484822,-0.2196856,3.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,0.4684179,0.5519139,0.7160957,0.4673461,4.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
max,3.851964,6.099981,2.287949,4.418408,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [39]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 0 to 301
Data columns (total 28 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   trestbps   297 non-null    float64
 1   chol       297 non-null    float64
 2   thalach    297 non-null    float64
 3   oldpeak    297 non-null    float64
 4   age        297 non-null    int32  
 5   sex_0      297 non-null    uint8  
 6   sex_1      297 non-null    uint8  
 7   cp_1       297 non-null    uint8  
 8   cp_2       297 non-null    uint8  
 9   cp_3       297 non-null    uint8  
 10  cp_4       297 non-null    uint8  
 11  fbs_0      297 non-null    uint8  
 12  fbs_1      297 non-null    uint8  
 13  restecg_0  297 non-null    uint8  
 14  restecg_1  297 non-null    uint8  
 15  restecg_2  297 non-null    uint8  
 16  exang_0    297 non-null    uint8  
 17  exang_1    297 non-null    uint8  
 18  slope_1    297 non-null    uint8  
 19  slope_2    297 non-null    uint8  
 20  slope_3   

In [40]:
X.head()

Unnamed: 0,trestbps,chol,thalach,oldpeak,age,sex_0,sex_1,cp_1,cp_2,cp_3,cp_4,fbs_0,fbs_1,restecg_0,restecg_1,restecg_2,exang_0,exang_1,slope_1,slope_2,slope_3,ca_0,ca_1,ca_2,ca_3,thal_3,thal_6,thal_7
0,0.75038,-0.276443,0.017494,1.069652,5,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,1,1,0,0,0,0,1,0
1,1.596266,0.744555,-1.816334,0.381782,5,0,1,0,0,0,1,1,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,0
2,-0.659431,-0.3535,-0.89942,1.326346,5,0,1,0,0,0,1,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1
3,-0.095506,0.051047,1.63301,2.099781,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,1,0,0
4,-0.095506,-0.835103,0.978071,0.296217,1,1,0,0,1,0,0,1,0,0,0,1,1,0,1,0,0,1,0,0,0,1,0,0


### Modelado

#### Configuración dataset entrenamiento y validación

In [41]:
# Creación del train-test-split para la fase de entrenamiento
X_train, X_test, y_train, y_test = train_test_split(X.values, Y.values, test_size=0.20, random_state=42, stratify=Y)

In [42]:
print(f"the shape of X_train: {X_train.shape}")
print(f"the shape of X_test: {X_test.shape}")
print(f"the shape of y_train: {y_train.shape}")
print(f"the shape of y_test: {y_test.shape}")

the shape of X_train: (237, 28)
the shape of X_test: (60, 28)
the shape of y_train: (237,)
the shape of y_test: (60,)


#### Configuración de hiperparámetros para entrenar el modelo

In [43]:
# Configuración de los hiperparámetros a optimizar
ada_search = {
    'model': [AdaBoostClassifier()],
    'model__learning_rate': Real(0.0005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__algorithm': Categorical(['SAMME', 'SAMME.R'])
}

gb_search = {
    'model': [GradientBoostingClassifier()],
    'model__learning_rate': Real(0.0005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__criterion': Categorical(['friedman_mse', 'mse', 'mae'])
}

#### Entrenamiento del modelo

In [None]:
# Definición del pipeline para entrenar
pipe = Pipeline([
                 ('model', GradientBoostingClassifier())
])

# Definicición del modelo, con cross-validation (cv=5)
opt = BayesSearchCV(
    pipe,
    [(ada_search, 100), (gb_search, 100)],
    cv=5,
    random_state=42
)

# Entrenamiento del modelo
opt.fit(X_train, y_train)

### Evaluación e interpretación de resultados

In [None]:
print(f"validation score: {opt.best_score_}")
print(f"test score: {opt.score(X_test, y_test)}")
print(f"best parameters: {str(opt.best_params_)}")

validation score: 0.6118143459915611
test score: 0.6
best parameters: OrderedDict([('model', AdaBoostClassifier(algorithm='SAMME', learning_rate=0.06036798208145149,
                   n_estimators=1000)), ('model__algorithm', 'SAMME'), ('model__learning_rate', 0.06036798208145149), ('model__n_estimators', 1000)])


In [None]:
Y_pred_opt = opt.predict(X_test)
Y_pred_opt.shape

(60,)

In [None]:
score_opt = round(accuracy_score(Y_pred_opt,y_test)*100,2)

print(f"El accuracy obtenido por el modelo es: {score_opt}%")

El accuracy obtenido por el modelo es: 60.0%


In [None]:
Y_train_opt = opt.predict(X_train)
Y_train_opt.shape
print(classification_report(y_train,Y_train_opt))

              precision    recall  f1-score   support

           0       0.83      0.93      0.88       128
           1       0.35      0.42      0.38        43
           2       0.50      0.32      0.39        28
           3       0.35      0.21      0.27        28
           4       0.43      0.30      0.35        10

    accuracy                           0.65       237
   macro avg       0.49      0.44      0.45       237
weighted avg       0.63      0.65      0.64       237



In [None]:
print(classification_report(y_test,Y_pred_opt))

              precision    recall  f1-score   support

           0       0.86      0.94      0.90        32
           1       0.33      0.45      0.38        11
           2       0.00      0.00      0.00         7
           3       0.17      0.14      0.15         7
           4       0.00      0.00      0.00         3

    accuracy                           0.60        60
   macro avg       0.27      0.31      0.29        60
weighted avg       0.54      0.60      0.57        60



In [None]:
confusion_matrix(y_test, Y_pred_opt)

array([[30,  1,  0,  1,  0],
       [ 5,  5,  1,  0,  0],
       [ 0,  5,  0,  2,  0],
       [ 0,  3,  3,  1,  0],
       [ 0,  1,  0,  2,  0]])

In [None]:
len(y_train[y_train == 0])

28

In [None]:
len(y_test[y_test == 0])

32