# Proyecto Integrador


## Tecnológico de Monterrey
### Maestría en Inteligencia Artificial Aplicada (MNA)
#### Avance 4
#### Equipo 7


* Jorge Arturo Federico Rivera – A01250724
* Marco Antonio Vázquez Morales – A01793704
* Alejandro Jesús Vázquez Navarro - A01793146

## Modelos Alternativos


### Table of Contents
1. [Preparación](##Preparación)
2. [Comparativa](##Comparativa)
3. [Ajuste Fino](##AjusteFino)

Proyecto:

*Modelo clasificador de multimorbilidad maternal y predictor de desenlaces perinatales a partir de datos clínicos metabólicos, genéticos y nutricionales de mujeres mexicanas*

23 de mayo de 2024

#1. [Preparación](##Preparación).

# **1.1 Carga de Librerías**

In [8]:
#%pip install imbalanced-learn
#%pip install scikit-learn tensorflow
#%pip install scikeras
#%pip install --upgrade scikit-learn
#%pip uninstall scikit-learn imbalanced-learn
#%pip install scikit-learn==1.0.2 imbalanced-learn==0.8.1

import pandas as pd
import time
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from imblearn.over_sampling import SMOTE


# **1.2. Carga de datos**


In [None]:
file_path = r'data/data_limpia.csv'

#Eliminamos columnas que no necesitaremos como identificarores y el número de clúster

cols_to_remove = ['id_gdg', 'origen_px', 'IndexMorbilidad', 'anticonceptivo_0.0',
       'anticonceptivo_0.6316526610644257', 'anticonceptivo_1.0',
       'anticonceptivo_2.0', 'cluster']

data = pd.read_csv(file_path, sep=";", encoding='utf-8', index_col=False)
data  = data.drop(cols_to_remove , axis=1)

data.head()

Unnamed: 0,EscalaRiesgo,biopsias,obs_h,ichos_pregesta,hta_pregesta,sop,hipotiroidismo,hipertiroidismo,consumo_alcohol,consumo_tabaco,...,sdg_parto,ant_aborto,macrosomia_rn_0,macrosomia_rn_0.0,macrosomia_rn_0/0,macrosomia_rn_1,macrosomia_rn_1.0,macrosomia_rn_1/0,macrosomia_rn_1/1,macrosomia_rn_2
0,C,0,0,0,0,0,0,0,0,0,...,0.0,-0.747297,0,0,0,1,0,0,0,0
1,C,0,0,0,1,0,0,0,0,0,...,0.0,-0.747297,0,0,0,1,0,0,0,0
2,C,0,1,0,1,0,0,0,0,0,...,0.0,-0.747297,0,0,0,1,0,0,0,0
3,B,0,0,0,1,0,0,0,0,0,...,0.0,-0.747297,0,0,0,1,0,0,0,0
4,C,0,1,0,1,0,0,0,0,0,...,0.0,1.420203,0,0,0,1,0,0,0,0


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

X = data.drop('EscalaRiesgo', axis=1)
y = data['EscalaRiesgo']



# **1.3 Aplicar balanceo de clases via SMOTE**


Esta técnica es recomendada cuando tenemos clases desbalanceadas. Nuestro problema tiene 3 clases:

- A Para alto riesgo (la presencia de 3 o 4 factores multimorbilidad)
- B Para medio riesgo (presencia de 1 o 2 factores multimorbilidad)
- C Para riesgo nulo. (sin la presencia de factores de multimorbilidad)

En temas de salud es importante tener clases balanceadas, esto nos permitirá tener los siguientes beneficios:

- Mejora en la precisión del modelo; en este sentido, detectar una clase **A** con alto riesgo de morbilidad es crucial para que el modelo entregue valor al área médica.

- Equidad en la atención médico. Debemos aseguramos que nuestro modelo sea justo y equitativo sobre todo en temas de salud. Un desbalance puede provocar sesgos y lecturas erróneas.

- Generalización. Un modelo con clases balanceadas generalizará mejor cuando trate nuevos datos. En el dominio de conocimiento de salud humana, esto es relevante.

#2.[Comparativa](##Comparativa)

Evaluación con TensorFlow

In [None]:
def train_and_evaluate_algorithms(X, y):
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Apply SMOTE to balance the classes
    smote = SMOTE(random_state=42)
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

    # Check the class distribution
    print("Original class distribution:")
    print(y_train.value_counts())
    print("\nBalanced class distribution:")
    print(y_train_balanced.value_counts())

    # Define the models to be used
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs'),
        'Decision Tree': DecisionTreeClassifier(),
        'Support Vector Classifier': SVC(),
        'K-Nearest Neighbors': KNeighborsClassifier(),
        'Naive Bayes': GaussianNB(),
        'Linear Discriminant Analysis': LinearDiscriminantAnalysis()
    }

    # Initialize a list to store the results
    results = []

    # Train and evaluate each model
    for name, model in models.items():
        start_time = time.time()
        model.fit(X_train_balanced, y_train_balanced)
        training_time = time.time() - start_time

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate performance metrics
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        # Append the results
        results.append({
            'Model': name,
            'Accuracy': accuracy,
            'F1 Score': f1,
            'Training Time (s)': training_time
        })

    # Convert results to a DataFrame and display
    results_df = pd.DataFrame(results)
    print(results_df)

In [None]:
target_column = 'EscalaRiesgo'

train_and_evaluate_algorithms(X, y)


Original class distribution:
EscalaRiesgo
C    645
B    305
A      9
Name: count, dtype: int64

Balanced class distribution:
EscalaRiesgo
C    645
B    645
A    645
Name: count, dtype: int64
                          Model  Accuracy  F1 Score  Training Time (s)
0           Logistic Regression  0.963504  0.964074           0.386540
1                 Decision Tree  0.936740  0.935397           0.083869
2     Support Vector Classifier  0.861314  0.856127           0.136673
3           K-Nearest Neighbors  0.664234  0.674319           0.006041
4                   Naive Bayes  0.953771  0.962749           0.017494
5  Linear Discriminant Analysis  0.956204  0.956470           0.067245


#3. [Ajuste Fino](##AjusteFino)

De acuerdo a la métrica F1 Score, los 2 mejores modelos son:
- Logistic Regression
- Nayve Bayes

Con base en estos dos modelos, realizaremos el afinamiento de hiperparámetros. Definimos una función que realizará el afinamiento de hiperparámetros con base a los 2 modelos elegidos.

In [11]:
def train_and_evaluate_algorithms(X, y):
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Apply SMOTE to balance the classes
    smote = SMOTE(random_state=42)
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

    # Check the class distribution
    print("Original class distribution:")
    print(y_train.value_counts())
    print("\nBalanced class distribution:")
    print(y_train_balanced.value_counts())

    # Define the models to be used
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs'),
        'Naive Bayes': GaussianNB(),
    }

    # Initialize a list to store the results
    results = []

    # Train and evaluate each model
    for name, model in models.items():
        start_time = time.time()
        model.fit(X_train_balanced, y_train_balanced)
        training_time = time.time() - start_time

        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate performance metrics
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')

        # Append the results
        results.append({
            'Model': name,
            'Accuracy': accuracy,
            'F1 Score': f1,
            'Training Time (s)': training_time
        })

    # Hyperparameter tuning for Logistic Regression
    log_reg = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
    log_reg_params = {
        'C': [0.1, 1, 10, 100],
        'solver': ['newton-cg', 'lbfgs', 'saga']
    }
    log_reg_grid = GridSearchCV(log_reg, log_reg_params, cv=5, scoring='f1_weighted')
    start_time = time.time()
    log_reg_grid.fit(X_train_balanced, y_train_balanced)
    training_time = time.time() - start_time
    log_reg_best = log_reg_grid.best_estimator_

    # Make predictions
    y_pred = log_reg_best.predict(X_test)

    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Append the results
    results.append({
        'Model': 'Logistic Regression (Tuned)',
        'Accuracy': accuracy,
        'F1 Score': f1,
        'Training Time (s)': training_time
    })

    # Hyperparameter tuning for Naive Bayes (Gaussian)
    naive_bayes = GaussianNB()
    nb_params = {
        'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]
    }
    nb_grid = GridSearchCV(naive_bayes, nb_params, cv=5, scoring='f1_weighted')
    start_time = time.time()
    nb_grid.fit(X_train_balanced, y_train_balanced)
    training_time = time.time() - start_time
    nb_best = nb_grid.best_estimator_

    # Make predictions
    y_pred = nb_best.predict(X_test)

    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    # Append the results
    results.append({
        'Model': 'Naive Bayes (Tuned)',
        'Accuracy': accuracy,
        'F1 Score': f1,
        'Training Time (s)': training_time
    })

    # Convert results to a DataFrame and display
    results_df = pd.DataFrame(results)
    print(results_df)

In [12]:
train_and_evaluate_algorithms(X, y)


Original class distribution:
EscalaRiesgo
C    645
B    305
A      9
Name: count, dtype: int64

Balanced class distribution:
EscalaRiesgo
C    645
B    645
A    645
Name: count, dtype: int64




                         Model  Accuracy  F1 Score  Training Time (s)
0          Logistic Regression  0.963504  0.964074           0.363115
1                  Naive Bayes  0.953771  0.962749           0.016687
2  Logistic Regression (Tuned)  0.961071  0.961925         110.308201
3          Naive Bayes (Tuned)  0.953771  0.962749           0.500227
