# Modelos de ensamblaje con dataset de PIMA 

The Pima Indian Diabetes Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA. The outcome tested was Diabetes, 258 tested positive and 500 tested negative. Therefore, there is one target (dependent) variable and the 8 attributes (TYNECKI, 2018): pregnancies, OGTT(Oral Glucose Tolerance Test), blood pressure, skin thickness, insulin, BMI(Body Mass Index), age, pedigree diabetes function. The Pima population has been under study by the National Institute of Diabetes and Digestive and Kidney Diseases at intervals of 2 years since 1965. 

### Información importante
- Clases: Diabetes (1(SI) o 0(NO))
- 268 tested positive and 500 tested negative - total 768
- Características: pregnancies, OGTT(Oral Glucose Tolerance Test), blood pressure, skin thickness, insulin, BMI(Body Mass Index), age, pedigree diabetes function

## 1. EDA


In [25]:
import pandas as pd
import numpy as np

data = pd.read_csv('https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/refs/heads/master/diabetes.csv')
data. columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'diabFunction', 'Age', 'Diabetes']
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,diabFunction,Age,Diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [6]:
# Análisis simple 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pregnancies    768 non-null    int64  
 1   Glucose        768 non-null    int64  
 2   BloodPressure  768 non-null    int64  
 3   SkinThickness  768 non-null    int64  
 4   Insulin        768 non-null    int64  
 5   BMI            768 non-null    float64
 6   diabFunction   768 non-null    float64
 7   Age            768 non-null    int64  
 8   Diabetes       768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [10]:
#Contar cuántos registros(instancias o filas) hay por cada clase
print("Número de muestras por cada clase: \n", data['Diabetes'].value_counts())

Número de muestras por cada clase: 
 Diabetes
0    500
1    268
Name: count, dtype: int64


Datos desbalanceados. Clase mayoritaria: No diabetes (0).

In [8]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,diabFunction,Age,Diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Características sobre las cuales debemos dudar: ['Glucose', 'BloodPressure', 'BMI']

In [15]:
#¿Hay valores nulos o faltantes?
datos_faltantes = data.isnull()

for i in datos_faltantes.columns.values.tolist():
    print(datos_faltantes[i].value_counts())


Pregnancies
False    768
Name: count, dtype: int64
Glucose
False    768
Name: count, dtype: int64
BloodPressure
False    768
Name: count, dtype: int64
SkinThickness
False    768
Name: count, dtype: int64
Insulin
False    768
Name: count, dtype: int64
BMI
False    768
Name: count, dtype: int64
diabFunction
False    768
Name: count, dtype: int64
Age
False    768
Name: count, dtype: int64
Diabetes
False    768
Name: count, dtype: int64


In [18]:
#Datos anómalos o cero
cat_dudosas = ['Glucose', 'BloodPressure', 'BMI']

#Seccionemos los datos según las cat dudosas
data_sec = data[cat_dudosas]

#Ver qué valores son cero
data_cero = pd.DataFrame(data_sec == 0) #Segmentamos donde existen ceros

for i in data_cero.columns.values.tolist():
    print(data_cero[i].value_counts()) 

Glucose
False    763
True       5
Name: count, dtype: int64
BloodPressure
False    733
True      35
Name: count, dtype: int64
BMI
False    757
True      11
Name: count, dtype: int64


## 2. Preprocesar (Imputación-Eliminación)

In [28]:
#Vamos a crear una copia del dataset original en donde podamos probar distintas cosas
data_mod = data.copy()
data_mod.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,diabFunction,Age,Diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [20]:
### Aquí van a imputar los datos

## 3. Entrenar el modelo

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

#Crear conjunto de entrenamiento y validación
X = data_mod.drop('Diabetes', axis=1)
y = data['Diabetes']

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=42)

#Crear el modelo y entrenar
rf = RandomForestClassifier(n_estimators=100) #Aquí debemos optimizar los hiperparámetros
rf.fit(xtrain, ytrain)

#Validamos el modelo
ypred = rf.predict(xtest)

In [31]:
mat_con = confusion_matrix(ytest, ypred)
print(mat_con)

[[124  27]
 [ 32  48]]


In [32]:
print(accuracy_score(ytest,ypred))

0.7445887445887446


In [34]:
print(precision_score(ytest,ypred))
print("El modelo solo acierta el en diagnóstico correcto el 64 por ciento de las veces")

0.64
El modelo solo acierta el en diagnóstico correcto el 64 por ciento de las veces


In [36]:
print(recall_score(ytest,ypred))
print("El modelo identifica solo el 60 por ciento los casos positivos, el otro 40 no es identificado")


0.6
El modelo identifica solo el 60 por ciento los casos positivos, el otro 40 no es identificado


In [38]:
print(f1_score(ytest,ypred))
print("EL dataset no está balanceado")

0.6193548387096774
EL dataset no está balanceado
