<a href="https://colab.research.google.com/github/martasaparicio/lematecX/blob/main/4.1-Diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Solved Exercise

## Introduction

In this exercise, we intend to predict whether a given patient is diabetic or not. To do this, we will take into account several characteristics of these patients, such as their age, blood pressure and body mass index. 

## Data

*   Number of observations = 768
*   Number of independent variables = 8
  1.   Pregnancies = number of pregnancies 
  1.   Glucose = glucose level (mg/dL)
  1.   BloodPressure = diastolic blood pressure (mmHg)
  2.   SkinThickness = triceps skinfold thickness (mm)
  2.   Insulin = insulin level (mu U/ml)
  2.   BMI = body mass index (kg/m^2)
  1.   DiabetesPedigreeFunction = genetic predisposition for diabetes (according to a specific function)
  2.   Age = age (years)
*   Number of dependent variables = 1
  1.   Outcome = tells us if a patient is diabetic (1) or not (0)

Data availabe at:  https://raw.githubusercontent.com/pmarcelino/datasets/master/diabetes.csv

**Note:** The dependent and independent variables are defined specifically using the information in the exercise's introduction. If in the introduction we were asked, for example, to predict the insulin level using the other variables, the dependent variable would be 'Insulin' and not 'Outcome'.

## Example

The first observation in the dataset that we will see in the Solution - the first row in the table that appears after doing `df` - refers to a patient who:

*   Has had `6` pregnancies
*   Has a glucose level of `148` mg/dL
*   Has a diastolic blood pressure of `72` mmHg 
*   Has a triceps skinfold thickness of `35` mm 
*   Has an insulin level of `0` mu U/ml 
*   Has a body mass index of `33.6` kg/m^2
*   Has a genetic predisposition for diabetes of `0.627` 
*   Is `50` years old
*   Is `diabetic`




# Solution

1.   Prepare data
1.   Explore data
2.   Train the model
2.   Evaluate the model

In [None]:
# Import libraries
import pandas as pd  
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier  
from sklearn.model_selection import cross_val_score

**Note:** If `FutureWarning` appears while you are importing the libraries, do not worry. This message appears associated with a library and serves only to inform you that some changes are ocurring in that library.

## 1. Prepare data

1.   Import data
2.   Remove observations with missing data

In [None]:
# Import data
url = "https://raw.githubusercontent.com/pmarcelino/datasets/master/diabetes.csv"
df = pd.read_csv(url)
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


*   The import was successful


In [None]:
# Check for missing data
pd.isnull(df).sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

*   There is no data missing

## 2. Explore data

1.   Detect errors and anomalies

In [None]:
# Check for errors or anomalies
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


*   There are errors and anomalies
*   The following variables are not expected to have a minimum value of 0: 
  *   Glucose level
  *   Diastolic blood pressure
  *   Triceps skinfold thickness
  *   Insulin level
  *   Body mass index

In [None]:
# Remove observations with erros and anomalies
df.drop(df[df['Glucose']==0].index, inplace=True)
df.drop(df[df['BloodPressure']==0].index, inplace=True)
df.drop(df[df['SkinThickness']==0].index, inplace=True)
df.drop(df[df['Insulin']==0].index, inplace=True)
df.drop(df[df['BMI']==0].index, inplace=True)

In [None]:
# Check for errors or anomalies
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796,0.331633
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777,0.471401
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0,0.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0,0.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0,0.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0,1.0


*   The observations previously identified as containing errors or anomalies have disappeared
*   The number of observations has gone from 768 to 392 (see `count`)

**Conclusions drawn from the data exploration**

*   There were obvious errors and anomalies and these were rectified

## 3. Train the model

In [None]:
# Define the independent and dependent variables
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [None]:
# Define the algorithm of the model
model = RandomForestClassifier(random_state=1143)

*   At a certain point in the Random Forest algorithm, there is a random division of the data
*   In order for this division to always be equal, we define the `random_state` argument
*   That way, if anyone were to run code that is exactly the same as ours, they would get the same results
*   We used the number 1143, but we could have used any other number

## 4. Evaluate the model

In [None]:
# Evaluate the model using cross-validation (using the success rate)
score = cross_val_score(model, X, y, cv=5)
score

array([0.79746835, 0.67088608, 0.75641026, 0.80769231, 0.83333333])

*   We used five iterations (`cv=5`) and that is why we have a `score` with five values (one for each iteration) 
*   Scikit-learn uses the evaluation metric predefined in the model. For RandomForestClassifier, the default evaluation metric is success rate 
*   Since this is a classification problem, scikit-learn automatically uses stratified cross-validation


In [None]:
# Estimate the model's performance
score_mean = score.mean()
score_mean

0.7731580655631288

*   Our model has an average success rate of 77% (approximately) 



In [None]:
# Evaluate the model using cross-validation (using sensitivity)
score = cross_val_score(model, X, y, scoring='recall', cv=5)
score

array([0.61538462, 0.46153846, 0.42307692, 0.65384615, 0.73076923])

*   If we consider diabetes to be a serious disease and we want to assess the model's ability to identify all positive cases, we can use another evaluation metric
*   In this context, an appropriate evaluation metric would be sensitivity 
*   To use this metric in cross-validation, which can be found in the scikit-learn library, we have to pass the `scoring='recall'` argument in the `cross_val_score` function 



In [None]:
# Estimate the model's performance
score_mean = score.mean()
score_mean

0.576923076923077

*   Our model has a sensitivity of 58% (approximately)
*   Since the maximum would be 100% - which would mean that, on average, our model would detect all positive cases - we find that, taking into account the sensitivity, the performance of the model is low