<p align="center">
  <img src="https://raw.githubusercontent.com/pachecowillians/svg-icons/24b0ca90f467a751be9f0e7e5fa50801f89e4d17/img/diabetes.svg" alt="Diabetes" width="100px">
</p>

Predictive analysis of individuals' health, specifically focusing on determining the presence or absence of diabetes. The notebook utilizes machine learning techniques and a dataset containing relevant health indicators such as gender, BMI, hypertension, and age. By training a classification model and evaluating its performance, valuable insights can be gained regarding the predictive factors associated with diabetes.

In [572]:
# Importing the pandas library
import pandas as pd

In [573]:
# Loading the Wine Dataset
ds = pd.read_csv('dataset/diabetes_prediction_dataset.csv')

# Displaying the dataset
ds.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [574]:
def generateIdToColumn(column):
    # Get the distinct values from the column
    unique_values = column.unique()

    # Create a mapping dictionary for the distinct values
    mapping_gender = {valor: index for index, valor in enumerate(unique_values)}

    # Replace the values in the column with the corresponding number
    column = column.map(mapping_gender)

    return column

In [575]:
# Converting the 'gender' column to numerical values using the 'generateIdToColumn' function
ds['gender'] = generateIdToColumn(ds['gender'])

# Displaying the first few rows of the DataFrame after the conversion
ds.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0,80.0,0,1,never,25.19,6.6,140,0
1,0,54.0,0,0,No Info,27.32,6.6,80,0
2,1,28.0,0,0,never,27.32,5.7,158,0
3,0,36.0,0,0,current,23.45,5.0,155,0
4,1,76.0,1,1,current,20.14,4.8,155,0


In [576]:
# Converting the 'smoking_history' column to numerical values using the 'generateIdToColumn' function
ds['smoking_history'] = generateIdToColumn(ds['smoking_history'])

# Displaying the first few rows of the DataFrame after the conversion
ds.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0,80.0,0,1,0,25.19,6.6,140,0
1,0,54.0,0,0,1,27.32,6.6,80,0
2,1,28.0,0,0,0,27.32,5.7,158,0
3,0,36.0,0,0,2,23.45,5.0,155,0
4,1,76.0,1,1,2,20.14,4.8,155,0


In [577]:
# Splitting the dataset into goal (target variable) and predict (feature variables)
goal = ds['diabetes']
predict = ds.drop('diabetes', axis=1)

In [578]:
# Importing the train_test_split function
from sklearn.model_selection import train_test_split

In [579]:
# Splitting the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(predict, goal, test_size=0.3)

In [580]:
# Printing the shapes of the datasets
print(ds.shape, x_train.shape, x_test.shape)

(100000, 9) (70000, 8) (30000, 8)


In [581]:
# Importing the ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [582]:
# Creating an instance of the ExtraTreesClassifier model
model = ExtraTreesClassifier()

# Training the model
training = model.fit(x_train, y_train)

In [583]:
# Calculating the accuracy of the trained model on the test set
result = model.score(x_test, y_test)

# Printing the accuracy
print("Accuracy:", result)

Accuracy: 0.9682333333333333


In [584]:
# Importing the random module
import random

In [585]:
# Randomly sampling 10 rows from the DataFrame 'x_test'
random_sample = x_test.sample(n=10)

In [586]:
# Obtaining the corresponding 'y' values for the randomly sampled rows
random_sample_goal = y_test.loc[random_sample.index]

In [587]:
# Importing the NumPy library with the alias 'np'
import numpy as np

In [588]:
# Converting 'random_sample_goal' to a NumPy array
random_sample_goal = np.array(random_sample_goal)
print(random_sample_goal)

[0 0 1 0 1 0 0 0 0 0]


In [589]:
# Making predictions on the selected subset of test data
prediction = model.predict(random_sample)
print(prediction)

[0 0 1 0 1 0 0 0 0 0]


In [590]:
# Printing the comparison results
print(random_sample_goal == prediction)

[ True  True  True  True  True  True  True  True  True  True]


In [591]:
# Creating a custom input data
person = {
    'gender': 1, 
    'age': 21, 
    'hypertension': 0, 
    'heart_disease': 0, 
    'smoking_history': 0,
    'bmi': 18.91, 
    'HbA1c_level': 6.10, 
    'blood_glucose_level': 158.00
    }

In [592]:
# Converting 'person' dictionary into a Pandas Series
person = pd.Series(person)

# Converting the Series into a DataFrame
person.to_frame()

Unnamed: 0,0
gender,1.0
age,21.0
hypertension,0.0
heart_disease,0.0
smoking_history,0.0
bmi,18.91
HbA1c_level,6.1
blood_glucose_level,158.0


In [593]:
# Making predictions on the selected subset of test data
has_diabetes = model.predict(person.to_frame().T)

In [594]:
if has_diabetes:
    print("The person has diabetes.")  # Print message for person with diabetes
else:
    print("The person does not have diabetes.")  # Print message for person without diabetes

The person does not have diabetes.
