# Supervised Learning - Project

In this Project, we are going to perform a full supervised learning machine learning project on a "Diabetes" dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes,
based on certain diagnostic measurements included in the dataset. 

[Kaggle Dataset](https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset)

# Part I : EDA - Exploratory Data Analysis

  - For this task, you are required to conduct an exploratory data analysis on the diabetes dataset. You have the freedom to choose the visualizations you want to use, but your analysis should cover the following tasks mostly:

1. Are there any missing values in the dataset?
1. How are the predictor variables related to the outcome variable?
1. What is the correlation between the predictor variables?
1. What is the distribution of each predictor variable?
1. Are there any outliers in the predictor variables?
1. How are the predictor variables related to each other?
1. Is there any interaction effect between the predictor variables?
1. What is the average age of the individuals in the dataset?
1. What is the average glucose level for individuals with diabetes and without diabetes?
1. What is the average BMI for individuals with diabetes and without diabetes?
1. How does the distribution of the predictor variables differ for individuals with diabetes and without diabetes?
1. Are there any differences in the predictor variables between males and females (if gender information is available)?

  - ##### Notebook: https://github.com/leoaugusto1976/LHL-Supervised-Learning-Project/blob/main/notebooks/2.data_analysis.ipynb

# Part II : Preprocessing & Feature Engineering

- You need to perform preprocessing on the given dataset. Please consider the following tasks and carry out the necessary steps accordingly.
  - Handling missing values
  - Handling outliers
  - Scaling and normalization
  - Feature Engineering
  - Handling imbalanced data
<br><br>
- ##### For cleaning data, go to the notebook: https://github.com/leoaugusto1976/LHL-Supervised-Learning-Project/blob/main/notebooks/1.diabetes.ipynb
- ##### For feature engineering, go to the notebook: https://github.com/leoaugusto1976/LHL-Supervised-Learning-Project/blob/main/notebooks/3.feature_engineering.ipynb

# Part III : Training ML Model

For this task, you are required to build a machine learning model to predict the outcome variable. This will be a binary classification task, as the target variable is binary. You should select at least two models, one of which should be an ensemble model, and compare their performance.

- Train the models: Train the selected models on the training set.
- Model evaluation: Evaluate the trained models on the testing set using appropriate evaluation metrics, such as accuracy, precision, recall, F1-score, and ROC-AUC.
- Model comparison: Compare the performance of the selected models and choose the best-performing model based on the evaluation metrics. You can also perform additional analysis, such as model tuning and cross-validation, to improve the model's performance.

- ##### Notebook: https://github.com/leoaugusto1976/LHL-Supervised-Learning-Project/blob/main/notebooks/4.machine_learning.ipynb

# Part IV : Conclusion

From the machine learning models developed and the exploratory data analysis (EDA) conducted, generate four bullet points as your findings.

#### Result 1: BMI Category vs Patient has diabetes, Stacked by Age Group

- It shows the distribution of individuals with different BMI categories (Underweight, Normal, Overweight, Obese) based on whether they have diabetes, separated by age groups (Young and Adult).
- The results indicate that the Obese category has the highest number of individuals, both in the Young and Adult age groups, and a significant portion of them has diabetes.
- It's important to note that there are no individuals with "Underweight" who have diabetes in either the Young or Adult group.

#### Result 2: Insulin Level vs Patient has diabetes, Stacked by Age Group

- This result illustrates how insulin levels (low, normal, elevated, high) relate to the presence of diabetes, broken down by age groups (Young and Adult).
- Notably, individuals with "insulin normal" levels are the most common in both age groups, and a significant portion of them has diabetes.
- The "insulin elevated" level group also has a substantial number of individuals with diabetes.

#### Result 3: Glucose Level vs Patient has diabetes, Stacked by Age Group

- This result shows the connection between glucose levels (low, normal, prediabetes, diabetes) and the presence of diabetes, categorized by age groups (Young and Adult).
- It appears that many individuals with "glucose prediabetes" and "glucose diabetes" have diabetes in both age groups.
- Individuals with "glucose normal" levels also have a notable number of diabetes cases, especially in the Adult group.

These insights can be valuable for understanding the relationships between these factors and the presence of diabetes. Keep in mind that these insights provide a snapshot of the data, and further statistical analysis or modeling may be needed to draw more robust conclusions or make predictions. Additionally, it's essential to consider the specific context and objectives of your analysis when interpreting these results.

#### Machine Learning key findings

- **Balanced Class Performance**: Undersampling has successfully balanced the class distribution in the dataset, leading to improved model performance for both Logistic Regression and Random Forest Classifier. Both models demonstrate good accuracy, precision, recall, F1 score, and ROC AUC score for both classes (Outcome 0 and Outcome 1).

- **Logistic Regression Findings**:
  - The Logistic Regression model achieves an accuracy of 0.71, with a balanced F1 score of 0.76, suggesting that it can predict diabetes outcomes reasonably well.
  - It demonstrates a higher recall (0.85) for class 1 (indicating individuals with diabetes) compared to class 0, which means it is effective at identifying positive cases.  

- **Random Forest Classifier Findings**:
  - The Random Forest Classifier exhibits excellent performance with an accuracy of 0.87 and a balanced F1 score of 0.89. This indicates strong predictive capabilities for both classes.
  - It maintains a high recall (0.92) for class 1, demonstrating its ability to identify individuals with diabetes, while maintaining a respectable recall (0.81) for class 0.  

- **Model Selection**: The Random Forest Classifier outperforms the Logistic Regression model in terms of accuracy, precision, recall, F1 score, and ROC AUC score. It is the preferred model for this balanced dataset, offering robust predictive capabilities and well-balanced classification metrics for both classes.

These findings highlight the effectiveness of undersampling in achieving a balanced dataset and improving model performance. The Random Forest Classifier, in particular, stands out as a strong choice for making accurate predictions in this context, with a focus on both class 0 and class 1 outcomes.

#### Maybe adding Gender...
Adding gender information to the dataset can provide valuable insights into the potential gender-based variations in diabetes outcomes. This additional feature would enable us to explore whether there are differences in diabetes risk, progression, or response to treatments between the genders. By including gender as a predictor variable, we can conduct more granular analyses, identify potential gender-specific risk factors, and tailor interventions for improved diabetes management. Gender is a fundamental demographic factor that can significantly enhance the depth and accuracy of our predictive models and inform personalized healthcare strategies.