# Predictive Analysis of Diabetes Dataset


## Introduction

## Preliminary exploratory data analysis

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(ggplot2)

diabetes_data <- read_csv("diabetes.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

In this dataset:

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.

Therefore, as each row represents an individual's data on each variable, the dataset is tidy.

In [2]:
#splitting dataframe into training, testing datasets
diabetes_split <- initial_split(diabetes_data, prop = 0.75, strata = Outcome)

diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

#summarizing to get mean of each predictor + total no. of rows per class
diabetes_summary <- diabetes_train |>
                    group_by(Outcome) |>
                    summarize(mean_Pregnancies = mean(Pregnancies), mean_Glucose = mean(Glucose),
                              mean_BloodPressure = mean(BloodPressure), mean_SkinThickness = mean(SkinThickness),
                              mean_Insulin = mean(Insulin), mean_BMI = mean(BMI), 
                              mean_DiabetesPedigreeFunction = mean(DiabetesPedigreeFunction), mean_Age = mean(Age),
                              n_count = n(), missing_data_count = sum(is.na(diabetes_train)))

diabetes_summary
     
# Scatter plots for numerical predictors against the outcome
numerical_predictors <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age")

# Create a list of ggplot objects for each numerical predictor

Outcome,mean_Pregnancies,mean_Glucose,mean_BloodPressure,mean_SkinThickness,mean_Insulin,mean_BMI,mean_DiabetesPedigreeFunction,mean_Age,n_count,missing_data_count
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>
0,3.162667,109.752,68.02933,19.77067,70.99467,30.07733,0.4385707,31.176,375,0
1,4.865672,139.4776,70.64179,22.37811,104.67164,35.03433,0.5647065,36.9602,201,0


## Methods

We used K-nearest neighbour (knn) classification to predict the outcome *Diabetes*, which takes values 0 (negative) or 1 (positive). Among the 8 columns other than the outcome column, we chose *Glucose*, *BMI*, *Diabetes Predigree Function*, and *Age* as the predictors. *Glucose* and *BMI* were included as predictors because ... <span style="color:red">To be completed after Introduction section is finished ...</span>

The whole dataset was scaled and centred before splitting into 75\% training and 25\% testing datasets. For the training data, the knn model was fitted using Euclidean distance. The number of nearest neighbours, $k$, was yet to be determined through 5-fold cross validation, in which the training data were divided equally into 5 subsets, and each subset was used as the validation dataset while the other four were used to fit the model with different $k$. The value of $k$ was then optimized using the average validation **recall** among the 5 validation datasets. We used recall (fraction of positive that are predicted by the model) as our evaluation metrics because our model targets to detect as much diabetes patients as possible, while false positive results are acceptable as they could be handeled by follow-up treatment procedures.

The results were visuallized using a table of the numbers of True/Predicted positive and negative cases, which looks like the one below:

|  | Predicted Positive | Predicted Negative |
| --- | --- | --- |
| **True Positive** | ... | ... |
| **True Negative** | ... | ... |

## Expected outcomes and significance
