# Instroduction:

Diabetes is a chronic disease that affects millions of people worldwide, and it can lead to a range of serious complications if not managed properly.
By developing accurate predictive models for diabetes, healthcare providers can identify patients who are at high risk for developing complications or 
experiencing adverse health outcomes, and intervene early to prevent these outcomes.

Early identification of high-risk patients: By using predictive models, healthcare providers can identify patients who are at high risk for developing
complications such as kidney disease, neuropathy, or retinopathy, and intervene early to prevent or delay these complications.

More personalized care: Predictive models can help healthcare providers tailor their treatment plans to the specific needs of each patient, taking into 
account their individual risk factors, medical history, and other relevant factors.

Reduced healthcare costs: By identifying high-risk patients early and intervening to prevent complications, healthcare providers can reduce the overall
cost of care for diabetic patients.

Improved patient outcomes: By using predictive models to identify high-risk patients and intervene early, healthcare providers can improve patient outcomes
and quality of life.

Overall, studying predictive models for diabetic patients can help healthcare providers provide more effective, personalized care, and ultimately improve
patient outcomes while reducing healthcare costs

We aim to find possible correlation between 
The dataset used for the project can be found on the following web address: 
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

We have used the second of the 3 files for the purpose of this project. It has been cleaned and wrangled from another data file which can be found on the following web address: 
https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system
This dataset was created using the annual telephone survey conducted by the Center for Disease Control and Prevention under The Behavioral Risk Factor Surveillance System (BRFSS) program. 

Finally, we adapted the dataset and publiched it online using GitHub. This is the dataset we read in our project. It can be found here:
https://drive.google.com/u/0/uc?id=1OAZCpZGdFPy70ll_Fo2ow5dpaM1sG_47&amp;export=download

!!! Describe the dataset with the file we use: what factors, what cols involve?
!!! more detail question: how does a concoction of factors such as Age, BMI, Smoking habits, 
Income effect a non-diabetic person's chances of acquiring type-2 diabetes?

Method and Result:
    
While there may be numerous methods to conduct such an analysis, we find that a predictive model would suit best for our purpose of the research for two reasons: 
    reason 1: our target is to identify a categorical variable
    reason 2: a relatively simple algorithm to implement, since it only requires calculating distances between data points and selecting the k-nearest neighbors.

# should we or how we improve those method?
 We shall first run some preliminary analysis before deploying sophisticated tools to analyse areas of interest. Our classifier model will take variables such as a person's age, bmi, smoking habits, income etc. into account to identify non-diabetic patients most at risk of acquiring diabetes. 
This can be done in several ways. For instance, the average distance of 3 nearest diabetic patients can be used to determine how likely a person is to acquire diabetes. Participants of the survey who's average distance to the 3 nearest diabetics on the graph is below a certain level can be classified most at-risk.
Another possible method is to use K-nearest neighbors classification algorithm, although this method would be more suitable to classify someone as diabetic or non-diabetic rather than predicting which respondents are most at-risk.

# my draft for method section
1. we run some preliminary analysis to select the factors in raw dataset, identify the potential predictors to build our predictive classification model
2. in order to build a good-quality classification model, we should split the data into training dataset and testing dataset, which can be use to test the quality of overall classifier
/// only preprocess the training dataset


3. Then we should tune on our training dataset by cross-validation. The classifier need to pick a K value to maximize the accuracy. Thus we choose to use cross-validation. If we just split the tranining data once to evaluate, the K maybe strongly depends on the specific valiadation and sub-training dataset, which lead to be overfitting or underfitting 
4. In R, we can use the vfold_cv function to conduct cross-validation. To utilize this function, we need to  indicate the number of folds (v) and the categorical variable.
5. Generate a new model specification for K-nearest neighbor but instead of defining a value for the K, using function tune()
6. create a workflow() analysis that put fruit_recipe and new knn_tune model specification.
7. plot the line diagram which can find different K with different accuracy. In this case, we should roughly optiminal accuracy but not too expensive for calculating (not too large);  does not change much if you change KK to a nearby value

8. Then put this K into our best fit classifier, training the model and using the test dataset to evaluate the predictive model.


In [40]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
#options(repr.matrix.max.rows = 6)
#source('tests.R')
#source('cleanup.R')

url <- "https://drive.google.com/u/0/uc?id=1OAZCpZGdFPy70ll_Fo2ow5dpaM1sG_47&amp;export=download"
diabetes_data <- read_csv(url)

summary(diabetes_data)

[1mRows: [22m[34m70692[39m [1mColumns: [22m[34m22[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


 Diabetes_binary     HighBP          HighChol        CholCheck     
 Min.   :0.0     Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
 Median :0.5     Median :1.0000   Median :1.0000   Median :1.0000  
 Mean   :0.5     Mean   :0.5635   Mean   :0.5257   Mean   :0.9753  
 3rd Qu.:1.0     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      BMI            Smoker           Stroke        HeartDiseaseorAttack
 Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000      
 1st Qu.:25.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000      
 Median :29.00   Median :0.0000   Median :0.00000   Median :0.0000      
 Mean   :29.86   Mean   :0.4753   Mean   :0.06217   Mean   :0.1478      
 3rd Qu.:33.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000      
 Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000      
  PhysActivit

In [41]:
# make more readable, slice the data
slice(diabetes_data, 1:10)

Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,⋯,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,1,0,1,26,0,0,0,1,0,⋯,1,0,3,5,30,0,1,4,6,8
0,1,1,1,26,1,1,0,0,1,⋯,1,0,3,0,0,0,1,12,6,8
0,0,0,1,26,0,0,0,1,1,⋯,1,0,1,0,10,0,1,13,6,8
0,1,1,1,28,1,0,0,1,1,⋯,1,0,3,0,3,0,1,11,6,8
0,0,0,1,29,1,0,0,1,1,⋯,1,0,2,0,0,0,0,8,5,8
0,0,0,1,18,0,0,0,1,1,⋯,0,0,2,7,0,0,0,1,4,7
0,0,1,1,26,1,0,0,1,1,⋯,1,0,1,0,0,0,1,13,5,6
0,0,0,1,31,1,0,0,0,1,⋯,1,0,4,0,0,0,1,6,4,3
0,0,0,1,32,0,0,0,1,1,⋯,1,0,3,0,0,0,0,3,6,8
0,0,0,1,27,1,0,0,0,1,⋯,1,0,3,0,6,0,1,6,4,4


# ??? DO a exploratory data analysis to identify the potential predictors? 
# !! need to improve to more concise


# finding a good subset of predictors

Since whole dataset contain 21 explanatory variables, based on our KNN classification, we identify the numerical variables which could give us more accurate prediction.
The variables with only 0 or 1 may strongly influence the result after standardlization.


First, we select mutate the Diabetes_binary as a categorical variable;
Then we choose only numerical explanatory variables since our purpose is to use KNN classification (which is a distance-based algorithm, calculating distances between the data points to determine the nearest neighbor) to predict. 
Slice first 10 rows to make the dataset more readable

We also exclude the Smoker variable in dataset since it is a catagorical variables.Although we identify it could have relationship with diabetes prediction, in our further exploratory analysis with forward selection method, we should make sure all the predictors are numerical variable, except for Diabetes_binary.
BUT we will seperately analysis the relationship between Smoker and diabetes by using visualization. 

In [42]:
diabetes_data <- diabetes_data |>
    # mutate(Smoker = as.logical(Smoker)) |>
    mutate(Diabetes_binary = as.factor(Diabetes_binary)) |>
    select(-Smoker,
           -PhysActivity,
          -Veggies,
          -HvyAlcoholConsump, -AnyHealthcare, -NoDocbcCost, 
          -DiffWalk,
          -Sex,-HighBP, -HighChol, -CholCheck,
          -Stroke,-HeartDiseaseorAttack, -Fruits) |>
      summarize(across(Diabetes_binary:Income,na.rm =TRUE)) 

    # select(BMI,Age,Income,Smoker,Diabetes_binary)

slice(diabetes_data, 1:10)

Diabetes_binary,BMI,GenHlth,MentHlth,PhysHlth,Age,Education,Income
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,26,3,5,30,4,6,8
0,26,3,0,0,12,6,8
0,26,1,0,10,13,6,8
0,28,3,0,3,11,6,8
0,29,2,0,0,8,5,8
0,18,2,7,0,1,4,7
0,26,1,0,0,13,5,6
0,31,4,0,0,6,4,3
0,32,3,0,0,3,6,8
0,27,3,0,6,6,4,4


Now we contain only 6 explanatory variable and need to further identify relevant potential predictors to maximize the accuracy for the classifer. 

# Forward selection method
///result have NA??
We choose to use forward selection (Eforymson 1966; Draper and Smith 1966) to  build up a model by adding one predictor variable at a time; First, we create a model formula for each subset of predictors for which we want to build a model.

In [48]:
names <- colnames(diabetes_data |> select(-Diabetes_binary))
example_formula <- paste("Diabetes_binary", "~", paste(names, collapse="+"))

we need to create a empty tibbles and add the predictors into this tibbles, finding the best predictors;
Then we create a normal recipe for classification and do the cross-validation for each combination of predictors

In [49]:
accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

diabetes_vfold <- vfold_cv(diabetes_data, v = 5, strata = Diabetes_binary)

n_total <- length(names)

selected <- c()

Here we will use two for loops: one over increasing predictor set sizes (where you see for (i in 1:length(names)) below), and another to check which predictor to add in each round (where you see for (j in 1:length(names)) below). For each set of predictors to try, we construct a model formula, pass it into a recipe, build a workflow.

In [None]:
for (i in 1:n_total) {
    # for every predictor still not added yet
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        # create a model string for this combination of predictors
        preds_new <- c(selected, names[[j]])
        model_string <- paste("Diabetes_binary", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        diabetes_recipe <- recipe(as.formula(model_string), 
                                data = diabetes_data) |>
                          step_scale(all_predictors()) |>
                          step_center(all_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
          add_recipe(diabetes_recipe) |>
          add_model(knn_spec) |>
          tune_grid(resamples = diabetes_vfold, grid = 10) |>
          collect_metrics() |>
          filter(.metric == "accuracy") |>
          summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}
accuracies
        

We can visualize the relationship between different variables and accuracy.
...

# ??? visualization for the dataset after exploratory data analysis: Use the diagram to show relationship between each factor? maybe use ggpair???
/// also we need to improve the visualization to show whether smoker and diabetes have relationship (more concise and clear way)

# Using ggpair to visual the explanatory variables 


In [None]:
explanatory_variables_visualization <- ggpairs(diabetes_data,c("Income",
                                                       "BMI",
                                                       "GenHlth",
                                                       "MentHlth",
                                                       "PhysHlth",
                                                       "Age",
                                                       "Education",
                                                      "Diabetes_binary"), 
        columns = c(1,2,3,4,5,6), 
        aes(color = Diabetes_binary))
        
explanatory_variables_visualization

# how to interpret the plot?
# Age, GenHlth, PhysHlth have stronger relationship from ggpair?

# Building a classifier and tune the model

In [None]:
set.seed(1)
diabetes_split <- initial_split(diabetes_data, prop =0.75, strata = Diabetes_binary) 
diabetes_train <-training(diabetes_split)
diabetes_test <-testing(diabetes_split)

#make more readable and show that our sub dataset already shuffle)
#should we show that the dataset is balanced and stra
slice(diabetes_train,1:10)
slice(diabetes_test,1:10)

In [None]:
# tuning and build the classifier by using cross-validation

In [None]:
# using best accuracy K put into the classifier and train

In [None]:
# using test dataset to evaluate the quality of classifier, then interpret the result

In [None]:
# ?? create the visualization for the analysis? how to do that with multivariable classifier?

# Discussion: (writing part)
    1. what you find?: from what: 
        classifier accuracy
        is predictors help to improve accuracy? ...
        
    2. different with expectation?
    3. impact of this finding?
    4. lead what further questions?: For instance, can a similar model be built for all such diseases which do not have outright physical indicators. If models can be built for predicting or classifying patients as healthy, at-risk and patient, then could a model be built for curing diabetes and similar diseases?