# Group 17 DSCI Project (Section 007)
### Predicting diabetes based on demographic attributes, medical history, and clinical measurements

Darby Quinn #19752633 <br>
Manav Parikh #13928775 <br>
Nitya Goel #89433221 <br>
Reeva Bansal #68061514 <br>

### Introduction

Diabetes is a common condition that affects people of all ages. In individuals with diabetes, the body either doesn't make enough insulin or can’t use insulin properly. This can cause serious health problems such as heart disease, vision loss, kidney disease and high blood pressure (BP). There are certain medical factors (e.g. BMI, age, insulin and glucose levels) that can help predict whether an individual is likely to have or develop diabetes. Using these factors and making predictions on whether someone has diabetes can help ensure that proper precautions are taken or treatments are administered to manage the negative short and long term effects of diabetes.

The question we are answering is: **Can we classify whether or not someone has diabetes based on their blood pressure, BMI and age?**

The dataset we are using to answer this question is a Diabetes dataset from the online website Kaggle with nine different columns: number of pregnancies; blood glucose level; BP; skin thickness; blood insulin level; BMI; diabetes pedigree function; age; and outcome. It is a .csv file that is pre-divided into training and testing data, with no missing values. The dataset has previously been useful in using classification to predict diabetes outcome, as well as studying risk factors and diabetes management. 

In [1]:
# loading libraries needed to perform classification and analysis
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

### Methods & Results

The general procedure we will follow will be:
1. Reading the data into R from the web
2. Ensuring the data is in a tidy format
3. Summarizing the data to select the predictor variables
4. Determining the best K-value to use by cross validation
6. Creating the the K-nearest neighbors classifier with the training set 
7. Finalizing the model and recipe needed to train the classifier
8. Determining the accuracy of the classifier

#### 1. Reading the data into R from the web
The data set is located on kaggle at https://www.kaggle.com/datasets/ehababoelnaga/diabetes-dataset/data. We will load the training and testing data into R using the download.file and read.csv functions, then display the first 6 observations of each set.

In [2]:
#reading in training data
url <- "https://raw.githubusercontent.com/nityag11/DSCI-100-group17-project/main/Training%20(1).csv"
download.file(url, "Training%20(1).csv")
training_data <- read.csv("Training%20(1).csv")
head(training_data)

# reading in testing data
url <- "https://raw.githubusercontent.com/nityag11/DSCI-100-group17-project/main/Testing.csv"
download.file(url, "Testing.csv")
testing_data <- read.csv("Testing.csv")
head(testing_data)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>
1,6,148,72,35,0,33.6,0.627,50,1
2,1,85,66,29,0,26.6,0.351,31,0
3,8,183,64,0,0,23.3,0.672,32,1
4,1,89,66,23,94,28.1,0.167,21,0
5,0,137,40,35,168,43.1,2.288,33,1
6,5,116,74,0,0,25.6,0.201,30,0


Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>
1,9,120,72,22,56,20.8,0.733,48,0
2,1,71,62,0,0,21.8,0.416,26,0
3,8,74,70,40,49,35.3,0.705,39,0
4,5,88,78,30,0,27.6,0.258,37,0
5,10,115,98,0,0,24.0,1.022,34,0
6,0,124,56,13,105,21.8,0.452,21,0


In [37]:
training_data <- training_data |>
    mutate(training_data, Outcome = as_factor(Outcome)) |>
    mutate(Outcome = fct_recode(Outcome, "Yes" = "1", "No" = "0")) |>
    select(BMI, BloodPressure, Age, Outcome) |>
    rename(Diabetes_Outcome = Outcome, Blood_Pressure = BloodPressure)
head(training_data)

[1m[22m[36mℹ[39m In argument: `Outcome = fct_recode(Outcome, Yes = "1", No = "0")`.
[33m![39m Unknown levels in `f`: 1, 0”


Unnamed: 0_level_0,BMI,Blood_Pressure,Age,Diabetes_Outcome
Unnamed: 0_level_1,<dbl>,<int>,<int>,<fct>
1,33.6,72,50,Yes
2,26.6,66,31,No
3,23.3,64,32,Yes
4,28.1,66,21,No
5,43.1,40,33,Yes
6,25.6,74,30,No
