# Lab Three: Extending Logistic Regression

Team: Miro Ronac, Kirk Watson, Brandon Vincitore

## 1. Preparation and Overview

### 1.1 Task and Use-case

This data can be useful in identifying prediabetes or diabetes in patients and assisting doctors with making accurate observations from a variety of health indicators.

Every year, the CDC collects data from a health-related telephone survey called the Behavioral Risk Factor Surveillance System. The data gathered from these surveys include information on “health-related risk behaviors, chronic health conditions, and use of preventive services.” This dataset focuses on responses from 2015 and diabetes, a “prevalent chronic disease in the United States.”

Ultimately, the ability to identify a patient with prediabetes or diabetes with increased efficiency and accuracy is the intention of analyzing this dataset. With this capability, a diabetes diagnosis can be reached at a faster rate compared to when a human makes the diagnosis. Diabetes is extremely common in the US as about 1 in 10 Americans have diabetes, and [about 1 in 5 people with diabetes don’t know they have it](https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html#:~:text=37.3%20million%20Americans%E2%80%94about%201,t%20know%20they%20have%20it.). In addition, 1 in 3 Americans have prediabetes, and [more than 8 in 10 adults with prediabetes don’t know they have it](https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html#:~:text=37.3%20million%20Americans%E2%80%94about%201,t%20know%20they%20have%20it.). Using this classifier, patients that might be at risk of diabetes can be brought to a doctor’s attention at a higher rate allowing for earlier medical care and attention.

Dataset Source: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

### 1.2 Dataset Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')
print('Size of dataset:', df.shape[0])
df.head()

Size of dataset: 253680


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [2]:
# Creating target features by modifying first column in original dataframe such that we have 3 features consisting of binary values for
# no diabetes, prediabetes, and diabetes where 0 is False and 1 is True
target_array = np.zeros((len(df),4))
for i in range(len(df)):
    target_array[i,0] = df['Diabetes_012'].values[i]
for i in range(len(target_array)):
    # no diabetes
    if target_array[i,0] == 0:
        target_array[i,1] = 1
    # prediabetes
    if target_array[i,0] == 1:
        target_array[i,2] = 1
    # diabetes
    if target_array[i,0] == 2:
        target_array[i,3] = 1

# Adding new target columns to original dataframe
target_columns = ['NoDiabetes', 'PreDiabetes', 'Diabetes']
for i in range(target_array.shape[1]-1):
    df.insert(i, target_columns[i], target_array[:,1:][:,i], True)
df_target = df.drop('Diabetes_012', axis=1)

columns = list(df_target.columns)
for col in columns:
    df_target[col] = df_target[col].astype(int)
df_target.head()

Unnamed: 0,NoDiabetes,PreDiabetes,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1,0,0,1,1,1,40,1,0,0,...,1,0,5,18,15,1,0,9,4,3
1,1,0,0,0,0,0,25,1,0,0,...,0,1,3,0,0,0,0,7,6,1
2,1,0,0,1,1,1,28,0,0,0,...,1,1,5,30,30,1,0,9,4,8
3,1,0,0,1,0,1,27,0,0,0,...,1,0,2,0,0,0,0,11,3,6
4,1,0,0,1,1,1,24,0,0,0,...,1,0,2,3,0,0,0,11,5,4


### 1.3 Dataset Training and Testing Split

In [3]:
targets = ['NoDiabetes', 'PreDiabetes', 'Diabetes']
for col in targets:
    columns.remove(col)
    
# Splitting dataset
from sklearn.model_selection import train_test_split as tts

train, test = tts(df_target, test_size=.20, random_state=42, shuffle=True)

X_train = train[columns].to_numpy()
X_test  = test[columns].to_numpy()

y_train = {}
y_test  = {}

for col in targets:
    y_train[col] = train[col].to_numpy()
    y_test[col]  = test[col].to_numpy()

With a large dataset of over 250,000 features, an 80/20 split is sufficient. Our classifier has plenty of data to use for training and also has plenty of data to use for training. We could comformtably move our split to 75/25 or 70/30 if we desired more opportunities to test our classifier. With such a large dataset, we could also divide the dataset with an additional validation set. With a validation set, we can use this set for more frequent model evaulations and save the testing dataset for a final unbiased evaluation.

## 2. Modeling

## 3. Deployment

## 4. Exceptional Work