## <font style="font-weight: bold;"> Data Preparation </font>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

#### Reproducibility

In [2]:
seed = 2024

# pandas, statsmodels, matplotlib and y_data_profiling rely on numpy's random generator, and thus, we need to set the seed in numpy
np.random.seed(seed)

### <font color='green'> Phrase 1: Connect to MySQL Database in Docker </font>

Connect to the database and get the data from the table into a pandas DataFrame

In [1]:
from sqlalchemy import create_engine

def get_data_from_db():
    engine = create_engine('mysql+pymysql://' + 
                           'user:password' +  # credentials (username:password)
                           '@127.0.0.1:3306/' + # host and port
                           'db') # database name

    query = "SELECT * FROM diabetes_data"
    df = pd.read_sql(query, con=engine)
    
    return df

# Load data from database
df = get_data_from_db()

df.head()

NameError: name 'pd' is not defined

# EDA

## Content
The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.

## Objective

The objective of this dataset is to build a predictive model for diagnosing diabetes in patients. The model should predict whether a patient has diabetes (Outcome = 1) or does not have diabetes (Outcome = 0) based on several diagnostic measurements, blood pressure, cholestrol, BMI, smoking, diet, other chronic health conditions, and others.

## Dataset

The target variable Diabetes has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. The dataset has 21 feature variables

## About Columns

- **Diabetes**: 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes.

- **HighBP**: Have you ever been informed by a health professional that you have high blood pressure? (Yes=1, No=0)

- **HighChol**: Have you ever been informed by a health professional that you have high cholesterol? (Yes=1, No=0)

- **CholCheck**: Cholesterol check within the past five years (Yes=1, No=0)

- **BMI**: Body Mass Index (BMI)

- **Smoker**: Have you smoked more than 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] (Yes=1, No=0)

- **Stroke**: Have you ever been informed by a health professional that you had a stroke? (Yes=1, No=0)

- **HeartDiseaseorAttack**: Respondents who have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) (Yes=1, No=0)

- **PhysActivity**: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job (Yes=1, No=0)

- **Fruits**: Consume fruits at least once per day (Yes=1, No=0)

- **Veggies**: Consume vegetables at least once per day (Yes=1, No=0)

- **HvyAlcoholConsump**: Heavy drinker? (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) (Yes=1, No=0)

- **AnyHealthcare**: Patients who have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? (Yes=1, No=0)

- **NoDocbcCost**: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? (Yes=1, No=0)

- **GenHlth**: Would you say that in general your health is? Rate (1 ~ 5)

- **MentHlth**: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? (0 ~ 30)

- **PhysHlth**: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? (0 ~ 30)

- **DiffWalk**: Do you have serious difficulty walking or climbing stairs? (Yes=1, No=0)

- **Sex**: Indicate sex of respondent (0=Female, 1=Male)

- **Age**: Fourteen-level age category (1 ~ 14)

- **Education**: What is the highest grade or year of school you completed? (1 ~ 6)

- **Income**: Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") (1 ~ 8)


### <font color='green'> Phase 2: Data Understanding </font>

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_binary       253680 non-null  int64  
 1   HighBP                253680 non-null  int64  
 2   HighChol              253680 non-null  int64  
 3   CholCheck             253680 non-null  int64  
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  int64  
 6   Stroke                253680 non-null  int64  
 7   HeartDiseaseorAttack  253680 non-null  int64  
 8   PhysActivity          253680 non-null  int64  
 9   Fruits                253680 non-null  int64  
 10  Veggies               253680 non-null  int64  
 11  HvyAlcoholConsump     253680 non-null  int64  
 12  AnyHealthcare         253680 non-null  int64  
 13  NoDocbcCost           253680 non-null  int64  
 14  GenHlth               253680 non-null  int64  
 15  

In [None]:
df.describe()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.296921,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.69816,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,2.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0


#### Dataset Report

Detailed exploratory data analysis report for the dataset. This library generates a complete report about the data.

In [None]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [None]:
df.shape

(253680, 22)

In [None]:
df.head()

Unnamed: 0,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,1,1,1,40.0,1,0,0,0,0,...,1,0,5,18,15,1,0,9,4,3
1,0,0,0,0,25.0,1,0,0,1,0,...,0,1,3,0,0,0,0,7,6,1
2,0,1,1,1,28.0,0,0,0,0,1,...,1,1,5,30,30,1,0,9,4,8
3,0,1,0,1,27.0,0,0,0,1,1,...,1,0,2,0,0,0,0,11,3,6
4,0,1,1,1,24.0,0,0,0,1,1,...,1,0,2,3,0,0,0,11,5,4


Duplicates


In [None]:
df.duplicated().sum() / df.shape[0]

np.float64(0.09420923998738569)

No Missing values

In [None]:
df.isnull().sum()

Diabetes                0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

#### Initial Analysis

### <font color='green'> Phase 2: Data Preparation </font>

Data Preparation mainly consists of two parts:
- **Data Cleaning** - the goal is assure data quality. This includes removing wrong/corrupt 
data entries and making sure the entries are standardized, e.g. enforcing certain encodings. 
- **Data Wrangling** - transforms the data in order to make it suitable for the modelling step. Sometimes, steps from Data Wrangling are incorporated into the automatized Pipeline.

### Data Cleaning


In [None]:
plt.figure(figsize = (15,15))
for i,col in enumerate(['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age','Education', 'Income']):
    plt.subplot(4,2,i+1)
    sns.boxplot(x = col, data = df ,palette='Set2')
plt.show()


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = col, data = df ,palette='Set2')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = col, data = df ,palette='Set2')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = col, data = df ,palette='Set2')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x = col, data = df ,palette='Set2')

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `l