<h1 style="background-color:#4CD5FF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Diabetes Predictive Model
</h1>

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
About Dataset
</h2>

*This dataset is originally from the **National Institute of Diabetes and Digestive and Kidney Diseases**. The objective is to predict based on diagnostic measurements whether a patient has diabetes.*
*Several constraints were placed on the selection of these instances from a larger database. In particular, **all patients here are females at least 21 years old** of Pima Indian heritage.*

- **Pregnancies** - Number of times pregnant;
- **Glucose** - Plasma glucose concentration a 2 hours in an oral glucose tolerance test;
- **BloodPressure** - Diastolic blood pressure (mm Hg);
- **SkinThickness** - Triceps skin fold thickness (mm);
- **Insulin** - 2-Hour serum insulin (mu U/ml);
- **BMI** - Body mass index (weight in kg/(height in m)^2);
- **DiabetesPedigreeFunction** - Diabetes pedigree function;
- **Age** - Age (years);
- **Outcome** - Class variable (0 or 1);


<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Importing Libraries
</h2>

In [4]:
import sqlite3
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

# data partition
from sklearn.model_selection import train_test_split

from itertools import product
from scipy.stats import skewnorm

from datetime import datetime
from sklearn.impute import KNNImputer

# for better resolution plots
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'

# Setting seaborn style
sns.set()

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Importing DataFrame
</h2>

In [5]:
df_diabetes = pd.read_csv("C:/Users/mariana/Documents/Personal_MVF/diabetes.csv", delimiter=',', header=0, decimal='.', quotechar='"')

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Data Exploration 
</h2>

In [6]:
df_diabetes.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,True
1,1,85,66,29,0,26.6,0.351,31,False
2,8,183,64,0,0,23.3,0.672,32,True
3,1,89,66,23,94,28.1,0.167,21,False
4,0,137,40,35,168,43.1,2.288,33,True


In [7]:
df_diabetes.tail()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
763,10,101,76,48,180,32.9,0.171,63,False
764,2,122,70,27,0,36.8,0.34,27,False
765,5,121,72,23,112,26.2,0.245,30,False
766,1,126,60,0,0,30.1,0.349,47,True
767,1,93,70,31,0,30.4,0.315,23,False


In [8]:
df_diabetes.describe()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


In [9]:
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pregnancies               768 non-null    int64  
 1   glucose                   768 non-null    int64  
 2   bloodpressure             768 non-null    int64  
 3   skinthickness             768 non-null    int64  
 4   insulin                   768 non-null    int64  
 5   bmi                       768 non-null    float64
 6   diabetespedigreefunction  768 non-null    float64
 7   age                       768 non-null    int64  
 8   outcome                   768 non-null    bool   
dtypes: bool(1), float64(2), int64(6)
memory usage: 48.9 KB


*No missing values!!*
*.....so far*

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Splitting Data
</h2>

*Defining the **independent variables in X** and defining the **dependent variable in y** (the target)*

In [10]:
X = df_diabetes.drop('outcome', axis = 1)
y = df_diabetes['outcome']

*Using the method **train_test_split** to split the dataset into a train dataset (70%) amd the test dataset (30%)*

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.3, 
                                                  random_state = 0, 
                                                  stratify = y,
                                                  shuffle = True)

*For the next steps of preprocessing we will use X_train since we are using X_val to test if the model is suitable for this dataset* # TENHO DE TIRAR ISTO!!!!!

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Data types, duplicate/missing/unique values, typecasting, feature stats
</h2>

In [12]:
X_train.dtypes


pregnancies                   int64
glucose                       int64
bloodpressure                 int64
skinthickness                 int64
insulin                       int64
bmi                         float64
diabetespedigreefunction    float64
age                           int64
dtype: object

*We should correct the data type of 'glucose' as it appears to be a float instead of a integer. The remainder seem to be correctly assigned.*

In [13]:
X_train['glucose'] = X_train['glucose'].apply(lambda x: float(x) if pd.notna(x) else x)

*Now the type of the variable 'glucose' is corrected*

In [14]:
X_train.replace("", np.nan, inplace=True) #Replacing empty strings with NaN values
X_train.isna().sum() #Checking the number of missing values in each feature of the dataset X

pregnancies                 0
glucose                     0
bloodpressure               0
skinthickness               0
insulin                     0
bmi                         0
diabetespedigreefunction    0
age                         0
dtype: int64

In [15]:
X_val.replace("", np.nan, inplace=True) 
X_val.isna().sum()

pregnancies                 0
glucose                     0
bloodpressure               0
skinthickness               0
insulin                     0
bmi                         0
diabetespedigreefunction    0
age                         0
dtype: int64

*Now we can firmly confirm there are no missing values on both datasets*

In [16]:
print("\nDuplicated values:", X_train.duplicated().sum())


Duplicated values: 0


In [17]:
print("\nDuplicated values:", X_val.duplicated().sum())


Duplicated values: 0


*There's no duplicated values on both datasets*

In [18]:
for i in X_train.columns:
    if X_train[i].dtype != 'object':
        print(f"{i} unique values:{X_train[i].unique()} |n")

pregnancies unique values:[10  2  0 13  1  9  4  5  7  3  8  6 11 12 17 15 14] |n
glucose unique values:[122. 158. 107.  76.  91. 106. 125. 130. 109. 187. 145. 119. 165. 147.
 120. 139.  81.  90.  94. 155. 102. 101. 137. 188. 128.  99. 174.  83.
  89. 157. 159. 111. 103. 168. 129. 143. 199. 114. 118.  78. 100. 112.
 142. 195.  68.  74. 135. 149.  80.  84. 191. 108.  96. 151. 144.  97.
 138. 171. 136. 104. 193. 162. 133.  95. 124. 110.  57. 141. 113.  88.
 161. 166. 186.  77.  93. 117.  92. 148. 115. 127. 163. 105. 180. 126.
 177. 116. 183. 179. 175. 173. 169.  71. 196. 194.  73. 156.  98.  79.
 131.  85. 182. 134. 154. 121.   0. 189.  82. 176. 181. 197. 146.  75.
 132.  86.  61.  72. 123. 152. 164. 184. 170. 140. 172. 167.  87.  62.
 160.  44. 190.] |n
bloodpressure unique values:[ 78  90  76  60  54  52  70  82  62  50  38  68  88  85  72  65  44  40
  58  66  74  64  46  30  84  86  56  75   0  80 122  48 104  24  61  96
  94  92 100 110  55  95 108  98] |n
skinthickness unique value

*Everything seems to be fine with the data, there's no missing values and all values are positive or zero. Regarding the variable 'age', it will be kept as it is because i believe 'age' has a non-linear relationship with diabetes, therefore it will be better for analysis to compare this varible with other variables, as well as, if i change 'age' to represent the year of birth and then change it back to correlate all variables it will make the model more complex, which is not the objective* 

In [19]:
y_train = y_train.map(lambda x: int(x))
y_val = y_val.map(lambda x: int(x))

In [20]:
X_train.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age
34,10,122.0,78,31,0,27.6,0.512,45
221,2,158.0,90,0,0,31.6,0.805,66
531,0,107.0,76,0,0,45.3,0.686,24
518,13,76.0,60,0,0,32.8,0.18,41
650,1,91.0,54,25,100,25.2,0.234,23


*For future implememntation of predictive model i made sure to convert the target from a boolean to a integer!*

<h2 style="background-color:#8EEAFF; color:white; padding:3px 6px; border-radius:4px; font-weight:bold;">
Data Undertanding
</h2>

#### Histograms

In [21]:
sp_rows = 2
sp_cols = 4

