# EDA: Diagnosing Diabetes

### Question: How certain diagnostic factors affect the diabetes outcome of women patients?

**Note**: Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

### 1. Initial Inspection

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns

# load in data
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### 2. How many columns (features) and rows (observations) does the data contain?

In [6]:
# Nnumber of columns
print(len(diabetes_data.columns))

# Number of rows
print(len(diabetes_data))

# Number of columns and rows
diabetes_data.shape

9
768


(768, 9)

### 3. Do any of the columns in the data contain null (missing) values?

In [7]:
diabetes_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

While it's technically true that none of the columns contain null values, that doesn't necessarily mean that the data isn't missing any values. When exploring data, you should always question your assumptions and try to dig deeper.

### 6. Calculate summary statistics

In [8]:
# perform summary statistics
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


Looking at the summary statistics:

- `Glucose`
- `BloodPressure`
- `SkinThickness`
- `Insulin`
- `BMI`

If you take a look at the minimum values for these five columns, you'll notice that they are all `0`. 

How can Blood Pressure or BMI be `0`? That makes no sense! These values also seem to be way off from their respective medians and means, another indicator that something is off.

One way to interpret this is that there are missing values in the data.

In addition to the `0` values that show up for the columns above, there appear to be additional outliers, such as:

- The maximum value of the `Insulin` column is `846`, which is abnormally high.
- The maximum value of the `Pregnancies` column is `17`. While having 17 pregnancies is not impossible, this case might be something to look further into to determine its accuracy.

### 7. Replace the instances of `0` with `NaN`

In [9]:
# replace instances of 0 with NaN
diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)

### 8. Check for missing (null) values in all of the columns again

In [10]:
# find whether columns contain null values after replacements are made
diabetes_data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

### 9. Let's take a closer look at these rows to get a better idea of _why_ some data might be missing

Print out all the rows that contain missing (null) values.

In [11]:
# print rows with missing values
diabetes_data[diabetes_data.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
766,1,126.0,60.0,,,30.1,0.349,47,1


One thing you might notice is that most rows with missing data have missing values in more than one column. In fact, every single row with at least one missing value also has a missing value in the `Insulin` column. This is a clue as to why the data is missing! If patients did not have their insulin measured, why might they also not have had these other measurements taken?

Depending on how much data is missing, you might choose to remove specific rows or impute the missing values somehow.

### 10. Look at the data types of each column

Does the result match what you would expect?

In [12]:
# print data types using .info() method
diabetes_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(6), int64(2), object(1)
memory usage: 54.1+ KB


### 11. To figure out why the `Outcome` column is of type `object` (string) instead of type `int64`, print out the unique values in the `Outcome` column.

In [13]:
# print unique values of Outcome column
diabetes_data.Outcome.unique()

array(['1', '0', 'O'], dtype=object)

### 12. How might you resolve this issue?

A possible next step would be to replace instances of `'O'` with `0` and convert the `Outcome` column to type `int64`.

In [14]:
# replace instances of `O` with 0
diabetes_data[['Outcome']] = diabetes_data[['Outcome']].replace('O', 0)

In [15]:
diabetes_data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,,0.232,54,1


In [16]:
# print unique values of Outcome column
diabetes_data.Outcome.unique()

array(['1', '0', 0], dtype=object)

In [19]:
diabetes_data['Outcome'] = diabetes_data['Outcome'].astype('int64')

In [20]:
# print unique values of Outcome column
diabetes_data.Outcome.unique()

array([1, 0])