# EDA: Diagnosing Diabetes
The following is my response to the CodeCademy **Data Wrangling & Tidying** module's project, seeking to inspect, clean and validate the [Pima Indians Diabetes Data](https://www.kaggle.com/uciml/pima-indians-diabetes-database) from the National Institution of Diabetes and Kidney Diseases.

## Setup

In [1]:
import pandas as pd
import numpy as np

## Inspection

### Initial Inspection

In [2]:
diabetes = pd.read_csv('diabetes.csv')

In [3]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Finding how many *columns* & *rows* there are:

In [5]:
diabetes.shape

(768, 9)

### Missing Values

#### Identifying

In [6]:
diabetes.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Although `.isnull()` is 0 for all columns, if we use `.describe()` to summarise, we can see that some columns are 0:

In [7]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### Resolving

Replacing all instances of `0` with `NaN` in 5 columns with mising values, identified by CodeCademy, given below as `columns_miss_values`:

In [8]:
columns_miss_values = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']

In [9]:
diabetes_nan = diabetes.copy()

diabetes_nan[columns_miss_values] = diabetes_nan[columns_miss_values].replace(0,np.NaN)

Alternatively, replacing all instances with each column's mean:

*NB: if `.mean()` was calculated with the `0` values included, it will underestimate the mean. Thus, convert to `NaN` first, then impute the mean:*

In [16]:
diabetes_miss_mean = diabetes_nan.copy()

for col in columns_miss_values:
    diabetes_miss_mean[col].replace(np.NaN,
                                    diabetes_miss_mean[col].mean(),
                                    inplace=True)

In [17]:
diabetes_miss_mean.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


#### Re-inspection

Using `isnull().sum()` again to detect `null` values:

In [12]:
diabetes_nan.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Printing out all rows that contain missing values:

In [13]:
diabetes_nan[diabetes_nan.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
766,1,126.0,60.0,,,30.1,0.349,47,1


374 out of 376 rows with a missing value have `Insulin` missing.

## Exploring Columns

### Data Types

In [14]:
diabetes_nan.dtypes

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

### Exploring Values

In [15]:
for col in diabetes_nan.columns:
    print(diabetes_nan[[col]].value_counts())
    print('\n')

Pregnancies
1              135
0              111
2              103
3               75
4               68
5               57
6               50
7               45
8               38
9               28
10              24
11              11
13              10
12               9
14               2
15               1
17               1
dtype: int64


Glucose
100.0      17
99.0       17
106.0      14
129.0      14
125.0      14
           ..
67.0        1
65.0        1
62.0        1
61.0        1
199.0       1
Length: 135, dtype: int64


BloodPressure
70.0             57
74.0             52
68.0             45
78.0             45
72.0             44
64.0             43
80.0             40
76.0             39
60.0             37
62.0             34
66.0             30
82.0             30
88.0             25
84.0             23
90.0             22
86.0             21
58.0             21
50.0             13
56.0             12
54.0             11
52.0             11
92.0              8
75.0  

Outliers, if any, are listed as the final value of each column's `.value_counts()` output above.