# EDA: Diagnosing Diabetes

In this project, you'll imagine you are a data scientist interested in exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

You will use your EDA skills to help inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)

Let's get started!

## Initial Inspection

1. First, familiarize yourself with the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

   Look at each of the nine columns in the documentation.
   
   What do you expect each data type to be?

Expected data type for each column:

- `Pregnancies`: `int`
- `Glucose`: `int`
- `BloodPressure`: `int`
- `SkinThickness`: `int`
- `Insulin`: `int`
- `BMI`: `float`
- `DiabetesPedigreeFunction`: `float`
- `Age`: `int`
- `Outcome`: `bool`

2. Next, let's load in the diabetes data to start exploring.

   Load the data in a variable called `diabetes_data` and print the first few rows.
   
   **Note**: The data is stored in a file called `diabetes.csv`.

In [1]:
import pandas as pd
import numpy as np

# Load in data
diabetes_data = pd.read_csv('diabetes.csv')

3. How many columns (features) does the data contain?

In [2]:
# Print number of columns
print("This dataset contains {} columns.".format(len(diabetes_data.columns)))

This dataset contains 9 columns.


4. How many rows (observations) does the data contain?

In [3]:
# Print number of rows
print("This dataset contains {} rows.".format(diabetes_data.shape[0]))

This dataset contains 768 rows.


## Further Inspection

5. Let's inspect `diabetes_data` further.

   Do any of the columns in the data contain null (missing) values?

In [4]:
# Find whether columns contain null values
missing_rows = diabetes_data[diabetes_data.isnull().any(axis=1)]
missing_rows.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


6. If you answered no to the question above, not so fast!

   While it's technically true that none of the columns contain null values, that doesn't necessarily mean that the data isn't missing any values.
   
   When exploring data, you should always question your assumptions and try to dig deeper.
   
   To investigate further, calculate summary statistics on `diabetes_data` using the `.describe()` method.

In [5]:
# Perform summary statistics
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


7. Looking at the summary statistics, do you notice anything odd about the following columns?

   - `Glucose`
   - `BloodPressure`
   - `SkinThickness`
   - `Insulin`
   - `BMI`

**Your response to question 7**: <br>
These columns contain a minimum value of 0.0. This is illogical, because any person is expected to return a non-zero value for each of these variables.

8. Do you spot any other outliers in the data?

**Your response to question 8**: <br>
A maximum `SkinThickness` of 99 seems alarmingly high to me. When the average is ~20 and the standard deviation is ~16, a value of 99 feels like a stark outlier. Likewise, a maximum `Insulin` of 846 seems quite outset the expected values. Lastly, 17 pregnancies feels like quite an outlier.

9. Let's see if we can get a more accurate view of the missing values in the data.

   Use the following code to replace the instances of `0` with `NaN` in the five columns mentioned:
   
   ```py
   diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
   ```

In [6]:
# Replace instances of 0 with NaN
diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] \
= diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)

10. Next, check for missing (null) values in all of the columns just like you did in Step 5.

    Now how many missing values are there?

In [7]:
# Find whether columns contain null values after replacements are made
print(diabetes_data.isnull().sum())
print()
print("The most likely columns to be missing data in order are \
Insulin, SkinThickness, and BloodPressure. \nBMI and Glucose \
are also missing some values.")

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

The most likely columns to be missing data in order are Insulin, SkinThickness, and BloodPressure. 
BMI and Glucose are also missing some values.


11. Let's take a closer look at these rows to get a better idea of _why_ some data might be missing.

    Print out all the rows that contain missing (null) values.

In [8]:
# Print rows with missing values
missing_rows = diabetes_data[diabetes_data.isnull().any(axis=1)]
print("There are {} rows with missing data.".format(missing_rows.shape[0]))
print("Of these rows, only {} contain data for Insulin.".format(missing_rows.Insulin.notnull().sum()))
missing_rows.head()

There are 376 rows with missing data.
Of these rows, only 2 contain data for Insulin.


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0


12. Go through the rows with missing data. Do you notice any patterns or overlaps between the missing data?

**Your response to question 12**: <br>
There seems to be a bit of a cascade effect with the missing data. If a patient is missing `Insulin`, then they are likely to be missing `SkinThickness`; however, there is rarely a case where a patient is missing `SkinThickness` without missing `Insulin`. The same effect occurs between `SkinThickness` and `BloodPressure`. This may imply that if a patient's insulin wasn't measured, then it was extra challenging to measure their skin thickness. Likewise, if a patient's skin thickness wasn't measured, perhaps it was challenging to measure their blood pressure. Altogether, only 2 patients out of the 376 patients in our `missing_rows` table include data on `Insulin`. These patients are missing data in `Glucose` and `BMI`, respectively.

13. Next, take a closer look at the data types of each column in `diabetes_data`.

    Does the result match what you would expect?

In [9]:
# Print data types using .info() method
print(diabetes_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     763 non-null float64
BloodPressure               733 non-null float64
SkinThickness               541 non-null float64
Insulin                     394 non-null float64
BMI                         757 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null object
dtypes: float64(6), int64(2), object(1)
memory usage: 54.1+ KB
None


This result matches my expectations in the sense that it matches what I would expect Python to guess for data types given the values. Every numerical value with a decimal point is a `float` and every numerical value without a decimal point is an `int`, with the exception of `Outcome` as an `object`. However, this does not match my expectations based on looking at the data through Kaggle. There, I see that the columns `Glucose`, `BloodPressure`, `SkinThickness`, and `Insulin` are all whole numbers (every decimal is XX.0) and thus all ought to be `int`.

14. To figure out why the `Outcome` column is of type `object` (string) instead of type `int64`, print out the unique values in the `Outcome` column.

In [10]:
# Print unique values of Outcome column
print(diabetes_data.Outcome.unique())

['1' '0' 'O']


15. How might you resolve this issue?

**Your response to question 15**: <br>
I would change all instances of `O` to `0` and convert the values to integers.

In [11]:
# Change O to 0...
diabetes_data.Outcome = diabetes_data.Outcome.replace('O', '0')

# Convert values to integers...
diabetes_data.Outcome = diabetes_data.Outcome.astype('int')

# Print the unique values...
print(diabetes_data.Outcome.unique())

# Check the data types...
print(diabetes_data.dtypes)

[1 0]
Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


## Next Steps:

16. Congratulations! In this project, you saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep your datasets clean and reliable.

    Here are some ways you might extend this project if you'd like:
    - Use `.value_counts()` to more fully explore the values in each column.
    - Investigate other outliers in the data that may be easily overlooked.
    - Instead of changing the `0` values in the five columns to `NaN`, try replacing the values with the median or mean of each column.

In [12]:
# Let's use .value_counts()
print(diabetes_data.Pregnancies.value_counts())

1     135
0     111
2     103
3      75
4      68
5      57
6      50
7      45
8      38
9      28
10     24
11     11
13     10
12      9
14      2
15      1
17      1
Name: Pregnancies, dtype: int64
