# DS201 Final Project: Pima Indian Diabetes Prediction

<hr>

**Problem Description:**

This dataset was originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the overarching study was to find the most likely risk factors that lead to Diabetes. This dataset is a subset of that original study, it was restricted to only include females older than 21 years old, whom are of Pima Indian heritage. 

We will be using the dataset to accurately predict the onset of diabetes, tracked in the ```outcome``` column.
Your job is to do some EDA to try and find which features are the most strongly corrolated to having diabetes. Then, engineer some new features that may be better suited to the modelling process. Lastly, you will build a model around these variables —you must split the set into a train and test set, pick a series of appropriate models and run your data against them, then evaluate the models to find the most performant one.
<hr>

**Remember! Although These processces have been talked about as a series of steps:**

1) EDA

2) Cleaning

3) Feat Engineering

4) Modeling

5) Model Evaluation

**This is more of an iterative process!** 

You may build a model only to find you're accuracy is low, which will require you to go back and engineer new features or maybe preform some more EDA to ensure that you've selected the most important features, given the problem at hand.

<hr>
<br>

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('max_columns', 100)

## Load Data

In [None]:
db = pd.read_csv('./data/pima_diabetes.csv')

## Basic EDA

**Each row represents a single AD, each had has several features**

In [None]:
db.columns

### Dataset 5-num Summary and Description

In [None]:
db.describe().transpose()

### Column Data Types

In [None]:
db.dtypes

### View Sampling of Rows

In [None]:
db.sample(10)

<hr>
<br>

## Column Breakdown

### Target: ```outcome```: no diabetes: ```0```, diabetes: ```1```

### ```Pregnancies```: # of pregnancies that each woman has undergone.
### ```DiabetesPedigreeFunction```: Quantifies a persons propensity for diabetes, based on family history.

### The rest of the features are self explanatory

<hr>
<br>

## Intermediate EDA

**Just to get you started**

### ```Outcome```

In [None]:
sns.countplot(x='Outcome', data=db, palette='hls', label='count')
plt.show()

In [None]:
db['Outcome'].value_counts()/len(db)

## Age

In [None]:
db['Age'].hist(figsize=(8,5))
plt.show()

## Correlations -> ```BMI``` vs ```SkinThickness```

In [None]:
plt.scatter('BMI', 'SkinThickness', data=db)
plt.show()

# Good Luck!