<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  End-to-End Data Science: a Review

_Authors: B Rhodes (DC)_

## Learning Objectives

After this lesson, students will be able to:


- **Describe** the data science workflow.
- Perform EDA using pandas
- **Implement** one or more algorithms using sklearn
- **Evaluate** the performance of a machine learning algorithm using various metrics.

---

In [None]:
# Imports
import numpy as np
import pandas as pd

# sklearn imports
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.preprocessing import StandardScaler
import joblib


# data
from sklearn.datasets import load_boston, load_iris, load_diabetes

# Import visualization modules
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



## Background

Diabetes is a chronic condition that causes a individual’s blood sugar level to become too high. Diabetics may suffer from one or more dangerous and sometimes life-threatening symptoms among which are:

- High blood pressure
- Increased risk of infection
- Risk of a heart disease
- Gastroparesis
- Damaged blood vessels
- Pancreas malfunctioning, etc. (World Health Organization, 2018)

According to the World Health Organization, the number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014 and this trend continues. (World Health Organization, 2018)

The dataset used here is from a study of Pima Indians in Arizona. Due to a number of factors the Pima Indians, and other Native American populations suffer from worse health outcomes than the majority population. One factor is the switch to a high-fat diet common among whites and to a sedentary lifestyle that is making Native Americans sick at a faster rate than other Americans.

The dataset has been restricted to the following patient population: 

- All patients are female
- All patients are at least 21 years old
- All patients are of Pima Indian heritage

The data consists of the following features

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)2)
- DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
- Age: Age (years)
- Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

Our goal is to build a model to to predict whether someone has diabetes or not.

The data is located in this repo here: `./data/diabetes.csv`

Load the diabetes data.

In [None]:
# clean data is at './data/diabetes.csv'

diabetes_df = pd.read_csv('./data/diabetes_fix.csv')

diabetes_df.head()

## EDA -
The basic steps of exploratory data analysis are:
1. Verify the data was read-in correctly and is properly formatted.
2. Determine how much data we have and what the shape the data is.
3. Identify the data types contained in the data.
4. Identify the target (independent variable) and features (dependent variables).
5. Look for missing data.
6. Perform statistical analysis of the data, look for outliers, low variance features, etc.
7. Look for correlations, between features and between features and the target.
8. Visualize the data to see distributions, outliers, etc.

A key goal of EDA, beyond understanding our data is to determine what preprocessing we might need to perform. This can be everything from converting data types, one-hot encoding, dropping features, filling in or dropping missing data, feature engineering, or going out to collect more data.

### Pre-processing
All data must be numerical and if present we need to fill-in missing data. We'll also do any feature engineering we think might help the process.


### **Q. What is our target? Which columns are our dependent variables?**

#### Answer: 

In [None]:
## Check how much data and what types

### Q. Other than `.info()` what other pandas commands would give us this information?



In [None]:
# Answer here for determining how much data and what types are in the data set.

### Q. Look for missing data. What's a quick way to check for missing values across the whole dataset?


In [None]:
# Hint: chain two pandas methods together.

Does this mean we have no missing data?

Since the `BMI` column is made up of strings we should check they are all valid numerical values.


### Fix data type issues and missing data.

In [None]:
# write a lambda function to map '?'s to NaN

In [None]:
# use the apply method to apply the lambda function to BMI column

# these will still be strings so we can convert to float in this step. It handles the NaNs

In [None]:
# Look at the value counts. We have to include NaNs

#### Convert Outcome Column to Numeric

In [None]:
# generate summary statistics on the data

In [None]:
# Let's fill the BMI = 0.0 with the median value.

In [None]:
# grab the column names for later use

In [None]:
# Let's see if there are any strong correlations in our data
# Q What do we see here? Does anything stand out?

In [None]:
# visualize the correlation matrix
# Is this easier than looking at the matrix?

In [None]:
# Visualize the outcome variable

In [None]:
# We'll also get a count and ratio to look for imbalance

In [None]:
# We know we have 768 observations so we're going to split our data into 3 chunks.
# take the last 18 observations for a final validation DF

#confirm our split

In [None]:
# Lets split the bulk of the data into training and testing use an 80:20 split

# first define X, y


In [None]:
# Train test split the data 80/20 and stratify to ensure the outcomes are proportional in each segment.

In [None]:
# verify the split

In [None]:
# verify the outcome ratios of test and splits

### End EDA
We converted the any data types since all our data is numerical.

## Build and Compare

Let's build a few models and compare them.

We're working with a blank slate here and I did not pre-build anything so we are doing this from scratch. Expect some bumps and misteps along the way.

1. What type of problem is this regression or classification?
2. What metrics and tools should we use to evaluate performance?

What model should try first?

In [None]:
#instantiate a logistict regression model

In [None]:
#fit

In [None]:
# Predict

In [None]:
# Score

In [None]:
# what else should we look at? Confusion matrix


In [None]:
# we need to unravel the confusion matrix # tn, fp, fn, tp = cm_log.ravel()

In [None]:
# true positive rate = TPR = tp/(tp+fn)


In [None]:
# TRP is also called recall, which we can score directly from sklearn


In [None]:
# As a reminder the baseline accuracy on test set is 65%

In [None]:
# Let's try KNN
# We have to scale our data first.

#instantiate


# fit the scaler


# Transform the training data


# Transform the testing data



In [None]:
# Find the best K

# pick a range for k


# define the parameter grid


# display them


In [None]:
# instantiate a KNN classifier

# instantiate a grid search GridSearchCV (use the hyper-parameters, scoring = 'recall', return_train_score=True)

In [None]:
# Fit it


In [None]:
# what is the best estimator, use .best_estimator_

In [None]:
# make predictions with the best estimator

In [None]:
# Recall