# Introduction to Machine Learning
## Identifying the causes of Diabetes
In this notebook you will implement your first machine learning algorithm to analyze a population health dataset: the Pima Indian diabetes dataset. The purpose of this analysis is to learn how to utilize machine learning to solve a very specific problem - identifying individuals at risk of diabetes and discovering potential causes of Diabetes in the population.

## Contents
1. Import dataset
2. Data exploration
3. Feature engineering
4. Modeling
5. Model Evaluation

## How to use this notebook
- To execute any single block of text or markdown, use ctrl+enter, shift+enter or press the run arrow on the left of the box (only in Colaboratory)
- To reset the notebook select "Factory reset runtime" from the Runtime tab at the top of Colaboratory

## 1. Import dataset

In [None]:
# First let's import our data
import pandas as pd

url = 'https://raw.githubusercontent.com/jzhangab/DS101/master/1_Data/diabetes.csv'
df = pd.read_csv(url, sep = ',')

In [None]:
# Let's look at the first 5 rows to begin understanding what factors are available
df.head()

From the data frame we can see that the dataset consists of 8 different factors that contribute to the risk of diabetes. The actual truth of whether or not an individual has diabetes is in the column "Outcome". We will use this information to train a machine learning model to understand how the different factors are connected.

## 2. Data Exploration
The purpose of data exploration is to seek to understand the data. We will look primarily at histograms and scatterplots to visualize if there are any interesting relationships.

In [None]:
# Let's take a look at the histograms of the dataframe to understand each factor.
import matplotlib.pyplot as plt

%matplotlib inline
fig = plt.figure(figsize = (15,15))
ax = fig.gca()
df.hist(ax = ax)

In [None]:
# Age is skewed young, but how does it relate to one of the more normally distributed factors such as BloodPressure?
# Try changing the x and y variables in the scatterplot declaration to view other relationships.
%matplotlib inline
df.plot.scatter(x = 'Age',
                y = 'BloodPressure')

In [None]:
# What about BMI vs. Outcome? Are higher BMI persons more likely to have diabetes?
# This shows that you may be tempted to draw conclusions using single factor analysis that BMI causes diabetes - when it is only one of several factors
%matplotlib inline
import numpy as np
from sklearn.linear_model import LinearRegression

x = pd.DataFrame(df['BMI'])
y = pd.DataFrame(df['Outcome'])

# create a linear regression model
model = LinearRegression()
model.fit(x, y)

# predict y from the data
x_new = np.linspace(15, 60, 100)
y_new = model.predict(x_new[:, np.newaxis])

# plot the results
plt.figure(figsize=(8, 6))
ax = plt.axes()
ax.scatter(x, y)
ax.plot(x_new, y_new)

ax.set_xlabel('BMI')
ax.set_ylabel('Outcome')

ax.axis('tight')

plt.show()

## 3. Feature Engineering
The purpose of feature engineering is to prepare data for modeling. The diabetes data set is formatted well and does not contain text variables so we will only do two things to prepare the data

1. Missing data
2. Reduce multicollinearity

In [None]:
# By far the most important thing to understand about a dataset is how "clean" it might be
# For cleanliness, missing data is very important, let's check how much missing data there is for each factor
for col in list(df):
    num_na = len(df[col]) - df[col].count()
    print ("Percent null in column " + col + " is:", 100*num_na/len(df[col]))

In [None]:
# For some columns such as Glucose, BloodPressure, SkinThickness, Insulin, BMI, and Age we are not only concerned with null values but also 0 values
for col in list(df):
    num_0 = len(df.loc[df[col] == 0][col])
    print ("Percent 0 in column " + col + " is:", 100*num_0/len(df[col]))

In [None]:
# Let's do the following to clean our data

# 1. Replace all null values with 0
df.fillna(0, inplace=True)

# 2. Remove any data point where any of the following are 0: Glucose, BloodPressure, SkinThickness, Insulin, BMI, or Age
# List of factors we want to remove 0 values from
nonzero_factors = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'] 
# Iterate over the list of columns and subset the main dataframe by each column where values are nonzero
for col in nonzero_factors:
    df = df.loc[df[col] != 0]

### Multicollinearity
The idea of collinearity is that if certain input factors are closely correlated, they will bias the output of the model by amplifying their particular effects. We need to understand if some of our factors are high collinear and then reduce bias by removing all but 1 of the collinear factors from the dataframe.

In [None]:
# We can check the correlation (R-square) between variables using a correlation matrix
df.corr()

In [None]:
# To quantify multicollinearity, we will use variance inflation factor (VIF)
# Rule of thumb, VIF above 10 indicates a particular variable ought to be removed
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Also - VIF for a constant term should be high because the intercept is a proxy for the constant.
# A constant term needs to be added to accurately measure VIF for the other terms
df_c = add_constant(df[[c for c in list(df) if 'PARAM' in c]])

# inline Generator on a pandas series
pd.Series([variance_inflation_factor(df_c.values, i) 
               for i in range(df_c.shape[1])], 
              index=df_c.columns)

## 4. Modeling
In the modeling step we will train a supervised machine learning model to understand relationships in the diabetes data set. We will then evaluate the model to see how well it predicts.

The particular model that we will use is Logistic Regression. This model is commonly used in binary classification for predictive analytics.

1. Split dataset into training and validation datasets
2. Train model
3. Predict outcomes of validation dataset
4. Calculate accuracy of validation dataset

In [None]:
# We will split the data 80%/20% using 80% of the data to train the model and 20% to validate the accuracy of the model
# We can use pre-built functions from the machine learning package sci-kit learn to do this task
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# The Outcome column not part of input features so we will use a generator to create a new list and call it "features"
features = [col for col in list(df) if col not in ['Outcome']]

# The X input is df[features] which is all columns in the dataframe of the list features we created using the generator
# The Y input is df['Outcome'] which is the binary label column
X = df[features]
y = df['Outcome']

# Generate the 4 datasets we need
# X_train and y_train to train the model
# X_test to generate predictions
# y_test to evaluate the accuracy of the predictions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Ok so what's the difference between X_train and X_test???
print("Length of X_train ", len(X_train))
print("Length of X_test ", len(X_test))

In [None]:
# Declare and fit model
model = LogisticRegression(random_state=0)
model.fit(X_train, y_train)

In [None]:
# Predict using test set
y_pred = model.predict(X_test)

## 5. Model Evaluation
We will use several techniques to evaluate the strength of the Model

1. Accuracy
2. Confusion Matrix (false positive, true positive, false negative, true negative)
3. Receiver operating characteristic
4. Sigmoid probability visualization

In [None]:
# Compare y_test (true values) to y_pred (predicted values)
accuracy_score(y_test, y_pred)

In [None]:
# Let's take a look at the confusion matrix, which shows us false positives and false negatives
confusion_matrix(y_test, y_pred)

In [None]:
# Another method of evaluating a classifier is using the Receiver Operating Characteristic (ROC)
# ROC is a plot of true positive vs. false positive. We calculate the area under the curve (AUC)
# AUC = 1 indicates a perfect classifier, AUC = 0.5 means the classifier is no better than a coin flip
from sklearn.metrics import roc_curve, roc_auc_score

%matplotlib inline
y_pred_proba = model.predict_proba(X_test)[::,1]
falseposrate, trueposrate, _ = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(falseposrate,trueposrate,label="ROC curve, auc="+str(auc))
plt.legend(loc=4)
plt.show()

In [None]:
# With Logistic Regression it is possible to use a Sigmoid curve to visualize the probability function for each variable
# Below you will see the Sigmoid probability function for a single variable. Where the Outcome is greater than 0.5, the model is more likely to predict that data point as a 1 (Positive for diabetes)
# Try changing the x variable to visualize this decision function for each variable

import seaborn as sns
sns.regplot(x='Glucose', y='Outcome', data=df, logistic=True)