## Loading Libraries
First we need to load the required Python libraries. Libraries are like extensions to the base python that add functionality or help to carry out specific tasks. 

We will load some libraries that will boost your data handling capacity. The main ones are numpy and pandas - we will call them `np` and `pd`.

In [None]:
import numpy as np
import pandas as pd

## Loading the heart attack data set

Use `pd.read_csv` to read in the file (add header=None).

Use `pd.shape` to check dimensions and `pd.head()` to take a look at it.

In [None]:
heart = pd.read_csv("./processed.cleveland.data.clean", header=None)

In [None]:
heart.shape

In [None]:
heart.head()

## Loading the heart attack data set

What are we looking at?

The last column is different kinds of heart attacks (0 is none, 1,2,3,4 are different grades)

In [None]:
heart.columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']

In [None]:
heart.head()

## Our goal: predict heart disease from the data

For now, we will treat the last column 'num' as a binary outcome. 0 is no heart disease, anything higher than zero is heart disease. We can do multivariate classification later!


In [None]:
X = heart.iloc[:,0:13]
y = heart.iloc[:,-1]
y[y > 0] = 1

In [None]:
X.head()

In [None]:
y.head()

Take out the last row and call it `num` and keep all the other rows in a separate data.frame called `attributes`.

## Principal components analysis can reveal clustering

Use scikit-learn `StandardScaler()` to scale all of the features and use `seaborn` to produce a pairplot and a heatmap of the correlctions.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_scaled = pd.DataFrame(ss.fit_transform(X))
X_scaled.columns = X.columns
X = X_scaled

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


Pairplot:

In [None]:
sns.pairplot(X)


Heatmap:

In [None]:
corrmat = X.corr()

plt.figure(figsize=(12,8))
sns.heatmap(corrmat, 
            linewidths=0.5, 
            cmap="RdBu", 
            vmin=-1, 
            vmax=1, 
            annot=True)

plt.xticks(rotation=270);

Use scikit-learn `PCA` to calculate the PCs and display the variance explained.

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns

pca = PCA()
pca.fit(X)

sns.barplot(x=np.arange(0,pca.n_components_)+1, y=pca.explained_variance_ratio_)
plt.xticks(np.arange(0, pca.n_components_), np.arange(0, pca.n_components_))

plt.figure()
sns.barplot(np.arange(pca.n_components_)+1, np.cumsum(pca.explained_variance_ratio_))
plt.xticks(np.arange(0, pca.n_components_), np.arange(0, pca.n_components_))



Check out the `components` of the PCA object - what do they mean?

In [None]:
components = pd.DataFrame(pca.components_, 
                        columns = X.columns,
                        index   = X.columns)

## Bonus: display the components as a heatmap

In [None]:
plt.figure(figsize=(20, 16))
sns.heatmap(
        components.transpose(), 
        linewidths=0.5, 
        cmap="RdBu", 
        vmin=-1, vmax=1, annot=True )

Plot the first two principal components and overlay the disease state as the color to see if they cluster:

In [None]:
X_pca = pd.DataFrame(pca.transform(X))
X_pca.columns = ['PC' + str(i) for i in range(0,13)]
X_pca = X_pca.join(y)
X_pca.head()



In [None]:
sns.lmplot(x='PC1', y='PC2', data=X_pca, fit_reg=False, hue = 'num')


# Bonus (if you are ahead):

Load the iris data set using  and do PCA.

In [None]:
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()
target = pd.DataFrame(iris.target)
iris = pd.DataFrame(iris.data, columns=iris.feature_names)
iris.head()

In [None]:
pca = PCA()
iris_pca = pd.DataFrame(pca.fit_transform(iris))
iris_pca.columns = ['PC' + str(i) for i in range(0,4)]
#iris_pca = iris_pca.join(target)

iris_pca['target'] = target
iris_pca.head()


In [None]:
sns.lmplot(x='PC1', y='PC2', data=iris_pca, fit_reg=False, hue = "target")

# Simple classifier using Logistic Regression

A logistic regression models a binary outcome using a linear model, which is transformed into a 'sigma' shape.

This is a very commonly used model and a great place to start. 

First, use `train_test_split` to 'hide' some of our data to use for testing later on.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("XTrain dimensions: ", X_train.shape)
print ("yTrain dimensions: ", y_train.shape)
print ("XTest dimensions: ", X_train.shape)
print ("yTest dimensions: ", y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

log_reg = LogisticRegression(class_weight='balanced')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print (classification_report(y_test, y_pred))
#print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, y_pred),2))

# Moving on to Random Forest

We will use a `RandomForestClassifier` to try to build a model to distinguish heart disease from healthy.

In [None]:
from sklearn.ensemble import RandomForestClassifier

Now, we can initialize the RandomForestClassifier and fit a model to our training data. Use `n_estimators=100` to use 100 different trees.

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=100)

In [None]:
clf.fit(X_train, y_train)

Use `X_test` as an input to the classifer to get out `y_pred`. We can use the `classification_report` and `confusion_matrix` from scikit-learn to get an idea of how well we are performing.

In [None]:
y_pred = clf.predict(X_test)

In [None]:
print (classification_report(y_test, y_pred))
print ("Overall Accuracy: ", round(accuracy_score(y_test, y_pred), 2))

In [None]:
mat = confusion_matrix(y_test, y_pred) 
print (mat)

Not bad, but we only ran 100 trees... if we create more than 100 trees, do we get better accuracy? Where does it stop being worth making more trees?

Create a plot for different values from 100 to 1000 trees. This may take a while!!


In [None]:
# Range of `n_estimators` values to explore.
min_estimators = 100
max_estimators = 1000

error_rate = []

for i in range(min_estimators, max_estimators + 1, 50):
    clf.set_params(n_estimators=i)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    error_rate.append( 1 - accuracy_score(y_test, y_pred))

# Generate the "OOB error rate" vs. "n_estimators" plot.
plt.plot(range(min_estimators, max_estimators + 1, 50), error_rate)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB error rate")
plt.legend(loc="upper right")
plt.show()


The `RandomForestClassifier` calculates a feature importance for each of the inputs. We can visualize these using seaborn barplot.

In [None]:
importance = clf.feature_importances_

In [None]:
sns.barplot(x = importance, y = X.columns)

# (Bonus) Make a 'receiver operating characteristic' (ROC) curve and compare logistic regression and random forest

In [None]:
from sklearn.metrics import roc_curve

In [None]:
clf.set_params(n_estimators = 10000)
clf.fit(X_train, y_train)
log_reg.fit(X_train, y_train)

y_pred_rf = clf.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

y_pred_log_reg = log_reg.predict_proba(X_test)[:, 1]
fpr_log_reg, tpr_log_reg, _ = roc_curve(y_test, y_pred_log_reg)


In [None]:
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.plot(fpr_log_reg, tpr_log_reg, label='LogReg')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

# (Double Bonus) Multi-label classification

At the beginning of the notebook, we took all of the different grades of heart disease (1,2,3,4) and made them all 1 to represent any type of heart disease.

Use RandomForest to do multi-label classification - how does it do?

In [None]:
heart = pd.read_csv("./processed.cleveland.data.clean", header=None)
X = heart.iloc[:,0:13]
y = heart.iloc[:,-1]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (classification_report(y_test, y_pred))
print ("Overall Accuracy: ", round(accuracy_score(y_test, y_pred), 2))

print(confusion_matrix(y_test, y_pred))

# Recap

There are _dozens_ of different supervised learning techniques, all with different strengths, weaknesses, and applications.

- We learned two, logistic regression and random forests.
- These models are relatively simple to understand, while still providing great performance. Random Forests provide the additional benefit of being able to model non-linear relationships.
- More important than the models themselves are the principles of training and testing, and model evaluation. These will not change nearly as much as the underlying models!
