# BMI ML Bootcamp #1 - Supervised Learning

In [None]:
from IPython.display import SVG

import pandas as pd
import numpy as np
from graphviz import Source

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

For this first tutorial, we'll be working with the standard scikit-learn iris dataset. This contains sepal length, sepal width, petal length and petal width for three iris types (Setosa, Versicolour, and Virginica). 

In [None]:
iris = load_iris()

In [None]:
iris_df = pd.DataFrame(data = np.c_[iris['data'], iris['target']], columns = iris['feature_names'] + ['target'])
iris_df.head()

#### Pick 3 data fields and create a 3D scatterplot with the points colored by target.

Did the classes separate out visually when plotted by the three fields you chose?

In [None]:
### Your code here ###

#### Split iris data up into test/training set

In [None]:
X = iris.data
y = iris.target

# What size test set is reasonable?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = ?) 

## 1. Random Forest Classifier 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Set up Random Forest Classifier
Some hyperparameters to test:
* n_estimators
* criterion
* max_depth
* max_features
and lots of others - see documentation for all the options

In [None]:
# Create RandomForestClassifier and fit using your training data
rfc = RandomForestClassifier() 
rfc.fit(?, ?) 

In [None]:
# How well does your model perform on your test set?

rfc.score(?, ?)

Visualize a few of your trees using the code below. How do they differ from one another? Does changing the number of estimators or the leaf criteria significantly affect the look of individual trees?

In [None]:
tree = rfc.estimators_[?]
graph = Source(export_graphviz(tree, out_file=None, feature_names=iris.feature_names, class_names = iris.target_names))
SVG(graph.pipe(format='svg'))

#### What are the top important features? (hint - look at the RandomForestClassifier attributes)
What feature was the most important? Does this change as you modify the hyperparameters of your model? If you train without that feature, what accuracy do you achieve?

## 2. k Nearest Neighbors
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Create KNeighbor classifier and train on your data.

In [None]:
neigh = ?
neigh.fit(?, ?)
neigh.score(?, ?)

Using the code below, plot the decision boundaries. Do they look reasonable? How do they change if you change k? How do they change if you use uniform weights vs. distance weights?

In [None]:
def plot_boundaries(model, training_X, training_y, n_neighbors):

    X = training_X[:, :2]
    y = training_y
    
    model.fit(X, y)
    
    h = .02 

    cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
    cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = data[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i)" % (n_neighbors))

    plt.show()
    
plot_boundaries(KNeighborsClassifier(n_neighbors=?), X_train, y_train, ?)

How well does this model perform on your test set? What value of k results in the highest accuracy?

## 3. Support Vector Machines
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Create and train a support vector classifier, then test its accuracy on your training set.

What points does your model consider to be the support vectors? Plot them on the 3d graph you made above - do they seem reasonable?

Try a few different kernels. Do they improve performance? Which worked the best?

### Of the three classifiers, which classifier performed the best? Is it the one that you expected? Which performed the worst?

If you're curious, scikit-learn implements a bunch of other classifiers, such as AdaBoost and Naive bayes that you could try. Alternatively, you could use one of the features as a label instead and treat this as a regression task.