# Intro to Scikit-Learn

Introduces Scikit-Learn through two simple supervised learning algorithms.

**Author**: Michael duPont

Created for Orlando Python ML Workshop

## Intro

Welcome to the magical world of machine learning. Well...it's not really magic, but it can feel like it sometimes. Machine learning is, at its very core, pattern recognition, something we do ourselves very well. The field can be confusing sometimes with a sea of algorithms, but there are a few high-level categories that help us whittle down the initial list of potential models.

![](http://datasentimentanalysis.com/wp-content/uploads/2015/01/ml_map1.png)

This flow chart is provided by Scikit-Learn (which we'll shorten to sk-learn) which provides a quick road map to a potential algorithm. It doesn't include every algorithm available in sk-learn and likely won't be enough by itself, but it's good place to start when working with new problems.

### Terminology

Before we go further, let's define some terminology we'll be using.

* **Model**: The algorithm which processes data and returns some prediction
* **Feature**: An input to the model. Think column of a spreadsheet
* **Target**: An output/desired result from a model
* **Train**: Fitting a model to a given subset of the features and targets
* **Test**: Test the accuracy of the trained model with the remainder of the features and targets
* **Dimension**: The number of features used with a model
* **Variability**: The (dis)simularity of data within a feature. Low variability ~ concentrated data

### Categories

There are three primary categories of machine learning algorithms:

* **Supervised**: Used when the desired output of a model is known and can be trained on
* **Unsupervised**: Used to deconstruct data and find unknown groupings
* **Reinforcement**: Similar to supervised but driven towards maximizing reward with no set target


### Subcategories

The four groups highlighted in the flow chart are the most used subcategories in machine learning.

* **Classification**: Supervised where the target value is binary or categorical
* **Regression**: Supervised where the target value is in a continuous range of numbers
* **Clustering**: Unsupervised used to identify and label groups of targets
* **Dimensionality Reduction**: Unsupervised used to combine features while preserving variability

For this intro, we're going to do a quick implementation of the two supervised subcategories.

## Classification

We're going to start out with a binary classifier. We'll train a model to determine someone's sex based on their height.

### Data

By this point, I should have collected some sample data from the workshop attendees. Normally we'd split the dataset into training and testing data, but we're going to test on some people that we're included in the first place.

In [None]:
import pandas as pd
from IPython.display import display
opug = pd.read_csv(open('opug-heights.csv'), header=0, index_col=None)
print('Height (in) and Sex (M/F) of OPUG attendees')
display(opug)

## Model

Because of the small data size, we'll use a [Support Vector Machine](http://scikit-learn.org/stable/modules/svm.html#classification) as our classifier. We'll use the same one later, but it will be configured as a regressor. I won't go too in-depth about the algorithm. For our purposes, know that it creates a boundary between our two target values and returns which side of the boundary new points are on.

In [None]:
from sklearn.svm import SVC
import numpy as np

#Reshape our data to a 2D array
height = np.reshape(opug.height, (-1, 1))
sex = list(opug.sex)
print("Feature data (height)\n", height)
print("\nTarget data (sex)\n", sex)

#Create and train our classifier
clf = SVC()
clf.fit(height, sex)

### Predictions

Now that our model is trained, let's have some fun with it. We'll use it to predict the sex of other attendees that weren't included in the training set.

In [None]:
print(clf.predict([[68],[64],[71]]))

## Regression

Now we're going to work with a regression. I'm sure some people have created graphs using Excel, Sheets, Numbers, or any of the other spreadsheet applications. You usually have the option to include a "line of best fit" to help explain the data. In essance, this is a regression algorithm.

![](Regression.png)


### Data

We're going to generate data that roughly follows a sin wave in the x-y plane. We're going to introduce some noise and see how well our model can infer the pattern.

In [None]:
#Setting the seed / random_state allows us to tweak without the data changing
np.random.seed(312)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - np.random.rand(8))

Unlike our classification example, we have our full dataset already, so we'll need to split our existing data into training and testing sets.

In [None]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.20, random_state=42)
print('Number of items in the training set:', X_train.shape[0])

### Model

Again, we'll be using a [Support Vector Machine](http://scikit-learn.org/stable/modules/svm.html#regression) but as a regressor. We're going to show how the parameters of a model can affect its ability to represent the data. A requirement for SVMs is that it must make a linear boundary. However, we can use something called a kernel trick to shape the data to best fit the model. Our three kernels will be linear, rbf, and poly.

In [None]:
from sklearn.svm import SVR

svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3, degree=2)
y_rbf = svr_rbf.fit(X_train, y_train).predict(X_test)
y_lin = svr_lin.fit(X_train, y_train).predict(X_test)
y_poly = svr_poly.fit(X_train, y_train).predict(X_test)

### Metrics

In the chart earlier, the regression line is accompanied by an r^2 score. The closer this value is to one, the more accurately it represents the data. We'll use sk-learn's metrics module to compute this score for each of our kernels.

In [None]:
from sklearn.metrics import r2_score

print('r^2 score for {}: {:.3f}'.format(svr_rbf.kernel, r2_score(y_test, y_rbf)))
print('r^2 score for {}: {:.3f}'.format(svr_lin.kernel, r2_score(y_test, y_lin)))
print('r^2 score for {}: {:.3f}'.format(svr_poly.kernel, r2_score(y_test, y_poly)))

Our rbf score is really good, and they all make sense. Even if you don't know what rbf or poly do, it would make sense that the linear model scores the lowest for a sinusoidal dataset. Let's see why below.

### Visulaization

Finally, the fastest way to understand our data and models is to visualize them. We'll use pyplot from matplotlib to plot our training points, testing points, and three SVMs and visually compare them to the r^2 scores we just calculated.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def sort_pair(a, b):
    """Sorts two lists based on the values in the first list"""
    srt = sorted(zip(a, b))
    return zip(*srt)

plt.scatter(X_train, y_train, c='k', label='train data')
plt.scatter(X_test, y_test, c='c', label='test data')
plt.hold('on')
plt.plot(*sort_pair(X_test, y_rbf), c='g', label='RBF model')
plt.plot(*sort_pair(X_test, y_lin), c='r', label='Linear model')
plt.plot(*sort_pair(X_test, y_poly), c='b', label='Polynomial model')
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.show()

Of our eight testing points, one was a moderate outlier. However, while it dragged down a small part of the graph, the outlier didn't interfere with the rbf model's ability to infer the sinusoidal shape. The other two are not as well suited to this dataset.