In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Foundamental Data Science for Data Scientist

# Supervised Learning Part 1 -- Classification

To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data in on two-dimensional screens.

We will illustrate some very simple examples before we move on to more "real world" data sets.


First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function.

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(centers=2, random_state=0)

print('X ~ n_samples x n_features:', X.shape)
print('y ~ n_samples:', y.shape)

print('\nFirst 5 samples:\n', X[:5, :])
print('\nFirst 5 labels:', y[:5])

As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis.

In [None]:
plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend(loc='upper right');

Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:

1. a training set that the learning algorithm uses to fit the model
2. a test set to evaluate the generalization performance of the model

The ``train_test_split`` function from the ``model_selection`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.

<img src="figures/train_test_split_matrix.svg" width="100%">


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    stratify=y)

### The scikit-learn estimator API
<img src="figures/supervised_workflow.svg" width="100%">


Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class.

In [None]:
from sklearn.linear_model import LogisticRegression

Next, we instantiate the estimator object.

In [None]:
classifier = LogisticRegression()

In [None]:
X_train.shape

In [None]:
y_train.shape

To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):

In [None]:
classifier.fit(X_train, y_train)

(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)

We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:

In [None]:
prediction = classifier.predict(X_test)

We can compare these against the true labels:

In [None]:
print(prediction)
print(y_test)

We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:

In [None]:
np.mean(prediction == y_test)

There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:
    

In [None]:
classifier.score(X_test, y_test)

It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:

In [None]:
classifier.score(X_train, y_train)

LogisticRegression is a so-called linear model,
that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:

In [None]:
from figures import plot_2d_separator

plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel("first feature")
plt.ylabel("second feature")
plot_2d_separator(classifier, X)
plt.legend(loc='upper right');

**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:

In [None]:
print(classifier.coef_)
print(classifier.intercept_)

Another classifier: K Nearest Neighbors
------------------------------------------------
Another popular and easy to understand classifier is K nearest neighbors (kNN).  It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

The interface is exactly the same as for ``LogisticRegression above``.

In [5]:
from time import time
import logging
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import tree
from figures import plot_2d_separator

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

iris = pd.read_csv("iris.csv")

iris

  from numpy.core.umath_tests import inner1d


Unnamed: 0,sepal_l,sepal_w,petal_l,petal_w,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.iloc[:, 0:4], 
                                                    iris['species'], 
                                                    train_size=0.7, 
                                                    random_state=123)
print("Labels for training and testing data")
print(y_train)


Labels for training and testing data
114     Iris-virginica
136     Iris-virginica
53     Iris-versicolor
19         Iris-setosa
38         Iris-setosa
110     Iris-virginica
23         Iris-setosa
9          Iris-setosa
86     Iris-versicolor
91     Iris-versicolor
89     Iris-versicolor
79     Iris-versicolor
101     Iris-virginica
65     Iris-versicolor
115     Iris-virginica
41         Iris-setosa
124     Iris-virginica
95     Iris-versicolor
21         Iris-setosa
11         Iris-setosa
103     Iris-virginica
74     Iris-versicolor
122     Iris-virginica
118     Iris-virginica
44         Iris-setosa
51     Iris-versicolor
81     Iris-versicolor
149     Iris-virginica
12         Iris-setosa
129     Iris-virginica
            ...       
120     Iris-virginica
137     Iris-virginica
125     Iris-virginica
147     Iris-virginica
39         Iris-setosa
84     Iris-versicolor
2          Iris-setosa
67     Iris-versicolor
55     Iris-versicolor
49         Iris-setosa
68     Iris-versicol



In [7]:
from sklearn.neighbors import KNeighborsClassifier

This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:

In [8]:
knn = KNeighborsClassifier(n_neighbors=3)

We fit the model with out training data

In [9]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [12]:
knn.score(X_test, y_test)

0.9555555555555556

In [13]:
print("Predicting iris on the test set using K-NN")
t0 = time()
y_pred = knn.predict(X_test)
print("done in %0.3fs" % (time() - t0))

print(classification_report(y_test, y_pred, target_names=iris['species'].unique()))
print(confusion_matrix(y_test, y_pred))


Predicting iris on the test set using K-NN
done in 0.002s
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.90      0.90      0.90        10
 Iris-virginica       0.94      0.94      0.94        17

    avg / total       0.96      0.96      0.96        45

[[18  0  0]
 [ 0  9  1]
 [ 0  1 16]]


# Naive Bayes Classification

In [14]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train);

In [17]:
print("Predicting iris on the test set using Naive Bayes")
t0 = time()
y_pred = model.predict(X_test)
print("done in %0.3fs" % (time() - t0))

print(classification_report(y_test, y_pred, target_names=iris['species'].unique()))
print(confusion_matrix(y_test, y_pred))

Predicting iris on the test set using Naive Bayes
done in 0.001s
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.83      1.00      0.91        10
 Iris-virginica       1.00      0.88      0.94        17

    avg / total       0.96      0.96      0.96        45

[[18  0  0]
 [ 0 10  0]
 [ 0  2 15]]


Exercise
=========
Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change.