# COMS W1002: Computing in Context
---

## Lecture 19: A Brief Introduction to Machine Learning with scikit-learn
__Reading:__ [Online Documentation](https://scikit-learn.org/stable/index.html)


### What is Machine Learning? 

Three required ingredients:

* A well-defined task  
* A measure of success  
* Experience  

**Example:** Suppose we have one thousand headshots of students in this class and they are each correctly labeled with the student's name. Now suppose we have a new unlabeled headshot of a student in this class and we wish to assign a student name to it.  Correctly labeling an unlabeled headshot will count as success and labeled headshots will be our units of experience.

**Example:** Another example is playing chess against each student in this class. This time we can define success as the percentage of wins when we play exactly one game against each student. Games played against self or others can be units of experience.   

Machine learning is often divided into two broad categories of machine learning problems.  
  
**Supervised Learning:** Supervised learning problems are problems for which we have some labeled data available to us that we can use to *train* our learning algorithm (fit our model). That is, we have labeled examples of data similar to what we will ultimately use to use to make predictions. The first headshot example above is an example of supervised learning since we have an intitial setof labeled head shots to work from. Two common types of supervised learning problems are *classification* and *regression*. In classification problems the labels we are trying to predict are drawn from a finite set of classes (Example: Given the length and weight of an animal predict the type of animal from some finite set of catagories like fish or horse). In regression problems we will try to predict a continous number (Example: Given the age of a bear and the time of year, predict its weight).   




**Unsupervised Learning**
In unsupervised learning problems we are not given any labeled training data to start with. One common type of unsupervised learning problem is clustering. Given some set of (unlabeled) data, we determine if the data separates into two or more clusters (clusters are groups of data that may be similar in some obvious or subtle way). Many problems in natural language processing are also unsupervised learning problems including the famous cocktail party problem where we try to isolate a single voice or conversation in a noisy environment. 

Today we will focus on the supervised learning classification problem. We will use the wildly popular Python library scikit-learn to demonstrate some standard examples. The scikit-learn library is built on top of scipy and comes with the Anaconda Python distribution. 


**Wisconsin Breast Cancer Data**
The Wisconsin breast cancer dataset is one of many benchmark data sets in the ML community and is included as a standard data sdet in the `sklearn` datasets module. You can read about this dataset [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) but basically it is consists of 569 30-dimensional (that just means each piece of data has 30 numbers associated with it) observations labled malignant or benign. Follow the link above if you want to know more. Today we will build two types of classifer (A function that maps from the 30-dimensional space the data live in to *Malignant* or *Benign*) using this data set.

In [1]:
# here we create a dictionary-like object that encapsulates the raw data along with some metadata
from sklearn import datasets
wbcd=datasets.load_breast_cancer()
wbcd

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [2]:
#if we just want to see the data
wbcd.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [3]:
# and if we just want to see the labels
wbcd.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [4]:
# Here is a description (metadata)
print(wbcd.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

**Pandas?**
And yes, we can transform the scikit-learn data set (a bunches object) into a pandas DataFrame if we want.

In [5]:
import pandas as pd

wbcd_df = pd.DataFrame(wbcd.data,columns=wbcd.feature_names)
wbcd_df['target'] = pd.Series(wbcd.target)
wbcd_df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1


**K-Nearest Neighbor**  
The first classifier we will use is called K-nearest neighbor (Knn) and the way it works is we plot the labeled data, which we will call *training data*, and then when given a new unlabeled observation we plot it and look at its nearest neighbors among the training data. That is we look at the nearest *k* neighbors and let their labels determine our predicted label. So we will label the new observation *malignant* if a majority of its *k* nearest neighbors are labeled *malignant* and we label it *benign* otherwise. 

In [6]:
# we creatae the classifier clf by calling KNeighborsClassifier() in the neighbors module

from sklearn import neighbors
clf=neighbors.KNeighborsClassifier(n_neighbors=1)

# now we train or fit the classifier by giving it data and corresponding labels for the data. 
# notice that we have not given it the first two vectors in our dataset. We will reserve those
# to test to see if our classifier works
clf.fit(wbcd.data[2:],wbcd.target[2:])

KNeighborsClassifier(n_neighbors=1)

In [7]:
# we use the predict method to predict the labels of the first twenty elements of our dataset
clf.predict(wbcd.data[0:20])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

In [8]:
# of course we know the actual values and look, they match!
wbcd.target[0:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

**Cross-Validation**  
More generally we would like to know how well we can expect our classifier to work in the future. To this end we will partition the data into five separate cells (mutually exclusive similar sized subsets of the data). We will use four of these cells as labeled training data and for the other cell (so 20% of the data) we will pretend we don't actually know the labels and will use the classifier to predict them. We call this last cell the *test set*. Since we actually do know the lables of the test set we can measure our classifier's empircal performance on the test set. We don't just do this once, we do it five times using each of the cells as the test set once while using the other four cells as trainging data. So each observation will be used as training data four times and in the test set once. We call this process 5-fold cross-validation and more generally *K-fold cross-validation*. Remember, what we're doing here is very meta. We are estimating the *performance* of our classifier. The hope is that future data looks like the labeled data set we start with so if we observe our performance on that, we have evidence to suggest that future performance will be similar. 

In [9]:
# luckily scikit-learn comes with all of this set up for us already
from sklearn.model_selection import cross_val_score
scores=cross_val_score(clf,wbcd.data,wbcd.target,cv=5)
scores

array([0.85964912, 0.92982456, 0.9122807 , 0.9122807 , 0.91150442])

In [10]:
# there are even methods to provide us with summary statistics
scores.mean()

0.9051079024996118

In [11]:
scores.std()

0.023753852210021166

**Support Vector Machines:** 
Another popular and more modern classifier is the support vector machine. While the details of how this works are a too technical for this discussion I'll try to give you the gist of what's going on. Imagine you have two-dimensional data and it's all plotted out in front of you. Suppose some of the data belongs to class A and some to class B. (Maybe the data represents weight and lenghts of fishes and horses). If you could draw a straight line that separates all of the class A data from the class B data then given a new unlabeled observation you could plot it, see what side of the line it falls on, and classify it accordingly. If there is some separation between the two classes then you could probably draw lots of different lines that separate the data in this way, we'll choose one that leaves a bit of margin on either side. That is, if there's a gap between classes, we'll choose a line near the middle of the gap. Support vector machines work a little like this. They map the data into a high dimensional space and look for a linear surface there that separates the data (a separating hyperplane). It's usually not possible to find a perfect separating plane but we can choose a surface (a line in 2-d) that doesn't give us too many mistakes when applied to the labeled training data. This approach involves two parameters, gamma and C. We will discover the a good gamma,c pair by trying many different possibilities and settling on the one that performs best in a cross validation. 


In [12]:
#here's how easy it is to try it out!

from sklearn import svm
clf=svm.SVC(gamma=.00001,C=200)
scores=cross_val_score(clf,wbcd.data,wbcd.target,cv=5)
scores.mean()

0.9472752678155565

**What I want you to know**   
Takeaways from this lecture:
* What is machine learning?
* What is supervised versus unsupervised learning?
* What is classification?
* What is training data?
* What is the test set?
* What is K-fold cross-validation
* What is k-nearest neighbor?

You don't have to memorize any of the Python and you don't have to know anything about support vector machines beyond the fact that they are another kind of classifier. Next time we'll apply some of this to some other popular data sets so we can get more familiar with *using* scikit-learn