# Supervised learning vs unsupervised learning

We have seen that data analysis is in big part about constructing models out of a given data, and we use these models mostly to make predictions on new data points.  

![model](images/model.png)

## Supervised learning

[Last week](../lecture-4/lecture-4.ipynb) we have looked at [regression models](https://onlinecourses.science.psu.edu/stat501/node/250/). In the simplest cases, the data suitable for regression consists of instances of sequences of predictors (independent variables) of a fixed length $(x_1,\ldots,x_n)$ and instances of explonatory (dependent) variables $y_i$.  We then constructed a *linear model* of the form
$$ y \sim \alpha\cdot\mathbf{x} + \beta $$
where $\mathbf{x}$ denotes a predictor and $y$ denotes a dependent variable, and $\alpha$ and $\beta$ are the model parameters.

If you look at how we constructed our model, you will see that we must have samples of **expected output** for a collection of inputs.  Any model that requires that we have a collection of inputs and expected outputs is called a **supervised learning model**.  Thus we can also measure **the fit**. As a matter of fact, we constructed our model by optimizing the error. 

$$ RSS(\alpha,\beta) = \sum_{i=1}^N\left(y_i - \beta - \sum_{j=1}^n \alpha_j x_{ij}\right)^2 $$

*The error* in this case is the sum of the squares of the differences between the expected out put and the predicted output coming from the model.

The alternative to this model is an **unsupervised learning model**.

## Unsupervised learning

In unsupervised learning, we have a collection of data points but we do not know what the expected output is.  The learning process (the process of constructing a model) looks at the data and infers the model by looking at the internal structure, internal similarities and differences.  The only job that the analyst do in such cases is to set the *shape* of the model.  The rest is taken care of by the algorithm dictated by the choice we made.

![clusters](images/clusters.png)

## An example: k-nn vs k-means

Consider the example dataset whose picture is given above. We would like to separate the dataset into distinct clusters.

### k-nn (k-nearest neighbor)

The k-nearest neighbor is a supervised classification algorithm. A classification problem is about finding a function
$$ f\colon D \to \{label_1,\ldots, label_m \} $$
that labels any given data point $x\in D$ with a specific label $f(x)$.
The algorithm takes a collection of data points $(x_i)$ and a finite set of determined $m$ labels $y_1,\ldots,y_m$.  Then it constructs a model, i.e. a function, which can determine a label for any new point whose label is to be decided.

In the example above, we have a collection of data points in $\mathbb{R}^2$ and our labels are the colors these points are associated:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
points = pd.read_csv("data/clusters.csv")
points.head(10)

Unnamed: 0,xs,ys,cs
1,-1.060787,-1.897693,black
2,-1.284119,-1.997824,black
3,0.239897,-3.021372,black
4,-0.493053,-1.67142,black
5,0.226585,-1.103103,black
6,-1.125684,-2.255711,black
7,-0.196605,-4.913048,black
8,-1.339146,-2.066089,black
9,-1.560298,-2.714427,black
10,-1.538243,-1.166403,black


The only parameters the k-nearest neighbor algorithm has are an odd positive integer k, and a distance function.  In the this algorithm, the model takes an unlabelled point $\mathbf{x}$ and then find the k-closest points $\mathbf{x}'_1,\ldots,\mathbf{x}'_k$ whose labels are known. Then the algorithm takes a vote among these points, i.e. counts the occurances of labels. Whichever label wins is assigned as the label for the point $\mathbb{x}$.

In [3]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(points.iloc[:,0:1], points.iloc[:,2], test_size=0.33)

In [4]:
classifier = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
classifier.fit(Xtrain, Ytrain)
predicted = classifier.predict(Xtest)
print(confusion_matrix(Ytest,predicted))
accuracy_score(Ytest,predicted)

[[42  3  9]
 [ 2 73  1]
 [12  1 22]]


0.8303030303030303

In [5]:
classifier = KNeighborsClassifier(n_neighbors=3, metric='minkowski')
classifier.fit(Xtrain, Ytrain)
predicted = classifier.predict(Xtest)
print(confusion_matrix(Ytest,predicted))
accuracy_score(Ytest,predicted)

[[42  3  9]
 [ 2 73  1]
 [12  1 22]]


0.8303030303030303

In [6]:
iris = pd.read_csv("data/iris.csv",header=-1)
iris.head(10)

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [7]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(iris.iloc[:,0:3], iris.iloc[:,4], test_size=0.5)

In [8]:
classifier = KNeighborsClassifier(n_neighbors=1, metric='euclidean')
classifier.fit(Xtrain, Ytrain)
predicted = classifier.predict(Xtest)
print(confusion_matrix(Ytest,predicted))
accuracy_score(Ytest,predicted)

[[26  0  0]
 [ 0 22  5]
 [ 0  1 21]]


0.92

## k-means

The next algorithm, k-means, is an unsupervised clustering algorithm. Clustering algorithms, unlike classification algorithms, do not have a preconceived notion of *labels*.  Instead, they try to split the data points into disjoint clusters by looking at their internal structures. Again, there are two parameters: the number of clusters and a distance function.

Here is how the algorithm works:

1. Initially, we randomly place cluster centers $c_1,\ldots,c_k$
2. We go over all of the points $x\in D$ in our dataset. We determine which center $c_i$ is closest to $x$, and then assign cluster label $i$ to $x$
3. When Step (2) is done, we recalculate the center of each cluster $i$.
4. Repeat Steps (2) and (3) until cluster centers stabilize.

In [9]:
from sklearn.cluster import KMeans

In [10]:
classifier = KMeans(n_clusters=3,random_state=1)
classifier.fit(Xtest)
predicted = classifier.predict(Xtest)
labels = {"Iris-setosa":1, "Iris-versicolor":2, "Iris-virginica":0}
real = Ytest.map(lambda x: labels[x])
print(confusion_matrix(real,predicted))
accuracy_score(real,predicted)

[[ 0 14  8]
 [26  0  0]
 [ 0  4 23]]


0.30666666666666664

In [11]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(points.iloc[:,0:1], points.iloc[:,2], test_size=0.5, random_state=2)

In [13]:
classifier = KMeans(n_clusters=3, random_state=9)
classifier.fit(Xtest)
predicted = classifier.predict(Xtest)
transform = {"red":0, "black":2, "green":1}
real = Ytest.map(lambda x: transform[x])
print(confusion_matrix(real,predicted))
accuracy_score(real,predicted)

[[ 50   4   5]
 [  0 124   0]
 [ 31   5  31]]


0.82

## Another example: Naive Bayes Classifier



In [14]:
import glob
import re
from collections import Counter

In [15]:
spam = Counter()

for file in glob.glob('data/corpus/spm*'):
    with open(file,"r") as f:
        raw = re.sub(r'[^a-zA-Z ]','',f.read()).lower().split(' ')
        spam = spam + Counter(raw)

In [None]:
words = Counter()

for file in glob.glob('data/corpus/*'):
    with open(file,"r") as f:
        raw = re.sub(r'[^a-zA-Z ]','',f.read()).lower().split(' ')
        words = words + Counter(raw)

In [None]:
def prob(w):
    return(spam.get(w)/words.get(w))

In [None]:
prob('buying')

In [None]:
prob('august')