# Week 1

This week we will investigate ML basics and a variety of ML techniques use as they are implemented in scikit-learn.  Throughout the rest of the class we will be working with our final project dataset on every assignment to see how to bring to bear the tools and techniques covered in that week apply to a specific ML problem, however on this one we will experiment a bit with a built in dataset in scikit-learn.  Scikit-learn has [a plethora of small datasets to play with](http://scikit-learn.org/stable/datasets/index.html), and we'll be making use of the digit recognition dataset.

## Goals
* Re-familiarize ourselves with Eider, and learn about Scikit-learn.
* Experiment with a couple of black-box ML techniques from Scikit-learn
* See the difference between supervised and unsupervised learning

## Resources
Here are a some resources that might be of interest while working on this assignment.  It includes relevant chapters of our book and all libraries you might wish to use during this, and all later, lectures.  Of particular note is [Eider Expo](https://eider.corp.amazon.com/expo) which lets you search for notebooks that demonstrate particular libraries or concepts.

* *Python Machine Learning*
    * Chapter 1: all pages
    * Chapter 3: all pages exclude perceptron related
    * Chapter 10: page 277-290, 294-296
    * Chapter 11: page 311-317, 320-321
* Scikit-learn
    * [Quick-Start tutorial](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
    * [User Guide](http://scikit-learn.org/stable/user_guide.html)
    * [Chosing the right estimator](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
    * [API](http://scikit-learn.org/stable/modules/classes.html)
* Pandas
    * [API](http://pandas.pydata.org/pandas-docs/version/0.18.1/api.html)
    * [Visualization](http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html)
* matplotlib
    * [Plotting API](http://matplotlib.org/api/pyplot_summary.html)
    * [Gallery](http://matplotlib.org/gallery.html)
* NumPy
    * [API](https://docs.scipy.org/doc/numpy/reference/routines.html)
* Eider
    * [User guide documentation](https://w.amazon.com/index.php/Eider/Documentation)
    * [Expo](https://eider.corp.amazon.com/expo)

## Question 1

Load in the [handwritten digits dataset](http://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-data-set) using [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits).  This is the dataset we will be working with today.  Use [matplotlib]()'s ```imshow``` to take a look at the first 10 elements of this dataset to see what we are dealing with (don't forget to reshape them back into images).

In [0]:
from sklearn.datasets import load_digits
digits = load_digits()
import matplotlib.pyplot as plt 
import pylab as pl

plt.gray()

def show_pictures(data, labels=None,fig_title=None):
    num = len(data)
    fig, ax = plt.subplots(1, num, figsize=(14, 2.1))
    if fig_title:
        fig.suptitle(fig_title)
    for k in range(num):
        ax[k].imshow(data[k].reshape((8,8)))
        if labels is not None:
            ax[k].set_title("This is: " + str(labels[k]))
show_pictures(digits.data[:10], labels=digits.target[:10], fig_title="First 10 digits from the sample set of %d with target label" % len(digits.data))

We will first experiment with the dataset as an unsupervised learning problem---that is ignoring the class labels that we were provided.  In particular, we will investigate attempting to cluster these digits using $K$-means.  We know that there are 10 classes to work with, so we will start with $K=10$.

## Question 2
Implement $K$-means clustering on the digits dataset with $K=10$ using [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).  The cluster centers are an average of all examples assigned to that cluster, and so can be thought of as the prototypical behavior of the digits in that cluster.  The output from this routine is somewhat stochastic, so try running it a couple of times and explain below what you see.  Are there cluster centers clearly for all $10$ digit classes?  How about what happens if you have the number of cluster centers wrong: say 4 or 12?  Please be critical of these algorithms as it is often tempting to see the power and turn a blind eye to their failings.

In [0]:
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

def run_kmeans(clusters, rs=0):
    return KMeans(n_clusters=clusters, init='random', random_state=rs).fit(digits.data)

def kmeans_with_pictures(clusters, rs=0):
    kmeans = run_kmeans(clusters, rs=rs)
    show_pictures(sorted(kmeans.cluster_centers_, key=sum), fig_title="Sorted centers for %d clusters random seed: %d" % (clusters, rs))
    return kmeans

kmeans = kmeans_with_pictures(10,0)

# The seeds are arbitrarily selected.
# We explicitly select seeds to have a predictable output that matches our observations in the discussion.
kmeans_with_pictures(10, 65439793)
kmeans_with_pictures(10, 768827)
kmeans_with_pictures(4, 12396)
kmeans_with_pictures(12, 96234012)

## Discussion for Question 2

The cluster centers for Kmeans-10 look pretty close to the numbers themselves. Reruning the KMeans with different random settings also yieldst to very similar clusters, however sometimes it can have big differences.

With KMeans - 4, the clusters are getting much blurier. Number 9 still stands out, but other clusters look like several numbers melted together.

With KMeans - 12, the clusters can be more 'precise' and more centers can isolate the same number. The number 8 center became more clear, and we have two centers for number 7 and probably also for number 1.

## Question 3
To be rigorous about the performance of the model, you can take the labels associated to each digit and see if they are all associated to the same cluster.  Use the ```predict``` function associated with ```KMeans``` to predict which class center each digit was clustered with.  Is there any particular digit which gets spread out over many clusters?

In [0]:
def run_cluster_stats(digits, algorithm):
    num_samples = len(digits.data)
    prediction = [algorithm.predict(digits.data[x].reshape(1,-1))[0] for x in xrange(num_samples)]
    x = pd.DataFrame({ 'target': digits.target, 'cluster' : prediction, 'match_count' : np.ones(num_samples) })
    target_count = x[['target','match_count']].groupby('target',as_index=False).count()
    cluster_count = x[['cluster','match_count']].groupby('cluster',as_index=False).count()
    g = x.groupby(by=['target','cluster'],as_index=False).count()
    g = g.merge(target_count,on=['target'], suffixes=['','_target'])
    g = g.merge(cluster_count,on=['cluster'], suffixes=['','_cluster'])
    g.rename(columns={'match_count_target': 'target_count', 'match_count_cluster': 'cluster_count'}, inplace=True)
    g['pct_target'] = (g['match_count']/g['target_count'])*100.0
    g['pct_cluster'] = (g['match_count']/g['cluster_count'])*100.0
    return g
    
g = run_cluster_stats(digits, kmeans)
def print_cluster_stats_help():
    print "match_count - number of target-cluster pairs"
    print "target_count - number of target values"
    print "cluster_count - number of cluster values"
    print "pct_target - how many targets does this target-cluster pair covers?"
    print "pct_cluster - how big percentage from the cluster is the current target-cluster pair?\n\n"
print_cluster_stats_help()
g[g['pct_cluster'] > 40].sort_values(['target'],ascending=False)

In [0]:
def run_error_rate(clusters):
    kmeans = run_kmeans(clusters)
    g = run_cluster_stats(digits, kmeans)
    c_max = g[['cluster','pct_cluster']].groupby(['cluster'],as_index=False).max()
    g_max = g.merge(c_max, on=['cluster'], suffixes=['','_max'])
    g_max = g_max[g_max['pct_cluster'] == g_max['pct_cluster_max']]
    return (g_max['cluster_count'] - g_max['match_count']).sum()
n_clusters = range(2,50,2)
values = [run_error_rate(x) for x in n_clusters]
plt.scatter(n_clusters, values)

## Discussion for Question 3

We run a simple error summary comparing how well the clusters cover their target in respect to a single target digit. As you can see from the graph, the error starts pretty big, which make sense with few cluster, and starts to be narrow at around 12 clusters, which might indicate, that the 12 clusters are first good approximation of the clusters.

As you undoubtedly see it does an impressive, albeit flawed, job of separating the digits into their natural classes completely without labeled data.  Often times, certain similar classes blur together, and some classes may be missing a definitive class at all, and some classes that have high diversity may split into many.

This is not particularly surprising.  We basically just told the algorithm, "Here is a pile of vectors without context, what do you see?"  Sometimes we need to do that if there are no labels available to us, however here we discarded the labels we were given.  Let's now try to now fit a supervised model: a version of logistic regression made for multiple classes.

## Question 4
Implement [logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to try to predict digit labels.  Much like the cluster centers from before, each class has an associated weight vector what is used specify the class, so please plot those to see how the class decision is being made.  Can you identify the digit classes now from the weights?

In [0]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
model = lr.fit(digits.data, digits.target)
show_pictures(model.coef_, fig_title="Coefficients of logistic regression")

## Discussion for Question 4

No, the weight vectors don't look anything like numbers. They look more as a random images.

You most likely saw that the weights learned from the logistic regression look essentially like noise with only the slightest hints of the digits.  We do not dig into this in this assignment, but this noisy appearance is partially due to overfitting, where the model weights heavily random fluctuation that happens to be indicative of class label in our dataset.

However, now, we will see that our model does a much better job of classifying the digits it trained on.

## Question 5
As in **Question 3**, you can now take the labels predicted for each digit and see if they are all the ones they trained on.  Is there any particular digit which gets spread out over many clusters now?

In [0]:
print_cluster_stats_help()
run_cluster_stats(digits, lr)

## Discussion for Question 5

The clusters are much more precise. Some of the clusters have 100% coverage, meaning they cover only one exact number. Number 9, 1 and 8 also occure in different clusters, but with low percentage. The highest is 8, which is in cluster for number 1 with 2.7% of cases.

You will have likely seen that it does an almost perfect job.  We don't investigate further here, but don't be fooled by this behavior!  As hinted at before, this model is overfitting severely, and if we had set aside some of the digits at the beginning of this notebook to test on later, we would see it does not do such a good job with previously unseen digits.  Being careful about this, however, is a topic for a later lecture.