# Intro to ML with Sci-kit learn

### What is Machine Learning?

Machine Learning is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of <b> Artificial Intelligence </b> since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow <b> generalizing </b> rather that just storing and retrieving data items like a database system would do.

We'll take a look at two very simple machine learning tasks here. The first is a <b> classification </b> task: the figure shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm may be used to draw a dividing boundary between the two clusters of points:


### We are gonna go briefly go over Supervised, and focus on unsupervised.

### Supervised Learning: Classification and Regression


In <b>Supervised Learning</b>, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task. Some more complicated examples are:
 - given a multicolor image of an object through a telescope, determine whether that object is a star, a quasar, or a galaxy.
 - given a photograph of a person, identify the person in the photo.
 - given a list of movies a person has watched and their personal rating of the movie, recommend a list of movies they would like (So-called recommender systems: a famous example is the Netflix Prize).
 
What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to estimate the age of an object based on such observations: this would be a regression problem, because the label (age) is a continuous quantity.


### Classification Example¶

K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.


In [None]:
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()
print(iris.target_names)
X, y = iris.data, iris.target

# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

# fit the model
knn.fit(X, y)

# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
# call the "predict" method:
result = knn.predict([[3, 5, 4, 2],])

print(iris.target_names[result])

In [None]:
knn.predict_proba([[3, 5, 4, 2],])

## Unsupervised Learning: Dimensionality Reduction and Clustering¶


<b> Unsupervised Learning </b> addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as <i> dimensionality reduction, clustering, and density estimation </i>. For example, in the iris data discussed above, we can used unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we'll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:
- given detailed observations of distant galaxies, determine which features or combinations of features best summarize the information.
- given a mixture of two sound sources (for example, a person talking over some music), separate the two (this is called the blind source separation problem).
- given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.


### Optical Character Recognition
Let's consider OCR (Optical Character Recognition) – that is, recognizing hand-written digits. In the wild, this problem involves both locating and identifying characters in an image. Here we'll take a shortcut and use scikit-learn's set of pre-formatted digits, which is built-in to the library.

#### Loading and visualizing the digits data


In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape


In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])

In [None]:
print(digits.images.shape)
print(digits.images[0])


In [None]:
# The data for use in our algorithms
print(digits.data.shape)
print(digits.data[0])


In [None]:
# The target label
print(digits.target)
#So our data have 1797 samples in 64 dimensions.


### Unsupervised Learning: Dimensionality Reduction
We'd like to visualize our points within the 64-dimensional parameter space, but it's difficult to plot points in 64 dimensions! Instead we'll reduce the dimensions to 2, using an unsupervised method. Here, we'll make use of a manifold learning algorithm called Isomap, and transform the data to two dimensions.


In [None]:
from sklearn.manifold import Isomap

In [None]:
iso = Isomap(n_components=2)
data_projected = iso.fit_transform(digits.data)

In [None]:
data_projected.shape

In [None]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
            edgecolor='none', alpha=0.8, cmap=plt.cm.get_cmap('nipy_spectral', 10));
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)

We see here that the digits are fairly well-separated in the parameter space; this tells us that a supervised classification algorithm should perform fairly well. Let's give it a try.


### Classification on Digits¶
Let's try a classification task on the digits. The first thing we'll want to do is split the digits into a training and testing sample:


In [None]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
                                                random_state=2)
print(Xtrain.shape, Xtest.shape)


Let's use a simple logistic regression which (despite its confusing name) is a classification algorithm:


In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)


In [None]:
from sklearn.metrics import accuracy_score    #let's find out the accuracy of our classification
accuracy_score(ytest, ypred)

That number really doesn't tell us what's going on. Let's look at the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, ypred))

In [None]:
plt.imshow(np.log(confusion_matrix(ytest, ypred)),
           cmap='Blues', interpolation='nearest')
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');

And the output as well 

In [None]:
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
    ax.text(0.05, 0.05, str(ypred[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == ypred[i]) else 'red')
    ax.set_xticks([])
    ax.set_yticks([])

Ta-da!  Wrong ones could also looking confusing to us :<

## Clustering: Intro to K-Means
Here we'll explore K Means Clustering, which is an unsupervised clustering technique.

K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).

K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.

Let's look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=50);

You could easily pick out the four clusters.

In [None]:
from sklearn.cluster import KMeans
est = KMeans(100)  # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');

The algorithm identifies the four clusters of points in a manner very similar to what we would do by eye!


### Application of KMeans to Digits

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
est = KMeans(n_clusters=10)
clusters = est.fit_predict(digits.data)
est.cluster_centers_.shape

10 clusters in 64 dimensions. Let's see what it's like.

In [None]:
fig = plt.figure(figsize=(8, 3))
for i in range(10):
    ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])
    ax.imshow(est.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)


In [None]:
from scipy.stats import mode

labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

In [None]:
from sklearn.decomposition import PCA

X = PCA(2).fit_transform(digits.data)

kwargs = dict(cmap = plt.cm.get_cmap('rainbow', 10),
              edgecolor='none', alpha=0.6)
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax[0].scatter(X[:, 0], X[:, 1], c=labels, **kwargs)
ax[0].set_title('learned cluster labels')

ax[1].scatter(X[:, 0], X[:, 1], c=digits.target, **kwargs)
ax[1].set_title('true labels');

Just for fun, let's see how accurate our K-Means classifier is with no label information.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(digits.target, labels))

plt.imshow(confusion_matrix(digits.target, labels),
           cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');

## Text Datasets

In [None]:
# Dataset of 20 newsgroups
from sklearn.datasets import fetch_20newsgroups

In [None]:
# Categories that we will select form
categories = ['sci.space', 'comp.sys.mac.hardware', 'comp.windows.x', 'sci.crypt', 'rec.autos']

# Training dataset
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
# Test dataset
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

# Each entry has a filename, a target, and some data
print(twenty_train.filenames.shape)
print(twenty_train.target.shape)
print(twenty_train.target[:10])

In [None]:
print(twenty_train.data[0])

In [None]:
print(twenty_train.filenames[0])
print(twenty_train.target_names[twenty_train.target[0]])

In [None]:
# Vectorize the text dataset into word counts
from sklearn.feature_extraction.text import CountVectorizer

# Count vectorizer takes a text dataset and creates vectors per document relating to the count of each word
# i.e. "X" dimension is each row
#      "Y" dimension is count of each unique word in the given X document
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

# First dimension is number of documents, second dimension is number of unique words
print(X_train_counts.shape)

# We can find the index for specific words in the vector
print(count_vect.vocabulary_.get(u'data'))
print(count_vect.vocabulary_.get(u'cs'))

car_idx = count_vect.vocabulary_.get(u'car')
car_count = np.sum(X_train_counts[:,car_idx])
print("Number of occurences of 'car': {}".format(car_count))

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# TF-IDF stands for "term frequency–inverse document frequency"
# TFIDF is a useful metric for describing how "important" a word is in a document based on its occurence in the
# document relative to its occurences in the larger corpus
tfidf = TfidfTransformer()

# TF-IDF has the same "dimensions" as our count vectors
# However, now the vectors represent the tfidf model rather than word counts
X_train_tfidf = tfidf.fit_transform(X_train_counts)
X_train_tfidf.shape

In [None]:
from sklearn.naive_bayes import MultinomialNB

# MultinomialNB is a simple classifer for discrete features
# We can pass it in our training data vectors and the feature labels
# The classifier will fit itself, so we can use it to predict data later
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [None]:
# We can run our classifier on some example data
docs_test = [
    'my windows pc just got a virus',
    'why doesnt my computer have a real gpu',
    'what is the most secure hashing algorithm',
    'elon musk will take us to mars',
    'my car caught on fire last night',
    'I virtualize windows on my macintosh laptop while sending crypto-currency to the moon'
]

# We take our example data and run it through the same operations as the training data
X_new_counts = count_vect.transform(docs_test)
X_new_tfidf = tfidf.fit_transform(X_new_counts)

# Use the classifier to predict the labels for our test documents
predicted = clf.predict(X_new_tfidf)

# Print the test data along with the predicted label
for doc, category in zip(docs_test, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

In [None]:
from sklearn.pipeline import Pipeline

# Scikit-learn offers a "pipeline" tool that can glue together stages of ML processing
# In the following lines, we construct exactly the same type of model and classifier as previously,
# but with more compact syntax
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target)

In [None]:
# We can use the actual training data to generate some metrics
docs_test = twenty_test.data

# Use our pipelined predictor
predicted = text_clf.predict(docs_test)

# A simple "accuracy" measurement
np.mean(predicted == twenty_test.target)

In [None]:
# Confusion matrix
print(confusion_matrix(twenty_test.target, predicted))

plt.imshow(confusion_matrix(twenty_test.target, predicted),
           cmap='Blues', interpolation='nearest')
plt.colorbar()
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');