# Introduction to Machine Learning 2
Na-Rae Han, 10/19/2019

General machine learning work flow:
1. Choose a class of model
2. Choose model hyperparameters
3. Fit the model to the training data ("training")
4. Use the model to predict labels for new data
    - If labels are known (test data, aka 'gold' data), evaluate the performance. 

### Three types of ML:
https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html

1. Regression: predicting continuous values
2. Classification: predicting discrete labels
3. **Clustering: inferring labels on unlabeled data**  <-- This one below

In [None]:
# Turns on/off pretty printing 
%pprint

# Every returned Out[] is displayed, not just the last one. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn               # sklearn is the ML package we will use
import seaborn as sns        # seaborn graphical package

## Clustering: a type of unsupervised learning

- Using sklearn's pre-loaded data set "20 Newsgroups" 
- Code below is adapted from sklearn's official tutorial: 
  http://scikit-learn.org/stable/auto_examples/text/document_clustering.html 

Topic-based clustering is our goal:  
- Given a set of documents that are written on 4 topics, can they be grouped into 4 clusters? 

We will try **K-means clustering** method. 
- A good introduction article: https://www.datascience.com/blog/k-means-clustering
- sklearn's documentation: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html

In [None]:
from sklearn import metrics
from sklearn.cluster import KMeans

In [None]:
# TfidfVectorizer is essentially CountVectorizer + TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

### Data preparation step

In [None]:
from sklearn.datasets import fetch_20newsgroups

# We will use the same 4 categories
cats = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics']

# Not using train-test split. Because this is un-supervised! 
dataset = fetch_20newsgroups(subset='all', categories=cats, shuffle=True, random_state=12)

In [None]:
type(dataset)

In [None]:
dir(dataset)

In [None]:
dataset.data[5]

In [None]:
dataset.target
dataset.target[5]
dataset.target_names

In [None]:
len(dataset.data)

In [None]:
# In our case, WE KNOW TRUE VALUE OF K: 4 topics. 
# But in many real-life use cases, true number of clusters will not be known,
#  and user must experiment with different K values. 

true_k = np.unique(dataset.target).shape[0]
print(true_k)

In [None]:
# Ignore words found in over 50% of documents, ignore words found in just 1 document. 
# 1000 most frequent words, remove stop words. 
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, max_features=1000, stop_words='english')
X = vectorizer.fit_transform(dataset.data)

In [None]:
X[5]
print(X[5])
# 1x1000? "sparse matrix"? 

In [None]:
vectorizer.vocabulary_.get('space')
vectorizer.get_feature_names()[204]
vectorizer.get_feature_names()[180]

### Data preparation complete. Time to apply K-means

In [None]:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, verbose=True)
km.fit(X)

In [None]:
# A bunch of metrics that compare target labels and labels as assigned by KM. 
print("Homogeneity: %0.3f" % metrics.homogeneity_score(dataset.target, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(dataset.target, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(dataset.target, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(dataset.target, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

In [None]:
# Top terms ("features") as ranked by centroids
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

In [None]:
km.labels_[:20]        # Cluster labels as assigned by KMeans
dataset.target[:20]    # These are the real target labels
dataset.target_names

### Round 2. Let's try 3 clusters this time. 

In [None]:
km2 = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1, verbose=True)
km2.fit(X)

In [None]:
print("Top terms per cluster:")
order_centroids = km2.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(3):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()
# Are the clusters looking better? 
# CAVEAT: could be local optimum, re-run to change

In [None]:
km2.labels_[:20]        # Cluster labels as assigned by KMeans
dataset.target[:20]     # These are the real target labels
dataset.target_names

In [None]:
# Newsgroup label -> KM label. Will need to adjust. 
labelmap = {0:0, 1:2, 2:1, 3:1}

target_conv = [labelmap[x] for x in dataset.target]
target_conv[:20]

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(target_conv, km2.labels_)
cm

In [None]:
sns.heatmap(cm.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true group')
plt.ylabel('predicted group')
plt.show()

### Question: Can we produce nifty clustering visuals
such as the ones in tutorial/documentation: 
- https://www.datascience.com/blog/k-means-clustering
- http://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_iris_004.png

??

### Too Many Dimensions
This is where PCA (Principal Component Analysis) comes in. 
- Textbook chapter: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

### Testing on new, made up examples

In [None]:
tests = ['sending a payload to the ISS', 'I met Santa Claus once']
preds = km2.predict(tests)
print(preds)
#???

In [None]:
tests = ['sending a payload to the ISS', 'I met Santa Claus once']
tests_tfidf = vectorizer.transform(tests)    # Yep, need this
preds = km2.predict(tests_tfidf)
print(preds)