# Lab 5. Natural Language Processing. Unsupervised Learning

In [0]:
# Some IPython magic
# Put these at the top of every notebook, here nbagg is used for interactive plots
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

## Natural Language Processing.
NLP refers to processing text data. This could refer to a wide range of tasks, from very simple ones, like searching for a pattern, to very complex ones, like text summarization, or automated translation.

### Feature Extraction
In order to apply Machine Learning algorithms on text data, we need to figure out a way to represent the text as a set of numeric attributes.

#### Bag of Words
The simplest way to represent a text document as a vector of numbers is to count the words, and output a frequency count. Let's say we have a list of all english words, like the following:

In [0]:
# creates a wordlist, with all words, from "a" to "zygote"
import urllib.request as request
words = request.urlopen("https://svnweb.freebsd.org/csrg/share/dict/words?view=co")
wordlist = []
for w in words:
    wordlist.append(str(w.decode().strip()))
print(', '.join(wordlist[:4]) + " ... " + ', '.join(wordlist[-4:]))

In [0]:
print("Now we can convert any text to a vector of size " \
      + str(len(wordlist)))

For example, the text "In this lab we study Natural Language Processing and Unsupervised Learning" can be represented as a vector with almost all values equal to 0, and values of 1 in the position of the words "in", "this", etc.

This ___feature vector___ can be extracted directly from the dataset. If we have a large collection of text, we can assume that other documents will use the same vocabulary. So if we build a model for news articles, most likely those articles will not use every single word in the english language. So during the training phase of our machine learning modeling, we can use the train set to create our feature vector, we select only the words that appear in the train set. If new words appear during the test phase, we will discard them. This is a good thing to do because during training, we did not learn anything about those words. We cannot use unseen words to perform classification.

Let's start working with a dataset.

In [0]:
from sklearn.datasets import fetch_20newsgroups

# We select only 3 categories for now, feel free to change the categories
categories = [
    'rec.sport.baseball',
    'comp.graphics',
    'sci.space',
]
dataset = fetch_20newsgroups(subset='all', categories=categories, 
                              shuffle=True, random_state=42)

#dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42) #if you want all caterogies


In [0]:
# here are the attributes of the object retrieved by fetch_20newsgroups
dir(dataset)

In [0]:
len(dataset.data)

In [0]:
X = dataset.data
y = dataset.target

In [0]:
print(dataset.data[0])

In [0]:
dataset.target_names

One way to turn a text document into a feature vector is to use a frequency count of
each word in the document.  We build a large dictionary of words, and for each document
we return a vector with as many features as there are words, and for each word, we return
the  number  of  times  that  word  appears  in  the  document  (this  is  technically  called
**term frequency** , or tf for short).  Sklearn has a **[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)**  that does just that.

In [0]:
#TO DO: Transfrom the dataset into numerical feature vectors
from sklearn.feature_extraction.text import CountVectorizer

One problem with this representation is the high frequency of common words like ”the”
or ”to” or ”and”.  Those words appear in almost all documents, so they don’t offer much information
A better way to extract features from text is to use both the **term frequency** metric and the **inverse document frequency** metric . Sklearn has a **[TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)** that does just that.

In [0]:
#TO DO: Transfrom the dataset into tf-idf feature vectors
from sklearn.feature_extraction.text import TfidfVectorizer


Now you will need to use the vectorized dataset to perfom clustering.

You will need to  use the following algorithms : [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.k_means.html#sklearn.cluster.k_means), [AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering).

For each algorithm try to find the parameters that produce clusters as similar as possible to the real distribution of the data.

Use different metrics to evaluate the algorithms : [Rand Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html),  [Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), [Homogeneity Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html), [Completness Score.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score)






In [0]:
# TODO : Use the following algorithms to perform clustering on the dataset  
from sklearn.cluster import DBSCAN, KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, adjusted_rand_score,homogeneity_score, completeness_score


As you can see, high dimensional sparse vectors do not produce the best clusters.
Now, try to improve your results by using [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the dimensions of your feature vectors before applying the clustering algorithms. 





In [0]:
from sklearn.decomposition import PCA

Use t-SNE to produce a low-dimensional embedding of the dataset (and plot it).

In [0]:
from sklearn.manifold import TSNE

As an extra exercise, try to implement kernel KMeans. Look at the KMeans course. Slide 70

In [0]:
from sklearn.datasets import make_circles

X, _ = make_circles(n_samples=100, noise=0.1, factor=0)

In [0]:
from sklearn.metrics.pairwise import rbf_kernel


k = 2
# TODO : assign points to random clusters
y = 
dist = np.zeros((X.shape[0], k))

        
# TODO
max_iter = 10      
for _ in range(max_iter):      
  for j in range(k):
    # TODO : get the points that are in cluster j
    X_j = 
    
    # TODO : compute the first term
          
    first_term = 
    
    # TODO : compute the second term
    
    second_term = 
      
    dist[:, j] = first_term + second_term
        
  # TODO : change the clusters
  y = np.argmin(....)

   

In [0]:
plt.scatter(X[:,0], X[:,1], c=y)