___
<h1> Machine Learning </h1>
<h2> M. Sc. in Electrical and Computer Engineering </h2>
<h3> Instituto Superior de Engenharia / Universidade do Algarve </h3>

[MEEC](https://ise.ualg.pt/en/curso/1477) / [ISE](https://ise.ualg.pt) / [UAlg](https://www.ualg.pt)

Pedro J. S. Cardoso (pcardoso@ualg.pt)
___

In [None]:
from pprint import pprint

from sklearn.datasets import fetch_20newsgroups, fetch_20newsgroups_vectorized
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk.stem

# Clustering

In this notebook we'll be doing some clustering over text. So we'll start by seeing how to convert text to something more easy to cluster... vectors!

## Bags of Words

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval. 

In this model, a text (such as a sentence or a document) is represented as the **bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity**. (wikipedia, 2020)

In [None]:
list_of_sentences = [
    """Space: the final frontier. These are the voyages of the starship Enterprise.
        Its five-year mission: to explore strange new worlds. To seek out new life and new civilizations. To boldly go where no man has gone before!""",
    "Help me, Obi-Wan Kenobi. You’re my only hope.",
    "I find your lack of faith disturbing",
    """It’s the ship that made the Kessel run in less than twelve parsecs. I’ve outrun Imperial starships. Not the local bulk cruisers, mind you. 
        I’m talking about the big Corellian ships, now. She’s fast enough for you, old man""",
    "The Force will be with you. Always",
    "Never tell me the odds",
    "No. I am your father"
]
                    

## CountVectorizer

Let us convert a collection of text documents to a ** (sparce) matrix of token counts**

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the **number of features will be equal to the vocabulary size found by analyzing the data**.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
vectorized = CountVectorizer(
    strip_accents="unicode", 
    lowercase=True, 
    stop_words='english'
)

# Learn a vocabulary dictionary of all tokens in the raw documents.
vectorized.fit(list_of_sentences)
vectorized

And then do the transformation

In [None]:
mx = vectorized.transform(list_of_sentences)
mx.todense()

As an alternative we can learn the vocabulary dictionary and return document-term matrix in a single step.

In [None]:
mx = vectorized.fit_transform(list_of_sentences)
mx.todense()

and the feature names are 

In [None]:
print(vectorized.get_feature_names_out())

Now it is possible to transform documents to document-term matrix.

In [None]:
to_be_transformed = [
    """Frontier is an outer limit in a field of endeavor, especially one in which the opportunities for research and development have not been exploited: 
    the frontiers of space exploration.""", 
    """Space: the final frontier. These are the voyages of the starship Enterprise. Its continuing mission: to explore strange new worlds. 
    To seek out new life and new civilizations. To boldly go where no one has gone before!"""
]
vectorized.transform(to_be_transformed).todense()

Maybe there are some problems: what about plural, variations, tense, ... 

In [None]:
mx = vectorized.transform(["Space frontier civilizations worlds",  
                      "spaces frontiers civilization world"]).toarray()
mx

In [None]:
mx.sum(axis=1)

## TfidfVectorizer
- A varition is `TfidfVectorizer` - Transform a count matrix to a normalized tf or tf-idf representation. 

- Note that **_Tf_ means term-frequency** while **_tf-idf_ means term-frequency times inverse document-frequency**. 

- The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to **scale down the impact of tokens that occur very frequently in a given corpus** and that are hence empirically **less informative than features that occur in a small fraction of the training corpus**.

$$tf-idf(t, d) = tf(t, d) \times idf(t)$$
where
$$idf(t) = \log\left(\frac{1 + n}{1 + df(t)}\right) + 1$$
and
$$tf(t, d) = \text{number of times term t appears in document d}$$

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [None]:
vectorized = TfidfVectorizer(
    strip_accents="ascii", 
    lowercase=True, 
    stop_words='english'
)
vectorized.fit(list_of_sentences)

In [None]:
print("The transformation of ", to_be_transformed)
vectorized.transform(to_be_transformed).todense()

## Stemmer
Snowball Stemmer allows us to create more thoughful bag-of-words by **removing morphological affixes from words**, leaving only the word stem.

https://www.nltk.org/howto/stem.html

In [None]:
english_stemmer = nltk.stem.SnowballStemmer('english')

For example

In [None]:
list(map(english_stemmer.stem, ['civilization', 'civilizations']))

In [None]:
list(map(english_stemmer.stem, ['jump', 'jumping', 'jumps', 'jumped']))

In [None]:
list(map(english_stemmer.stem, ['ship', 'ships', 'shipping', 'shipped']))

To use the `TfidfVectorizer` we can extend it and, to apply the snowball stemmer, we redefine its `build_analyzer` (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.build_analyzer), as follows

In [None]:
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
    # overriding the build_analyzer
    def build_analyzer(self):
        '''Return a callable that handles preprocessing, tokenization and n-grams generation.'''
        analyzer = super().build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])


steemed_vectorizer = StemmedTfidfVectorizer(stop_words='english', 
                                            decode_error ='ignore', 
                                            encoding='utf-8'
                                           )
X = steemed_vectorizer.fit_transform(list_of_sentences)

So, given the list of sentences

In [None]:
list_of_sentences = [
    """Space: the final frontier. These are the voyages of the starship Enterprise.
        Its five-year mission: to explore strange new worlds. To seek out new life and new civilizations. To boldly go where no man has gone before!""",
    "Help me, Obi-Wan Kenobi. You’re my only hope.",
    "I find your lack of faith disturbing",
    """It’s the ship that made the Kessel run in less than twelve parsecs. I’ve outrun Imperial starships. Not the local bulk cruisers, mind you. 
        I’m talking about the big Corellian ships, now. She’s fast enough for you, old man""",
    "The Force will be with you. Always",
    "Never tell me the odds",
    "No. I am your father"
]

The corresponding features are

In [None]:
print(steemed_vectorizer.get_feature_names_out())

Recall the previous list of features names was the following

In [None]:
print(vectorized.get_feature_names_out())

And the the Tf-idf-weighted document-term matrix (associated with the learned vocabulary) is a sparse matrix of type '<class 'numpy.float64'> with elements in Compressed Sparse Row format

In [None]:
X.todense()

and now, its is possible to apply to new strings

In [None]:
print("The transformation of ", to_be_transformed, "is ")
steemed_vectorizer.transform(to_be_transformed).toarray()

# Newsgroup Clustering
## Prepare newsgroups' data

Let us start by getting some posts from the 20 newsgroups text dataset. 

See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [None]:
newsgroups_train = fetch_20newsgroups(
    subset='train', 
    remove=("headers", "footers", "quotes")
)

In [None]:
print(newsgroups_train.DESCR)

In [None]:
newsgroups_train.target_names

This includes +11K posts which could be used for classification (predict post's group)

In [None]:
len(newsgroups_train.data)

Let us see some examples

In [None]:
newsgroups_train.data[:5]

Let us implement a vectorizer with a stemmer approach

In [None]:
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
    # overriding the build_analyzer
    def build_analyzer(self):
        analyzer = super().build_analyzer()
        return lambda doc: ([english_stemmer.stem(w) for w in analyzer(doc)])


steemed_vectorizer = StemmedTfidfVectorizer(
    stop_words='english', 
    lowercase=True,
    ngram_range=(1, 2), # The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    min_df=10, # When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
    max_df=.3, # When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. 
    decode_error ='ignore',  # Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding.
    encoding='utf-8',
    norm='l2'
)

X = steemed_vectorizer.fit_transform(newsgroups_train.data)

The bag contains +11k features. Some good, others.... 

In [None]:
print(len(steemed_vectorizer.get_feature_names_out()))

In [None]:
print(steemed_vectorizer.get_feature_names_out()[100:])

In [None]:
print(steemed_vectorizer.get_feature_names_out()[900:1000])

For the first post, the values in the documento-term matrix are

In [None]:
print("For the first post:\ng", newsgroups_train.data[0])
print(X[0,:])

We can get the columns for which the first post (line 0) has values different from 0

In [None]:
rows, cols = X[0,:].nonzero()
cols

and now see the terms

In [None]:
' '.join((steemed_vectorizer.get_feature_names_out()[c] for c in cols))    

## Clustring over newsgroup's data - KMeans

The newsgroup dataset is labeled and could be used for classification but here we are going to create some clusters. To achieve it we are going to use KMeans with 20 clusters (as many as the groups in the dataset!)

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
num_clusters = 20
from sklearn.cluster import KMeans

km = KMeans(
    n_clusters=num_clusters, 
    init='k-means++', 
    n_init=20, 
    verbose=1)

km.fit(X)

The post labels stored in the `labels_` attribute

In [None]:
km.labels_

These values are not directly comparable with the newsgroups_train target values as labels (0-20) were randomly assigned. 
So, to make some comparision it would be necessary to map the labels (in a optimum form!)

In [None]:
newsgroups_train.target

### Inference

If we have a new post first we transform it

In [None]:
post = '''Disk drive problems.
 Hi, I have a problem with my hard disk. 
 After 1 year it is working only sporadically now. 
 I tried to format it, but now it doesn't boot any more. Any ideas? Thanks.'''

post_vec = steemed_vectorizer.transform([post])
print(post_vec)

and then we send it to the KMeans instance, getting a cluster label 

In [None]:
post_label = km.predict(post_vec)
post_label

Lets us get the post with same labels

In [None]:
similar_indices = (km.labels_== post_label[0]).nonzero()[0]
similar_indices

and compute which ones are more similar

In [None]:
import numpy as np

similar = []
for i in similar_indices:
    dist = np.linalg.norm((post_vec - X[i]).toarray())
    similar.append((dist, newsgroups_train.data[i], i))
    
similar = sorted(similar)

And the most "similar" to 
    
    `Disk drive problems.
     Hi, I have a problem with my hard disk. 
     After 1 year it is working only sporadically now. 
     I tried to format it, but now it doesn't boot any more. Any ideas? Thanks.`
 
is....

In [None]:
pprint(similar[0])


 
 and the least similiar inside the cluster is...

In [None]:
pprint(similar[-1])