# Week 10 - Unsupervised Learning

## Drill

## What is Unsupervised Learning?

Last week we have looked into supervised learning, which is a type of machine learning where a target variable exists. This is very useful in prescribing practical solutions. There is another main category of machine learning, which is called unsupervised learning. Following the logic, it means there are no target variables to be deduced. Rather, unsupervised learning looked at the patterns within data. 

These are the main applications of unsupervised learning: 
* clustering
* association rule learning

### Application: Topic Modelling using Latent Dirichlet Allocation (LDA)

As a closing of this week, let us apply unsupervised learning into classifying documents by topics. We are using Latent Dirichlet Allocation (LDA) possess topic modelling. The aim of this exercise is to identify what topic is the document or string and compare them with other documents. 

So first let us read the sample data. Which contains 9 conversations or paragraphs. 

In [1]:
text_data = ["25 years old, I said. ", 
             "It was a rat's nest. Not a literal one, but that is what her hair seemed to resemble every morning when she got up. It was going to take at least an hour to get it under control and she was sick and tired of it. She peered into the mirror and wondered if it was worth it. It wasn't. She opened the drawer and picked up the hair clippers.", 
             "She sat in the darkened room waiting. It was now a standoff. He had the power to put her in the room, but not the power to make her repent. It wasn't fair and no matter how long she had to endure the darkness, she wouldn't change her attitude. At three years old, Sandy's stubborn personality had already bloomed into full view.", 
             "Pink ponies and purple giraffes roamed the field. Cotton candy grew from the ground as a chocolate river meandered off to the side. What looked like stones in the pasture were actually rock candy. Everything in her dream seemed to be perfect except for the fact that she had no mouth.", 
             "It's not his fault. I know you're going to want to, but you can't blame him. He really has no idea how it happened. I kept trying to come up with excuses I could say to mom that would keep her calm when she found out what happened, but the more I tried, the more I could see none of them would work. He was going to get her wrath and there was nothing I could say to prevent it.", 
             "There was something in the tree. It was difficult to tell from the ground, but Rachael could see movement. She squinted her eyes and peered in the direction of the movement, trying to decipher exactly what she had spied. The more she peered, however, the more she thought it might be a figment of her imagination. Nothing seemed to move until the moment she began to take her eyes off the tree. Then in the corner of her eye, she would see the movement again and begin the process of staring again.",
             "It was going to rain. The weather forecast didn't say that, but the steel plate in his hip did. He had learned over the years to trust his hip over the weatherman. It was going to rain, so he better get outside and prepare. He heard the crack echo in the late afternoon about a mile away. His heart started racing and he bolted into a full sprint. \"It wasn't a gunshot, it wasn't a gunshot,\" he repeated under his breathlessness as he continued to sprint.",
             "She wondered if the note had reached him. She scolded herself for not handing it to him in person. She trusted her friend, but so much could happen. She waited impatiently for word.",
             "Sitting in the sun, away from everyone who had done him harm in the past, he quietly listened to those who roamed by. He felt at peace in the moment, hoping it would last, but knowing the reprieve would soon come to an end. He closed his eyes, the sun beating down on face and he smiled. He smiled for the first time in as long as he could remember."
            ]

In this exercise we will look into natural language processing (NLP). Where this is a technique to analyse the sematics of texts. Normally it involves the following steps: 
1. Tokenisation - Converting words or characters into individual tokens for analysis. 
2. Preprocessing 
    * Stopping word removal
    * Stemming/ Lemmatisation
    * n-grams
3. Generate the bag-of-words/ text-document matrix
4. Train the LDA model

The first thing we should do is to convert the words as token for our analysis. In this exercise we use `nltk` to possess NLP, and they have offered several tokenisers. For example, 
* `word_tokenize()` splits the string into words. 
* `sent_tokenize` splits the sentence into words. 
* `RegexpTokenizer` splits the sentence into words using [regex](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet).

In this tutorial we use the latter one to tokenise the strings. 

In [None]:
'''Run me
'''
# Tokenise the documents.
from nltk.tokenize import RegexpTokenizer

docs = []

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(text_data)):
    text_data[idx] = text_data[idx].lower()  # Convert to lowercase.
    docs.append(tokenizer.tokenize(text_data[idx]))  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[x for x in token if not x.isnumeric()] for token in docs]

From above we spend the first few rows to tokenise the strings and store them to a new list called `docs`. This list is a nested list which contains all the words within the sentences. After that we try to clean the list by removing the numbers. 

__Exercise:__ Does `.isnumeric()` returns `True` if the string contains both alphabetic characters and numbers?

__Solution:__ No. But you might want to use a loop to search numbers within the tokens for this. 

__Exercise:__ Write a code so that the tokenised list has no words with one character (or less). 

In [None]:
# Remove words that are only one character.
docs = [[x for x in token if len(x) > 1] for token in docs]

In [None]:
# Your code below
docs = 

We have made our first (or second if you have done the exercise) attempt on cleaning the data. Now let us move on to pre-processing and clean the data further. Firstly let us proceed with lemmatise. This is a process to group the words with similar meanings or form (e.g. "playing" or "played" or "plays" are stemmed to "play"). 

There is another NLP task called stemming which is similar, except lemmatisation identifies which part of speech the word is used to decide if the words should be grouped. 

The following is the code to lemmatise the tokenised strings. 

In [None]:
'''Run me
'''
'''
If seen: 

    LookupError: 
    **********************************************************************
      Resource wordnet not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('wordnet')
Then use the snippet to download the wordnet. 

'''
# import nltk
# nltk.download('wordnet')
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

In `nltk` there the lemmatiser requires a package `wordnet`. If you see the following: 
```bash
LookupError: 
    **********************************************************************
      Resource wordnet not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('wordnet')
```
Then simply use the code specified above to run your code. It is commented in the code snippet above. 

The other one you might want to pre-process the data is to remove the stopping words. Remember from week 7 that these are the words that does not convey significant meanings. In `ntlk` they have provided a set of stopping words in english if you call
```python
stop_words = set(stopwords.words('english'))
```

__Exercise:__ Remove the stopping words from `docs`. 

If you see: 
```bash
LookupError: 
    **********************************************************************
      Resource stopwords not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('stopwords')
```
Then you might want to uncomment the code from below. Do a search on how it could be done, or you can try yourself first. 

In [None]:
'''
If seen: 
    LookupError: 
    **********************************************************************
      Resource stopwords not found.
      Please use the NLTK Downloader to obtain the resource:

      >>> import nltk
      >>> nltk.download('stopwords')
Then use the snippet to download the stopwords library. 
'''
# import nltk
# nltk.download('stopwords')
# Remove stopping words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
docs = [[x for x in token if x not in stop_words] for token in docs]

In [None]:
# Your code below
# import nltk
# nltk.download('stopwords')
# Remove stopping words
from nltk.corpus import stopwords

stop_words = ???
docs = 

In [None]:
from gensim.corpora import Dictionary

dictionary = Dictionary(docs)

In reality you may need to analyse a large amount of documents. Words that rarely appeared in the set of strings or too much are outliers and it affects what we predict. So we need to filter out words that occur outside a range of documents. For example, to filter our words appeared in less than 20 documents, or more than 50% of the documents we can write.
```python
dictionary.filter_extremes(no_below=20, no_above=0.75)
```
In this exercise we don't have do that as there are too little strings in `docs`. 

In [None]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Finally we can train our LDA model. In `gensim` the function requires the following parameters: 
* `num_topics` - Number of topics, you will need to think how many topics the strings might appear. It could be an educated guess. 
* `chunksize` - Controls how many documents are processed at a time in the training algorithm. The higher the value the faster it is proceeded. 

In [None]:
# Code from https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html (Accessed 11 July 2021)
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [None]:
'''Run me
'''
top_topics = model.top_topics(corpus)
top_topics

### Conclusion