# Practical 2: Text Pre-processing and Representation

In the previous practical we gathered movie reviews from IMDB and annotated them with sentiment. When data is scraped from the web, or even gathered from other sources, it is unlikely to be in a suitable format for NLP applications. So, now that we have some data, the next step is to clean and normalise the text. This will reduce noise within the data and ensure a consistent set of input features for an ML model. Very frequent or infrequent words, punctuation and other characters, emojis, HTML tags etc all increase the number of features present within the data. Each of these may, or may not, be helpful when training a model for a given task.

In the first part of this practical we will examine several text pre-processing and normalisation steps and the process of building a vocabulary. Then develop a function to apply each of these steps to our imdb review data.

In the second part of this practical we will look at several different methods of representing text in a format that is compatible with ML models, i.e. as numbers or vectors.

In the final part you will apply the text pre-processing function and create a vocabulary for a larger dataset, ready for classification next week.

The objectives of this practical are:
1. Understand various text pre-processing options and determine which are appropriate for a given problem

2. Develop text pre-processing and create a vocabulary functions

3. Explore vectorised language representations - BOW, One-hot and TF-IDF

4. Understand the benefit of word vectors and how to use them

# 1 Text Pre-processing

## 1.0 Import libraries

1. [spaCy](https://spacy.io/) - is a Python library for NLP. It's very efficient and has an excellent set of features.

2. [Natural Language Toolkit (NLTK)](https://www.nltk.org/) - is an older but more comprehensive NLP toolkit for Python.

3. [Unidecode](https://pypi.org/project/Unidecode/) - is a small Python package for stripping accents from letters.

4. [Contractions](https://github.com/kootenpv/contractions) - is a small Python package for expanding contractions.

In [None]:
import os
import re
import spacy
import unidecode
import contractions
import pandas as pd
from collections import Counter
from nltk.stem.snowball import SnowballStemmer

# Set the directory to the data folder
data_dir = os.path.join('..', 'data', 'imdb')

# Spacy needs to install the language model also
# If you recieve an error, uncomment the following line and re-run the cell
# !python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

## 1.1 Pre-processing options

The following cells demonstrate each of the pre-processing options discussed in the lecture. For most we use spaCy but several are possible using regular expressions or plain Python.

It is very unlikely, you would ever need to apply **all** of these steps. In fact you probably wouldn't have much text left if you did! But it is important to understand what each does and when they might be appropriate.

<div class = "alert alert-block alert-info"><b>Note:</b> The <i>order</i> these steps are applied can sometimes make a big difference.<br>
For example, if you were to remove punctuation and replace with an empty string, then hyphenated words would be joined together. So, 'father-in-law' becomes 'fatherinlaw'.</div>

### Tokenisation and Segmentation

In [None]:
# Create spacy document object
raw_text = "Let's visit my father-in-law in St. Louis next year. He said 'it would be fun!'"
doc = nlp(raw_text)
print("Document: " + str(doc))

# Segment the text into sentences
sentences = list([sent for sent in doc.sents])
print("Sentences: " + str(sentences))

# Tokenise the sentences
tokens = []
for sent in doc.sents:
    tokens.append([token.text for token in sent])
print("Tokens: " + str(tokens))

### Stemming and Lemmatisation

In [None]:

# Create NLTk stemmer
stemmer = SnowballStemmer(language='english')

# Create spacy document object
raw_text = "studies studying cries cry automatic automation are is car cars am"
doc = nlp(raw_text)

# Print the stem and lemma for each token
print(f"{'Token:':20} {'Stem:':20} {'Lemma:':20}\n")
for token in doc:
    print(f"{token.text:20} {stemmer.stem(token.text):20} {token.lemma_:20}")

### Stop words, Case-folding and Punctuation

In [None]:
# Print Spacy's default stop words
print("List of stop words: " + str(list(nlp.Defaults.stop_words)[:50]) + "\n")

# Create spacy document object
raw_text = "Let's visit my #father-in-law @ St. Louis next year.\n He said 'it would be fun!'"
doc = nlp(raw_text)
print("Document: " + str(doc))

# Remove stop words
print("Removed stop words:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_stop]
    print(sent)

# Lowercase the tokens
# Python: text.lower()
print("Lower-cased words:")
for sent in doc.sents:
    sent = [token.lower_ for token in sent]
    print(sent)

# Remove punctuation
# Regex: keep only letters and numbers
# re.sub('[^A-Za-z0-9]+', ' ', text)
print("Removed punctuation:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_punct]
    print(sent)

### Whitespace, characters, contractions, accents, HTML tags and emoji

<div class = "alert alert-block alert-info"><b>Note:</b> The regex for removing emojis is from <a href=https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python>this stack overflow answer</a>.
</div>

<div class="alert alert-warning" role="alert">
<b>Parsing HTML with regex:</b> The regular expression used here works reasonably well for simple HTML tags but is not fool proof, as jokingly outlined <a href=https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?noredirect=1&lq=1>in this well known stack overflow answer</a>.<br>

Regex cannot reliably account for the complex structure of HTML, so if it is critical to correctly parse HTML you should use an XML parser or something like Beautiful soup.
</div>

In [None]:
# Create spacy document object
raw_text = u"<a href='site.com' class='link'> Let's visit mý \t #fáther-in-law @ St. Louis next year.\n <b>He said 'it would be fun!' \U0001f602 </b></a><br>"

doc = nlp(raw_text)
print("Document: " + str(doc))

# Remove whitespace
# Regex: remove 1 or more whitespace characters
# re.sub('\s+', ' ', text)
print("Removed whitespace characters:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_space]
    print(sent)

# Remove specific characters
# Characters are specified inside the square brackets
print("Removed specific characters:")
for sent in doc.sents:
    sent = re.sub('[@#$]', '', sent.text)
    print(sent)

# Remove accents
print("Removed accents:")
for sent in doc.sents:
    sent = unidecode.unidecode(sent.text)
    print(sent)

# Expand contractions
print("Expanded contractions:")
for sent in doc.sents:
    sent = contractions.fix(sent.text)
    print(sent)

# Remove HTML tags
# Match 0 or more characters between < and >
print("Removed HTML tags:")
for sent in doc.sents:
    sent = re.sub('<.*?>', '', sent.text)
    print(sent)

# Remove emojis
emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        "]+", flags=re.UNICODE)
print("Removed emoji:")
for sent in doc.sents:
    sent = emoji_pattern.sub(r'', sent.text)
    print(sent)

## 1.2 Building a vocabulary

It is often helpful to create a vocabulary once the text has been processed. At a certain point words appear so infrequently they may have little impact on the model. So a vocabulary allows us to choose how many words (features) to keep and then discard those that are less frequently occuring.

A vocabulary also alows us to map word tokens to indices to perform simple text **vectorisation**. And also add special tokens such as `<unk>` to replace unknown/out-of-vocabulary (OOV) words, and `<pad>` to pad inputs to a given length.


1. First pre-process/normalise the text.

2. Then use `Counter()` to create a dictionary of words and frequency counts.

3. Finally create a vocabulary (list) and add the `vocab_size` number of most frequently occuring words. Note we also added `<unk>` and `<pad>` at the beginning. We will use these in future weeks.

In [None]:
# Create spacy document object
raw_text = "Let's visit my father-in-law in St. Louis next year.\n He said 'it would be fun!'."
doc = nlp(raw_text)
print("Document: " + str(doc))

# Do some pre-processing
# Let's just lowercase the tokens and remove whitespace characters
corpus = []
for sent in doc.sents:
    sent = [token.lower_ for token in sent]
    sent = [token.strip() for token in sent]
    corpus.append(sent)

# Count the frequency of each token in the corpus
word_counter = Counter()
for sent in corpus:
    word_counter.update(sent)
print(word_counter)
print("Total word count: " + str(len(word_counter)))

# Create a vocabulary of vocab_size, also include special tokens
vocab_size = 20
special_tokens = ['<pad>', '<unk>']
vocab = []

# Add the special tokens to the vocabulary
vocab.extend(special_tokens)

# Add the vocab_size most common tokens to the vocabulary
vocab.extend([word for word, count in word_counter.most_common(vocab_size - len(special_tokens))])
print(vocab)
print("Vocabulary size: " + str(len(vocab)))

# Now we can get the index for a token, or the token from an index
print(vocab.index('father'))
print(vocab[vocab.index('father')])

## 1.3 Exercise: Pre-processing pipeline

We have now seen each various pre-processing options applied individually. However, several of these steps will need to be applied at the same time. The appropriate steps to apply are problem specific and choice of approache is all part of a NLP project development. At the very least you will probably need to remove extra whitespace and tokenise the text, but most likely case-folding and removing some special characters will be necessary too. 

Libraries like NLTK, spaCy and textaCy can help you build a processing 'pipeline' but it is convenient to create a function or class to handle these steps for you.

1. In the following cell complete the `preprocess_text()` function. It should take a single string as input, apply a range of processing options and either return a list of tokens, if `tokenise=True`, or a string.

2. The function should apply case-folding, expand contractions, lemmatise, remove punctuation, whitespace, accents, basic HTML tags and emojis. It should also include arguments to select and apply each of these options separately e.g. `to_lower=False`.

3. You can use the `test_text` string to develop the function. Remember the *order* you apply different steps can make a big difference!

4. Once you are happy, load your IMDB reviews from the .csv file and apply the function to each review. It might take some trial and error to find the right pre-processing options for your reviews.

5. Finally, you should convert the processed reviews into a list where each item is a single tokenised review and make sure you name it `imdb_corpus`.

In [None]:
# Create a function to pre-process the corpus
def preprocess_text(text, tokenise=False):
    # You can add more pre-processing steps here
    pass

test_text = "<a href='https://www.imdb.com/title/tt0000417/reviews/?ref_=tt_ql_urv'>I can now say that I've seen a movie that's over 100 years old</a> Georges Méliès's 1902 masterpiece is not just a science fiction movie. <br /><br />It's also a satire on nineteenth-century science.\t""Le Voyage dans la Lune"" (""A Trip to the Moon"") is also an indictment of colonialism.\nThe astronauts attack the Moon Men - called \"Selenites\" - and then bring one back to Earth, where they parade him around. "" \U0001f602"
print(test_text)

text = preprocess_text(test_text,)
print(text)

# Load the imdb reviews
imdb_reviews = pd.read_csv(os.path.join(data_dir, 'imdb_reviews_raw.csv'), index_col=0)

# Apply the preprocessing function to each review in the reviews column
imdb_corpus = imdb_reviews['review'].apply(lambda x: preprocess_text(x)).tolist()
print(imdb_corpus[0])

## 1.4 Exercise: Create a vocabulary

Once the reviews have been pre-processed create a vocabulary from the corpus.

1. In the following cell complete the `create_vocabulary()` function. It should take a list of tokenised sentences as input and return a list of words.

2. It should also include arguments to set the (maximum) `vocab_size` and include `special_tokens`, e.g. `special_tokens=['<pad>', '<unk>']`.

3. Once the function is complete choose an appropriate `vocab_size` (1-2k words), do not include special tokens for now and create and name the vocabulary something unique like `imdb_vocab`.

4. Print the most common 50-100 tokens and see if these are what you expect?

In [None]:
def create_vocabulary(corpus, vocab_size=None, min_freq=1, special_tokens=None):
    # You can add the vocabulary creation code here
    pass

# Set vocab_size and special tokens
vocab_size = 3000
special_tokens = None

# Create a vocabulary
imdb_vocab = create_vocabulary(imdb_corpus)

# Print the vocabulary
print("Vocabulary size: " + str(len(imdb_vocab)))
for i, word in enumerate(imdb_vocab[:50]):
    print(f'({str(i)}, {word})', end=' ')

# 2 Language Representation

## 2.0 Import libraries

1. [Sklearn (scikit-learn)](https://scikit-learn.org/stable/) - is a comprehensive Python library for Machine Learning. We will use its text pre-processing features and also for PCA.

2. [Gensim](https://radimrehurek.com/gensim/index.html) - is primarily a Python Topic Modelling library. It also has lots of useful features for working with Word Vectors. 

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from nltk import ngrams
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec

# Increase pandas display width
pd.set_option('display.width', 500)
# Set seaborn style
sns.set_style("whitegrid", {'axes.grid' : False})

## 2.1 Representation options

The following cells demonstrate each of the language representation options discussed in the lecture. For most we use numpy/plain Python to demonstrate the process and then sklearn's built in functions.

Like pre-processing the approapriate representation is dependant on the task, and generally the input *shape* of the data for a given model.

### One-hot Encoding

One-hot encoding converts a word into an array of length `vocab_size`, with a **1** at the index of the words position in the vocabulary and **0's** in every other position. Encoding a sentence then becomes a 2D array of shape `vocab_size` x `sequence_length`.

In [None]:
# Create spacy document object
text = "This is a test sentence which is a very long test sentence."
doc = nlp(text)
print("Document: " + str(doc))

# Tokenise the document
tokens = [token.text for token in doc]
print("Tokens: " + str(tokens))

# Create simple vocabulary
vocab = list(set(tokens))
print("Vocabulary: " + str(vocab))

# Get a list of token indices within vocabulary
token_indices = [vocab.index(token) for token in tokens]
print("Token indices: " + str(token_indices))

# Create a one-hot vector with numpy
num_unique = len(vocab) # Need to know how many features there are
one_hot = np.eye(num_unique)[token_indices]
print("One-hot vector with numpy:")
print(one_hot)

# Create a one-hot vector with sklearn
token_indices = np.array(token_indices).reshape(-1, 1) # Need to reshape the array to 2D
one_hot = OneHotEncoder(sparse=False).fit_transform(token_indices)
print("One-hot vector with sklearn:")
print(one_hot)

### Bag-of-words (BOW)

BOW converts a sentence into an array of length `vocab_size`. Simply count the number of times a word appears within the sequence and increment the index according to its position within the vocabulary.

<div class = "alert alert-block alert-info"><b>Note:</b> The output of sklearns CountVectorizer() is different to the numpy implementation. Can you work out why?
</div>

In [None]:
# Create spacy document object
text = "This is a test sentence which is a very long test sentence."
doc = nlp(text)
print("Document: " + str(doc))

# Tokenise the document
tokens = [token.text for token in doc]
print("Tokens: " + str(tokens))

# Create simple vocabulary (add word not in our input text)
vocab = list(set(tokens + ['supercalifragilisticexpialidocious']))
print("Vocabulary: " + str(vocab))

# Get a list of token indices within vocabulary
token_indices = [vocab.index(token) for token in tokens]
print("Token indices: " + str(token_indices))

# Create a BOW with numpy
bow = np.zeros(len(vocab), dtype=np.int32)
for i in range(len(token_indices)):
    bow[token_indices[i]] += 1
print("BOW with numpy:")
print(bow)

# Create a BOW with sklearn
bow_vectoriser = CountVectorizer(vocabulary=vocab, lowercase=False)
bow = bow_vectoriser.fit_transform([text])
print("BOW with sklearn:")
print(bow.toarray())

### TF-IDF

TF-IDF converts a corpus into an array of length `num_documents` x `vocab_size`. TF is the frequency of a word within a *document* and IDF is frequency of a word within the *corpus*. The TF-IDF for a word is then TF(word) x IDF(word):

$w =$ word/term

$d =$ document

$N =$ number of documents in corpus

$TF(w) = \frac{count(w, d)}{len(d)}$

$IDF(w) = log\frac{N}{\sum_{d=1}^{N} count(w, d) + 1}$

$TF-IDF(w) = TF(w) \times IDF(w)$

<div class = "alert alert-block alert-info"><b>Note:</b> The class TF_IDF() mimics the sklearn implementation (as best as possible). Try different normalisations ('l1' or 'l2') and set smoothing True/False.
</div>

In [None]:
class TFIDF():
    
    def __init__(self, tokeniser=None, vocabulary=None, norm=None, smooth_idf=True):
        """ Arguments:
                tokeniser: A function that takes a string and returns a list of tokens.
                vocabulary: A list of tokens to use as the vocabulary.
                norm: The normalisation to use when calculating the tf-idf vectors.
                smooth_idf: Whether to use Laplace smoothing when calculating the idf.
        """
        self.corpus = None
        self.N = None
        self.tokeniser = tokeniser
        self.vocabulary = vocabulary
        self.norm = norm
        self.smooth_idf = smooth_idf

        if not self.tokeniser:
            self.tokeniser = self._tokenise

        # l1 norm is the sum of the absolute values of the vector
        if self.norm and self.norm == 'l1':
            self.norm = 1
        # l2 norm is the square root of the sum of the squared values of the vector
        elif self.norm and self.norm == 'l2':
            self.norm = 2

    def _tokenise(self, s):
        return s.split()

    def get_vocabulary(self):
        vocab = []
        for doc in self.corpus:
            vocab.extend(self.tokeniser(doc))

        vocab = list(set(vocab))
        vocab.sort()
        return vocab

    def _tf(self):
        """Get the term frequency for each document in the corpus."""

        tf = []
        for doc in self.corpus:
            tf.append(Counter(self.tokeniser(doc)))
        return tf

    def _df(self):
        """Get the document frequency of each word in the corpus."""

        df = Counter()
        for doc in self.corpus:
            df.update(set(self.tokeniser(doc)))
        return df

    def _idf(self):
        """Calculate inverse document frequency for each word in the vocabulary."""

        # Calculate the DF
        df = self._df()

        idf = {}
        for word in self.vocabulary:
            if self.smooth_idf:
                idf[word] = 1.0 + np.log((self.N + 1) / (df[word] + 1))
            else:
                idf[word] = 1.0 + np.log(np.divide(self.N, df[word]))
        return idf

    def _tfidf(self):
        """Calculate the TF-IDF for each document in the corpus."""

        # Calculate TF and IDF
        tf = self._tf()
        idf = self._idf()

        # Calculate TF-IDF
        tfidf = np.zeros((self.N, len(self.vocabulary)))

        for i, doc in enumerate(self.corpus):
            for j, word in enumerate(self.vocabulary):
                tfidf[i, j] = tf[i][word] * idf[word]
        
        if self.norm:
            tfidf = tfidf / np.linalg.norm(tfidf, ord=self.norm, axis=1, keepdims=True)
        return tfidf

    def fit(self, corpus):
        # Set corpus/N
        self.corpus = np.array(corpus)
        self.N = len(self.corpus)

        # Set vocabulary
        if not self.vocabulary:
            self.vocabulary = self.get_vocabulary()

        # Calculate TF-IDF
        self.tfidf = self._tfidf()
        return self

    def transform(self, corpus):
        # Update corpus/N
        self.corpus = np.append(self.corpus, corpus, axis=0)
        self.N = len(self.corpus)

        # Calculate TF-IDF
        self.tfidf = self._tfidf()
        return self.tfidf[-len(corpus):]

corpus = ['the car is driven on the road', 'the truck is driven on the highway']

# Create a TF-IDF with numpy
tfidf_numpy = TFIDF(norm='l1', smooth_idf=False).fit(corpus)
terms = tfidf_numpy.get_vocabulary()
matrix = tfidf_numpy.transform(corpus)
print("TF-IDF with numpy:")
print(pd.DataFrame(data=matrix, columns=terms))

# Transform a new sentence
matrix = tfidf_numpy.transform(['the car is driven in the sky'])
print(pd.DataFrame(data=matrix, columns=terms))

# Create a TF-IDF with sklearn
tfidf_sklearn = TfidfVectorizer(norm='l1', smooth_idf=False).fit(corpus)
terms_2 = tfidf_sklearn.get_feature_names_out()
matrix_2 = tfidf_sklearn.transform(corpus).toarray()
print("TF-IDF with sklearn:")
print(pd.DataFrame(data=matrix_2, columns=terms_2))

# Transform a new sentence
matrix_2 = tfidf_sklearn.transform(['the car is driven in the sky']).toarray()
print(pd.DataFrame(data=matrix_2, columns=terms_2))

### N-grams

N-grams are sequences of N words. Typically uni-grams (1), bi-grams (2) and tri-grams (3). Bi-grams and tri-grams (or larger) provide some context to words and can be used as replacement for uni-grams in many models. Here we use NLTK to create tuples of all bi-grams and tri-grams from the text.

<div class = "alert alert-block alert-info"><b>Note:</b> The sklearn CountVectorizer() and TfidfVectorizer() have an <code>ngram_range</code> argument which allows you to vectorise N-grams instead of single words.
</div>

In [None]:
# Create spacy document object
text = 'I sat by the river bank. I went to the bank to withdraw money.'
doc = nlp(text)
print("Document: " + str(doc))

# Create N-grams with nltk
for sent in doc.sents:
    print("Sentence: " + str(sent))

    tokens = [token.text for token in sent]

    bi_grams = list(ngrams(tokens, 2))
    print("Bi-grams: " + str(bi_grams))

    tri_grams = list(ngrams(tokens, 3))
    print("Tri-grams: " + str(tri_grams))

### Word Vectors

Word vectors represent single words as a vector (list) of real numbers which capture some aspect of their meaning and relationships to other words. The best known (and first) method is [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf) which uses either skip-gram (given context word predict surrounding target words), or a continuous bag of words (predict target word given context words). Word vector models are typically trained on 100's of millions of words to produce a set of weights - an embedding matrix - of shape `vocab_size` x `embedding_dim`, where the embedding dimension is the length of a vector for each word (usually 50 to 300).

Once trained these embeddings can be used as semantically rich word representations for other NLP tasks, such as classification. This is called transfer learning, where the weights for a model trained on one objective (predicting words) can be used as input to train models on a different task (classification, language modelling, etc). There are lots of pre-trained word vectors available to download which can be used to map words to vectors for input into your models.

We can use Gensim to create a Word2Vec model from our IMDB data.

- `sentences` = the input list of lists of tokens.

- `size` = dimensionality of the word vectors.

- `window` = maximum distance between the current and predicted word within a sentence - the size of the 'sliding window' during training.

- `min_count` = ignore words with total frequency lower than this.

- `sg` = training algorithm: 1 for skip-gram; otherwise CBOW.

- `workers` = number of worker threads to train the model (faster training with multicore machines).

Initially the model is untrained, so we are using the existing Word2Vec values for each word. We can view the words vector and the N most similar words (calculated with cosine similarity). If you pick a word that is quite unique to our IMDB data (like 'georges') it is likely that the most similar words don't make much sense. If you instead choose a more common word (like 'film') you should see similar words like cinema.

Once the model is 'fine-tuned' on the IMDB data, the corpus-specific words should have more sensible similar words, e.g. 'melies', 'directed' and 'director', for 'georges'.

In [None]:
# Create a word2vec model with gensim
embedding_dim = 100
w2v_model = Word2Vec(sentences=imdb_corpus, size=embedding_dim, window=5, min_count=2, sg=1, seed=1, workers=4)
print("W2v model vocabulary: " + str(list(w2v_model.wv.vocab)[:100]))

N = 10
word = 'georges'
print(f"Vector for '{word}':")
print(w2v_model.wv.get_vector(word))
print(f"{N} most similar words to '{word}':")
print(w2v_model.wv.most_similar(word, topn=N))

# Train the model for a few epochs
w2v_model.train(imdb_corpus, total_examples=len(imdb_corpus), epochs=2)
print(f"Vector for '{word}':")
print(w2v_model.wv.get_vector(word))
print(f"{N} most similar words to '{word}':")
print(w2v_model.wv.most_similar(word, topn=N))

Once we have fine-tuned the model we can create an embedding matrix of shape `vocab_size` x `embedding_dim`.

In [None]:
# Create an empty numpy array
embedding_matrix = np.zeros((len(imdb_vocab), embedding_dim))

# For each word in the imdb vocabulary
for i, word in enumerate(imdb_vocab):
    # If the word is in the word2vec model
    if word in w2v_model.wv.vocab:
        # Get the vector for the word
        embedding_matrix[i] = w2v_model.wv.get_vector(word)
    else:
        # Get a random vector
        embedding_matrix[i] = np.random.uniform(np.min(embedding_matrix), np.max(embedding_matrix), embedding_dim)

# Create dataframe with words and vectors
embedding_df = pd.DataFrame(embedding_matrix, index=imdb_vocab)
embedding_df.head(10)

Now we can calculate the similarity between all words in the embedding matrix.

<div class = "alert alert-block alert-info"><b>Note:</b> We use cosine similarity which is a measure of similarity between two sequences of numbers. It produces a similarity score in the range [0, 1].
</div>

In [None]:
# Calculate the cosine similarity between the words
similarity_matrix = cosine_similarity(embedding_df)
# Create dataframe with words and similarity
similarity_df = pd.DataFrame(similarity_matrix, columns=imdb_vocab)
# Add word as second index
similarity_df.insert(0, 'word_ind', imdb_vocab)
similarity_df.set_index('word_ind', inplace=True, append=True)
similarity_df.head(10)

Now we can use the vectors to visualise the most similar and least similar words to a given target word.

1. We will use Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings so we can visualise them.

2. Next find the N most similar and disimilar words to a target word.

3. Create a 3D plot of the embeddings. With `N=10` and 'georges' you should see that, for example, 'melies' and 'director' are very close in embedding space.

In [None]:
# Set the number of similar/disimilar words and a target word
N = 10
word = 'georges'

# Perform PCA (dimensionality reduction) on the embedding matrix
pca_embeddings = PCA(n_components=3).fit_transform(embedding_matrix)

# Find the N most/least similar words
most_sim = similarity_df[word].sort_values(ascending=False)[0:N + 1]
least_sim = similarity_df[word].sort_values(ascending=True)[0:N + 1]

most_sim_words = [w for ind, w in most_sim.index.values]
least_sim_words = [w for ind, w in least_sim.index.values]

# Get the indices of the most/least similar words from the reduced embedding matrix
most_sim_pca = pca_embeddings[[ind for ind, w in most_sim.index.values]]
least_sim_pca = pca_embeddings[[ind for ind, w in least_sim.index.values]]

# Plot the most/least similar words
fig = plt.figure(figsize=(11, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(most_sim_pca[:, 0], most_sim_pca[:, 1],  most_sim_pca[:, 2], linewidths=1, color='blue')
ax.scatter(least_sim_pca[:, 0], least_sim_pca[:, 1],  least_sim_pca[:, 2], linewidths=1, color='red')
# Add words to the plot
for i, word in enumerate(most_sim_words):
    ax.text(most_sim_pca[i, 0]+.02, most_sim_pca[i, 1], most_sim_pca[i, 2], word, size=10, zorder=1)
for i, word in enumerate(least_sim_words):
    ax.text(least_sim_pca[i, 0]+.02, least_sim_pca[i, 1], least_sim_pca[i, 2], word, size=10, zorder=1)

# 3 Processing a Dataset

So far we have explored several text pre-processing and representation methods using the IMDB movie reviews that we gathered. However, deep learning requires a lot of data, so we do not have enough to adequately train most models. For this we will use an existing movie review dataset for training and keep ours as an additional test set. The IMDB movie review dataset contains 50,000 reviews and sentiment labels 'positive' and 'negative'.

## 3.1 Exercise: Prepare the IMDB dataset

You have already developed a `preprocess_text()` function which applies case-folding, expand contractions, lemmatise, remove punctuation, whitespace, accents, basic HTML tags and emojis. So now you can continue to develop its functionality and then apply it to your IMDB reviews and the larger IMDB dataset.

1. Extend the function to provide the option to remove only specific characters, rather than all punctuation. For example, you may want to remove brackets and other special characters, but keep full stops, commas etc. Similarly, you could include the option to remove only specific stop words, rather than all of the stop words included with spaCy.

2. Load the your IMDB reveiw dataset and apply the pre-processing to each review. You **should not tokenise the data** at this stage. Leave each review as a whole string.

3. Save the processed dataset as `imdb_reviews.csv`.

4. Once you are happy with the results apply the pre-processing to the larger dataset `imdb_reviews_raw.csv`, and save it as `imdb_reviews.csv`.

<div class = "alert alert-block alert-warning"><b>Warning:</b> Processing all 50,000 reviews might take some time. You should be sure of the pre-processing options you have selected before you apply them to the entire datset.
</div>

In [None]:
# Load the imdb reviews
imdb_data = pd.read_csv(os.path.join(data_dir, 'imdb_reviews_raw.csv'), index_col=0)

# Apply the preprocessing function
imdb_data['review'] = imdb_data['review'].apply(lambda x: preprocess_text(x, tokenise=False))

# Save the data
imdb_data.to_csv(os.path.join(data_dir, 'imdb_reviews.csv'))

# Load the full imdb dataset
# imdb_data = pd.read_csv(os.path.join(data_dir, 'imdb_dataset_raw.csv'))

# # Apply the preprocessing function
# imdb_data['review'] = imdb_data['review'].apply(lambda x: preprocess_text(x, tokenise=False))

# # Save the data
# imdb_data.to_csv(os.path.join(data_dir, 'imdb_dataset.csv'), index=False)

## 3.2 Exercise: Create an IMDB vocabulary

You have already developed a `create_vocabulary()` function, so now you can continue to develop its functionality and then create a vocabulary from the larger IMDB dataset.

1. Extend the function to discard words below a certain frequency e.g. `min_freq=2` only includes words that occur 2 times or more (possibly at the expense of the vocabulary size).

2. Load your newly processed IMDB dataset and tokenise it. Hint: you can use `.apply(lambda x: [token.text for token in nlp.tokenizer(x)])' to quickly tokenise the strings.

3. Set a vocabulry size and then create a vocabulary. You **do not need to add special tokens** for now.

4. Save the vocabulary as a text file for later use.

In [None]:
# Load the imdb dataset
imdb_data = pd.read_csv(os.path.join(data_dir, 'imdb_dataset.csv'))

# Tokenise the reviews
tokens = imdb_data['review'].apply(lambda x: [token.text for token in nlp.tokenizer(x)])

# Set vocab_size
vocab_size = 2000

# Create a vocabulary
imdb_vocab = create_vocabulary(tokens, vocab_size=vocab_size, min_freq=2)

# Print the vocabulary
print("Vocabulary size: " + str(len(imdb_vocab)))
for i, word in enumerate(imdb_vocab[:50]):
    print(f'({str(i)}, {word})', end=' ')

# Save to text file
with open(os.path.join(data_dir, 'imdb_vocab.txt'), 'w+') as file:
    for word in imdb_vocab:
        file.write(word + '\n')