# CS-EJ3311 - Deep Learning with Python, 09.09.2021-18.12.2021

## Round 5 - Natural Language Processing

This notebook is a part of teaching material for CS-EJ3311 - Deep Learning with Python 13.09.-17.12.2021\
Aalto University (Espoo, Finland)\
fitech.io (Finland)

If you are familiar with virtual assistants like [Alexa](https://en.wikipedia.org/wiki/Amazon_Alexa), [Siri](https://en.wikipedia.org/wiki/Siri), [Google Assistant](https://en.wikipedia.org/wiki/Google_Assistant) or if your email provider automatically classifies your emails into different categories (promotions, updates, forums, spams), or if you use Google Translator, then you have been using Natural Language Processing (NLP). These are not the only places where you can find NLP applications in daily life, but they show in a really good way what can be achieved. 

The NLP goal is to design and build computer systems capable of analyzing and responding to text or voice, similar to the way humans do. In this notebook, we will take a closer look at the fundamentals of NLP and how Deep Learning has contributed to better results compared to traditional approaches. 

As motivation, visit the website [GPT-3 Demo](https://gpt3demo.com/) and see if you can find something interesting made with [GPT-3](https://en.wikipedia.org/wiki/GPT-3). Personally, I find this one impressive: [AI-Powered Code Generator](https://sourceai.dev/documentation/example#example-in-java)


## Learning goals
- understanding NLP and its applications
- understanding how to represent textual data
- deep learning in the context of NLP


## Recommended Reading
<a id='hapke'></a>
- Hapke, H., Howard, C. and Lane, H., 2019. **Natural Language Processing in Action: Understanding, analyzing, and generating text with Python**. Simon and Schuster. 

-  Beysolow II, Taweh. **Applied Natural Language Processing with Python. Implementing Machine Learning and Deep Learning Algorithms for Natural Language Processing**. 1st ed. 2018., Apress, 2018, doi:10.1007/978-1-4842-3733-5.


## Additional Material (Optional)

- [CS224N: Natural Language Processing with Deep Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)
- [Natural Language Processing (NLP) Zero to Hero](https://www.youtube.com/playlist?list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S)
- [Natural Language Processing - Stanford University](https://www.youtube.com/playlist?list=PLLssT5z_DsK8HbD2sPcUIDfQ7zmBarMYv)
- [Natural Language Processing Specialization
](https://www.deeplearning.ai/program/natural-language-processing-specialization/)


 # What is Natural Language Processing?

NLP is a field of Computer Science dealing with methods to analyze, model, and understand human language. It is composed of tasks like: 

* **Speech recognition (or speech-to-text)**: this is what Google uses for converting your voice into text when you are dictating a short message to a friend on your phone or when you do a voice search in the Google app. There are many factors that make this task challenging because different people have different ways of speaking (tone, pronunciation, emphasis) and also, we need to handle improper uses of language (grammatical errors) and background noise.   

* **Part-of-speech tagging**: Have you ever wondered how applications like [Grammarly](https://grammarly.com/) or [Microsoft Word](https://www.microsoft.com/en-us/microsoft-365/word) can check the grammar in the texts we write? In order to be able to do their job, they use a process (among others) for determining the part-of-speech (PoS) tagging of a particular word or piece of text based on its use and context. If you want to see a demo, check this website https://huggingface.co/flair/pos-english. The following picture gives you an example. 

![PoS Tagging](../../../coursedata/R5/post-tagging.png)
  

* **Word sense disambiguation**: Words have several meanings, for example, if I say the word '*banco*' to a Spanish-speaking person, her or his first thought would most probably be a bank (financial institution), but I could be talking about a park bench, or maybe I was thinking of a shoal of fish. The point is that the meaning of a word most of the time is subject to the context in which it is used. Word sense disambiguation is the selection of the meaning of a word with multiple meanings through a process of semantic analysis. Semantic analysis is the process of drawing meaning from text. It allows computers to understand and interpret sentences, paragraphs, or whole documents, by analyzing their grammatical structure, and identifying relationships between individual words in a particular context.

* **Named entity recognition**: Named entities are sets of elements that are relevant to understanding a text. Named Entity Recognition (NER) is the process of finding entities that can be put under categories like names, organizations, locations, quantities, monetary values, percentages, etc. In the example given in the figure, what do you think is more useful, having *Aalto* and *University* as two separate words or having *Aalto University* as a unit? To be fair, the answer depends on your final goal, but being able to recognize that "Aalto University" is an organization, or that "Alex" is a proper noun is extremely useful when you are creating relationships between entities.

![NER](../../../coursedata/R5/ner.png)

* **Sentiment analysis**: If you are a company that sells, let's say, bikes, you most probably are interested in knowing how your customers feel about the quality of your bikes. But maybe you want to go even further and you are not only interested in the opinions of the bikes as a whole, but you want to know what people think about the brakes, or the wheels or the crank arm. All this information could be extracted from the reviews that customers give you and being able to analyze it properly is possible with sentiment analysis.

* **Natural language generation**:  Nowadays, chatbots are present on many websites. They try to guide us through the website, or answer frequently asked questions. Some of them are just rule-based, but you can find some with the ability to generate text. These chatbots are known as conversational systems and their final goal is to generate a text that can sound human-produced.

There are many other tasks within NLP, like **Topic Modeling**, **Information Retrieval**, **Question and Answering**, **Image Captioning**, etc., but because this is an introductory course, we cannot cover all of them. If you want to get a broader introduction, consult this article [Natural Language Processing (NLP)](https://www.ibm.com/cloud/learn/natural-language-processing).

# Document representation

The document representation is a crucial step in any task related to NLP. Traditionally, the model used to represent documents as vectors are called **Vectorial Space Model**. This model is based on the idea that words in a document aren't related, hence, the document is just a *bag-of-words*. This is a really easy and simple model to implement, but it is good to know that it has some disadvantages:

* The meaning and the structure of documents cannot be expressed.
* Each word is independent of the others, word sequences or any other type of relationship cannot be expressed.
* If two documents have similar meanings but different vocabularies, calculating the similarity between the two of them can be difficult.

Given those cons, other models for document representation have been developed, some of which are:

* **Latent Semantic Analysis (LSA)**: It is based on the idea that (1) meaning is contextually dependent and (2) in the contextual use, there are semantic relationships that are latent. Read more in:
   > Martin, D.I. and Berry, M.W., 2007. *Mathematical foundations behind the latent semantic analysis. Handbook of latent semantic analysis*, pp.35-56.
* **Probabilistic Latent Semantic Analysis (PLSA)**: It is a statistical technique based on the general model of latent variables and it's an alternative to the LSA. Read more in:
   > Hofmann, T., 2013. *[Probabilistic latent semantic analysis](https://arxiv.org/ftp/arxiv/papers/1301/1301.6705.pdf)*. arXiv preprint arXiv:1301.6705.
   
* **Latent Dirichlet Allocation (LDA)**. It was proposed to address the shortcomings of PLSA using probabilistic modeling. Probabilistic modeling assumes that the results of observations come from a generative model in which there are variables that we cannot observe (latent variables). In the case of documents, the latent variables represent the thematic structure of the documents. Read more in:
   > Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. *[Latent Dirichlet allocation. the Journal of Machine Learning research](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?TB_iframe=true&width=370.8&height=658.8)*, 3, pp.993-1022. 

* **Random Indexing**: This is an incremental model of word space that is based on the accumulation of context vectors from the occurrence of words in contexts. Read more in:
   > Sahlgren, M., 2005. *[An introduction to random indexing](https://www.diva-portal.org/smash/get/diva2:1041127/FULLTEXT01.pdf)*. In Methods and applications of semantic indexing workshop at the 7th international conference on terminology and knowledge engineering.
* **Language Models**: A statistical model of the language is nothing more than a probability distribution P(s) over the possible sentences, expressions, documents, or any other linguistic unit of the language. A correctly adjusted model will assign high probabilities to well-formed language sentences and low probabilities to infrequent or badly formed sentences. Read more in:
   > Rosenfeld, R., 2000. *[Two decades of statistical language modeling: Where do we go from here?](https://kilthub.cmu.edu/articles/Two_Decades_of_Statistical_Language_Modeling_Where_Do_We_Go_From_Here_/6611138/files/12103316.pdf)*. Proceedings of the IEEE, 88(8), pp.1270-1278.

## Bag-of-Words 

This model assumes that the document is a vector from a vocabulary $V=[w_1,w_2,\dots,w_{|V|}]$ and that the values of the components of the vector are the frequency of the **i-th** word of the vocabulary in the document.

Let's illustrate the model with an example. Let's assume that we have the following vocabulary and set of documents:

```python
V = [
  'aalto', 'art', 'bold', 'build', 'business', 
  'challenges', 'community', 'creating', 'future', 'global',
  'is', 'major', 'meet', 'novel', 'science', 
  'solutions', 'sustainable', 'technology', 'thinkers', 'university'
]

documents = [
    "Aalto University is a community of bold thinkers where science and art meet technology and business",
    "We build a sustainable future by creating novel solutions to major global challenges",
]
```
$|V| = 20$

Then each document will be represented by a vector of 20 components, which is the length of the vocabulary. This vector will be sparse, which means that most of its components will be zero. The vectorial representation of our documents is:

```python

X = [
   [1, 1, 1, 0, 1,   0, 1, 0, 0, 1,   1, 0, 1, 0, 1,  0, 0, 1, 1, 1],
   [0, 0, 0, 1, 0,   1, 0, 1, 1, 0,   0, 1, 0, 1, 0,  1, 1, 0, 0, 0],
]
```

The value in `X[0][0]` represents the frequency of the word `aalto` in the first document. Another common pre-processing task is to calculate the TF-IDF, which stands for **Term Frequency - Inverse Document Frequency**. This is a technique for quantifying a word in multiple documents. We generally compute a weight to each word signifying the importance of the word in the document and corpus. The more frequent the word is in the corpus, the lower is its corresponding component in the vector. Let's say we want to classify documents, then we would like to assign more weight to words that occur only inside a document and not too frequently in the rest. Doing so, we will get vectors that characterize documents with similar meanings.

Its mathematical formulation is:

$\text{idf}(t, D) = log \frac{|D|}{|\{d \in D: t \in d\}|}$

$\text{tfidf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)$


where:

* $\text{tf}(t, d)$ is the frequency of the term $t$ in the document $d$. It can be defined as:
    - boolean frequency: $\text{tf}(t, d) = 1$ if $t$ is in $d$, zero otherwise
    - logarithmic frequency: $\text{tf}(t, d) = 1 + \log{f(t, d)}$; $f(t, d)$ is the amount of times that $t$ is in $d$. If $t$ is not in $d$, then its         $\text{tf}(t, d) = 0$
    - normalized frequency: $\text{tf}(t, d) = \frac{f(t, d)}{max\{f(x, d): x \in d\}}$; $f(t, d)$ is defined as in the logarithmic frequency
   
* $|D|$ is the number of documents in the corpus
* $|\{d \in D: t \in d\}|$ is the number of documents containing the term $t$

For a more comprehensive explanation, check these resources:

* Towardsdatascience blog post [TF-IDF from scratch in python on real world dataset.](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)
* Wiki [tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
* Kdnuggets blog post [WTF is TF-IDF?](https://www.kdnuggets.com/2018/08/wtf-tf-idf.html)

## Python Libraries for NLP

There are many high-performance libraries for NLP in the Python ecosystem. Some of the more popular ones are:

* [**Natural Language Toolkit (NLTK)**](https://www.nltk.org/): It is a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It is an open-source library licensed under the [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) license. 
* [**SpaCy**](https://spacy.io/): It is described as a production-ready training system with support for 64+ languages and integration with Deep Learning frameworks like  PyTorch and TensorFlow. It is an open-source library licensed under the [MIT](https://mit-license.org/) license.
* [**Gensim**](https://radimrehurek.com/gensim/): It is oriented to Topic Modeling. It offers implementations for  Word2Vec, FastText, Latent Semantic Indexing (LSI, LSA, LsiModel), Latent Dirichlet Allocation (LDA, LdaModel), etc. It is an open-source library licensed under the OSI-approved [GNU LGPLv2.1](https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html) license (free for both personal and commercial uses). 
* [**Sklearn**](https://scikit-learn.org/stable/index.html): It is described as a simple and efficient tool for predictive data analysis. It implements several algorithms for classification, regression, clustering, dimensionality reduction, model selection, pre-processing. It is an open-source library licensed under the [BSD](https://opensource.org/licenses/BSD-3-Clause) license.
* [**Keras**](https://keras.io/): It provides methods that allow designing and training of an ANN using a few lines of Python code. It is implemented as a wrapper for most popular deep learning frameworks like TensorFlow, Theano, and CNTK. It is an open-source library licensed under the [Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) license.  
* [**Hugging Face**](https://huggingface.co/): Hugging Face is "the AI community building the future". Its mission is to democratize NLP and make models accessible. It provides resources like datasets, tokenizers, and transformers to perform NLP tasks such as sentiment analysis, coreference resolution, question answering, chatbots. If you want to learn more about what you can do with this library, take a look at [Introduction to Hugging Face ecosystem](https://huggingface.co/course/chapter0?fw=tf). 

Given that many of you used [Sklearn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) in the "Machine Learning with Python" course, we will use it in this section to illustrate the concepts described so far. We will use [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset. It is a collection of 20 different newsgroups (~20,000 documents) talking about politics, religion, science, sports, etc. One of the challenges with this corpus is that some of the groups are similar to each other in the subject matter.

Let's begin by loading the corpus.  In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts. We will select four newsgroups or categories out of 20.

In [None]:
import numpy as np 
from sklearn.datasets import fetch_20newsgroups # import  20 Newsgroups from sklearn

rng = np.random.RandomState(42)

# newsgroups
categories = [
    'alt.atheism',
    'talk.politics.guns',
    'comp.graphics',
    'sci.space',
]

# load train and test data from fetch_20newsgroups
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=rng)
test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=rng)

In [None]:
print(f'There are a total of {len(train.data)} documents in the training set')
print(f'There are a total of {len(test.data)} documents in the test set')

As with many other sklearn datasets, loaded sets are of bunch datatype and contain following keys:

In [None]:
train.keys()

`train.data` contains textual data, `train.filenames` contains names of the documents, `train.target_names` contains category of a file and `train.target` - numeric respresenation of that category (label). With `train.DESCR` you can print out information about dataset.\
Let's print out all these data for the first file in the train set:

In [None]:
print(f"Data: {train.data[0][:40]}\n")                 # print first 40 symbols of first file in train set
print(f"Filename: {train.filenames[0]}\n")
print(f"Category: {train.target_names[0]}\n")
print(f"Category, numeric label: {train.target[0]}")

Summary of train and test sets:

In [None]:
from collections import Counter 
import pandas as pd

# create a counter object from targets (category) of train and test sets
train_counter = Counter(train.target)
test_counter =  Counter(test.target)

# create dataframe with counted n.o. files belonging to a certain category
cl = pd.DataFrame(data={
    'Train': { **{ train.target_names[index]: count for index, count in train_counter.items()}, 'Total': len(train.target)},
    'Test':  { **{test.target_names[index]: count for index, count in test_counter.items()},  'Total': len(test.target)},
})

cl.columns = pd.MultiIndex.from_product([["Class distribution"], cl.columns])
cl

Let's have a closer look at the structure of our documents with interactive jupyter widget. You do not need to understand the code, just check its output.

In [None]:
import ipywidgets as widgets

max_index = 15 # You can adjust this value if you want to check more documents
dropDown = widgets.Dropdown(
    options=[f'{i + 1}' for i in range(max_index)],
    value='1',
    disabled=False,
    layout={'width': 'max-content'}
)

classLabel = widgets.Label(value=train.target_names[train.target[int(dropDown.value)]])

documentContent = widgets.HTML(value=f'<textarea rows="25" cols="120" readonly>{train.data[int(dropDown.value)]}</textarea>')

def handle_dropdown_change(change):
    index = int(change.new) - 1
    text = train.data[index]
    dclass = train.target_names[train.target[index]]
    
    classLabel.value = dclass
    documentContent.value = f'<textarea rows="25" cols="120" readonly>{train.data[int(dropDown.value)]}</textarea>'

dropDown.observe(handle_dropdown_change, names='value')

items = [widgets.Label(value='Document index:'), dropDown, widgets.Label(value='Document class:'), classLabel, documentContent]
widgets.GridBox(items, layout=widgets.Layout(grid_template_columns="repeat(2, 130px)"))

As you can see, the documents contain a lot of characters that are not useful for our analysis. Characters like `-`, `>` or `|` do not have any semantic weight and they can be safely removed.

<a id='St1'></a>
<div class=" alert alert-info">
    <h3><b>DEMO.</b> DOCUMENT REPRESENTATION. </h3>
    
Here we will demonstrate how to transform text to a TF-IDF-weighted document-term matrix.  
First, we need to convert a collection of text documents to a matrix of token counts. We do it with the sklearn `CountVectorizer` class.\
Then we create a weighted version of this vector with the `TfidfTransformer` class.\
We chain two steps with the sklearn `Pipeline` class.
</div>    

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# text document
corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document']
# vocabulary
vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
              'and', 'one']

# create tokens from text given vocabulary (i.e. create the count vectorizer)
token_matrix = CountVectorizer(vocabulary=vocabulary)
# convert count matrix to TF-IDF format (i.e.create the tfi-df trasformer)
tfid_transform = TfidfTransformer()

# chain steps
pipe = Pipeline([('count', token_matrix),
                 ('tfid', tfid_transform)])

# fit data
pipe.fit(corpus)

Display tokenized text:

In [None]:
pipe['count'].transform(corpus).toarray()

Display text converted to TF-IDF representation:

In [None]:
pipe.transform(corpus).toarray()

<a id='St1'></a>
<div class=" alert alert-warning">
    <h3><b>STUDENT TASK 6.1.</b> DOCUMENT REPRESENTATION. </h3>
    
Your task is to implement a pre-processing pipeline to convert documents to vectors using bag-of-words and TF-IDF.    
</div>    

**Hints:**

To implement the `text_processing_pipeline()` function you will need to perform the following steps:

* Create a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) which tokenizes the documents and creates vectors of word counts. Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens. These tokens are often loosely referred to as terms or words, but they could be words, numbers, acronyms, word-roots, or fixed-length character strings.\
You should set the parameter `stop_words` to `"english"` and take into consideration the parameter `features` of the `text_processing_pipeline()` function.\
We provide you a custom function `preprocess_text`, which removes unimportant symbols, to be set as the argument `preprocessor` of the `CountVectorizer` object. 
* Create a [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) to normalize the word counts matrix using TF-IDF. The default parameters are sufficient in this case.
* Create a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to perform both operations in a pipeline.

In [None]:
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)        # Remove numbers
    text = re.sub(r'[-_]+', '', text)      # Remove lines like: "-----------------" or "______________"
    text = re.sub(r'\/*\|+|\/+', '', text) # Removes combinations like "/|", "||/" or "////"
    return text

def text_processing_pipeline(features=None):
    '''  Corpus pre-processing pipeline
    
    The inputs to the function are:
      - list of strings (one element is a document)
      - maximum number of features (size of the vocabulary) to use
      
    It returns a Pipeline object   
    '''
     # YOUR CODE HERE
     raise NotImplementedError()
    
    # create the count vectorizer
    # vectorizer = ...
    
    # create the tfi-df trasformer
    # tfidf = ...
    
    # create the pipeline
    # pipeline = ...
    
    return pipeline

Now we can use text processing pipeline to create training and test sets:

In [None]:
pipeline = text_processing_pipeline(features=10000)
X_train = pipeline.fit_transform(train.data)
y_train = train.target

X_test = pipeline.transform(test.data)
y_test = test.target

In [None]:
# Sanity check

assert X_train.shape==(2203, 10000)
assert X_test.shape==(1466, 10000)

print("Sanity check passed!")

In [None]:
# this cell is for tests


In [None]:
# this cell is for tests


In [None]:
# this cell is for tests


In [None]:
# this cell is for tests


<a id='St2'></a>
<div class=" alert alert-warning">
    <h3><b>STUDENT TASK 6.2.</b> DOCUMENT CLASSIFICATION. </h3>
    
Your task is to train a [LogisticRegression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and evaluate its performance over the test data. Expected f1_score on test set is ~0.94.
</div>   

**Hints:**

Because we are doing multi-class classification, you need to pass the parameter `average` to the `f1_score`. You can choose between `{'micro', 'macro', 'samples', 'weighted'}` but we suggest `'weighted'`. Read more about it in the documentation for the [f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function.

Important: Store the predictions in a variable called `pred`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# YOUR CODE HERE
raise NotImplementedError()

# define classifier with sklearn LogisticRegression
# clf = ...

# fit classifier to training set
# clf...

# get predictions for test set
# pred = ...

score = accuracy_score(y_test, pred)
print("Accuracy:   %0.3f" % score)

f1 = f1_score(y_test, pred, average='weighted')
print("      F1:   %0.3f" % f1)

In [None]:
# this cell is for tests


If you implemented everything correctly you should get precision and F1 around **0.94**. This is a really good result, but keep in mind that we are only using 4 classes that are properly separated. Let's now visualize the confusion matrix. If you have problems understanding what it shows, take a look at this article [Understanding Confusion Matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62). One thing to notice is that Sklearn flips the values, and we get the matrix in the following form:


<img src='../../../coursedata/R5/confussion_matrix.png' width=400/>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Code source https://stackoverflow.com/questions/62722416/plot-confusion-matrix-for-multilabel-classifcation-python
def print_confusion_matrix(confusion_matrix, axes, class_label, class_names, fontsize=14):

    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names,
    )

    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes)
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
        
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    axes.set_ylabel('True label')
    axes.set_xlabel('Predicted label')
    axes.set_title("Confusion Matrix for the class - " + class_label)

In [None]:
from sklearn.metrics import multilabel_confusion_matrix
cfs_matrix = multilabel_confusion_matrix(y_test, pred)

fig, ax = plt.subplots(2, 2, figsize=(12, 7))
    
for axes, cfs, label in zip(ax.flatten(), cfs_matrix, train.target_names):
    print_confusion_matrix(cfs, axes, label, ["N", "P"])

fig.tight_layout()
plt.show()

## Deep Learning for NLP

The techniques discussed above only consider the linear relationships between words, and in many cases, you need an expert to define the features to use in each task. With Artificial Neural Networks (ANN) we can accomplish feature extraction in a completely automated way. We saw that the Bag-of-Word model ignores the context in which a word is used. This means that it also ignores the effect that the neighbors of a word have on its meaning and in the whole meaning of a statement.

When we use Bag-of-Word, each word  is represented with a vector that is the result of a one-hot-encoding. For example, let's say that our vocabulary is composed of the following words:

```python
V = ['dog', 'painting', 'sun', 'winter', ]
```
then the word vectors for each words are:

```python
word_vector['dog']      = [1, 0, 0, 0]
word_vector['painting'] = [0, 1, 0, 0]
word_vector['sun']      = [0, 0, 1, 0]
word_vector['winter']   = [0, 0, 0, 1]
```

It is easy to see that these vectors are orthogonal, which means that there is not a natural notion of similarity between them. Using ANN, it is possible to build dense vectors for each word that can capture the concept of synonyms, antonyms, or words that just belong to the same category, such as people, animals, places, etc. These vectors are called word vectors or word embeddings and [Hapke](#hapke) define them as:


> Word vectors are numerical vector representations of word semantics or meaning, including literal and implied meaning. So word vectors can capture the connotation of words, like "peopleness", "animalness", "placeness", "thingness", and even "conceptness". And they combine all that
into a dense vector (no zeros) of floating-point values. This dense vector
enables queries and logical reasoning.

The following picture shows a 3D projection of the embeddings calculated using Google's [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec#embedding_lookup_and_analysis). To create the graph, they first calculate [PCA](https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/) and then position the words according to the cosine distance between the vectors in the original space.

![Word Embedings visualization](../../../coursedata/R5/w2v-visualization.png)

The highlighted word is **machine**. As we expect, words like *learning*, *computer* and *translation* are among the 100 more similar words. In this case, the embedding has captured the use of machine in the Machine Learning field. But we can also see other similar words like *gun*, *weapon*, *rifle* and *pistol* which belong to a completely different domain. In the website [Embedding Projector](https://projector.tensorflow.org/) you can explore more and play with different representations.

But, how exactly these vectors are calculated? Word2Vec was developed by [Tomas Mikolov in 2013 at Google](https://arxiv.org/pdf/1310.4546.pdf). The goal is to create a model that can learn high-quality word vectors from **huge** data sets, typically billions of words, and millions of (unique) words in the vocabulary. This is achieved with the basic task of being able to predict what words occur in the context of other words. 

![Word2Vec Intuition](../../../coursedata/R5/w2v_intuition.png)

In the figure, the words in the green rectangles are in the context, and the center word is *community*. The analysis happens in a *sliding window* which in this case is of size 2. Two models can be used: Continuous Bag-of-Words (CBOW) and Continuous Skip-gram.

![Word2Vec Architectures](../../../coursedata/R5/word2vec_architectures.png)

1. Continuous Bag-of-Words: Given its context, the goal is to predict the center word. It is faster than Skip-gram and has better representations for more frequent words.
2. Continuous Skip-gram: Given the center word, the goal is to predict the context. This method works well with a small amount of data and is found to represent rare words well.

The ANN architecture for both models is really simple, and reassemble what was shown in the previous picture: the input layer, a hidden layer, and the output layer.

![Word2Vec ANN](../../../coursedata/R5/w2vec_ann.png)

Because of time constraints, we won't offer a full derivation, but professor [Christopher Manning](https://nlp.stanford.edu/~manning/) gives an extensive explanation of the process in the video [Introduction and Word Vectors](https://youtu.be/rmVRLeJRkl4) which is the first lecture for the [Stanford CS224N NLP with Deep Learning ](https://web.stanford.edu/class/cs224n/index.html) course. 

We will be using pre-trained English word embeddings, but the [Turku NLP Group](https://turkunlp.org/) have trained a model for the Finnish language that you can download in this link: 

http://dl.turkunlp.org/finnish-embeddings/. 

You can find more info on their website [Finnish NLP](https://turkunlp.org/finnish_nlp.html). In the website [NLPL word embeddings repository](http://vectors.nlpl.eu/repository/), maintained by the  University of Oslo.

Beside word similarity, word embeddings also allows us to calculate *word analogies*. The classical example of this is:  $\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} = \vec{\text{queen}}$. 

In the article [Word Embedding Analogies: Understanding King - Man + Woman = Queen](https://kawine.github.io/blog/nlp/2019/06/21/word-analogies.html), the author explains why these calculations hold. We recommend you to read the paper [Towards Understanding Linear Word Analogies](https://arxiv.org/pdf/1810.04882.pdf). The Turku NLP Group have an online demo where you can play with word analogies on the website:

http://bionlp-www.utu.fi/wv_demo/

### Document classification with word embeddings

Let's see how we can use word embedding to classify documents. The [complete pre-trained model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) is a binary file of ~3.4 Gb. This means that you will need at least 16 Gb of RAM on your computer to load the whole file. We have extracted a subset with 20,000 words present in the collection we are working with. If you are interested in training your own model, take a look at the tutorial [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec) from the Tensorflow documentation.

The first step is to load the subset.

In [None]:
import pickle
from pathlib import Path

embeddings_path = Path().cwd() / '..' / '..' / '..' / 'coursedata' / 'R5' / '20newsgroups_subset_vocabulary_embeddings.p'

with open(embeddings_path, "rb") as f:
    embeddings = pickle.load(f)
    vocabulary = list(embeddings.keys())
    
print(f'The vocabulary has a total of {len(vocabulary)} words')

The variable `embeddings` is a dictionary consisting of (key, value) pairs, where the key is a word and value is the embedding vector of a word. The embedding vector is a numpy array of length 300. For example, for index 7 we have (key, value) or (word, embedding):

In [None]:
print(f"Word: {list(embeddings.keys())[7]}")  
print(f"\nEmbedding vector shape: {list(embeddings.values())[7].shape}") 
print(f"\nEmbedding vector (first 50 values): \n\n {list(embeddings.values())[7][:50]}")

We create training and validation subsets from our `train` dataset (loaded from sklearn `fetch_20newsgroups` in the beginning of the notebook).

In [None]:
# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(train.data))

train_samples = train.data[:-num_validation_samples]
val_samples = train.data[-num_validation_samples:]

train_labels = train.target[:-num_validation_samples]
val_labels = train.target[-num_validation_samples:]

test_samples = test.data
test_labels = test.target

  To process the raw text, we will use the Keras layer [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). This is a simple way of performing the pre-processing step, but bear in mind that you cannot use it if you need more complex pre-processing, for example, [lemmatization](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/), [stemming](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python), etc. Because we already have a vocabulary, we can pass it to TextVectorization and make sure that only words coming from it are part of our documents. In practice, you may not have a pre-defined vocabulary. In such a case, you will need to "learn" vocabulary with `.adapt()` method:

```python
vectorizer = TextVectorization(
    output_mode="int",
    max_tokens=20000,
    output_sequence_length=1000
)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(32)
vectorizer.adapt(text_ds)
```
 
  The function `TextVectorization.adapt` is similar to the the function `fit` in the Sklearn models. According to the documentation: "Fits the state of the preprocessing layer to the data being passed". 

In [None]:
import string

from nltk import word_tokenize
from nltk.corpus import stopwords

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer = TextVectorization(
    output_mode="int", 
    standardize="lower_and_strip_punctuation", # lowercase all the words and remove punctuation
    output_sequence_length=500 # each document will be reduced to a vector of 500 integers
)
vectorizer.set_vocabulary(vocabulary)

Let's vectorize a test sentence:

In [None]:
sentence = "Robert Plant wrote a hell of a song"

output = vectorizer(np.array([sentence]))
output.numpy()[0, :8]

The index 1 corresponds to the token **'[UNK]'** which is used to represent unknown words (missing words in the vocabulary). Do you have an idea why "a" and "of" are not in the vocabulary? 
<details>
    <summary>Answer</summary>
    <p>Both "a" and "of" are in the set of stopwords</p>
</details>

Now, let's prepare a corresponding embedding matrix that we can use in a Keras `Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained vector for the word of index `i` in our `vectorizer`'s vocabulary. 

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

In [None]:
voc = vectorizer.get_vocabulary() # It will include tokens like '[UNK]' and the padding token ""
word_index = dict(zip(voc, range(len(voc))))

In [None]:
num_tokens = len(voc)
embedding_dim = 300 
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print(f"Converted {hits} words ({misses} misses)")

  Next, we load the pre-trained word embeddings matrix into an `Embedding` layer. Note that we set `trainable=False` to keep the embeddings fixed, we don't want to update them during training. We just need the ANN to learn the weights for the other layers. 

In [None]:
from tensorflow import keras
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

Before we can start training the model, we need to vectorize the training and validation set:

In [None]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val   = vectorizer(np.array([[s] for s in val_samples])).numpy()
x_test  = vectorizer(np.array([[s] for s in test_samples])).numpy()

y_train = train_labels
y_val   = val_labels
y_test  = test_labels

print(f"Training set shape: {x_train.shape}")
print(f"Validation set shape: {x_val.shape}")
print(f"Test set shape: {x_test.shape}")

Let's see the result of text vectorization. As an example we select training sample with index 3:

In [None]:
print(f"Original text:\n {train_samples[3][:103]}\n")
print(f"Vectorized text:\n {x_train[3][:13]}")

If we check values of vectorized text in our vocabulary `voc`, we get:

In [None]:
voc[1], voc[302], voc[6], voc[303], voc[304],voc[305],voc[306],voc[307],voc[308]

Thus, to each word in original text we assign a numeric value according to our vocabulary.

<img src="../../../coursedata/R5/vectorized.png" width="800"/>

The vectorized text is then used to build the corresponding matrix with pre-trained embeddings. 

![Embeddings](../../../coursedata/R5/embeddings.png)

We can confirm this by checking the outputs of the embedding layer we created before. For example, we have 1763 vectorized documents in the training set. Each document is of length 500 tokens. The embedding matrix consists of ~20k embedding vectors of length 300. After passing vectorized training set, we get embedding vectors for each of the tokens in the training set, 1763x500x300.

In [None]:
# pass data to emb layer (matrix)
x_train_emb = embedding_layer(x_train)

print(f"Training set shape: {x_train.shape}")
print(f"Training set shape after embedding layer: {x_train_emb.shape}")

We can also confirm that returned data set is just corresponding embedding vectors from `embedding_matrix`:

In [None]:
print(f"Vectorized document ind=0, token ind=2: {x_train[0][2]}\n")
print(f"Corresponding embedding vector: {embedding_matrix[5][:5]}\n")
print(f"Embedding for document ind=0, token ind=2: {x_train_emb[0][2][0:5]}\n")

<a id='St3'></a>
<div class=" alert alert-warning">
    <h3><b>Student task 6.3.</b> Define a simple CNN to perform document classification.</h3>

The structure of the ANN should be the following:
* `embedding_layer` as the input layer
* Two blocks of one [Conv1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D) (use 128 filters, a kernel of size 5 and activation function "relu") followed by a [MaxPool1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPool1D) (with pool size 2)
* A Conv1D (with the same parameters as above) followed by a [GlobalMaxPool1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool1D) (with the default parameters)
* A [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) with 128 units followed by a [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) (with 0.5 rate)
* A Dense layer (output layer) with `m` units and activation `"softmax"`  

</div>    

In [None]:
from tensorflow.keras import layers

# number of categories for classification
m = len(categories)

# YOUR CODE HERE
raise NotImplementedError()
# model = ...

model.summary()

In [None]:
# Perform some sanity checks on the solution
assert len(model.layers) == 10, "There should be 10 layers!"

print("Sanity checks passed!")

In [None]:
# this cell is for tests


<div class=" alert alert-warning">
    <h3><b>Student task. </b>Compile and train model.</h3>
   
Your task is to:
    
1. Compile a model. Use categorical cross-entropy as our loss since we're doing softmax classification.
Specifically, use `sparse_categorical_crossentropy` since our labels are integers. Use `sparse_categorical_accuracy` as metrics and optimizer 'RMSprop'.
    
2. Train model for 20 epochs with batch size 32. Save model as 'model.h5'.
   
</div>

In [None]:
# set trainig=False when validating or submitting notebook
# and set training=True, when training network
training=True

In [None]:
# this hidden cell is for setting flag training=False


In [None]:
# compile the model 
# model.compile(...)

# model training
if training:
    # history = model.fit(...)

# YOUR CODE HERE
raise NotImplementedError()
else: 
    model = tf.keras.models.load_model("model.h5")

In [None]:
# Sanity check

model = tf.keras.models.load_model("model.h5")
_, test_acc = model.evaluate(x_test, y_test, verbose=0)
print("Test accuracy {:.2f}".format(test_acc))
assert test_acc>=0.91, "Accuracy is too low!"

In [None]:
# this cell is for tests


## Define an end-to-end model

Now, we may want to define a `Model` object which takes a string of arbitrary length as input, rather than a sequence of indices. It would make the model much more portable
since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is a Keras layer, so it's simple:

In [None]:
# keras layer that takes a string as an input
string_input = keras.Input(shape=(1,), dtype="string")
# vectorize string input
x = vectorizer(string_input)
# pass to main model
preds = model(x)

end_to_end_model = keras.Model(string_input, preds)
end_to_end_model.summary()

Using this final model, we can test the accuracy of the ANN. Note, that now we can pass text data `test.data` directly to the model.

In [None]:
y_pred = [np.argmax(prob) for prob in end_to_end_model.predict(test.data)]

score = accuracy_score(y_test, y_pred)
print("Accuracy:   %0.3f" % score)

f1 = f1_score(y_test, y_pred, average='weighted')
print("      F1:   %0.3f" % f1)

In [None]:
from sklearn.metrics import multilabel_confusion_matrix
cfs_matrix = multilabel_confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(2, 2, figsize=(12, 7))
    
for axes, cfs, label in zip(ax.flatten(), cfs_matrix, train.target_names):
    print_confusion_matrix(cfs, axes, label, ["N", "P"])

fig.tight_layout()
plt.show()

You should be getting an F1 score slightly below the one we get using the traditional approach. Given what you have learned in this course, try to explain what is happening here.

## Conclusion

The NLP field is vast and it's not possible to cover everything in one notebook. There is an exciting new area of study using [Transformers](https://huggingface.co/transformers/), where models like [BERT](https://arxiv.org/abs/1810.04805) and [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) are giving mind-blowing results in many NLP tasks. We have covered the basics: pre-processing, document representation, word embeddings, and the different libraries available for creating NLP applications. Deep learning makes feature extraction easy, but the downside is that it needs a lot of text in order to train the algorithm. Thanks to Transfer Learning, we can use a general-purpose data set to perform the feature extraction, and then use the learned representations in domain-specific tasks. Libraries like [Hugging Face](https://huggingface.co/models) make it extremely easy to test state-of-the-art models, but you should not treat them as black boxes. 

Finally, if you want to learn more about the field, a good starting point is this specialization in Coursera: [DeepLearning.AI Natural Language Processing](https://www.coursera.org/specializations/natural-language-processing).

<div class=" alert alert-warning">
    <h3>Question 1.</h3>
    
Choose the correct statement. The Bag-of-words model assumes that:

1. Meaning of a word depends on the context
2. All words are independent units
3. Order of the words matters
  
</div>

In [None]:
# remove the line raise NotImplementedError() before testing your solution and submitting code

# YOUR CODE HERE
raise NotImplementedError()

# answer_1 = ...

In [None]:
# This cell is for tests
assert answer_1 in [1, 2, 3], '"answer" Value should be an integer between 1 and 3.'
print('Sanity check tests passed!')


<div class=" alert alert-warning">
    <h3>Question 2.</h3>
    
Choose the correct statement. TF-IDF:

1. Evaluate how often a word appears in a document
2. It's a numeric representation of a text document
3. Reflect how important a word is to a document in a set of documents
    
</div>

In [None]:
# remove the line raise NotImplementedError() before testing your solution and submitting code

# YOUR CODE HERE
raise NotImplementedError()

# answer_2 = ...

In [None]:
# This cell is for tests
assert answer_2 in [1, 2, 3], '"answer" Value should be an integer between 1 and 3.'
print('Sanity check tests passed!')


<div class=" alert alert-warning">
    <h3>Question 3.</h3>
    
Choose the correct statement. In the notebook, we vectorized the text data which means that:

1. We split the text document into words and mapped each word to an integer value
2. Each letter in the text was converted to a float number
3. We split the text document into words and mapped each word to a float number
4. We mapped a whole text document to an integer value
    
</div>

In [None]:
# remove the line raise NotImplementedError() before testing your solution and submitting code

# YOUR CODE HERE
raise NotImplementedError()

# answer_3 = ...

In [None]:
# This cell is for tests
assert answer_3 in [1, 2, 3, 4], '"answer" Value should be an integer between 1 and 4.'
print('Sanity check tests passed!')


<div class=" alert alert-warning">
    <h3>Question 4.</h3>
    
Choose the correct statement. The embedding matrix consists of:

1. Text data
2. One-hot encoded vectors corresponding to each word in a vocabulary
3. Labels (categories) of the text documents
4. Learnt real-valued representations (vectors) of each word in a vocabulary
    
</div>

In [None]:
# remove the line raise NotImplementedError() before testing your solution and submitting code

# YOUR CODE HERE
raise NotImplementedError()

# answer_4 = ...

In [None]:
# This cell is for tests
assert answer_4 in [1, 2, 3, 4], '"answer" Value should be an integer between 1 and 4.'
print('Sanity check tests passed!')
