# Tutorial 11 - Natural Language Processing

**Semester:** Fall 2021

**Adapted by:** [Kevin Dick](https://kevindick.ai/)

**PART I Concepts adapted from:** [Ventsislav Yordanov's](https://medium.com/@ventsislav94) [Introduction to Natural Language Processing for Text](https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63).

**PART II Notebooks adapted from:** [HuggingFace's Transformer Notebooks](https://huggingface.co/transformers/notebooks.html) particularly the [Getting Started with Transformers Notebook](https://github.com/huggingface/transformers/blob/master/notebooks/02-transformers.ipynb)

**PART III Notebooks adapted from:** [José Eduardo Storopoli](https://github.com/storopoli)'s [Topic Modelling Notebooks](https://github.com/storopoli/topic-modelling/tree/master/Notebooks)

---

**Tangential Aside**: For anyone planning on pursuing advanced research in the field of NLP, I strongly suggest familiarizing yourselves with [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) which is a universal distribution of word frequencies within anyy (and all!) languages: [Fantastic (& Quizacious ;) VSauce Video Disccussing the Zipf Mystery](https://youtu.be/fCn8zs912OE)

---

### PART I: What is Natural Language Processing (NLP)?

NLP is a **subfield of machine learning** concerned with the application of **learning algorthms to text and speech**. More generally, NLP-based methods are typically applicable to all sequential-type information (*e.g.* DNA sequences, audio signals, time-series signals, *etc.*) however, they are predominantly used in human language applications.

For example, we can use NLP to create systems including:
1. **speech recognition** (*e.g.* real-time captioning)
2. **document summarization** 
3. **machine translation**
4. **spam detection**
5. **named entity recognition**
6. **question answering**
7. **autocomplete** (*i.e.* predictive typing)

**General Data Processing Pipeline:** In order to tackle each of these tasks, we must first process text into a format amenable for use by NLP learning algorithms.

### The NLTK Library for the Basics of NLP [(See Details Here)](https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63)
NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpora (large document collections) and lexical resources. Also, it contains a suite of text processing libraries for **classification**, **tokenization**, **stemming**, **tagging**, **parsing**, and **semantic reasoning**. Best of all, NLTK is a free, open source, community-driven project.

1. **Sentence Tokenization:** Sentence tokenization (also called **sentence segmentation**) is the problem of *dividing a string of written language into its component sentences*.
2. **Word Tokenization**: Word tokenization (also called **word segmentation**) is the problem of *dividing a string of written language into its component words*.
3. **Text Lemmatization & Stemming**: The goal of both stemming (crude) and lemmatization (refiined) is to *reduce inflectional forms* and sometimes derivationally related forms of a word to a common base form. For example, "drive" & "drives" & "driving" all have the same semantic meaning and should be combined.
4. **Stop Words**: Stop words usually refer to the **most common words** such as “and”, “the”, “a” in a language and when applying machine learning to text, **these words can add a lot of noise** so we remove them.
5. **Regex**: A regular expression, regex, or regexp is a sequence of characters that define *a search pattern to apply additional filtering* to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.
6. **Bag-of-Words**: Machine learning algorithms *cannot work with raw text directly*, we need to convert the text into vectors of numbers (i.e. feature extraction) and the *bag-of-words mode*l is a popular and simple feature extraction technique that *counts the occurrence of each word within a document*.
7. **TF-IDF**: One problem with scoring word frequency is that *the most frequent words in the document start to have the highest scores* (frequent words may not havee much “informational gain”) to the model so we penalize words that are frequent across all the documents using TF-IDF (**term frequency-inverse document frequency** is a s**tatistical measure** used to evaluate the **importance of a word** to a **document in a collection or corpus**).

---

### Demonstration in Code

In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

 ### 1. Sentence Tokenization
 Split the text of string (from a document) into individuual sentences.

In [2]:
# For visual conveenience, this is represented as a string block, but it should
# be conceptualized as a single long and continuous string.
text = """Backgammon is one of the oldest known board games. 
          Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. 
          It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.
       """

sentences = nltk.sent_tokenize(text)
for i, sentence in enumerate(sentences):
    print(f'{i}: "{sentence}"')

0: "Backgammon is one of the oldest known board games."
1: "Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East."
2: "It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."


### 2. Word Tokenization
Split each sentence into individual words.

In [3]:
for i, sentence in enumerate(sentences):
    words = nltk.word_tokenize(sentence)
    print(f'{i}: {words}')

0: ['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
1: ['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']
2: ['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']


### 3. Text Lemmatization & Stemming
Reduce the semantic-space by reducing smilar words to a common semanttic token.

In [4]:
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemming and lemmitization using the passed stemmer, 
    lemmatizer, word and pos (part of speech)
    """
    print(f'Input Word: {word}\n{"-" * 10}')
    print(f'Stemmer: {stemmer.stem(word)}\nLemmatizer: {lemmatizer.lemmatize(word, pos)}')
    print('=' * 15)

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Input Word: seen
----------
Stemmer: seen
Lemmatizer: see
Input Word: drove
----------
Stemmer: drove
Lemmatizer: drive


### 4. Stop Words
Remove high-frequency and "noisy" words that don't add to the sentence meaning.

In [5]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let's remove the stop-words from a sentence:


In [6]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


### 5. Regular Expression Filtering
Regex are useful for filtering text based on specific patterns.

Here, we remove all punctuation (remove anything that doesnt match a word).

In [7]:
import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
pattern = r"[^\w]" # Translates to: NOT MATCH WORD
print(re.sub(pattern, " ", sentence))

The development of snowboarding was inspired by skateboarding  sledding  surfing and skiing 


* . - match any character except newline
* \w - match word
* \d - match digit
* \s - match whitespace
* \W - match not word
* \D - match not digit
* \S - match not whitespace
* [abc] - match any of a, b, or c
* [^abc] - not match a, b, or c
* [a-g] - match a character between a & g

### 6. Bag-of-Words

This is a limited example of a Bag-of-Words application.

In [8]:
# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Normally you would load this from a file.
raw_text = """I like this movie, it's funny. 
              I hate this movie.
              This was awesome! I like it.
              Nice one. I love it."""

# Step 1. Design the Vocabulary
#   The default token pattern removes tokens of a single character. 
#   That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 2. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(raw_text.splitlines())

# Show the Bag-of-Words Model as a Pandas DataFrame
#   NOTE: the sum of columns generates a "concordance" (SYSC 1005 Easter Egg!)
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)



Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0,1,0,1,1,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,1,0
2,1,0,0,1,1,0,0,0,0,1,1
3,0,0,0,1,0,1,0,1,1,0,0


Considering a large amount of data in most big data applications, the length of the vector that represents a document might be **thousands or millions of elements.** Furthermore, each document may contain only a few of the known words in the vocabulary.

Therefore the **vector representations will have a lot of zeros** and will therefore be **sparse vectors** that ttypically require **more memory and computational resources**.

#### Vocabulary Simplification using *n*-grams: 
A more complex way to create a vocabulary is to use **grouped words**. This **changes the scope of the vocabulary** and allows the bag-of-words model to get more details about the document.

### 7. TF-IDF: Term Frequency-Inverse Document Frequency
The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.

![](https://miro.medium.com/max/700/1*V9ac4hLVyms79jl65Ym_Bw.png)

**Term Frequency (TF)**: a scoring of the frequency of the word in the current document.

![](https://miro.medium.com/max/463/1*V3qfsHl0t-bV5kA0mlnsjQ.png)

**Inverse Document Frequency (IDF)**: a scoring of how rare the word is across documents.

![](https://miro.medium.com/max/445/1*wvPGL02y36QL7-tdG1BT1A.png)


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Normally you would load this from a file.
raw_text = """I like this movie, it's funny. 
              I hate this movie.
              This was awesome! I like it.
              Nice one. I love it."""

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(raw_text.splitlines())

# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(), columns = feature_names)



Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0.0,0.571848,0.0,0.365003,0.450852,0.0,0.450852,0.0,0.0,0.365003,0.0
1,0.0,0.0,0.702035,0.0,0.0,0.0,0.553492,0.0,0.0,0.4481,0.0
2,0.539445,0.0,0.0,0.344321,0.425305,0.0,0.0,0.0,0.0,0.344321,0.539445
3,0.0,0.0,0.0,0.345783,0.0,0.541736,0.0,0.541736,0.541736,0.0,0.0


---

## PART II: Introduction to NLP Transformer-based Methods (NOTE: Time Consuming)
The transformers library is an open-source, community-based repository to train, use and share models based on 
the Transformer architecture [(Vaswani & al., 2017)](https://arxiv.org/abs/1706.03762) such as Bert [(Devlin & al., 2018)](https://arxiv.org/abs/1810.04805),
Roberta [(Liu & al., 2019)](https://arxiv.org/abs/1907.11692), GPT2 [(Radford & al., 2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf),
XLNet [(Yang & al., 2019)](https://arxiv.org/abs/1906.08237), etc. 

Along with the models, the library contains multiple variations of each of them for a large variety of 
downstream-tasks like **Named Entity Recognition (NER)**, **Sentiment Analysis**, 
**Language Modeling**, **Question Answering** and so on.

## The Recurrent Neural Networks that Preceded the Transformer

In 2017, most Neural Network application to Natural Language Processing relied on the sequential processing of the input through [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network).

![rnn](http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png)   

RNNs performed well for a large variety of tasks involving sequential dependenciies over the input sequence, however, the sequentially-dependent process had issues modeling very long range dependencies and was not well suited for the kind of hardware we're currently leveraging (poor ability to parallelize computation). 

Most recently, the Attention mechanism was introduced as an improvement over "raw" RNNs by giving  a learned, weighted-importance to each element in the sequence, allowing the model to focus on "important" elements.

![attention_rnn](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Example-of-Attention.png)  

## Then Came the Transformer  

Then in 2017, [(Vaswani & al., 2017)](https://arxiv.org/abs/1706.03762)
heralded the Transformer-era by demonstrating superiority over [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network)
on translation tasks and Attention-based methods were quickly extended to almost all RNN-type tasks overcoming the State-of-the-Art at the time.

One advantage of the Transformer architechtture over its RNN counterpart is its non-sequential attention model. Recall that RNNs have to iterate over each element of the input sequence one-by-one and carry an "updatable-state" between each hop. Conversely, Transformer models are able to look at every position in the sequence, at the same time, in one operation converting a formerly serial-type task into an embarassingly paarallel one.

Read [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder-and-decoder-stacks) for a deep-dive into the Transformer architecture.

![transformer-encoder-decoder](https://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png)

## Getting started with transformers

For the rest of this notebook, we will use the [BERT (Devlin & al., 2018)](https://arxiv.org/abs/1810.04805) architecture, as it's the most simple and there are plenty of content about it over the internet (it will be easy to dig more over this architecture if you want to).

The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational
infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library 
in PyTorch and TensorFlow in a transparent and interchangeable way. 

# How to train a new language model from scratch using Transformers and Tokenizers

Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 

In [1]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2021-12-05 18:04:49--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 54.192.18.43, 54.192.18.17, 54.192.18.90, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|54.192.18.43|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [2]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-ai4slu1v
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-ai4slu1v
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
tokenizers                    0.10.3
transformers                  4.13.0.dev0


In [3]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 17min 50s, sys: 4.83 s, total: 17min 54s
Wall time: 9min 13s


In [4]:
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

mkdir: cannot create directory ‘EsperBERTo’: File exists


['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.


In [5]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

In [6]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [7]:
tokenizer.encode("Mi estas Kevin.").tokens

['<s>', 'Mi', 'Ġestas', 'ĠKe', 'vin', '.', '</s>']

**bold text**## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [8]:
# Check that we have a GPU
!nvidia-smi

Sun Dec  5 18:15:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

In [10]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [11]:
# Now let's re-create our tokenizer in transformers
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

file ./EsperBERTo/config.json not found
file ./EsperBERTo/config.json not found


Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [12]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)



Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

#### Start Training (~1hr+)

In [None]:
%%time
trainer.train()

In [None]:
trainer.save_model("./EsperBERTo")

#### Check that the Language Model Actually Learned

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.

In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:

In [None]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

### Share your Model with the Community

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)



---

# Part III: Topic Modelling using Laten Dirichelet Allocation

In this final part, we will demonstrate how to perform Topic Modelling on a corpus of data.
As a reminder, a "corpus" means a collection of text documents and can be of arbitrary size.

In [1]:
!pip install gensim spacy pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.11.3
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 15.1 MB/s 
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Collecting pandas>=1.2.0
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 14.0 MB/s 
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136897 sha256=b568e536f8884

In [2]:
import re, gensim, os, sys, spacy
import numpy as np
import pandas as pd

from gensim import models

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

print('Python Version: %s' % (sys.version))

Python Version: 3.7.12 (default, Sep 10 2021, 00:21:48) 
[GCC 7.5.0]


  from collections import Iterable


In [3]:
dictionary = gensim.corpora.Dictionary.load('documents.dict')
corpus = gensim.corpora.MmCorpus('documents.mm')
lda_model = models.LdaModel.load('lda_model')
ldamallet = models.wrappers.LdaMallet.load('ldamallet')
optimal_model = models.wrappers.LdaMallet.load('optimal_model')

print(dictionary)
print(corpus)
print(lda_model)
print(ldamallet)



FileNotFoundError: ignored

In [None]:
import pickle
#with open('documents', 'wb') as f: #save
#    pickle.dump(mylist, f)

with open('documents', 'rb') as f: #load
    documents = pickle.load(f)

## Tokenize and Clean-up using gensim’s simple_preprocess()
The sentences look better now, but you want to tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s `simple_preprocess()` is great for this. Additionally I have set deacc=True to remove the punctuations.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(documents))

## Lemmatization
Lemmatization is a process where we convert words to its root word.

For example: ‘Studying’ becomes ‘Study’, ‘Meeting becomes ‘Meet’, ‘Better’ and ‘Best’ becomes ‘Good’.

The advantage of this is, we get to reduce the total number of unique words in the dictionary. As a result, the number of columns in the document-word matrix (created by CountVectorizer in the next step) will be denser with lesser columns.

You can expect better topics to be generated in the end.

In [None]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(documents, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

## Create the Document-Word matrix
The LDA topic model algorithm requires a document word matrix as the main input.

You can create one using CountVectorizer. In the below code, I have configured the `CountVectorizer` to consider words that has occurred at least 10 times (min_df), remove built-in english stopwords, convert all words to lowercase, and a word can contain numbers and alphabets of at least length 3 in order to be qualified as a word.

So, to create the doc-word matrix, you need to first initialise the CountVectorizer class with the required configuration and then apply fit_transform to actually create the matrix.

Since most cells contain zeros, the result will be in the form of a sparse matrix to save memory.

If you want to materialize it in a 2D array format, call the 1todense()1 method of the sparse matrix like its done in the next step.

In [None]:
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=4,                        # minimum reqd occurences of a word 
                             stop_words='english',             # remove stop words
                             lowercase=True,                   # convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                             # max_features=50000,             # max number of uniq words
                            )

data_vectorized = vectorizer.fit_transform(data_lemmatized)

## Check the Sparsicity
Sparsicity is nothing but the percentage of non-zero datapoints in the document-word matrix, that is data_vectorized.

Since most cells in this matrix will be zero, I am interested in knowing what percentage of cells contain non-zero values.

In [None]:
# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

## Build LDA model with sklearn
Everything is ready to build a Latent Dirichlet Allocation (LDA) model. Let’s initialise one and call `fit_transform()` to build the LDA model.

For this example, I have set the n_topics as 20 based on prior knowledge about the dataset. Later we will find the optimal number using grid search.

In [None]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=20,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

## Diagnose model performance with perplexity and log-likelihood

A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good. Let’s check for our model.

In [None]:
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

# See model parameters
print(lda_model.get_params())

## How to GridSearch the best LDA model?
The most important tuning parameter for LDA models is `n_components` (number of topics). In addition, I am going to search `learning_decay` (which controls the learning rate) as well.

Besides these, other possible search params could be `learning_offset` (downweigh early iterations. Should be `> 1) and max_iter`. These could be worth experimenting if you have enough computing resources.

Be warned, the grid search constructs multiple LDA models for all possible combinations of param values in the param_grid dict. So, this process can consume a lot of time and resources.

In [None]:
# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(data_vectorized)

## How to see the best topic model and its parameters?

In [None]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

In [None]:
# Get Log Likelyhoods from Grid Search Output
log_likelyhoods_5 = []
log_likelyhoods_7 = []
log_likelyhoods_9 = []

for i in range(len(model.cv_results_['params'])):
    if model.cv_results_['params'][i]['learning_decay'] == 0.5:
       log_likelyhoods_5.append(round(model.cv_results_['mean_test_score'][i]))
    elif model.cv_results_['params'][i]['learning_decay'] == 0.7:
       log_likelyhoods_7.append(round(model.cv_results_['mean_test_score'][i]))
    elif model.cv_results_['params'][i]['learning_decay'] == 0.9:
       log_likelyhoods_9.append(round(model.cv_results_['mean_test_score'][i]))

# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()

## How to see the dominant topic in each document?
To classify a document as belonging to a particular topic, a logical approach is to see which topic has the highest contribution to that document and assign it.

In the table below, I’ve greened out all major topics in a document and assigned the most dominant topic in its own column.

In [None]:
# Create Document - Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_components)]

# index names
docnames = ["Doc" + str(i) for i in range(len(documents))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
    color = 'green' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(15).style.applymap(color_green).applymap(make_bold)
df_document_topics

## Review topics distribution across documents

In [None]:
df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution

## How to visualize the LDA model with pyLDAvis?
The pyLDAvis offers the best visualization to view the topics-keywords distribution.

In [None]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
panel

## How to see the Topic’s keywords?
The weights of each keyword in each topic is contained in `lda_model.components_` as a 2d array. The names of the keywords itself can be obtained from vectorizer object using `get_feature_names()`.

Let’s use this info to construct a weight matrix for all keywords in each topic.

In [None]:
# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames

# View
df_topic_keywords.head()

## Get the top 15 keywords each topic
From the above output, I want to see the top 15 keywords that are representative of the topic.

The `show_topics()` defined below creates that.

In [None]:
# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

## How to predict the topics for a new piece of text?
Assuming that you have already built the topic model, you need to take the text through the same routine of transformations and before predicting the topic.

For our case, the order of transformations is:

`sent_to_words() –> lemmatization() –> vectorizer.transform() –> best_lda_model.transform()`

You need to apply these transformations in the same order. So to simplify it, let’s combine these steps into a `predict_topic()` function.

In [4]:
# Define function to predict topic for a given text document.
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def predict_topic(text, nlp=nlp):
    global sent_to_words
    global lemmatization

    # Step 1: Clean with simple_preprocess
    mytext_2 = list(sent_to_words(text))

    # Step 2: Lemmatize
    mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

    # Step 3: Vectorize transform
    mytext_4 = vectorizer.transform(mytext_3)

    # Step 4: LDA Transform
    topic_probability_scores = best_lda_model.transform(mytext_4)
    topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()
    return topic, topic_probability_scores

# Predict the topic
mytext = ["Some text about christianity and bible"]
topic, prob_scores = predict_topic(text = mytext)

  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):


NameError: ignored

## How to cluster documents that share similar topics and plot?
You can use k-means clustering on the document-topic probabilioty matrix, which is nothing but lda_output object. Since out best model has 4 clusters, I’ve set `n_clusters=4` `in KMeans()`.

Alternately, you could avoid k-means and instead, assign the cluster as the topic column number with the highest probability score.

We now have the cluster number. But we also need the X and Y columns to draw the plot.

For the X and Y, you can use SVD on the `lda_output` object with `n_components` as 2. SVD ensures that these two columns captures the maximum possible amount of information from `lda_output` in the first 2 components.

In [None]:
# Construct the k-means clusters
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=4, random_state=100).fit_predict(lda_output)

# Build the Singular Value Decomposition(SVD) model
svd_model = TruncatedSVD(n_components=2)  # 2 components
lda_output_svd = svd_model.fit_transform(lda_output)

# X and Y axes of the plot using SVD decomposition
x = lda_output_svd[:, 0]
y = lda_output_svd[:, 1]

# Weights for the 15 columns of lda_output, for each component
print("Component's weights: \n", np.round(svd_model.components_, 2))

# Percentage of total information in 'lda_output' explained by the two components
print("Perc of Variance Explained: \n", np.round(svd_model.explained_variance_ratio_, 2))

We have the X, Y and the cluster number for each document.

Let’s plot the document along the two SVD decomposed components. The color of points represents the cluster number (in this case) or topic number.

In [None]:
# Plot
plt.figure(figsize=(12, 12))
plt.scatter(x, y, c=clusters)
plt.xlabel('Component 2')
plt.xlabel('Component 1')
plt.title("Segregation of Topic Clusters")

## How to get similar documents for any given piece of text?
Once you know the probaility of topics for a given document (using `predict_topic()`), compute the euclidean distance with the probability scores of all other documents.

The most similar documents are the ones with the smallest distance.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def similar_documents(text, doc_topic_probs, documents = documents, nlp=nlp, top_n=5, verbose=False):
    topic, x  = predict_topic(text)
    dists = euclidean_distances(x.reshape(1, -1), doc_topic_probs)[0]
    doc_ids = np.argsort(dists)[:top_n]
    if verbose:        
        print("Topic KeyWords: ", topic)
        print("Topic Prob Scores of text: ", np.round(x, 1))
        print("Most Similar Doc's Probs:  ", np.round(doc_topic_probs[doc_ids], 1))
    return doc_ids, np.take(documents, doc_ids)

In [None]:
# Get similar documents
mytext = ["Some text about christianity and bible"]
doc_ids, docs = similar_documents(text=mytext, doc_topic_probs=lda_output, documents = documents, top_n=1, verbose=True)
print('\n', docs[0][:10])

---

# Takeaway Messages:

- NLP is a **subfield of machine learning** concerned with the application of **learning algorthms to text and speech**.
- NLP-based methods are typically applicable to all sequential-type information (*e.g.* DNA sequences, audio signals, time-series signals, *etc.*) however, they are predominantly used in human language applications.
- For example, we can use NLP to create systems including:
    1. **speech recognition** (*e.g.* real-time captioning)
    2. **document summarization**
    3. **machine translation**
    4. **spam detection**
    5. **named entity recognition**
    6. **question answering**
    7. **autocomplete** (*i.e.* predictive typing)
- **Sentence Tokenization:** Sentence tokenization (also called **sentence segmentation**) is the problem of *dividing a string of written language into its component sentences*.
- **Word Tokenization**: Word tokenization (also called **word segmentation**) is the problem of *dividing a string of written language into its component words*.
- **Text Lemmatization & Stemming**: The goal of both stemming (crude) and lemmatization (refiined) is to *reduce inflectional forms* and sometimes derivationally related forms of a word to a common base form. For example, "drive" & "drives" & "driving" all have the same semantic meaning and should be combined.
- **Stop Words**: Stop words usually refer to the **most common words** such as “and”, “the”, “a” in a language and when applying machine learning to text, **these words can add a lot of noise** so we remove them.
- **Regex**: A regular expression, regex, or regexp is a sequence of characters that define *a search pattern to apply additional filtering* to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.
- **Bag-of-Words**: Machine learning algorithms *cannot work with raw text directly*, we need to convert the text into vectors of numbers (i.e. feature extraction) and the *bag-of-words mode*l is a popular and simple feature extraction technique that *counts the occurrence of each word within a document*.
- **TF-IDF**: One problem with scoring word frequency is that *the most frequent words in the document start to have the highest scores* (frequent words may not havee much “informational gain”) to the model so we penalize words that are frequent across all the documents using TF-IDF (**term frequency-inverse document frequency** is a s**tatistical measure** used to evaluate the **importance of a word** to a **document in a collection or corpus**).

# FIN