# Spacy PyTorch Transformers Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lG3ReZc9ESyVPsstjuu5ek73u6vVsi3X)


![alt text](https://d33wubrfki0l68.cloudfront.net/d04566d0f6671ae94fdae6fa3f767f5a6553d335/c50f0/blog/img/spacy-pytorch-transformers.jpg)

# Set-Up

Setting up the environment in Colab to run various experiments, note the cuda version of spacy-pytorch-transformers is being downloaded

In [None]:
!pip install gputil
!pip install torch==1.1.0
!pip install spacy-pytorch-transformers[cuda100]==0.2.0
!python -m spacy download en_pytt_xlnetbasecased_lg
!python -m spacy download en_pytt_bertbaseuncased_lg

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp36-none-any.whl size=7411 sha256=e22cfe5f80e940631932a7d8e3aad1b22d3ccb5ad0fec2b6a5a3442ee17bc282
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Collecting spacy-pytorch-transformers[cuda100]==0.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/b4/a5/45618feff3774b96b046eaafd0d5980c8671159da0bac8ac308dc387532f/spacy_pytorch_transformers-0.2.0-py3-none-any.whl (57kB)
[K     |████████████████████████████████| 61kB 2.9MB/s 
Collecting ftfy<6.0.0,>=5.0.0 (from spacy-pytorch-transformers[cuda100]==0.2.0)
[

You will need to **restart runtime after these installs** to reinstatiate the environment/directory

In [None]:
import spacy
import GPUtil
import torch
import numpy
from numpy.testing import assert_almost_equal
from scipy.spatial import distance
import cupy
import numpy as np

Checks whether GPU is available, switches to cuda if it is

In [None]:
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    print("Using GPU!")
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
    print("GPU Usage")
    GPUtil.showUtilization()

Using GPU!
GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  0% |


# XL-Net & BERT Models Explained

2018 was a breakthrough year for NLP with the release of BERT , most of them centered around language modeling.  In case you’re not familiar, language modeling is a fancy word for the task of predicting the next word in a sentence given all previous words. This seemingly simple task has a surprising amount of depth and the true potential of language modeling started to be unlocked by methods using it as a pretraining method.

The forerunners in this trend were ULMFiT and ELMo, both of which used LSTM-based language models. The basic idea of these methods was to train a language model on massive amounts of unlabeled data and then use the internal representations of the language model on subsequent tasks with smaller datasets such as question answering and text classification. This was a form of transfer learning, where a larger dataset was used to bootstrap a model that could then perform better on other tasks. The reason this worked so well was that language models captured general aspects of the input text that were almost universally useful. Indeed, both ULMFiT and ELMo were a massive success, producing state-of-the-art results on numerous tasks.

## BERT

BERT stands for “Bidirectional Encoder Representations from Transformers”. It is a neural network architecture that can model bidirectional contexts in text data using Transformer.

Traditional language models are trained in a left-to-right fashion to predict the next word given a sequence of words. This has the limitation of not requiring the model to model bidirectional context. What does “bidirectional context” mean? For some words, their meaning might only become apparent when you look at both the left and right context simultaneously. The simultaneous part is important: models like ELMo train two separate models that each take the left and right context into account but do not train a model that uses both at the same time.

BERT solves this problem by introducing a new task in the form of masked language modeling. The idea is simple: instead of predicting the next token in a sequence, BERT replaces random words in the input sentence with the special [MASK] token and attempts to predict what the original token was. In addition to this, BERT used the powerful Transformer architecture to incorporate information from the entire input sentence.

Equipped with these two approaches, BERT achieved state-of-the-art performance across numerous tasks. 

In [None]:
model_choice = "en_pytt_bertbaseuncased_lg" #@param ["en_pytt_bertbaseuncased_lg", "en_pytt_xlnetbasecased_lg"]

One important detail is that BERT uses wordpieces (e.g. playing -> play + ##ing)instead of words. This is effective in reducing the size of the vocabulary and increases the amount of data that is available for each word.

In [None]:
nlp = spacy.load(model_choice)
doc = nlp("Here is some text to encode.")
assert doc.tensor.shape == (7, 768)  # Always has one row per token
print(doc._.pytt_word_pieces_)  # String values of the wordpieces
# The raw transformer output has one row per wordpiece.


['[CLS]', 'here', 'is', 'some', 'text', 'to', 'en', '##code', '.', '[SEP]']


Here we can see that for the 10 word piece parts there is an individual encoding of size 768. Spacy provides a convenient utility to align the wordpieces back to the original words.  
As the word **encode** has been split into its component parts - if we wanted to extract it's token representation as a single word we would need to pool together the 6th and 7th vector representations. 

In [None]:
print(doc._.pytt_word_pieces)  # Wordpiece IDs (note: *not* spaCy's hash values!)
print(doc._.pytt_alignment)  # Alignment between spaCy tokens and wordpieces

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 1012, 102]
[[1], [2], [3], [4], [5], [6, 7], [8]]


We don't see any masked tokens as those are used during the training batches for the model to learn word representations. As we're using pre-trained models these masks are not part of the outputs. The special [CLS] and [SEP] tokens are still output as part of the code

BERT prepends a [CLS] token (short for “classification”) to the start of each sentence (this is essentially like a start-of-sentence token) and is used as an overall representation of the sentence in downstream tasks

In [None]:
print(f"The {doc._.pytt_word_pieces_[0]} token embedding can be retrieved by getting the first embedding from the pytorch output - it's the same size as the other embeddings: {len(doc._.pytt_last_hidden_state[0])}")

The [CLS] token embedding can be retrieved by getting the first embedding from the pytorch output - it's the same size as the other embeddings: 768


The last hidden state is the encoding value of the last hidden layer in the BERT architecture and can be retrieved using the *doc._.pytt_last_hidden_state* method.  
Running the method on our document gives us the embedding for each wordpiece token

In [None]:
print(doc._.pytt_last_hidden_state.shape)
assert len(doc._.pytt_last_hidden_state) == len(doc._.pytt_word_pieces)

(10, 768)


If we wanted to retrieve every hidden layer's output the *doc._.pytt_all_hidden_states* accesses a tensor containing all layers of every token  
** At time of writing this method doesn't yet work and is a known issue in the github **

In [None]:
print(doc._.pytt_all_hidden_states)

None


While the [CLS] token is often used as a sentence representation in downstream tasks - it's also possible to sum the component embeddings for each word to get a sentence level vector

In [None]:
print(f"The sentence level representation retains the same embedding dimensions using a sum-pooled vector match : {len(doc.tensor.sum(axis=0))}")

The sentence level representation retains the same embedding dimensions using a sum-pooled vector match : 768


In [None]:
doc.tensor.sum(axis=0)

array([ 1.92071831e+00, -2.27924675e-01,  9.30034518e-02, -1.93034962e-01,
       -8.33929181e-01, -5.17823124e+00,  1.63885760e+00,  5.26988888e+00,
       -5.49891591e-03,  4.23406363e-01,  4.84476984e-01, -2.48546958e+00,
       -1.97492468e+00,  1.45040047e+00, -4.58841419e+00,  1.33792830e+00,
       -3.63066268e+00,  2.52574968e+00,  3.23240161e-02,  1.73363376e+00,
       -9.02754664e-01, -2.40544513e-01, -5.86369324e+00,  1.33724976e+00,
        8.09447193e+00, -2.85938358e+00, -3.24457264e+00, -1.94612670e+00,
       -7.56774235e+00, -2.41960573e+00,  5.64183593e-01,  8.24668646e-01,
       -3.08341694e+00, -1.93161607e-01, -1.11684406e+00, -1.17629361e+00,
        1.48193562e+00, -9.17339146e-01, -9.93975759e-01,  2.36142230e+00,
       -7.09139729e+00, -2.71788001e+00,  2.19503140e+00,  7.31056631e-01,
        4.54783440e+00, -3.48939514e+00,  4.75842571e+00, -1.59263515e+00,
       -1.60763323e-01,  6.74959958e-01, -5.97016144e+00,  3.28443575e+00,
       -9.77768183e-01,  

## BERT's shortcomings

BERT was already a revolutionary method with strong performance across multiple tasks, but it wasn’t without its flaws. XLNet pointed out two major problems with BERT.

1. The [MASK] token used in training does not appear during fine-tuning

BERT is trained to predict tokens replaced with the special [MASK] token. The problem is that the [MASK] token – which is at the center of training BERT – never appears when fine-tuning BERT on downstream tasks.

This can cause a whole host of issues such as:

What does BERT do for tokens that are not replaced with [MASK]?
In most cases, BERT can simply copy non-masked tokens to the output. So would it really learn to produce meaningful representations for non-masked tokens?
Of course, BERT still needs to accumulate information from all words in a sequence to denoise [MASK] tokens. But what happens if there are no [MASK] tokens in the input sentence?
There are no clear answers to the above problems, but it’s clear that the [MASK] token is a source of train-test skew that can cause problems during fine-tuning. The authors of BERT were aware of this issue and tried to circumvent these problems by replacing some tokens with random real tokens during training instead of replacing them with the [MASK] token. However, this only constituted 10% of the noise. When only 15% of the tokens are noised to begin with, this only amounts to 1.5% of all the tokens, so is a lackluster solution.

2. BERT generates predictions independently

Another problem stems from the fact that BERT predicts masked tokens in parallel. Let’s illustrate with an example: Suppose we have the following sentence.

*I went to [MASK] [MASK] and saw the [MASK] [MASK] [MASK].*

One possible way to fill this out is

*I went to New York and saw the Empire State building.*

Another way is

*I went to San Francisco and saw the Golden Gate bridge.*

However, the sentence

*I went to San Francisco and saw the Empire State building*

is not valid. Despite this, BERT **predicts all masked positions in parallel, meaning that during training**, it does not learn to handle dependencies between predicting simultaneously masked tokens. In other words, it _does not learn dependencies between its own predictions_. Since BERT is not actually used to unmask tokens, this is not directly a problem. The reason this can be a problem is that this reduces the number of dependencies BERT learns at once, making the learning signal weaker than it could be.

Note that neither of these problems is present in traditional language models. Language models have no [MASK] token and generate all words in a specified order so it learns dependencies between all the words in a sentence.

## XL-Net

The conceptual difference between BERT and XLNet. Transparent words are masked out so the model cannot rely on them. XLNet learns to predict the words in an arbitrary order but in an autoregressive, sequential manner (not necessarily left-to-right). BERT predicts all masked words simultaneously.


XLNet does this by introducing a variant of language modeling called “permutation language modeling”. Permutation language models are trained to predict one token given preceding context like traditional language model, but instead of predicting the tokens in sequential order, it predicts tokens in some random order. To illustrate, let’s take the following sentence as an example:

I like cats more than dogs.

A traditional language model would predict the tokens in the order

“I”, “like”, “cats”, “more”, “than”, “dogs”

where each token uses all previous tokens as context.

![alt text](https://i2.wp.com/mlexplained.com/wp-content/uploads/2019/06/ezgif.com-gif-maker-1.gif?resize=447%2C170)

In expectation, the model should learn to model the dependencies between all combinations of inputs in contrast to traditional language models that only learn dependencies in one direction.

The difference between permutation language modeling and BERT is best illustrated below.

![alt text](https://i1.wp.com/mlexplained.com/wp-content/uploads/2019/06/Screen-Shot-2019-06-22-at-5.38.12-PM.png?resize=1024%2C567&ssl=1)

In [None]:
model_choice = "en_pytt_bertbaseuncased_lg" #@param ["en_pytt_bertbaseuncased_lg", "en_pytt_xlnetbasecased_lg"]

You can see that the XL-Net model also has the [SEP] and [CLS] tokens like the BERT model - these are in inverse positions however.

In [None]:
nlp = spacy.load(model_choice)
doc = nlp("Here is some text to encode.")
assert doc.tensor.shape == (7, 768)  # Always has one row per token
print(doc._.pytt_word_pieces_)  # String values of the wordpieces
# The raw transformer output has one row per wordpiece.


XL-Net doesn't use the the wordpiece model to perform tokenisation but instead uses sentencepiece which doesn't split up words into their component pieces - see encode is a single token/piece

In [None]:
print(doc._.pytt_word_pieces)  # Wordpiece IDs (note: *not* spaCy's hash values!)
print(doc._.pytt_alignment)  # Alignment between spaCy tokens and wordpieces

Spacy provides the same functionality that we previously saw with BERT: we can access the last hidden layer of each token by using the **._.pytt_last_hidden_state** method. It contains 9 embeddings of size 768 - One for each wordpiece (including the [SEP] and [CEP] special tokens)

In [None]:
doc._.pytt_last_hidden_state.shape

In [None]:
doc._.pytt_last_hidden_state

We can use a sum-pooled average to get the sentence embedding

In [None]:
doc.tensor.sum(axis=0)

## SOTA powered Spacy Similarity 

In [None]:
model_choice = "en_pytt_bertbaseuncased_lg" #@param ["en_pytt_bertbaseuncased_lg", "en_pytt_xlnetbasecased_lg"]

As PyTorch transformers is integrated into the normal SpaCy pipeline and methods - we can use the **.similarity** method to compare vectors at both token level and at sentence level - see https://spacy.io/api/token#similarity. We can also access vectors directly using the **.vector** method

In [None]:
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")

At a token level - we can see that the word Apple has different embedding representations in each context and so the similarity of Apple & Apple in each context is different. The model correctly identifies the difference between the embedding representation of the company and the fruit

In [None]:
print(apple1[0].similarity(apple2[0]))  # 0.73428553
print(apple1[0].similarity(apple3[0]))  # 0.43365782

Similarly, this can be applied at a sentence level with the two Company related Apple sentence are more similar that the apple pie sentence is

In [None]:
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963

0.69861203
0.5404965


To understand what's going on under the hood, we can manually recreate the above similarity scores using numpy & scipy methods.
First we perform a sum-pooled vector representation of each token to get a sentence embedding as we did above. Then we convert the cupy/chainer array to a numpy array

In [None]:
a1_embedding = cupy.asnumpy(apple1.tensor.sum(axis=0))
a2_embedding = cupy.asnumpy(apple2.tensor.sum(axis=0))
a3_embedding = cupy.asnumpy(apple3.tensor.sum(axis=0))

Similarity is defined as **1 - cosine distance** between two arrays

In [None]:
print(f"Similarity between Sentence 1 and Sentence 2 is : {1 - distance.cosine(a1_embedding, a2_embedding)}")

Similarity between Sentence 1 and Sentence 2 is : 0.6986120343208313


In [None]:
print(f"Similarity between Sentence 1 and Sentence 3 is : {1 - distance.cosine(a1_embedding, a3_embedding)}")

Similarity between Sentence 1 and Sentence 3 is : 0.5404964685440063


# Build a Sentiment Classifier using Spacy-PyTT

This is a notebook version of the example found in the SpaCy PyTorch Transformers Github repo: https://github.com/explosion/spacy-pytorch-transformers/blob/master/examples/train_textcat.py

**Restart the kernel prior to running this section as the memory allocation on the GPU from the previous sections will cause the code to error**

Loading in additional libraries for this example

In [None]:
import thinc
import random
import spacy
import GPUtil
import torch
from spacy.util import minibatch
from tqdm.auto import tqdm
import unicodedata
import wasabi
import numpy
from collections import Counter

Ensuring GPU is in use: 
To run this example, ensure GPU MEM ~ 1% at start

In [None]:
spacy.util.fix_random_seed(0)
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")
    print("GPU Usage")
    GPUtil.showUtilization()

GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  1% |


We'll use the IMDB movie database for sentiment analysis (https://ai.stanford.edu/~amaas/data/sentiment/). We've imported thinc which has the imdb dataset available as a build in method

In [None]:
def _prepare_partition(text_label_tuples, *, preprocess=False):
    texts, labels = zip(*text_label_tuples)
    cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels]
    return texts, cats

def load_data(*, limit=0, dev_size=2000):
    """Load data from the IMDB dataset, splitting off a held-out set."""
    if limit != 0:
        limit += dev_size
    assert dev_size != 0
    train_data, _ = thinc.extra.datasets.imdb(limit=limit)
    assert len(train_data) > dev_size
    random.shuffle(train_data)
    dev_data = train_data[:dev_size]
    train_data = train_data[dev_size:]
    train_texts, train_labels = _prepare_partition(train_data, preprocess=False)
    dev_texts, dev_labels = _prepare_partition(dev_data, preprocess=False)
    return (train_texts, train_labels), (dev_texts, dev_labels)

We can call the above functions to generate our training and testing data

In [None]:
(train_texts, train_cats), (eval_texts, eval_cats) = load_data()

next we'll select the pytt model we want to use to load into spacy

In [None]:
model_choice = "en_pytt_xlnetbasecased_lg" #@param ["en_pytt_bertbaseuncased_lg", "en_pytt_xlnetbasecased_lg"]

In [None]:
nlp = spacy.load(model_choice)
print(nlp.pipe_names)
print(f"Loaded model '{model_choice}'")
if model_choice == "en_pytt_xlnetbasecased_lg":
  textcat = nlp.create_pipe(
          "pytt_textcat", config={"architecture": "softmax_class_vector"}
      )
elif model_choice == "en_pytt_bertbaseuncased_lg":
  textcat = nlp.create_pipe(
          "pytt_textcat", config={"architecture": "softmax_class_vector"}
      )
else: 
  print("Choose a supported PyTT model")

['sentencizer', 'pytt_wordpiecer', 'pytt_tok2vec']
Loaded model 'en_pytt_xlnetbasecased_lg'


In [None]:
 # add label to text classifier
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

1

In [None]:
print("Labels:", textcat.labels)
nlp.add_pipe(textcat, last=True)
print(f"Using {len(train_texts)} training docs, {len(eval_texts)} evaluation")

Labels: ('POSITIVE', 'NEGATIVE')
Using 23000 training docs, 2000 evaluation


In [None]:
# total_words = sum(len(text.split()) for text in train_texts)
train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats]))

In [None]:
n_iter=4
n_texts=1000 #Changed number of texts to 75 to relieve pressue on GPU memory
batch_size=8 #batch-szie changed to 4 to relieve pressure on GPU memory
learn_rate=2e-5
max_wpb=1000
pos_label="POSITIVE"

In [None]:
def cyclic_triangular_rate(min_lr, max_lr, period):
    it = 1
    while True:
        # https://towardsdatascience.com/adaptive-and-cyclical-learning-rates-using-pytorch-2bf904d18dee
        cycle = numpy.floor(1 + it / (2 * period))
        x = numpy.abs(it / period - 2 * cycle + 1)
        relative = max(0, 1 - x)
        yield min_lr + (max_lr - min_lr) * relative
        it += 1

In [None]:
def evaluate(nlp, texts, cats, pos_label):
    tp = 0.0  # True positives
    fp = 0.0  # False positives
    fn = 0.0  # False negatives
    tn = 0.0  # True negatives
    total_words = sum(len(text.split()) for text in texts)
    with tqdm(total=total_words, leave=False) as pbar:
        for i, doc in enumerate(nlp.pipe(texts, batch_size=batch_size)):
            gold = cats[i]
            for label, score in doc.cats.items():
                if label not in gold:
                    continue
                if label != pos_label:
                    continue
                if score >= 0.5 and gold[label] >= 0.5:
                    tp += 1.0
                elif score >= 0.5 and gold[label] < 0.5:
                    fp += 1.0
                elif score < 0.5 and gold[label] < 0.5:
                    tn += 1
                elif score < 0.5 and gold[label] >= 0.5:
                    fn += 1
            pbar.update(len(doc.text.split()))
    precision = tp / (tp + fp + 1e-8)
    recall = tp / (tp + fn + 1e-8)
    if (precision + recall) == 0:
        f_score = 0.0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score}

In [None]:
# Initialize the TextCategorizer, and create an optimizer.
optimizer = nlp.resume_training()
optimizer.alpha = 0.001
optimizer.pytt_weight_decay = 0.005
optimizer.L2 = 0.0
learn_rates = cyclic_triangular_rate(
    learn_rate / 3, learn_rate * 3, 2 * len(train_data) // batch_size
    )
print("Training the model...")
print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F"))

pbar = tqdm(total=100, leave=False)
results = []
epoch = 0
step = 0
eval_every = 100
patience = 3
while True:
    # Train and evaluate
    losses = Counter()
    random.shuffle(train_data)
    batches = minibatch(train_data, size=batch_size)
    for batch in batches:
        optimizer.pytt_lr = next(learn_rates)
        texts, annotations = zip(*batch)
        nlp.update(texts, annotations, sgd=optimizer, drop=0.1, losses=losses)
        pbar.update(1)
        if step and (step % eval_every) == 0:
            pbar.close()
            with nlp.use_params(optimizer.averages):
                scores = evaluate(nlp, eval_texts, eval_cats, pos_label)
            results.append((scores["textcat_f"], step, epoch))
            print(
                "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format(
                    losses["pytt_textcat"],
                    scores["textcat_p"],
                    scores["textcat_r"],
                    scores["textcat_f"],
                )
            )
            pbar = tqdm(total=eval_every, leave=False)
        step += 1
    epoch += 1
    print(f"epoch {epoch}")
    # Stop if no improvement in HP.patience checkpoints
    if results:
        best_score, best_step, best_epoch = max(results)
        print(f"best score: {best_score}  best_step : {best_step}  best epoch : {best_epoch} ")
        print(f"break clause: {((step - best_step) // eval_every)}")
        if ((step - best_step) // eval_every) >= patience:
            break

    msg = wasabi.Printer()
    table_widths = [2, 4, 6]
    msg.info(f"Best scoring checkpoints")
    msg.row(["Epoch", "Step", "Score"], widths=table_widths)
    msg.row(["-" * width for width in table_widths])
    for score, step, epoch in sorted(results, reverse=True)[:10]:
        msg.row([epoch, step, "%.2f" % (score * 100)], widths=table_widths)

    # Test the trained model
    test_text = eval_texts[0]
    doc = nlp(test_text)
    print(test_text, doc.cats)

Training the model...
LOSS 	  P  	  R  	  F  


HBox(children=(IntProgress(value=0), HTML(value='')))

# More information & Sources

**Sources & More information:**  
*XL-Net explanation*  
https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/  
Attention is all you need  
https://arxiv.org/abs/1706.03762