# BERT (Paired-Programming Exercise)

**Objectives**

*   To understand how BERT works
*   To apply BERT to NLP

## BERT - The model

### Some details on the model

**BERT** (**Bidrectional Encoder Representation from Transformer**) is a linguistic embedding model published by Google. It is a context-based model, unlike other embedding models such as word2vec, which are context-free. The context-sensitive nature of BERT was built upon a dataset of 3.3 billion words, in particular approximately 2.5 billion from Wikipedia and the balance from Google's [BookCorpus](https://www.english-corpora.org/googlebooks/#).

Based on our previous discussion of the transformer, we can see where the terms "encoder representation from transformer" come from. But what about "Bidirectional?" Bidrectional simply mean the encoder can read the sentence in both directions, e.g. both *Cogito ergo sum* to *I think therefore I am* and vice versa.

BERT has three main hyperparameters

*   $ L $ is the number of encoder layers
*   $ A $ is the number of attention heads
*   $ H $ is the number of hidden units

The model also comes in some pre-specified configurations, and here are the two standard ones

*   BERT-base: $ L = 12 $, $ A = 12 $, $ H = 768 $
*   BERT-large: $ L = 42 $, $ A = 16 $, $ H = 1,024 $

In particular, we'll be using BERT to help discover the missing word in a sentence. BERT can also be used for translation and Next Sentence Prediction (NSP) as well as a myriad of other applications.

### Using BERT

With that as prologue, let's start using BERT. First, we'll have to set up our environment.

#### BERT's environment

**Note:** We are using Google Colab since with standard Jupyter notebooks, there can be a lot of issues with the various installations working well together; this is especially true for M1 chip MacBooks.

In [None]:
# Install transformers
!pip install transformers

In [2]:
# Import the german libraries
from transformers import pipeline

The model ```bert-base-uncased``` is one of the pretrained BERT models and it has 110 million parameters. Details can be found at [Hugging Face](https://huggingface.co/bert-base-uncased).



#### Masking with BERT

In [None]:
# Define our function unmasker
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Now let's try a sentence and see how BERT does.

In [None]:
# [MASK] goes in the place you want BERT to predict the correct word
unmasker("Artificial Intelligence [MASK] take over the world.")

The top five possibilities are shown. Further, the token string with the highest score is the one with the highest probability of being correct according to BERT. In this example, it is "can" as in "artificial intelligence can take over the world."

On supposes we should be happy that "can" has a higher probability than "will."

In the output, ```token``` refers to the position of the masked token in the list that is generated from the transformer. For our purposes, we don't have to worry about that, but only ```score``` and ```token_str``` with the corresponding ```sequence```.

##### Task 1: Masking Twice

What happens if one used ```[MASK]``` two times in a sentence?

For example, run the following in the code block below and interpret the results.


```
unmasker("Artificial Intelligence [MASK] take over the [MASK}.")
```



In [None]:
# Using [MASK] twice


*Explain and interpret the "double-mask" here.*

##### Task 2: Using unmasker

Use ```unmasker``` on three other sentences. At least one of them should be a "double-mask."
Explain and interpret each one.

In [6]:
# Run each unmasker sentence in a different code cell followed by their analysis in a text cell.


##### Literary Interlude

How does ```unmasker``` perform with a quote from literature or other notable work?



Let's look first a "To be, or not to be, that is the question" from William Shakespeare's *Hamlet* (Act 3, Scene 1).

In [None]:
# Let's mask "question"
unmasker("To be, or not to be, that is the [MASK]:")

We can see that the highest probability does give us the correct answer.

Let's look at another one.

The opening line of James Joyce's *Ulysses* is “Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.”

In [None]:
# Let's mask "plump"
unmasker("Stately, [MASK] Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.")

We see that the actual word- "plump"- did not make the top 5.

Now let's unmask "plump" and mask "lather."

In [None]:
# Let'colabs mask "later"
unmasker("Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of [MASK] on which a mirror and a razor lay crossed.")

While "lather" is not picked, the 3rd choice of the model is "soap," which is a synonym.

##### Task 3: A quote from literature or other notable work

Now it is your turn.

Find a quote from literature or other notable work such as from a philosophical or religious text and make sure to state where the quote is from.

Mask at least two different words and see how BERT performs.

In [None]:
#

##### Task 4: Bias in the model

Run the following two code cells.

In [None]:
# Men at work
unmasker("The man worked as a [MASK].")

In [None]:
# Women at work
unmasker("The man worked as a [MASK].")

What do you notice about the top five responses for men and women? Explain. 

*Recall that we noted above which data BERT was trained on, so you may want to reference that in your explanation.*