In [1]:
# !pip install transformers
# !pip install torch

### About Transformers

* Transformers are an architecture (family) of language models
  * In the same way that CNNs and RNNs are common architectures for working with image or sequential data. 

 * Trained on large amounts of raw text in a self-supervised fashion. 
    * Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. 

* Resulting models develop statistical understanding of the language they has been trained on


### Tranforms are Big Models

* Transformers are typically very large models

![](https://miro.medium.com/max/1338/1*40VA19kG5zUmTj-AOnh47A.png)


### Transformers: an Encoder and a Decoder

![](https://www.dropbox.com/s/g3vfmxmb926l4hn/encoder_decoder_arch.png?dl=1)


### Transformers: an Encoder and a Decoder

* The encoder finds an appropriate representation of the input

* The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence.

* It turns out that the encoder can used independently of the decoder
  * In this course, we will focus solely on the encoder

### Transformers Architecture: The encoder

* As discussed, Word2Vec can be used as a context-specific "dictionary" to assign embeddings to words
  * This is a problem since the meaning of a word depends on its' content
  * e.g.: English language is full of homonyms 
    * "He banks as FHB" vs. "He banks on her support to win the election"
    * "He wore a bow tie to the event"  vs. "He used a bow and arrows to hunt the prey."

* The encoder builds a *contextualized* representation an input. 
  * This means that the model can acquire understanding from the input.
* One of the most popular encoder models is BERT
  * Bidirectional Encoder Representations from Transformers


### Transformers Architecture: Example Encoder -- BERT

* BERT is a way to contextually encode a word
  * The embedding of the word is context dependent. 
  * "He banks as FHB" vs. "He banks on her support to win the election"
  * The value of "banks" takes into account the value of the words around it.
    
* Unlike Word2Vec, BERT does not just operate like a dictionary

* Size of the generated embedding depends on the architecture
  * For BERT base , the size 768

* BERT embedding is said to hold the meaning within the text
  * BERT tokenization, so not "1 word = 1 embedding".


### More encoders

* There are dozens of different encoder architectures. For example:
    
* Thre are also dozens or modes that fine-tune BERT to specific domains
 * FinBERT(Finance) https://github.com/ProsusAI/finBERT
 * med-BERT (medica field) https://github.com/ProsusAI/finBERT
 * Sci-BERT: Scientific Text Bert (https://aclanthology.org/D19-1371.pdf)
  ...        


### Transformers Architecture: The Decoder

* The decoders work similarly to an encoder 
 * Unlike the encoder, the decoder uses masked self-attention. 
   * Unlike the encoder, it can only see the words on one side (ex. left)

* The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence.
* Word at position $i$ depends only on words at positions $i-1$ 
  * This means that the model is optimized for generating outputs.
* Decoders are not as relevant for this course

### Popular Transformers

* GPT and GPT-2: transformer-based language model (GPT-2 has 1.5 billion parameters)
  * Trained on 8 million web pages
* GPT-3, or the third generation GPT transormer 
  * 175 billion learning parameters
  * Incredible (scary) good at a dizzying number of tasks
  * Generating Web Component or SQL code from a language query
  * https://github.com/features/copilot

### Some NLP Tasks Empowered by Transforems
* Feature Extraction
  * Getting the vector embedding of word, sentence, paragraph or even document
* Question answering
* Sentiment analysis
* Summarization
* Zero shot classification
* Named entity recognition
* Etc.
  


### Huggin Face Model Hub

* The Apple Store of transformer-based Language models
  * Some newer transformer-based image models as well

* Dozens of pre-trained models  
  * Default models pretrained for specific tasks
* Support for over 100 languages.
* APIs to download and use those pre-trained models on a given text
 * Complete pipeline, including text processing
 * Often painfully difficult to process the data which is often model specific
* Possibility to fine tune the model for custom data


In [2]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
classifier = pipeline("sentiment-analysis")
classifier("ICS 438 is an amazing course. Everyone should take it!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998667240142822}]

In [4]:
classifier("What a super boring movie. Never going to recommend it!")

[{'label': 'NEGATIVE', 'score': 0.9991720914840698}]

In [5]:
sentences_to_classify = [
                            "I really like the new design. Your app rocks!",
                            "What a mess this site is to navigate",
                            "very confusing to get anything done on this new redesign"
                        ]

In [12]:
classifier(sentences_to_classify)

[{'label': 'POSITIVE', 'score': 0.9998397827148438},
 {'label': 'NEGATIVE', 'score': 0.9997926354408264},
 {'label': 'NEGATIVE', 'score': 0.9996102452278137}]

In [13]:
### Positive?
### Models are not humans -- undrstanding of the limitations is critical
classifier("He went home")

[{'label': 'POSITIVE', 'score': 0.9991198182106018}]

### Example: Zero Shot Classification

* Zero-shot learning: solve a task despite not having received any training examples of that task.
  * E.g.: recognizing a category of object in photos without ever having seen a photo of that kind of object before. 




In [15]:
classifier = pipeline("zero-shot-classification")

classifier(
    "The CDC is approving booster vaccine shots for everyone over the age of 18.",
    candidate_labels=["politics", "business", 'healthcare'],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'The CDC is approving booster vaccine shots for everyone over the age of 18.',
 'labels': ['healthcare', 'business', 'politics'],
 'scores': [0.9219019412994385, 0.06307942420244217, 0.015018666163086891]}

### Named Entity Recognition

* Find the entities (such as persons, locations, or organizations) in a sentence. 
 * Classify each label of the sentence to a class per entity and one class for “no entity.”
* Default classes
    * O means the word doesn’t correspond to any entity.
    * PER person entity
    * ORG: organization entit
    * LOC: location entity
    * MISC: miscellaneous entity


In [16]:
ner = pipeline("ner", grouped_entities=True)
ner("Speaking from San Francisco, Elon Musk asked whether he should sell off Tesla stocks")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 998/998 [00:00<00:00, 432kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1.33G/1.33G [01:07<00:00, 19.7MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 60.0/60.0 [00:00<00:00, 45.3kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 380kB/s]


[{'entity_group': 'LOC',
  'score': 0.9993223,
  'word': 'San Francisco',
  'start': 14,
  'end': 27},
 {'entity_group': 'PER',
  'score': 0.9980908,
  'word': 'Elon Musk',
  'start': 29,
  'end': 38},
 {'entity_group': 'ORG',
  'score': 0.9525725,
  'word': 'Tesla',
  'start': 72,
  'end': 77}]

### The Pipeline Step in Details

* Word Tokenization
* Input processing
* Models Processing
* Post-processsing


### Word Encoding

* Models require numerical inputs. 

* Naive: assign a unique value to each word in the vocabulary

   * {"aardvark": 1, ... "Zeuxis": 125,452}

* This approach is referred to as work tokenization

  * Split on words and punctuation.

  * Assign each word a unique ID 

* Corpus may contain hundreds of thousands of words and the dataset can be very large. 

```There is one count that puts the English vocabulary at about 1 million words — but that count presumably includes words such as Latin species names, prefixed and suffixed words, scientific terminology, jargon, foreign words of extremely limited English use, and technical acronyms.```[https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words](https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words)



### Character Tokenization

* The other end of the spectrum, we could encode every charcater independently
  * Our encoder need only need valid alphabet and punctation characters
* Small encoding scheme but can encode any word in the same alphabet

Issue with this approach:

* Encoding for single sentence become large
* token do no mean anything when taken separately
  * need to be combined to generate a userful meaning

### Word Tokenization 
* Questions:
 1. How de encode these without explicitly accounting for every word in the language?

 2. How do you encode for very similar words (car vs. cars, text vs. textual)

  * Yes, we can rely on embedding to be fairly similar, but maybe we can encode words so that they match before computing embedding.

 3. How should a deployed model handle previously unseen words? 

* Solution: subword tokenization    
    

### Sub-word Encoding: An Intuition

* Language contains hundreds of thousands of words and text is often very large 

  * How can we encode these words without explicitly accounting for every word in the language?

    

* Use tokens instead of words

  * Split a word into prefix, stem and suffix

  * unusually -> un + usual + ly  

  * unsuspiciously -> un + suspicious + ly  

* [suspicious, usual, un, ly] these 4 tokens can construct 8 words 

  * suspicious, usual, unsuspicious, unusual, usually, suspiciously, unusually, unsuspiciously 

  

    

* Expand the approach to include common stems, suffixes, and prefixes.

  * Makes it easy for words with similar stems to match while keeping the vocabulary small

    



### Sub-word Encoding

* Word should be split into meaningful subwords

* Frequent words should not be split. 

* Rare words should be split into subwords



### Sub-word Tokenization Schemes

* Different models use different schemes for splitting encoding words
  * BERT uses Word Piece: Tokenization introduced by Google.
    * Algorithm used has not been open sources; reverse engineered in some applications
  * ALBERT uses Unigram 
    * Substantially different form Word Piece



### Word Piece Algorithm: General Approach

* A greedy algorithm that decomposes its vocabulary into chunks and retains the most frequent ones.

* Builds a vocabulary containing the most frequent chunks

* Given the following corpus, the algorithm proceeds as described next

```Corpus =  {HuggingFace, hugging, face, hug, hugger, learning, learner, learners, learn},```





### Word Piece Algorithm -1 
![](https://www.dropbox.com/s/q1dctybrlx2y0mh/wp_1.png?dl=1)


### Word Piece Algorithm -2
![](https://www.dropbox.com/s/f5626qe6mpdjapm/wp_2.png?dl=1)


### Word Piece Algorithm -3
![](https://www.dropbox.com/s/5m5n0qvn22lfwxk/wp_3.png?dl=1)

### Word Piece Algorithm - 4
![](https://www.dropbox.com/s/4ol60txxkqy6evp/wp_4.png?dl=1)

### Word Piece Algorithm - Matching

![](https://www.dropbox.com/s/875ntqoyuhh8bg9/wp_5.png?dl=1)

* Each word is encoded using a unique value
* Input representation is encoding of its unique tokens

### Model to Tokenizer Mapping.

* It's critical to use the correct encoding for each model we need to use. 
  * To use BERT model, we need to convert the input data using the same tokenizer it used for it's training
    * Split the workds in the same way that the model does
    * Use the same delimiter characters
    * use the same token ids as the model


* For more details, see: https://huggingface.co/transformers/tokenizer_summary.html

In [20]:
# the bert cased model is case-sensitive: it makes a difference between english and English.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

my_input = tokenizer("Word tokenization is cool!", return_tensors="pt")
my_input

{'input_ids': tensor([[  101, 10683, 22559,  2734,  1110,  4348,   106,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [21]:
# the "##" indicated the token does not occur at the start of the word.
tokens = tokenizer.tokenize("Word tokenization is cool!")
tokens

['Word', 'token', '##ization', 'is', 'cool', '!']

In [22]:
tokenizer.convert_tokens_to_string(tokens)

'Word tokenization is cool !'

In [23]:
tokenizer.convert_tokens_to_ids(tokens)

[10683, 22559, 2734, 1110, 4348, 106]

In [24]:
my_input["input_ids"].tolist()[0]

[101, 10683, 22559, 2734, 1110, 4348, 106, 102]

In [None]:
len(tokenizer.vocab.keys())


In [29]:
list(tokenizer.vocab.keys())[10_000:10_020]


['Interior',
 'echoed',
 'Valentine',
 'varieties',
 'Brady',
 'cluster',
 'Ever',
 'voyage',
 '##of',
 'deposits',
 'ultimate',
 'Hayes',
 'horizontal',
 'proximity',
 '##ás',
 'estates',
 'exploration',
 'NATO',
 'Classical',
 '##most']

### BERT Architecture and Embeddings
![](https://jalammar.github.io/images/bert-output-vector.png)

In [44]:
# Importing the relevant modules
from transformers import BertModel
# import pandas as pd
# import numpy as np
import torch
# Loading the pre-trained BERT model
###################################
# Embeddings will be derived from
# the outputs of this model

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True,)# Setting up the tokenizer
###################################
# This is the same tokenizer that
# was used in the model to generate
# embeddings to ensure consistency
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
text = "The bank was out of money."
tokenized_input = tokenizer(text, return_tensors="pt")

output = model(**tokenized_input)
output["last_hidden_state"]

tensor([[[-0.0319,  0.4166, -0.4186,  ..., -0.3329,  0.3522,  0.1938],
         [ 0.3837,  0.1618, -0.6934,  ...,  0.1908,  1.1430, -0.5608],
         [ 0.7917, -0.5113, -0.0309,  ..., -0.5770,  0.1681, -0.1110],
         ...,
         [ 0.0486, -0.7598, -0.4104,  ..., -0.5119, -0.1877, -0.1703],
         [ 0.5552,  0.0767, -0.4177,  ...,  0.1522, -0.3517, -0.4224],
         [ 0.7520,  0.1612, -0.1570,  ...,  0.0096, -0.5722, -0.2309]]],
       grad_fn=<NativeLayerNormBackward0>)

In [46]:
output["last_hidden_state"].shape

torch.Size([1, 9, 768])

In [47]:
# Text corpus
##############
# These sentences show different uses of the word 'bank' and illustrate the
# value of contextualized embeddings

texts = ["bank",
         "The river bank was flooded.",
         "The bank vault was robust.",
         "He had to bank on her for support.",
         "The bank was out of money.",
         "The bank robber was arrested."]


In [56]:
tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"][0])


['[CLS]', 'bank', '[SEP]']

In [66]:
def find_index_word(word, tokenized_input):
    words = tokenizer.convert_ids_to_tokens(tokenized_input) 
    return words.index(word)

In [67]:
find_index_word("bank", tokenized_input["input_ids"][0])

1

In [74]:
# Getting embeddings for the target
# word in all given contexts
# The following is implemented using a for loop with illustration purposes only.
target_word_position = []
target_word_embeddings = []

for text in texts:
    tokenized_input = tokenizer(text, return_tensors="pt")
    output = model(**tokenized_input)
    embeddings  = output.last_hidden_state
    embeddings = torch.squeeze(embeddings, dim=0)
    word_index = find_index_word("bank", tokenized_input["input_ids"][0])
    target_word_position.append(word_index)
    word_embedding = embeddings[word_index]
    target_word_embeddings.append(word_embedding)

In [75]:
target_word_position

[1, 3, 2, 4, 2, 2]

In [76]:
len(target_word_embeddings)

6

In [79]:
from scipy.spatial.distance import cosine
import pandas as pd

list_of_sim= []
for i in range(len(texts)-1):
    for j in range(i+1,len(texts)):
        text_1 = texts[i]
        text_2 = texts[j]
        embd_1 = target_word_embeddings[i]
        embd_2 = target_word_embeddings[j]
        cos_sim = 1 - cosine(embd_1.detach().numpy(), embd_2.detach().numpy())
        list_of_sim.append([text_1, text_2, cos_sim])
sims_df = pd.DataFrame(list_of_sim, columns=['text1', 'text2', 'similarity'])
sims_df

Unnamed: 0,text1,text2,similarity
0,bank,The river bank was flooded.,0.338063
1,bank,The bank vault was robust.,0.494098
2,bank,He had to bank on her for support.,0.25614
3,bank,The bank was out of money.,0.469941
4,bank,The bank robber was arrested.,0.43394
5,The river bank was flooded.,The bank vault was robust.,0.523326
6,The river bank was flooded.,He had to bank on her for support.,0.331584
7,The river bank was flooded.,The bank was out of money.,0.512161
8,The river bank was flooded.,The bank robber was arrested.,0.535616
9,The bank vault was robust.,He had to bank on her for support.,0.416074


### How does BERT Work: intuition - 1?

![](https://www.dropbox.com/s/02pizzlbl2qhnsx/weight_scheme.png?dl=1)

### How does BERT Work: intuition - 2?

![](https://www.dropbox.com/s/pkgchdb8qu1tt26/steps.png?dl=1)

### How does BERT Work: intuition - 3?

![](https://www.dropbox.com/s/oai4l6yjs5tvq18/reweigh_sem_sim.png?dl=1)

### How does BERT Work: intuition - 4?

![](https://www.dropbox.com/s/oai4l6yjs5tvq18/reweigh_sem_sim.png?dl=1)

### How does BERT Work: intuition - 5?

![](https://www.dropbox.com/s/e6fva5rwfxq0sc5/self_attention.png?dl=1)

### How does BERT Work: intuition - 6?
![](https://www.dropbox.com/s/87vgwxurl9dvw96/attention_block_no_params.png?dl=1)

### How does BERT Work: intuition - 7?
![](https://www.dropbox.com/s/y6e8k6re7zgauqd/attention_block_params.png?dl=1)

### How does BERT Work: intuition - 8?
![](https://www.dropbox.com/s/woqfyyh92rt72h9/multi_head_attension.png?dl=1)

### Why use Multi-Head Attention?

* Language is complex and ambiguous.

  * Many rules and principles govern syntax, word formation, and other features.

  * There are many exceptions to these rules.

* Multiple attention heads are needed to capture the complexity of the language.

  * Each head will "focus" on different aspects of the language, possibly on different regions.

  

![](https://www.dropbox.com/s/67qf3xrweu1p9tl/heads.png?dl=1)

### How does BERT Work: Interesting Reads?


The Illustrated Transformer. An easy-to-follow blog post that is somewhat superficial.
http://jalammar.github.io/illustrated-transformer/

Transformers from scratch. A blog entry that is easy to and provides code to illustrate. https://peterbloem.nl/blog/transformers

Attention is all you need. The paper that Introduced "modern" transformers as we know them
https://arxiv.org/abs/1706.03762

An PyTorch annotated guide that explains the Attenion is All You nNeed paper 
http://nlp.seas.harvard.edu/2018/04/01/attention.html

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned: https://aclanthology.org/P19-1580.pdf

* An excellent explanation of how BERT is implemented: https://gmihaila.github.io/tutorial_notebooks/bert_inner_workings/#bertpooler



### Transfer Learning

* Language models are typically trained on very large amounts of data.

* Training can take weeks on very sophisticated architecture and cost very much $$$

  * Training GPT-3 reportedly cost $12M for a single training run    
  https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/

* The training is done on generic data
  * Note optimized for a specific domain (ex. healthcare or physics)
* Training can also be done on a generic task 
  * e.g. Masked Language modeling,(e.g., "my [MASK] is John Doe", "name")  


### Transfer Learning - cont'd

* Transfer learning is the process of transferring knowledge from model A, to a model B

  * Model B may be doing a task that's slightly different

    * E.g., model A does NER on news items and model B does NER on invoices (company names, total-tax, item counts, etc..) 

    * Model A trained on English Wikipedia and model B on health care documents.

    * Task can be different

      * A does masked-language analysis, B does sentiment analysis

* The knowledge acquired in model A is transferred to model B

* Model B, typically needs a smaller dataset to be trained

### Transfer Learning - Cont'd

* When training from scratch, all the model's weights are initialized randomly.

* Transfer learning consists of "continuing" training with a new, smaller dataset

  * Some approaches are used to force weights not to change too much.

    * E.g., a much smaller learning rate or even freezing some layers and adding new ones that learn something new, 

* Since the pre-trained model was already trained on lots of data, the fine-tuning requires substantially fewer data to obtain reasonable results in less time and with fewer computational resources.



### Pretraining in Language Model 

* As we've seen with the Word2Vec model, pre-training can be self-supervised
  * Dataset does not require human annotation
  * Train on some task that allows the model to acquire some understanding of the language. Examples
    * Predict the next word in a sentence.
      * This is, for example, how GPT-2 was trained
    * Masked Word Prediction
      * BERT was trained on English Wikipedia and 11k books
####### Next word prediction
```
Apple  -> will
Apple will -> soon
Apple will soon  -> allow
Apple will soon allow -> customers
Apple will soon allow customers -> to
Apple will soon allow customers to -> fix
Apple will soon allow customers to fix -> their
Apple will soon allow customers to fix their -> devices
Apple will soon allow customers to fix their devices -> .
```
####### Masked Word prediction
```
The company will [MASK] their earnings tomorrow.
MASK= announce
....
```

    
    

### Transfer Learning for a Different Task

![](https://www.dropbox.com/s/xt9io0croj9mked/transfer_learning.png?dl=1)


### Transfer Learning for a Different Task

* The tasks should be as similar as possible

  * Uses the same language

  * Produce the same type of output (distributions over words)

  * Etc.

* The derived model leverages the language understanding acquired in the previous model

    * Amazing, brilliant, fun, cool... convey somewhat similar meanings.

    * When trained with "such a fun movie" -> "positive", it will learn that the sentence could have also been

      * "such an amazing movie". "such a brilliant movie", "such a cool movie", etc..