# Comparing Pretrained Transformer Models and Use Cases
### Important Definitions Compiled From Hugging Face Documentation:

- `tokenizer`: returns a tokenizer corresponding to the specific model or path (most of the pretrained models have unique tokenizers). Generally a tokenizer is configuration A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).  So it would take in a string sentence and generally break it into the separate word pieces to be vectorized.
- `model`: returns a model corresponding to the specified model or path (this would generally be a pretrained or saved model in this instance).  If we use the BertModel for example then there are 20+ pytorch models (torch.nn.modules) with pretrained weights to choose from!
- `modelForSequenceClassification`: This type of model takes in a sequence and attempts to classify what it is into 2 or more different classes (for the CoLA dataset which we used in our BERT grammar checker, for example we tried to classify sentences into class 0: grammatically incorrect or class 1: grammatically correct)
- `modelForCausalLM`: This type of model is given a prompt and it attempts to predict the next token
- `modelForQuestionAnswering`: This type of model is given a question and it attempts to give a sequence response

## In this notebook we have spent our time reading and paraphrasing documentation from the hugging face library which we think will help us out a ton for later projects and purposes!  In this notebook we are creating and compiling information we learn such as:
##1.Short implementations using pretrained models
##2.Pairing which types of famous models are used for which types of NLP problems
##3.Pairing which types of common benchmark datasets are used for which types of NLP problems
##4.Documentation we have found helpful to use later

## Part 1: Compilation of Short Implementations Using Pretrained Models

### Short pretrained BERT example Implementation for predicting a masked token

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 30.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 30.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 2.3 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 36.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
  

In this section we start off by preparing a tokenized input from a string using the BertTokenizer.  This tokenized input is what will then be fed into bert as a list of token embedding indices!  

From below we can see that we are masking the 8th token.  Before being masked we can see that this masked token should be 'curry'.  So let's see if we can get pretrained BERT to correctly predict this masked word for us!

In [2]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# TOKENIZING THE FOLLOWING INPUT STRING
text = "[CLS] Who was Jackson Curry ? [SEP] Jackson Curry was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jackson', 'curry', '?', '[SEP]', 'jackson', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In this next section we use the BertModel in order to encode our inputs in hidden-states.

In [3]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Finally we use the BertForMaskedLM functionality in order to predict the masked token from before ('curry')!

In [4]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

# confirm we were able to predict 'curry'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(f"The token that pretrained BERT predicted was: {predicted_token} !!!")
if predicted_token == 'curry':
  print("pretrained BERT predicted the masked token correctly YAY!")
else:
  print("pretrained BERT predicted the masked token incorrectly :(")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The token that pretrained BERT predicted was: curry !!!
pretrained BERT predicted the masked token correctly YAY!


### Short pretrained OpenAI GPT-2 example for predicting the next token (causal LM)
One thing once again to note is how this OpenAI GPT-2 model has a distinctly different tokenizer from our BERT example above!

Also we must note that now we have changed our text to be 'Who was Jackson Curry ? Jackson Curry was a".  And in this NLP task our model will attempt to predict the token that comes next at the end of this sentence!

In [14]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "Who was Jackson Curry ? Jackson Curry was a"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

Below we demonstrate how we can use GPT2LMHeadModel to generate the next predicted token of our text!  From this we can see that the model appended the word "great" to the end of our input sentence. 

In [15]:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print("Below is the full text that the GPT-2 pretrained model predicts.  We can see that it has added an additional token to the end of our input sentence!")
print(predicted_text)

Below is the full text that the GPT-2 pretrained model predicts.  We can see that it has added an additional token to the end of our input sentence!
Who was Jackson Curry? Jackson Curry was a great


Lets try changing our input sentence and see what the model can come up with for the next token now!

In [18]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "Giraffes love to"
indexed_tokens = tokenizer.encode(text)

# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

Now we can see that when we input the string "Giraffes love to" then the pretrained model says "Giraffes love to talk" which is funny.  We could run this over and over to see what gpt2 creates for full sentences!

In [19]:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print("Below is the full text that the GPT-2 pretrained model predicts.  We can see that it has added an additional token to the end of our input sentence!")
print(predicted_text)

Below is the full text that the GPT-2 pretrained model predicts.  We can see that it has added an additional token to the end of our input sentence!
Giraffes love to talk


## Part 2: Pairing which types of famous models are used for which types of NLP problems (note that this list is not comprehensive yet)
GLUE (General Language Understanding Evaluation) Problems - BERT, XLM, XLNet and RoBERTa.

Language Generation Problems - GPT, GPT-2, GPT-3, Transformer-XL, XLNet, CTRL

Causal Language Modelling - GPT, GPT-2, GPT-3

Question Answering - BERT, DistilBERT, RoBERTa, XLNet, XLM

## Part 3: Pairing which types of common benchmark datasets are used for which types of NLP problems

### Below we list the most common benchmark datasets that can be used when using a model that is of the type modelForSequence Classification:

The GLUE (General Language Understanding Evaluation) Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

-CoLA Note that we already fine-tuned BERT on this dataset this week.  The dataset is used to try to determine if a sentence is grammatically correct or not.is a dataset containing sentences labeled grammatically correct or not.

-MNLI (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)

-you MRPC (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.

-QNLI (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)

-QQP (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.

-RTE (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.

-SST-2 (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.

-STS-B (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.

-WNLI (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

### Language Generation Datasets
-Generally the model is initially trained on a massive corpus like wikipedia
-After pretraining these models don't really need additional datasets

### Causal Language Generation Datasets
-Again these types of models are initially pretrained on massive corpus's like wikipedia
-However after pretraining these models don't really need additional datasets

### Question Answering Datasets:
-SQuAD (Stanford Question Answering Dataset) is the main dataset used for question answering.  It is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.   

##Part 4: Documentation we have found helpful to use later (just documentation not runnable)

Importing Different Tokenizers:

In [7]:
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')    # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', './test/bert_saved_model/')  # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`

Downloading: "https://github.com/huggingface/pytorch-transformers/archive/master.zip" to /root/.cache/torch/hub/master.zip


RuntimeError: ignored

Importing Pretrained Models:

In [None]:
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-uncased', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

Importing Pretrained Models with a Language Modelling Head:

In [None]:
import torch
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2')    # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True)  # Update configuration during loading
assert model.config.output_attentions == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json')
model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './tf_model/gpt_tf_checkpoint.ckpt.index', from_tf=True, config=config)

Importing Models with a Sequance Classification Head:

In [None]:
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)

Importing Models with a Question Answering Head:

In [None]:
import torch
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased')    # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './test/bert_model/')  # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attention=True)  # Update configuration during loading
assert model.config.output_attention == True
# Loading from a TF checkpoint file instead of a PyTorch model (slower)
config = AutoConfig.from_json_file('./tf_model/bert_tf_model_config.json')
model = torch.hub.load('huggingface/pytorch-transformers', 'modelForQuestionAnswering', './tf_model/bert_tf_checkpoint.ckpt.index', from_tf=True, config=config)