### Some NLP Tasks
* Feature Extraction
  * Getting the vector embedding of word, sentence, paragraph or even document
* Question answering
* Sentiment analysis
* Summarization
* Zero shot classification
* Named entity recognition
* Etc.
  


### Huggin Face Model Hub

* The apple store of language models
  * Some preliminary image models as well


* Thousands of pre-trained models  
  * Default models pretrained for specific tasks
* Support for over 100 languages.

* APIs to download and use those pre-trained models on a given text
 * Complete pipeline, including text processing
 * Often painfully difficult to process the data which is often model specific
* Possibility to fine tune the model for custom data


### Model Hub Transformers

* Transformers are an architecture (family) of language models
  * In the same way that CNNs and RNNs are common architectures for working with image or sequential data. 

 * Trained on large amounts of raw text in a self-supervised fashion. 
    * Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. 

* Resulting models develop statistical understanding of the language they has been trained on

 

In [285]:
from transformers import pipeline

In [286]:
classifier = pipeline("sentiment-analysis")
classifier("ICS 438 is an amazing course. Everyone should take it!")

[{'label': 'POSITIVE', 'score': 0.999866783618927}]

In [287]:
classifier("What a super boring movie. Never goingt to recomment it!")

[{'label': 'NEGATIVE', 'score': 0.999622106552124}]

In [290]:
sentences_to_classify = [
                            "I really like the new design. Your app rocks!",
                            "What a mess this site is to navigate",
                            "very confusing to get anything done on this new redesign"
                        ]

In [291]:
classifier(sentences_to_classify)

[{'label': 'POSITIVE', 'score': 0.9998397827148438},
 {'label': 'NEGATIVE', 'score': 0.9997925162315369},
 {'label': 'NEGATIVE', 'score': 0.9996102452278137}]

In [293]:
### Positive?
### Models are not humans -- undrstanding of the limitations is critical
classifier("He went home")

[{'label': 'POSITIVE', 'score': 0.999119758605957}]

### Example: Zero Shot Classification

* Zero-shot learning: solve a task despite not having received any training examples of that task.
  * E.g.: recognizing a category of object in photos without ever having seen a photo of that kind of object before. 
  *  Needs to predict the class they belong to. 


* You can learn more about zero-shot learning in Sec. 15.2 of the Deep Learning textbook: http://www.deeplearningbook.org/contents/representation.html


In [297]:
classifier = pipeline("zero-shot-classification")

classifier(
    "The CDC is approving booster vaccine shots for everyone over the age of 18.",
    candidate_labels=["politics", "business", 'healthcare'],
)

{'sequence': 'The CDC is approving booster vaccine shots for everyone over the age of 18.',
 'labels': ['healthcare', 'business', 'polics'],
 'scores': [0.9048793315887451, 0.06191409006714821, 0.03320661932229996]}

### Named Entity Recognition

* Find the entities (such as persons, locations, or organizations) in a sentence. 
 * Classify each label of the sentence to a class per entity and one class for “no entity.”
* Default classes
    * O means the word doesn’t correspond to any entity.
    * PER person entity
    * ORG: organization entit
    * LOC: location entity
    * MISC: miscellaneous entity


In [298]:
ner = pipeline("ner", grouped_entities=True)
ner("Elon Musk asked whether he should sell off Tesla stocks")

[{'entity_group': 'PER',
  'score': 0.996318,
  'word': 'Elon Musk',
  'start': 0,
  'end': 9},
 {'entity_group': 'ORG',
  'score': 0.9502535,
  'word': 'Tesla',
  'start': 43,
  'end': 48}]

### Tranforms are Big Models

* Transformers are typically very large models

![](https://miro.medium.com/max/1338/1*40VA19kG5zUmTj-AOnh47A.png)


### Transformers: an Encoder and a Decoder

![](https://miro.medium.com/max/923/0*L9Zx_5BBFSXgGvvx)


### Transformers: an Encoder and a Decoder
* The encoder finds an appropriate representation of the input
* The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence.

* It turns out that the encoder and decoder can be used idependently

### Transformers Architecture: The encoder

* As discussed, Word2Vec can be used as a dictionary look to assign embeddings to words
  * This is a problem since the meaning of a word depends on its' content
  * e.g.: English language is full of homonyms 
    * "He banks as FHB" vs. "He banks on her support to win the election"
    * "He wore a bow tie to the event"  vs. "He used a bow and arrows to hunt the prey."

* The encoder receives an input and builds a *contextualized* representation of it (its features). 
  * This means that the model is optimized to acquire understanding from the input.
* One of the most popular encoder models is BERT
  * Bidirectional Encoder Representations from Transformers


### Transformers Architecture: Example Encoder -- BERT

* BERT is a way to contextually encode a word
  * The embedding of the word is context dependent. 
  * "He banks as FHB" vs. "He banks on her support to win the election"
  * The value of "banks" takes into account the value of the words around it.
    
* Unlike Word2Vec, BERT does not just operate like a dictionary

* Size of the generated embedding depends on the architecture
  * For BERT base , the size 768

* BERT embedding is said to hold the meaning within the text
  * BERT tokenization, so not "1 word = 1 embedding".


In [299]:
from transformers import BertTokenizer, BertModel

model = BertModel.from_pretrained('bert-base-cased')

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

my_input = tokenizer("Hello, my dog is cute", return_tensors="pt")


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [300]:
my_input

{'input_ids': tensor([[  101,  8667,   117,  1139,  3676,  1110, 10509,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [301]:
outputs = model(**my_input)

last_hidden_states = outputs.last_hidden_state


In [302]:
last_hidden_states.shape

torch.Size([1, 8, 768])

In [303]:
my_input["input_ids"]

tensor([[  101,  8667,   117,  1139,  3676,  1110, 10509,   102]])

### More encoders

* There are dozens of different encoder architectures. For example:
    
* Thre are also dozens or modes that fine-tune BERT to specific domains
 * FinBERT(Finance) https://github.com/ProsusAI/finBERT
 * med-BERT (medica field) https://github.com/ProsusAI/finBERT
 * Sci-BERT: Scientific Text Bert (https://aclanthology.org/D19-1371.pdf)
  ...        


### Transformers Architecture: The Decoder

* The decoders work similarly to an encoder 
 * Unlike the encoder, the decoder uses masked self-attention. 
   * Unlike the encoder, it can only see the words on one side (ex. left)

* The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence.
* Word at position $i$ depends only on words at positions $i-1$ 
  * This means that the model is optimized for generating outputs.
* Decoders are not as relevant for this course

### Popular Transformers

* GPT and GPT-2: transformer-based language model (GPT-2 has 1.5 billion parameters)
  * Trained on 8 million web pages
* GPT-3, or the third generation GPT transormer 
  * 175 billion learning parameters
  * Incredible (scary) good at a dizzying number of tasks
  * Generating Web Component or SQL code from a language query

### The Pipeline Step in Details

* Word Tokenization
* Input processing
* Models Processing
* Post-processsing


### Word Encoding

* Analytics on text ues numerical values. 
 * Various different strategies to infer those values
* Native: asssign a unique value to each wor in the vocabulary
   * {"aardvark": 1, ... "Zeuxis": 125,452}

* This approach is referred to as work tokenization
  * Split on words and ponctuation
  * Assign each word a unique ID 


* Issues with this approach
* Corpus may contain hundreds of thousads of words and dataset can be very large 
  * How de encode these without explicitly accountig for every word in the language?
* Also, how do you encode for very similar words (car vs. cars, text vs. textual)
  * Yes, we can rely on embedding to be fairly similar, but may be we can encode wors so that they match before computing embedding
 * How should a deployed model handle previously unseen words. 

### Character Tokenization

* The other end of the spectrum, we could encode every charcater independently
  * Our encoder need only need valid alphabet and punctation characters
* Small encoding scheme but can encode any word in the same alphabet

Issue with this approach:

* Encoding for single sentence become large
* token do no mean anything when taken separately
  * need to be combined to generate a userful meaning

### Sub-word Encoding: An Intuition

* Language contain hundreds of thousads of words and text is often very big 
  * How de encode these without explicitly accountig for every word in the language?
    
* Use tokens insted of words
  * Split a work into prefix stem and suffic
  * unusually -> un + usual + ly  
  * unsuspiciously -> un + suspicious + ly  

*  [suspicious, usual, un, ly] with these 4 tokens  construct 8 words 
  * susupicious, usual unsuspicous, unusual, usually, suspivciously, unusually, unsuspiciously 
    


  * makes it easy for words with similar stems to match while keeping the vocabulary small
    


In [None]:
### Sub-word Encoding

* Word shoud be split into meaningful subwords

* Frequent words should not be split 

* Rarely used words should be split into subwords




### Sub-word Tokenization Schemes

* Different models use different schemes for splitting encoding words
* Different approches and schemes
  * Bert uses word piece: Tokenization introduced by Google.
    * algorithm used has not been open sources; reverse engineered in some applications
  * ALBERT uses Unigram 
    * Substantially different form Word Piece



In [None]:
### Word Piece Algorithm
* The word piece is a greedy algorithm
* decomposes its' vocabulary into chunks and retains the most frequent ones
* Builds a vocabulary containing the most frequent chunks





### Word Piece Algorithm -1 
![](https://www.dropbox.com/s/q1dctybrlx2y0mh/wp_1.png?dl=1)


### Word Piece Algorithm -2
![](https://www.dropbox.com/s/f5626qe6mpdjapm/wp_2.png?dl=1)


### Word Piece Algorithm -3
![](https://www.dropbox.com/s/5m5n0qvn22lfwxk/wp_3.png?dl=1)

### Word Piece Algorithm - 4
![](https://www.dropbox.com/s/4ol60txxkqy6evp/wp_4.png?dl=1)

### Word Piece Algorithm - 5
![](https://www.dropbox.com/s/875ntqoyuhh8bg9/wp_5.png?dl=1)

### Model to Tokenizer Mapping.

* It's critical to use the correct encoding for each model we need to use. 
  * To use BERT model, we need to convert the input data using the same tokenizer it used for it's training
    * split the workds in the same way that the model does
    * Use the same delimiter characters
    * use the same token ids as the model


### For more details, see: https://huggingface.co/transformers/tokenizer_summary.html

In [None]:
my_input = tokenizer("Word tokenization is cool!", return_tensors="pt")
my_input

In [None]:
# the "##" indicated the token does not occur at the start of the word.
tokens = tokenizer.tokenize("Word tokenization is cool!")
tokens

In [None]:
tokenizer.convert_tokens_to_string(tokens)

In [None]:
tokenizer.convert_tokens_to_ids(tokens)

In [None]:
my_input["input_ids"].tolist()[0]

In [None]:
list(tokenizer.vocab.keys())[10_000:10_020]


### BERT Architecture and Embeddings
![](https://jalammar.github.io/images/bert-output-vector.png)

In [104]:
# Importing the relevant modules
from transformers import BertTokenizer, BertModel
import pandas as pd
import numpy as np
import torch
# Loading the pre-trained BERT model
###################################
# Embeddings will be derived from
# the outputs of this model

model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True,)# Setting up the tokenizer
###################################
# This is the same tokenizer that
# was used in the model to generate
# embeddings to ensure consistency
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [146]:
# Text corpus
##############
# These sentences show the different
# forms of the word 'bank' to show the
# value of contextualized embeddings

texts = ["bank",
         "The river bank was flooded.",
         "The bank vault was robust.",
         "He had to bank on her for support.",
         "The bank was out of money.",
         "The bank robber was arrested."]


In [147]:
# Getting embeddings for the target
# word in all given contexts
target_word_embeddings = []

for text in texts:
    tokenized_input = tokenizer(text, return_tensors="pt")
    output = model(**tokenized_input)
    embeddings  = output.last_hidden_state
    embeddings = torch.squeeze(embeddings, dim=0)
    word_index = tokenized_text.index('bank')
    word_embedding = embeddings[word_index]
    target_word_embeddings.append(word_embedding)

In [148]:
from scipy.spatial.distance import cosine

list_of_sim= []
for i in range(len(texts)-1):
    for j in range(i+1,len(texts)):
        text_1 = texts[i]
        text_2 = texts[j]
        embd_1 = target_word_embeddings[i]
        embd_2 = target_word_embeddings[j]
        cos_sim = 1 - cosine(embd_1.detach().numpy(), embd_2.detach().numpy())
        list_of_sim.append([text_1, text_2, cos_sim])
sims_df = pd.DataFrame(list_of_sim, columns=['text1', 'text2', 'distance'])
sims_df

Unnamed: 0,text1,text2,distance
0,bank,The river bank was flooded.,0.034353
1,bank,The bank vault was robust.,0.031118
2,bank,He had to bank on her for support.,0.058298
3,bank,The bank was out of money.,0.036416
4,bank,The bank robber was arrested.,0.011138
5,The river bank was flooded.,The bank vault was robust.,0.496597
6,The river bank was flooded.,He had to bank on her for support.,0.278664
7,The river bank was flooded.,The bank was out of money.,0.417891
8,The river bank was flooded.,The bank robber was arrested.,0.519588
9,The bank vault was robust.,He had to bank on her for support.,0.260012


In [194]:
texts = [
    "The New York Times reported Tuesday that the FDA may authorize booster shots for all Americans ",
    "Apple rolling out new firmware update for AirPods and AirPods Pro headphones. Here’s how to check for it",
    "Top infectious disease official said if more Americans get vaccines and booster shots, the disease could be downgraded to endemic status",
    "Despite strong vaccination rates, Hawaii’s Safe Travels program likely isn’t ending anytime soon",
    "Shopping Black Friday sales? Here are the best ways to pay",
]

In [211]:
tokenized_inputs= tokenizer(texts, padding=True, return_tensors="pt")
outputs = model(**tokenized_inputs)


In [218]:
from scipy.spatial.distance import cosine

list_of_sim= []
for i in range(len(texts)-1):
    for j in range(i+1,len(texts)):
        text_1 = texts[i]
        text_2 = texts[j]
        embd_1 = outputs.pooler_output[i]
        embd_2 = outputs.pooler_output[j]
        cos_sim = 1 - cosine(embd_1.detach().numpy(), embd_2.detach().numpy())
        list_of_sim.append([text_1, text_2, cos_sim])
sims_df = pd.DataFrame(list_of_sim, columns=['text1', 'text2', 'distance'])
sims_df

Unnamed: 0,text1,text2,distance
0,The New York Times reported Tuesday that the F...,Apple rolling out new firmware update for AirP...,0.965975
1,The New York Times reported Tuesday that the F...,Top infectious disease official said if more A...,0.985166
2,The New York Times reported Tuesday that the F...,"Despite strong vaccination rates, Hawaii’s Saf...",0.985678
3,The New York Times reported Tuesday that the F...,Shopping Black Friday sales? Here are the best...,0.825593
4,Apple rolling out new firmware update for AirP...,Top infectious disease official said if more A...,0.967587
5,Apple rolling out new firmware update for AirP...,"Despite strong vaccination rates, Hawaii’s Saf...",0.971793
6,Apple rolling out new firmware update for AirP...,Shopping Black Friday sales? Here are the best...,0.815434
7,Top infectious disease official said if more A...,"Despite strong vaccination rates, Hawaii’s Saf...",0.981738
8,Top infectious disease official said if more A...,Shopping Black Friday sales? Here are the best...,0.780993
9,"Despite strong vaccination rates, Hawaii’s Saf...",Shopping Black Friday sales? Here are the best...,0.842483


# How does BERT Work

* See the followin excellent write up for ery detailed explaination of how BERT works. 

  * Warning: the scope is perhaps the levels of complexity are beyond the level of material introduced in this course.
  
BERT inner workings:  https://gmihaila.github.io/tutorial_notebooks/bert_inner_workings/#bertpooler


### Transfer Learning

* Language models are typically trained on very large amounts of data.

* Training can take weeks on very sophisticated architecture and cost very much $$$

  * Training GPT-3 reportedly cost $12M for a single training run    
  https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/

* The training is done on generic data
  * Note optimized for a specific domain (ex. healthcare or physics)
* Training can also be done on a generic task 
  * e.g. Masked Language modeling
  * training ("my [MASK] is John Doe", "name")
  
* An extremely powerful idea in deep learning

### Transfer Learning - cont'd

* Transfer learning is the process of transferring knowledge from a model A, to a model B
  * Model B may be doing a task that's slightly different
    * E.g., model A does NER on news item and Model B does NER on invoices (company names, total-tax, item counts, etc..) 
    * Model A trained on English Wikipedia and model B on health care documents.
    * Task can be different
      * A does masked-language B does sentiment analysis
* The knowledge acquired in model A is transferred to model B
* Model B, typically needs a smaller dataset to be trained

### Transfer Learning - cont'd

* When training from scratch, all the model's weights are initialized randomly.
* Transfer learning consists of "continuing" training with a new, smaller dataset
  * Some approaches are used to force weights not to change too much.
    * e.g., a much smaller learning rate or even freezing some layers and adding new ones that learn something new, 
* Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
    For the same reason, the amount of time and resources needed to get good results are much lower.


### Pretraining in Language Model

* In language models, pre-training is usually self-supervised
  * Dataset does not require human annotation
  * train on some task that allows the model to acquire some understanding of the language. Examples
    * PRredict the next word in a sentence.
      * This is, for example how GPT-2 was trained
    * Masked Word Prediction
      * BERT was trained on English Wikipedia and 11k books
####### Next word prediction
```
Apple  -> will
Apple will -> soon
Apple will soon  -> allow
Apple will soon allow -> customers
Apple will soon allow customers -> to
Apple will soon allow customers to -> fix
Apple will soon allow customers to fix -> their
Apple will soon allow customers to fix their -> devices
Apple will soon allow customers to fix their devices -> .
```
####### Masked Word prediction
```
The company will [MASK] their earnings tomorrow.
MASK= announce
....
```

    
    

### Transfer Learning for a Different Task

![](https://www.dropbox.com/s/xt9io0croj9mked/transfer_learning.png?dl=1)


### Transfer Learning for a Different Task

* The task should be as similar as possible
  * Uses the same language
  * Produce the same set of output (distributions over words)
  * Etc.
* The derived model leverages the language understanding acquired in the previous model
    * Amazing, brilliant, fun, cool... convey somewhat similar meanings.
    * When trained with "such a fun movie" -> "positive", it will learn that the sentence could have also been
      * "such an amazing movie". "such a brilliant movie", "such a cool movie", etc..
* Most NLP language will that language understanding acquired in the first layers