# Setup

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/9c/34/fb092588df61bf33f113ade030d1cbe74fb73a0353648f8dd938a223dce7/transformers-3.5.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.5MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 14.3MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 33.6MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K  

In [2]:
from pprint import pprint

# Sentiment Analysis Example

We will use a great package from [hugging-face](https://huggingface.co/) called `Transformers`

In [3]:
from transformers import pipeline

They have already trained and ready pipelines, using BERT for things like sentiment-analysis classification:

In [4]:
nlp = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




In [5]:
nlp('we love you')

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

But be aware that although it may seem to be working well ...

In [6]:
nlp('I ate pizza with olives.')

[{'label': 'POSITIVE', 'score': 0.9042282700538635}]

In [7]:
nlp('I ate pizza with olives. it was cold')

[{'label': 'NEGATIVE', 'score': 0.9971919059753418}]

In [8]:
nlp('I ate pizza with olives, but the pizza was cold')

[{'label': 'NEGATIVE', 'score': 0.9980430603027344}]

... it might not always match your expectations:

In [9]:
nlp('It was cold outside. I ate pizza with olives.')

[{'label': 'NEGATIVE', 'score': 0.9729326367378235}]

In [10]:
nlp('I am dying to see you')

[{'label': 'NEGATIVE', 'score': 0.8540693521499634}]

In [11]:
nlp = pipeline("fill-mask")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=480.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=331070498.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
pprint(nlp(f"{nlp.tokenizer.mask_token} is the president of the united states."))

[{'score': 0.24416913092136383,
  'sequence': '<s>Trump is the president of the united states.</s>',
  'token': 7565,
  'token_str': 'Trump'},
 {'score': 0.20126527547836304,
  'sequence': '<s>Obama is the president of the united states.</s>',
  'token': 33382,
  'token_str': 'Obama'},
 {'score': 0.07434801757335663,
  'sequence': '<s> Trump is the president of the united states.</s>',
  'token': 140,
  'token_str': 'ĠTrump'},
 {'score': 0.04162761569023132,
  'sequence': '<s>Clinton is the president of the united states.</s>',
  'token': 36206,
  'token_str': 'Clinton'},
 {'score': 0.038377437740564346,
  'sequence': '<s> Obama is the president of the united states.</s>',
  'token': 1284,
  'token_str': 'ĠObama'}]


In [13]:
pprint(nlp(f"There's no place like {nlp.tokenizer.mask_token}."))

[{'score': 0.03708091005682945,
  'sequence': "<s>There's no place like hell.</s>",
  'token': 7105,
  'token_str': 'Ġhell'},
 {'score': 0.034593284130096436,
  'sequence': "<s>There's no place like this.</s>",
  'token': 42,
  'token_str': 'Ġthis'},
 {'score': 0.02284209616482258,
  'sequence': "<s>There's no place like that.</s>",
  'token': 14,
  'token_str': 'Ġthat'},
 {'score': 0.017316240817308426,
  'sequence': "<s>There's no place like yours.</s>",
  'token': 14314,
  'token_str': 'Ġyours'},
 {'score': 0.013859190046787262,
  'sequence': "<s>There's no place like ours.</s>",
  'token': 15157,
  'token_str': 'Ġours'}]


# BERT

## Full Model Example

Now let's get the full model to work.

In [14]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM


In [15]:
import logging
logging.basicConfig(level=logging.INFO)

In [16]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])


INFO:filelock:Lock 139892318339368 acquired on /root/.cache/torch/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

INFO:filelock:Lock 139892318339368 released on /root/.cache/torch/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock





In [17]:
tokenized_text

['[CLS]',
 'who',
 'was',
 'jim',
 'henson',
 '?',
 '[SEP]',
 'jim',
 '[MASK]',
 'was',
 'a',
 'puppet',
 '##eer',
 '[SEP]']

In [18]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to(device)
segments_tensors = segments_tensors.to(device)
model.to(device)

# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]

# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

INFO:filelock:Lock 139892320233512 acquired on /root/.cache/torch/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

INFO:filelock:Lock 139892320233512 released on /root/.cache/torch/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock





INFO:filelock:Lock 139892333893504 acquired on /root/.cache/torch/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

INFO:filelock:Lock 139892333893504 released on /root/.cache/torch/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock





In [19]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to(device)
segments_tensors = segments_tensors.to(device)
model.to(device)

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Fine-tune example.

This time we will fine-tune BERT on our special domain: **Pizza and Ice Cream**.

We will train a sentiment analysis on the topic - where 1 is positive, and 0 is a negative sentiment.


https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html#supervised-fine-tuning

In [21]:
from transformers import BertForSequenceClassification

In [20]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', 
                                                      num_labels = 2,
                                                      return_dict=True)
model.train()

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=1e-5)


NameError: ignored

We don't want to train the WHOLE Bert. 

It took originally more than 4 days on 100s of machines to train it, on a massive amoutn of data. Instead, we want to only fine-tune the upper layers, to match our need.

The optimizer lets us choose layers where we want to focus at:

In [None]:
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5)


In [None]:
text_batch = ["I love Ice Cream.", 
         "I don't care for Ice Cream.", 
         "I like my pizza cold", 
         "The pizza was cold", 
         "The ice cream was amazing", 
         "The pizza toppings wer boring",
         "He would like fries with that",
         "Ice cream is awesome"]
  
labels = [0, 
          1,
          1,
          0,
          1,
          0,
          1,
          1]

Tokenizing and encoding, can be done through the huggingFace library with either:

In [None]:
encoding = tokenizer(text_batch, 
                     return_tensors='pt', # pt = pyTorch 
                     padding=True, 
                     truncation=True)

input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']


or (newer):

In [None]:
encoded_dict = tokenizer.encode_plus(
                        text_batch,                # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        padding = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                        truncation=True
                   )


In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in text_batch:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        padding = True,
                        max_length = 32,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                        truncation=True
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

### Your Turn:
convert the toy dataset to vectors, and input them into the model.

In [None]:
### your code here:



In [None]:
# Model fine-tuning (base-idea):

# labels = torch.tensor(labels).unsqueeze(0)
# outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
# loss = outputs.loss
# loss.backward()
# optimizer.step()


# Roberta


Instead of Big BERT, try using the smart sister: [roBERTa - A Robustly Optimized BERT Pretraining Approach](https://github.com/pytorch/fairseq/tree/master/examples/roberta)

**RoBERTa** (short for Robustly optimized BERT approach; Liu, et al. 2019) refers to a new receipt for training BERT to achieve better results, as they found that the original BERT model is significantly undertrained. The receipt contains the following learnings:

Train for longer with bigger batch size.
Remove the next sentence prediction (NSP) task.
Use longer sequences in training data format. The paper found that using individual sentences as inputs hurts downstream performance. Instead we should use multiple sentences sampled contiguously to form longer segments.
Change the masking pattern dynamically. The original BERT applies masking once during the data preprocessing stage, resulting in a static mask across training epochs. RoBERTa applies masks in 10 different ways across 40 epochs.
RoBERTa also added a new dataset CommonCrawl News and further confirmed that pretraining with more data helps improve the performance on downstream tasks. It was trained with the BPE on byte sequences, same as in GPT-2. They also found that choices of hyperparameters have a big impact on the model performance.


HuggingFace has full support of it - https://huggingface.co/transformers/model_doc/roberta.html - but few adjustments will be necessary to get it to work.

Check out the tutorial section for info how to use their impressive framework:
* https://huggingface.co/transformers/pretrained_models.html



Your task is to build a classifier.

You can choose between: 

## Sentence Acceptability
1. classify an acceptability of a sentence, https://arxiv.org/abs/1805.12471

For example:
* Good - What did Betsy paint a picture of?
* Bad - What was a picture of painted by Betsy?

The dataset can be downloaded from:
https://nyu-mll.github.io/CoLA/

And the corresponding kaggle competition:

https://www.kaggle.com/c/cola-out-of-domain-open-evaluation

## Toxic Comments Classification
2. Toxic comments classification

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

Classify which of the comments are toxic, by fine-tuning RoBERTa on the dataset.