In [1]:
#!pip install transformers
import transformers

### Using pre-trained transformers
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

[{'label': 'POSITIVE', 'score': 0.9998860359191895}]


In [3]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
outputs = {k: classifier(v)[0]['label'] == 'POSITIVE' for k, v in data.items()}

assert sum(outputs.values()) == 3 and outputs[base64.decodestring(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

Well done!


  if sys.path[0] == '':


You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [4]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
    print(f"P={hypo['score']:.5f}", hypo['sequence'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [5]:
# Your turn: use bert to recall what year was the Soviet Union founded in
mlm_model(f"Russian Red Revolution happened in the year {MASK}.")

[{'sequence': 'russian red revolution happened in the year 1917.',
  'score': 0.5716407299041748,
  'token': 4585,
  'token_str': '1917'},
 {'sequence': 'russian red revolution happened in the year 1918.',
  'score': 0.1523766964673996,
  'token': 4271,
  'token_str': '1918'},
 {'sequence': 'russian red revolution happened in the year 1919.',
  'score': 0.021794825792312622,
  'token': 4529,
  'token_str': '1919'},
 {'sequence': 'russian red revolution happened in the year 1920.',
  'score': 0.02099030464887619,
  'token': 4444,
  'token_str': '1920'},
 {'sequence': 'russian red revolution happened in the year 1916.',
  'score': 0.02016862854361534,
  'token': 4947,
  'token_str': '1916'}]

In [6]:
# lets check if bert knows that there wass @give us some democracy@ revolution earlier this century
mlm_model(f"Russian Empire Revolution happened in the year {MASK}.")

[{'sequence': 'russian empire revolution happened in the year 1917.',
  'score': 0.2406449317932129,
  'token': 4585,
  'token_str': '1917'},
 {'sequence': 'russian empire revolution happened in the year 1918.',
  'score': 0.0716315284371376,
  'token': 4271,
  'token_str': '1918'},
 {'sequence': 'russian empire revolution happened in the year 1905.',
  'score': 0.04560703784227371,
  'token': 5497,
  'token_str': '1905'},
 {'sequence': 'russian empire revolution happened in the year 1916.',
  'score': 0.02266937680542469,
  'token': 4947,
  'token_str': '1916'},
 {'sequence': 'russian empire revolution happened in the year 1848.',
  'score': 0.021684745326638222,
  'token': 7993,
  'token_str': '1848'}]

In [8]:
# hehe, it does

```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [10]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sittingon the surface. Philae is talking to us,” said one scientist.
"""

#from transformers import AutoTokenizer, AutoModelForTokenClassification

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
ner_model = pipeline('ner', 'dbmdz/bert-large-cased-finetuned-conll03-english')
named_entities = ner_model(text)

In [11]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'entity': 'I-MISC', 'score': 0.8746414, 'index': 19, 'word': 'Google', 'start': 73, 'end': 79}, {'entity': 'I-MISC', 'score': 0.8997781, 'index': 27, 'word': 'Rose', 'start': 112, 'end': 116}, {'entity': 'I-MISC', 'score': 0.951263, 'index': 28, 'word': '##tta', 'start': 116, 'end': 119}, {'entity': 'I-ORG', 'score': 0.99925184, 'index': 40, 'word': 'Guardian', 'start': 179, 'end': 187}, {'entity': 'I-PER', 'score': 0.9992004, 'index': 46, 'word': 'Ian', 'start': 207, 'end': 210}, {'entity': 'I-PER', 'score': 0.9995033, 'index': 47, 'word': 'Sam', 'start': 211, 'end': 214}, {'entity': 'I-PER', 'score': 0.9965073, 'index': 48, 'word': '##ple', 'start': 214, 'end': 217}, {'entity': 'I-PER', 'score': 0.9991747, 'index': 53, 'word': 'Stuart', 'start': 240, 'end': 246}, {'entity': 'I-PER', 'score': 0.9996471, 'index': 54, 'word': 'Clark', 'start': 247, 'end': 252}, {'entity': 'I-LOC', 'score': 0.9998199, 'index': 85, 'word': 'Germany', 'start': 414, 'end': 421}, {'entity': 'I-PER'

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [12]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [14]:
# You can now apply the model to get embeddings
with torch.no_grad():
    outputs = model(**tokens_info)

print(outputs['pooler_output'])

tensor([[-0.8854, -0.4722, -0.9392,  ..., -0.8081, -0.6955,  0.8748],
        [-0.9297, -0.5161, -0.9334,  ..., -0.9017, -0.7492,  0.9201]])


### The search for similar questions.

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to **really** solve this task using context-aware embeddings.

In [None]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

__Main task(3 pts):__ 
* Implement a function that takes a text string and finds top-k most similar questions from `quora.txt`
* Demonstrate your function using at least 5 examples

There are no prompts this time: you will have to write everything from scratch.

In [24]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

device='cuda'

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [48]:
import tqdm
import numpy as np
data = list(open("./quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

In [29]:
from scipy.spatial.distance import cosine
def cos_sim(u,v):
    return 1 - cosine(u,v)

In [37]:
# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(data[:10], padding=True, truncation=True, return_tensors="pt")

In [19]:
print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

Detokenized:
[CLS] can i get back with my ex even though she is pregnant with another guy's baby? [SEP]
[CLS] what are some ways to overcome a fast food addiction? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [41]:
features_test.shape

(2, 768)

In [49]:
def get_features(lines, model, tokenizer, batch_size=512, feature_dim=768, device='cuda') -> np.ndarray:
    nr_lines = len(lines)
    
    feature_matrix = np.zeros((nr_lines, feature_dim))
    
    batch_borders = list(range(0, nr_lines + batch_size, batch_size))
    for batch_b, batch_e in tqdm.tqdm(zip(batch_borders[:-1], batch_borders[1:])):
        batch_e = min(batch_e, nr_lines)

        tokens_info = tokenizer(
            lines[batch_b: batch_e], padding=True, truncation=True, return_tensors="pt"
        )
        tokens_info = {k: v.to(device) for k, v in tokens_info.items()}
        
        with torch.no_grad():
            outputs = model(**tokens_info)
            feature_matrix[batch_b: batch_e] = outputs['pooler_output'].cpu().numpy()
    
    return feature_matrix

In [50]:
features = get_features(data, model, tokenizer)

1050it [10:34,  1.66it/s]


In [57]:
def normalize_features(features):
    return features / np.linalg.norm(features, ord=2, axis=1, keepdims=True)

In [58]:
features_l2 = normalize_features(features)

In [83]:
def get_similar(line, features_matrix, ref_lines, device='cuda', topn=5):
    tokens_info = tokenizer(
        [line], padding=True, truncation=True, return_tensors="pt"
    )
    tokens_info = {k: v.to(device) for k, v in tokens_info.items()}

    with torch.no_grad():
        outputs = model(**tokens_info)
        feature = outputs['pooler_output'].cpu().numpy()
        
    similarities = 1 - features_matrix @ normalize_features(feature)[0];
    indices = np.argsort(similarities)
    return [(ref_lines[i], similarities[i]) for i in indices[:topn]]

In [91]:
get_similar('What is like to be white male in USA?', features_l2, data)

[('What is like to live in Morocco as foreigner?\n', 0.0029890035040398555),
 ('What is to be a Hindu in Pakistan?\n', 0.0035213410771406384),
 ('How long does it take to renew Indian passport in Canada?\n',
  0.0035492363370205338),
 ('What is it like to live and work in London 20s?\n', 0.00359940182220575),
 ('What is the best app for white men who only like black men?\n',
  0.0036031545994793523)]

In [92]:
get_similar('What is like to be white male in Russia?', features_l2, data)

[('What is the difference between the Manchu and Han people?\n',
  0.002449484354391518),
 ('What is life like as a black person in China?\n', 0.0024514532470841788),
 ('What do non-Indonesians think about rendang?\n', 0.00254805326030505),
 ('What is it like to be an ethnic Tajik in China?\n', 0.002566751405706613),
 ('What do Kurds think of Germans?\n', 0.0027248765311250756)]

In [99]:
get_similar('Are there parallel universes?', features_l2, data)

[('Are there any real parallel universe?\n', 0.002507165318368032),
 ('Are there any mathematical proofs to the existence of aliens?\n',
  0.00251653791223172),
 ('Is there a way to get infinite energy?\n', 0.0028554901967061674),
 ('Are there really interpretations of dreams?\n', 0.0028775753667864556),
 ('Is it true that continental drift is fake?\n', 0.002924550808185211)]

In [100]:
get_similar('What is neural network?', features_l2, data)

[('What is matrix learning?\n', 0.0007113249482321171),
 ('What is backend process?\n', 0.0008258957443837422),
 ('What is Fourier transform?\n', 0.0008880005608729036),
 ('What is virtual machine?\n', 0.0009063134571185572),
 ('What is neuro linguistic programming?\n', 0.0009082031753746556)]

In [102]:
get_similar('ARM vs x86?', features_l2, data)

[('JavaScript vs Java?\n', 0.0023425796013568645),
 ('Scala Vs Node.js?\n', 0.002656501804128686),
 ('Ntpc gate cutoff?\n', 0.0028276098248721793),
 ('Windows 10 architecture?\n', 0.0032204057895649507),
 ('C++ free certificate?\n', 0.0034634240529997085)]

In [107]:
get_similar('Can I do Discrete Fourier Transform in Galois field?', features_l2, data)

[('How do I improve solution for Josephus problem algorithm?\n',
  0.0034718122028002396),
 ('What is inverse function? What are some examples?\n', 0.003655330969229542),
 ('What is tangential component of electric field?\n', 0.003663591549844858),
 ('Is any new idea for usefull mini project in compuer engineering?\n',
  0.003782691745504607),
 ('How do I prove ratio test and root test for convergence-divergence using epsilon delta definition?\n',
  0.003845545280049989)]

In [108]:
get_similar('How Enigma was decrypted?', features_l2, data)

[('How was Bank of America founded?\n', 0.0020955360146868163),
 ('When and how was "The Laboratory" written?\n', 0.0021501556443611625),
 ("What was the main focus of Nazi Germany's SS? Why?\n",
  0.0021622765529916155),
 ('How did the Hidden Wiki come to be?\n', 0.0022083650866796534),
 ('What did CCCP mean in Russia?\n', 0.0022590918248388547)]

```















```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [None]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

text = "The Fermi paradox "
tokens = tokenizer.encode(text)
num_steps = 1024
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()

    next_token_index = p_next.argmax() #<YOUR CODE: REPLACE THIS LINE>
    # YOUR TASK: change the code so that it performs nucleus sampling

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()



Transformers knowledge hub: https://huggingface.co/transformers/