In [1]:
!pip install transformers
import transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 5.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 18.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 42.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=393753df8f5

### Using pre-trained transformers
_for fun and profit_

There are many toolkits that let you access pre-trained transformer models, but the most powerful and convenient by far is [`huggingface/transformers`](https://github.com/huggingface/transformers). In this week's practice, you'll learn how to download, apply and modify pre-trained transformers for a range of tasks. Buckle up, we're going in!


__Pipelines:__ if all you want is to apply a pre-trained model, you can do that in one line of code using pipeline. Huggingface/transformers has a selection of pre-configured pipelines for masked language modelling, sentiment classification, question aswering, etc. ([see full list here](https://huggingface.co/transformers/main_classes/pipelines.html))

A typical pipeline includes:
* pre-processing, e.g. tokenization, subword segmentation
* a backbone model, e.g. bert finetuned for classification
* output post-processing

Let's see it in action:

In [2]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model="distilbert-base-uncased-finetuned-sst-2-english")

print(classifier("BERT is amazing!"))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…


[{'label': 'POSITIVE', 'score': 0.9998860955238342}]


In [3]:
import base64
data = {
    'arryn': 'As High as Honor.',
    'baratheon': 'Ours is the fury.',
    'stark': 'Winter is coming.',
    'tyrell': 'Growing strong.'
}

# YOUR CODE: predict sentiment for each noble house and create outputs dict
for key, value in data.items():
  data[key] = classifier(value)

outputs = {}

for key, value in data.items():
    if data[key][0]['label'] == 'POSITIVE':
        outputs[key] = True 
    else:
        outputs[key] = False 

#outputs = <YOUR CODE: dict (house name) : True if positive, False if negative>

assert sum(outputs.values()) == 3 and outputs[base64.decodestring(b'YmFyYXRoZW9u\n').decode()] == False
print("Well done!")

Well done!




You can also access vanilla Masked Language Model that was trained to predict masked words. Here's how:

In [4]:
mlm_model = pipeline('fill-mask', model="bert-base-uncased")
MASK = mlm_model.tokenizer.mask_token

for hypo in mlm_model(f"Donald {MASK} is the president of the united states."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


P=0.99719 donald trump is the president of the united states.
P=0.00024 donald duck is the president of the united states.
P=0.00022 donald ross is the president of the united states.
P=0.00020 donald johnson is the president of the united states.
P=0.00018 donald wilson is the president of the united states.


In [5]:
# Your turn: use bert to recall what year was the Soviet Union founded in

for hypo in mlm_model(f"The Soviet Union was founded in {MASK}."):
  print(f"P={hypo['score']:.5f}", hypo['sequence'])

P=0.29111 the soviet union was founded in russia.
P=0.09198 the soviet union was founded in europe.
P=0.03760 the soviet union was founded in moscow.
P=0.02844 the soviet union was founded in october.
P=0.02342 the soviet union was founded in siberia.


```

```

```

```


Huggingface offers hundreds of pre-trained models that specialize on different tasks. You can quickly find the model you need using [this list](https://huggingface.co/models).


In [6]:
text = """Almost two-thirds of the 1.5 million people who viewed this liveblog had Googled to discover
 the latest on the Rosetta mission. They were treated to this detailed account by the Guardian’s science editor,
 Ian Sample, and astronomy writer Stuart Clark of the moment scientists landed a robotic spacecraft on a comet 
 for the first time in history, and the delirious reaction it provoked at their headquarters in Germany.
  “We are there. We are sitting on the surface. Philae is talking to us,” said one scientist.
"""

# Task: create a pipeline for named entity recognition, use task name 'ner' and search for the right model in the list
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

ner_model = pipeline("ner", model=model, tokenizer=tokenizer)

named_entities = ner_model(text)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=829.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=59.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433316646.0, style=ProgressStyle(descri…




In [7]:
print('OUTPUT:', named_entities)
word_to_entity = {item['word']: item['entity'] for item in named_entities}
assert 'org' in word_to_entity.get('Guardian').lower() and 'per' in word_to_entity.get('Stuart').lower()
print("All tests passed")

OUTPUT: [{'word': 'Rose', 'score': 0.7991043329238892, 'entity': 'B-LOC', 'index': 27, 'start': 112, 'end': 116}, {'word': '##tta', 'score': 0.9511924982070923, 'entity': 'I-LOC', 'index': 28, 'start': 116, 'end': 119}, {'word': 'Guardian', 'score': 0.9982230067253113, 'entity': 'B-ORG', 'index': 40, 'start': 179, 'end': 187}, {'word': 'Ian', 'score': 0.9997612833976746, 'entity': 'B-PER', 'index': 46, 'start': 207, 'end': 210}, {'word': 'Sam', 'score': 0.9997870326042175, 'entity': 'I-PER', 'index': 47, 'start': 211, 'end': 214}, {'word': '##ple', 'score': 0.999646008014679, 'entity': 'I-PER', 'index': 48, 'start': 214, 'end': 217}, {'word': 'Stuart', 'score': 0.9997830986976624, 'entity': 'B-PER', 'index': 53, 'start': 240, 'end': 246}, {'word': 'Clark', 'score': 0.9997482299804688, 'entity': 'I-PER', 'index': 54, 'start': 247, 'end': 252}, {'word': 'Germany', 'score': 0.9997227191925049, 'entity': 'B-LOC', 'index': 85, 'start': 414, 'end': 421}, {'word': 'Phil', 'score': 0.996312797

### The building blocks of a pipeline

Huggingface also allows you to access its pipelines on a lower level. There are two main abstractions for you:
* `Tokenizer` - converts from strings to token ids and back
* `Model` - a pytorch `nn.Module` with pre-trained weights

You can use such models as part of your regular pytorch code: insert is as a layer in your model, apply it to a batch of data, backpropagate, optimize, etc.

In [8]:
import torch
from transformers import AutoTokenizer, AutoModel, pipeline

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [9]:
lines = [
    "Luke, I am your father.",
    "Life is what happens when you're busy making other plans.",
    ]

# tokenize a batch of inputs. "pt" means [p]y[t]orch tensors
tokens_info = tokenizer(lines, padding=True, truncation=True, return_tensors="pt")

for key in tokens_info:
    print(key, tokens_info[key])

print("Detokenized:")
for i in range(2):
    print(tokenizer.decode(tokens_info['input_ids'][i]))

input_ids tensor([[ 101, 5355, 1010, 1045, 2572, 2115, 2269, 1012,  102,    0,    0,    0,
            0,    0,    0],
        [ 101, 2166, 2003, 2054, 6433, 2043, 2017, 1005, 2128, 5697, 2437, 2060,
         3488, 1012,  102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
Detokenized:
[CLS] luke, i am your father. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
[CLS] life is what happens when you're busy making other plans. [SEP]


In [10]:
# You can now apply the model to get embeddings
with torch.no_grad():
    token_embeddings, sentence_embedding = model(**tokens_info)

print(sentence_embedding)

pooler_output


### The search for similar questions.

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to **really** solve this task using context-aware embeddings.

In [11]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2021-04-05 07:38:03--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:601b:18::a27d:812
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2021-04-05 07:38:03--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc104c518a1e6ad9062fab1b4bfc.dl.dropboxusercontent.com/cd/0/get/BMD5vxpeEZIKIH3gTAqOYrSbQWEdkLeNHP-P5fItnlEreIZ-nf6tvLC_dJS-QODLNsRlnJi8T6YeP0FoGSki2qG-5DrdcbG7IgMjElYCMkagvsuUyJRZnvrKmk13CLvd0ZUmprLn6VORkaS_pezdN9Qt/file?dl=1# [following]
--2021-04-05 07:38:03--  https://uc104c518a1e6ad9062fab1b4bfc.dl.dropboxusercontent.com/cd/0/get/BMD5vxpeEZIKIH3gTAqOYrSbQWEdkLeNHP-P5fItnlEreIZ-nf6tvLC_dJS-QODLNsRlnJi8T6YeP0FoGSki2qG-5DrdcbG7I

__Main task(3 pts):__ 
* Implement a function that takes a text string and finds top-k most similar questions from `quora.txt`
* Demonstrate your function using at least 5 examples

There are no prompts this time: you will have to write everything from scratch.

In [12]:
#<A whole lot of your code. Feel free to format it as you see fit>

In [13]:
handle = open("quora.txt", "r")
data = handle.readlines() # read ALL the lines!
print(*data[:10], sep = '\n')
handle.close()

Can I get back with my ex even though she is pregnant with another guy's baby?

What are some ways to overcome a fast food addiction?

Who were the great Chinese soldiers and leaders who fought in WW2?

What are ZIP codes in the Bay Area?

Why was George RR Martin critical of JK Rowling after losing the Hugo award?

What can I do to improve my immune system?

How is your relationship with your mother in law?

How does one get Free PSN codes/Vita Codes?

What is your review of osquery?

How can I look smart and act smart?



In [14]:
len(data)

537272

In [15]:
pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/aa/f672ce489063c4ee7a566ebac1b723c53ac0cea19d9e36599cc241d8ed56/sentence-transformers-1.0.4.tar.gz (74kB)
[K     |████▍                           | 10kB 11.8MB/s eta 0:00:01[K     |████████▊                       | 20kB 17.2MB/s eta 0:00:01[K     |█████████████▏                  | 30kB 13.2MB/s eta 0:00:01[K     |█████████████████▌              | 40kB 10.0MB/s eta 0:00:01[K     |█████████████████████▉          | 51kB 5.7MB/s eta 0:00:01[K     |██████████████████████████▎     | 61kB 5.5MB/s eta 0:00:01[K     |██████████████████████████████▋ | 71kB 5.8MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 

In [34]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

In [35]:
model = SentenceTransformer('stsb-roberta-large')

# Calculate semantic similarity between two sentences

In [36]:
sentence1 = data[0]
sentence2 = data[1]


# encode sentences to get their embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

print("Sentence 1:", sentence1)
print("Sentence 2:", sentence2)
print("Similarity score:", cosine_scores.item())

Sentence 1: Can I get back with my ex even though she is pregnant with another guy's baby?

Sentence 2: What are some ways to overcome a fast food addiction?

Similarity score: 0.008801806718111038


# Calculate semantic similarity between two lists of sentences

In [19]:
sentences1 = [data[0], data[1]]   
sentences2 = [data[2], data[3]]

# encode list of sentences to get their embeddings
embedding1 = model.encode(sentences1, convert_to_tensor=True)
embedding2 = model.encode(sentences2, convert_to_tensor=True)

# compute similarity scores of two embeddings
cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)

for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        print("Sentence 1:", sentences1[i])
        print("Sentence 2:", sentences2[j])
        print("Similarity Score:", cosine_scores[i][j].item())
        print()

Sentence 1: Can I get back with my ex even though she is pregnant with another guy's baby?

Sentence 2: Who were the great Chinese soldiers and leaders who fought in WW2?

Similarity Score: 0.08254334330558777

Sentence 1: Can I get back with my ex even though she is pregnant with another guy's baby?

Sentence 2: What are ZIP codes in the Bay Area?

Similarity Score: 0.05146317183971405

Sentence 1: What are some ways to overcome a fast food addiction?

Sentence 2: Who were the great Chinese soldiers and leaders who fought in WW2?

Similarity Score: 0.04884848743677139

Sentence 1: What are some ways to overcome a fast food addiction?

Sentence 2: What are ZIP codes in the Bay Area?

Similarity Score: 0.34503987431526184



# Retrieve Top K most similar sentences from a corpus given a sentence

In [20]:
data[37]

'How do students at IIT get foreign internships?\n'

In [21]:
corpus = data[:50]

# encode corpus to get corpus embeddings
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
sentence = "How do students at medical get foreign internships?"

# encode sentence to get sentence embeddings
sentence_embedding = model.encode(sentence, convert_to_tensor=True)

# top_k results to return
top_k=5

# compute similarity scores of the sentence with the corpus
cos_scores = util.pytorch_cos_sim(sentence_embedding, corpus_embeddings)[0]

# Sort the results in decreasing order and get the first top_k
top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
print("Sentence:", sentence, "\n")
print("Top", top_k, "most similar sentences in corpus:")
for idx in top_results[0:top_k]:
    print(corpus[idx], "(Score: %.4f)" % (cos_scores[idx]))

Sentence: How do students at medical get foreign internships? 

Top 5 most similar sentences in corpus:
How do students at IIT get foreign internships?
 (Score: 0.7911)
Can a high school student get an internship at Goldman Sachs?
 (Score: 0.5546)
How is the experience to be an extra on the show - How I Met Your Mother?
 (Score: 0.3665)
What can I do to improve my immune system?
 (Score: 0.3262)
How do I meet foreign guys in Hong Kong?
 (Score: 0.3241)


# Now let's turn the code from the top cell into a function

In [39]:
def get_top_k_most_similar_questions(question, corpus, model, k = 5):

    # encode corpus to get corpus embeddings
    corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
    
    # encode sentence to get sentence embeddings
    question_embedding = model.encode(question, convert_to_tensor=True)

    # compute similarity scores of the sentence with the corpus
    cos_scores = util.pytorch_cos_sim(question_embedding, corpus_embeddings)[0]

    # Sort the results in decreasing order and get the first top_k
    top_results = np.argpartition(-cos_scores, range(k))[0:k]

    result = list()

    for idx in top_results[0:k]:
        d = dict.fromkeys(['question', 'score'])
        d['question'] = corpus[idx]
        d['score'] = cos_scores[idx]
        result.append(d)

    return result

In [42]:
question1 = 'How do I start investing in stocks with 10$?'
corpus = data[:100]
k = 5

get_top_k_most_similar_questions(question1, corpus, model, k)

[{'question': 'How do I start investing in stocks with 100€?\n',
  'score': tensor(0.7049)},
 {'question': 'What is Quantitative Analysis in finance? How is it applied?\n',
  'score': tensor(0.4661)},
 {'question': 'Which car should I buy out of Tiago, Kwid, Figo, Datsun or any other?\n',
  'score': tensor(0.3890)},
 {'question': 'Which brand should go with the GTX 960 graphic card, MSI, Zotac or ASUS?\n',
  'score': tensor(0.3346)},
 {'question': 'What schools can I apply for with a GRE score of Q-165 V-153?\n',
  'score': tensor(0.3147)}]

```















```

__Bonus demo:__ transformer language models. 

`/* No points awarded for this task, but its really cool, we promise :) */`

In [None]:
import torch
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', add_prefix_space=True)
model = GPT2LMHeadModel.from_pretrained('gpt2').train(False).to(device)

text = "The Fermi paradox "
tokens = tokenizer.encode(text)
num_steps = 1024
line_length, max_length = 0, 70

print(end=tokenizer.decode(tokens))

for i in range(num_steps):
    with torch.no_grad():
        logits = model(torch.as_tensor([tokens], device=device))[0]
    p_next = torch.softmax(logits[0, -1, :], dim=-1).data.cpu().numpy()

    next_token_index = p_next.argmax() #<YOUR CODE: REPLACE THIS LINE>
    # YOUR TASK: change the code so that it performs nucleus sampling

    tokens.append(int(next_token_index))
    print(end=tokenizer.decode(tokens[-1]))
    line_length += len(tokenizer.decode(tokens[-1]))
    if line_length >= max_length:
        line_length = 0
        print()



Transformers knowledge hub: https://huggingface.co/transformers/