# CS549 Machine Learning
# Assignment 12: Transformer and Transformer-based Models (part 2)

**Total points: 10**

In this assignment, you will: 
1) Implement the **multiple head attention** sub layer in a transformer encoder.

2) Play with the transformer-based models provided in **transformers** for multiple natural language processing (NLP) tasks.

In [2]:
import torch
from torch.nn.functional import cosine_similarity

## Task 2. Play with transformer-based models
**Points: 5**

### 2.1 Installation
Install the *transformers* package with the following command:
```
pip install transformers
```

After it is done, you can load some pretrained BERT models and tokenizers like this (you can ignore the warnings):

In [3]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

### 2.2 Tokenizing inputs

Run the following examples

In [4]:
text = """The hotness of the sun and the coldness of the outer space are inexhaustible thermodynamic
resources for human beings. From a thermodynamic point of view, any energy conversion systems
that receive energy from the sun and/or dissipate energy to the universe are heat engines with
photons as the "working fluid" and can be analyzed using the concept of entropy. While entropy
analysis provides a particularly convenient way to understand the efficiency limits, it is typically
taught in the context of thermodynamic cycles among quasi-equilibrium states and its
generalization to solar energy conversion systems running in a continuous and non-equilibrium
fashion is not straightforward. In this educational article, we present a few examples to illustrate
how the concept of photon entropy, combined with the radiative transfer equation, can be used to
analyze the local entropy generation processes and the efficiency limits of different solar energy
conversion systems. We provide explicit calculations for the local and total entropy generation
rates for simple emitters and absorbers, as well as photovoltaic cells, which can be readily
reproduced by students. We further discuss the connection between the entropy generation and the
device efficiency, particularly the exact spectral matching condition that is shared by infinitejunction photovoltaic cells and reversible thermoelectric materials to approach their theoretical
efficiency limit."""

encoded_input = tokenizer(text, return_tensors='pt')

print(len(text.split()))
print(encoded_input['input_ids'].shape)

211
torch.Size([1, 275])


Can you explain why the `encoded_input` has more elements than the actual number of words in `text`?\
(**Points: 1**)

In [5]:
### Write your answer within the quotes ###
answer = """
    The encoded input has more elements than the actual number of words becuase the input is seperated into
    useful chunks. For example, the word "hotness" is split into 2 "words", hot and ness. To the model, hot and 
    ness' can be parsed seperately, or have their information be combined to achiece the meaning of hotness.
"""

*NOTE*: there is no expected output for this question.

---

### 2.3 Output word vectors from BERT

In [6]:
output = model(**encoded_input)

last_hidden_state = output['last_hidden_state']

print(last_hidden_state.shape)

torch.Size([1, 275, 768])


With the following code, you can find the corresponding token of each integer id in `input_ids`.

In [7]:
input_ids_pt = encoded_input['input_ids']
input_ids_list = input_ids_pt.tolist()[0]
input_tokens = tokenizer.convert_ids_to_tokens(input_ids_list)

print(input_ids_list[:10])
print(input_tokens[:10])

[101, 1996, 2980, 2791, 1997, 1996, 3103, 1998, 1996, 3147]
['[CLS]', 'the', 'hot', '##ness', 'of', 'the', 'sun', 'and', 'the', 'cold']


Can you find the output vector**s** among `last_hidden_state` that correpond to the input word "entropy"?\
Do they have the same values?\
**(Points: 1)**

*Hint*: You can use a `if` statement to check if the current token is the word "entropy", and if so, you can append it to `vectors`.

In [9]:
vectors = []
for i, token in enumerate(input_tokens):
    ### START YOUR CODE ###
    if (input_tokens[i] == "entropy"):
            vectors.append(i)
            
vectors = torch.tensor(vectors)
    ### END YOUR CODE ###

# Do not change the code below
print('Number of "entropy":', len(vectors))

matches = [torch.allclose(vectors[i], vectors[i+1]) for i in range(len(vectors)-1)]
print(f'Do they have the same value? {matches}')

Number of "entropy": 6
Do they have the same value? [False, False, False, False, False]


**Expected output:** \
Number of "entropy": 6\
Do they have the same value? [False, False, False, False, False]

---
### 2.4 Sentence vectors from BERT

We can obtain the output vectors for a batch of sentences.

First, we need to break the text into a list of sentences, using a simple end-of-sentence str '.' as a separater. 

In [10]:
sentences = text.replace('\n', ' ').split('.')
sentences = [s.strip() + '.' for s in sentences if len(s.strip())>0] # Some cleaning work

print(f'Resulting in {len(sentences)} sentences:')
print(sentences)

Resulting in 6 sentences:
['The hotness of the sun and the coldness of the outer space are inexhaustible thermodynamic resources for human beings.', 'From a thermodynamic point of view, any energy conversion systems that receive energy from the sun and/or dissipate energy to the universe are heat engines with photons as the "working fluid" and can be analyzed using the concept of entropy.', 'While entropy analysis provides a particularly convenient way to understand the efficiency limits, it is typically taught in the context of thermodynamic cycles among quasi-equilibrium states and its generalization to solar energy conversion systems running in a continuous and non-equilibrium fashion is not straightforward.', 'In this educational article, we present a few examples to illustrate how the concept of photon entropy, combined with the radiative transfer equation, can be used to analyze the local entropy generation processes and the efficiency limits of different solar energy conversion 

Now, let's use tokenizer on this batch of sentences

In [11]:
encoded_sentences = tokenizer(sentences, padding=True, return_tensors='pt')

print(encoded_sentences['input_ids'].shape)
print(encoded_sentences['input_ids'][0,:])

torch.Size([6, 57])
tensor([  101,  1996,  2980,  2791,  1997,  1996,  3103,  1998,  1996,  3147,
         2791,  1997,  1996,  6058,  2686,  2024,  1999, 10288, 13821,  3775,
         3468,  1996, 10867,  7716, 18279,  7712,  4219,  2005,  2529,  9552,
         1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])


You can find that shorter sentences are padded with a special id `0`.

Next, we can obtain the output tensors for all input sentences, also in a batch.

In [12]:
outputs = model(**encoded_sentences)

print(outputs['last_hidden_state'].shape)

torch.Size([6, 57, 768])


Note that the first dimension of `outputs['last_hidden_state']` is batch size. So the output tensor for the 1st sentence is `outputs['last_hidden_state'][0]`, and so on.

In [13]:
print(outputs['last_hidden_state'][0].shape)

torch.Size([57, 768])


For each output tensor, the first 768-dim vector (at position 0) always corresponds to the special input token `[CLS]`. We can use this vector to represent the meaning of the whole sentence.

In [14]:
CLS_vec = outputs['last_hidden_state'][0][0]
print(CLS_vec.shape)

torch.Size([768])


Now, it is your task to compute the cosine similarities between each pair of the 6 sentences, and find the pair that has the closest meanings.\
**(Points: 3)**

*Hint*: You can use the `cosine_similarity()` function imported at the beginning, which takes input two tensors and returns the similarity score in a tensor. So you will need to append a `.item()` to retrieve the numeric value from the returned tensor. You also need to specify the argument `dim=0`.

In [27]:
for i in range(5):
    for j in range(i+1, 6):
        ### START YOUR CODE ###
        vec_i = outputs['last_hidden_state'][i][0]
        vec_j = outputs['last_hidden_state'][j][0]
        sim = cosine_similarity(vec_i, vec_j, dim = 0)
        # Hint: when you call cosine_similarity() function,
        # remember to specify dim=0. Also, you need append .item() at
        # the end to obtain a number instead of a tensor.
        
        

        ### END YOUR CODE ###
        print(f'{i} <-> {j}: {sim}')

0 <-> 1: 0.8591638207435608
0 <-> 2: 0.777198314666748
0 <-> 3: 0.7985225319862366
0 <-> 4: 0.7754685878753662
0 <-> 5: 0.805216372013092
1 <-> 2: 0.876341700553894
1 <-> 3: 0.8321617841720581
1 <-> 4: 0.8238447904586792
1 <-> 5: 0.8492751121520996
2 <-> 3: 0.8241375684738159
2 <-> 4: 0.8598626255989075
2 <-> 5: 0.8579833507537842
3 <-> 4: 0.9018083214759827
3 <-> 5: 0.929144024848938
4 <-> 5: 0.9185265898704529


**Expected output:**\
0 <-> 1: 0.8591639399528503\
0 <-> 2: 0.777198314666748\
0 <-> 3: 0.7985224723815918\
0 <-> 4: 0.7754684090614319\
0 <-> 5: 0.8052163124084473\
1 <-> 2: 0.876341700553894\
1 <-> 3: 0.8321619629859924\
1 <-> 4: 0.823844850063324\
1 <-> 5: 0.8492751717567444\
2 <-> 3: 0.8241377472877502\
2 <-> 4: 0.8598626852035522\
2 <-> 5: 0.8579834699630737\
3 <-> 4: 0.9018082618713379\
3 <-> 5: 0.929144024848938\
4 <-> 5: 0.9185266494750977

---
You can print out the two sentences to see if the similarity score makes sense.

In [28]:
print(sentences[3])
print(sentences[5])

In this educational article, we present a few examples to illustrate how the concept of photon entropy, combined with the radiative transfer equation, can be used to analyze the local entropy generation processes and the efficiency limits of different solar energy conversion systems.
We further discuss the connection between the entropy generation and the device efficiency, particularly the exact spectral matching condition that is shared by infinitejunction photovoltaic cells and reversible thermoelectric materials to approach their theoretical efficiency limit.


---

### 2.5 Play with summarization

Lastly, let's play with the summarization pipelien provided by transformers. Be patient when the model is downloading. 

You can try the following code with different input text or arguments.

In [29]:
from transformers import pipeline

summarizer = pipeline("summarization")

print(summarizer(text, max_length=150, min_length=30))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' The hotness of the sun and the coldness of outer space are inexhaustible thermodynamic resources for human beings . From a thermodynamic point of view, any energy conversion systems that receive energy from the sun or dissipate energy to the universe are heat engines with photons as the "working fluid"'}]
