# Assignment I: PyTorch & Language Models

In this first assignment we will work with PyTorch and Open AI's early open source model GPT2 to develop a base understanding and intuition of how language models work and how they are trained. We will also look at a specific simple task, Sentiment Classification, and see in two ways how we can use language models for this problem.

The structure of the Assignment is as follows:

1. **PyTorch Basics: Common Operations**

   We first want to familiarize ourselves more with some basics of PyTorch. We will perform a few operations and test some of the functions we will do later.

2. **Basic GPT-2 Usage**

   Next, we will download GPT-2 from Hugging Face. The model is a smaller and older Decoder model released in 2018. We will do a few exercises that help us to understand the models.  


For reference, please consider the Lecture material for weeks 2 & 3 as well as the two Special Session notebooks:

* Intro to PyTorch I (Basics)
* Intro to PyTorch II (Hugging Face & Language Models)
* Lesson Notebook for Week 2



**INSTRUCTIONS:**

* Before submitting, please run this entire notebook end-to-end using Colab Pro or the free Colab version. Then download the exectuted notebook and submit that one.  This is **very important** because we set a random seed which allows the results of runs to be more deteministic and enables more rapid grading.
  
* Questions are always indicated as **QUESTION**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook. Please do not remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.

* \### YOUR CODE HERE indicates that you are supposed to write code.
* \### YOUR ANSWER HERE indicates that you are supposed to write your answer. Please put your answer as either a string or a raw number -- no executable code, for example "3,15,24" or 0.123.

**AUTOGRADER:**

- In each code block, do NOT delete the ### comment at the top of a cell (it's needed for the auto-grading!)
  - Full autograder tests for the first 3 questions are on gradescope.
  - You may upload and run the autograder as many times as needed in your time window to get full points for those 3 questions.
  - The assignment needs to be named Assignment_1.ipynb to be graded from the autograder!
  - The examples given are samples of how we will test/grade your code.
    - Please ensure your code outputs the exact same information / format!
    - Each autograder test tells you what input it is using
  - Once complete, the autograder will show each test, if that test is passed or failed, and your total score
  - The autograder fails for a couple of reasons:
    - Your code crashes with that input (for example: `Test Failed: string index out of range`)
    - Your code output does not match the 'correct' output (for example: `Test Failed: '1 2 3 2 1' != '1 4 6 4 1'`)
- Please format your input and output strings to be user friendly
- Adding comments in your code is strongly suggested but won't be graded.
- If you are stuck on a problem or do not understand a question - please come to office hours or ask questions (please don't post your code though). If it is a coding problem send a private email to your instructor.
- We also have TAs for extra help and 1 on 1 sessions!
- You may use any libraries from the Python Standard Library for this assignment: https://docs.python.org/3/library/



## 0. Setup

Let us first install a few required packages. (You may want to comment this out in case you use a local environment that already has the suitable packages installed.)

Note our use of `%%capture` in the cell below absorbs all of the output when the model(s) are loading.  You can comment it out if you want to see that output.

In [None]:
%%capture

#!pip install torch.     # commented because it is pre-installed in Colab
!pip install torchtext
!pip install transformers
#!pip install numpy      # commented because it is pre-installed in Colab
!pip install portalocker
!pip install pandas

Next, we will import required libraries

In [None]:
import copy
import random

import torch
import numpy as np
import pandas as pd


from torch.utils.data import Dataset, DataLoader
from torch import nn

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification, GPT2LMHeadModel

Let's make sure we will later put data and models on the proper device:

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

This should say 'cpu' if using a CPU, or 'cuda', if a GPU is used (or 'mps' per the comment).

Now let's get started!

## 1. PyTorch Basics: Common Operations

Let's get started with some simple operations. For reference you should use the PyTorch documentation at https://pytorch.org/docs/stable/index.html .

Review the PyTorch Intro I notebook. Your goal is to create a simple neural net with two connected layers that takes a (random) input that we will create and 'classifies' imagining a 3-class prediction problem. (We will not train the model, so the purpose is simply to test dimensions, expressions, etc., but not real values. We do however want you to execute the cells consecutively in the proper order so that we can compare the final (randomly generated) outcomes. They should always be the same given the manual seed that we set.)  

We start with setting the seed which insures that the answers are more deteministic.

In [None]:
torch.manual_seed(10)

### 1.a Tensor Manipulation

Now consider 1.a of the PyTorch I notebook and generate a random input dataset that mimics 4 examples with 6 features each. Consider using  torch.rand().

In [None]:
input_dim = 6
n_examples = 4
n_classes = 3
#call your generated input tensor 'input_data'

### YOUR CODE HERE

### END YOUR CODE

input_data

**QUESTION:**

1.a. What is the value of the input_data[0,0]? (Make sure you just re-ran the manual seed.)

In [None]:
### Q1-a Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Let's now do a few simple exercises. Using torch.argmax (https://pytorch.org/docs/stable/generated/torch.argmax.html), find the index of the maximum element for each row and each column.

In [None]:
# call your indices row_ind_max_arg and col_ind_max_arg

### YOUR CODE HERE

### END YOUR CODE
print('Index of argmax for each row', row_ind_max_arg)
print('Index of argmax for each column', col_ind_max_arg)

**QUESTION:**

1.b. What are the indices of the elements with the largest value in each row? Copy the list of indices to the answer cell and represent them as a list e.g. [55, 77, 99, 11].   


In [None]:
### Q1-b Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

1.c. What are the indices of the elements with the largest value in each column? Again, copy the list of indices to the answer cell and represent them as a list e.g. [55, 77, 99, 11].

In [None]:
### Q1-c Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

You can get the values of a tensor just like you can for numpy. For example, the values for the last column (i.e., fixed second dimension) can be obtained through:

In [None]:
print(input_data[:, -1])
print(input_data[:, 5])


Similarly, get the values of the last row (first index 'last', second index unconstrained):

In [None]:
# call your values for the last row last_row

### YOUR CODE HERE



### END YOUR CODE

print('Values of last row: ', last_row)


**QUESTION:**

1.d. Copy the tensor of the last row into the answers.

In [None]:
### Q1-d Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Next, reshape input_data into a 2x12 tensor using \<tensor>.reshape

In [None]:
# call your reshaped tensor reshaped_input_data

### YOUR CODE HERE



### END YOUR CODE

print('Reshaped data shape: ', reshaped_input_data.shape)

**QUESTION:**

1.e. Write the shape of the reshaped tensor as a tuple.

In [None]:
### Q1-e Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

### b. The Simple Classification Network

Now construct the network. Fill in your code for the __init__ and forward methods. Again, we want to have two hidden layers (dims: hidden_dim_1, hidden_dim_2) with relu activation functions, and output layer of dimension n_classes (and softmax activation function). The model should return both the probabilities (probs) and the logits (logits), as you can tell from the return statement.

In [None]:
class SimpleClassificationNertwork(nn.Module):
    def __init__(self, input_dim, hidden_dim_1, hidden_dim_2, n_classes):
        super().__init__()
        ### YOUR CODE HERE


        ### END YOUR CODE

    def forward(self, x):
        ### YOUR CODE HERE


        ### END YOUR CODE
        return probs, logits

In [None]:
mySimpleClassificationNertwork = SimpleClassificationNertwork(input_dim=input_dim,
                                                              hidden_dim_1=7,
                                                              hidden_dim_2=10,
                                                              n_classes=n_classes)

In [None]:
probs,logits = mySimpleClassificationNertwork(input_data)

print('Probabilities:\n\t', probs)
print('\nLogits:\n\t', logits)

**NOTE: Once everything works please rerun the cells starting with setting the manual seed up to this cell to make sure that the numbers (if everything is correct) can be compared to the solutions!**

**QUESTION:**

1.f. Copy the tensor for the probabilities into the answer file.  

In [None]:
### Q1-f Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

1.g. Copy the tensor for the logits into the answers file.

In [None]:
### Q1-g Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Great. Next do a calculation that *manually* verifies that the Softmax calculation is correct. Specifically, please recalculate the probability of the first class of the first example (use np.exp()... And if you don't want to use the numbers above, but the expressions probs and logits in a suitable way, use \<tensor\>.detach().numpy() to convert to numpy!)

In [None]:
# call your probability of the first class for the first example p_1_1

### YOUR CODE HERE


### END YOUR CODE
p_1_1

**QUESTION:**

1.h. Copy the value of `p_1_1` into the answers file. (Note that the first class has index 0.)

In [None]:
### Q1-h Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Great. Now imagine that the correct classes are 0, 1, 2, 0 for the four examples. What is the average loss? For that we will first define the loss function and then calculate the loss. Note that the input to the CrossEntropyLoss() function are i) the un-normalized logits, and ii) either the class probabilities or the actual classes (better in this case as each example belongs to a class).

In [None]:
loss_fn = torch.nn.CrossEntropyLoss()

loss = loss_fn(logits, torch.tensor([0, 1, 2, 0]))
loss

Now verify that this loss agrees with the manual calculation. Recall from Week 2 that

$$ CE \rightarrow -\frac{1}{N}\sum_k \log(q^k_{{correct \ class}}) $$

where k refers to the example, N is the number of examples, and

$$q_{{correct \ class}}$$

is the model probability for the correct class for a given example.

In [None]:
# call your manual loss calculation manual_loss

### YOUR CODE HERE


### END YOUR CODE
manual_loss

**QUESTION:**

1.i. Write out the complete Cross-Entropy loss calculation as a single mathematical expression, substituting the values (the floating point numbers) you arrived at in your earlier work on this assignment.  Your answer should be a single line showing the entire calculation with all necessary values inserted.


Note: Do not perform any calculations or simplify the expression. Simply write out the formula with the appropriate numeric values inserted.


In [None]:
### Q1-i Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Great. Now we are ready to move to Language Models.

## 2. Basic GPT-2 Usage

We are now downloading GPT-2 from Hugging Face. We will get the Tokenizer and the model. We will make sure that it is on the proper device.

In [None]:
%%capture

gpt_2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
#gpt_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

We can simply apply the tokenizer to a sentence to see how words are converted into word indices (which will be model inputs and, as first order of business for the model, be converted to word vectors). (The tokenizer is model-specific and various tokenizers have some special considerations/quirks. You should always take a look at how a specific tokenizer works. Consult the Hugging Face docs and try some examples.)

In [None]:
gpt_2_tokenizer("What a nice day", return_tensors='pt')

Note the difference between encodings of a word when the word is at the very beginning of a sentence vs a word that occurs later in the sentence. (The attention masks become important if you have examples of varying length and padding tokens are used to make sure that model inputs are of the same length. The return_tensors option is used to get the tokenization into a format that is suitable for model input, if desired. Following, we will omit the return_tensors option as we don't need it here.)

Below, consider the embedding for 'I' in the three tokenizations:

In [None]:
print(gpt_2_tokenizer("I am")['input_ids'])
print(gpt_2_tokenizer("am I")['input_ids'])
print(gpt_2_tokenizer(" I")['input_ids'])

Decoding  (e.g. turn your input_id back into the coresponding string) is done with \<tokenizer>.decode():

In [None]:
gpt_2_tokenizer.decode([314])

Please tokenize the longest word that Shakespeare used: 'honorificabilitudinitatibus' (when not at the beginning of a sentence), and find the first token (not the id, but the corresponding token string):

In [None]:
# Name your tokenization tokenized_long_word, the index of the first token first_index, and the first token first_token

### YOUR CODE HERE


### END YOUR CODE

print('Tokenized long word: ', tokenized_long_word)
print('Length of tokenized long word: ', len(tokenized_long_word))
print('First index: ', first_index)
print('First token: ', first_token)

**QUESTION:**

2.a. Into how many tokens is the word honorificabilitudinitatibus split when not in the beginning of the sentence? (You can either create a sentence with this word where it is not in the beginning, or you need to make sure that there is a space in front of the word.)

In [None]:
### Q2-a Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.b. What is the first token of the tokenization?

In [None]:
### Q2-b Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now redo the same, but imagine the word 'honorificabilitudinitatibus' at the very start of a sentence/doc (never mind the capitalization) as in
'honorificabilitudinitatibus is a state I am in'.

In [None]:
# Now name your tokenization beg_tokenized_long_word, the index of the first token beg_first_index, and the first token beg_first_token

### YOUR CODE HERE

### END YOUR CODE

print('Tokenized long word: ', beg_tokenized_long_word)
print('First index: ', beg_first_index)
print('First token string: ', beg_first_token)

**QUESTION:**

2.c. Into how many tokens is the word honorificabilitudinitatibus now be split (when **in** the beginning of the sentence)?

In [None]:
### Q2-c Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.d. What is now the first token string of the tokenization?

In [None]:
### Q2-d Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now apply the gpt2_model to a sentence in order to predict the most likely next word after 'The movement started in Italy. From there it went to France and Switzerland.  Soon it spread throughout'. You may want to consult the PyTorch Introduction II Notebook and the Week 2 lesson notebook.

Please get i) the shape of the output, ii) the values of the logits of the last token,  iii) the index of the  token with largest logit, and iv) the token that belongs to it. Same for the token with the second most largest logit.

In [None]:
text = "The movement started in Italy. From there it went to France and Switzerland.  Soon it spread throughout"
input = gpt_2_tokenizer(text, return_tensors='pt')

gpt_out = gpt2_model(**input)

# Please call your model output shape, output_shape, the logits for the last position last_logits,
# the index for the token with the largest last_logit value max_logit_index, and the corresponding token max_logit_token.
# Name the corresponding values for the second largest logit second_logit_index, second_logit_token, and second_logit.

### YOUR CODE HERE



### END YOUR CODE

print('Output shape: ', output_shape)
print('Logits of output for last token: ', output_logits)
print('Index of token with largest logit: ', max_logit_index)
print('Token with largest logit: ', max_logit_token)
print('Logit of token with largest logit: ', max_logit)
print('Index of token with second largest logit: ', second_logit_index)
print('Token with second largest logit: ', second_logit_token)
print('Logit of token with second largest logit: ', second_logit)


**QUESTION:**

2.e. What is the shape of the output?


In [None]:
### Q2-e Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.f. What do the three numbers shape refer to?


In [None]:
### Q2-f Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.g. What is the index of the word with the largest logit?

In [None]:
### Q2-g Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.h. What is the token string associated with the largest logit?

In [None]:
### Q2-h Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.i. What is the second most likely token id?

In [None]:
### Q2-i Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.j. What is the second most likely word

In [None]:
### Q2-j Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

Now we will translate the logits into relative token probabilities depending on the chosen temperature. Use numpy or pytorch calculations. (But remember to use .detach() etc if you want to use numpy.)

In [None]:
T_1 = 1.
T_2 = 10.
T_3 = 0.1

# Please call your relative probabilities between the most likely token and the second most likely token p_t1, p_t2, p_t3, depending
# on the temperature

### YOUR CODE HERE




### END YOUR CODE

print('Logit ratio for T1: ', p_t1)
print('Logit ratio for T2: ', p_t2)
print('Logit ratio for T3: ', p_t3)


**QUESTION:**

2.k. What is the ratio of probabilities between the most likely token and the second most likely token if T=1?

In [None]:
### Q2-k Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.l. What is the ratio of probabilities between the most likely token and the second most likely token if T=10?

In [None]:
### Q2-l Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

2.m. What is the ratio of probabilities between the most likely token and the second most likely token if T=0.1? (Hint: to avoid a NaN you may want to use a simple mathematical identity to deal with the low temperature: $e^a/e^b = e^{(a-b)}$)

In [None]:
### Q2-m Grading Tag: Please put your answer in this cell. Don't edit this line.

### YOUR ANSWER HERE

### END YOUR ANSWER

And that is it! Congratulations!