<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/BeamSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with pre-trained Transformers
In this assignment we will work with Pre-trained Transformers such as GPT2 for generating text from a given sequence. Transformers aim to address the long term dependency issue in sequence-to-seuqence prediction by using concepts such as self-attention and positional encoding. GPT2 is a langauge model, pretrained on text generation, that can be used as a multi-task learner for tasks such as summarization, question-answering, and other generation tasks. This assignment's focus is on using GPT2 to generate text via greedy decoding and beam search. For more background on beam search, see [Jurafsky and Martin, chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf).

In [None]:
!pip install transformers

In [1]:
import copy
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np

In [2]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [3]:
sentences = ['I like walking and', 
             'Martha wanted to read a book that',
             'Thomas is studying computer science to',
             'Their friendship inspired',
             'We should take the trash out since',
             'I am not a fan of coffee because',
             'I could not complete my homework by the deadline because',
             'The last semester was much easier due to',
             'I will be painting the walls white so that'
            ]

We apply greedy decoding to get predictions for each sentence here. This function returns the text output of greedy decoding. Modify it to return a tuple (ordered pair) of text and average log-likelihood per word for each sentence.

__Interpretation:__ I interpreted the prompt as the average log-likelihood per _predicted_ word for each sentence. I did not include probabilities for the given words in the input sentences

In [4]:
# Output: list of (predicted sentence, average log likelhiood of predicted words)

def greedy_decode(sentences, max_length, tokenizer):
  # Obtain loss from output and calculate
  # log likelihood for each sentence
  texts = []
  for sentence in sentences:
    word_probs = 0 # sum of log likelihood for each word
    cnt = 0
    predicted_sentence = copy.copy(sentence)
    # Predict a word each itertation until the max length
    for i in range(max_length):
      indexed_tokens = tokenizer.encode(predicted_sentence)
      token_tensors = torch.tensor([indexed_tokens])

      with torch.no_grad():
        output = model(token_tensors, labels=token_tensors)
        predictions = output[1]

      # add to sum of log likelihood of next predicted word
      word_probs += torch.max(predictions[0, -1, :])
      cnt += 1
      # get highest probability next predicted word
      predicted_index = torch.argmax(predictions[0, -1, :]).item()
      predicted_sentence = tokenizer.decode(indexed_tokens + [predicted_index])

    # save predicted sentence and average log likelihood of predicted words for each sentence
    texts.append((predicted_sentence, word_probs / cnt))

  return texts

In [None]:
texts = greedy_decode(sentences, max_length=25, tokenizer=tokenizer)
texts

[("I like walking and biking, but I don't like being in a car. I like to be in a car. I like to be in",
  tensor(-126.9903)),
 ('Martha wanted to read a book that she had read in college. She was a little nervous about it, but she was excited about it. She was a little',
  tensor(-121.1514)),
 ('Thomas is studying computer science to become a professor of computer science at the University of California, Berkeley. He is also a member of the Board of Trustees',
  tensor(-87.4711)),
 ("Their friendship inspired him to become a writer and a writer's assistant. He also wrote a book about his life and career.\n\n\n",
  tensor(-109.5394)),
 ('We should take the trash out since it\'s not going to be a problem," he said.\n\n\n"We\'re going to have to do something about',
  tensor(-102.8561)),
 ('I am not a fan of coffee because it is not good for you. I am a fan of coffee because it is not good for you.\n\n\nI',
  tensor(-91.6273)),
 ('I could not complete my homework by the deadline because I

You'll be implementing **beam search**, which returns a list of the $k$ most likely output sequences for each sentence. For this assignment, let $k = 8$. For the first token in the generated text, you will select the top $k$ output tokens. Then, for the next token, find the $k$-best continuations for each of those $k$ hypotheses and select the $k$-best overall. Return the $k$-best overall hypotheses according to average log likelihood per word. Note that if we don't average per word, the decoder will simply prefer shorter outputs. As above, return tuples of text output and average log likelihood.

In [55]:
# TODO: only choosing among ones with max length? Then why does average per word matter? 
  # average per word at each step?
# or choosing among all sequences regardless of if reached max length? 

# TODO: what about end of sequence token?

# input: 
  # sentences_list = list of (predicted sentence, sum of log likelihood predicted words in sentence)
  # length = current length
  # max length = maximum length of generated sentence
  # k_num = beam length k 
  # tokenizer 

# output: list of all (predicted sentences of length max length, average log likelihood of predicted words)

def recursion(sentences_list, length, max_length, k_num, tokenizer):

  # end recursion if reach max length
  if length == max_length:
    # calculate average log likelihood per word
    sentences_list = [(i[0], i[1] / max_length) for i in sentences_list]  ## TODO does this need to be average?
    # output final sequences and average log likelihoods
    return sentences_list

  else:
    kseq = []
    for input in sentences_list:
      s = input[0] # sentence
      p = input[1] # sum log likelihood probability
      # encode sentence
      indexed_tokens = tokenizer.encode(s)
      token_tensors = torch.tensor([indexed_tokens])
      # predict
      with torch.no_grad():
        output = model(token_tensors, labels=token_tensors)
        predictions = output[1]
      # generate k best following indexes and get log likelihoods
      predicted_indexes = torch.argsort(predictions[0, -1, :], descending = True)[0:k_num]
      predicted_probs = torch.sort(predictions[0, -1, :], descending = True)[0][0:k_num]
      # decode each index and keep track of in list + add to probability sum
      for i in range(len(predicted_indexes)):
        predicted_sentence = tokenizer.decode(indexed_tokens + [predicted_indexes[i]])
        kseq.append((predicted_sentence, p + predicted_probs[i]))

    # choose k best at this length for the frontier
    probs = [i[1].item() for i in kseq] ### TODO do probs need to be average?
    best_indexes = np.argsort(probs)[::-1][0:k_num]
    kseq = [kseq[i] for i in best_indexes]

    return recursion(kseq, length + 1, max_length, k_num, tokenizer)

In [54]:
# output: list of lists. One outer list per input sentence. Inner list: k best (predicted sentence, average log likelhiood of predicted words)

def beam_search(sentences, max_length, tokenizer, k_num=8):

  texts = []

  for sentence in sentences:

    # recursively generate k best sequences 
    predicted_sentence = copy.copy(sentence)
    predicted_sequences = recursion([[predicted_sentence, 0]], 0, max_length, k_num, tokenizer)

    texts.append(predicted_sequences)

  return texts

In [56]:
beam_search(sentences, 25, tokenizer, 8)

[[('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA)',
   tensor(-34.5525)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA),',
   tensor(-34.6284)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA).',
   tensor(-34.7347)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA,',
   tensor(-34.7452)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA)—',
   tensor(-34.7535)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA)-',
   tensor(-34.7779)),
  ('I like walking and talking," Mr. Johnson added.<|endoftext|>The U. S.\'S.–Iran Free-Trade Agreement (USFATA)(',
   tensor(-34.7782)),
  ('I like walking and talking," Mr.

In [None]:
# TODO check against model.generate() method - not the same
# TODO observations 

**TODO:** Record your observations here

For further exploration, you can experiment with $k$ to see how the fluency of text changes.