<a href="https://colab.research.google.com/github/heesukjang/W266_NLP_With_DeepLearning/blob/main/lesson_5_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson Notebook 5: Text Generation

In this notebook we will look at 3 different examples:

1. Building a Seq2Seq model for machine translation using RNNs with and without Attention

2. Playing with T5 for summarization and translation

3. Exercise with prompts and language generation using the OPT model

The sequence to sequence architecture is inspired by the Keras Tutorial https://keras.io/examples/nlp/lstm_seq2seq/.


<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [Seq2Seq Model](#encoderDecoder)
      * 2.1 [Data Acquisition](#dataAcquisition)
      * 2.2 [Seq2Seq without Attention](#s2sNoAttention)
      * 2.3 [Seq2Seq with Attention](#s2sAttention)
  * 3. [T5](#t5Example)
    * 3.1 [Tokenization](#tokenization)
    * 3.2 [Model Structure & Output](#modelOutput)
  * 4. [Prompt Engineering and Generative Large Language Models](#prompts)
    * 4.1 [Cloze Prompts](#clozeExample)
    * 4.2 [Prefix Prompts](#prefixExample)
    * 4.3 [Class Exercise](#classExercise)
  * 5. [Answers](#answers)      




  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-spring-main/blob/master/materials/lesson_notebooks/lesson_5_Text_Generation.ipynb)

[Return to Top](#returnToTop)  
<a id = 'setup'></a>

## 1. Setup

We first need to do the usual setup. We will also use some nltk and sklearn components in order to tokenize the text.

This notebook requires the tensorflow dataset and other prerequisites that you must download and then store locally. This can also be done on Colab.

In [74]:
#@title Installs

!pip install pydot --quiet
!pip install transformers --quiet
!pip install sentencepiece --quiet
!pip install nltk --quiet

In [75]:
#@title Imports

import numpy as np

import tensorflow as tf
from tensorflow import keras

import tensorflow_datasets as tfds

import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer

import os
import nltk

import matplotlib.pyplot as plt

import re
import textwrap

from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration
from transformers import GPT2Tokenizer, TFOPTForCausalLM

In [76]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

[Return to Top](#returnToTop)  
<a id = 'encoderDecoder'></a>


## 2. Building a Seq2Seq model for Translation using RNNs with and without Attention

### 2.1 Downloading and pre-processing Data


Let's get the data. Just like the Keras tutorial, we will use http://www.manythings.org as the source for the parallel corpus, but we will use German.  Machine translation requires sentence pairs for training, that is individual sentences in German and the corresponding sentence in English.

In [77]:
!!curl -O http://www.manythings.org/anki/deu-eng.zip       # use "cURL" to transfer data to and from a server. 
!!unzip deu-eng.zip

['Archive:  deu-eng.zip',
 'replace deu.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: A',
 '  inflating: deu.txt                 ',
 '  inflating: _about.txt              ']

Next, we need to set a few parameters.  Note these numbers are much smaller than we would set in a real world system.  For example, vocabulary sizes of 2000 and 3000 are unrealistic unless we were dealing with a highly specialized domain.

In [78]:
embed_dim = 100  # Embedding dimensions for vectors and LSTMs.
num_samples = 10000  # Number of examples to consider.

# Path to the data txt file on disk.
data_path = "deu.txt"

# Vocabulary sizes that we'll use:
english_vocab_size = 2000
german_vocab_size = 3000

Next, we need to format the input. In particular we would like to use nltk to help with the tokenization. We will then use sklearn's CountVectorizer to create a vocabulary from the most frequent words in each language.

(Before, we used pre-trained word embeddings from Word2Vec that came with a defined vocabulary. This time, we'll start from scratch, and need to extract the vocabulary from the training text.)

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)<br>
Convert a collection of text documents to a matrix of token counts.

In [79]:
input_texts = []
target_texts = []

max_input_length = -1
max_output_length = -1


with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
print('len(lines): ',len(lines))
lines[:5]

len(lines):  255818


['Go.\tGeh.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)',
 'Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)',
 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)',
 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)',
 'Run.\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)']

In [80]:
print(f'num_samples: {num_samples} | len(lines) - 1: {len(lines) - 1}')
print(min(num_samples, len(lines) - 1))

num_samples: 10000 | len(lines) - 1: 255817
10000


In [81]:
# ===================================
# ======= REVERSE ENGINEERING =======
# ===================================

input_texts = []
target_texts = []

max_input_length = -1
max_output_length = -1


with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")

# for line in lines[: min(num_samples, len(lines) - 1)]:
for line in lines[:2]:
    input_text, target_text, _ = line.split("\t")
    print(f'input_text: {input_text}, target_text: {target_text}, _: {_}')

    tokenized_source_text = nltk.word_tokenize(input_text, language='english')
    tokenized_target_text = nltk.word_tokenize(target_text, language='german')
    print(f'tokenized_source_text: {input_text}, tokenized_target_text: {target_text}')

    if len(tokenized_source_text) > max_input_length:
      print(f'len(tokenized_source_text): {len(tokenized_source_text)}')
      max_input_length = len(tokenized_source_text)
      print(f'max_input_length: {max_input_length}')

    if len(tokenized_target_text) > max_output_length:
      print(f'len(tokenized_target_text): {len(tokenized_target_text)}')
      max_output_length = len(tokenized_target_text)
      print(f'max_output_length: {max_output_length}')

    source_text = (' '.join(tokenized_source_text)).lower()
    target_text = (' '.join(tokenized_target_text)).lower()
    print(f'source_text: {source_text}, target_text: {target_text}')

    input_texts.append(source_text)
    target_texts.append(target_text)
    print(f'len(input_texts): {len(input_texts)}, input_texts: {input_texts}, target_texts: {target_texts}\n')

vectorizer_english = CountVectorizer(max_features=english_vocab_size)   # Use CountVectorizer to create a vocabulary from the most frequent words in each language.
print(f'vectorizer_english_CountVectorizer: {vectorizer_english}')
vectorizer_english.fit(input_texts)                                     # train 'input_texts
vocab_english = vectorizer_english.get_feature_names_out()
print(f'vocab_english.get_feature_names_out(): {vocab_english}')

vectorizer_german = CountVectorizer(max_features=german_vocab_size)
print(f'vectorizer_german_CountVectorizer: {vectorizer_german}')
vectorizer_german.fit(target_texts)
vocab_german = vectorizer_german.get_feature_names_out()
print(f'vocab_german.get_feature_names_out(): {vocab_german}')

print('\nMaximum source input length: ', max_input_length)
print('Maximum target output length: ', max_output_length)

input_text: Go., target_text: Geh., _: CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
tokenized_source_text: Go., tokenized_target_text: Geh.
len(tokenized_source_text): 2
max_input_length: 2
len(tokenized_target_text): 2
max_output_length: 2
source_text: go ., target_text: geh .
len(input_texts): 1, input_texts: ['go .'], target_texts: ['geh .']

input_text: Hi., target_text: Hallo!, _: CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)
tokenized_source_text: Hi., tokenized_target_text: Hallo!
source_text: hi ., target_text: hallo !
len(input_texts): 2, input_texts: ['go .', 'hi .'], target_texts: ['geh .', 'hallo !']

vectorizer_english_CountVectorizer: CountVectorizer(max_features=2000)
vocab_english.get_feature_names_out(): ['go' 'hi']
vectorizer_german_CountVectorizer: CountVectorizer(max_features=3000)
vocab_german.get_feature_names_out(): ['geh' 'hallo']

Maximum source input length:  2
Maximum target output length:  2


In [82]:
# =============================
# ======= ORIGINAL CODE =======
# =============================

input_texts = []
target_texts = []

max_input_length = -1
max_output_length = -1


with open(data_path, "r", encoding="utf-8") as f:
    lines = f.read().split("\n")
for line in lines[: min(num_samples, len(lines) - 1)]:
    input_text, target_text, _ = line.split("\t")

    tokenized_source_text = nltk.word_tokenize(input_text, language='english')
    tokenized_target_text = nltk.word_tokenize(target_text, language='german')

    if len(tokenized_source_text) > max_input_length:
      max_input_length = len(tokenized_source_text)

    if len(tokenized_target_text) > max_output_length:
      max_output_length = len(tokenized_target_text)

    source_text = (' '.join(tokenized_source_text)).lower()
    target_text = (' '.join(tokenized_target_text)).lower()

    input_texts.append(source_text)
    target_texts.append(target_text)

vectorizer_english = CountVectorizer(max_features=english_vocab_size)
vectorizer_english.fit(input_texts)
vocab_english = vectorizer_english.get_feature_names_out()

vectorizer_german = CountVectorizer(max_features=german_vocab_size)
vectorizer_german.fit(target_texts)
vocab_german = vectorizer_german.get_feature_names_out()

print('Maximum source input length: ', max_input_length)
print('Maximum target output length: ', max_output_length)

Maximum source input length:  6
Maximum target output length:  11


In [83]:
input_texts[:5]

['go .', 'hi .', 'hi .', 'run !', 'run .']

In [84]:
target_texts[:5]

['geh .', 'hallo !', 'grüß gott !', 'lauf !', 'lauf !']

Looks simple but correct.

So the source and target sequences have max lengths 6 and 11, respectively. As we will add start and end tokens (\<s> and \</s>) to our decoder side we will set the respective max lengths to: 

In [85]:
max_encoder_seq_length = 6
max_decoder_seq_length = 13 #11 + start + end

Next, we create the dictionaries translating between integer ids and tokens for both source (English) and target (German).



In [86]:
print(f'len(vocab_english): {len(vocab_english)}\n\nvocab_english_full: {vocab_english}\nvocab_english[:5]: {vocab_english[:5]}\nvocab_english[100:105]: {vocab_english[100:105]}\nvocab_english[200:205]: {vocab_english[200:205]}\nvocab_english[-5:]: {vocab_english[-5:]}')

len(vocab_english): 2000

vocab_english_full: ['00' '10' '100' ... 'your' 'yours' 'yourself']
vocab_english[:5]: ['00' '10' '100' '12' '13']
vocab_english[100:105]: ['apart' 'apologize' 'apologized' 'applauded' 'apple']
vocab_english[200:205]: ['bike' 'biked' 'bird' 'birds' 'birthday']
vocab_english[-5:]: ['you' 'young' 'your' 'yours' 'yourself']


In [87]:
# ===================================
# ======= REVERSE ENGINEERING =======
# ===================================

# english_vocab_size = 2000
# german_vocab_size = 3000

# ================= Encoder: Dict for English ===================
source_id_vocab_dict = {}
source_vocab_id_dict = {}

for sid, svocab in enumerate(vocab_english[:3]):   # limit to the length of 3
  source_id_vocab_dict[sid] = svocab
  source_vocab_id_dict[svocab] = sid
  print(f'source_id_vocab_dict: {source_id_vocab_dict[sid]}\nsource_vocab_id_dict: {source_vocab_id_dict[svocab]}\n')
print(f'source_id_vocab_dict: {source_id_vocab_dict}\nsource_vocab_id_dict: {source_vocab_id_dict}\n')

source_id_vocab_dict[english_vocab_size] = "<unk>"       # unknown token: used to replace the rare words that didn't fit in your vocabulary
source_id_vocab_dict[english_vocab_size + 1] = "<pad>"   # padding token: used to fill the reminder of max length defined, making sure all the squences in your batch in training data have the same length
print(f'source_id_vocab_dict[english_vocab_size]: {source_id_vocab_dict[english_vocab_size]}\nsource_id_vocab_dict[english_vocab_size + 1]: {source_id_vocab_dict[english_vocab_size + 1]}\n')

source_vocab_id_dict["<unk>"] = english_vocab_size
source_vocab_id_dict["<pad>"] = english_vocab_size + 1
print(f'source_id_vocab_dict: {source_id_vocab_dict}\nsource_vocab_id_dict: {source_vocab_id_dict}')

# ================= Decoder: Dict for German ===================
target_id_vocab_dict = {}
target_vocab_id_dict = {}

# for tid, tvocab in enumerate(vocab_german):
for tid, tvocab in enumerate(vocab_german[:3]):     # limit to the length of 3
  target_id_vocab_dict[tid] = tvocab
  target_vocab_id_dict[tvocab] = tid

# Add unknown token plus start and end tokens to target language
# we are going to feed the <start> token at the beginning of the encordre and then it keeps generating until it chooses
# to generate the <end> token
target_id_vocab_dict[german_vocab_size] = "<unk>"
target_id_vocab_dict[german_vocab_size + 1] = "<start>"
target_id_vocab_dict[german_vocab_size + 2] = "<end>"
target_id_vocab_dict[german_vocab_size + 3] = "<pad>"

target_vocab_id_dict["<unk>"] = german_vocab_size
target_vocab_id_dict["<start>"] = german_vocab_size + 1
target_vocab_id_dict["<end>"] = german_vocab_size + 2
target_vocab_id_dict["<pad>"] = german_vocab_size + 3
print(f'\ntarget_id_vocab_dict: {target_id_vocab_dict}\ntarget_vocab_id_dict: {target_vocab_id_dict}')

source_id_vocab_dict: 00
source_vocab_id_dict: 0

source_id_vocab_dict: 10
source_vocab_id_dict: 1

source_id_vocab_dict: 100
source_vocab_id_dict: 2

source_id_vocab_dict: {0: '00', 1: '10', 2: '100'}
source_vocab_id_dict: {'00': 0, '10': 1, '100': 2}

source_id_vocab_dict[english_vocab_size]: <unk>
source_id_vocab_dict[english_vocab_size + 1]: <pad>

source_id_vocab_dict: {0: '00', 1: '10', 2: '100', 2000: '<unk>', 2001: '<pad>'}
source_vocab_id_dict: {'00': 0, '10': 1, '100': 2, '<unk>': 2000, '<pad>': 2001}

target_id_vocab_dict: {0: '00', 1: '10', 2: '100', 3000: '<unk>', 3001: '<start>', 3002: '<end>', 3003: '<pad>'}
target_vocab_id_dict: {'00': 0, '10': 1, '100': 2, '<unk>': 3000, '<start>': 3001, '<end>': 3002, '<pad>': 3003}


In [88]:
# =============================
# ======= ORIGINAL CODE =======
# =============================

source_id_vocab_dict = {}
source_vocab_id_dict = {}

for sid, svocab in enumerate(vocab_english):
  source_id_vocab_dict[sid] = svocab
  source_vocab_id_dict[svocab] = sid

source_id_vocab_dict[english_vocab_size] = "<unk>"
source_id_vocab_dict[english_vocab_size + 1] = "<pad>"

source_vocab_id_dict["<unk>"] = english_vocab_size
source_vocab_id_dict["<pad>"] = english_vocab_size + 1

target_id_vocab_dict = {}
target_vocab_id_dict = {}

for tid, tvocab in enumerate(vocab_german):
  target_id_vocab_dict[tid] = tvocab
  target_vocab_id_dict[tvocab] = tid

# Add unknown token plus start and end tokens to target language

target_id_vocab_dict[german_vocab_size] = "<unk>"
target_id_vocab_dict[german_vocab_size + 1] = "<start>"
target_id_vocab_dict[german_vocab_size + 2] = "<end>"
target_id_vocab_dict[german_vocab_size + 3] = "<pad>"

target_vocab_id_dict["<unk>"] = german_vocab_size
target_vocab_id_dict["<start>"] = german_vocab_size + 1
target_vocab_id_dict["<end>"] = german_vocab_size + 2
target_vocab_id_dict["<pad>"] = german_vocab_size + 3

Lastly, we need to create the training and test data that will feed into our two models. It is convenient to define a small function for that that also takes care of padding and adding start/end tokens on the decoder side.

Notice that we need to create three sequences of vocab ids: 
- inputs to the encoder (starting language), 
- inputs to the decoder (output language, for the preceding tokens in the output sequence) and 
- labels for the decoder (the correct next word to predict at each time step in the output, which is shifted one over from the inputs to the decoder).

### Convert text to vocabulary IDs 
Used Word2Vec before but now we just have our own that we just learned from tokenizing the training data. And the unique thing about this function "convert_text_to_data()", we are going to reuse this for 3 different inputs (3 different inputs needed to a SeqToSeq model):
- Input to the encoder = Vocab IDs for the English text
- Input for the decorder = Vocab IDs for the German text
- Output of the decorder = Our labels that both of those are going to be German sequence. They are just off by 1

So the **input to the decorder** starts with the **<start>** sentence token and then the **1st label** is the 1st real word of the sentence and then that 1st real word of the sentence is the 2nd input to the decorder. And this way, we're not as we train. we are not waiting for whatever we predicted as the previous word for the decorder we're feeding it the correct sequence as the inputs and just teaching it to predict the correct next word. It's called **teacher forcing.**


In [89]:
# ===================================
# ======= REVERSE ENGINEERING =======
# ===================================

print(f'input_texts[:10] in English => {input_texts[:10]}\ntarget_texts[:10] in German => {target_texts[:10]}')
print(f'source_vocab_id_dict => {source_vocab_id_dict}\ntarget_vocab_id_dict => {target_vocab_id_dict}')
# print(f'input_texts[:10] in English => {input_texts[:10]}\ntarget_texts[:10] in German => {target_texts[:10]}')

def convert_text_to_data(texts, 
                         vocab_id_dict, 
                         max_length=20, 
                         type=None,
                         train_test_vector=None,
                         samples=100000):
  
  if type == None:
    raise ValueError('\'type\' is not defined. Please choose from: input_source, input_target, output_target.')
  
  train_data = []
  test_data = []

  for text_num, text in enumerate(texts[:5]):   # limit to 5 texts

    sentence_ids = []

    for token in text.split():
      print('token:', token)

      if token in vocab_id_dict.keys():
        sentence_ids.append(vocab_id_dict[token])
      else:
        sentence_ids.append(vocab_id_dict["<unk>"])
    
    vocab_size = len(vocab_id_dict.keys())
    
    # Depending on encoder/decoder and input/output, add start/end tokens.
    # Then add padding.
    
    if type == 'input_source':     # input for the encorder: English (=sentence_ids) and padding tokens
      # ids = (sentence_ids + [vocab_size - 1] * max_length)[:max_length]
      ids = (sentence_ids + [vocab_id_dict["<pad>"]] * max_length)[:max_length]     # Natalie's code

    elif type == 'input_target':   # input for the decorder: use <start> token at the beginning
      # ids = ([vocab_size - 3]] + sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]
      ids = ([vocab_id_dict["<start>"]] + sentence_ids + [vocab_id_dict["<end>"]] + [vocab_id_dict["<pad>"]] * max_length)[:max_length]   # Natalie's code
      

    elif type == 'output_target':   # label for the decorder: we don't put the <start> token at the beginning but we do make sure having the <end> token then padding token
      # ids = (sentence_ids + [vocab_size - 2] + [vocab_size -1] * max_length)[:max_length]
      ids = (sentence_ids + [vocab_id_dict["<end>"]] + [vocab_id_dict["<pad>"]] * max_length)[:max_length]     # Natalie's code

    if train_test_vector is not None and not train_test_vector[text_num]:
      test_data.append(ids)
    else:
      train_data.append(ids)


  return np.array(train_data), np.array(test_data)


train_test_split_vector = (np.random.uniform(size=5) > 0.2)             # QUESTION???: do not understand
print('np.random.uniform(size=5): ', np.random.uniform(size=5))
print('train_test_split_vector: ',train_test_split_vector)

train_source_input_data, test_source_input_data = convert_text_to_data(input_texts, 
                                                                       source_vocab_id_dict,
                                                                       type='input_source',
                                                                       max_length=max_encoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_input_data, test_target_input_data = convert_text_to_data(target_texts,
                                                                       target_vocab_id_dict,
                                                                       type='input_target',
                                                                       max_length=max_decoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_output_data, test_target_output_data = convert_text_to_data(target_texts,
                                                                         target_vocab_id_dict,
                                                                         type='output_target',
                                                                         max_length=max_decoder_seq_length,
                                                                         train_test_vector=train_test_split_vector)




input_texts[:10] in English => ['go .', 'hi .', 'hi .', 'run !', 'run .', 'wow !', 'wow !', 'duck !', 'fire !', 'help !']
target_texts[:10] in German => ['geh .', 'hallo !', 'grüß gott !', 'lauf !', 'lauf !', 'potzdonner !', 'donnerwetter !', 'kopf runter !', 'feuer !', 'hilfe !']
source_vocab_id_dict => {'00': 0, '10': 1, '100': 2, '12': 3, '13': 4, '15': 5, '17': 6, '18': 7, '19': 8, '30': 9, '300': 10, '45': 11, '50': 12, '99': 13, 'aah': 14, 'abandon': 15, 'aboard': 16, 'about': 17, 'abroad': 18, 'absent': 19, 'absurd': 20, 'accelerated': 21, 'accept': 22, 'accurate': 23, 'ache': 24, 'ached': 25, 'aches': 26, 'acne': 27, 'across': 28, 'act': 29, 'action': 30, 'active': 31, 'actor': 32, 'add': 33, 'addict': 34, 'addicted': 35, 'admirable': 36, 'admire': 37, 'admired': 38, 'admit': 39, 'adopted': 40, 'adore': 41, 'adored': 42, 'adores': 43, 'advises': 44, 'afraid': 45, 'after': 46, 'afternoon': 47, 'again': 48, 'against': 49, 'age': 50, 'agile': 51, 'agree': 52, 'agreed': 53, 'agrees

In [90]:
# =============================
# ======= ORIGINAL CODE =======
# =============================

def convert_text_to_data(texts, 
                         vocab_id_dict, 
                         max_length=20, 
                         type=None,
                         train_test_vector=None,
                         samples=100000):
  
  if type == None:
    raise ValueError('\'type\' is not defined. Please choose from: input_source, input_target, output_target.')
  
  train_data = []
  test_data = []

  for text_num, text in enumerate(texts[:samples]):

    sentence_ids = []

    for token in text.split():

      if token in vocab_id_dict.keys():
        sentence_ids.append(vocab_id_dict[token])
      else:
        sentence_ids.append(vocab_id_dict["<unk>"])
    
    vocab_size = len(vocab_id_dict.keys())
    
    # Depending on encoder/decoder and input/output, add start/end tokens.
    # Then add padding.
    
    if type == 'input_source':     # input for the encorder: English (=sentence_ids) and padding tokens
      # ids = (sentence_ids + [vocab_size - 1] * max_length)[:max_length]
      ids = (sentence_ids + [vocab_id_dict["<pad>"]] * max_length)[:max_length]     # Natalie's code

    elif type == 'input_target':   # input for the decorder: use <start> token at the beginning
      # ids = ([vocab_size - 3]] + sentence_ids + [vocab_size - 2] + [vocab_size - 1] * max_length)[:max_length]
      ids = ([vocab_id_dict["<start>"]] + sentence_ids + [vocab_id_dict["<end>"]] + [vocab_id_dict["<pad>"]] * max_length)[:max_length]   # Natalie's code
      

    elif type == 'output_target':   # label for the decorder: we don't put the <start> token at the beginning but we do make sure having the <end> token then padding token
      # ids = (sentence_ids + [vocab_size - 2] + [vocab_size -1] * max_length)[:max_length]
      ids = (sentence_ids + [vocab_id_dict["<end>"]] + [vocab_id_dict["<pad>"]] * max_length)[:max_length]     # Natalie's code

    if train_test_vector is not None and not train_test_vector[text_num]:
      test_data.append(ids)
    else:
      train_data.append(ids)



  return np.array(train_data), np.array(test_data)


train_test_split_vector = (np.random.uniform(size=10000) > 0.2)     


train_source_input_data, test_source_input_data = convert_text_to_data(input_texts, 
                                                                       source_vocab_id_dict,
                                                                       type='input_source',
                                                                       max_length=max_encoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_input_data, test_target_input_data = convert_text_to_data(target_texts,
                                                                       target_vocab_id_dict,
                                                                       type='input_target',
                                                                       max_length=max_decoder_seq_length,
                                                                       train_test_vector=train_test_split_vector)

train_target_output_data, test_target_output_data = convert_text_to_data(target_texts,
                                                                         target_vocab_id_dict,
                                                                         type='output_target',
                                                                         max_length=max_decoder_seq_length,
                                                                         train_test_vector=train_test_split_vector)




Let us look at a few examples. They appear coorect.

In [91]:
train_source_input_data[:2]

array([[ 749, 2000, 2001, 2001, 2001, 2001],
       [ 843, 2000, 2001, 2001, 2001, 2001]])

In [92]:
train_target_input_data[:2]

array([[3001, 1080, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [3001, 1247, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

In [93]:
train_target_output_data[:2]

array([[1080, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003],
       [1247, 3000, 3002, 3003, 3003, 3003, 3003, 3003, 3003, 3003, 3003,
        3003, 3003]])

[Return to Top](#returnToTop)  
<a id = 's2sNoAttention'></a>

### 2.2 The Seq2seq model without Attention

We need to build both the encoder and the decoder and we'll use LSTMs.  We'll set up the system first without an attention layer between the encoder and decoder.

In [None]:
def create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim):

    source_input_no_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,),
                                                dtype='int64',
                                                name='source_input_no_att')
    target_input_no_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,),
                                                dtype='int64',
                                                name='target_input_no_att')

    source_embedding_layer_no_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                                              output_dim=embed_dim,
                                                              name='source_embedding_layer_no_att')

    target_embedding_layer_no_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                                               output_dim=embed_dim,
                                                               name='target_embedding_layer_no_att')

    source_embeddings_no_att = source_embedding_layer_no_att(source_input_no_att)
    target_embeddings_no_att = target_embedding_layer_no_att(target_input_no_att)

    encoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_no_att')
    encoder_out_no_att, encoder_state_h_no_att, encoder_state_c_no_att = encoder_lstm_layer_no_att(source_embeddings_no_att)

    decoder_lstm_layer_no_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_no_att')
    decoder_lstm_out_no_att = decoder_lstm_layer_no_att(target_embeddings_no_att, [encoder_state_h_no_att, encoder_state_c_no_att])

    target_classification_no_att = tf.keras.layers.Dense(decode_vocab_size,
                                                         activation='softmax',
                                                         name='classification_no_att')(decoder_lstm_out_no_att)

    translation_model_no_att = tf.keras.models.Model(inputs=[source_input_no_att, target_input_no_att], outputs=[target_classification_no_att])

    translation_model_no_att.compile(optimizer="Adam",
                                     loss='sparse_categorical_crossentropy',
                                     metrics=['accuracy'])
    
    return translation_model_no_att


Now we can call the function we created to instantiate that model and confirm that it is set up the way we like using model.sumary().

In [None]:
encode_vocab_size = len(source_id_vocab_dict.keys())
decode_vocab_size = len(target_id_vocab_dict.keys())

translation_model_no_att = create_translation_model_no_att(encode_vocab_size, decode_vocab_size, embed_dim)

translation_model_no_att.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 source_input_no_att (InputLaye  [(None, 6)]         0           []                               
 r)                                                                                               
                                                                                                  
 target_input_no_att (InputLaye  [(None, 13)]        0           []                               
 r)                                                                                               
                                                                                                  
 source_embedding_layer_no_att   (None, 6, 100)      200200      ['source_input_no_att[0][0]']    
 (Embedding)                                                                                  

It never hurts to look at the shapes of the outputs.

In [None]:
translation_model_no_att.predict(x=[train_source_input_data, train_target_input_data]).shape



(8047, 13, 3004)

Now that everything checks out, we can train our model.

In [None]:
translation_model_no_att.fit(x=[train_source_input_data, train_target_input_data],
                             y=train_target_output_data,
                             validation_data=([test_source_input_data, test_target_input_data],
                                              test_target_output_data),
                             epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f2c002138e0>

[Return to Top](#returnToTop)  
<a id = 's2sAttention'></a>

### 2.3 The Seq2seq model with Attention

All we need to do is add an attention layer that ceates a context vector for each decoder position. We can use the attention layer provided by Keras in *tf.keras.layers.Attention()*.  We will then simply concatenate these corresponding context vectors with the output of the LSTM layer in order to predict the translation tokens one by one.

In [None]:
def create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim):

    source_input_with_att = tf.keras.layers.Input(shape=(max_encoder_seq_length,), 
                                                  dtype='int64',
                                                  name='source_input_with_att')
    target_input_with_att = tf.keras.layers.Input(shape=(max_decoder_seq_length,), 
                                                  dtype='int64',
                                                  name='target_input_with_att')

    source_embedding_layer_with_att = tf.keras.layers.Embedding(input_dim=encode_vocab_size,
                                                                output_dim=embed_dim,
                                                                name='source_embedding_layer_with_att')

    target_embedding_layer_with_att  = tf.keras.layers.Embedding(input_dim=decode_vocab_size,
                                                                 output_dim=embed_dim,
                                                                 name='target_embedding_layer_with_att')

    source_embeddings_with_att = source_embedding_layer_with_att(source_input_with_att)
    target_embeddings_with_att = target_embedding_layer_with_att(target_input_with_att)

    encoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=True, name='encoder_lstm_layer_with_att')
    encoder_out_with_att, encoder_state_h_with_att, encoder_state_c_with_att = encoder_lstm_layer_with_att(source_embeddings_with_att)

    decoder_lstm_layer_with_att = tf.keras.layers.LSTM(embed_dim, return_sequences=True, return_state=False, name='decoder_lstm_layer_with_att')
    decoder_lstm_out_with_att = decoder_lstm_layer_with_att(target_embeddings_with_att, [encoder_state_h_with_att, encoder_state_c_with_att])

    attention_context_vectors = tf.keras.layers.Attention(name='attention_layer')([decoder_lstm_out_with_att, encoder_out_with_att])

    concat_decode_out_with_att = tf.keras.layers.Concatenate(axis=-1, name='concat_layer_with_att')([decoder_lstm_out_with_att, attention_context_vectors])

    target_classification_with_att = tf.keras.layers.Dense(decode_vocab_size,
                                                           activation='softmax',
                                                           name='classification_with_att')(concat_decode_out_with_att)

    translation_model_with_att = tf.keras.models.Model(inputs=[source_input_with_att, target_input_with_att], outputs=[target_classification_with_att])

    translation_model_with_att.compile(optimizer="Adam",
                                       loss='sparse_categorical_crossentropy',
                                       metrics=['accuracy'])

    return translation_model_with_att


In [None]:
translation_model_with_att = create_translation_model_with_att(encode_vocab_size, decode_vocab_size, embed_dim)

translation_model_with_att.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 source_input_with_att (InputLa  [(None, 6)]         0           []                               
 yer)                                                                                             
                                                                                                  
 target_input_with_att (InputLa  [(None, 13)]        0           []                               
 yer)                                                                                             
                                                                                                  
 source_embedding_layer_with_at  (None, 6, 100)      200200      ['source_input_with_att[0][0]']  
 t (Embedding)                                                                              

In [None]:
translation_model_with_att.fit(x=[train_source_input_data, train_target_input_data],
                               y=train_target_output_data,
                               validation_data=([test_source_input_data, test_target_input_data],
                                                test_target_output_data),
                               epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7f2bbffa5a00>

Validation accuracy is about one percentage point better.

**Question 1:** Why do you think the benefit of adding an attention layer is not larger?

[Return to Top](#returnToTop)  
<a id = 't5Example'></a>

## 3. T5

Now we turn to text generation with transformers. The T5 system was introduced [here](https://arxiv.org/pdf/1910.10683.pdf).  This model uses both the encoder and the decoder configurations of transformers and connects them together.  A big difference with this model is that it designed to accept text as an input and produce text as an output for a number of different tasks ranging from summarization and question answering to classification.  The system needs to be told which task to perform as the first part of the input text.  Be sure to look in *Appendix D* of the paper to see a complete set of the tasks that T5 base and large checkpoints can perform right out of the box and the data used to train them.

Let's play a bit with Huggingface's (Large) implementation of T5.

In [None]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-large')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-large')

t5_model.summary()

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)"tf_model.h5";:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  32899072  
                                                                 
 encoder (TFT5MainLayer)     multiple                  334939648 
                                                                 
 decoder (TFT5MainLayer)     multiple                  435627520 
                                                                 
Total params: 737,668,096
Trainable params: 737,668,096
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


737 m trainable parameters. Quite a lot. 

Let's create a short text to use as an example.

In [None]:
ARTICLE = ("Oh boy, what a lengthy and cumbersome excercise this was. " \
           "I had to look into every detail, check everything twice, " \
           " and then compare to prior results. Because of this tediousness " \
           " and extra work my homework was 2 days late.")

Next, we need to specify the task we want T5 to perform and include it at the begining of the input text.  We add a task prompt to the begining of our input.  Because we are summarizing, we add the word *summarize:* to the begining of our input.

In [None]:
t5_input_text = "summarize: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

First, we will generate a summary using the default output options.

In [None]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False)
       for g in t5_summary_ids])



['homework was a lengthy and cumbersome excercise . because of this tedious']


Not great. But let's get more sophisticated and prescribe a minimum length and use beam search to generate multiple outputs.  We also indicate the maximum length the output should be.  Finally, in order to reduce repetitive output we tell the model to avoid output that repeats trigrams (three word groupings).

In [None]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=20,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['i had to look into every detail, check everything twice, and then compare to prior results . because of this tediousness and extra work my homework was 2 days late .']


That is a bit better thanks to our application of some hyperparameters. 

Lastly, can T5 perform machine translation? Yes, in some limited instances.  We need to specify the input and output languages. Keep in mind that the model has only been trained to translate in particular directions e.g. English to Romanian but NOT Romanian to English.


In [None]:
t5_input_text = "translate English to German: " + ARTICLE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [None]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Ich habe es nicht geschafft, meinen ersten Test zu schreiben, da ich nicht genügend Zeit hatte, um meinen Test zu bearbeiten.']


Hmm... output language fluency is very good. But take the German output and feed it in to translate.google.com and see what this means. Is it anything like its English input? This hallucination might be mitigated by changing some of the hyperparameters like num_beams.

Is a shorter example more accurate?  Maybe.

In [None]:
t5_input_text = "translate English to German: That was really not very good today; it was too difficult to solve."
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

In [None]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                   num_beams=3,
                                   no_repeat_ngram_size=3,
                                   min_length=10,
                                   max_length=40)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Das war heute wirklich nicht sehr gut; es war zu schwierig zu lösen.']


That is not bad, though some mistakes are there.

[Return to Top](#returnToTop)  
<a id = 'prompts'></a>

## 4. Prompt Engineering and Generative Large Language Models

The development of very large language models such as [GPT3](https://arxiv.org/pdf/2005.14165.pdf) have led to increased interest in few shot and zero shot approaches to tasks.  These generative language models allow a user to provide a prompt with several examples followed by a question the model must answer.  GPT3, especially its 175 billion parameter model, demonstrates the feasibility of a zero shot model where the model can simply be presented with the prompt and in many instances provide the correct answer.  

The implication of this zero shot capability is that a very large generative language model can be pre-trained and then shared by a large group of people because it requires no fine-tuning or parameter manipulation. Instead, the users work on the wording of their prompt and providing enough context that the model an perform the task correctly. [Liu et. al.](https://arxiv.org/pdf/2107.13586.pdf) characterize this as "pre-train, prompt, and predict."

There are multiple approaches to pre-train, prompt and predict.  Here we explore two of them.  First we look at cloze prompts.  These leverage the masked language model approach used in BERT an T5 where individual words or spans are masked and during pre-training the model learns to predict the maked tokens. Second we look at prefix prompts.  These leverage the next word prediction capability of decoder only models in the GPT family. 

[Return to Top](#returnToTop)  
<a id = 'clozePrompts'></a>

### 4.1 Cloze Prompts

Cloze prompts take advantgae of the masked language model task where an individual word or span of words anywhere in the input are masked and the language model learns to predict them. 

In [None]:
del t5_tokenizer
del t5_model

t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)"tf_model.h5";:   0%|          | 0.00/892M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


"\<extra_id_0\>" is the special token (called a sentinel token) we can use with T5 to invoke its masked word modeling ability. There are up to 99 of these tokens. This means we can construct sentences, like a fill in the blank test, that allow us to probe the knowledge embedded in the model based on its pre-training.  Here's an example that works well.  After you've run it try substituting beagle for poodle and you'll see the model gets confused.

Notice two that we are using a beam search approach and accepting the top three choices rather than just the first choice.

In [None]:
PROMPT_SENTENCE = ( "An Australian <extra_id_0> is a type of working dog .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=15,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['Shepherd', 'working', 'Working']


[Return to Top](#returnToTop)  
<a id = 'prefixPrompt'></a>

### 4.2 Prefix Prompts

Prefix prompts are used with models that predict the next word given a large context window.  If you fill that window with the right information you can get the model to generate the output you want.  GPT3 relies on this approach to successfully perform.  You can either include a couple of examples of what you want the model to do and then ask your question or you can just ask your question.

Let's take a look at a decoder-only generative pretrained text generation model: [OPT](https://arxiv.org/pdf/2205.01068.pdf). This model doesn't have separate input and output sequences, instead we will feed in one sequence (the prefix prompt) and ask the model to continue generating text to complete that same sequence.  The OPT model is intended to replicate the functionality of the GPT-3 model and comes in several size from 125 million parameters to 175 billion parameters.  We'll work with the 350 million parameter model.

As with T5, we'll just try out the pre-trained model and see what text it generates for a new starting sequence.

In [None]:
from transformers import GPT2Tokenizer, TFOPTForCausalLM


In [None]:
opt_tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m")
opt_model = TFOPTForCausalLM.from_pretrained("facebook/opt-350m")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

Downloading (…)"tf_model.h5";:   0%|          | 0.00/663M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFOPTForCausalLM.

All the layers of TFOPTForCausalLM were initialized from the model checkpoint at facebook/opt-350m.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFOPTForCausalLM for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
prefix_prompt = 'Yesterday, I went to the store to buy '
input_ids = opt_tokenizer.encode(prefix_prompt, return_tensors='tf')

In [None]:
generated_text_outputs = opt_model.generate(
    input_ids, 
    max_length=35,
    num_return_sequences=3,
    repetition_penalty=1.5,
    top_p=0.92,
    temperature=.85,
    do_sample=True,
    top_k=125,
    early_stopping=True
)

#Print output for each sequence generated above
for i, beam in enumerate(generated_text_outputs):
  print()
  print("{}: {}".format(i, opt_tokenizer.decode(beam, skip_special_tokens=True, clean_up_tokenization_spaces=True)))



0: Yesterday, I went to the store to buy  an iPhone and there was a message saying there were no new models available.
You've only been here for like 2

1: Yesterday, I went to the store to buy  a new (to me) T-shirt.
I came home and said "Wait where did this shirt come from

2: Yesterday, I went to the store to buy iced tea. One of my friends was at work and he has a huge beard so it would be weird if we got


Now let's try a long prompt to give the model a lot of context to work with and see how well it performs.  We'll also include the output for that same prompt from chatGPT for comparison purposes. 

In [None]:
prompt = ("Write a paragraph long review of Dino's Diner which has been your family's favorite for generations." \
          " You are a 42 year old parent with three hungry kids who lives in Tom's River, NJ.")

inputs = opt_tokenizer(prompt, return_tensors="tf")

In [None]:
#OPT 350m model
generate_ids = opt_model.generate(inputs.input_ids,
                              min_length=100, 
                              max_length=200,  
                              repetition_penalty = 1.5,
                              top_k=150, 
                              do_sample=True, 
                              top_p=0.95, 
                              temperature=.85,
                              num_return_sequences=2) 

for i, sample_output in enumerate(generate_ids):
  print("{}: {}".format(i, opt_tokenizer.decode(sample_output, skip_special_tokens=True, clean_up_tokenization_spaces=True)))
  print()

0: Write a paragraph long review of Dino's Diner which has been your family's favorite for generations. You are a 42 year old parent with three hungry kids who lives in Tom's River, NJ. What makes Dino's Dining so special?
A great place to see family and friends enjoy the best burgers around! We're on the way there now but we would have come if Dino's hadn't closed its doors last April - this is a special lunch from me because it really does just happen that I know my kids will love Dino's (heck he never said how many people can go through his kitchen) So get out one more time...and let us be proud parents today. Thank you!
Have you ever had a diner experience like Dino’s? Tell us about it here: https://www3dinksubwaynewyorkcitiesreviewclub/comments/?id=9996#comment-398029258841 | Here at 3DINS

1: Write a paragraph long review of Dino's Diner which has been your family's favorite for generations. You are a 42 year old parent with three hungry kids who lives in Tom's River, NJ. Here is

Here's how chatGPT, a variant of GPT3 trained to generate outputs that please humans, responded to that same prompt:

`Dino's Diner has been a staple in my family for generations. As a 42 year old parent with three hungry kids, I appreciate the affordable, family-friendly atmosphere that Dino's provides. The menu offers a wide variety of options for breakfast, lunch, and dinner, and the portions are always generous. The staff is friendly and accommodating, and the service is fast. My kids love the milkshakes and the classic diner fare, and I can always count on Dino's to hit the spot. Living in Tom's River, NJ, Dino's Diner is the perfect spot for a family meal. It has been a beloved family tradition for us, and I'm sure it will continue to be for generations to come.`


[Return to Top](#returnToTop)  
<a id = 'classExercise'></a>

### 4.3 In-Class Exercise (or on your own):
- Try changing the text_start input text to see how OPT completes different types of starting sentences (prefix prompts). (If time, we can brainstorm some sentences to try in groups or collect in the chat during the live session.)
- You can alter num_return_sequences to return a larger or smaller number of output options (i.e. beams).
- You might want to play with the parameters for repetition_penalty to see how they affect the model's output.
- You might also want to see what happens when you increase max_length, and how that relates to the repetition constraints. As the text gets longer, it will be more challenging for the model to avoid repeating itself. So stricter constraints against repetition might make the model get more creative or wander farther from the input sequence.

[Return to Top](#returnToTop)  
<a id = 'answers'></a>

## 5. Answers

**Question 1:** Why do you think the benefit of adding an attention layer is not larger?

      Answer:   The nature of our training and test sets and the artificial size of the inputs (6 words) and outputs (11 words) means that the gains we might see on long sentences aren't a part of this test.