# From: Transformers and Text-Generation
by Liam Dugan (UPenn). 


Please write your answers and code in the cells with questions below. 

----------

For this homework, we will take ideas from the entire class: language models, text generation, vector-based word representations, syntactic analysis, and neural networks. We'll be using large, pre-trained language models to generate text, and studying how we can fine-tune these large language models to generate text in whatever genre and style we want!

In this assignment you will get:
1. An overview of the "Transformer" architecture is and why it is particularly well suited for Natural Language Processing tasks
2. An introduction to the Generative Pretrained Transformer (GPT) family, which is a set of large-scale language models that can be used to generate text that often sounds like it was written by a human.
3. Experience with using the HuggingFace package to fine-tune these models to generate text that sounds like it comes from a specific source.

# Part 1: What is a Transformer? (Reading)
<figure align="center">
<img src="https://media.giphy.com/media/VeWllmR9zfaco/giphy.gif" />
<figcaption>(It's probably not this guy, right?)</figcaption>
</figure>

### The Transformer

The current state-of-the-art for a variety of natural language processing tasks belongs to the **Transformer** architecture, first published December 6th 2017. 

The Transformer can be thought of as a big feed-forward network with every feed-forward layer containing something called an "attention module". 

>You might be wondering: why are we moving back to feed-forward networks after having so much success with recurrent neural networks and variants like LSTMs? Aren't RNNs naturally poised to handle sequences as their inputs? Well, as it turns out, the sequential nature of RNNs make them really difficult to train in a distributed/parallel fashion. So while RNNs make more sense to use on sequences of inputs, serial networks such as the transformer can be trained much faster, allowing orders of magnitude more training data to be used. 



### Reading \# 1 - What is a Transformer?

In order to get a good grasp on exactly *why* these models are so good it's important to understand what they are and how they work. 

Your first task for this homework is to read the blog post ["The Illustrated Transformer" by Jay Alammar](http://jalammar.github.io/illustrated-transformer/). This blog post explains the transformer architecture (and the all-important "Attention Module") with helpful visualizations and diagrams. 

**You should read this post very closely and understand exactly what the Transformer is and how it works. Once you're finished reading, answer the following questions in 2-3 sentences each.**

1. (2 pts) What is Self-Attention (at a high level)?

   > It is a layer that helps the encoder look at other words in the input sentence as it encodes a specific word

2. (2 pts) How is Self-Attention computed?

   > First, we create three vectors from each of the encoder’s input vectors : a Query vector, a Key vector, and a Value vector. Next, calculate a score by taking the dot product of the Query vector and all the Key vectors. Then divide the score by the square root of the dimension of the Key vector, and normalize the value through a softmax function. The softmax score is multiplied with the Value vector to create weighted value vectors and taking the sum of those weighted value vectors gives the self-attention vector.

3. (2 pts) What do the "Query", "Key", and "Value" vectors encode (at a high level)?

   > They’re abstractions that are useful for calculating and thinking about attention.

4. (2 pts) What is an attention "head" and why should we use multiple heads?

   > An attention head means an attention layer, if we do multiple heads, then we have multiple layers so that each layer could focus on different aspects.

5. (2 pts) What are positional embeddings?

   > These are the embeddings to help determine the positions of the words.

6. (2 pts) Why are positional embeddings important?

   > The positional embeddings provides meaningful distances between the original embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.




-----------
### Reading \# 2 - Transformer Language Models

On June 11th 2018, OpenAI released a model named **GPT**, standing for *Generative Pre-Trained Transformer*. This model was a Transformer architecture that was modified such that it could be used for Text Generation instead of sequence to sequence modeling. This model was also pre-trained, which means that anyone could download the fully trained model and use it without needing to train the model themselves. 

On February 14th 2019, OpenAI released a blog post detailing a brand new version of GPT that had an insane **1.5 billion parameters**. They named this version **GPT-2**. To train such a large model, OpenAI crawled 40GB worth of text from the web (roughly 20,000,000,000 words). 

GPT-2 is an extremely impressive language model that can generate text that often sounds so plausible that it seems like it might have been written by a human.  Here is an example of what GPT-2 can generate, taken from [OpenAI’s blog post](https://openai.com/blog/better-language-models/):

**Human-Written Prompt:**
> “Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.”

**Model Continuation:**
> The 19-year-old singer was caught on camera being escorted out of the store by security guards.
> 
> The singer was wearing a black hoodie with the label ‘Blurred Lines’ on the front and ‘Fashion Police’ on the back.
> 
> Scroll down for video
> 
> Shoplifting: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today (pictured)
> 
> The singer was also wearing a pair of black-rimmed glasses, a black jacket, black jeans and black sandals.
> 
> She was carrying a pair of black and white striped gloves and a small black bag.

*(To be clear: Miley Cyrus was never actually arrested for shoplifting)*.  

The quality of the model's output was so good, that the researchers at OpenAI were worried that, in addition to positive applications, it might be used for malicious purposes like generating fake news, impersonating people online, creating abusive or fake content to post on social media, or automating spam and phishing scams. In a [demo to Wired Magazine](https://www.wired.com/story/ai-text-generator-too-dangerous-to-make-public/), the reporter typed in “Hillary Clinton and George Soros” and GPT-2 generated a crazy rant with lots of conspiracy theories: 
> Hillary Clinton and George Soros are a perfect match, and their agenda appears to be to create a political movement where Soros and his political machine and Clinton are two of the only major players. This is the first time Soros and Clinton have been caught on tape directly colluding in promoting the same false narrative. One of the key revelations in the leaked audio was Clinton's admission to a Russian banker that she knew about the Uranium One deal before it was approved by Congress. Clinton was shown sharing the same talking points that were originally drafted by a Fusion GPS contractor hired by an anti-Trump Republican donor. The leaked audio is the clearest evidence yet that the Clinton campaign and the Hillary Foundation colluded with Fusion GPS to manufacture propaganda against President Trump.


They were concerned enough that they labeled GPT-2 "too dangerous to release", and OpenAI initially refused to release their dataset, training code, or GPT-2 model weights.  OpenAI decided to release in a delayed, phased fashion so that researchers could spend time working on automatic detection of generated text.

In this homework, you'll get to be the judge of how good GPT-2 is, as you'll be using it yourself to generate text!

**To start your journey into the world of Text Generation, you should read Part 1 of the blog post ["The Illustrated GPT-2" by Jay Alammar](http://jalammar.github.io/illustrated-gpt2/) and answer the following questions in 2-3 sentences each**

1. (4 pts) How does the architecture of GPT-2 differ from the standard Encoder-Decoder Transformer model?
   > GPT-2 only uses the transformer decoder blocks.
2. (4 pts) What is the difference between "Masked Self-Attention" and "Self-Attention"
   > A normal self-attention block allows a position to peak at tokens to its right. In Masked self-attention, information from tokens that are to the right of the position is blocked from calculation.
3. (4 pts) What are logits? How are they computed? and How does GPT-2 use them to decide which word to predict next?
   > The logits are the output token probabilities of the vocabulary size, GPT-2 takes the one with the highest values to be the output prediction.

### Aside: GPT-3 

On June 11th 2020, OpenAI released GPT-3 [(paper)](https://arxiv.org/pdf/2005.14165.pdf) [(wikipedia)](https://en.wikipedia.org/wiki/GPT-3). This model has an unfathomable **175 billion parameters** (100x larger than GPT-2!) and was trained on 570GB of text! This model is virtually indistinguishable from human output and can generate text about any topic and in any style with only a few words of priming text. It is by far the largest language model ever trained and it can do some very terrifying things.

GPT-3 Can:
- Generate JSX code off natural language descriptions
- Generate Emojis based off of descriptions of the feeling
- Generate regular expressions off natural language descriptions
- Generate website mockups off natural language descriptions
- Generate charts with titles, labels and legends from natural language descriptions
- Explain python code in plain english
- Automatically generate quiz questions (and grade them)
- Generate Latex from natural language descriptions
- Generate Linux commands from natural language descriptions
- Generate a Machine Learning model from natural language descriptions

[Here's a collection of 21 things GPT-3 can do (with examples)](https://machinelearningknowledge.ai/openai-gpt-3-demos-to-convince-you-that-ai-threat-is-real-or-is-it/#OpenAI_GPT-3_Demos)

[Here's a NYT article about how GPT-3 can write code, poetry, and argue](https://www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html)

[Here's an article GPT-3 wrote for The Guardian about how it loves humans and would never subjugate humanity](https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3)

**You may optionally choose to read Jay Alammar's most recent blog post ["How GPT3 Works - Visualizations and Animations"](http://jalammar.github.io/how-gpt3-works-visualizations-animations/) from July 2020 if you're curious as to how GPT-3 differs from GPT-2**

Similarly to GPT-2, OpenAI has decided not to release GPT-3, this time opting to put GPT-3 behind an API which you need to request permission to use. This allows them to control exactly who can generate text and what type of text is generated. While this is a good solution in the short term, the long term implications of GPT-3 are still unclear.

If you are interested in trying out GPT-3 yourself, feel free to [Join the OpenAI API Waitlist](https://share.hsforms.com/1Lfc7WtPLRk2ppXhPjcYY-A4sk30)

-------------------------------

# Part 2: GPT-2 Text Generation with HuggingFace

Phew, that was a lot of reading. Now lets get to the fun part! Let's use the transformer to generate some text!!

We will use the [Transformers library from HuggingFace](https://transformer.huggingface.co), which provides support for many Transformer-based language models like GPT-2. 

**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in `Runtime > Change runtime type` before running this Colab.**

In [1]:
!pip install transformers



## 2.1 The 'Pipeline' Interface

The simplest way to use the HuggingFace library is to use their [Pipeline interface](https://huggingface.co/transformers/main_classes/pipelines.html)

There are many different types of Pipelines available but in this section we'll use the TextGenerationPipeline to get up and running with pretrained gpt2 as fast as possible

In [1]:
from transformers import pipeline

In [3]:
# Note: device=0 means to use GPU, device=-1 is to use CPU
generator = pipeline('text-generation', model='gpt2', device=0) 

In [4]:
outputs = generator('I wonder what I will generate?')
print(outputs)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I wonder what I will generate? I have not seen a single word!" said S.M., who had been absent on an attempt to get in touch in the past few days.\n\nS.M. was referring to the lack of electricity'}]


Note that the 'text-generation' pipeline will work with any **auto-regressive** language model (a.k.a 'causal-lm' models according to the HuggingFace lingo). You can find a list of all such models here https://huggingface.co/models?filter=causal-lm. 

(6 pts) **Your first task is to use the Pipeline interface to get generation output below for at least two different 'causal-lm' models (One of these two can be a different version of GPT2, but make sure at least one is a non-gpt family language model)**

In [3]:
## YOUR CODE HERE FOR MODEL 1
generator1 = pipeline('text-generation', model='distilgpt2', device=0) 
output1 = generator1('I wonder what I will generate?')
print(output1)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=762.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=352833716.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I wonder what I will generate? For me, and for the rest of the world, I wonder if those on one side have the right to create so many different, and then I wonder what I will create.\n\nI haven't really gotten"}]


In [4]:
## YOUR CODE HERE FOR MODEL 2
generator2 = pipeline('text-generation', model='xlnet-base-cased', device=0) 
output2 = generator2('I wonder what I will generate?')
print(output2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=760.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=467042463.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798011.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1382015.0, style=ProgressStyle(descript…


[{'generated_text': 'I wonder what I will generate? If I don’t have a net about what will happen to a net? Are there any chances for me to do something not good for the net? Will it be in the future of that web? How in the world is it going to go with any technology? The net might be a good thing for this internet. We have a whole galaxy of things. The web will be'}]


## 2.2 Dissecting the Pipeline
Now that was easy!

As beautiful and easy as the Pipeline interface is, we want to know what's going on under the hood!

There are four main steps to a text generation pipeline:
1. (Tokenize) Turn the raw input text into a vector of integer token IDs using a tokenizer

2. (Encode) Feed those token IDs into the language model by querying for each token's embedding in the model's embedding matrix (the "encoder") and then feed the "encoded" sequence into the decoder module

3. (Decode) The decoder will output logits (a probability distribution over all possible integer token IDs) and we sample from those logits to get our next token -- repeat until EOS token is generated or we hit max_length

4. (Detokenize) Take the output sequence of token IDs and turn them from integer token IDs back to tokens with the tokenizer

Below you'll see how HuggingFace does this:

First we have to initialize both the tokenizer and the model from their pre-trained checkpoints. Note that the tokenizer has to match the model.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()

In [6]:
#### Step 1: Tokenize the input into integer token IDs
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
print("Input Token IDs: " + str(inputs))

Input Token IDs: tensor([[15496,    11,   703,   389,   345,    30]], device='cuda:0')


In [7]:
#### Step 2 and 3: Feed in the integer token IDs and get out a sequence of token IDs as output
outputs = model.generate(inputs)
print("Output Token IDs: " + str(outputs))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output Token IDs: tensor([[15496,    11,   703,   389,   345,    30,   198,   198,    40,  1101,
           257,  1310,  1643,   286,   257, 34712,    13,   314,  1101,   257]],
       device='cuda:0')


In [8]:
#### Step 4: Feed in the integer token IDs and get out a sequence of token IDs as output
output_text = [tokenizer.decode(x) for x in outputs]
print("Output Text: " + str(output_text))

Output Text: ["Hello, how are you?\n\nI'm a little bit of a nerd. I'm a"]


Now that you have dissected the pipeline, it's time to play with some common parameters!

[Check out this demo notebook from HuggingFace](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) for a good overview of the different generation parameters and what they do (with example code!).

The full documentation on all of the parameters you can use in the generate function can be found [here](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.generate)

As an example, below we have a call to generate that:
- randomly samples from the top 50 words in the output distribution (rather than just greedily picking the best one every time)
- downweights the probability of all previously generated tokens by a factor of 1.2 (to prevent repetition)
- goes on for 512 tokens, because its more interesting

In [9]:
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      do_sample=True,          # Randomly sample from the logits instead of greedily picking next word with highest probability
      top_k=50,                 # Only sample from the top 50 most likely words
      repetition_penalty=1.2,    # Downweights the probability of all previously generated tokens by a factor of 1.2
      max_length=512          # Generate for a maximum of 512 tokens
  )
print([tokenizer.decode(x) for x in outputs][0])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you? When did this happen?" she asked.
"It was shortly after I arrived at the restaurant two nights ago," Kana told me last week in a letter from North Korea to New York police about her life and future situation on Wall Street as part of an effort by his father-inlaw, Kim Jong Il (Kim Kardashian), against both East Asia's powerful central banks that would face sanctions due to their actions leading up into September 12th 2011 through June 24rd 2012 which had been ordered according what many Korean saw first hand when they heard it happening during Operation Enduring Freedom held around 30 times over 2 weeks earlier every year for four years between late 2005; "this is where we came here today". My family made arrangements with friends or relatives involved with foreign companies under my name not to travel internationally but just so people knew who were working if there ever existed anyone like them living out these same problems - those very questions became one thing

**Your job is to provide two different examples of generation output from GPT-2 with different choices of generation parameters. You must also provide a 1-2 sentence explanation of what these parameters do and how they affect your output**

Feel free to get creative with this! Really poke around and try to find the combination of settings that gives you the best sounding text! The ways in which these parameters affect how 'human-like' a section of generated text sounds is an area of active research. :)

In [10]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 1
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      max_length=512,          # Generate for a maximum of 512 tokens
      num_beams=5,  # Keep the most likely 5 of hypotheses at each time step
      no_repeat_ngram_size=2 # all 2-grams can only occur once
  )
print([tokenizer.decode(x) for x in outputs][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you?

"I'm fine," I said. "It's just that I don't know what to do with myself. I'm not sure if I'll ever be able to get out of here. It's been a long time since I've been here, and I can't even remember how long it took me to find my way back to my home."
... "Well, I guess you're right," he said, "I guess it's a little bit of a shock to hear that, but I think you've got a lot of things to worry about, right? I mean, there's no way I could ever get back home without you. You're the only person I know who's ever been to a place like this before. So, what do you think? Do you want to go back? Or is there something else you'd like to talk to me about?" He looked at me for a moment, then looked back at the door. He didn't say a word. Then he turned and walked back into the living room, where I was sitting on the couch, staring at my phone. The phone was still ringing, so I turned it off and went to turn it back on again. As I did so, the phone started to ring again, which ma

(4 pts) If we add no_repeat_ngram_size as above then all 2-grams will only appear once, which reduces repetition when we use multiple beams, since when there are multiple beams we are likly to have the same generated text again and again.

In [11]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 2
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      max_length=512,          # Generate for a maximum of 512 tokens
      do_sample=True, 
      top_p=0.95
  )
print([tokenizer.decode(x) for x in outputs][0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you? You are a friend of mine, my mother-in-law. I'm going to be here forever, just a little bit before you die. But wait a minute, my name is Don and I'm here for you, to be the new, beautiful woman."

Don nodded.

She took a deep breath and spoke softly, "No one ever told me, I've been through so many people I've turned into the other side of the world. I love you dearly, but I know the love isn't worth anything, and if they think I'm a jerk, let me tell them I'm a good person and they should know that."

He sighed. "You're a good person, man. I really appreciate that. So why don't we try a bit of magic in your life? I was supposed to take you from me, to save you from that world, but that didn't happen. It could be better for us to get some rest in the meantime, maybe a little bit of time, you know?"

They spoke like they were separated, Don not having told the whole story. It was a new kind of friendship, of the true friends who have loved each other for years. Don f

(4 pts) When do_sample is set to True, we have top_k be 50 as default, and set the top_p to be 0.95, which means that only the most probable tokens with probabilities that add up to 0.95 or higher are kept for generation.

## 2.3 Fine-Tuning GPT-2
Okay now time for the best part!

Generating general-purpose text from pre-trained models is great, but what if we want our text to be in a specific genre or style? Luckily for us, the GPT family of models use the idea of "Transfer learning" -- using knowledge gained from one problem (or training setting), and applying it to another area or domain. The idea of transfer learning for NLP, is that we can train a language model on general texts, and then adapt it to use it for a specific task or domain that we're interested in. This process is also called **fine-tuning**.

In this section we'll walk you through an example of using HuggingFace to fine-tune GPT-2 and then you'll be asked to fine-tune GPT-2 on two datasets of your own choosing!

### Fine-Tuning Example using HuggingFace Datasets library: Crime and Punishment

For our fine-tuning example we're going to train GPT-2 to mimic the style of Fyodor Dostoevsky's novel "Crime and Punishment"

We will be downloading our data using the HuggingFace [Datasets](https://huggingface.co/docs/datasets/) library.

In [12]:
!pip install datasets



In [3]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset, list_datasets

### Step 1: Initialize a Brand New GPT-2 Model and Tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

###Step 2: Load the text of "Crime and Punishment" and tokenize it

The 'load_dataset' function queries for a dataset with a certain tag and downloads the corresponding data from HuggingFace's hosting site. This allows us to download all sorts of datasets through the same interface!

The documentation for load_dataset can be found [here](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

Here we take our tokenizer and run it on the entirety of Crime and Punishment in a single batch by using map on our custom encode function.

In [13]:
def encode(batch): return tokenizer([x.strip('\n\r') for x in batch['line']], truncation=True, padding=True)

crime_and_punishment = load_dataset('crime_and_punish', split='train')
processed = crime_and_punishment.map(encode, batched=True, batch_size=len(crime_and_punishment))
processed.set_format('torch', columns=['input_ids', 'attention_mask'])

Reusing dataset crime_and_punish (/root/.cache/huggingface/datasets/crime_and_punish/crime-and-punish/1.0.0/87ec36ba9cb8741325bea3e40a6d4525210c8f8ef13e7b07872fe32eb72c13ac)
Loading cached processed dataset at /root/.cache/huggingface/datasets/crime_and_punish/crime-and-punish/1.0.0/87ec36ba9cb8741325bea3e40a6d4525210c8f8ef13e7b07872fe32eb72c13ac/cache-afeb316fb7f9e767.arrow


### Step 3: Initialize the Trainer

The 'Trainer' module is the main way we perform fine-tuning. In order to initialize a Trainer, you need a model, tokenizer, TrainingArguments, your training data (in a Dataset object) and something called a data_collator (which tells the Trainer not to look for a vector of labels). 

In [16]:
training_args = TrainingArguments(
    output_dir='/content/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

### Step 4: Fine-Tune the Model!

Now we're done! All we have to do is hit run and sit back!

In [17]:
trainer.train()

Step,Training Loss
100,4.0013
200,3.6974
300,3.6628
400,3.6354
500,3.592
600,3.591
700,3.5361
800,3.5011
900,3.5077
1000,3.4878


TrainOutput(global_step=1374, training_loss=3.5788374766165165, metrics={'train_runtime': 379.7802, 'train_samples_per_second': 3.618, 'total_flos': 574101809809920.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -558952448, 'train_mem_gpu_alloc_delta': 1498660864, 'train_mem_cpu_peaked_delta': 575156224, 'train_mem_gpu_peaked_delta': 1100962304})

### Step 5: Save the Model and use it to Generate!

Save your fine-tuned model and compare its output with regular GPT-2's output to see the difference for yourself!

In [18]:
trainer.save_model('./dostoevskypt2')

In [19]:
dostoevskypt2 = pipeline('text-generation', model='./dostoevskypt2', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

In [20]:
print(dostoevskypt2('Saint Petersburg is'))
print(gpt2('Saint Petersburg is'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Saint Petersburg is no stranger to such men. One can hardly believe that you, _Pigmatism_. One knows that Porfiry Zagorsky has a great deal in common with Ivan Ilya Petrovitch and that Katerina'}]
[{'generated_text': "Saint Petersburg is home again for a very strong first-ever game and you should make no mistake about that. Their defense is still very good, but you have to be careful with how they run things now, or it's going to get worse."}]


## PERPLEXITY

(3 pts) Using the pointer [here](https://huggingface.co/transformers/perplexity.html), compute the perplexity of the GPT2 pre-trained model on the Wikipedia test set (you can keep the same hyperparameters as in the link) 

In [21]:
!pip install nlp

Collecting nlp
[?25l  Downloading https://files.pythonhosted.org/packages/09/e3/bcdc59f3434b224040c1047769c47b82705feca2b89ebbc28311e3764782/nlp-0.4.0-py3-none-any.whl (1.7MB)
[K     |▏                               | 10kB 11.1MB/s eta 0:00:01[K     |▍                               | 20kB 17.0MB/s eta 0:00:01[K     |▋                               | 30kB 15.2MB/s eta 0:00:01[K     |▉                               | 40kB 13.6MB/s eta 0:00:01[K     |█                               | 51kB 8.2MB/s eta 0:00:01[K     |█▏                              | 61kB 8.8MB/s eta 0:00:01[K     |█▍                              | 71kB 8.0MB/s eta 0:00:01[K     |█▋                              | 81kB 8.9MB/s eta 0:00:01[K     |█▉                              | 92kB 9.3MB/s eta 0:00:01[K     |██                              | 102kB 7.7MB/s eta 0:00:01[K     |██▏                             | 112kB 7.7MB/s eta 0:00:01[K     |██▍                             | 122kB 7.7MB/s eta 0:00:01

In [4]:
import torch
from tqdm import tqdm
from nlp import load_dataset as load_data
# torch.cuda.empty_cache()

In [23]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON WIKIPEDIA TEST SET 
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()
test = load_data('wikitext', 'wikitext-2-raw-v1', split='test')
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')

max_length = model.config.n_positions
stride = 512

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=8140.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5806.0, style=ProgressStyle(description…


Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown sizetotal: 17.41 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/8e456126357b4411737ead54576f99321fc077a0d4b64e4a724ab3454ba5b730...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4721645.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/8e456126357b4411737ead54576f99321fc077a0d4b64e4a724ab3454ba5b730. Subsequent calls will reuse this data.


Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 562/562 [01:04<00:00,  8.72it/s]


In [24]:
ppl

tensor(25.1705, device='cuda:0')

> PERPLEXITY: 25.1705

(2 pts) Compute the  perplexity of the dostoevskypt2 model on Wikipedia test set




In [25]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON WIKIPEDIA TEST SET
tokenizer = AutoTokenizer.from_pretrained('./dostoevskypt2')
model = AutoModelForCausalLM.from_pretrained('./dostoevskypt2').cuda()

In [26]:
encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
max_length = model.config.n_positions

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)
ppl

Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 562/562 [01:05<00:00,  8.52it/s]


tensor(67.4607, device='cuda:0')

> PERPLEXITY: 67.4607

(2 pts) Compute the perplexity of the GPT2 pre-trained model on the Crime and Punishment train dataset

In [27]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON CRIME AND PUNISHMENT TRAIN DATASET 
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()
cap = [x.strip('\n\r') for x in crime_and_punishment['line']]
encodings = tokenizer('\n\n'.join(cap), return_tensors='pt')

Token indices sequence length is longer than the specified maximum sequence length for this model (334477 > 1024). Running this sequence through the model will result in indexing errors


In [28]:
max_length = model.config.n_positions

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)
ppl

100%|██████████| 654/654 [01:17<00:00,  8.45it/s]


tensor(21.4867, device='cuda:0')

> PERPLEXITY: 21.4867

(2 pts) Compute the **train** perplexity of the **dostoevskypt2** model 




In [29]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON CRIME AND PUNISHMENT TRAIN DATASET 
tokenizer = AutoTokenizer.from_pretrained('./dostoevskypt2')
model = AutoModelForCausalLM.from_pretrained('./dostoevskypt2').cuda()
encodings = tokenizer('\n\n'.join(cap), return_tensors='pt')
max_length = model.config.n_positions

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)
ppl

Token indices sequence length is longer than the specified maximum sequence length for this model (334477 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 654/654 [01:18<00:00,  8.29it/s]


tensor(20.2664, device='cuda:0')

> PERPLEXITY: 20.2664

(2 pts) Compute the perplexity of the GPT2 model on your Pride and Prejudice text




In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON PRIDE AND PREJUDICE TEXT 
with open('/content/drive/MyDrive/Colab Notebooks/prideAndPrejudice.txt','r') as f:
  text = [line.rstrip('\n') for line in f]

In [None]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()
encodings = tokenizer('\n\n'.join(text), return_tensors='pt')

In [None]:
max_length = model.config.n_positions

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)
ppl

> PERPLEXITY: 27.8958

(2 pts) Compute the perplexity of the dostoevskypt2 model on your Pride and Prejudice text




In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON PRIDE AND PREJUDICE TEXT 
tokenizer = AutoTokenizer.from_pretrained('./dostoevskypt2')
model = AutoModelForCausalLM.from_pretrained('./dostoevskypt2').cuda()
encodings = tokenizer('\n\n'.join(text), return_tensors='pt')

max_length = model.config.n_positions

lls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i    # may be different from stride on last loop
    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(model.device)
    target_ids = input_ids.clone()
    target_ids[:,:-trg_len] = -100

    with torch.no_grad():
      outputs = model(input_ids, labels=target_ids)
      log_likelihood = outputs[0] * trg_len

    lls.append(log_likelihood)

ppl = torch.exp(torch.stack(lls).sum() / end_loc)
ppl

> PERPLEXITY:41.3754

1. (1 pt) Which model performs better on Wikipedia text?
   > GPT2 performs better on Wiki text.
2. (1 pt) Which model performs better on Pride and Prejudice text?
   > GPT2 also performs better on Pride and Prejudice text.

### Now's Your Turn!

**Your job is to fine-tune GPT2 one more time with your choice of fine-tuning dataset.**

*****For the fine-tuned model you create, you should clearly demonstrate (through visible generation outputs and analysis) that your fine-tuned model follows the desired style better than vanilla GPT2** ***

Please make sure to give a brief description 

In order to see which datasets are available for download, run the cell below. Pick one that you think would be interesting!

In [30]:
datasets_list = list_datasets()
print(', '.join(dataset for dataset in datasets_list))

acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, amttl, anli, app_reviews, aqua_rat, aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, arxiv_dataset, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbc_hindi_nli, bc2gm_corpus, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, blended_skill_talk, blimp, blog_authorship_corpus, bn_hate_speech, bookcorpus, bookcorpusopen, boolq, bprec, break_data, brwac, bsd_ja_en, bswac, c3, c4, cail2018, caner, capes, catalonia_independence, cawac, cbt, cc100, cc_news, ccaligned_multilingual, cdsc, cdt, cfq, chr_en, cifar10, cifar100, circa, civil_comments, clickbait_news_bg, climate_fever, clinc_oos, clue, cmrc2018, cnn_

### Tips
- Most of the datasets hosted by HuggingFace are not meant for Causal LM fine-tuning. Make sure you preprocess them accordingly if you want to use them.
- In order to check out information about a dataset hosted by huggingface you can use [this web viewer](https://huggingface.co/datasets/viewer/?dataset=crime_and_punish). Try to avoid downloading a dataset that's too big!
- You will likely have to change the custom 'encode' function for each new dataset you want to fine-tune on. You need to change batch['line'] to instead index with the correct column label for your specific dataset (it probably wont be called 'line').

### Useful Links
[load_datasets Documentation](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

[Trainer Documentation](https://huggingface.co/transformers/main_classes/trainer.html#id1)

[Example: Fine-Tuning BERT for Esperanto](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=zTgWPa9Dipk2)

[Example: Fine-Tuning for IMDb Classification](https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing#scrollTo=5DEWNilys9Ty)


#### Dataset \#1

In [5]:
## YOUR CODE HERE - FOR FINE-TUNING GPT2 ON DATASET 
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained('gpt2').cuda()
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [12]:
def encode(batch): return tokenizer([x.strip('\n\r') for x in batch['sentence']], truncation=True, padding=True)
limit = load_dataset('limit', split='train')
limit = limit.remove_columns(['id', 'motion','motion_entities'])
processed = limit.map(encode, batched=True, batch_size=len(limit))
processed.set_format('torch', columns=['input_ids', 'attention_mask'])

Using custom data configuration default
Reusing dataset limit (/root/.cache/huggingface/datasets/limit/default/1.0.0/22ff29c4830e016ecdcd2959050a5140aab92edf492ebfd9a97f398441ad7c22)


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [14]:
training_args = TrainingArguments(
    output_dir='/content/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

In [15]:
trainer.train()

Step,Training Loss
100,4.2373
200,4.1275
300,4.0773
400,4.0575
500,4.0632
600,4.0377
700,3.9688
800,3.9911
900,3.9841
1000,3.9737


TrainOutput(global_step=1473, training_loss=4.0259538224499325, metrics={'train_runtime': 1206.4177, 'train_samples_per_second': 1.221, 'total_flos': 3025491114645504.0, 'epoch': 1.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -295440384, 'train_mem_gpu_alloc_delta': 1500909568, 'train_mem_cpu_peaked_delta': 325840896, 'train_mem_gpu_peaked_delta': 6013088256})

(3 pts) I used the dataset Literal-Motion-in-Text which is a large human-annotated collection of English text sentences describing physical occurrence of motion, with annotated physical entities in motion. 

In [17]:
trainer.save_model('./limit')
limitmodel = pipeline('text-generation', model='./limit', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

In [22]:
for i in range(10):
  print(limitmodel('The ball'))

print()

for i in range(10):
  print(gpt2('The ball'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball struck him on the head. In his agony of agony, he sank back, his forehead sinking like a rag, while his voice was almost unintelligible. "Auld sir," he whispered, but he sat down and fell to his'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball was given to Nubal and the people went forth and the man who had carried it away, then went out into the street where the old lady sat and waited for a change. The old woman with the maid brought the ball and threw'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball struck against the wall and knocked out its fellow, giving her the advantage. Even if she had been saved, she would not escape the injury. "Did that not throw you off? For now you know that you are, I would say'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball is shown off again and the man scores more of his shots. The clip also shows a video of two female players kicking goalies and leads into the woman kicking a goal. The player scores and a man walks out. The man then wins'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball was rolling up to him, and he hit it. The ball bounced back as he struck it. He put his foot on it. After some minutes he passed the ball and returned it. Several hundred yards forward the same player returned it,'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball rolled aside and the crowd applauded. The man in Black, who was talking with a hand on his shoulder, threw himself back and forth with excitement. The onlookers applauded him, and the man in Black began to raise his hand. The'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball struck home on the second pitch, and the captain rose and went to congratulate his mother. "You\'ve been a big success, captain, and I am very glad. Don\'t you know, we\'re going to set you up here to'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball went round the ring, the ring fell down, and the spectators began to dance on the floor and to hear the dancing.\n\nThe crowd was so small that there were scarcely ten men on the stage, but they looked to be more'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball is now on the bottom of the bench. A girl is kneeling in the middle of the stage. The girl turns red and grabs a ball. She grabs the ball and scores the game out with ten to two to beat her. She then'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball left the table, then it dropped to the floor by the foot of another table, and there it fell again. The ball struck the bottom of the two tables. When the floor was shaken a dozen or more hits are made in the floor'}]



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The ball is always a little high in the air when we are doing this – it is pretty intense now because we are in touch with the game's best players in an age when there are so many exciting youngsters who will come round and play in Europe"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball will continue to settle at an ever faster pace over the summer and it is expected to start to break through, as well, with just as much speed as its predecessor and should have come into effect in the summer.'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball bounces from head to toe before it bounces to its next position, with the receiver bouncing back to the receiver and again the ball bounces back to the receiver.\n\n"That is what a receiver is," says Robson, who has spent'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The ball is coming in and they're going to put our whole stadium on the line. It's tough. They're gonna use any available space for an action that makes sense for us. We just have to step that up against some of the worst"}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The ball doesn't hit the ground, but it moves out to full speed! When it's free, you keep it out of the way so that your other guys can jump in. It is an amazing weapon!\n\nThat leaves your opponents with"}]
[{'generated_text': 'The ball is going the way it has long been.'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball is a game changer for our football environment.\n\n"We\'ve got one of the nicest stadiums in the country but now we can get off to a terrific start. As an English team, Arsenal\'s been winning League One as'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The ball was in my lap and the thought of it getting to me, feeling the weight of the ball. I could tell my friends was there because I looked in his hands and he looked up to me. So it was just amazing. I have'}]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The ball had gotten in his hand before he left the court and he hadn't touched it. But he had just been sitting there in pain. My eyes widened for a moment, and I couldn't help but think that maybe the guy hadn't touched"}]
[{'generated_text': 'The ball was held in his hands, and the headpiece was placed on her thigh. When the game was over, she had become the fourth maid in that wardage, and the last maid of her wardage. There were few who were less'}]


(4 pts) According to the output, we can see that the generation from fine-tuned model indicates motion more frequently than the normal gpt2 model.
