
# From: Transformers and Text-Generation
by Liam Dugan (UPenn). 


Please write your answers and code in the cells with questions below. 

----------

For this homework, we will take ideas from the entire class: language models, text generation, vector-based word representations, syntactic analysis, and neural networks. We'll be using large, pre-trained language models to generate text, and studying how we can fine-tune these large language models to generate text in whatever genre and style we want!

In this assignment you will get:
1. An overview of the "Transformer" architecture is and why it is particularly well suited for Natural Language Processing tasks
2. An introduction to the Generative Pretrained Transformer (GPT) family, which is a set of large-scale language models that can be used to generate text that often sounds like it was written by a human.
3. Experience with using the HuggingFace package to fine-tune these models to generate text that sounds like it comes from a specific source.

# Part 1: What is a Transformer? (Reading)
<figure align="center">
<img src="https://media.giphy.com/media/VeWllmR9zfaco/giphy.gif" />
<figcaption>(It's probably not this guy, right?)</figcaption>
</figure>

### The Transformer

The current state-of-the-art for a variety of natural language processing tasks belongs to the **Transformer** architecture, first published December 6th 2017. 

The Transformer can be thought of as a big feed-forward network with every feed-forward layer containing something called an "attention module". 

>You might be wondering: why are we moving back to feed-forward networks after having so much success with recurrent neural networks and variants like LSTMs? Aren't RNNs naturally poised to handle sequences as their inputs? Well, as it turns out, the sequential nature of RNNs make them really difficult to train in a distributed/parallel fashion. So while RNNs make more sense to use on sequences of inputs, serial networks such as the transformer can be trained much faster, allowing orders of magnitude more training data to be used. 



### Reading \# 1 - What is a Transformer?

In order to get a good grasp on exactly *why* these models are so good it's important to understand what they are and how they work. 

Your first task for this homework is to read the blog post ["The Illustrated Transformer" by Jay Alammar](http://jalammar.github.io/illustrated-transformer/). This blog post explains the transformer architecture (and the all-important "Attention Module") with helpful visualizations and diagrams. 

**You should read this post very closely and understand exactly what the Transformer is and how it works. Once you're finished reading, answer the following questions in 2-3 sentences each.**

1. (2 pts) What is Self-Attention (at a high level)?

   > Self-Attention is way to create connections within the same sentence. Using this mechanism, the model can look for clues in other words of the sentence to understand a word's reference or meaning. Ex. The cat stopped eating, because it was full. In this sentence, the word 'it' refers to the cat. The understanding of this reference in a machine translation model can be achieved using the self-attention.

2. (2 pts) How is Self-Attention computed?

   > Self-attention is computed using 3 vectors namely Query Vector, Key Vector, Value Vector. These vectors are created by multiplying the input embeddings by three weight matrices(WQ,WK,WV) that were trained during training process.
The formula on a high level is self-attention z = summation(softmax(Q*K/sqrt(dim(K))) * V)



3. (2 pts) What do the "Query", "Key", and "Value" vectors encode (at a high level)?

   > These three vectors are abstractions that can be used to calculate self-attention. The Query Vector is a representation of the current word used to score against all the other words. Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words. Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.

4. (2 pts) What is an attention "head" and why should we use multiple heads?

   > If the calculation of self-attention happens many times, with different values of Q,K,V and Weight matrices, such a procedure is called Milti-headed attention and each set of k,q,v would refer to an attention head, which would generate one self-attention value. Usage of Multiple heads procedure in calculation of self-attention, can expand the model's ability to focus on different positions in the input. Also it allows the model to jointly attend to information from different representation sub-spaces at different positions.

5. (2 pts) What are positional embeddings?

   > Positional Embeddings are vectors which describe the position of the words in the original input. These vectors follow a specific pattern which the model learns and is able to determine the position of individual word w.r.t. each other based on the values

6. (2 pts) Why are positional embeddings important?

   > The positional meaning of words in a sentence is an important bit of information to leave out. To address this, the transformer adds a vector to each input embedding in both encoding and decoding phase. The intuition here is that adding these vectors would provide meaningful distances between the embedding vectors once they're projected into Q/K/V vectors and during dot-product attention.




-----------
### Reading \# 2 - Transformer Language Models

On June 11th 2018, OpenAI released a model named **GPT**, standing for *Generative Pre-Trained Transformer*. This model was a Transformer architecture that was modified such that it could be used for Text Generation instead of sequence to sequence modeling. This model was also pre-trained, which means that anyone could download the fully trained model and use it without needing to train the model themselves. 

On February 14th 2019, OpenAI released a blog post detailing a brand new version of GPT that had an insane **1.5 billion parameters**. They named this version **GPT-2**. To train such a large model, OpenAI crawled 40GB worth of text from the web (roughly 20,000,000,000 words). 

GPT-2 is an extremely impressive language model that can generate text that often sounds so plausible that it seems like it might have been written by a human.  Here is an example of what GPT-2 can generate, taken from [OpenAI’s blog post](https://openai.com/blog/better-language-models/):

**Human-Written Prompt:**
> “Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today.”

**Model Continuation:**
> The 19-year-old singer was caught on camera being escorted out of the store by security guards.
> 
> The singer was wearing a black hoodie with the label ‘Blurred Lines’ on the front and ‘Fashion Police’ on the back.
> 
> Scroll down for video
> 
> Shoplifting: Miley Cyrus was caught shoplifting from Abercrombie and Fitch on Hollywood Boulevard today (pictured)
> 
> The singer was also wearing a pair of black-rimmed glasses, a black jacket, black jeans and black sandals.
> 
> She was carrying a pair of black and white striped gloves and a small black bag.

*(To be clear: Miley Cyrus was never actually arrested for shoplifting)*.  

The quality of the model's output was so good, that the researchers at OpenAI were worried that, in addition to positive applications, it might be used for malicious purposes like generating fake news, impersonating people online, creating abusive or fake content to post on social media, or automating spam and phishing scams. In a [demo to Wired Magazine](https://www.wired.com/story/ai-text-generator-too-dangerous-to-make-public/), the reporter typed in “Hillary Clinton and George Soros” and GPT-2 generated a crazy rant with lots of conspiracy theories: 
> Hillary Clinton and George Soros are a perfect match, and their agenda appears to be to create a political movement where Soros and his political machine and Clinton are two of the only major players. This is the first time Soros and Clinton have been caught on tape directly colluding in promoting the same false narrative. One of the key revelations in the leaked audio was Clinton's admission to a Russian banker that she knew about the Uranium One deal before it was approved by Congress. Clinton was shown sharing the same talking points that were originally drafted by a Fusion GPS contractor hired by an anti-Trump Republican donor. The leaked audio is the clearest evidence yet that the Clinton campaign and the Hillary Foundation colluded with Fusion GPS to manufacture propaganda against President Trump.


They were concerned enough that they labeled GPT-2 "too dangerous to release", and OpenAI initially refused to release their dataset, training code, or GPT-2 model weights.  OpenAI decided to release in a delayed, phased fashion so that researchers could spend time working on automatic detection of generated text.

In this homework, you'll get to be the judge of how good GPT-2 is, as you'll be using it yourself to generate text!

**To start your journey into the world of Text Generation, you should read Part 1 of the blog post ["The Illustrated GPT-2" by Jay Alammar](http://jalammar.github.io/illustrated-gpt2/) and answer the following questions in 2-3 sentences each**

7. (4 pts) How does the architecture of GPT-2 differ from the standard Encoder-Decoder Transformer model?
   > GPT-2 consists of only decoder blocks, whereas the standard contains both encoder and decoder blocks. Hence, the initial input to the decoder block in GPT-2 does not depend on any previous values, whereas it depends on the output from encoder in standard transformer. The word vectors used for the first layer of GPT-2 are not simple one-hot tokenizations but byte pair encodings, which means the tokens are parts of words.

8. (4 pts) What is the difference between "Masked Self-Attention" and "Self-Attention"
   > A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention only allows to view the present and previous words and scores the future words to 0, so that the model cannot peak into the future.
   
9. (4 pts) What are logits? How are they computed? and How does GPT-2 use them to decide which word to predict next?
   > The output of the last decoder, in order to transform it back into a word, is transformed into a much larger vector called logits vector, using a fully connected neural network. The length of this vector is dependent on our Ouput Vocabulary, if the vocabulary size is 10000, then the logits vector is 10000 cells wide, each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer. GPT-2 multiplies the last output vector with the token embeddings matrix. This would result in a matrix containing score for each word in the model’s vocabulary. We can simply select the token with the highest score. But better results are achieved if the model considers other words as well.

### Aside: GPT-3 

On June 11th 2020, OpenAI released GPT-3 [(paper)](https://arxiv.org/pdf/2005.14165.pdf) [(wikipedia)](https://en.wikipedia.org/wiki/GPT-3). This model has an unfathomable **175 billion parameters** (100x larger than GPT-2!) and was trained on 570GB of text! This model is virtually indistinguishable from human output and can generate text about any topic and in any style with only a few words of priming text. It can do some very terrifying things.

GPT-3 Can:
- Generate JSX code off natural language descriptions
- Generate Emojis based off of descriptions of the feeling
- Generate regular expressions off natural language descriptions
- Generate website mockups off natural language descriptions
- Generate charts with titles, labels and legends from natural language descriptions
- Explain python code in plain english
- Automatically generate quiz questions (and grade them)
- Generate Latex from natural language descriptions
- Generate Linux commands from natural language descriptions
- Generate a Machine Learning model from natural language descriptions

[Here's a collection of 21 things GPT-3 can do (with examples)](https://machinelearningknowledge.ai/openai-gpt-3-demos-to-convince-you-that-ai-threat-is-real-or-is-it/#OpenAI_GPT-3_Demos)

[Here's a NYT article about how GPT-3 can write code, poetry, and argue](https://www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html)

[Here's an article GPT-3 wrote for The Guardian about how it loves humans and would never subjugate humanity](https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3)

**You may optionally choose to read Jay Alammar's most recent blog post ["How GPT3 Works - Visualizations and Animations"](http://jalammar.github.io/how-gpt3-works-visualizations-animations/) from July 2020 if you're curious as to how GPT-3 differs from GPT-2**

Similarly to GPT-2, OpenAI has decided not to release GPT-3, this time opting to put GPT-3 behind an API which you need to request permission to use. This allows them to control exactly who can generate text and what type of text is generated. While this is a good solution in the short term, the long term implications of GPT-3 are still unclear.

If you are interested in trying out GPT-3 yourself, feel free to [Join the OpenAI API Waitlist](https://share.hsforms.com/1Lfc7WtPLRk2ppXhPjcYY-A4sk30)

-------------------------------

# Part 2: GPT-2 Text Generation with HuggingFace

Phew, that was a lot of reading. Now lets get to the fun part! Let's use the transformer to generate some text!!

We will use the [Transformers library from HuggingFace](https://transformer.huggingface.co), which provides support for many Transformer-based language models like GPT-2. 

**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in `Runtime > Change runtime type` before running this Colab.**

In [34]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## 2.1 The 'Pipeline' Interface

The simplest way to use the HuggingFace library is to use their [Pipeline interface](https://huggingface.co/transformers/main_classes/pipelines.html)

There are many different types of Pipelines available but in this section we'll use the TextGenerationPipeline to get up and running with pretrained gpt2 as fast as possible

In [35]:
from transformers import pipeline

In [36]:
# Note: device=0 means to use GPU, device=-1 is to use CPU
generator = pipeline('text-generation', model='gpt2', device=0) 

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_vers

In [37]:
outputs = generator('I wonder what I will generate?')
print(outputs)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I wonder what I will generate? All my best wishes to you, Zendaya."\n\n"My friend, the Queen. But do you know what I fear? Something like this."\n\nZendaya nodded solemnly with a hint'}]


Note that the 'text-generation' pipeline will work with any **auto-regressive** language model (a.k.a 'causal-lm' models according to the HuggingFace lingo). You can find a list of all such models here https://huggingface.co/models?filter=causal-lm. 

10. (6 pts) **Your first task is to use the Pipeline interface to get generation output below for at least two different 'causal-lm' models (One of these two can be a different version of GPT2, but make sure at least one is a non-gpt family language model)**

In [38]:
generator_1 = pipeline('text-generation',model='EleutherAI/gpt-neo-125M',device=0)

outputs_1 = generator_1('I wonder what will happen?',min_length=20,max_length=30)

print(outputs_1)

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/config.json
Model config GPTNeoConfig {
  "_name_or_path": "EleutherAI/gpt-neo-125M",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      6
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 12,
  "num_layers": 12,
  "resid_dropout": 0,
  "su

Downloading:   0%|          | 0.00/526M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/pytorch_model.bin
All model checkpoint weights were used when initializing GPTNeoForCausalLM.

All the weights of GPTNeoForCausalLM were initialized from the model checkpoint at EleutherAI/gpt-neo-125M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPTNeoForCausalLM for predictions without further training.


Downloading:   0%|          | 0.00/560 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/config.json
Model config GPTNeoConfig {
  "_name_or_path": "EleutherAI/gpt-neo-125M",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local",
    "global",
    "local"
  ],
  "attention_types": [
    [
      [
        "global",
        "local"
      ],
      6
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 12,
  "num_layers": 12,
  "resid_dropout": 0,
  "su

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/merges.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--EleutherAI--gpt-neo-125M/snapshots/324e21bd3b56dfecba4308ab6ec147b588df23af/config.json
Model config GPTNeoConfig {
 

[{'generated_text': 'I wonder what will happen?\n\nI have a question about the future of the world. I am a little confused about the future of the world'}]


In [45]:
generator_2 = pipeline('text-generation',model='CarperAI/diff-codegen-350m-v1',device=0)

outputs_2 = generator_2('Do you know about ',min_length=20,max_length=30)

print(outputs_2)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--CarperAI--diff-codegen-350m-v1/snapshots/14ce658d33a4dca14b33949d4f857ab4ba3ceaed/config.json
Model config CodeGenConfig {
  "_name_or_path": "CarperAI/diff-codegen-350m-v1",
  "activation_function": "gelu_new",
  "architectures": [
    "CodeGenForCausalLM"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "codegen",
  "n_ctx": 2048,
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 20,
  "n_positions": 2048,
  "resid_pdrop": 0.0,
  "rotary_dim": 32,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50

[{'generated_text': 'Do you know about "\n\n### 1. I want to learn more in a new project? ##\n1. [How do I create'}]


## 2.2 Dissecting the Pipeline
Now that was easy!

As beautiful and easy as the Pipeline interface is, we want to know what's going on under the hood!

There are four main steps to a text generation pipeline:
1. (Tokenize) Turn the raw input text into a vector of integer token IDs using a tokenizer

2. (Encode) Feed those token IDs into the language model by querying for each token's embedding in the model's embedding matrix (the "encoder") and then feed the "encoded" sequence into the decoder module

3. (Decode) The decoder will output logits (a probability distribution over all possible integer token IDs) and we sample from those logits to get our next token -- repeat until EOS token is generated or we hit max_length

4. (Detokenize) Take the output sequence of token IDs and turn them from integer token IDs back to tokens with the tokenizer

Below you'll see how HuggingFace does this:

First we have to initialize both the tokenizer and the model from their pre-trained checkpoints. Note that the tokenizer has to match the model.

In [46]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel# AutoTokenizer, AutoModelForCausalLM

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2').cuda()

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12

In [47]:
#### Step 1: Tokenize the input into integer token IDs
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
print("Input Token IDs: " + str(inputs))

Input Token IDs: tensor([[15496,    11,   703,   389,   345,    30]], device='cuda:0')


In [48]:
#### Step 2 and 3: Feed in the integer token IDs and get out a sequence of token IDs as output
outputs = model.generate(inputs)
print("Output Token IDs: " + str(outputs))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output Token IDs: tensor([[15496,    11,   703,   389,   345,    30,   198,   198,    40,  1101,
           257,  1310,  1643,   286,   257, 34712,    13,   314,  1101,   257]],
       device='cuda:0')




In [49]:
#### Step 4: Feed in the integer token IDs and get out a sequence of token IDs as output
output_text = [tokenizer.decode(x) for x in outputs]
print("Output Text: " + str(output_text))

Output Text: ["Hello, how are you?\n\nI'm a little bit of a nerd. I'm a"]


Now that you have dissected the pipeline, it's time to play with some common parameters!

[Check out this demo notebook from HuggingFace](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) for a good overview of the different generation parameters and what they do (with example code!).

The full documentation on all of the parameters you can use in the generate function can be found [here](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.generate)

As an example, below we have a call to generate that:
- randomly samples from the top 50 words in the output distribution (rather than just greedily picking the best one every time)
- downweights the probability of all previously generated tokens by a factor of 1.2 (to prevent repetition)
- goes on for 512 tokens, because its more interesting

In [50]:
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      do_sample=True,          # Randomly sample from the logits instead of greedily picking next word with highest probability
      top_k=50,                 # Only sample from the top 50 most likely words
      repetition_penalty=1.2,    # Downweights the probability of all previously generated tokens by a factor of 1.2
      max_length=512,          # Generate for a maximum of 512 tokens
      
  )
print([tokenizer.decode(x) for x in outputs][0])


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you?


No more talking with "The Joker!" while he speaks. Not because I'm mad or anything about that... but when do YOU think The Scarecrow would be like him again?, especially after all the torture in Gotham City was finally ended by his return to your good friends and family.. Oh! My oh-nothers..... well they look so much better now instead of just being on my bad side right? But who am i kidding!? Who wouldn't want a new character suddenly getting into some pretty awful things?! It feels really satisfying even if it's only for those 2 special reasons above.... Let me tell ya: there may come an exception - as shown later during Episode 3 (well still...) :O [smiles] Well done man. You're also acting kind at this point though :) As always we'll talk first!! That sorta does no harm here either!!! If anyone doesn' take their guard up against any 'vain attacks,' then hey presto; stay back @JokerStalker #RideOnYourDislikes ;-)<|endoftext|>


**11. Your job is to provide two different examples of generation output from GPT-2 with different choices of generation parameters. You must also provide a 1-2 sentence explanation of what these parameters do and how they affect your output**

Feel free to get creative with this! Really poke around and try to find the combination of settings that gives you the best sounding text! The ways in which these parameters affect how 'human-like' a section of generated text sounds is an area of active research. :)

In [51]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 1

inputs1 = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs1 = model.generate(
      inputs,
      num_beams=5,
      no_repeat_ngram_size=2,
      early_stopping=True,   
      max_length=50,          
  )
print([tokenizer.decode(x) for x in outputs1][0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you?

"I'm fine," I said. "It's just that I don't know what to do with myself. I'm not sure if I'll ever be able to get out of here, or whether I


(4 pts) YOUR ANSWER HERE - 
1. The **num_beams** parameter specifies the number of beams to be considered for the beam search(this is used to try to get a fluent text). 
2. The **no_repeat_ngram_size** parameter makes sure that the ngrams of the mentioned size cannot occur again by manually setting the probability to 0. 
3. The **early_stopping** parameter makes sure that generation is ended when atleast **n_beams** number of beams are finished. 
4. **max_length** specifies the maximum length the generated tokens can have.


In [52]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 2
inputs2 = tokenizer.encode("I love working on Neural Networks.", return_tensors='pt').to(model.device)
outputs2 = model.generate(
        inputs2,
        max_length=50,
        do_sample=True,
        top_p=0.95,
        top_k=0,
        temperature=0.9,
        no_repeat_ngram_size=2
  )
print([tokenizer.decode(x) for x in outputs2][0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I love working on Neural Networks. I've spent the last few years working in R&D. So I was able to get back to the game, to see what I could do with the different architectures and how they fit into the underlying architecture.


(4 pts) YOUR ANSWER HERE -- 
1. The **max_length** parameter specifies the maximum length the generated token can have. 
2. The **do_sample** parameter specifies the model to take a sample of tokens based on conditional probability distribution(this would avoid simply picking up the word with highest probability). 
3. The **top_p** parameter is specified so that only the smallest set of tokens summing up to the probability are considered(this way the size of words chosen can increase or decrease dynamically based on next word's probability distribution). 
4. The **top_k** parameter is set to zero, to avoid picking only the top k words hacing high probabilities. 
5. The **temperature** parameter is used to sharpen(increasing the likelihood of high probability words and decreasing the likelihood of low probability words) the probability distribution in order to avoid incoherent gibberish to be generated. 
6. The **no_repeat_ngram_size** parameter makes sure that the ngrams of the mentioned size cannot occur again by manually setting the probability to 0.



## 2.3 Fine-Tuning GPT-2
Okay now time for the best part!

Generating general-purpose text from pre-trained models is great, but what if we want our text to be in a specific genre or style? Luckily for us, the GPT family of models use the idea of "Transfer learning" -- using knowledge gained from one problem (or training setting), and applying it to another area or domain. The idea of transfer learning for NLP, is that we can train a language model on general texts, and then adapt it to use it for a specific task or domain that we're interested in. This process is also called **fine-tuning**.

In this section we'll walk you through an example of using HuggingFace to fine-tune GPT-2 and then you'll be asked to fine-tune GPT-2 on two datasets of your own choosing!

### Fine-Tuning Example using HuggingFace Datasets library: Crime and Punishment

For our fine-tuning example we're going to train GPT-2 to mimic the style of Fyodor Dostoevsky's novel "Crime and Punishment"

We will be downloading our data using the HuggingFace [Datasets](https://huggingface.co/docs/datasets/) library.

In [53]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [54]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset, list_datasets

### Step 1: Initialize a Brand New GPT-2 Model and Tokenizer

In [55]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2').to("cuda")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12

###Step 2: Load the text of "Crime and Punishment" and tokenize it

The 'load_dataset' function queries for a dataset with a certain tag and downloads the corresponding data from HuggingFace's hosting site. This allows us to download all sorts of datasets through the same interface!

The documentation for load_dataset can be found [here](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

Here we take our tokenizer and run it on the entirety of Crime and Punishment in a single batch by using map on our custom encode function.

In [56]:
def encode(batch): return tokenizer([x.strip('\n\r') for x in batch['line']], truncation=True, padding=True)

crime_and_punishment = load_dataset('crime_and_punish', split='train')
processed = crime_and_punishment.map(encode, batched=True, batch_size=len(crime_and_punishment))
processed.set_format('torch', columns=['input_ids', 'attention_mask'])



  0%|          | 0/1 [00:00<?, ?ba/s]

### Step 3: Initialize the Trainer

The 'Trainer' module is the main way we perform fine-tuning. In order to initialize a Trainer, you need a model, tokenizer, TrainingArguments, your training data (in a Dataset object) and something called a data_collator (which tells the Trainer not to look for a vector of labels). 

In [57]:
training_args = TrainingArguments(
    output_dir='/content/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


### Step 4: Fine-Tune the Model!

Now we're done! All we have to do is hit run and sit back!

In [58]:
trainer.train()

***** Running training *****
  Num examples = 21969
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1374
  Number of trainable parameters = 124439808


Step,Training Loss
100,3.969
200,3.7507
300,3.675
400,3.6147
500,3.5417
600,3.5856
700,3.5394
800,3.4946
900,3.5195
1000,3.4919


Saving model checkpoint to /content/checkpoint-500
Configuration saved in /content/checkpoint-500/config.json
Model weights saved in /content/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/checkpoint-500/special_tokens_map.json
Saving model checkpoint to /content/checkpoint-1000
Configuration saved in /content/checkpoint-1000/config.json
Model weights saved in /content/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /content/checkpoint-1000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1374, training_loss=3.579167728340782, metrics={'train_runtime': 284.8772, 'train_samples_per_second': 77.117, 'train_steps_per_second': 4.823, 'total_flos': 392405005440000.0, 'train_loss': 3.579167728340782, 'epoch': 1.0})

### Step 5: Save the Model and use it to Generate!

Save your fine-tuned model and compare its output with regular GPT-2's output to see the difference for yourself!

In [59]:
trainer.save_model('./dostoevskypt2')

Saving model checkpoint to ./dostoevskypt2
Configuration saved in ./dostoevskypt2/config.json
Model weights saved in ./dostoevskypt2/pytorch_model.bin
tokenizer config file saved in ./dostoevskypt2/tokenizer_config.json
Special tokens file saved in ./dostoevskypt2/special_tokens_map.json


In [60]:
dostoevskypt2 = pipeline('text-generation', model='./dostoevskypt2', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

loading configuration file ./dostoevskypt2/config.json
Model config GPT2Config {
  "_name_or_path": "./dostoevskypt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 50257
}

lo

In [61]:
print(dostoevskypt2('Saint Petersburg is'))
print(gpt2('Saint Petersburg is'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Saint Petersburg is a town with several streets. It should be a pleasant land, for all living there are peasants like this, especially of children. There are almost two hundred street children, and if it was a pleasure to live with them for a week'}]
[{'generated_text': "Saint Petersburg is home again for a very strong first-ever game and you should make no mistake about that. Their defense is still very good, but you have to be careful with how they run things now, or it's going to get worse."}]


## PERPLEXITY

12. (2 pts) Using the pointer [here](https://huggingface.co/transformers/perplexity.html), compute the perplexity of the GPT2 pre-trained model on the Wikipedia test set (you can keep the same hyperparameters as in the link) 

In [62]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON WIKIPEDIA TEST SET 

# ANSWERS BELOW:
# Load wiki test set
from datasets import load_dataset
import torch


from tqdm import tqdm
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

device = "cuda"
model_id = "gpt2-large"
model_large = GPT2LMHeadModel.from_pretrained(model_id).to(device)
tokenizer_large = GPT2TokenizerFast.from_pretrained(model_id)

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer_large("\n\n".join(test["text"]), return_tensors="pt")
max_length = model_large.config.n_positions
stride = 256
seq_len = encodings.input_ids.size(1)

# Define a function for ppl
def ppl(model, seq_len, stride):
  nlls = []
  prev_end_loc = 0
  for begin_loc in tqdm(range(0, seq_len, stride)):
      end_loc = min(begin_loc + max_length, seq_len)
      trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
      input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs.loss * trg_len

      nlls.append(neg_log_likelihood)

      prev_end_loc = end_loc
      if end_loc == seq_len:
        break

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl

perplexity = ppl(model_large, seq_len, stride)
print(perplexity)

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/config.json
Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1280,
  "n_head": 20,
  "n_inner": null,
  "n_layer": 36,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.25.1",
  "u

Downloading:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/pytorch_model.bin
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2-large/snapshots/e5ab12c7d42b9e60a6025476a688aab2c5695189/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2-large",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors


  0%|          | 0/1124 [00:00<?, ?it/s][A[A

  0%|          | 1/1124 [00:00<11:09,  1.68it/s][A[A

  0%|          | 2/1124 [00:01<10:54,  1.72it/s][A[A

  0%|          | 3/1124 [00:01<10:50,  1.72it/s][A[A

  0%|          | 4/1124 [00:02<10:49,  1.72it/s][A[A

  0%|          | 5/1124 [00:02<10:47,  1.73it/s][A[A

  1%|          | 6/1124 [00:03<10:48,  1.72it/s][A[A

  1%|          | 7/1124 [00:04<10:46,  1.73it/s][A[A

  1%|          | 8/1124 [00:04<10:46,  1.73it/s][A[A

  1%|          | 9/1124 [00:05<10:46,  1.72it/s][A[A

  1%|          | 10/1124 [00:05<10:47,  1.72it/s][A[A

  1%|          | 11/1124 [00:06<10:47,  1.72it/s][A[A

  1%|          | 12/1124 [00:06<10:46,  1.72it/s][A[A

  1%|          | 13/1124 [00:07<10:46,  1.72it/s][A[A

  1%|          | 14/1124 [00:08<

tensor(16.1233, device='cuda:0')





> YOUR PERPLEXITY ANSWER HERE: 16.1233 (-20, +10 are fine, beyond that give partial credit, deducting 0.5 as it gets worse)

13. (2 pts) Compute the  perplexity of the dostoevskypt2 model on Wikipedia test set




In [63]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON WIKIPEDIA TEST SET

dostoevskypt2 = GPT2LMHeadModel.from_pretrained('./dostoevskypt2').to("cuda")
dostoevskypt2_tokenizer = GPT2Tokenizer.from_pretrained('./dostoevskypt2')

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
print(test["text"])
encodings = dostoevskypt2_tokenizer("\n\n".join(test["text"]), return_tensors="pt")
max_length = dostoevskypt2.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(dostoevskypt2, seq_len, stride)
print(perplexity)

loading configuration file ./dostoevskypt2/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading weigh



Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████▉| 560/562 [01:00<00:00,  9.31it/s]

tensor(68.2987, device='cuda:0')





> YOUR PERPLEXITY ANSWER HERE: 68.2987

14. (2 pts) Compute the perplexity of the GPT2 pre-trained model on the Crime and Punishment train dataset

In [64]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON CRIME AND PUNISHMENT TRAIN DATASET 


crime_and_punish_train = load_dataset('crime_and_punish', split='train')
encodings = tokenizer("\n\n".join(crime_and_punish_train["line"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(model, seq_len, stride)
print(perplexity)

Token indices sequence length is longer than the specified maximum sequence length for this model (360591 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████▉| 703/705 [01:19<00:00,  8.87it/s]


tensor(68.8016, device='cuda:0')


> YOUR PERPLEXITY ANSWER HERE: 68.8016



15. (2 pts) Compute the **train** perplexity of the **dostoevskypt2** model 




In [65]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON CRIME AND PUNISHMENT TRAIN DATASET 

crime_and_punish_train = load_dataset('crime_and_punish', split='train')
encodings = dostoevskypt2_tokenizer("\n\n".join(crime_and_punish_train["line"]), return_tensors="pt")
max_length = dostoevskypt2.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(dostoevskypt2, seq_len, stride)
print(perplexity)

100%|█████████▉| 703/705 [01:15<00:00,  9.35it/s]

tensor(63.3755, device='cuda:0')





> YOUR PERPLEXITY ANSWER HERE: 63.3755



> (1 pt) Which model performs better on Crime and Punishment train set, vanilla GPT-2 or your dostoevskypt2 checkpoint?

> The dostoevskypt2 model gave perplexity value as 63.4, whereas vanilla GPT-2 gave 68.8. Hence, we can say that by fine tuning the vanilla GPT-2 model, we were able to get a better perplexity value i.e. dostoevskypt2 model performs better than vanilla GPT-2 model.

16. (2 pts) Compute perplexity of the GPT2 model on your raw pride and prejudice text.

In [66]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON PRIDE AND PREJUDICE TEXT 

!wget "www.gutenberg.org/files/1342/1342-0.txt" 

with open('./1342-0.txt', encoding='utf8') as f:
    text = f.read()

encodings = tokenizer(text, return_tensors="pt")
max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(model, seq_len, stride)
print(perplexity)

--2022-12-07 06:02:58--  http://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/1342/1342-0.txt [following]
--2022-12-07 06:02:59--  https://www.gutenberg.org/files/1342/1342-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 772145 (754K) [text/plain]
Saving to: ‘1342-0.txt’


2022-12-07 06:03:52 (332 KB/s) - ‘1342-0.txt’ saved [772145/772145]



 99%|█████████▉| 381/383 [00:43<00:00,  8.78it/s]

tensor(49.7856, device='cuda:0')





> YOUR PERPLEXITY ANSWER HERE: 49.7856

17. (2 pts) Compute perplexity of the **dostoevskypt2** model on your raw pride and prejudice text.

In [67]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF dostoevskipt2 ON PRIDE AND PREJUDICE TEXT

with open('./1342-0.txt', encoding='utf8') as f:
    text = f.read()

encodings = dostoevskypt2_tokenizer(text, return_tensors="pt")
max_length = dostoevskypt2.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(dostoevskypt2, seq_len, stride)
print(perplexity)

 99%|█████████▉| 381/383 [00:40<00:00,  9.29it/s]

tensor(40.2029, device='cuda:0')





> YOUR PERPLEXITY ANSWER HERE: 40.2029

### Now's Your Turn!

**Your job is to fine-tune GPT2 one more time with your choice of fine-tuning dataset.**

*****For the fine-tuned model you create, you should clearly demonstrate (through visible generation outputs and analysis) that your fine-tuned model follows the desired style better than vanilla GPT2** ***

Please make sure to give a brief description 

In order to see which datasets are available for download, run the cell below. Pick one that you think would be interesting!

In [68]:
datasets_list = list_datasets()
print(', '.join(dataset for dataset in datasets_list))

acronym_identification, ade_corpus_v2, adversarial_qa, aeslc, afrikaans_ner_corpus, ag_news, ai2_arc, air_dialogue, ajgt_twitter_ar, allegro_reviews, allocine, alt, amazon_polarity, amazon_reviews_multi, amazon_us_reviews, ambig_qa, americas_nli, ami, amttl, anli, app_reviews, aqua_rat, aquamuse, ar_cov19, ar_res_reviews, ar_sarcasm, arabic_billion_words, arabic_pos_dialect, arabic_speech_corpus, arcd, arsentd_lev, art, arxiv_dataset, ascent_kb, aslg_pc12, asnq, asset, assin, assin2, atomic, autshumato, babi_qa, banking77, bbaw_egyptian, bbc_hindi_nli, bc2gm_corpus, beans, best2009, bianet, bible_para, big_patent, billsum, bing_coronavirus_query_set, biomrc, biosses, blbooks, blbooksgenre, blended_skill_talk, blimp, blog_authorship_corpus, bn_hate_speech, bnl_newspapers, bookcorpus, bookcorpusopen, boolq, bprec, break_data, brwac, bsd_ja_en, bswac, c3, c4, cail2018, caner, capes, casino, catalonia_independence, cats_vs_dogs, cawac, cbt, cc100, cc_news, ccaligned_multilingual, cdsc, cdt

### Tips
- Most of the datasets hosted by HuggingFace are not meant for Causal LM fine-tuning. Make sure you preprocess them accordingly if you want to use them.
- In order to check out information about a dataset hosted by huggingface you can use [this web viewer](https://huggingface.co/datasets/viewer/?dataset=crime_and_punish). Try to avoid downloading a dataset that's too big!
- You will likely have to change the custom 'encode' function for each new dataset you want to fine-tune on. You need to change batch['line'] to instead index with the correct column label for your specific dataset (it probably wont be called 'line').

### Useful Links
[load_datasets Documentation](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

[Trainer Documentation](https://huggingface.co/transformers/main_classes/trainer.html#id1)

[Example: Fine-Tuning BERT for Esperanto](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=zTgWPa9Dipk2)

[Example: Fine-Tuning for IMDb Classification](https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing#scrollTo=5DEWNilys9Ty)


#### 18. Dataset \#1

In [69]:
## YOUR CODE HERE - FOR FINE-TUNING GPT2 ON DATASET

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset, list_datasets
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [70]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2').to("cuda")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gpt2/snapshots/f27b190eeac4c2302d24068eabf5e9d6044389ae/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12

In [71]:
def encode(batch): 
  fbatch = []
  for x in batch['English']:
    if x != None:
      fbatch.append(x)
    else:
      fbatch.append('')
  return tokenizer(fbatch, truncation=True, padding=True)

data = load_dataset("collectivat/una-fraza-al-diya",split='train')
processed = data.map(encode, batched=True, batch_size=len(data))
processed.set_format('torch', columns=['input_ids', 'attention_mask'])



In [72]:
training_args = TrainingArguments(
    output_dir='/content2/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    logging_steps=1,
    weight_decay=0.01,
    logging_dir='./logs2',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [73]:
trainer.train()

***** Running training *****
  Num examples = 307
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 20
  Number of trainable parameters = 124439808


Step,Training Loss
1,4.8827
2,4.4855
3,4.4795
4,4.1099
5,3.961
6,4.1748
7,3.8008
8,3.9905
9,3.7022
10,3.8049




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=20, training_loss=3.944250690937042, metrics={'train_runtime': 4.6548, 'train_samples_per_second': 65.953, 'train_steps_per_second': 4.297, 'total_flos': 7206964992000.0, 'train_loss': 3.944250690937042, 'epoch': 1.0})

In [74]:
trainer.save_model('./FineTuned')

Saving model checkpoint to ./FineTuned
Configuration saved in ./FineTuned/config.json
Model weights saved in ./FineTuned/pytorch_model.bin
tokenizer config file saved in ./FineTuned/tokenizer_config.json
Special tokens file saved in ./FineTuned/special_tokens_map.json


In [75]:
finetuned_model_pipeline = pipeline('text-generation', model='./FineTuned', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

loading configuration file ./FineTuned/config.json
Model config GPT2Config {
  "_name_or_path": "./FineTuned",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading co

(4 pts) YOUR ANSWER HERE -


> This dataset is a **Ladino langauge learning dataset**, which contains translations of sentences in different languages like Ladino, Engish, Espanol. This is a small dataset consisting of 302 sentences including the file names. The reason to choose such a small dataset, was the limitation of GPU RAM available, bigger dataset is leading to error '*Out of GPU RAM*'.



In [83]:
## YOUR CODE HERE - FOR GENERATION WITH YOUR FINE-TUNED MODEL AND COMPARISON WITH REGULAR GPT2

print("Case-1: I wonder what I will generate?")
print("FineTuned Mode: ",finetuned_model_pipeline('I wonder what I will generate?')[0]['generated_text'])
print("GPT-2: ",gpt2('I wonder what I will generate?')[0]['generated_text'])
print("-------------------------------------------------------------------")
print("Case-2: Can I complete it?")
print("FineTuned Mode: ",finetuned_model_pipeline('Can I complete it?')[0]['generated_text'])
print("GPT-2: ",gpt2('Can I complete it?')[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Case-1: I wonder what I will generate?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


FineTuned Mode:  I wonder what I will generate? After many conversations, I have come up with a plan. I am making a dinner for the morning to a few friends and I will ask them to be present. I am prepared for my food. I am prepared


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


GPT-2:  I wonder what I will generate? Or what things my mind will make of it? Maybe what I do is not good enough. Maybe this will show you what I will do better. There is room for something else. But it isn't good enough
-------------------------------------------------------------------
Case-2: Can I complete it?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


FineTuned Mode:  Can I complete it? Can I go back and change the time again? Will I finish the last part? What can I do? We decided to celebrate on Thursday, we were looking for lunch. I'm ready to start this week with breakfast and
GPT-2:  Can I complete it? Yeah! It's that simple! Let me think what I can say. I'll do it before I go out and meet you. Don't worry, I'll let you show some love to me. I like getting things


In [85]:
finetuned_model = GPT2LMHeadModel.from_pretrained('./FineTuned').to("cuda")
finetuned_tokenizer = GPT2Tokenizer.from_pretrained('./FineTuned')

crime_and_punish_train = load_dataset('crime_and_punish', split='train')
encodings = finetuned_tokenizer("\n\n".join(crime_and_punish_train["line"]), return_tensors="pt")
max_length = finetuned_model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

print("\nPerplexity for Fine Tuned model on Crime and Punish Train Dataset is: ", ppl(finetuned_model,seq_len,stride))

loading configuration file ./FineTuned/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights f


Perplexity for Fine Tuned model on Crime and Punish Train Dataset is:  tensor(94.3564, device='cuda:0')





In [86]:
crime_and_punish_train = load_dataset('crime_and_punish', split='train')
encodings = tokenizer("\n\n".join(crime_and_punish_train["line"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

perplexity = ppl(model, seq_len, stride)
print('\nPerplexity for vanilla GPT-2 model on Crime and Punish Train Dataset is',perplexity)

Token indices sequence length is longer than the specified maximum sequence length for this model (360591 > 1024). Running this sequence through the model will result in indexing errors
100%|█████████▉| 703/705 [01:19<00:00,  8.87it/s]



Perplexity for vanilla GPT-2 model on Crime and Punish Train Dataset is tensor(125.9962, device='cuda:0')


(5 pts) YOUR ANSWER HERE - COMPARISON OF YOUR DATASET'S FINE-TUNED OUTPUT VS NON-FINE-TUNED OUTPUT 

The Perplexity of the Vanilla GPT-2 model is higher than the fine tuned model, even though the model is trained on very small dataset. Hence, we can say that Fine Tuned model has better performance.

In terms of generated outputs, Fine Tuned model is generating better text in terms of grammar and overall subject of the text. When taken two cases of general starting text and executed many times, the fine tuned model ended up generating better text, not that vanilla model is generating entire gibberish, it was also generating text with good sense and grammar, but not in the correct subject of the starting lines.