## Fine-Tuning GPT-2 Model in PyTorch

This project aims to demonstrate how to fine-tune a pre-trained GPT-2 model in PyTorch. It is also the project in the Udmey course **Introduction to Transformers for NLP with Python**.

First, we downloaded all the pretrained model from the transformers libray

In [9]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
gpt2tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
gpt2 = TFGPT2LMHeadModel.from_pretrained("gpt2", 
                                         pad_token_id=gpt2tokenizer.eos_token_id)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### Greedy search algorithm

We created texts using a greedy search algorithm. It is usually a good idea to fix the random seed to make sure the results are reproducable.

In [11]:
# settings

#for reproducability
SEED = 30
tf.random.set_seed(SEED)

#maximum number of words in output text
MAX_LEN = 70

In [12]:
# encode context the generation is conditioned on
input_ids = gpt2tokenizer.encode('Pokemon is the', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = gpt2.generate(input_ids, max_length=50)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(greedy_output[0], skip_special_tokens=True))


Output:
--------------------------------------------------
Pokemon is the first game in the series to feature a character that is a member of the team.

The game's story is based on the manga series, and the characters are based on the characters from the anime.

The game's


A prompt was supplied for the model to complete. The model started in a promising manner but soon resorted to repeating the same output. The out for the text generation is indicative.There are a few various reasons for this.The models themselves may be retrained periodically by the Hugging face team and may evolve with the newer version. Beam search can be consicered as an alternative. At each step of generating a token, a set of top probability tokens are kept as part of the beam instead of just the highest- probability token. The sequence with the highest overall probability is returned at  the end of the generation. It is a time to generate text with beam search algorithm. 

### Beam search algorithm

In [13]:
import tensorflow as tf
tf.random.set_seed(42)  # for reproducible results
# BEAM SEARCH
# activate beam search and early_stopping
beam_output = gpt2.generate(
    input_ids, 
    max_length=51, 
    num_beams=20, 
    early_stopping=True
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the most popular video game in the world. It is the most popular video game in the world. It is the most popular video game in the world. It is the most popular video game in the world.


Qualitatively, the first sentence makes a lot more sense than the one generated by the greedy search. The early_stopping patameter signals generation to stop when all beams reach the EOS token. Howeverm there is still much repetition going on. One parameter that can be used to control the repetition is by setting a limit on n-grams being repeated.

### Set no repeat n-gram size

In [16]:
# set no_repeat_ngram_size to 3
beam_output = gpt2.generate(
    input_ids, 
    max_length=50, 
    num_beams=20, 
    no_repeat_ngram_size=3, 
    early_stopping=True
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the most popular video game in the world. It's also one of the most successful franchises in the history of the video game industry. It has sold more than 1 billion copies worldwide and has been downloaded more than 100 million times.




As can be seen from the above result, this had made a considerable differences in the quality of the generated text. The no_repeat_ngram_size parameter prevents the model from generationg any 3-grams or triplets of tokens more than onace. While this improves the quality of the text, using the n-gram constraint can have a significant impact on the quality of the generated text. If the generated text is about the White House, then these 3 words can only be used onace in the entire generated text. In such a case, using the n-gram constraint will be counter-productive.

Beam search works well in  cases where the generated sequence is of a restricted length. As the length of the sequence increases, the number of beams to be maintained and computed increases signnificantly. Consequently, beam search works well in tasks like summarization and translation but performs poorly in open-ended text generation. Furthermore, beam search, by trying to maximize the cumulative probabiltiy, generated more predictable text. The text feels less natural. We will write some codes can be used to get a feel for the varibous beams being generated. Please ensure that the number of beams is greater than or equal to the number of sequences to be returned.

### Add temperature controlling parameter

In [20]:
# Returning multiple beams
tf.random.set_seed(42)  # for reproducible results
beam_outputs = gpt2.generate(
    input_ids, 
    max_length=50, 
    num_beams=20, 
    no_repeat_ngram_size=3, 
    num_return_sequences=3,  
    early_stopping=True,
    temperature=0.7
)

print("Output:\n" + 50 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("\n{}: {}".format(i, 
                        gpt2tokenizer.decode(beam_output, 
                                             skip_special_tokens=True)))

Output:
--------------------------------------------------

0: Pokemon is the most popular video game in the world. It's also one of the most successful franchises in the history of the video game industry. It has sold more than 1 billion copies worldwide and has been downloaded more than 100 million times.



1: Pokemon is the most popular video game in the world. It's also one of the most successful franchises in the history of the video game industry. It has sold more than 1 billion copies worldwide and has been downloaded more than 2 billion times.



2: Pokemon is the most popular video game in the world. It's also one of the most successful franchises in the history of the video game industry. It has sold more than 1 billion copies worldwide and has been downloaded more than 3 billion times.




There is another method for improving the coherence and creativity of the text being geenreated called Top-K sampling. This is the preferred methid in GPT-2 and plays an important role in the success of GPT-2 in story generation. 

### Top-K sampling

In [21]:
# Top-K sampling
tf.random.set_seed(42)  # for reproducible results
beam_output = gpt2.generate(
    input_ids, 
    max_length=50, 
    do_sample=True, 
    top_k=25,
    temperature=2.0
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the only Pokemon with all three stats and it is an awesome Pokemon because there is nothing to fear (yet, no less). In fact with the Pokemon from FireRed & Blue we see what a fun story this series and anime has made!


As can be seen from the above example, the model looks at the 25 top tokens out of the 50000+ tokens while generating text. Then it picks a random word from these and conitunes the generation. Choosing larger values will result in more surprising or creative text. Choosing lower values of K will result in more predictable text. 

#### Choosing lower values of K 

In [23]:
#tf.random.set_seed(42)  # for reproducible results
beam_output = gpt2.generate(
    input_ids, 
    max_length=50, 
    do_sample=True, 
    top_k=15,
    temperature=2.0
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the only series in the series to use all-purpose, special characters and the game's own character. However, it also has one main difference: all characters in that episode's story are the same as in this particular episode.




#### higher k value

In [24]:
beam_output = gpt2.generate(
    input_ids, 
    max_length=50, 
    do_sample=True, 
    top_k=150,
    temperature=2.0
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the one available
No wonder now how hot will one, let alone six-thousand more people!
Pokemon Snap is almost going on sale. It hit Japan today one offtnight, followed by Korean takoba hawgawa


This seems like a step in the right direction. Can we do better than this?
Top-P sampling (also known as nucleus sampling) is similar to Top-K, but instead of choosing the top K most likely words we can choose the smallest set of words hose total probability is larger than p and then the enture probability mass is shifted to the words in this set. 

### Top-p sampling

In [33]:
sample_output = gpt2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_p = 0.9, 
                             top_k = 0
)

print("Output:\n" + 50 * '-')
print(gpt2tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
Pokemon is the main character of Star Fox 64.

Star Fox Red from Tekken 7

Hakumetoro Minoru

Gen B Shoryuken Majin Vegeta

Sakura Sixteen and Tenshi Space Jukenho

Sega Genesis San Ace Queen

Taito Infinite

Asus


### Combine all approaches

In [35]:
# Combine all approaches together
sample_outputs = gpt2.generate(
                              input_ids,
                              do_sample = True, 
                              max_length = 2*MAX_LEN,  #to test how long we can generate and it be coherent
                              #temperature = .7,
                              top_k = 20, 
                              top_p = 0.95, 
                              num_return_sequences = 5
)
print("Output:\n" + 50 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("\n{}: {}".format(i, 
                        gpt2tokenizer.decode(sample_output, 
                                             skip_special_tokens=True)))

Output:
--------------------------------------------------

0: Pokemon is the most important game in the franchise to date, with the original series of Star Wars having its very own, highly-anticipated sequel. And in the case of Star Wars: The Old Republic, it's also the franchise that has been so beloved by fans in the world of gaming and that will probably become the franchise's greatest story to date.

But if you can't believe how many people love that game, how is Star Wars: The Old Republic the most influential game of all time? The first game, in a nutshell, was the beginning of the game's life and the second game, after that, is the foundation of what makes Star Wars the best-selling game of

1: Pokemon is the most common genre and a lot of it has a good amount of content in it. However, I think that most of it is pretty boring.

The main game is pretty simple. You're playing as the world's first detective, tasked with a job that requires the cooperation of many people. The main

The above longer form text was generated by the smallest GPT-2 model, which has around 124 million parameters, Several different settings and model sizes are available for you to now play with. Remember, with great power comes great responsibility.