# Transformers

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 03/03/2025   | Martin | Created   | Created notebook for Transformers Chapter. Started text generation section and exploring different decoding methods | 

# Content

* [Introduction](#introduction)
* [Text Generation](#text-generation)

# Introduction

Transformers perform a similar function to RNNs in processing sequential data but are an improvement since they do not require processing of data in order. Results in better parallelisation and faster training.

They can be pretrained on large bodies of unlabeled data and then fintuned for other tasks

__Perform Functions__

* Translation
* Question answering
* Text summarisations

__2 Common Architectures__

1. Bidirectional Encoder Represetations for Transformers (BERT)
2. Generative Pretrained Transformers (GPTs)

__Recipes Covered__

1. Text generation
2. Sentiment Analysis
3. Text classification: sarcasm detection
4. Question answering

---

# Text Generation

Using GPT-2 from __HuggingFace__

_GPT-2:_ Second generation of the GPT architecture. It showed show generative language models can acquire knowledge and process long-range dependencies thanks to pretraining on a large, diverse corpus of contiguous text

In [3]:
import tensorflow as tf

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
GPT2 = TFGPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [4]:
# Set seed
SEED = 34
MAX_LEN = 70 # set the maximum context length
tf.random.set_seed(SEED)

## Decoding Methods

These are different methods on how to determine the next word

1. __Greedy Search__ - Select the word with the highest probability
2. [__Beam Search__](https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24/) - Picks the N best sequences and considers the probabilities of the combination of all the preceding words + current word. Each branching sequence is retrained against the model conditioned on all previously selected words till the end of a sentence ("") has the highest probability
3. __Sample-based decoding__ - Randomly selecting the next token according to a probability distribution. The distribution is created based on the probabilities for each word, which are conditional probabilities
4. __Top K Sampling__ - 

### 1. Greedy Search

Model tends to repeat itself after awhile, because the high probability words mask the less probable ones which prevents any exploration for diverse combinations

In [5]:
# Sample texts/ sequences
input_sequence_1 = "I don't know about you, but there's only one thing I want to do after a long day of work"
input_sequence_2 = "There are times when I am really tired of people, but I feel lonely too."

In [9]:
# 1. Greedy Search
# Encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence_2, return_tensors='tf')

# Generate text until the output length (including context length - input)
greedy_output = GPT2.generate(input_ids, max_length=MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I feel like I am not in the right place at the right time. I feel like I am not in the right place at the right time. I feel like I am not in the right place at the right time. I feel like I am not in the


### 2. Beam Search

Selects the N best sequences by selecting the best N combinations and creating branching possibilities of next words. Each branch is retrained with the previously selected sequence to get a new set of possible word probabilities.

e.g For the sentence: "I am going to the ____"

Possible predictions:

* Park 0.6
* Zoo 0.1
* Doctor 0.25
* Supermarket 0.05

With N=2, the beam search will rerun the model on "I am going to the __Park__" and "I am going to the __Supermarket__" and then repeat the process for each result

Each iteration will select the highest probability across __ALL__ branches. This method increases the exploration of possible words

In [None]:
# 2. Beam Search
input_ids = tokenizer.encode(input_sequence_2, return_tensors='tf')

beam_outputs = GPT2.generate(
  input_ids,
  max_length=200,
  num_beams=5,
  no_repeat_ngram_size=2,
  num_return_sequences=5,
  early_stopping=True
)

print('')
print("Output:\n" + 100 * '-')

# we have 5 different outputs
for i, beam_output in enumerate(beam_outputs):
  print(f"{i+1}: {tokenizer.decode(beam_output, skip_special_tokens=True)}")


Output:
----------------------------------------------------------------------------------------------------
1: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.'"
2: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.' I just want to go home."
3: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.' I have no idea what's going on."
4: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my

### 3. Sample-based Decoding

Randomly selecting the next token according to its probability distribution. 

Convert the scores into probability distributions (conditional distributions) and randomly select a token

_Temperature:_ Controls the "sharpness" of the probability distribution

* Low Temperature (0.1-0.5): Makes high-probability more likely to be chosen (Leptokurtic) i.e More focused and predictable text
* High Temperature (0.8-1.5): Flattens the distribution and gives more equal chance to lower probability tokens (Platykurtic). Increases diversity, bit risks incoherence

__OUTCOME:__ Have a more varied output, but sometimes is incoherent esp. with a higher temperature value

In [16]:
# 3. Sample-based decoding
input_ids = tokenizer.encode(input_sequence_1, return_tensors='tf')

sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=0,
  temperature=0.2 # relatively low temperature value to maintain coherence
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. I want to go to the gym and get my body ready for the next day. I want to go to the gym and get my body ready for the next day. I want to go to the gym and get my body ready


In [17]:
# with a higher temperature value to increase variability in tokens
sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=0,
  temperature=0.8 # relatively low temperature value to maintain coherence
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. This is the day I leave the house," she said.

"You have to be ready for something," Zhaoxiu said. "I am afraid to leave."

She didn't leave alone.


