# Transformers

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 03/03/2025   | Martin | Created   | Created notebook for Transformers Chapter. Started text generation section and exploring different decoding methods | 
| 04/03/2025   | Martin | Update   | Completed top k and top p methods for decoding section | 

# Content

* [Introduction](#introduction)
* [Text Generation](#text-generation)

# Introduction

Transformers perform a similar function to RNNs in processing sequential data but are an improvement since they do not require processing of data in order. Results in better parallelisation and faster training.

They can be pretrained on large bodies of unlabeled data and then fintuned for other tasks

__Perform Functions__

* Translation
* Question answering
* Text summarisations

__2 Common Architectures__

1. Bidirectional Encoder Represetations for Transformers (BERT)
2. Generative Pretrained Transformers (GPTs)

__Recipes Covered__

1. Text generation
2. Sentiment Analysis
3. Text classification: sarcasm detection
4. Question answering

---

# Text Generation

Using GPT-2 from __HuggingFace__

_GPT-2:_ Second generation of the GPT architecture. It showed show generative language models can acquire knowledge and process long-range dependencies thanks to pretraining on a large, diverse corpus of contiguous text

In [1]:
import tensorflow as tf

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
GPT2 = TFGPT2LMHeadModel.from_pretrained('gpt2', pad_token_id=tokenizer.eos_token_id)

2025-03-04 23:12:20.017064: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-04 23:12:22.302923: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741101143.101419    1444 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741101143.337128    1444 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-04 23:12:25.479633: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [2]:
# Set seed
SEED = 34
MAX_LEN = 70 # set the maximum context length
tf.random.set_seed(SEED)

## Decoding Methods

These are different methods on how to determine the next word

1. __Greedy Search__ - Select the word with the highest probability
2. [__Beam Search__](https://towardsdatascience.com/foundations-of-nlp-explained-visually-beam-search-how-it-works-1586b9849a24/) - Picks the N best sequences and considers the probabilities of the combination of all the preceding words + current word. Each branching sequence is retrained against the model conditioned on all previously selected words till the end of a sentence ("") has the highest probability
3. __Sample-based decoding__ - Randomly selecting the next token according to a probability distribution. The distribution is created based on the probabilities for each word, which are conditional probabilities
4. __Top K Sampling__ - Entire probability mass is shifted to only the top $k$ words. This increases the chances of high probability words occurring and decreases the chances of low probability words
5. __Top P Sampling (Nucleus Sampling)__ - Choose the smallest set of words whose total probability is greater than $p$. The probability mass function is rescaled to these set of words. This means that the size of the set of words changes with each step

### 1. Greedy Search

Model tends to repeat itself after awhile, because the high probability words mask the less probable ones which prevents any exploration for diverse combinations

In [3]:
# Sample texts/ sequences
input_sequence_1 = "I don't know about you, but there's only one thing I want to do after a long day of work"
input_sequence_2 = "There are times when I am really tired of people, but I feel lonely too."

In [9]:
# 1. Greedy Search
# Encode context the generation is conditioned on
input_ids = tokenizer.encode(input_sequence_2, return_tensors='tf')

# Generate text until the output length (including context length - input)
greedy_output = GPT2.generate(input_ids, max_length=MAX_LEN)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output:
----------------------------------------------------------------------------------------------------
There are times when I am really tired of people, but I feel lonely too. I feel like I am not in the right place at the right time. I feel like I am not in the right place at the right time. I feel like I am not in the right place at the right time. I feel like I am not in the


### 2. Beam Search

Selects the N best sequences by selecting the best N combinations and creating branching possibilities of next words. Each branch is retrained with the previously selected sequence to get a new set of possible word probabilities.

e.g For the sentence: "I am going to the ____"

Possible predictions:

* Park 0.6
* Zoo 0.1
* Doctor 0.25
* Supermarket 0.05

With N=2, the beam search will rerun the model on "I am going to the __Park__" and "I am going to the __Supermarket__" and then repeat the process for each result

Each iteration will select the highest probability across __ALL__ branches. This method increases the exploration of possible words

In [None]:
# 2. Beam Search
input_ids = tokenizer.encode(input_sequence_2, return_tensors='tf')

beam_outputs = GPT2.generate(
  input_ids,
  max_length=200,
  num_beams=5,
  no_repeat_ngram_size=2,
  num_return_sequences=5,
  early_stopping=True
)

print('')
print("Output:\n" + 100 * '-')

# we have 5 different outputs
for i, beam_output in enumerate(beam_outputs):
  print(f"{i+1}: {tokenizer.decode(beam_output, skip_special_tokens=True)}")


Output:
----------------------------------------------------------------------------------------------------
1: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.'"
2: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.' I just want to go home."
3: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my God, I'm going to have to get out of here.' I have no idea what's going on."
4: There are times when I am really tired of people, but I feel lonely too. I don't know what to do with myself.

"I feel like I can't do anything about it. It's like, 'Oh my

### 3. Sample-based Decoding

Randomly selecting the next token according to its probability distribution. 

Convert the scores into probability distributions (conditional distributions) and randomly select a token

_Temperature:_ Controls the "sharpness" of the probability distribution

* Low Temperature (0.1-0.5): Makes high-probability more likely to be chosen (Leptokurtic) i.e More focused and predictable text
* High Temperature (0.8-1.5): Flattens the distribution and gives more equal chance to lower probability tokens (Platykurtic). Increases diversity, bit risks incoherence

__OUTCOME:__ Have a more varied output, but sometimes is incoherent esp. with a higher temperature value

In [16]:
# 3. Sample-based decoding
input_ids = tokenizer.encode(input_sequence_1, return_tensors='tf')

sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=0,
  temperature=0.2 # relatively low temperature value to maintain coherence
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. I want to go to the gym and get my body ready for the next day. I want to go to the gym and get my body ready for the next day. I want to go to the gym and get my body ready


In [17]:
# with a higher temperature value to increase variability in tokens
sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=0,
  temperature=0.8 # relatively low temperature value to maintain coherence
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. This is the day I leave the house," she said.

"You have to be ready for something," Zhaoxiu said. "I am afraid to leave."

She didn't leave alone.




### 4. Top K Sampling

In [4]:
input_ids = tokenizer.encode(input_sequence_1, return_tensors='tf')

sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work. I have a great time. And that day I'm going to stop getting sick. Because you can't sit up with yourself and eat nothing, you have to sit up with yourself."

If you are on social media, ...


### 5. Top P Sample/ Nucleus Sampling

In [5]:
input_ids = tokenizer.encode(input_sequence_1, return_tensors='tf')

sample_output = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=MAX_LEN,
  top_k=0,
  top_p=0.8
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True), '...')

Output:
----------------------------------------------------------------------------------------------------
I don't know about you, but there's only one thing I want to do after a long day of work that I can do. I want to get back to work, and get back to helping the community."

But on a personal note, Zuckerman says he hopes his decision will provide him with hope for his daughter.
 ...


### Combining both approaches

In [11]:
input_ids = tokenizer.encode(input_sequence_1, return_tensors='tf')

sample_outputs = GPT2.generate(
  input_ids,
  do_sample=True,
  max_length=2*MAX_LEN,
  top_k=50,
  top_p=0.8,
  num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print(f"{i+1}: {tokenizer.decode(sample_output, skip_special_tokens=True)}")
  print('')

Output:
----------------------------------------------------------------------------------------------------
1: I don't know about you, but there's only one thing I want to do after a long day of work. I want to make my wife happy."

When the day started, the young woman told me she felt sorry for her daughter, who was suffering from an autism spectrum disorder. She said that after working long hours, she felt like a burden. She had no idea that she had to take a stand on behalf of other children.

"I want my kids to learn how to be successful," the woman said. "I want them to be able to do everything in their power to be successful, and to do that because they know what it's like to have no

2: I don't know about you, but there's only one thing I want to do after a long day of work. I want to do this for you, for your children and for your grandchildren. I'm going to make sure you get a good education."

A family member of mine, an Australian citizen, went on to write a long story in 