# Fine Tune a Model

A fine-tuned [distilgpt2](https://huggingface.co/distilbert/distilgpt2) model for generating manga-style quotes, trained on a [curated dataset](https://www.kaggle.com/datasets/tarundalal/anime-quotes) of anime and manga quotes.

Needs Tensorflow 2.15.0

In [12]:
import tensorflow as tf
import tensorflow.keras as keras
print(f"Tensorflow {tf.__version__}")

from transformers import AutoTokenizer, TFAutoModelForCausalLM # to load pretrained models

import pandas as pd # for data processing
import re # for text processing
import numpy as np

Tensorflow 2.15.0


## 1. Load the Pretrained Model and Tokenizer

In [2]:
model_name = "distilgpt2"  # Use a TensorFlow-supported model

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer.special_tokens_map)

# Load model
model = TFAutoModelForCausalLM.from_pretrained(model_name)

{'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## 2. Importing Data

In [9]:
df = pd.read_csv("./data/AnimeQuotes.csv")
print("Dataset loaded successfully")

df.head()

Dataset loaded successfully


Unnamed: 0,Quote,Character,Anime
0,"People’s lives don’t end when they die, it end...",Itachi Uchiha,Naruto
1,"If you don’t take risks, you can’t create a fu...",Monkey D Luffy,One Piece
2,"If you don’t like your destiny, don’t accept it.",Naruto Uzumaki,Naruto
3,"When you give up, that’s when the game ends.",Mitsuyoshi Anzai,Slam Dunk
4,All we can do is live until the day we die. Co...,Deneil Young,Uchuu Kyoudai or Space Brothers


## 3. Cleaning Text Data

In [10]:
# Function to clean text by removing unwanted punctuation
def clean_text(text):
    text = re.sub(r'["”!?,…]', '', text)  # Remove specific punctuation
    text = text.strip()  # Remove leading/trailing spaces
    # print(text)
    return text

# Extract the "Quote" column from the DataFrame
quotes = df['Quote']

# Convert all values in the column to strings (in case they are not already)
quotes = quotes.astype(str)

# Apply text cleaning function
quotes = quotes.apply(clean_text)

# Convert the column into a Python list (so it can be processed further)
quotes = quotes.tolist()

quotes[:2]

['People’s lives don’t end when they die it ends when they lose faith.',
 'If you don’t take risks you can’t create a future']

## 4. Tokenization

- The process of tokenization transforms a sentence into a list of tokens, from `"People’s lives don’t end when they die it ends when they lose faith."` to `1️⃣ ['People', '’s', 'lives', 'don', '’t', 'end', 'when', 'they', 'die', 'it', 'ends', 'when', 'they', 'lose', 'faith', '<PAD>', '<PAD>', ..., '<PAD>']`.

### 4.1 Padding

- Padding is added since all sentences used for training need to have the same length. The model does not learn from <PAD> tokens. Instead, the attention mask marks them as “ignore” areas.
- Padding can either be defined manually by adding `padding="max_length", max_length=50,` to the tokenizer, or by setting `padding=True` for automatially setting the padding for every batch.
- Ideally, the max-padding value should be slightly above the 95th percentile of sentence lengths to cover most data while avoiding excessive padding. Use this if effeciency is important. 


In [13]:
# Optional
# Calculate 95th percentile for efficient padding
quote_lengths = [len(tokenizer.encode(q)) for q in quotes]

# Print summary stats
print(f"Max length: {max(quote_lengths)}")
print(f"Mean length: {np.mean(quote_lengths):.2f}")
print(f"95th percentile: {np.percentile(quote_lengths, 95)}")

Max length: 73
Mean length: 24.07
95th percentile: 49.0


### 4.2 Creating the Tokens and Attention Mask 

When the Tokenizer returns a tensor dictionary, it consist of two array `input_ids` and `attention_mask`. 
- input_ids → The numerical token representation of the sentences. 
    - Example: "Life is an adventure." → [72, 318, 281, 18299, 13, 50256, 50256]
- attention_mask → A binary tensor: 
    - 1 → Real tokens (actual words).
    - 0 → Padding tokens, ignored by the model.

In [14]:
# Returns a TensorFlow tensor dictionary
tokenizer.pad_token = tokenizer.eos_token
tokenized_quotes = tokenizer(
    quotes, truncation=True, padding=True, return_tensors="tf"
)
tokenized_quotes

{'input_ids': <tf.Tensor: shape=(121, 73), dtype=int32, numpy=
array([[50256, 50256, 50256, ...,  4425,  4562,    13],
       [50256, 50256, 50256, ...,  2251,   257,  2003],
       [50256, 50256, 50256, ...,  2453,   340,    13],
       ...,
       [50256, 50256, 50256, ...,  2119,   284,  1663],
       [50256, 50256, 50256, ...,   534,  7401, 29955],
       [50256, 50256, 50256, ...,  1566,   340,  9457]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(121, 73), dtype=int32, numpy=
array([[0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1]], dtype=int32)>}

### 4.3 Shifting Labels

Casual Language Models like distilGPT2 predict **the next token** not the current one.

Below is a comparison of **before and after shifting**:

| Step | Input Token (Training Input) | ❌ Labels Without Shifting (Incorrect) | ✅ Labels With Shifting (Correct) |
|------|------------------------------|----------------------------------------|----------------------------------|
| 1️⃣  | `Life`                       | `Life` ❌ (Wrong, should predict `"is"`) | `is` ✅ |
| 2️⃣  | `is`                         | `is` ❌ (Wrong, should predict `"an"`) | `an` ✅ |
| 3️⃣  | `an`                         | `an` ❌ (Wrong, should predict `"adventure"`) | `adventure` ✅ |
| 4️⃣  | `adventure`                  | `adventure` ❌ (Wrong, should predict `"."`) | `.` ✅ |
| 5️⃣  | `.`                          | `.` ❌ (Wrong, should predict `<|endoftext|>`) | `<|endoftext|>` ✅ |

In [15]:
# Function to properly set up dataset
def map_fn(input_ids, attention_mask):
    labels = tf.concat([input_ids[:, 1:], tf.fill((tf.shape(input_ids)[0], 1), tokenizer.pad_token_id)], axis=1)
    return {"input_ids": input_ids, "attention_mask": attention_mask}, labels

BATCH_SIZE = 8
dataset = tf.data.Dataset.from_tensor_slices(
    (tokenized_quotes["input_ids"], tokenized_quotes["attention_mask"])
).batch(BATCH_SIZE).map(map_fn)

## 5. Fine-Tune Pre-Trained Model with Custom Data Set

Some notes on interpreting `loss`:
- Below 2.0 → Model is learning well.
- Between 1.0 - 1.5 → Good text generation capability.
- Below 1.0 → Model is highly trained and very accurate.
- However, if the loss is too low (~0.5 or lower), the model might be overfitting (just memorizing the training data instead of generalizing).
- The drop in loss should be smooth and consistent, meaning the model is learning without sudden overfitting.

Then, test the output.
- If the output looks good, no need for further training!
- If the output is still a bit repetitive, train for 2-3 more epochs (EPOCHS = 7 or 8) with a lower learning rate (learning_rate=3e-5) to refine the model.

In [16]:
# import tensorflow.keras as keras  # Make sure you are using TensorFlow's Keras
optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=5e-5)

# manually define loss function
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")

# Compile model
model.compile(optimizer=optimizer, loss=loss_fn)

# Train the model
EPOCHS = 5
model.fit(dataset, epochs=EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x32b8d6f90>

### 5.1 Save the model (optional)

In [15]:
model.save_pretrained("./model-export/")
tokenizer.save_pretrained("./model-export/")

('./model-export/tokenizer_config.json',
 './model-export/special_tokens_map.json',
 './model-export/vocab.json',
 './model-export/merges.txt',
 './model-export/added_tokens.json',
 './model-export/tokenizer.json')

## 6. Generate a Manga-Style Quote

input_ids – The encoded input prompt from which the model will generate text.
- Example: input_ids = tokenizer.encode("A warrior never", return_tensors="tf")

max_length=50 – The maximum number of tokens (words + punctuation) the model will generate before stopping.
- Increase for longer responses (e.g., max_length=100 for full paragraphs).
- Decrease for shorter quotes (e.g., max_length=30).

temperature=0.7 – Controls randomness in word selection.
- Lower values (e.g., 0.3) make the output more predictable and deterministic.
- Higher values (e.g., 1.0) make the output more creative and diverse.

top_k=50 – Limits word selection to the top 50 most probable words at each step.
- Lower values (e.g., top_k=10) make the output more focused.
- Higher values increase diversity but can lead to randomness.

top_p=0.9 – Enables nucleus sampling, which selects words from the smallest group of high-probability choices that together add up to 90% probability.
- If top_p=1.0, the model considers all possible words (more unpredictable).
- If top_p=0.5, the model limits selection to only the most likely words (more controlled).

do_sample=True – Enables sampling instead of greedy decoding, which improves creativity.
- If False, the model will always pick the highest-probability token (more robotic responses).
- If True, the model will randomly sample from the probability distribution (more natural responses).

In [28]:
input_text = "Life is"
input_ids = tokenizer.encode(input_text, return_tensors="tf")

output = model.generate(
    input_ids, 
    max_length=50, 
    temperature=0.7, # Increase randomness, 0.7
    top_k=50, # Reduce selection pool for better variety, 50
    top_p=0.6, # Adjust nucleus sampling for controlled diversity, 0.6
    repetition_penalty=1.2,  # Reduce repetitive phrases
    do_sample=True,
    no_repeat_ngram_size=2 #Prevents the model from generating EOS too soon if a phrase has already appeared
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True).strip('"”’“‘')
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Life is a very important tool for us to make sure that we are doing the right thing.


### 6.1 Best of Quotes 

3 Epochs
- Life is the most important thing in life.
- Life is a waste of time,’s hard work and effort. It will not be the best thing you can do
- Life is a world of pain.
- Life is a place to be. We’re going down and on, we can continue
- Life is the perfect moment of life.

5 Epochs
- Life is not a thing. But it’s the beginning of all things in life
- Life is not a matter of fate. It’s an act of destiny to be loved and peace for you all the
- Life is the only thing you can do.’s what it takes to live with your enemies and overcome them all!
- Life is a game of luck. You’ll never lose sight to your enemies and gain the most important thing you can
- Life is the most important thing to do. You’ll never forget it
- Life is not a game of luck. It’s the outcome that makes you happy and strong but it doesn't mean what your opponent wins! If there are no problems to overcome then they can always win with patience or hard work as well
- Life is a beautiful place to live. But it’s not the right time for you!
- Life is not the end of all things but a journey. The beginning will be as long and beautiful as you can find it. You are free to live your way through life’s path. Your destiny is what makes up for it all. True, but when.

- Art is the same thing as a dream.

