# A Notebook to Generative Manga Quotes

Based on Transfer Learning from a Small Language Model. 

In [1]:
import tensorflow as tf
print(tf.__version__)  # Should print 2.x.x

import keras
print(keras.__version__)

import kagglehub
import os
import pandas as pd



2.15.0
2.15.0


  from .autonotebook import tqdm as notebook_tqdm


## Define the Pretrained Model and Tokenizer

In [19]:
from transformers import AutoTokenizer, TFAutoModelForCausalLM

model_name = "distilgpt2"  # Use a TensorFlow-supported model

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = TFAutoModelForCausalLM.from_pretrained(model_name)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## Load the Custom Data for Fine-Tuning

In [None]:
path = kagglehub.dataset_download("tarundalal/anime-quotes")
print("Path to dataset files:", path)

filename = "AnimeQuotes.csv"  # Specify the filename here
dataset_path = os.path.join(path, filename)

if os.path.exists(dataset_path):
    df = pd.read_csv(dataset_path)
    print("Dataset loaded successfully:")
    print(df.head())
else:
    raise FileNotFoundError(f"File {filename} not found in the dataset directory.")


## Data Pre-Processing

In [55]:
import re

# Function to clean text by removing unwanted punctuation
def clean_text(text):
    text = re.sub(r'["”!?,…]', '', text)  # Remove specific punctuation
    text = text.strip()  # Remove leading/trailing spaces
    return text

# Extract the "Quote" column from the DataFrame
quotes = df['Quote']

# Convert all values in the column to strings (in case they are not already)
quotes = quotes.astype(str)

# Apply text cleaning function
quotes = quotes.apply(clean_text)

# Convert the column into a Python list (so it can be processed further)
quotes = quotes.tolist()

quotes

['People’s lives don’t end when they die it ends when they lose faith.',
 'If you don’t take risks you can’t create a future',
 'If you don’t like your destiny don’t accept it.',
 'When you give up that’s when the game ends.',
 'All we can do is live until the day we die. Control what we canand fly free.',
 'Forgetting is like a wound. The wound may heal but it has already left a scar.',
 'It’s just pathetic to give up on something before you even give it a shot.',
 'If you don’t share someone’s pain you can never understand them.',
 'Whatever you lose you’ll find it again. But what you throw away you’ll never get back.',
 'We don’t have to know what tomorrow holds That’s why we can live for everything we’re worth today',
 'Why should I apologize for being a monster Has anyone ever apologized for turning me into one',
 'People become stronger because they have memories they can’t forget.',
 'I’ll leave tomorrow’s problems to tomorrow’s me.',
 'If you wanna make people dream you’ve gott

### Tokenization

- distilgpt2 does not have a default padding token. GPT-2 models (including distilgpt2) do not use padding tokens since they process text in a left-aligned, continuous stream. However, for batch processing (like in fine-tuning), we need to define a padding token manually.

In [56]:
# Assign a padding token
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# Tokenize all quotes
tokenized_quotes = tokenizer(
    quotes, truncation=True, padding="max_length", max_length=50, return_tensors="tf"
)

In [57]:
# Function to properly set up dataset
def map_fn(input_ids, attention_mask):
    labels = tf.concat([input_ids[:, 1:], tf.fill((tf.shape(input_ids)[0], 1), tokenizer.pad_token_id)], axis=1)
    return {"input_ids": input_ids, "attention_mask": attention_mask}, labels

BATCH_SIZE = 8
dataset = tf.data.Dataset.from_tensor_slices(
    (tokenized_quotes["input_ids"], tokenized_quotes["attention_mask"])
).batch(BATCH_SIZE).map(map_fn)

# Convert tokenized quotes into a TensorFlow dataset for training:
# Convert to TensorFlow Dataset
# dataset = tf.data.Dataset.from_tensor_slices((tokenized_quotes["input_ids"], tokenized_quotes["attention_mask"]))

# Shuffle and batch
# dataset = dataset.shuffle(len(quotes)).batch(BATCH_SIZE)

## Fine-Tune Pre-Trained Model with Custom Data Set

Some notes on interpreting `loss`:
- Below 2.0 → Model is learning well.
- Between 1.0 - 1.5 → Good text generation capability.
- Below 1.0 → Model is highly trained and very accurate.
- However, if the loss is too low (~0.5 or lower), the model might be overfitting (just memorizing the training data instead of generalizing).
- The drop in loss should be smooth and consistent, meaning the model is learning without sudden overfitting.

Then, test the output.
- If the output looks good, no need for further training!
- If the output is still a bit repetitive, train for 2-3 more epochs (EPOCHS = 7 or 8) with a lower learning rate (learning_rate=3e-5) to refine the model.

In [215]:
# import tensorflow.keras as keras  # Make sure you are using TensorFlow's Keras

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=5e-5)

# manually define loss function
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")

# Compile model
model.compile(optimizer=optimizer, loss=loss_fn)

# Train the model
EPOCHS = 5
model.fit(dataset, epochs=EPOCHS)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x3b4a0aa50>

Save the model

In [216]:
model.save_pretrained("./manga-quote-generator")
tokenizer.save_pretrained("./manga-quote-generator")

('./manga-quote-generator/tokenizer_config.json',
 './manga-quote-generator/special_tokens_map.json',
 './manga-quote-generator/vocab.json',
 './manga-quote-generator/merges.txt',
 './manga-quote-generator/added_tokens.json',
 './manga-quote-generator/tokenizer.json')

## Generate a Manga-Style Quote

input_ids – The encoded input prompt from which the model will generate text.
- Example: input_ids = tokenizer.encode("A warrior never", return_tensors="tf")

max_length=50 – The maximum number of tokens (words + punctuation) the model will generate before stopping.
- Increase for longer responses (e.g., max_length=100 for full paragraphs).
- Decrease for shorter quotes (e.g., max_length=30).

temperature=0.7 – Controls randomness in word selection.
- Lower values (e.g., 0.3) make the output more predictable and deterministic.
- Higher values (e.g., 1.0) make the output more creative and diverse.

top_k=50 – Limits word selection to the top 50 most probable words at each step.
- Lower values (e.g., top_k=10) make the output more focused.
- Higher values increase diversity but can lead to randomness.

top_p=0.9 – Enables nucleus sampling, which selects words from the smallest group of high-probability choices that together add up to 90% probability.
- If top_p=1.0, the model considers all possible words (more unpredictable).
- If top_p=0.5, the model limits selection to only the most likely words (more controlled).

do_sample=True – Enables sampling instead of greedy decoding, which improves creativity.
- If False, the model will always pick the highest-probability token (more robotic responses).
- If True, the model will randomly sample from the probability distribution (more natural responses).

In [275]:
input_text = "Life is"
input_ids = tokenizer.encode(input_text, return_tensors="tf")

output = model.generate(
    input_ids, 
    max_length=50, 
    temperature=0.7, # Increase randomness
    top_k=50, # Reduce selection pool for better variety
    top_p=0.6, # Adjust nucleus sampling for controlled diversity
    repetition_penalty=1.2,  # Reduce repetitive phrases
    do_sample=True,
    no_repeat_ngram_size=2 #Prevents the model from generating EOS too soon if a phrase has already appeared
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True).strip('"”’“‘')
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Life is the most powerful thing in life. It’s what makes you stronger than ever before! And when it comes to happiness and death, that means everything else can be accomplished without fear of failure!


**Best Of Quotes**

3 Epochs
- Life is the most important thing in life.
- Life is a waste of time,’s hard work and effort. It will not be the best thing you can do
- Life is a world of pain.
- Life is a place to be. We’re going down and on, we can continue
- Life is the perfect moment of life.

5 Epochs
- Life is not a thing. But it’s the beginning of all things in life
- Life is not a matter of fate. It’s an act of destiny to be loved and peace for you all the
- Life is the only thing you can do.’s what it takes to live with your enemies and overcome them all!
- Life is a game of luck. You’ll never lose sight to your enemies and gain the most important thing you can
- Life is the most important thing to do. You’ll never forget it
- Life is not a game of luck. It’s the outcome that makes you happy and strong but it doesn't mean what your opponent wins! If there are no problems to overcome then they can always win with patience or hard work as well
- Life is a beautiful place to live. But it’s not the right time for you!
- Life is not the end of all things but a journey. The beginning will be as long and beautiful as you can find it. You are free to live your way through life’s path. Your destiny is what makes up for it all. True, but when.





