# Part 1: Loading Data and Tokenization

This part focuses on getting the fairy tale text ready for training. It includes loading the data, tokenizing it using GPT-2's tokenizer, and splitting it into manageable sequences.

In [1]:
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Define the path to the text file
file_path = "/kaggle/input/children-stories-text-corpus/cleaned_merged_fairy_tales_without_eos.txt"

# Read the content of the text file
with open(file_path, "r", encoding="utf-8") as file:
    story_text = file.read()

# Print the first few characters of the text
print(story_text[:500])

# Tokenize the text
tokenized_text = tokenizer.encode(story_text, return_tensors="pt")

# Split the tokenized text into smaller sequences
max_length = 512  # Maximum sequence length for GPT-2
stride = 128  # Stride for splitting the text
input_sequences = []

for i in range(0, tokenized_text.size(1), stride):
    input_sequences.append(tokenized_text[0, i : i + max_length])

# Convert input sequences to list of strings
input_texts = [tokenizer.decode(seq, skip_special_tokens=True) for seq in input_sequences]

# Print the first few input sequences
for i, input_text in enumerate(input_texts[:5]):
    print(f"Input {i + 1}: {input_text[:200]}...")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

The Happy Prince.
HIGH above the city, on a tall column, stood the statue of the Happy Prince.  He was gilded all over with thin leaves of fine gold, for eyes he had two bright sapphires, and a large red ruby glowed on his sword-hilt.
He was very much admired indeed.  “He is as beautiful as a weathercock,” remarked one of the Town Councillors who wished to gain a reputation for having artistic tastes; “only not quite so useful,” he added, fearing lest people should think him unpractical, which h


Token indices sequence length is longer than the specified maximum sequence length for this model (5104911 > 1024). Running this sequence through the model will result in indexing errors


Input 1: The Happy Prince.
HIGH above the city, on a tall column, stood the statue of the Happy Prince.  He was gilded all over with thin leaves of fine gold, for eyes he had two bright sapphires, and a large ...
Input 2:  really was not.
“Why can’t you be like the Happy Prince?” asked a sensible mother of her little boy who was crying for the moon.  “The Happy Prince never dreams of crying for anything.”
“I am glad th...
Input 3:  know?” said the Mathematical Master, “you have never seen one.”
“Ah! but we have, in our dreams,” answered the children; and the Mathematical Master frowned and looked very severe, for he did not app...
Input 4:  by her slender waist that he had stopped to talk to her.
“Shall I love you?” said the Swallow, who liked to come to the point at once, and the Reed made him a low bow.  So he flew round and round her...
Input 5: .  Then, when the autumn came they all flew away.
After they had gone he felt lonely, and began to tire of his lady-love.
“She has no conve

# Part 2: Chunking Text for Model Input

This section breaks down the text into smaller chunks that fit within the maximum sequence length of the GPT-2 model. This is essential for preventing memory issues during training.

In [4]:
from transformers import GPT2Tokenizer

# Load the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Define the maximum sequence length for the model
max_length = 1024

# Split the text into smaller chunks
chunk_size = 1024
chunks = [story_text[i:i+chunk_size] for i in range(0, len(story_text), chunk_size)]

# Tokenize and preprocess each chunk
input_sequences = []
for i, chunk in enumerate(chunks):
    # Tokenize the chunk
    tokenized_chunk = tokenizer.encode(chunk, return_tensors="pt")

    # Ensure the chunk fits within the maximum sequence length
    if tokenized_chunk.size(1) > max_length:
        tokenized_chunk = tokenized_chunk[:, :max_length]

    # Append the tokenized chunk to the input sequences
    input_sequences.append(tokenized_chunk)

# Print the number of input sequences
print("Number of input sequences:", len(input_sequences))


Number of input sequences: 19977


# Part 3: Fine-Tuning GPT-2 on the Fairy Tale Dataset

This is the core training process. It loads the pre-trained GPT-2 model, defines the training dataset and data collator, sets up the training arguments, and then starts the fine-tuning process.

In [6]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import os

# Load the pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Define the file path and cache directory
file_path = "/kaggle/input/children-stories-text-corpus/cleaned_merged_fairy_tales_without_eos.txt"
cache_dir = "/kaggle/working/cache"  # Writable directory for caching

# Ensure the cache directory exists
os.makedirs(cache_dir, exist_ok=True)

# Define the training dataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=128,
    overwrite_cache=True,
    cache_dir=cache_dir,  # Use the cache directory
)

# Define the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Fine-tune the model
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,3.63
1000,3.5482
1500,3.4999
2000,3.4706
2500,3.4479
3000,3.4297
3500,3.4191
4000,3.419
4500,3.4028
5000,3.367


TrainOutput(global_step=14958, training_loss=3.3015304914306935, metrics={'train_runtime': 2504.7272, 'train_samples_per_second': 47.768, 'train_steps_per_second': 5.972, 'total_flos': 7815636615168000.0, 'train_loss': 3.3015304914306935, 'epoch': 3.0})

In [7]:
output_dir = './fine_tuned_model'
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('./fine_tuned_model/tokenizer_config.json',
 './fine_tuned_model/special_tokens_map.json',
 './fine_tuned_model/vocab.json',
 './fine_tuned_model/merges.txt',
 './fine_tuned_model/added_tokens.json')

# Part 4: Generating Text with the Fine-Tuned Model

This part demonstrates how to use the fine-tuned model to generate new fairy tale text. It includes encoding an input prompt, using the model's generate function, and decoding the output.

In [9]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained(output_dir)
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)

# Encode input text
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=100, num_return_sequences=1)

# Decode generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, when the sun was shining brightly, the little girl was sitting on the grass, and the little man was sitting on the tree.
"What is the matter?" asked the little man.
"I am going to the forest to hunt," said the little girl.
"What is the matter?" asked the man.
"I am going to the forest to hunt," said the little girl.
"What is the matter?" asked the man.
"I


In [10]:
input_text = "Once upon a time"
output = model.generate(input_ids, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2, early_stopping=True)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, when the sun was shining brightly, the little girl was sitting on the grass, and the old woman was standing by the fire, looking at her with a sad face.
"What is the matter?" she asked. "What are you doing here?"
The old man answered, "I am going to the forest to hunt for my lost brother."
Then the girl said, 
  "My brother is dead, but I am still alive, for I have


# Part 5: Adjusting Generation Parameters

This section shows how to control the creativity and style of the generated text by adjusting parameters like temperature, top_k, and top_p.

In [11]:
output = model.generate(
    input_ids,
    max_length=100,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    early_stopping=True,
    temperature=0.7,  # Adjust temperature to control creativity
    top_k=50,  # Adjust top_k for top-k sampling
    top_p=0.95  # Adjust top_p for nucleus sampling
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, when the sun was shining brightly, the little girl was sitting on the grass, and the old woman was standing by the fire, looking at her with a sad face.
"What is the matter?" she asked. "What are you doing here?"
The old man answered, "I am going to the forest to hunt for my lost brother."
Then the girl said, 
  "My brother is dead, but I am still alive, for I have


# Part 6: Further Fine-Tuning 

This part explores further fine-tuning options, like using a different learning rate or increasing the number of training epochs, to enhance the model's performance.

In [12]:
# Fine-tuning with a different learning rate or more epochs
training_args = TrainingArguments(
    output_dir='./fine_tuned_model',
    overwrite_output_dir=True,
    num_train_epochs=5,  # adjusting the number of epochs
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=5e-5,  # Experiment with the learning rate
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
500,3.1995
1000,3.242
1500,3.2662
2000,3.2319
2500,3.2365
3000,3.2519
3500,3.2822
4000,3.2678
4500,3.261
5000,3.2533


TrainOutput(global_step=99705, training_loss=2.9831418253917414, metrics={'train_runtime': 7013.85, 'train_samples_per_second': 28.431, 'train_steps_per_second': 14.215, 'total_flos': 1.302606102528e+16, 'train_loss': 2.9831418253917414, 'epoch': 5.0})

# Part 7: Generating Text with Adjusted Parameters


 This final section focuses on using the fine-tuned model with adjusted parameters to generate more creative and engaging fairy tale stories.

In [13]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Loading the fine-tuned model and tokenizer
model_name = "./fine_tuned_model"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Input prompt
prompt = "Once upon a time, when the sun was shining brightly,"

# Encode the input prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generating text with adjusted parameters
output = model.generate(
    input_ids,
    max_length=150,  # we can increase max_length for longer text generation
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    early_stopping=True,
    temperature=0.8,  # Adjusting temperature to control creativity
    top_k=50,  # Adjusting top_k for top-k sampling
    top_p=0.9  # Adjusting top_p for nucleus sampling
)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, when the sun was shining brightly, the little girl was sitting on the grass, and the old woman was standing by the fire, looking at her with a sad face.
"What is the matter?" she asked. "What are you doing here?"
The old man answered, "I am going to the forest to hunt for my lost brother."
Then the girl said, 
  "My brother is dead, but I am still alive, for I have been hunting for him for many years.  I will go and look for the lost man, who is lying in the wood. I want to see if he is alive or dead."  Then she went to him and said:
"'I will
