# **Fine-Tuning GPT-2 for Short Story Generation**

This project demonstrates fine-tuning **GPT-2**, a pre-trained language model, to generate short stories using the **TinyStories** dataset. We'll preprocess the data, fine-tune the model with early stopping, evaluate its performance, and generate new stories.

In [None]:
pip install transformers datasets torch accelerate



### **Load the TinyStories dataset**


In [None]:
from datasets import load_dataset
dataset = load_dataset("roneneldan/TinyStories")
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})


To reduce training time, we'll sample only **5% of the dataset** from both training and validation splits. This ensures faster experimentation while retaining representative data.

In [None]:
def sample_five_percent(dataset_split):
    total_size = len(dataset_split)
    five_percent_size = total_size // 1000  # 2% of the dataset
    return dataset_split.shuffle(seed=42).select(range(five_percent_size))


def sample_five_percent1(dataset_split):
    total_size = len(dataset_split)
    five_percent_size = total_size // 100  # 2% of the dataset
    return dataset_split.shuffle(seed=42).select(range(five_percent_size))

# Sample 2% from train and validation splits
train_data = sample_five_percent(dataset['train'])
val_data = sample_five_percent1(dataset['validation'])

# Check sizes of sampled datasets
print(f"Train size (2%): {len(train_data)}")
print(f"Validation size (2%): {len(val_data)}")

Train size (2%): 2119
Validation size (2%): 219


### **Tokenization**


The **GPT-2 tokenizer** is used to preprocess text data by converting it into numerical format.

We'll also ensure that the **end-of-sequence (EOS)** token is used for padding.

In [None]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [None]:
print("Pad token:", tokenizer.pad_token)  # Should print "<|endoftext|>"
print("Pad token ID:", tokenizer.pad_token_id)  # Should print an integer ID


Pad token: <|endoftext|>
Pad token ID: 50256


The dataset is tokenized into sequences of numerical tokens with **padding** and **truncation** applied to ensure all sequences are of the same length.

In [None]:
# Function to tokenize and prepare inputs/labels
def tokenize_function_with_labels(examples):
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        padding="longest",  # Pad to a fixed length for batches
        max_length=512         # Set maximum length for sequences
    )
    # Add labels (same as input_ids for language modeling)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Tokenize train and validation datasets
train_dataset = train_data.map(tokenize_function_with_labels, batched=True, remove_columns=["text"])
val_dataset = val_data.map(tokenize_function_with_labels, batched=True, remove_columns=["text"])

# Verify tokenized data structure
print(train_dataset[0])

{'input_ids': [14967, 290, 32189, 588, 284, 711, 287, 262, 3952, 13, 1119, 766, 257, 1263, 3430, 319, 262, 2323, 13, 632, 318, 7586, 290, 890, 290, 4334, 13, 198, 198, 1, 8567, 11, 257, 3430, 2474, 5045, 1139, 13, 366, 40, 460, 10303, 340, 2474, 198, 198, 1544, 8404, 284, 10303, 262, 3430, 11, 475, 340, 318, 1165, 5802, 13, 679, 8953, 866, 290, 10532, 262, 3430, 13, 198, 198, 1, 46, 794, 2474, 339, 1139, 13, 366, 2504, 5938, 2474, 198, 198, 44, 544, 22051, 13, 1375, 318, 407, 1612, 11, 673, 655, 6834, 340, 318, 8258, 13, 198, 198, 1, 5756, 502, 1949, 2474, 673, 1139, 13, 366, 40, 460, 5236, 340, 2474, 198, 198, 3347, 11103, 510, 262, 3430, 290, 7584, 340, 319, 607, 1182, 13, 1375, 11114, 6364, 290, 7773, 13, 1375, 857, 407, 2121, 866, 13, 198, 198, 1, 22017, 2474, 5045, 1139, 13, 366, 1639, 389, 922, 379, 22486, 2474, 198, 198, 1, 10449, 345, 2474, 32189, 1139, 13, 366, 1026, 318, 1257, 2474, 198, 198, 2990, 1011, 4962, 22486, 262, 3430, 319, 511, 6665, 11, 5101, 11, 290, 7405, 13, 111

We load the pre-trained **GPT-2** model and adjust the token embeddings to account for the padding token added during tokenization.

### **Loading Model And Tokenizer**

In [None]:
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

### **Test Model Before Fine-Tuning**

In [None]:
from transformers import pipeline
story_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Provide a prompt to generate a story
prompt = "Once upon a time in a dark and feary forest"
generated_story = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story1 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story2 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story3 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story4 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story5 = story_generator(prompt, max_length=150, num_return_sequences=1)

# Display the generated story
print("Generated Story 0:")
print(generated_story[0]['generated_text'])
print("Generated Story 1:")
print(generated_story1[0]['generated_text'])
print("Generated Story 2:")
print(generated_story2[0]['generated_text'])
print("Generated Story 3:")
print(generated_story3[0]['generated_text'])
print("Generated Story 4:")
print(generated_story4[0]['generated_text'])
print("Generated Story 5:")
print(generated_story5[0]['generated_text'])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Generated Story 0:
Once upon a time in a dark and feary forest near a forest clearing, young boys from a nearby village come to an open meeting. A young man with a mustache and a big red beard tells the young man about the mysterious and ominous "Dark One." When the young man says his name he immediately says, "Father." The young man answers in one word: "Dark one."

Suddenly, the darkness is not very dark and it only lasts three seconds. The man, who has nothing in common with the dark man, leaves the tree. The young man who has never been there at all begins to cry, "Father! Father!"

And what a pity to see a father, of our age and with only one leg
Generated Story 1:
Once upon a time in a dark and feary forest, an unassuming humanoid fell asleep in its wake. In the early evening hours of March 18th, the monster awoke from its slumber. It was a tall, six-legged human, carrying a large, bright scarlet hand. The creature had just been seen on the battlefield, walking along the tracks o

### **Evaluation**

### **-- Coherence -- Diversity -- Fluency --**

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from sentence_transformers import SentenceTransformer, util
from collections import Counter
import torch
import math

# Example text
text = generated_story[0]['generated_text']
text1 = generated_story1[0]['generated_text']
text2 = generated_story2[0]['generated_text']
text3 = generated_story3[0]['generated_text']
text4 = generated_story4[0]['generated_text']
text5 = generated_story5[0]['generated_text']


# Load Sentence-BERT model for coherence
coherence_model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to calculate Type-Token Ratio (Diversity)
def calculate_ttr(text):
    words = text.split()
    unique_words = set(words)
    ttr = len(unique_words) / len(words) if len(words) > 0 else 0
    return ttr

# Function to calculate Entropy (Diversity)
def calculate_entropy(text):
    words = text.split()
    word_counts = Counter(words)
    total_words = len(words)
    entropy = 0.0
    for count in word_counts.values():
        probability = count / total_words
        entropy -= probability * math.log(probability, 2)
    return entropy

# Function to calculate coherence
def calculate_coherence(text):
    sentences = [s.strip() for s in text.split('.') if s.strip()]  # Split into sentences
    if len(sentences) < 2:
        return 1.0  # Single sentence is trivially coherent
    embeddings = coherence_model.encode(sentences)
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = util.cos_sim(embeddings[i], embeddings[i + 1])
        similarities.append(sim.item())
    coherence = sum(similarities) / len(similarities)
    return coherence


def calculate_fluency(text):
    # Check if GPU is available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move model to the device
    model.to(device)

    # Tokenize the text and move inputs to the same device
    inputs = tokenizer(text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])

    # Calculate perplexity
    loss = outputs.loss
    perplexity = torch.exp(loss)
    return perplexity.item()

# Calculate metrics
#fluency = calculate_perplexity(text, model, tokenizer)
ttr = calculate_ttr(text)
entropy = calculate_entropy(text)
coherence = calculate_coherence(text)
fluency=calculate_fluency(text)



ttr1 = calculate_ttr(text1)
entropy1 = calculate_entropy(text1)
coherence1 = calculate_coherence(text1)
fluency1=calculate_fluency(text1)

ttr2 = calculate_ttr(text2)
entropy2 = calculate_entropy(text2)
coherence2 = calculate_coherence(text2)
fluency2=calculate_fluency(text2)

ttr3 = calculate_ttr(text3)
entropy3 = calculate_entropy(text3)
coherence3 = calculate_coherence(text3)
fluency3=calculate_fluency(text3)

ttr4 = calculate_ttr(text4)
entropy4 = calculate_entropy(text4)
coherence4 = calculate_coherence(text4)
fluency4=calculate_fluency(text4)

ttr5 = calculate_ttr(text5)
entropy5 = calculate_entropy(text5)
coherence5 = calculate_coherence(text5)
fluency5=calculate_fluency(text5)
print('-------------------')

# Print results
#print("Fluency (Perplexity):", fluency)
print("Diversity (Type-Token Ratio):", ttr)
print("Diversity (Entropy):", entropy)
print("Coherence:", coherence)
print("fluency:", fluency)
print('-------------------')


print("Diversity (Type-Token Ratio):", ttr1)
print("Diversity (Entropy):", entropy1)
print("Coherence:", coherence1)
print("fluency:", fluency1)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr2)
print("Diversity (Entropy):", entropy2)
print("Coherence:", coherence2)
print("fluency:", fluency2)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr3)
print("Diversity (Entropy):", entropy3)
print("Coherence:", coherence3)
print("fluency:", fluency3)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr4)
print("Diversity (Entropy):", entropy4)
print("Coherence:", coherence4)
print("fluency:", fluency4)
print('-------------------')
print("Diversity (Type-Token Ratio):", ttr5)
print("Diversity (Entropy):", entropy5)
print("Coherence:", coherence5)
print("fluency:", fluency5)


-------------------
Diversity (Type-Token Ratio): 0.664
Diversity (Entropy): 6.0376781369153365
Coherence: 0.39965156217416126
fluency: 17.178083419799805
-------------------
Diversity (Type-Token Ratio): 0.7142857142857143
Diversity (Entropy): 6.089016793459786
Coherence: 0.45124289989471433
fluency: 18.006610870361328
-------------------
Diversity (Type-Token Ratio): 0.7166666666666667
Diversity (Entropy): 6.15898554592641
Coherence: 0.46206297278404235
fluency: 16.470256805419922
-------------------
Diversity (Type-Token Ratio): 0.7321428571428571
Diversity (Entropy): 6.093299846601907
Coherence: 0.21653433237224817
fluency: 17.7174129486084
-------------------
Diversity (Type-Token Ratio): 0.7741935483870968
Diversity (Entropy): 6.393841927682315
Coherence: 0.2862136835853259
fluency: 15.372954368591309
-------------------
Diversity (Type-Token Ratio): 0.6397058823529411
Diversity (Entropy): 6.058124489758668
Coherence: 0.3475918446977933
fluency: 18.52735137939453


### **Training Arguments**

Training arguments control various aspects of training, including **learning rate**, **batch size**, **evaluation frequency**, and saving model checkpoints. The best model will be loaded at the end of training.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./gpt2-tinystories",
    evaluation_strategy="steps",
    eval_steps=100,
    logging_steps=10, 
    learning_rate=5e-5,
    weight_decay=0.01,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    save_strategy="steps",
    save_steps=100,
    logging_dir="./logs",
    save_total_limit=2,
    load_best_model_at_end=True,
)



### **Trainer API**

The **Trainer** API simplifies training by integrating the model, dataset, training arguments, and callbacks into one interface.

In [None]:
from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,

)

  trainer = Trainer(


The model is fine-tuned on the **TinyStories** dataset, with evaluation and checkpointing performed every 500 steps

In [None]:
trainer.train()

Step,Training Loss,Validation Loss
100,0.9562,0.812368
200,0.9031,0.788236
300,0.8989,0.770299
400,0.8246,0.757766
500,0.8228,0.750297
600,0.7092,0.745406
700,0.7355,0.741772
800,0.7319,0.734896
900,0.745,0.732085
1000,0.7872,0.729332


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=2650, training_loss=0.7341434867427035, metrics={'train_runtime': 2374.4326, 'train_samples_per_second': 4.462, 'train_steps_per_second': 1.116, 'total_flos': 2768389079040000.0, 'train_loss': 0.7341434867427035, 'epoch': 5.0})

After training, we evaluate the model on the validation set to measure its performance.

In [None]:
results = trainer.evaluate()
print("Evaluation Results:", results)

Evaluation Results: {'eval_loss': 0.7175973653793335, 'eval_runtime': 10.0031, 'eval_samples_per_second': 21.893, 'eval_steps_per_second': 5.498, 'epoch': 5.0}


Finally, we use the fine-tuned model to generate a short story based on a given prompt. The model completes the story in a coherent and creative manner.

### **Test Model After Fine-Tuning**

In [None]:
from transformers import pipeline

# Create a text generation pipeline using the fine-tuned model
story_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Provide a prompt to generate a story
prompt = "Once upon a time in a dark and feary forest"
generated_story = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story1 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story2 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story3 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story4 = story_generator(prompt, max_length=150, num_return_sequences=1)
generated_story5 = story_generator(prompt, max_length=150, num_return_sequences=1)

# Display the generated story
print("Generated Story:")
print(generated_story[0]['generated_text'])
print(generated_story1[0]['generated_text'])
print(generated_story2[0]['generated_text'])
print(generated_story3[0]['generated_text'])
print(generated_story4[0]['generated_text'])
print(generated_story5[0]['generated_text'])

Device set to use cuda:0


Generated Story:
Once upon a time in a dark and feary forest, there was a curious little girl named Lily. She liked to explore the forest every day.

One morning, Lily saw a big box that she could play with. She was excited to see what it was! She climbed up the branches, carefully picked it out of the ground and carefully removed it from her hands. It smelled sweet and shiny.

Lily's mom told her that it was a special box, so Lily was very excited. The mom explained that many things can be made with other things. Lily was so excited, she had even made a jar for the box!

Her mom went outside to play with the weird things she was able to make. They
Once upon a time in a dark and feary forest. There were two friends, a boy and a girl named Lulu. Lulu had a puppy named Max. Max lived in a warm and warm house with his mom. 

One night, a bad smell left the garden. Lulu and Max went outside to see if they could find the smell. They went to the ground and looked. 

In the grass, there was a

### **Evaluation**

### **-- Coherence -- Diversity -- Fluency --**

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from sentence_transformers import SentenceTransformer, util
from collections import Counter
import torch
import math

# Example text
text = generated_story[0]['generated_text']
text1 = generated_story1[0]['generated_text']
text2 = generated_story2[0]['generated_text']
text3 = generated_story3[0]['generated_text']
text4 = generated_story4[0]['generated_text']
text5 = generated_story5[0]['generated_text']

# Load Sentence-BERT model for coherence
coherence_model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to calculate Type-Token Ratio (Diversity)
def calculate_ttr(text):
    words = text.split()
    unique_words = set(words)
    ttr = len(unique_words) / len(words) if len(words) > 0 else 0
    return ttr

# Function to calculate Entropy (Diversity)
def calculate_entropy(text):
    words = text.split()
    word_counts = Counter(words)
    total_words = len(words)
    entropy = 0.0
    for count in word_counts.values():
        probability = count / total_words
        entropy -= probability * math.log(probability, 2)
    return entropy

# Function to calculate coherence
def calculate_coherence(text):
    sentences = [s.strip() for s in text.split('.') if s.strip()]  # Split into sentences
    if len(sentences) < 2:
        return 1.0  # Single sentence is trivially coherent
    embeddings = coherence_model.encode(sentences)
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = util.cos_sim(embeddings[i], embeddings[i + 1])
        similarities.append(sim.item())
    coherence = sum(similarities) / len(similarities)
    return coherence


def calculate_fluency(text):
    # Check if GPU is available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move model to the device
    model.to(device)

    # Tokenize the text and move inputs to the same device
    inputs = tokenizer(text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])

    # Calculate perplexity
    loss = outputs.loss
    perplexity = torch.exp(loss)
    return perplexity.item()

# Calculate metrics
#fluency = calculate_perplexity(text, model, tokenizer)
ttr = calculate_ttr(text)
entropy = calculate_entropy(text)
coherence = calculate_coherence(text)
fluency=calculate_fluency(text)



ttr1 = calculate_ttr(text1)
entropy1 = calculate_entropy(text1)
coherence1 = calculate_coherence(text1)
fluency1=calculate_fluency(text1)

ttr2 = calculate_ttr(text2)
entropy2 = calculate_entropy(text2)
coherence2 = calculate_coherence(text2)
fluency2=calculate_fluency(text2)

ttr3 = calculate_ttr(text3)
entropy3 = calculate_entropy(text3)
coherence3 = calculate_coherence(text3)
fluency3=calculate_fluency(text3)

ttr4 = calculate_ttr(text4)
entropy4 = calculate_entropy(text4)
coherence4 = calculate_coherence(text4)
fluency4=calculate_fluency(text4)

ttr5 = calculate_ttr(text5)
entropy5 = calculate_entropy(text5)
coherence5 = calculate_coherence(text5)
fluency5=calculate_fluency(text5)
print('-------------------')

# Print results
#print("Fluency (Perplexity):", fluency)
print("Diversity (Type-Token Ratio):", ttr)
print("Diversity (Entropy):", entropy)
print("Coherence:", coherence)
print("fluency:", fluency)
print('-------------------')


print("Diversity (Type-Token Ratio):", ttr1)
print("Diversity (Entropy):", entropy1)
print("Coherence:", coherence1)
print("fluency:", fluency1)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr2)
print("Diversity (Entropy):", entropy2)
print("Coherence:", coherence2)
print("fluency:", fluency2)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr3)
print("Diversity (Entropy):", entropy3)
print("Coherence:", coherence3)
print("fluency:", fluency3)
print('-------------------')

print("Diversity (Type-Token Ratio):", ttr4)
print("Diversity (Entropy):", entropy4)
print("Coherence:", coherence4)
print("fluency:", fluency4)
print('-------------------')
print("Diversity (Type-Token Ratio):", ttr5)
print("Diversity (Entropy):", entropy5)
print("Coherence:", coherence5)
print("fluency:", fluency5)


-------------------
Diversity (Type-Token Ratio): 0.6904761904761905
Diversity (Entropy): 6.17443456241999
Coherence: 0.3404272086918354
fluency: 5.773502349853516
-------------------
Diversity (Type-Token Ratio): 0.6637931034482759
Diversity (Entropy): 5.85592210756267
Coherence: 0.34114815294742584
fluency: 4.72048807144165
-------------------
Diversity (Type-Token Ratio): 0.7523809523809524
Diversity (Entropy): 6.05888605182675
Coherence: 0.2009171899408102
fluency: 4.655630111694336
-------------------
Diversity (Type-Token Ratio): 0.71
Diversity (Entropy): 5.873919005177825
Coherence: 0.45901202857494355
fluency: 6.44784688949585
-------------------
Diversity (Type-Token Ratio): 0.5887096774193549
Diversity (Entropy): 5.75637917318541
Coherence: 0.514707189053297
fluency: 4.8392791748046875
-------------------
Diversity (Type-Token Ratio): 0.7244897959183674
Diversity (Entropy): 5.917376207098508
Coherence: 0.4204471686056682
fluency: 5.4843621253967285
