# Romantic Poem Generator using GPT-2

## Step 0: Imports

In [1]:
import os
import glob
import torch
import re
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
import random
import numpy as np

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.5.1
CUDA available: True
CUDA device: NVIDIA GeForce RTX 4080 Laptop GPU


## Step 1: Prepare Pre-trained GPT-2 Model

Initially, we aim to create a system based on pre-trained model, GPT-2, to generate poems with general topic. Generative models are powerful at text generation and we want to learn what a classical model would perform now.

In [2]:
# Load pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'  

print(f"Loading pre-trained model: {model_name}")
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set pad token 
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded successfully!")
print(f"Model parameters: {model.num_parameters():,}")
print(f"Vocabulary size: {len(tokenizer)}")

Loading pre-trained model: gpt2


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded successfully!
Model parameters: 124,439,808
Vocabulary size: 50257


## Step 2: Load Romantic Poems and Prepare Dataset

We prepared an inspiring set containing romantic poems to fine tune the model for poem generation on [Kaggle](https://www.kaggle.com/datasets/michaelarman/poemsdataset?resource=download). They are text files and each line has similar length but does not always stand for a sentence. An example is shown below:
```
Fun of passionate enjoyment of romantic pleasure between
Two beautifully attractive souls is a great fortune for
Them to experience in the life of human world quite rare!
It's a great blessing only rare personalities of good heart
Have in this world full of doubts, taboos and conventions
That prevent two good hearts to live in romantic enjoyment!
Loving life so in romantic pleasure in fun for a brief period
Or a longtime sure here is not in one's hand, but if that
Happens naturally, it is a great boon bestowed on them by God!
The creator of humans decrees all live according to His will;
Love, romance, pleasure and peace are not for all to have in
Life except by the chosen few who have had deep faith in God!
For the ones who are in love of the Universal spiritual energy,
All in world and Nature, perhaps God bestows such a boon indeed!
```

In [4]:
# Load poems from the romantic folder
def load_poems_from_folder(folder_path):
    """
    Load all poems from text files in the specified folder.
    Expected filename format: RomanticPoems*.txt
    """
    poems = []
    pattern = os.path.join(folder_path, "RomanticPoems*.txt")
    files = glob.glob(pattern)
    
    print(f"Found {len(files)} poem files in '{folder_path}'")
    
    for file_path in files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read().strip()
                if content:  # Only add non-empty poems
                    poems.append(content)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    
    return poems

# Load the poems
POEMS_FOLDER = "romantic"  # Adjust this path if needed
poems = load_poems_from_folder(POEMS_FOLDER)

print(f"\nLoaded {len(poems)} poems successfully!")
print(f"\nExample poem (first 300 characters):")
print(poems[0][:300] if poems else "No poems found!")

Found 100 poem files in 'romantic'

Loaded 100 poems successfully!

Example poem (first 300 characters):
Fun of passionate enjoyment of romantic pleasure between
Two beautifully attractive souls is a great fortune for
Them to experience in the life of human world quite rare!
It's a great blessing only rare personalities of good heart
Have in this world full of doubts, taboos and conventions
That preven


In [5]:
# Prepare dataset for training
def prepare_dataset(poems, tokenizer, max_length=512):
    """
    Tokenize poems and prepare for training.
    """
    # Add special tokens to mark beginning and end of poems
    formatted_poems = [f"<|startoftext|>{poem}<|endoftext|>" for poem in poems]
    
    # Tokenize
    encodings = tokenizer(
        formatted_poems,
        truncation=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt'
    )
    
    # Create dataset
    dataset_dict = {
        'input_ids': encodings['input_ids'].tolist(),
        'attention_mask': encodings['attention_mask'].tolist()
    }
    
    dataset = Dataset.from_dict(dataset_dict)
    return dataset

# Prepare the dataset
train_dataset = prepare_dataset(poems, tokenizer)
print(f"Dataset prepared with {len(train_dataset)} examples")
print(f"Example tokenized length: {len(train_dataset[0]['input_ids'])} tokens")

Dataset prepared with 100 examples
Example tokenized length: 512 tokens


## Step 2.5: Fine-tune the Model

The fine-tuned model is the poem generator we aim for. After fine-tuning, GPT-2 model would shift its weights onto the given inspiring set and its vocabulary, paragraph structure thus being better at poem generation. 

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./romantic_gpt2_output",
    overwrite_output_dir=True,
    num_train_epochs=400,  
    per_device_train_batch_size=4,  
    save_steps=100,
    save_total_limit=2,
    logging_steps=100,
    learning_rate=1e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    prediction_loss_only=True,
    fp16=torch.cuda.is_available(),  
)

# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

print("Trainer initialized. Ready to fine-tune!")

Trainer initialized. Ready to fine-tune!


In [11]:
# Fine-tune the model
print("Starting fine-tuning...")
print("This may take a while depending on your hardware and dataset size.")
print("="*50)

trainer.train()

print("\n" + "="*50)
print("Fine-tuning complete!")

Starting fine-tuning...
This may take a while depending on your hardware and dataset size.


Step,Training Loss
100,2.6239
200,2.5702
300,2.4488
400,2.3043
500,2.1421
600,2.025
700,1.8846
800,1.7512
900,1.5972
1000,1.4778



Fine-tuning complete!


The model converges after approximately 8800 epochs.

In [12]:
# Save the fine-tuned model
model.save_pretrained("./romantic_gpt2_finetuned")
tokenizer.save_pretrained("./romantic_gpt2_finetuned")
print("Model saved to './romantic_gpt2_finetuned'")

Model saved to './romantic_gpt2_finetuned'


## Step 3: Generate Poems 

Now let's generate poems. The `generate_poem` function orchestrates the text generation process by first placing the model into **evaluation mode** and preprocessing the input `prompt` with a specific `<|startoftext|>` token to signal the beginning of a sequence. After converting this text into numerical tensors and moving them to the appropriate hardware device, the function utilizes `model.generate` within a memory-efficient `torch.no_grad()` context to predict the subsequent text. This generation step relies on **sampling parameters**—specifically `temperature`, `top_k`, and `top_p`—to introduce controlled randomness, ensuring the poem is creative rather than repetitive. Finally, the function iterates through the generated sequences, decodes the numerical tokens back into human-readable strings, and strips away the special structural tokens to return the clean, finished poem.

## Stopping Criteria
There are two stopping criteria: the **hard criterion** (`max_length`), which limits the output to 200 **tokens**, and the **soft criterion**, which is the `<|endoftext|>` token generated by the model itself. These two combined offer a complete poem when the model finishes its thought naturally, or a safeguard that stops generation if it exceeds the maximum limit, preventing infinite loops.

In [13]:
def generate_poem(model, tokenizer, prompt="", max_length=200, temperature=0.9, 
                  top_k=50, top_p=0.95, num_return_sequences=1):
    """
    Generate a poem using the fine-tuned model.
    
    Args:
        prompt: Starting text for the poem (empty for free generation)
        max_length: Maximum length of generated text
        temperature: Sampling temperature (higher = more random)
        top_k: Top-k sampling parameter
        top_p: Nucleus sampling parameter
        num_return_sequences: Number of poems to generate
    """
    model.eval()
    
    # Prepare input
    if prompt:
        input_text = f"<|startoftext|>{prompt}"
    else:
        input_text = "<|startoftext|>"
    
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    
    # Move to same device as model
    device = next(model.parameters()).device
    input_ids = input_ids.to(device)
    
    # Generate
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            num_return_sequences=num_return_sequences,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.encode("<|endoftext|>")[0]
        )
    
    # Decode and clean up
    poems = []
    for sequence in output:
        text = tokenizer.decode(sequence, skip_special_tokens=False)
        # Remove special tokens for display
        text = text.replace("<|startoftext|>", "").replace("<|endoftext|>", "")
        text = text.strip()
        poems.append(text)
    
    return poems

In [14]:
# Generate poems without specific prompts (general topics)
print("Generating poems with general topics...\n")
print("="*70)

general_poems = []
for i in range(3):
    poems = generate_poem(
        model, 
        tokenizer, 
        prompt="",  # No prompt - let model decide
        max_length=200,
        temperature=0.9,
        num_return_sequences=1
    )
    general_poems.extend(poems)
    print(f"\nPoem {i+1}:")
    print("-"*70)
    print(poems[0])
    print("="*70)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generating poems with general topics...


Poem 1:
----------------------------------------------------------------------
Sky lamp glints on us
An amorous dusk it was
We're merged with the stars and blown up by the sun's magic rays
An amorous dawn it was
We're merged with the galaxies and the dust clouds
An amorous dusk it was
We're merged with the stars and blown up by the light of the sun's full-moon
An enchanting dawn it was
We're merged with the planets and the moons
An amorous dusk it was
We're merged with the wind and the solar wind's magic
An enchanting dawn it was
We're merged with the stars and the planets
An amorous dusk it was
We're merged with the wind and the solar wind's magic
An enchanting dawn it was
We're merged with the stars and the planets
An amorous dusk it was
We're merged with the sun and the stars
An enchanting dawn it was
We're merged

Poem 2:
----------------------------------------------------------------------
On doggerel rime dedicated to you, catch my soul 

1 and 3 has large amount of repeated sentence structures and have very few meaningful content. Moreover, they both end up exceeding the max length while outputing something tedious. The second poem is comparatively better and obviously it is about romance, the topic of which is the same as the inspiring set. Thus, we determine to modify the prompt for the generator to output specified romantic poems and see how it would perform. 

## Step 4: Generate Romantic-Specific Poems

Here we modify the prompts and keywords for romantic poem generation.

In [15]:
# Define romantic prompts and keywords
romantic_prompts = [
    "Love",
    "My heart",
    "My dear",
    "Sweet",
    "Beloved",
    "In your eyes",
    "Your beauty",
    "Romance",
    "Passion",
    "Tenderness"
]

romantic_keywords = [
    'love', 'heart', 'dear', 'sweet', 'beloved', 'kiss', 'embrace',
    'passion', 'romance', 'tender', 'beauty', 'soul', 'desire', 'devotion',
    'affection', 'cherish', 'adore', 'darling', 'sweetheart', 'angel'
]

In [16]:
def is_romantic_poem(text, keywords, threshold=2):
    """
    Check if a poem contains romantic themes based on keyword presence.
    
    Args:
        text: The poem text to check
        keywords: List of romantic keywords
        threshold: Minimum number of keywords required
    
    Returns:
        Boolean indicating if poem is sufficiently romantic
    """
    text_lower = text.lower()
    count = sum(1 for keyword in keywords if keyword in text_lower)
    return count >= threshold

def generate_romantic_poem(model, tokenizer, max_attempts=10, **generation_kwargs):
    """
    Generate a romantic poem by:
    1. Using a romantic prompt
    2. Filtering results to ensure romantic content
    """
    for attempt in range(max_attempts):
        # Select a random romantic prompt
        prompt = random.choice(romantic_prompts)
        
        # Generate poem
        poems = generate_poem(model, tokenizer, prompt=prompt, **generation_kwargs)
        
        # Check if poem is romantic enough
        for poem in poems:
            if is_romantic_poem(poem, romantic_keywords, threshold=2):
                return poem, prompt
    
    # If we couldn't generate a sufficiently romantic poem, return the last one
    return poems[0], prompt

In [17]:
# Generate romantic-specific poems
print("Generating romantic-specific poems...\n")
print("="*70)

romantic_poems = []
for i in range(3):
    poem, prompt = generate_romantic_poem(
        model,
        tokenizer,
        max_length=200,
        temperature=0.85,  # Slightly lower temperature for more coherence
        top_k=50,
        top_p=0.95,
        num_return_sequences=1
    )
    romantic_poems.append(poem)
    
    print(f"\nRomantic Poem {i+1} (started with: '{prompt}'):")
    print("-"*70)
    print(poem)
    print("="*70)

Generating romantic-specific poems...


Romantic Poem 1 (started with: 'My heart'):
----------------------------------------------------------------------
My heart overflows
With sweet romantic feelings
When I think of you,
My eyes are missing you,
I wish that your family agrees to marry me
In a romantic ceremony,
But in vain! My heart is over-strained,
I am in love,
And I want nothing more than to be with you!
When I was young,
My whole world was consumed by your splendour,
I imagined you,
The wise ones,
The beautiful ones,
Who had attained the loftiest heights,
Who had enjoyed the finest soil,
Who had lived in the deepest parts of the earth!
I fancied you still live,
The bloated corpses of your people,
Who starved to death on the Egyptian highroad!
When I look into your eyes,
The deep sadness in your soul,
I wish that your descendants will never forget me!
When I look into your dreams,

Romantic Poem 2 (started with: 'Love'):
----------------------------------------------------------

Romantic poems are much more meaningful with romance-related contexts, since they have a better fine-grained weights on romantic corpus and the romantic poem structure. The 1st poem still has length-exceeded issue but the overall quality is still better than general poems, even compared with the best general one(the second from general poems). It even starts to rhyme with some words, for example in Romantic Poem 1 `marry me` and `ceremony`, `forget me` and `dreams`.

In [18]:
# Save all generated poems to files
import os

# Create output directory
os.makedirs("generated_poems", exist_ok=True)

# Save general poems
for i, poem in enumerate(general_poems):
    with open(f"generated_poems/general_poem_{i+1}.txt", 'w', encoding='utf-8') as f:
        f.write(poem)

# Save romantic poems
for i, poem in enumerate(romantic_poems):
    with open(f"generated_poems/romantic_poem_{i+1}.txt", 'w', encoding='utf-8') as f:
        f.write(poem)

print(f"Saved {len(general_poems)} general poems and {len(romantic_poems)} romantic poems to 'generated_poems/' folder")

Saved 3 general poems and 3 romantic poems to 'generated_poems/' folder


## Creativity Evaluation

This section evaluates the generated poems across four key dimensions:
1. Novelty - How different are they from the training data?
2. Coherence - Do the poems make logical and linguistic sense?
3. Emotional Impact - Do they evoke romantic feelings?
4. Technical Quality - Rhyme, rhythm, and imagery

We evaluate with both automatic method and our own thoughts.

In [34]:
# METRIC 1: NOVELTY - Comparing with Inspiring Set
import difflib
from collections import Counter
import re

def compute_n_grams(text, n=3):
    """Extract n-grams from text for similarity comparison."""
    words = text.lower().split()
    return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

def compute_novelty_score(generated_poem, training_poems, n=3):
    """
    Calculate novelty by measuring n-gram overlap with training data.
    Returns a score between 0 and 1, where 1 means completely novel.
    """
    gen_ngrams = set(compute_n_grams(generated_poem, n))
    
    if not gen_ngrams:
        return 0.0
    
    # Check overlap with each training poem
    max_overlap = 0
    for train_poem in training_poems:
        train_ngrams = set(compute_n_grams(train_poem, n))
        if train_ngrams:
            overlap = len(gen_ngrams.intersection(train_ngrams)) / len(gen_ngrams)
            max_overlap = max(max_overlap, overlap)
    
    # Novelty is inverse of maximum overlap
    novelty = 1 - max_overlap
    return novelty

def compute_exact_match_ratio(generated_poem, training_poems):
    """
    Check if any substantial phrase from generated poem appears verbatim in training data.
    """
    gen_text = generated_poem.lower()
    
    for train_poem in training_poems:
        train_text = train_poem.lower()
        # Use SequenceMatcher to find longest common substring
        matcher = difflib.SequenceMatcher(None, gen_text, train_text)
        match = matcher.find_longest_match(0, len(gen_text), 0, len(train_text))
        
        if match.size > 30:  # If match is longer than 30 characters
            return match.size, train_text[match.b:match.b+match.size]
    
    return 0, ""

def evaluate_novelty(generated_poems, training_poems):
    """
    Comprehensive novelty evaluation.
    """
    print("METRIC 1: NOVELTY EVALUATION")
    print(f"\nEvaluating {len(generated_poems)} generated poems against {len(training_poems)} training poems\n")
    
    results = []
    
    for i, poem in enumerate(generated_poems):
        print(f"\n--- Poem {i+1} ---")
        
        # Compute novelty scores at different n-gram levels
        trigram_novelty = compute_novelty_score(poem, training_poems, n=3)
        bigram_novelty = compute_novelty_score(poem, training_poems, n=2)
        
        # Check for exact matches
        match_length, match_text = compute_exact_match_ratio(poem, training_poems)
        
        results.append({
            'poem_id': i+1,
            'trigram_novelty': trigram_novelty,
            'bigram_novelty': bigram_novelty,
            'exact_match_length': match_length
        })
        
        print(f"Trigram Novelty Score: {trigram_novelty:.3f} (1.0 = completely novel)")
        print(f"Bigram Novelty Score: {bigram_novelty:.3f}")
        
        if match_length > 30:
            print(f"  Found exact match ({match_length} chars): '{match_text[:50]}...'")
        else:
            print(" No substantial exact matches found")
        
        # Overall assessment
        avg_novelty = (trigram_novelty + bigram_novelty) / 2
        if avg_novelty > 0.8:
            print(f"Assessment: HIGH NOVELTY ")
        elif avg_novelty > 0.5:
            print(f"Assessment: MODERATE NOVELTY")
        else:
            print(f"Assessment: LOW NOVELTY ")
    
    # Summary statistics
    avg_trigram = sum(r['trigram_novelty'] for r in results) / len(results)
    avg_bigram = sum(r['bigram_novelty'] for r in results) / len(results)

    print("NOVELTY SUMMARY:")
    print(f"Average Trigram Novelty: {avg_trigram:.3f}")
    print(f"Average Bigram Novelty: {avg_bigram:.3f}")
    print(f"Overall Novelty Score: {(avg_trigram + avg_bigram) / 2:.3f}")
    
    return results

# METRIC 2: COHERENCE - Linguistic and Logical Sense

def evaluate_coherence_metrics(poem):
    """
    Evaluate various aspects of coherence.
    """
    lines = [line.strip() for line in poem.split('\n') if line.strip()]
    
    metrics = {
        'num_lines': len(lines),
        'avg_line_length': sum(len(line.split()) for line in lines) / len(lines) if lines else 0,
        'has_punctuation': bool(re.search(r'[.!?,;:]', poem)),
        'repeated_words': 0,
        'sentence_fragments': 0,
    }
    
    # Check for excessive word repetition
    words = poem.lower().split()
    word_counts = Counter(words)
    metrics['repeated_words'] = sum(1 for count in word_counts.values() if count > 3)
    
    # Check for very short lines (potential fragments)
    metrics['sentence_fragments'] = sum(1 for line in lines if len(line.split()) < 3)
    
    return metrics

def coherence_human_readable_assessment(poem_text):
    """
    This function returns a structured assessment for human evaluation.
    The actual coherence judgment should be done by reading the poem.
    """
    metrics = evaluate_coherence_metrics(poem_text)
    
    assessment = {
        'structure': 'Good' if 5 <= metrics['num_lines'] <= 30 else 'Unusual',
        'line_length': 'Consistent' if 3 <= metrics['avg_line_length'] <= 12 else 'Variable',
        'punctuation': 'Present' if metrics['has_punctuation'] else 'Missing',
        'repetition': 'Excessive' if metrics['repeated_words'] > 2 else 'Appropriate',
        'fragments': 'Many' if metrics['sentence_fragments'] > metrics['num_lines'] * 0.3 else 'Few'
    }
    
    return assessment, metrics

def evaluate_coherence(poems):
    """
    Evaluate coherence with automatic metrics and structure for human judgment.
    """
    print("\nMETRIC 2: COHERENCE EVALUATION")
    print("\nNote: Coherence requires human judgment. These are supporting metrics.\n")
    
    for i, poem in enumerate(poems):
        print(f"\n--- Poem {i+1} ---")
        print("Poem Text:")
        print(poem[:300] + "..." if len(poem) > 300 else poem)
        
        assessment, metrics = coherence_human_readable_assessment(poem)
        
        print(f"\nAutomatic Metrics:")
        print(f"  Lines: {metrics['num_lines']}")
        print(f"  Avg words/line: {metrics['avg_line_length']:.1f}")
        print(f"  Punctuation: {assessment['punctuation']}")
        print(f"  Word repetition: {assessment['repetition']}")
        print(f"  Sentence fragments: {assessment['fragments']}")


# METRIC 3: EMOTIONAL IMPACT - Romantic Feelings

def analyze_emotional_vocabulary(poem):
    """
    Analyze the presence of emotion-evoking vocabulary.
    """
    # Emotional vocabulary categories
    emotion_words = {
        'love': ['love', 'adore', 'cherish', 'devotion', 'affection', 'passion'],
        'longing': ['yearn', 'desire', 'miss', 'wish', 'long', 'crave', 'ache'],
        'beauty': ['beautiful', 'lovely', 'gorgeous', 'fair', 'radiant', 'exquisite'],
        'intimacy': ['kiss', 'embrace', 'touch', 'caress', 'tender', 'gentle'],
        'joy': ['delight', 'bliss', 'happiness', 'joy', 'ecstasy', 'rapture'],
        'sadness': ['sorrow', 'melancholy', 'tears', 'pain', 'heartbreak', 'mourn']
    }
    
    poem_lower = poem.lower()
    found_emotions = {}
    
    for category, words in emotion_words.items():
        found = [word for word in words if word in poem_lower]
        if found:
            found_emotions[category] = found
    
    return found_emotions

def evaluate_emotional_impact(poems):
    """
    Evaluate the emotional impact and romantic quality of poems.
    """
    print("\nMETRIC 3: EMOTIONAL IMPACT EVALUATION")
    print("\nAnalyzing romantic and emotional vocabulary...\n")
    
    for i, poem in enumerate(poems):
        print(f"\n--- Poem {i+1} ---")
        print("Poem Text:")
        print(poem[:300] + "..." if len(poem) > 300 else poem)
        
        emotions = analyze_emotional_vocabulary(poem)
        
        print(f"\nEmotional Vocabulary Analysis:")
        if emotions:
            for category, words in emotions.items():
                print(f"  {category.upper()}: {', '.join(words)}")
            print(f"\n  Total emotional categories: {len(emotions)}/6")
            
            # Score
            emotion_score = len(emotions) / 6
            if emotion_score > 0.5:
                print(f"  Emotional richness: HIGH ")
            elif emotion_score > 0.3:
                print(f"  Emotional richness: MODERATE")
            else:
                print(f"  Emotional richness: LOW")
        else:
            print("  No strong emotional vocabulary detected ")
    

# METRIC 4: TECHNICAL QUALITY - Rhyme, Rhythm, Imagery

def detect_rhyme_scheme(poem):
    """
    Simple rhyme detection based on line endings.
    """
    lines = [line.strip() for line in poem.split('\n') if line.strip()]
    
    if len(lines) < 2:
        return None, []
    
    # Get last words
    last_words = []
    for line in lines:
        words = re.findall(r'\b\w+\b', line.lower())
        if words:
            last_words.append(words[-1])
    
    # Check for rhyming pairs (simple phonetic similarity)
    rhyme_pairs = []
    for i in range(len(last_words) - 1):
        for j in range(i + 1, len(last_words)):
            if last_words[i][-2:] == last_words[j][-2:] and len(last_words[i]) > 2:
                rhyme_pairs.append((i, j, last_words[i], last_words[j]))
    
    return last_words, rhyme_pairs

def analyze_rhythm(poem):
    """
    Basic rhythm analysis - syllable patterns.
    """
    lines = [line.strip() for line in poem.split('\n') if line.strip()]
    
    # Approximate syllable count (very rough)
    def count_syllables_rough(line):
        vowels = 'aeiouAEIOU'
        syllables = 0
        previous_was_vowel = False
        for char in line:
            is_vowel = char in vowels
            if is_vowel and not previous_was_vowel:
                syllables += 1
            previous_was_vowel = is_vowel
        return max(1, syllables)
    
    syllable_counts = [count_syllables_rough(line) for line in lines]
    
    if not syllable_counts:
        return {'pattern': 'None', 'consistency': 0}
    
    # Check consistency
    avg_syllables = sum(syllable_counts) / len(syllable_counts)
    variance = sum((c - avg_syllables) ** 2 for c in syllable_counts) / len(syllable_counts)
    consistency = 1 / (1 + variance)  # 1 = perfect consistency
    
    return {
        'avg_syllables': avg_syllables,
        'consistency': consistency,
        'pattern': syllable_counts[:min(5, len(syllable_counts))]
    }

def analyze_imagery(poem):
    """
    Detect imagery and figurative language.
    """
    imagery_words = {
        'visual': ['see', 'look', 'gaze', 'bright', 'dark', 'shimmer', 'glow', 'shadow'],
        'tactile': ['touch', 'soft', 'warm', 'cold', 'smooth', 'rough'],
        'auditory': ['hear', 'sound', 'whisper', 'song', 'music', 'voice'],
        'nature': ['rose', 'flower', 'moon', 'sun', 'star', 'sky', 'ocean', 'wind']
    }
    
    poem_lower = poem.lower()
    found_imagery = {}
    
    for category, words in imagery_words.items():
        found = [word for word in words if word in poem_lower]
        if found:
            found_imagery[category] = found
    
    return found_imagery

def evaluate_technical_quality(poems):
    """
    Evaluate technical aspects: rhyme, rhythm, and imagery.
    """
    print("\nMETRIC 4: TECHNICAL QUALITY EVALUATION")
    print("\nAnalyzing rhyme scheme, rhythm patterns, and imagery...\n")
    
    for i, poem in enumerate(poems):
        print(f"\n--- Poem {i+1} ---")
        print("Poem Text:")
        print(poem[:300] + "..." if len(poem) > 300 else poem)
        
        # Rhyme analysis
        last_words, rhyme_pairs = detect_rhyme_scheme(poem)
        print(f"\n RHYME ANALYSIS:")
        if rhyme_pairs:
            print(f"  Rhyming pairs found: {len(rhyme_pairs)}")
            for i1, i2, w1, w2 in rhyme_pairs[:3]:  # Show first 3
                print(f"    Line {i1+1} ('{w1}') rhymes with Line {i2+1} ('{w2}')")
            print(f"  Rhyme presence: YES ")
        else:
            print(f"  Rhyme presence: NONE or MINIMAL")
        
        # Rhythm analysis
        rhythm = analyze_rhythm(poem)
        print(f"\n RHYTHM ANALYSIS:")
        print(f"  Avg syllables/line: {rhythm['avg_syllables']:.1f}")
        print(f"  Rhythm consistency: {rhythm['consistency']:.2f} (1.0 = perfect)")
        print(f"  Pattern sample: {rhythm['pattern']}")
        
        # Imagery analysis
        imagery = analyze_imagery(poem)
        print(f"\n IMAGERY ANALYSIS:")
        if imagery:
            for category, words in imagery.items():
                print(f"  {category.upper()}: {', '.join(words)}")
            print(f"  Imagery richness: {len(imagery)}/4 categories")
        else:
            print(f"  Imagery: MINIMAL ")
        
        # Overall technical assessment
        print(f"\n TECHNICAL ASSESSMENT:")
        tech_score = 0
        if rhyme_pairs:
            tech_score += 1
            print(f"   Has rhyme scheme")
        if rhythm['consistency'] > 0.7:
            tech_score += 1
            print(f"   Consistent rhythm")
        if len(imagery) >= 2:
            tech_score += 1
            print(f"   Good imagery")
        
        if tech_score >= 2:
            print(f"\n  Overall: GOOD TECHNICAL QUALITY ")
        elif tech_score == 1:
            print(f"\n  Overall: MODERATE TECHNICAL QUALITY")
        else:
            print(f"\n  Overall: LIMITED TECHNICAL QUALITY")
    


def run_comprehensive_evaluation(generated_poems, training_poems):
    """
    Run all four evaluation metrics.
    """
    print("\n")
    print( "COMPREHENSIVE CREATIVITY EVALUATION\n")
    
    # Metric 1: Novelty
    novelty_results = evaluate_novelty(generated_poems, training_poems)
    
    # Metric 2: Coherence
    evaluate_coherence(generated_poems)
    
    # Metric 3: Emotional Impact
    evaluate_emotional_impact(generated_poems)
    
    # Metric 4: Technical Quality
    evaluate_technical_quality(generated_poems)

    print("EVALUATION COMPLETE")

In [35]:
run_comprehensive_evaluation(romantic_poems, poems)



COMPREHENSIVE CREATIVITY EVALUATION

METRIC 1: NOVELTY EVALUATION

Evaluating 3 generated poems against 1 training poems


--- Poem 1 ---
Trigram Novelty Score: 1.000 (1.0 = completely novel)
Bigram Novelty Score: 0.961
 No substantial exact matches found
Assessment: HIGH NOVELTY 

--- Poem 2 ---
Trigram Novelty Score: 1.000 (1.0 = completely novel)
Bigram Novelty Score: 1.000
 No substantial exact matches found
Assessment: HIGH NOVELTY 

--- Poem 3 ---
Trigram Novelty Score: 1.000 (1.0 = completely novel)
Bigram Novelty Score: 1.000
 No substantial exact matches found
Assessment: HIGH NOVELTY 
NOVELTY SUMMARY:
Average Trigram Novelty: 1.000
Average Bigram Novelty: 0.987
Overall Novelty Score: 0.993

METRIC 2: COHERENCE EVALUATION

Note: Coherence requires human judgment. These are supporting metrics.


--- Poem 1 ---
Poem Text:
My heart overflows
With sweet romantic feelings
When I think of you,
My eyes are missing you,
I wish that your family agrees to marry me
In a romantic ceremo

The first generated poem presents a complex case of creative failure despite high novelty. The poem begins promisingly with conventional romantic imagery and sentiment, opening with "My heart overflows / With sweet romantic feelings / When I think of you." This initial section establishes clear romantic intent, expressing desire for marriage and devotion to the beloved. However, the poem undergoes a severe and inexplicable tonal shift approximately halfway through, introducing disturbing historical imagery including "bloated corpses" and references to starvation on an "Egyptian highroad."

From a coherence perspective, this poem fails fundamentally. While it maintains grammatical correctness and a consistent first-person perspective, it does not follow a logical flow. The transition from contemporary romantic confession to morbid historical commentary occurs without explanation or connection. The ideas are not coherently linked; the romantic plea to marry and the graphic death imagery exist as two incompatible fragments forced into proximity. The poem reads as though two entirely different compositions were spliced together mid-generation, suggesting a failure in the model's ability to maintain thematic consistency over longer sequences.

The emotional impact of this poem is similarly problematic. Despite the automated metrics detecting high emotional richness across four of six emotional categories, the actual emotional effect on a human reader is confusion and discomfort rather than romantic sentiment. The opening lines do evoke romantic longing, but this emotional foundation is completely undermined by the subsequent disturbing imagery. The emotional tone, which should remain appropriate to the romantic genre, becomes deeply inappropriate when discussing corpses and death in the context of a love poem. The vivid imagery created by the poem is paradoxically both a strength and a weakness; while the imagery is indeed vivid, it is tonally wrong for the intended genre. The poem cannot be classified as genuinely romantic despite its romantic opening, as the intrusion of macabre elements fundamentally transforms its character. This represents a clear failure mode where the model loses control of tonal consistency during generation.