# Poem Generation using Qwen2.5-1.5B-Instruct

### Load Model and Generate Sonnets

In [42]:
import pandas as pd
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
import torch
import collections
from collections import Counter
import math
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from nltk.util import ngrams
import pronouncing
from nltk.corpus import cmudict
from collections import defaultdict
import re

In [13]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

In [9]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about courage set in a battlefield with a determined tone. Use vivid imagery to convey strength and resilience."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 1:\n", generated_poem1)


Generated Poem 1:
 Write a Shakespearean sonnet about courage set in a battlefield with a determined tone. Use vivid imagery to convey strength and resilience. Amidst the clash of steel and blood, I stand upon the field,
A warrior bold who dares not yield his goal.
With every stroke my heart beats strong and true,
As battle rages on without a pause.

The sun is but a beacon far away,
Its light a guiding star for weary men.
My shield is armor forged from ancient stone,
And sword sharp as the wind's relentless blow.

In this grim place where life and death are fought,
I fight the foe with valor unconfined.
For victory begets a better fate,
And through my will I conquer every night.

Thus, let me speak with voice resounding loud,
Of bravery shining bright like morning gold!


In [11]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about wonder set in outer space with an awe-filled tone. Use vivid imagery to convey mystery and discovery."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem2 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 2:\n", generated_poem2)


Generated Poem 2:
 Write a Shakespearean sonnet about wonder set in outer space with an awe-filled tone. Use vivid imagery to convey mystery and discovery. In the vast expanse of cosmic air, where stars do twinkle bright,
And planets spin with dizzying speed above,
A human spirit stirs beneath the night,
Where dreams are born on nebulae's ghostly wings.

The moon, a silvery orb, ascends aloft,
Its light a beacon through the velvet sky,
Guiding lost souls into realms unknown,
As if the universe were whispering words.

Yet here below, among the lunar dust,
There looms a question that no answer gives:
What secrets lie beyond this endless space?
Are we mere specks in cosmic ballet?

Oh, how I yearn for worlds unseen!
For wonders yet untold, unexplored deep.
In this celestial realm, my heart does sing
To mysteries so vast, so profound, so wide. 

So let us venture forth, our spirits free,
To explore these wonders, know them all,
For there is beauty in the unknown still,
And courage in the f

In [12]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about loss set in an empty home with a somber tone. Use vivid imagery to convey grief and reflection."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem3 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 3:\n", generated_poem3)


Generated Poem 3:
 Write a Shakespearean sonnet about loss set in an empty home with a somber tone. Use vivid imagery to convey grief and reflection. In silent halls where shadows loom, 
The echoes of the past remain;  
Each room a chamber once where laughter danced,
Now quiet as the whispers of the dead.

The fire, once bright and warm, now cold and stark,
Leaves embers that tell tales long forgotten;
A ghostly light flickers from within,
A testament to life's fleeting grace.

The dust on walls and furniture old,
As if time itself had paused its course,
In this deserted house, where love did grow,
And dreams were woven into every thread.

Yet here I stand, alone and lost,
To ponder truths that death has taught me most.
For though the world may change around us,
This place remains, a mirror to my soul. 

So let the memories come and go,
Like autumn leaves upon the ground below;
For even when they pass, their beauty stays,
In the heart where sorrow holds its sway. 

Thus ends my poem, w

#### Extracting the necessary content (sonnet)

In [15]:
file_path = "Poem_Data.csv"
data = pd.read_csv(file_path)

In [17]:
poem1 = data.loc[0, 'QWEN2 7B Instruct Q4']
poem2 = data.loc[1, 'QWEN2 7B Instruct Q4']
poem3 = data.loc[2, 'QWEN2 7B Instruct Q4']

#### BLEU (Bilingual Evaluation Understudy)
##### What it does: 
BLEU measures the precision of n-grams (sequences of n consecutive words) in the generated text when compared to a reference text. It’s widely used for machine translation but can also apply to poetry.
##### How we’re using it: 
While BLEU is typically associated with translation, we use it here to evaluate how well the n-grams of the generated poems align with reference poems. This gives us a basic sense of similarity between generated and target structures in terms of word patterns.

#### METEOR (Metric for Evaluation of Translation with Explicit ORdering)
##### What it does: 
METEOR considers semantic and syntactic similarity by accounting for synonyms, stemming, and word order. This allows it to address some of the weaknesses of the BLEU metric.
##### How we’re using it: 
We use METEOR to assess how closely the generated poem matches the meaning and structure of a reference poem. This is especially helpful for understanding if the model maintains context or theme throughout the text.

#### Perplexity
##### What it does: 
Perplexity measures how well a language model can predict the next word in a sequence. A lower perplexity means the model has more confidence in its predictions, which can translate to better fluency.
##### How we’re using it: 
We use perplexity to gauge the fluency and coherence of the generated poems. It gives us an idea of how "natural" the poem sounds, even though it doesn’t capture all aspects of creativity.

#### Entropy
##### What it does: 
Entropy measures the randomness or diversity of a text by quantifying how varied the token choices are in a given sequence. Higher entropy suggests the model is generating more diverse content, while lower entropy signals repetition or predictability.
##### How we’re using it: 
Entropy is important for understanding how creative or structured the generated poetry is. A balance is key—moderate entropy indicates diversity and creativity, while very low or very high entropy might signal overly repetitive or chaotic outputs. We use this metric to compare different models and analyze their ability to create engaging, varied poetry.

In [28]:
# Evaluation Functions
def bleu_score(reference, candidate):
    """
    Calculate the BLEU score between the reference and candidate texts.
    Measures the overlap of n-grams between the generated text and the reference.
    Returns the score rounded to 5 decimal places.
    """
    return round(sentence_bleu([reference.split()], candidate.split()), 5)

def meteor_score_func(reference, candidate):
    """
    Calculate the METEOR score between the reference and candidate texts.
    METEOR measures both exact matches and semantically similar matches 
    (stemming or synonyms) and penalizes overly short outputs.
    Tokenizes both inputs before computing the score.
    Returns the score rounded to 5 decimal places.
    """
    # Tokenize both the reference and candidate
    reference_tokens = reference.split()  # Tokenize the reference poem
    candidate_tokens = candidate.split()  # Tokenize the generated poem
    return round(meteor_score([reference_tokens], candidate_tokens), 5)

def calculate_perplexity(text):
    """
    Calculate the perplexity of a text using the provided tokenizer and model.
    Perplexity is a measure of how well the language model predicts the text.
    Lower perplexity indicates higher fluency and alignment with the model's training data.
    - Uses the tokenizer to convert the text into input tensors.
    - Performs inference using the model with no gradient calculation (for efficiency).
    - Computes the loss from the model's output and converts it to perplexity.
    Returns the perplexity value rounded to 5 decimal places.
    """
    inputs = tokenizer(text, return_tensors="pt")  # Tokenize text into tensors
    with torch.no_grad():  # Disable gradient computation for efficiency
        outputs = model(**inputs, labels=inputs["input_ids"])  # Compute loss
        loss = outputs.loss  # Extract loss
        return round(torch.exp(loss).item(), 5)  # Convert loss to perplexity

def calculate_entropy(text):
    """
    Calculate the entropy of a text, representing the diversity in its character or word usage.
    Entropy quantifies the unpredictability of the text.
    - Tokenizes the text
    - Counts the occurrences of each token.
    - Calculates probabilities for each token and uses them to compute entropy.
    Returns the entropy value rounded to 5 decimal places.
    """
    tokens = list(text)  # Tokenize the text at the character level (default)
    token_counts = Counter(tokens)  # Count occurrences of each token
    total_tokens = len(tokens)  # Total number of tokens
    probabilities = [count / total_tokens for count in token_counts.values()]  # Compute probabilities
    entropy = -sum(p * math.log2(p) for p in probabilities)  # Compute entropy
    return round(entropy, 5) 


In [30]:
model_name_to_evaluate = "QWEN2 7B Instruct Q4" 

In [32]:
# Evaluate Scores
results = []

for index, row in data.iterrows():
    prompt = row["PromptID"]
    reference_poem = row["ReferencePoem"]
    generated_poem = row[model_name_to_evaluate]
    
    # Calculate scores
    #bleu = bleu_score(reference_poem, generated_poem)
    meteor = meteor_score_func(reference_poem, generated_poem)
    perplexity = calculate_perplexity(generated_poem)
    entropy = calculate_entropy(generated_poem)
    
    # Append to results
    results.append({
        "Prompt": prompt,
        #"BLEU": bleu,
        "METEOR": meteor,
        "Perplexity": perplexity,
        "Entropy": entropy
    })

results_df = pd.DataFrame(results)
average_entropy = results_df["Entropy"].mean()
evaluation_title = f"Evaluation Results for Model: {model_name_to_evaluate}"
print(evaluation_title)
print(results_df)
print(f"\nAverage Entropy: {average_entropy:.4f}")

Evaluation Results for Model: QWEN2 7B Instruct Q4
   Prompt   METEOR  Perplexity  Entropy
0       1  0.10204     9.34963  4.42495
1       2  0.12649     7.87969  4.43341
2       3  0.10071     7.32215  4.37548

Average Entropy: 4.4113


##### METEOR Scores:
* The METEOR scores for the prompts are relatively low (around 0.10–0.13), indicating a low degree of overlap between the generated text and the reference text.

##### Perplexity:
* Perplexity values range from approximately 7.32 to 9.35. Lower perplexity generally suggests that the model finds the generated text more probable under its learned distribution. While these values aren't excessively high, they suggest the model could benefit from improved coherence and alignment with expected structures.

##### Entropy:
* The entropy values (~4.42) indicate the diversity of the generated text in terms of character distribution. Higher entropy suggests more diverse outputs, while lower entropy implies repetitive or predictable patterns. The entropy here is moderate, suggesting a balance between diversity and coherence in the text.

### Diversity and Variety

In the context of poem generation, diversity and variety refer to the richness, uniqueness, and creativity of the generated text, which are critical for producing compelling, engaging, and original poetry.

* Self-BLEU : Evaluates how much overlap exists between different samples generated by the same model.
* Distinct n-grams: Measures the percentage of distinct n-grams in the generated text compared to the total n-grams. Higher distinctness means more diversity.

In [46]:
# Self-BLEU Score
def self_bleu(generated_texts, n=1):
    """
    Calculates the Self-BLEU score for a list of generated texts.
    Self-BLEU measures the similarity between different generated outputs,
    highlighting repetition or lack of diversity in generation.

    Args:
        generated_texts (list of str): A list of generated text strings to evaluate.
        n (int): The n-gram size to use for computation. Default is 1 (unigrams).

    Returns:
        float: The average Self-BLEU score across all generated texts.
    """
    all_ngrams = [] 
    for text in generated_texts:
        if not isinstance(text, str):
            text = str(text) 
        tokens = text.split() 
        ngrams_list = list(ngrams(tokens, n)) 
        all_ngrams.append(collections.Counter(ngrams_list))
    
    # Initialize a score to compute Self-BLEU
    score = 0
    for i, counter in enumerate(all_ngrams):
        # Exclude the current text and focus only on other generated outputs
        others = all_ngrams[:i] + all_ngrams[i+1:]
        union_ngrams = collections.Counter() 
        for other in others:
            union_ngrams.update(other)

        # Avoid division by zero by checking if n-grams are available
        if sum(union_ngrams.values()) > 0:

            overlap = sum(min(count, union_ngrams[gram]) for gram, count in counter.items())
            score += overlap / sum(union_ngrams.values())
    
    return score / len(generated_texts) if len(generated_texts) > 1 else 0


# Distinct-N Score
def distinct_n(generated_texts, n=2):
    """
    Calculates the Distinct-N score for a list of generated texts.
    Distinct-N measures diversity by computing the ratio of unique n-grams
    over the total number of n-grams across all generated texts.

    Args:
        generated_texts (list of str): A list of generated text strings to evaluate.
        n (int): The n-gram size to compute for Distinct-N. Default is 2 (bigrams).

    Returns:
        float: The Distinct-N score indicating diversity of generated texts.
    """
    ngrams_set = set()  
    total_ngrams = 0 

    for text in generated_texts:
        if not isinstance(text, str):
            text = str(text)
        tokens = text.split()
        ngrams_list = list(ngrams(tokens, n))  
        ngrams_set.update(ngrams_list)
        total_ngrams += len(ngrams_list)

    return len(ngrams_set) / total_ngrams if total_ngrams > 0 else 0

In [48]:
# Function to evaluate Self-BLEU and Distinct-N for a model
def evaluate_model_metrics(data, model_name_to_evaluate):
    results = []

    # Collect all poems generated by the same model across different prompts
    generated_poems = data[model_name_to_evaluate].tolist()
    self_bleu_score = self_bleu(generated_poems, n=1)  # For unigrams (Self-BLEU n=1)
    distinct_2_score = distinct_n(generated_poems, n=2)  # For distinct-2 (bigrams)
    
    for index, row in data.iterrows():
        prompt = row["PromptID"]
        
        results.append({
            "Model": model_name_to_evaluate,
            "Prompt": prompt,
            "Self-BLEU (n=1)": round(self_bleu_score, 5),
            "Distinct-2": round(distinct_2_score, 5)
        })

    results_df = pd.DataFrame(results)
    
    evaluation_title = f"Evaluation Results for Model: {model_name_to_evaluate}"
    
    print(evaluation_title)
    print(results_df)

evaluate_model_metrics(data, model_name_to_evaluate)

Evaluation Results for Model: QWEN2 7B Instruct Q4
                  Model  Prompt  Self-BLEU (n=1)  Distinct-2
0  QWEN2 7B Instruct Q4       1          0.18863     0.96352
1  QWEN2 7B Instruct Q4       2          0.18863     0.96352
2  QWEN2 7B Instruct Q4       3          0.18863     0.96352


##### Self-BLEU (n=1):
* The Self-BLEU score of 0.1886 indicates low overlap in unigrams across the texts generated by the model.
* This suggests a relatively high level of diversity in the content of the generated poems.
##### Distinct-2 Score:
* The Distinct-2 score of 0.9635 reflects a high proportion of unique bigrams in the generated poems.
* This demonstrates that the model maintains substantial diversity in the construction of phrases and avoids repetitive patterns.
##### Consistency Across Prompts:
* The scores remain identical for all three prompts.
* This implies the model consistently generates diverse outputs regardless of the input prompt.

### Poetic Structure:

A Shakespearean Sonnet is a 14 line poem which follows a consitent rhyme scheme pattern of ABAB CDCD EFEF GG and each line is 10 syllables long written in iambic pentameter.

#### Cleaning the sonnet and storing as a list

In [20]:
import string

def process_poem(poem):
    # Clean punctuation and split lines
    lines = [
        line.translate(str.maketrans('', '', string.punctuation))
        for line in poem.strip().split('\n') if line.strip()
    ]
    
    return poem, lines

# Process poem1
poem1, lines1 = process_poem(poem1)
poem2, lines2 = process_poem(poem2)
poem3, lines3 = process_poem(poem3)

#### Finding the Rhyming Scheme

In [57]:
import pronouncing

poem_list = [lines1, lines2, lines3]

def get_rhyming_scheme(lines):
    last_words = [line.split()[-1] for line in lines]
    rhyme_dict = {}
    scheme = []
    current_letter = 'A'

    for word in last_words:
        rhymes = pronouncing.rhymes(word)
        for key, letter in rhyme_dict.items():
            if word in rhymes or key in rhymes:
                scheme.append(letter)
                break
        else:
            rhyme_dict[word] = current_letter
            scheme.append(current_letter)
            current_letter = chr(ord(current_letter) + 1)

    rhyme_scheme = ''.join(scheme)
    return '\n'.join([rhyme_scheme[i:i+4] for i in range(0, len(rhyme_scheme), 4)])


# Process each poem and print rhyming schemes
for i, poem in enumerate(poem_list, 1):
    rhyme_scheme = get_rhyming_scheme(poem)
    print(f"Rhyming Scheme for Poem {i}: \n{rhyme_scheme}")


Rhyming Scheme for Poem 1: 
ABCD
EFGH
IJKL
MN
Rhyming Scheme for Poem 2: 
ABAC
DEFG
HIJK
LMNO
PQRA
PSTU
Rhyming Scheme for Poem 3: 
ABCD
EFGH
IJKD
LMNO
KKPQ
RSTT


##### Poem 1:
* The rhyming scheme is straightforward and unique for each line in most stanzas (ABCD, EFGH, IJKL), with the final stanza (MN) suggesting an abrupt end or a shorter stanza.
* The lack of repetition in the rhyme scheme implies the model is not adhering to traditional poetic norms.

##### Poem 2:
* There are sporadic rhyming pairs (like "A" in ABAC and "R" in PQRA), but the structure is not cohesive. The long and irregular scheme, particularly with unrelated rhymes in consecutive stanzas, suggests that the model generates rhymes without maintaining a global pattern.

##### Poem 3:
* The inclusion of repeated letters like D and K shows some attempt at rhyme, but it does not adhere to the structural discipline of sonnet forms.

The model lacks structural consistency and generates poems having inconsistent rhyme repetition.

In [69]:
import nltk
from nltk.corpus import cmudict
from collections import defaultdict
import re

nltk.download("cmudict")
cmu_dict = cmudict.dict()

def count_syllables(word):
    """
    Count syllables for a given word using the CMU Pronouncing Dictionary.
    """
    word = word.lower()
    if word in cmu_dict:
        return min([len([y for y in x if y[-1].isdigit()]) for x in cmu_dict[word]])
    return 1  

def analyze_syllable_counts(poem):
    """
    Analyze the syllable count for each line in the poem.
    """
    syllable_counts = [sum(count_syllables(word) for word in re.findall(r'\w+', line)) for line in poem]
    return syllable_counts

all_syllable_counts = [analyze_syllable_counts(poem) for poem in poem_list]

for idx, syllable_count in enumerate(all_syllable_counts):
    print(f"Poem {idx + 1} syllable counts: {syllable_count}")

[nltk_data] Downloading package cmudict to C:\Users\ual-
[nltk_data]     laptop\AppData\Roaming\nltk_data...
[nltk_data]   Package cmudict is already up-to-date!


Poem 1 syllable counts: [14, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 10, 10, 11]
Poem 2 syllable counts: [15, 11, 10, 9, 11, 10, 9, 11, 10, 10, 10, 9, 8, 10, 10, 11, 10, 9, 10, 10, 8, 9, 7, 9]
Poem 3 syllable counts: [8, 8, 10, 10, 10, 10, 9, 9, 9, 9, 10, 10, 8, 10, 9, 10, 9, 10, 10, 9, 9, 10, 9, 9]


##### Inconsistency in Meter:
* The generated poems exhibit partial alignment with metrical norms but fail to consistently follow traditional forms like the iambic pentameter (10 syllables per line).
* Irregularities in syllable counts break the rhythm, suggesting the model does not prioritize metrical constraints.
##### Poetic Creativity Over Structure:
* The syllable counts indicate that QWEN prioritizes creative output over strict adherence to poetic structure.