# Poem Generation using gemma-2-2b-it

### Load Model and Generate Sonnets

In [133]:
import pandas as pd
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
import torch
import collections
from collections import Counter
import math
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from nltk.util import ngrams
import pronouncing
from nltk.corpus import cmudict
from collections import defaultdict
import re

In [56]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [68]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about courage set in a battlefield with a determined tone. Use vivid imagery to convey strength and resilience."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 1:\n", generated_poem1)


Generated Poem 1:
 Write a Shakespearean sonnet about courage set in a battlefield with a determined tone. Use vivid imagery to convey strength and resilience.

Upon the field of strife, where blood doth stain,
And steel meets steel in clashing, deadly dance,
A heart of valor beats, unyielding, plain,
A warrior's soul, with fire in its trance.

Though fear may whisper, doubt may take its hold,
And shadows of despair may darken the mind,
Yet courage, like a beacon, brightly bold,
Will pierce the darkness, leave no soul behind.

With every step, a testament to might,
Each fallen foe, a victory to claim,
The spirit unbent, a beacon in the night,
A warrior's will, a burning, endless flame.

So let the battle rage, let blood run free,
For courage, like a river, flows eternally. 



In [69]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about wonder set in outer space with an awe-filled tone. Use vivid imagery to convey mystery and discovery."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem2 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 2:\n", generated_poem2)


Generated Poem 2:
 Write a Shakespearean sonnet about wonder set in outer space with an awe-filled tone. Use vivid imagery to convey mystery and discovery.

Upon the velvet canvas of the night,
A million diamonds, scattered, bright and bold,
The Milky Way, a river of pure light,
A story whispered, centuries old.
And in that vast expanse, where shadows lie,
A tapestry of stars, a cosmic dance,
Each twinkling point, a mystery to spy,
A silent symphony, a timeless trance.
The planets spin, in orbits grand and wide,
Their fiery breath, a cosmic fire's glow,
And in their depths, secrets yet to hide,
A universe of wonder, we all know.
So let us gaze, with hearts both pure and free,
And marvel at the majesty, eternally. 


**Answer:**

Upon the velvet canvas of the night,
A million diamonds, scattered, bright and bold,
The Milky Way, a river of pure light,
A story whispered, centuries old.
And in that vast expanse, where shadows lie,
A tapestry of stars, a cosmic dance,
Each twinkling point, 

In [70]:
# Prompt for the poem
input_text = "Write a Shakespearean sonnet about loss set in an empty home with a somber tone. Use vivid imagery to convey grief and reflection."
input_ids = tokenizer(input_text, return_tensors="pt")

# Generate poem
outputs = model.generate(**input_ids, max_new_tokens=350, temperature=0.7, top_p=0.9)
generated_poem3 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Poem 3:\n", generated_poem3)


Generated Poem 3:
 Write a Shakespearean sonnet about loss set in an empty home with a somber tone. Use vivid imagery to convey grief and reflection.

The dust motes dance in sunbeams, pale and thin,
Across the empty hall, a silent scene.
The scent of lavender, a memory's spin,
A phantom touch, a love that's never been.

The worn rug, once vibrant, now lies still,
A tapestry of faded, broken hues.
The fireplace, cold, a hollow, silent chill,
Where laughter once did fill the room with news.

The clock ticks slow, a mournful, steady beat,
Each second echoes the void within.
The house, a tomb, where love and life did meet,
Now echoes only with the wind's lament.

And in this silence, I find my own despair,
A heart that aches, a soul beyond repair. 



#### Extracting the necessary content (sonnet)

In [166]:
file_path = "Poem_Data.csv"
data = pd.read_csv(file_path)

In [168]:
poem1 = data.loc[0, 'Gemma 2 2B Q4']
poem2 = data.loc[1, 'Gemma 2 2B Q4']
poem3 = data.loc[2, 'Gemma 2 2B Q4']

### Metrics

#### BLEU (Bilingual Evaluation Understudy)
##### What it does: 
BLEU measures the precision of n-grams (sequences of n consecutive words) in the generated text when compared to a reference text. It’s widely used for machine translation but can also apply to poetry.
##### How we’re using it: 
While BLEU is typically associated with translation, we use it here to evaluate how well the n-grams of the generated poems align with reference poems. This gives us a basic sense of similarity between generated and target structures in terms of word patterns.

#### METEOR (Metric for Evaluation of Translation with Explicit ORdering)
##### What it does: 
METEOR considers semantic and syntactic similarity by accounting for synonyms, stemming, and word order. This allows it to address some of the weaknesses of the BLEU metric.
##### How we’re using it: 
We use METEOR to assess how closely the generated poem matches the meaning and structure of a reference poem. This is especially helpful for understanding if the model maintains context or theme throughout the text.

#### Perplexity
##### What it does: 
Perplexity measures how well a language model can predict the next word in a sequence. A lower perplexity means the model has more confidence in its predictions, which can translate to better fluency.
##### How we’re using it: 
We use perplexity to gauge the fluency and coherence of the generated poems. It gives us an idea of how "natural" the poem sounds, even though it doesn’t capture all aspects of creativity.

#### Entropy
##### What it does: 
Entropy measures the randomness or diversity of a text by quantifying how varied the token choices are in a given sequence. Higher entropy suggests the model is generating more diverse content, while lower entropy signals repetition or predictability.
##### How we’re using it: 
Entropy is important for understanding how creative or structured the generated poetry is. A balance is key—moderate entropy indicates diversity and creativity, while very low or very high entropy might signal overly repetitive or chaotic outputs. We use this metric to compare different models and analyze their ability to create engaging, varied poetry.

In [145]:
# Evaluation Functions
def bleu_score(reference, candidate):
    """
    Calculate the BLEU score between the reference and candidate texts.
    Measures the overlap of n-grams between the generated text and the reference.
    Returns the score rounded to 5 decimal places.
    """
    return round(sentence_bleu([reference.split()], candidate.split()), 5)

def meteor_score_func(reference, candidate):
    """
    Calculate the METEOR score between the reference and candidate texts.
    METEOR measures both exact matches and semantically similar matches 
    (stemming or synonyms) and penalizes overly short outputs.
    Tokenizes both inputs before computing the score.
    Returns the score rounded to 5 decimal places.
    """
    # Tokenize both the reference and candidate
    reference_tokens = reference.split()  # Tokenize the reference poem
    candidate_tokens = candidate.split()  # Tokenize the generated poem
    return round(meteor_score([reference_tokens], candidate_tokens), 5)

def calculate_perplexity(text):
    """
    Calculate the perplexity of a text using the provided tokenizer and model.
    Perplexity is a measure of how well the language model predicts the text.
    Lower perplexity indicates higher fluency and alignment with the model's training data.
    - Uses the tokenizer to convert the text into input tensors.
    - Performs inference using the model with no gradient calculation (for efficiency).
    - Computes the loss from the model's output and converts it to perplexity.
    Returns the perplexity value rounded to 5 decimal places.
    """
    inputs = tokenizer(text, return_tensors="pt")  # Tokenize text into tensors
    with torch.no_grad():  # Disable gradient computation for efficiency
        outputs = model(**inputs, labels=inputs["input_ids"])  # Compute loss
        loss = outputs.loss  # Extract loss
        return round(torch.exp(loss).item(), 5)  # Convert loss to perplexity

def calculate_entropy(text):
    """
    Calculate the entropy of a text, representing the diversity in its character or word usage.
    Entropy quantifies the unpredictability of the text.
    - Tokenizes the text
    - Counts the occurrences of each token.
    - Calculates probabilities for each token and uses them to compute entropy.
    Returns the entropy value rounded to 5 decimal places.
    """
    tokens = list(text)  # Tokenize the text at the character level (default)
    token_counts = Counter(tokens)  # Count occurrences of each token
    total_tokens = len(tokens)  # Total number of tokens
    probabilities = [count / total_tokens for count in token_counts.values()]  # Compute probabilities
    entropy = -sum(p * math.log2(p) for p in probabilities)  # Compute entropy
    return round(entropy, 5) 


In [147]:
model_name_to_evaluate = "Gemma 2 2B Q4" 

In [149]:
# Evaluate Scores
results = []

for index, row in data.iterrows():
    prompt = row["PromptID"]
    reference_poem = row["ReferencePoem"]
    generated_poem = row[model_name_to_evaluate]
    
    # Calculate scores
    #bleu = bleu_score(reference_poem, generated_poem)
    meteor = meteor_score_func(reference_poem, generated_poem)
    perplexity = calculate_perplexity(generated_poem)
    entropy = calculate_entropy(generated_poem)
    
    # Append to results
    results.append({
        "Prompt": prompt,
        #"BLEU": bleu,
        "METEOR": meteor,
        "Perplexity": perplexity,
        "Entropy": entropy
    })

results_df = pd.DataFrame(results)
average_entropy = results_df["Entropy"].mean()
evaluation_title = f"Evaluation Results for Model: {model_name_to_evaluate}"
print(evaluation_title)
print(results_df)
print(f"\nAverage Entropy: {average_entropy:.4f}")

Evaluation Results for Model: Gemma 2 2B Q4
   Prompt   METEOR  Perplexity  Entropy
0       1  0.08862     3.12202  4.40724
1       2  0.14262     3.64469  4.43066
2       3  0.06653     3.94740  4.38255

Average Entropy: 4.4068


##### METEOR Scores:
* The METEOR scores (0.08862, 0.14262, 0.06653) are relatively low, suggesting that the generated poems exhibit limited lexical and semantic overlap with the reference poems.
##### Perplexity:
* Perplexity scores (3.12202, 3.64469, 3.94740) suggest that the model is fairly good at predicting the next word, but there is still room for improvement in making the text more fluent and smooth.
##### Entropy:
* Values ranges from 4.38 to 4.43,indicating that the model is producing text with moderate diversity. The average entropy of 4.4068 suggests that the text has a balanced amount of variation, without being overly random or repetitive.

### Diversity and Variety

In the context of poem generation, diversity and variety refer to the richness, uniqueness, and creativity of the generated text, which are critical for producing compelling, engaging, and original poetry.

* Self-BLEU : Evaluates how much overlap exists between different samples generated by the same model.
* Distinct n-grams: Measures the percentage of distinct n-grams in the generated text compared to the total n-grams. Higher distinctness means more diversity.

In [155]:
from nltk.util import ngrams
import collections

# Self-BLEU Score
def self_bleu(generated_texts, n=1):
    """
    Calculates the Self-BLEU score for a list of generated texts.
    Self-BLEU measures the similarity between different generated outputs,
    highlighting repetition or lack of diversity in generation.

    Args:
        generated_texts (list of str): A list of generated text strings to evaluate.
        n (int): The n-gram size to use for computation. Default is 1 (unigrams).

    Returns:
        float: The average Self-BLEU score across all generated texts.
    """
    all_ngrams = [] 
    for text in generated_texts:
        if not isinstance(text, str):
            text = str(text) 
        tokens = text.split() 
        ngrams_list = list(ngrams(tokens, n)) 
        all_ngrams.append(collections.Counter(ngrams_list))
    
    # Initialize a score to compute Self-BLEU
    score = 0
    for i, counter in enumerate(all_ngrams):
        # Exclude the current text and focus only on other generated outputs
        others = all_ngrams[:i] + all_ngrams[i+1:]
        union_ngrams = collections.Counter() 
        for other in others:
            union_ngrams.update(other)

        # Avoid division by zero by checking if n-grams are available
        if sum(union_ngrams.values()) > 0:

            overlap = sum(min(count, union_ngrams[gram]) for gram, count in counter.items())
            score += overlap / sum(union_ngrams.values())
    
    return score / len(generated_texts) if len(generated_texts) > 1 else 0


# Distinct-N Score
def distinct_n(generated_texts, n=2):
    """
    Calculates the Distinct-N score for a list of generated texts.
    Distinct-N measures diversity by computing the ratio of unique n-grams
    over the total number of n-grams across all generated texts.

    Args:
        generated_texts (list of str): A list of generated text strings to evaluate.
        n (int): The n-gram size to compute for Distinct-N. Default is 2 (bigrams).

    Returns:
        float: The Distinct-N score indicating diversity of generated texts.
    """
    ngrams_set = set()  
    total_ngrams = 0 

    for text in generated_texts:
        if not isinstance(text, str):
            text = str(text)
        tokens = text.split()
        ngrams_list = list(ngrams(tokens, n))  
        ngrams_set.update(ngrams_list)
        total_ngrams += len(ngrams_list)

    return len(ngrams_set) / total_ngrams if total_ngrams > 0 else 0

In [157]:
# Function to evaluate Self-BLEU and Distinct-N for a model
def evaluate_model_metrics(data, model_name_to_evaluate):
    results = []

    # Collect all poems generated by the same model across different prompts
    generated_poems = data[model_name_to_evaluate].tolist()
    self_bleu_score = self_bleu(generated_poems, n=1)  # For unigrams (Self-BLEU n=1)
    distinct_2_score = distinct_n(generated_poems, n=2)  # For distinct-2 (bigrams)
    
    for index, row in data.iterrows():
        prompt = row["PromptID"]
        
        results.append({
            "Model": model_name_to_evaluate,
            "Prompt": prompt,
            "Self-BLEU (n=1)": round(self_bleu_score, 5),
            "Distinct-2": round(distinct_2_score, 5)
        })

    results_df = pd.DataFrame(results)
    
    evaluation_title = f"Evaluation Results for Model: {model_name_to_evaluate}"
    
    print(evaluation_title)
    print(results_df)

evaluate_model_metrics(data, model_name_to_evaluate)

Evaluation Results for Model: Gemma 2 2B Q4
           Model  Prompt  Self-BLEU (n=1)  Distinct-2
0  Gemma 2 2B Q4       1           0.1878     0.95886
1  Gemma 2 2B Q4       2           0.1878     0.95886
2  Gemma 2 2B Q4       3           0.1878     0.95886


##### Self-BLEU (n=1):
* The Self-BLEU score of 0.1878 indicates low overlap in unigrams across the texts generated by the model.
* This suggests a relatively high level of diversity in the content of the generated poems.
##### Distinct-2 Score:
* The Distinct-2 score of 0.95886 reflects a high proportion of unique bigrams in the generated poems.
* This demonstrates that the model maintains substantial diversity in the construction of phrases and avoids repetitive patterns.
##### Consistency Across Prompts:
* The scores remain identical for all three prompts.
* This implies the model consistently generates diverse outputs regardless of the input prompt.

### Poetic Structure:

A Shakespearean Sonnet is a 14 line poem which follows a consitent rhyme scheme pattern of ABAB CDCD EFEF GG and each line is 10 syllables long written in iambic pentameter.

#### Cleaning the sonnet and storing as a list

In [159]:
import string

def process_poem(poem):
    lines = [
        line.translate(str.maketrans('', '', string.punctuation))
        for line in poem.strip().split('\n') if line.strip()
    ]
    
    return poem, lines

poem1, lines1 = process_poem(poem1)
poem2, lines2 = process_poem(poem2)
poem3, lines3 = process_poem(poem3)

#### Finding the Rhyming Scheme

In [161]:
import pronouncing

poem_list = [lines1, lines2, lines3]

def get_rhyming_scheme(lines):
    last_words = [line.split()[-1] for line in lines]
    rhyme_dict = {}
    scheme = []
    current_letter = 'A'

    for word in last_words:
        rhymes = pronouncing.rhymes(word)
        for key, letter in rhyme_dict.items():
            if word in rhymes or key in rhymes:
                scheme.append(letter)
                break
        else:
            rhyme_dict[word] = current_letter
            scheme.append(current_letter)
            current_letter = chr(ord(current_letter) + 1)

    rhyme_scheme = ''.join(scheme)
    return '\n'.join([rhyme_scheme[i:i+4] for i in range(0, len(rhyme_scheme), 4)])


# Process each poem and print rhyming schemes
for i, poem in enumerate(poem_list, 1):
    rhyme_scheme = get_rhyming_scheme(poem)
    print(f"Rhyming Scheme for Poem {i}: \n{rhyme_scheme}")


Rhyming Scheme for Poem 1: 
ABAB
CDCD
EFEF
GH
Rhyming Scheme for Poem 2: 
ABAB
CDCD
EFEF
GH
Rhyming Scheme for Poem 3: 
ABAA
CDCD
EAEF
GG


##### Poem 1 and Poem 2: (ABAB CDCD EFEF GH)
* Both deviate in the final rhyming couplet, ending with GH instead of the required GG.
##### Poem 3: (ABAA CDCD EAEF GG)
* Deviates significantly by breaking the rhyme pattern in the first quatrain (ABAA) and altering the third quatrain (EAEF). However, it maintains the correct final rhyming couplet (GG).

These deviations suggest that while the poems attempt structural similarity to a Shakespearean sonnet, they fail to fully conform to its rhyme scheme.

#### Finding the Syllable Counts

In [121]:
cmu_dict = cmudict.dict()

def count_syllables(word):
    """
    Count syllables for a given word using the CMU Pronouncing Dictionary.
    """
    word = word.lower()
    if word in cmu_dict:
        return min([len([y for y in x if y[-1].isdigit()]) for x in cmu_dict[word]])
    return 1  

def analyze_syllable_counts(poem):
    """
    Analyze the syllable count for each line in the poem.
    """
    syllable_counts = [sum(count_syllables(word) for word in re.findall(r'\w+', line)) for line in poem]
    return syllable_counts

all_syllable_counts = [analyze_syllable_counts(poem) for poem in poem_list]

for idx, syllable_count in enumerate(all_syllable_counts):
    print(f"Poem {idx + 1} syllable counts: {syllable_count}")

Poem 1 syllable counts: [10, 10, 10, 9, 10, 11, 10, 10, 10, 9, 10, 10, 10, 12]
Poem 2 syllable counts: [10, 10, 10, 9, 10, 10, 10, 10, 10, 10, 9, 10, 10, 12]
Poem 3 syllable counts: [9, 10, 9, 10, 9, 10, 11, 10, 10, 9, 10, 10, 11, 10]


##### Poem 1:
* The majority of the lines adhere to the required 10-syllable structure of a Shakespearean sonnet.
* However, there are deviations: Lines 4 (9 syllables), 6 (11 syllables), and 14 (12 syllables).
##### Poem 2:
* This poem also follows the 10-syllable requirement for most lines.
* Deviations occur in Lines 4 (9 syllables), 11 (9 syllables), and 14 (12 syllables).
##### Poem 3:
* The syllable counts are more inconsistent compared to Poem 1 and Poem 2.
* Deviations occur in Lines 1 (9 syllables), 3 (9 syllables), 5 (9 syllables), 7 (11 syllables), 10 (9 syllables), and 13 (11 syllables).