# Story Generator - Winnie the Pooh 

## Part 3: Integrating Readability Metrics into DSPy Module

[1. Imports and environment](#1-imports-and-environment)

[2. DSPy set up](#2-dspy-set-up)

[3. Evaluation metrics](#3-evaluation-metrics)

[4. Gradio UI](#4-gradio-ui)

### 1. Imports and environment

In [None]:
#pip install dspy-ai openai chromadb sentence_transformers spacy textstat asyncio deepeval

In [2]:
import os
os.environ['DEEPEVAL_TELEMETRY_OPT_OUT'] = "YES"


import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM
import chromadb
from chromadb.utils import embedding_functions
import dotenv

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

from evaluation_metrics import *


# Establish paths
CHROMA_PATH = '../data/chroma_db'
DB_COLLECTION = "winnie_the_pooh"
default_ef = embedding_functions.DefaultEmbeddingFunction()

# Set up OpenAI API key
dotenv.load_dotenv()
#openai_key = os.getenv('OPENAI_API_KEY')

True

### 2. DSPy Set up

Taken from previous notebook.

In [3]:
# Configure OpenAI as the language model
llm = dspy.OpenAI(model="gpt-4o-mini", max_tokens=1000, temperature=1.0)

# Set up Chroma client and retriever
chroma_client = chromadb.PersistentClient(path=CHROMA_PATH)
collection = chroma_client.get_collection(DB_COLLECTION)

# Set up ChromadbRM as the retriever model
chroma_retriever = ChromadbRM(
    collection_name=DB_COLLECTION, 
    persist_directory=CHROMA_PATH, 
    embedding_function=default_ef,
    )

# Configure DSPy settings
dspy.settings.configure(lm=llm, rm=chroma_retriever)

In [4]:

class GenerateStory(dspy.Signature):
    """Generate a Winnie the Pooh style story."""
    name = dspy.InputField()
    prompt = dspy.InputField(desc="details to include in the story.")
    context = dspy.InputField(desc="relevant passages from Winnie the Pooh stories and story structure.")
    story = dspy.OutputField(desc="generate a one-minute story for a child. Name is the main character who is friends with Pooh, and finish the story with 'The End.'")


class StoryGenerator(dspy.Module):
    def __init__(self, chroma_retriever):
        super().__init__()
        self.retriever = chroma_retriever
        self.generate = dspy.ChainOfThought(GenerateStory)

    def forward(self, name, prompt):
        retrieved = self.retriever(prompt, k=8)
        retrieved_context = [doc.long_text for doc in retrieved]
        context = "\n".join(retrieved_context)
        
        result = self.generate(context=context, prompt=prompt, name=name)
        return dspy.Prediction(story=result.story)

story_gen = StoryGenerator(chroma_retriever)

In [6]:
name= 'Lucy'
prompt = "They go into the woods and have a picnic with friends."


new_story = story_gen(name, prompt)
print(new_story.story)

On a sunny day in the Hundred Acre Wood, Lucy, a cheerful little girl with swirling curls, decided it was the perfect day for a picnic. So she hurried off to find her friend Pooh, who was always up for a tasty treat. 

“Pooh!” called Lucy, skipping along the path. “Let’s have a picnic by the big oak tree, where the honeybees buzz!”

“Oh, that sounds delightful!” replied Pooh, his tummy rumbling in agreement. “I’ll bring some honey, of course!”

So, hand in hand, Lucy and Pooh ventured down the winding paths of the wood, singing a happy tune. Along the way, they met Piglet, who was busy picking daisies. 

“Would you like to join us for a picnic, Piglet?” Lucy asked with a smile.

“Oh yes!" squeaked Piglet, his eyes sparkling. "I can bring acorn cookies!”

As they continued, they bumped into Rabbit, who was hopping near his garden. 

“Hello, Rabbit!” called Lucy. “We’re having a picnic! Would you like to come?”

“Oh, indeed!” said Rabbit, his ears perking up. “I’ll bring some fresh carro

### 3. Evaluation metrics

#### DeepEval - Answer Relevancy

Testing one of DeepEval's built-in metrics on a generated story. 

In [7]:
actual_output = new_story.story

# Initialize the AnswerRelevancyMetric
metric_relecancy = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

test_case = LLMTestCase(
    input=prompt,
    actual_output=actual_output
)

# Calculate the relevancy score
metric_relecancy.measure(test_case)
print(metric_relecancy.score)
print(metric_relecancy.reason)

Output()

0.9354838709677419
The score is 0.94 because while the output provided valuable context about the picnic, it included some irrelevant statements that didn't enhance the understanding of the event, such as 'Pooh!' and 'Hello, Rabbit!'. These detracted slightly from the overall relevance, but the main focus on the picnic and friends kept the score high.


DeepEval's built-in AnswerRelevancyMetric does not seem to be an appropriate metric in this case. Generating a fictional story will inevitably include "irrelevant" text from the context. I will instead define a metric that will better assess the appropriateness of the output, by measuring the readability of the generated story. 

#### Readability Score (Flesch-Kincaid Grade)

As previously discussed, the Flesch-Kincaid grade level calculation is a readability metric that indicates the reading level of a text, based on word and sentence length. 


**Flesch–Kincaid grade level** 

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/8e68f5fc959d052d1123b85758065afecc4150c3"
     alt="Flesch-Kincaid grade formula"
     style="margin-left: 45px;" 
     />


The actual Winnie the Pooh stories have an average readability score 3.8, and standard deviation of 0.8. I would like the generated stories to be no more than one fall within one standard deviation of the mean, and therefore less than 4.6. 

In [8]:
# Readability metric predefined in and imported from evaluation_metrics.py
# defined as a new DeepEval classm with high and low threshholds

test_case = LLMTestCase(input=prompt, actual_output=new_story.story)
metric = ReadabilityMetric(threshold_high=3.0, threshold_low=2.0)

result = metric.measure(test_case)

print("Readability acceptable:", result)
print(f"Readability score: {calculate_readability_scores(new_story.story)['Flesch-Kincaid Grade']}")

Readability acceptable: False
Readability score: 4.2


In the above example, the threshold was set to 2.0-3.0, and the text failed the test becasuse it has a score of 4.2

In [9]:
# define function to evaluate readability of a story

def evaluate_readability(name, prompt):
    """
    Generate a story and evaluate its readability.

    input: name, prompt
    output: readability pass (bool), generated story
    """

    new_story = story_gen(name,prompt)
    actual_output = new_story.story

    # Initialize the ReadabilityMetric
    metric = ReadabilityMetric(
        threshold_high=4.6,
        #threshold_low=0.0
    )

    test_case = LLMTestCase(
        input= prompt,
        actual_output=actual_output
    )

    return metric.measure(test_case), actual_output


# function to adjust prompt/story if it fails readability metric the first time.

def print_story(name, prompt):
    """ 
    print the story if it passes the readability metric, try again with simpler words and sentences if it fails,
    otherwise provide a suggestion to simplify the story.

    input: name, prompt
    output: story or suggestion
    """
    results = evaluate_readability(name, prompt)

    pass_metric = results[0]

    if pass_metric:
        return results[1]

    else:
        new_prompt = prompt + " Write the story using simplistic words and sentences."

        new_results = evaluate_readability(name, new_prompt)

        if new_results[0]:
            #print("Second Try: \n", new_results[1]) 
            return new_results[1]

        else:
            return "I'm sorry, I was not able to write you a story. Try a different prompt."


In [10]:
test_name = 'Noah'
test_prompt = "They go on an adventure and climb a tree."

print_story(test_name, test_prompt)

"One sunny morning in the Hundred Acre Wood, Noah was feeling quite adventurous. He had woken up with the thought of climbing the tallest tree and seeing the world from up high. As he wandered through the forest, humming a little tune, he bumped into Pooh.\n\n“Hello, Noah!” Pooh said, his eyes bright with curiosity. “What are you up to today?”\n\n“I’m going to climb that tall tree over there, Pooh! Would you like to come with me?” Noah replied, pointing to a great, leafy tree that stretched towards the sky.\n\n“Oh, I do love trees,” said Pooh, “especially the ones that might have honey at the top! Let’s go!”\n\nSo off they went together, giggling and chatting. When they reached the foot of the tree, they looked up at the branches that seemed to tickle the clouds.\n\n“Do you think we can really climb it?” Noah asked.\n\n“Of course! Just think of all the adventures waiting for us up there!” Pooh encouraged, patting Noah's back.\n\nTaking a deep breath, Noah began to climb, with Pooh foll

### 4. Gradio UI

In [11]:
import gradio as gr
from theme_violet_amber import theme as violet_amber


# Gradio UI
with gr.Blocks(theme=violet_amber) as demo:
    gr.Markdown(
    """
    # Winnie the Pooh Story Generator
    *Simply enter a character name and setting, then I will write a story from the Hundred Acre Woods for you!*
    """)
    textbox = gr.Textbox(label="Who is the story about?")
    textbox2 = gr.Textbox(label="What do you want the story to be about?")

    with gr.Row():
        button = gr.Button("Submit", variant="primary")
        clear = gr.ClearButton([textbox, textbox2]) 

    output = gr.Textbox(label="A story for you... ")

    button.click(print_story, [textbox, textbox2], output)
    clear.click(lambda: None, outputs = output)

demo.launch()


* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




In [12]:
demo.close()

Closing server running on port: 7860
