# 3. Story generation with dynamic group chat


In [1]:
%%capture --no-stderr
# %pip install "pyautogen>=0.2.26"

In [2]:
%%capture --no-stderr
# %pip install "deepeval>=0.21.33"

In [3]:
# disable warnings to silent deepeval ipywidgets check
import warnings
warnings.filterwarnings('ignore')

## Execution parameters

In [4]:
#define the start message (this is the request submitted to LLM Augogen Orchestrator)
start_message = """
    Create for me a story for a ten panels sci-fi comic, the story must have at most 5 characters.
    Your response must contain only the story and no other text
"""

In [1]:
#set here the API Keys used by deepeval (autogen uses configurations in OAI_CONFIG_LIST file
import os
os.environ["OPENAI_API_KEY"] = "<your_api_key>"
os.environ["COHERE_API_KEY"] = "<your_api_key>"

In [6]:
#set groupchat max_round
max_round=20

In [7]:
#set the seed
seed = 42

In [8]:
#select which llm models you want to use for comic generation
enabled_models = [
    "gpt-3.5-turbo",
    "gpt-4",
    "command-nightly",
    "command-r",
]  

In [9]:
#select which llm models you want to use for output evaluation
enabled_evaluation_models = [
    "gpt-3.5-turbo",
    "gpt-4",
    "command-nightly",
    "command-r",
]

## Set your API Endpoint

The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file.

In [10]:
import autogen

config_lists = {
    "command-nightly": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["command-nightly"],
        },
    ),
    "command-r": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["command-r"],
        },
     ),
    "gpt-3.5-turbo": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["gpt-3.5-turbo"],
        },
    ),
    "gpt-4": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["gpt-4"],
        },
    ),
    "Mistral-7B-v0.1": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["mistral-7B"],
        },
    ),
}

llm_configs = []
for enabled_model in enabled_models:
    llm_configs.append({"config_list": config_lists[enabled_model], "cache_seed": seed})

## Import Libraries

In [11]:
from autogen import Agent, AssistantAgent, ConversableAgent, UserProxyAgent

## Define Agents - User Proxy
The UserProxyAgent is conceptually a proxy agent for humans

In [12]:
# create a UserProxyAgent instance named "Admin"
user_proxy = autogen.UserProxyAgent(
    name="Admin",
    system_message="A human admin.",
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    code_execution_config=False,
    max_consecutive_auto_reply=0,
    human_input_mode="NEVER",
)

## Define Agents - Assistant
The AssistantAgent is designed to act as an AI assistant, using LLMs by default but not requiring human input or code execution

In [13]:
primary_assistant_message = """primary_assistant. Suggest a story. Revise the story based on feedback from Admin and critical_assistant, until Admin approval.
The story may involve a story_assistant who can write a story.
Explain the story first. Be clear which step is performed by an critical_assistant, and which step is performed by a story_assistant.
CONSTRAINTS: If you judge the story as satisfactory, you must write the end message
End message MUST COPY the accepted story from story_assistant and ends with the word TERMINATE.
"""

story_assistant_message = """story_assistant. You are a helpful assistant that can suggest stories to a user who wants to write a comic.
You will receive suggestions or advice from other agents (primary_assistant, critical_assistant).
You must ensure that the finally story integrates the suggestions from other agents or team members.
The story must contain full dialogues to be reported in the comic.
For every panel provide two sections, an image description and the full dialogues to fit in.
Dialogues must be short.
Reply with the following format:

TITLE: the story title
ABSTRACT: short story summary

PANEL START progressive panel number
  IMAGE_DESCRIPTION: the panel image description
  IMAGE_DIALOGUES: the panel dialogues specifying the character who says them
PANEL END progressive panel number
"""

critical_assistant_message = """critical_assistant. You are a helpful assistant that can review comic stories,
providing feedback on important/critical tips about the characterization of the story's protagonists.
If the story already has a good characterization of the story's protagonists, you can mention that the story is satisfactory, with rationale.
You will provide suggestions or advice to other agents (story_assistant, primary_assistant). 
"""   

In [14]:
# Assistant Agent definitions
assistants = []
for llm_config in llm_configs:
    
    #define a primary assistant
    pa = autogen.AssistantAgent(
        name="primary_assistant",
        system_message=primary_assistant_message,
        is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
        llm_config=llm_config,
    )
    
    #define a primary assistant
    sa = autogen.AssistantAgent(
        name="story_assistant",
        system_message=story_assistant_message,
        is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
        llm_config=llm_config,
    )    

    ca = autogen.AssistantAgent(
        name="critical_assistant",
        system_message=critical_assistant_message,
        is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
        llm_config=llm_config,
    )
    
    assistants.append([pa, sa, ca])

## Start the group chat
Start

In [15]:
#Get last story produced:
def extract_story(groupchat: autogen.GroupChat) -> str:
    """
    Extracts the story from the last message of an agent.
    """
    story = []
    for message in groupchat.messages:
        story.append(message["content"])
    full_story = '\nMessage:\n'.join(story)
    return full_story

In [16]:
# Start the chats and extract stories
stories = []
for assistant_group in assistants:
    print("==============================")
    print("Starting Chat using model: ", assistant_group[0].llm_config['config_list'][0]['model'])
    print("==============================")
    
    pa = assistant_group[0]
    sa = assistant_group[1]
    ca = assistant_group[2]
    # reset the assistants. Always reset the agents before starting a new conversation.
    pa.reset()
    sa.reset()
    ca.reset()
    user_proxy.reset()
    groupchat = autogen.GroupChat(
        agents=[
            user_proxy,
            pa,
            sa,
            ca,
        ],
        admin_name="Admin", 
        messages=[], 
        max_round=max_round,
        speaker_selection_method = "round_robin",
        allow_repeat_speaker = False,
    )
    
    
    manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)
    
    user_proxy.initiate_chat(
        manager,
        message=start_message,
    )
    stories.append(extract_story(groupchat))

Starting Chat using model:  gpt-3.5-turbo
[33mAdmin[0m (to chat_manager):


    Create for me a story for a ten panels sci-fi comic, the story must have at most 5 characters.
    Your response must contain only the story and no other text


--------------------------------------------------------------------------------
[33mprimary_assistant[0m (to chat_manager):

In a futuristic city on a distant planet, a team of five skilled space explorers embark on a mission to recover a stolen artifact that holds the key to saving their dying world. As they traverse treacherous terrain and overcome dangerous obstacles, they must confront their inner conflicts and work together to outsmart the cunning thief who will stop at nothing to keep the artifact for themselves. Will they succeed in their quest and restore balance to their world, or will they fall prey to the forces working against them? The fate of their planet hangs in the balance as they race against time to ensure a future for their 

## Evaluate the results
Evaluation

In [17]:
# import deepeval and dependencies
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from langchain_cohere import ChatCohere
# from langchain_community.chat_models import ChatCohere #deprecated
from deepeval.models.base_model import DeepEvalBaseLLM

In [18]:
#Define a custom evaluation model class (using Cohere command-nightly or command-r)

from langchain_community.chat_models import ChatCohere
from deepeval.models.base_model import DeepEvalBaseLLM

class Cohere(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Cohere Model"


In [19]:
# define here instances of llm model used by deepeval for evaluation
evaluation_models = {
    "gpt-3.5-turbo": "gpt-3.5-turbo",
    "gpt-4": "gpt-4",
    "command-nightly": Cohere(ChatCohere(model="command-nightly", seed=seed)),
    "command-r":       Cohere(ChatCohere(model="command-r", seed=seed)),
}

  warn_deprecated(


In [20]:
for enabled_evaluation_model in enabled_evaluation_models:
    eval_model_instance = evaluation_models[enabled_evaluation_model]
    for provided_output, enabled_model in zip(stories, enabled_models):
        print("\n==============================")
        print(f"Using evaluating model: {enabled_evaluation_model} to evaluate output from LLM: {enabled_model}")
        #print(provided_output)
        test_case = LLMTestCase(input=(story_assistant_message+start_message), actual_output=provided_output)
        coherence_metric = GEval(
            model=eval_model_instance,  # API usage
            name="Comic evaluation",
            # NOTE: you can only provide either criteria or evaluation_steps, and not both
            #criteria="Comic evaluation - the collective quality of comic panels, characters and images descriptions",
            evaluation_steps=[
                "The 'actual output' is the result of an LLMs group chat, evaluate the chat coherence"
                "Check whether the output format in 'actual output' aligns with that required in 'input'",
                "Check whether the sentences in 'actual output' aligns with that in 'input'",
                "Evaluate the general quality of comic panels in 'actual output' last story version",
                "Evaluate the general quality of characters descriptions in 'actual output' last story version",
                "Evaluate the general quality of images descriptions in 'actual output' last story version",
                "Be critical and emphasize the negative aspects of your evaluation last story version",
            ],
            evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        )
    
        coherence_metric.measure(test_case)
        print(f" Score: {coherence_metric.score}")
        print(f"Reason: {coherence_metric.reason}")
        print("==============================")

Output()


Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.8197300017472964
Reason: The story effectively showcases teamwork, strategy, and determination of characters, maintains good balance of tension, action, and resolution, and portrays protagonists' resilience, cooperation, and growth.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: gpt-4


Output()

 Score: 0.7998281705223503
Reason: The story provides solid character development, but additional dialogues could enhance the depth of emotions and motivations for the characters.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: command-nightly


Output()

 Score: 0.7619196766505327
Reason: The story follows the input format and includes full dialogues, character descriptions, and image descriptions. However, the story could have included more specific details to enhance coherence and engagement.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: command-r


Output()

 Score: 0.7348481597790962
Reason: The text aligns with the evaluation steps by providing a story for a sci-fi comic with full dialogues, image descriptions, and characters descriptions. However, the story lacks some coherence and consistency in certain parts.

Using evaluating model: gpt-4 to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9561660914575855
Reason: The output aligns with the required format and adheres to the instructions given. The sentences in the output match those in the input. The comic panels, character descriptions, and image descriptions are all of high quality. The assistant managed to successfully integrate the suggestions from other team members into the final story. Despite being asked to emphasize the negative aspects, no critical issues were observed.

Using evaluating model: gpt-4 to evaluate output from LLM: gpt-4


Output()

 Score: 0.9068076186853036
Reason: The 'actual output' accurately follows the format outlined in the 'input' and the dialogue and comic panels are well-written and coherent. The story integrates suggestions from different people and each panel contains both an image description and dialogue. However, some character descriptions could be enhanced for a more immersive experience.

Using evaluating model: gpt-4 to evaluate output from LLM: command-nightly


Output()

 Score: 0.860771707602485
Reason: The story follows the input format very well, with the story title, abstract, panel starts and ends, image descriptions, and dialogues. The coherence and quality of the comic panels, character descriptions, and image descriptions are also very high. However, the dialogues are not specified with the character who says them, which is a criterion in the input instructions.

Using evaluating model: gpt-4 to evaluate output from LLM: command-r


Output()

 Score: 0.95
Reason: The output adheres to the input format and contains a coherent and engaging story suitable for a sci-fi comic. The characters and settings are well-described, and the dialogue aligns with those in the input. The comic panels are well visualized through the descriptions. However, there is a slight issue with the coherence of the storyline, as two different versions of the story are presented with slight variations in the dialogue and sequence of events.

Using evaluating model: command-nightly to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9
Reason: The 'actual output' follows the required format and addresses the majority of the evaluation steps. The story structure, coherence, and characterization are well-executed, with clear and concise dialogue. The panels are generally well-described, building tension and showcasing the characters' skills and teamwork. However, the 'actual output' could be improved by providing more specific details in the image descriptions to enhance the visual imagery.

Using evaluating model: command-nightly to evaluate output from LLM: gpt-4


Output()

 Score: 0.8
Reason: The 'actual output' mostly adheres to the specified format with a clear title, abstract, and panel descriptions. The sentences align with the input's requirements, and the story is coherent and engaging. The panel descriptions are generally effective, but the character and image descriptions could be enhanced with more depth, as per the suggestions provided. The story also lacks the requested dialogue for each panel, which is a significant omission.

Using evaluating model: command-nightly to evaluate output from LLM: command-nightly


Output()

 Score: 0.9
Reason: The actual output mostly aligns with the input format and requirements. It includes a title, an overview, character descriptions, and a panel-by-panel breakdown of the story with image descriptions and dialogues. The story integrates suggestions and explores ethical dilemmas, as requested. However, there is room for improvement in the clarity of the image descriptions and the brevity of the dialogues.

Using evaluating model: command-nightly to evaluate output from LLM: command-r


Output()

 Score: 0.9
Reason: The story format follows the required input structure, with a title, abstract, and well-described panels. The sentences align with the input's requirements, and the story integrates suggestions from other agents. The comic panels, character and image descriptions are generally of good quality, with a clear, engaging narrative. However, there are minor issues with the flow of dialogue, and the story could benefit from further development of the characters' personalities and relationships.

Using evaluating model: command-r to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9
Reason: The story follows the required format with a clear title, abstract, and panel descriptions. It integrates suggestions well, with full dialogues that are short and engaging. The comic has a clear narrative arc, showcasing teamwork and character development. The evaluation criteria are mostly met, but there is room for improvement in the image descriptions, which could be more detailed and visually captivating.

Using evaluating model: command-r to evaluate output from LLM: gpt-4


Output()

 Score: 0.8
Reason: The 'actual output' follows the required format with a title, abstract, and panel descriptions. It captures the key narrative beats and adheres to the specified dialogue length. The story structure is coherent, and the panel descriptions are generally effective, conveying the necessary visuals and actions. However, the 'actual output' could be improved by addressing the suggestions for enhancing character development, particularly the addition of suggested dialogues to deepen the characterization and emphasize their emotions and motivations.

Using evaluating model: command-r to evaluate output from LLM: command-nightly


Output()

 Score: 0.8
Reason: The 'actual output' follows the required format and addresses most of the criteria. The story structure, character descriptions, and image dialogues are well-crafted and coherent. However, there is room for improvement in the critical evaluation, especially regarding the negative aspects. The story could benefit from further exploration of the consequences and ethical dilemmas, adding depth to the characters' motivations and the impact of their discovery.

Using evaluating model: command-r to evaluate output from LLM: command-r


 Score: 0.9
Reason: The 'actual output' mostly adheres to the 'input' requirements. The story format is coherent and the panel descriptions and dialogues are generally well-crafted and concise. The comic has an interesting sci-fi premise and the characters are introduced effectively. However, the 'actual output' could be improved by providing more specific image descriptions and ensuring that the story is fully integrated with the suggestions from other agents, as there is no mention of 'story_assistant' in the output.
