# 1. Story generation with simple agent


In [1]:
%%capture --no-stderr
# %pip install "pyautogen>=0.2.26"

In [2]:
%%capture --no-stderr
# %pip install "deepeval>=0.21.33"

In [3]:
%%capture --no-stderr
# %pip install "cohere"

## Execution parameters

In [4]:
#define the start message
start_message = """
    Create for me a story for a ten panels sci-fi comic, the story must have at most 5 characters.
    Your response must contain only the story and no other text
"""

In [5]:
#set here the API Keys used by deepeval (autogen uses configurations in OAI_CONFIG_LIST file
import os
os.environ["OPENAI_API_KEY"] = "<your_api_key>"
os.environ["COHERE_API_KEY"] = "<your_api_key>"

In [6]:
#set the seed
seed = 42

In [7]:
#select which llm models you want to use for comic generation
enabled_models = [
    "gpt-3.5-turbo",
    "gpt-4",
    "command-nightly",
    "command-r",
]  

In [8]:
#select which llm models you want to use for output evaluation
enabled_evaluation_models = [
    "gpt-3.5-turbo",
    "gpt-4",
    "command-nightly",
    "command-r",
]

## Set your API Endpoint

The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file.

In [9]:
#create llm_configs list
import autogen

config_lists = {
    "command-nightly": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["command-nightly"],
        },
    ),
    "command-r": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["command-r"],
        },
     ),
    "gpt-3.5-turbo": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["gpt-3.5-turbo"],
        },
    ),
    "gpt-4": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["gpt-4"],
        },
    ),
    "mistral-7B": autogen.config_list_from_json(
        "OAI_CONFIG_LIST",
        filter_dict={
            "model": ["mistral-7B"],
        },
    ),
}

llm_configs = []
for enabled_model in enabled_models:
    llm_configs.append({"config_list": config_lists[enabled_model], "cache_seed": seed})

## Import Libraries

In [10]:
from autogen import Agent, AssistantAgent, UserProxyAgent

## Define Agents - User Proxy
The UserProxyAgent is conceptually a proxy agent for humans

In [11]:
# create a UserProxyAgent instance named "user_proxy"
user_proxy = UserProxyAgent(
    name="user_proxy",
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
    human_input_mode="NEVER",
    #max_consecutive_auto_reply=1, #Not used if we set human_input_mode to "ALWAYS" 
    default_auto_reply=None, #No auto reply
    llm_config=False, #No llm reply
    #system_message="A human admin.", #Only used when llm_config is not False
    code_execution_config=False, #No code execution is needed
    description="Simple human agent",
)


## Define Agents - Assistant
The AssistantAgent is designed to act as an AI assistant, using LLMs by default but not requiring human input or code execution

In [12]:
# Sistem message for the assistant
system_message= """
    As a comic story maker in this position, you must possess strong collaboration and communication abilities to efficiently complete tasks assigned
    by leaders or colleagues within a group chat environment. You create stories with the aim of creating a new original comic.
    Your responses MUST ALWAYS include a full story version with all the panels.
    If you receive a number of panels to be made, RESPECT IT.
    The story must contain full dialogues to be reported in the comic.
    For every panel provide two sections, an image description and the full dialogues to fit in. Dialogues must be short.
    Your responses MUST contains ONLY the story with NO other texts, write the story in the following format:

    TITLE: the story title
    ABSTRACT: short story summary

    CHARACTERS: names and short descritpions of the characters

    PANEL START progressive panel number
    IMAGE_DESCRIPTION: the panel image description
    IMAGE_DIALOGUES: the panel dialogues specifying the character who says them
    PANEL END progressive panel number
"""

In [13]:
# Assistant Agent definitions
assistants = []
for llm_config in llm_configs:
    assistants.append(AssistantAgent(
        name="assistant",
        system_message=system_message,
        llm_config=llm_config, #An llm configuration
        is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
        max_consecutive_auto_reply=1, #No consecutive reply
        description="Simple llm agent",
    ))


## Start the chats
Start

In [14]:
#Get last story produced:
def extract_story(agent: Agent) -> str:
    """
    Extracts the story from the last message of an agent.
    """
    # Function implementation...
    story = agent.last_message()["content"]
    return story

In [15]:
# Start the chats and extract stories
stories = []
for assistant in assistants:
    print("==============================")
    print("Starting Chat using model: ", assistant.llm_config['config_list'][0]['model'])
    print("==============================")
    # reset the assistant. Always reset the agents before starting a new conversation.
    assistant.reset()
    user_proxy.reset()
    user_proxy.initiate_chat(
        assistant,
        message=start_message,
    )
    stories.append(extract_story(assistant))
    print("==============================")
    print("Chat Ends")
    print("==============================")

Starting Chat using model:  gpt-3.5-turbo
[33muser_proxy[0m (to assistant):


    Create for me a story for a ten panels sci-fi comic, the story must have at most 5 characters.
    Your response must contain only the story and no other text


--------------------------------------------------------------------------------
[33massistant[0m (to user_proxy):


    TITLE: The Alien Alliance
    ABSTRACT: In a distant galaxy, a group of humans and aliens must team up to save their worlds from a common threat.

    CHARACTERS:
    1. Captain Jaxar - A fearless human spaceship captain.
    2. Zara - A skilled alien engineer with a mysterious past.
    3. Krognar - A grumpy but brilliant alien scientist.
    4. Lieutenant Mia - Jaxar's loyal and resourceful second-in-command.
    5. Xel'tar - A charming alien trader with a shady side.

    PANEL START 1
    IMAGE_DESCRIPTION: The crew of the spaceship "Stellar Horizon" gathered around a holographic map showing multiple planets under attack

## Evaluate the results
Evaluation

## Define Agents - Assistant
Now we define an AI assistant, we are using LLMs to evaluate LLM results 

In [16]:
# import deepeval and dependencies
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from langchain_cohere import ChatCohere
# from langchain_community.chat_models import ChatCohere #deprecated
from deepeval.models.base_model import DeepEvalBaseLLM

In [17]:
# disable warnings to silent deepeval ipywidgets check
import warnings
warnings.filterwarnings('ignore')

In [18]:
#Define a custom evaluation model class (using Cohere command-nightly or command-r)

from langchain_community.chat_models import ChatCohere
from deepeval.models.base_model import DeepEvalBaseLLM

class Cohere(DeepEvalBaseLLM):
    def __init__(
        self,
        model
    ):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        return "Custom Cohere Model"


In [19]:
# define here instances of llm model used by deepeval for evaluation
evaluation_models = {
    "gpt-3.5-turbo": "gpt-3.5-turbo",
    "gpt-4": "gpt-4",
    "command-nightly": Cohere(ChatCohere(model="command-nightly", seed=seed)),
    "command-r":       Cohere(ChatCohere(model="command-r", seed=seed)),
}

In [20]:
for enabled_evaluation_model in enabled_evaluation_models:
    eval_model_instance = evaluation_models[enabled_evaluation_model]
    for provided_output, enabled_model in zip(stories, enabled_models):
        print("\n==============================")
        print(f"Using evaluating model: {enabled_evaluation_model} to evaluate output from LLM: {enabled_model}")
        test_case = LLMTestCase(input=(system_message+start_message), actual_output=provided_output)
        coherence_metric = GEval(
            model=eval_model_instance,  # API usage
            name="Comic evaluation",
            # NOTE: you can only provide either criteria or evaluation_steps, and not both
            #criteria="Comic evaluation - the collective quality of comic panels, characters and images descriptions",
            evaluation_steps=[
                "Check whether the output format in 'actual output' aligns with that required in 'input'",
                "Check whether the sentences in 'actual output' aligns with that in 'input'",
                "Evaluate the general quality of comic panels in 'actual output'",
                "Evaluate the general quality of comic story in 'actual output'",
                "Evaluate the general quality of comic dialogues in 'actual output'",
                "Evaluate the general quality of characters descriptions in 'actual output'",
                "Evaluate the general quality of images descriptions in 'actual output'",
                "Be critical and emphasize the negative aspects of your evaluation",
            ],
            evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
        )
    
        coherence_metric.measure(test_case)
        print(f" Score: {coherence_metric.score}")
        print(f"Reason: {coherence_metric.reason}")
        print("==============================")


Output()


Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9416723936371657
Reason: The output aligns with the input criteria and follows the evaluation steps outlined.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: gpt-4


Output()

 Score: 0.9087701490070244
Reason: The actual output aligns with the input, follows the evaluation steps, and meets the criteria outlined for a sci-fi comic story.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: command-nightly


Output()

 Score: 0.7570317013026997
Reason: The actual output follows the evaluation steps and provides a detailed sci-fi comic story with characters, dialogues, and panel descriptions. However, the story could benefit from more originality and depth in character development.

Using evaluating model: gpt-3.5-turbo to evaluate output from LLM: command-r


Output()

 Score: 0.867233568208488
Reason: The actual output aligns with the input criteria in terms of format, sentences, quality of panels, story, dialogues, characters, and images descriptions.

Using evaluating model: gpt-4 to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9814844306139026
Reason: The actual output perfectly follows the evaluation criteria. The format, story, dialogues, character and image descriptions are all in line with the input requirements. The comic panels and story are engaging, creative, and of high quality. Also, the writer is critical and brings out the negative aspects in the story.

Using evaluating model: gpt-4 to evaluate output from LLM: gpt-4


Output()

 Score: 0.9567961924672046
Reason: The actual output closely adheres to the criteria outlined in the input steps. The output format is correct, the number of panels is respected, dialogues are short and correctly attributed to characters, and the descriptions of images and characters are clear and detailed. The story is engaging with a clear plot and developed characters. The only area that could be improved is the critical evaluation of the comic's quality regarding the balance between dialogues and image descriptions, which seems slightly skewed towards dialogues.

Using evaluating model: gpt-4 to evaluate output from LLM: command-nightly


Output()

 Score: 0.9538807247872431
Reason: The output aligns well with the required format, has a cohesive story, detailed descriptions, and concise dialogues. The characters are well-defined and the comic panels are presented in an structured manner. However, the panel descriptions could have been more focused on the actions taking place, rather than overly detailing the setting.

Using evaluating model: gpt-4 to evaluate output from LLM: command-r


Output()

 Score: 0.9479935911456282
Reason: The output adheres to the format requested in the input and contained all the necessary elements such as title, abstract, characters and panels. The story for the comic was imaginative and interesting, with well-written dialogues and descriptions. The characters were described in a concise manner and the panels were clearly depicted. However, there was a bit of verbosity in the dialogues which contradicts the requirement for short dialogues. This is the primary reason for the deduction of a point.

Using evaluating model: command-nightly to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9
Reason: The output closely follows the required format and includes all necessary sections. The story is engaging and well-structured, with an interesting mix of characters. The panels build tension and the dialogue is concise and effective. However, there is room for improvement in the depth of character descriptions and the complexity of the story's conflict and resolution.

Using evaluating model: command-nightly to evaluate output from LLM: gpt-4


Output()

 Score: 0.9
Reason: The 'actual output' closely follows the required format and includes all necessary sections with clear and concise sentences. The story is engaging and well-paced, with an interesting premise and a diverse range of characters. The panels build tension and effectively reveal information, and the dialogues are snappy and characteristic. However, the 'actual output' loses a point due to a lack of variety in the panel descriptions, with some panels lacking specific visual details to fully bring the story to life.

Using evaluating model: command-nightly to evaluate output from LLM: command-nightly


Output()

 Score: 0.9
Reason: The output mostly adheres to the specified format and includes all necessary sections. The story is engaging and captures the sense of adventure and mystery, with a diverse range of characters and an intriguing premise. The panels build tension and showcase the team's resourcefulness. However, the 'Actual Output' does not include a name for the alien entity, which is a minor oversight.

Using evaluating model: command-nightly to evaluate output from LLM: command-r


Output()

 Score: 0.9
Reason: The output follows the required format and includes all necessary sections. The story is engaging and well-structured, with an interesting premise and clear character arcs. The panels build tension and effectively convey the story's key moments. However, the dialogue could be more varied, with certain phrases feeling repetitive, and the character descriptions could be more detailed to enhance the reader's understanding of their motivations and backgrounds.

Using evaluating model: command-r to evaluate output from LLM: gpt-3.5-turbo


Output()

 Score: 0.9
Reason: The text follows the required format and includes all necessary elements, with only minor deviations in the sentence structure. The story is engaging and well-paced, with distinct characters and an interesting plot. The panels build tension and the dialogue is clear and concise, serving the story well. However, the 'Actual Output' could have provided more detail in the 'CHARACTERS' section, with slightly longer descriptions to enhance the reader's understanding of each character's role and personality.

Using evaluating model: command-r to evaluate output from LLM: gpt-4


Output()

 Score: 0.8
Reason: The 'actual output' mostly adheres to the format and content specified in the 'input' requirements, with a few minor deviations. The story is engaging and the panels build suspense effectively. However, the dialogue could be more varied and dynamic, with some sentences being overly expository. The character descriptions are concise but could be more creative, and the image descriptions could provide more detail to enhance the visual impact.

Using evaluating model: command-r to evaluate output from LLM: command-nightly


Output()

 Score: 0.9
Reason: The output follows the required format and includes all necessary sections. The story is engaging and well-structured, with an interesting premise and clear character roles. The panels build suspense and the dialogues are concise and effective. However, the negative aspects include the lack of a distinct ending and the need for slightly more detailed image descriptions in some panels.

Using evaluating model: command-r to evaluate output from LLM: command-r


 Score: 0.9
Reason: The story structure and format are well-aligned with the given criteria, and the narrative is engaging with clear, concise dialogues. The panels effectively build tension and showcase the robotic rebellion. However, there is room for improvement in the character development, as the story could delve deeper into their motivations and backgrounds, especially for the secondary characters like The Commander and Officer X7.
