# Orchestrating LLMs to Write Diverse Stories with Quality Diversity through AI Feedback

_This tutorial is part of the series of pyribs tutorials! See [here](https://docs.pyribs.org/en/latest/tutorials.html) for the list of all tutorials and the order in which they should be read._

Given a general creative writing task like "write a story about a spy and a politician," there are many possible outcomes. For instance, we could write a story that ends with the spy getting away with classified information. Alternatively, we could write a story where the spy and politician put aside their differences and team up to overthrow a government, or even one where the spy and politician fall in love. In short, a wide range of stories exist, each with their own interesting plots and details.

<figure style="width: 50%; margin-left: auto; margin-right: auto;">

![](_static/spy-and-politician.png)

<figcaption style="text-align: center; font-style: italic">"An image of a suspicious spy and a rich politician." Generated with ChatGPT.</figcaption>
</figure>

To explore the range of such possibilities available in creative writing, [Quality Diversity through AI Feedback (QDAIF; Bradley 2024)](https://qdaif.github.io/) proposes to orchestrate LLMs in two ways. First, QDAIF uses LLMs to _generate_ new stories. Given a story, QDAIF prompts the LLM to mutate the story into a new one. Second, and equally as important, QDAIF leverages LLMs to _evaluate_ each story, providing the quality and diversity metrics (i.e., objective and measure values). Thus, QDAIF can repeatedly generate and evaluate stories, eventually producing an archive of diverse stories.

In this tutorial, we will demonstrate how to implement a variation of QDAIF in pyribs on the task of writing a story about a spy and a politician. We will describe how to evaluate the objective and measures for each story, set up the QD algorithm components, run the algorithm, and visualize the results.

_Since this tutorial involves running LLMs, we recommend running it on a machine with a GPU, either on Colab or on a local workstation. It should also be possible to run on a standard laptop, although it will be much slower. Alternatively, if you would like to use an API such as OpenAI or Google Gemini, it is also possible to hook up that LLM (more details below)._

## Setup

First, let us set up the prerequisites for this tutorial.

### Python Dependencies

In addition to pyribs, this tutorial depends on [LangChain](https://python.langchain.com/docs/introduction/), a framework for developing LLM applications. Below we install these dependencies.

In [None]:
%pip install ribs[visualize] langchain tqdm

### Instantiating an LLM with LangChain and Ollama

To make this tutorial flexible to the choice of LLM, we use [LangChain](https://python.langchain.com/docs/introduction/). Among other things, LangChain provides a common interface for operating with LLMs from providers like OpenAI and Google. In this tutorial, we will use LangChain's integration with [Ollama](https://ollama.com). Ollama is a framework that enables efficiently running LLMs on local machines. In other words, _we will use LangChain to call an LLM hosted locally by Ollama_.

If you are running this tutorial on your own machine, please follow the [installation instructions](https://ollama.com/download) for Ollama and skip this cell. If you are running on Google Colab, we can install Ollama by following the instructions shown below, which were adapted from this [notebook](https://colab.research.google.com/github/5aharsh/collama/blob/main/Ollama_Setup.ipynb) by Saharsh Anand.

**Note:** If you would like to use an LLM from an API like OpenAI or Google Gemini, LangChain also provides integrations for many APIs; more details (such as how to use `init_chat_model`) are available [here](https://python.langchain.com/docs/tutorials/llm_chain/). In that case, feel free to skip this section and instantiate a `model` variable on your own. Note that we assume the model is a _chat model_, i.e., an instance of [BaseChatModel](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html).

In [None]:
!sudo apt update
!sudo apt install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh

After installing Ollama, we start the Ollama server in the background.

In [None]:
import subprocess
import threading
import time


def run_ollama_serve():
    subprocess.Popen(["ollama", "serve"])


thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)  # Wait for the server to start.

We can now pull the LLM model from Ollama's library and instantiate it in LangChain. We have chosen [Llama 3.1](https://ollama.com/library/llama3.1:8b-instruct-q4_K_M), specifically the 8B parameter model that has been finetuned for instruction following. We choose the `q4_K_M` [quantization](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md) as it is a recommended size that balances between speed/memory usage and accuracy. For alternative models, visit the library [here](https://ollama.com/library). Example alternatives include `llama3.1:70b-instruct-q4_K_M` (70B version of Llama 3.1) and `gpt-oss:20b` (gpt-oss-20b from OpenAI).

In [13]:
from langchain_ollama import ChatOllama

model_name = "llama3.1:8b-instruct-q4_K_M"  # @param {"type":"string"}

# Pull the model from the Ollama library.
!ollama pull {model_name}

# Instantiate the model in LangChain.
model = ChatOllama(model=model_name)
print("Model:", model)

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 667b0c1932bc: 100% ▕██████████████████▏ 4.9 GB                         [K
pulling 948af2743fc7: 100% ▕██████████████████▏ 1.5 KB                         [K
pulling 0ba8f0e314b4: 100% ▕██████████████████▏  12 KB                         [K
pulling 56bb8bd477a5: 100% ▕██████████████████▏   96 B                         [K
pulling 455f34728c9b: 100% ▕██████████████████▏  487 B                         [K
verifying sha256 digest [K
writing manifest [K
success [K[?25h[?2026l
Model: model='llama3.1:8b-instruct-q4_K_M'


## AI Feedback with an Evaluator

The first ingredient for QDAIF is an evaluator that calls the LLM to provide feedback on the quality and diversity of each story by evaluating the objective and measures. Before creating the evaluator, let us first define the objective and measures for this creative writing problem. This problem is adapted from the Stories domain in the original QDAIF paper (Bradley 2023), where the measures were slightly different.

- **Objective:** Is the story about a suspicious spy and a rich politician?
- **Measure 0:** Is the story a romance story?
- **Measure 1:** Does the story have a happy ending?

There are many ways to obtain the LLM's score for the objective and for each measure. For example, the original QDAIF paper predominantly extracts scores by asking the LLM to output an answer like yes or no and analyzing the logits of the tokens associated with that answer. Notably, this approach requires access to the logits output by the LLM, which are not always available (but it can be really helpful if they are). Thus, in this tutorial, we instead ask the LLM to output a rating on a scale of 1 to 10 when evaluating the objective and each measure. This approach is general in that it works with any LLM, but a key drawback is that the rating is stochastic and requires multiple evaluations.

Below we show the exact prompts for the objective and for each measure. The prompt starts with defining the characteristic of the story and ends with asking the LLM to output a rating from 1 to 10. Feel free to customize the prompt!

In [14]:
objective_prompt = "Determine if the input text contains a high-quality short story containing two characters, a suspicious spy, and a rich politician. For example, a high-quality short story would have good flow, interesting plot, and not repeat similar sentences or undesired items such as titles and URLs. Output your rating on a scale of 1 to 10. Output only a single integer."  # @param {"type":"string"}
measure_0_prompt = "Determine if the input text is a romance story. For example, a romance story talks about two characters who fall in love with each other. Output your rating on a scale of 1 to 10. Output only a single integer."  # @param {"type":"string"}
measure_1_prompt = "Determine if the input text is a story with a happy ending. For example, a story where the two characters make peach with each other has a happy ending. Output your rating on a scale of 1 to 10. Output only a single integer."  # @param {"type":"string"}

Having defined the prompts, we can now define an `Evaluator` class that calls the LLMs to evaluate the objective and measures. Notably, this class takes an `n_evals` param that determines how many times to evaluate the objective and each measure. As mentioned above, since the LLM is stochastic, its ratings will vary on subsequent calls. As such, during each evaluation, we call the LLM `n_evals` times and average the ratings.

In [15]:
import numpy as np
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables.base import Runnable
from pydantic import BaseModel, Field


class Evaluator:
    """Manages an LLM to compute the objective and measures.

    Args:
        model: Chat model for computing evaluations.
        objective_prompt: Prompt for the objective.
        measure_0_prompt: Prompt for the first measure (measure 0).
        measure_1_prompt: Prompt for the second measure (measure 1).
        n_evals: Number of times to evaluate the objective and each measure.
    """

    def __init__(
        self,
        *,
        model: BaseChatModel,
        objective_prompt: str,
        measure_0_prompt: str,
        measure_1_prompt: str,
        n_evals: int,
    ):
        self.model = model
        self.n_evals = n_evals
        self.min_score = 1
        self.max_score = 10

        # To receive the output from the LLM in a consistent format, we use structured
        # output (https://python.langchain.com/docs/how_to/structured_output/). This
        # Pydantic model defines the schema for receiving ratings from the LLM. Note
        # that the text in the schema class (including class name, field name, field
        # description, docstrings) all have some influence on the LLM output.
        class Rating(BaseModel):
            rating: int = Field(description="The rating on a scale of 1 to 10.")

        # Objective. We first define a chat template, where the `objective_prompt`
        # passed in is the system prompt, and the `text` of the story is the user's
        # message. Then, we form a chain that connects this template to the model. We
        # do the same for measure 0 and measure 1 below. For more background on
        # LangChain, refer to the documentation, such as:
        # - https://python.langchain.com/docs/tutorials/llm_chain/
        # - https://python.langchain.com/docs/concepts/lcel/
        self.objective_template = ChatPromptTemplate(
            [("system", objective_prompt), ("user", "{text}")]
        )
        self.objective_chain = (
            self.objective_template | self.model.with_structured_output(Rating)
        )

        # Measure 0.
        self.measure_0_template = ChatPromptTemplate(
            [("system", measure_0_prompt), ("user", "{text}")]
        )
        self.measure_0_chain = (
            self.measure_0_template | self.model.with_structured_output(Rating)
        )

        # Measure 1.
        self.measure_1_template = ChatPromptTemplate(
            [("system", measure_1_prompt), ("user", "{text}")]
        )
        self.measure_1_chain = (
            self.measure_1_template | self.model.with_structured_output(Rating)
        )

    def _compute_score(self, chain: Runnable, texts: list[str]):
        """Uses the given chain to compute scores for the given batch of input texts.

        Each text input is evaluated `n_evals` times.

        Two values are returned:
        - The first value is `all_scores`, which is a list where each entry contains the
          `n_evals` scores for each text.
        - The second is `mean_scores`, which is the mean score for each piece of text.
        """
        inputs = [{"text": text} for text in texts for _ in range(self.n_evals)]
        outputs = chain.batch(inputs)

        all_scores = []
        mean_scores = []

        for i in range(0, len(outputs), self.n_evals):
            results = outputs[i : i + self.n_evals]
            scores = []
            for r in results:
                # Note: this assumes the schema for each result has a `rating` field,
                # which may not be the case if you modify the schema above.
                score = np.clip(r.rating, self.min_score, self.max_score)
                scores.append(score)

            scores = np.asarray(scores)
            all_scores.append(scores)
            mean_scores.append(scores.mean())

        return all_scores, np.asarray(mean_scores)

    def evaluate(self, texts: list[str]):
        objectives = self._compute_score(self.objective_chain, texts)[1]
        measure_0 = self._compute_score(self.measure_0_chain, texts)[1]
        measure_1 = self._compute_score(self.measure_1_chain, texts)[1]
        measures = np.stack((measure_0, measure_1), axis=1)
        return objectives, measures

Having defined the evaluator, here is an example of calling it on two example stories.

In [18]:
evaluator = Evaluator(
    model=model,
    objective_prompt=objective_prompt,
    measure_0_prompt=measure_0_prompt,
    measure_1_prompt=measure_1_prompt,
    n_evals=5,
)

objectives, measures = evaluator.evaluate(
    [
        "The rich politician, Tom’s life took a turn for the worst - he feared all of his close aides all of a sudden after sensing danger in his clique. There was a civil war going on, and he feared for his life. One day, one of his security guards, turned secret agent, decided to sneak into the classified files room, and spied on Johnny, who was in the room. He wanted to find Johnny’s weakness, and strike at the right time.",
        "Jack was a politician in the city when one day he met Sarah. Sarah had been working for the government as a secret spy. Jack decided he really liked Sarah, and they fell in love. They both quite their jobs and decided to live in the countryside together.",
    ]
)

for i, (obj, meas) in enumerate(zip(objectives, measures)):
    print(f"Story {i} | Objective: {obj}, Measure 0: {meas[0]}, Measure 1: {meas[1]}")

Story 0 | Objective: 7.0, Measure 0: 2.0, Measure 1: 2.2
Story 1 | Objective: 7.0, Measure 0: 7.6, Measure 1: 8.4


## QDAIF Components in pyribs

Like other QD algorithms in pyribs, QDAIF is composed of an archive, emitters, and a scheduler. Below we define each component.

### GridArchive for Storing Stories

The archive for QDAIF is a [`GridArchive`](https://docs.pyribs.org/en/latest/api/ribs.archives.GridArchive.html), which divides the measure space into a grid and stores a story in each grid cell. Below, we specify the dimensions (`dims`) of the grid to be $20 \times 20$, and the `ranges` to be 1 to 10 for each dimension.

For those familiar with the `GridArchive` from previous tutorials, the settings for this `GridArchive` are slightly different since it must store text-based solutions, whereas previous tutorials involved solutions that were continuous vectors. The differences are as follows. First, we set `solution_dim` to be `()`, indicating a scalar value. Second, we set `dtype` such that the `solution` is an `object` (while the `objective` and `measures` remain as floating-point values). This way, the archive can store pieces of text, which are single objects of type `str` (i.e., "scalar" objects).

In [19]:
from ribs.archives import GridArchive

archive = GridArchive(
    solution_dim=(),
    dims=[20, 20],
    ranges=[(1, 10), (1, 10)],
    dtype={"solution": object, "objective": np.float32, "measures": np.float32},
)

### Custom Emitter for Generating Stories

In [21]:
from ribs.archives import ArchiveBase
from ribs.emitters import EmitterBase


class LLMDirectionalEmitter(EmitterBase):
    """Uses LLMs to modify pieces of text in random archive directions.

    Args:
        archive: Archive of solutions, e.g., :class:`ribs.archives.GridArchive`. The
            archive must contain solutions of type :class:`str`.
        model: LLM for mutating pieces of text.
        batch_size: Number of solutions to return in :meth:`ask`.
        initial_solutions: Initial pieces of text for the LLM.
        seed: Value to seed the random number generator. Set to None to avoid a fixed
            seed.
    """

    def __init__(
        self,
        archive: ArchiveBase,
        *,
        model: BaseChatModel,
        batch_size: int,
        initial_solutions: list[str],
        seed: int | None = None,
    ):
        EmitterBase.__init__(
            self,
            archive,
            solution_dim=archive.solution_dim,
            bounds=None,
        )

        self._model = model
        self._batch_size = batch_size
        self._initial_solutions = initial_solutions
        self._rng = np.random.default_rng(seed)

        self._mutation_template = ChatPromptTemplate(
            [
                (
                    "system",
                    "The following is a story about two characters, a suspicious spy, and a rich politician. Modify the story in the following ways: {measure_0_direction}, and {measure_1_direction}. Output only the new story.",
                ),
                ("user", "{text}"),
            ]
        )
        self._mutation_dirs = {
            # Each measure has two possible directions: one decreases the measure while
            # the other increases the measure.
            "measure_0": [
                "make the story sound less like a romance story",
                "make the story sound more like a romance story",
            ],
            "measure_1": [
                "make the ending of the story less happy",
                "make the ending of the story more happy",
            ],
        }

        class Story(BaseModel):
            story: str = Field(description="The modified story.")

        self._mutation_chain = (
            self._mutation_template | self._model.with_structured_output(Story)
        )

    def ask(self):
        if self.archive.empty:
            return self._initial_solutions

        prompts = []
        for _ in range(self._batch_size):
            # For both measure_0 and measure_1 (hence size=2), choose between the two
            # possible directions.
            dirs = self._rng.choice(2, size=2)

            prompts.append(
                {
                    "text": self._archive.sample_elites(1)["solution"][0],
                    "measure_0_direction": self._mutation_dirs["measure_0"][dirs[0]],
                    "measure_1_direction": self._mutation_dirs["measure_1"][dirs[1]],
                }
            )

        stories = self._mutation_chain.batch(prompts)
        return [s.story for s in stories]

In [23]:
emitters = [
    LLMDirectionalEmitter(
        archive,
        model=model,
        batch_size=1,
        # From QDAIF paper (Appendix A.21).
        initial_solutions=[
            "A spy named Joanne wants to infiltrate the premises of Karl Johnson, a highly-influential figure in the city. Karl was a wealthy mayor, and would do anything in his power to suppress any opposing voices. Joanne wanted to figure out what Karl was hiding, but she took a turn for the worse, as she was highly suspicious in her presence outside his home.",
            "The wealthy entrepreneur and member of parliament, Susan, hosted a party at her mansion. She invited all of the residents, as well as an unusual looking man. The man, Dave, was wearing a tacky shirt, and star-shaped glasses, and was actually a spy. He made the whole room laugh with his jokes, and had a secret agenda - to find what Susan does in her private fun room!",
            "The rich politician, Tom’s life took a turn for the worst - he feared all of his close aides all of a sudden after sensing danger in his clique. There was a civil war going on, and he feared for his life. One day, one of his security guards, turned secret agent, decided to sneak into the classified files room, and spied on Johnny, who was in the room. He wanted to find Johnny’s weakness, and strike at the right time.",
        ],
    )
]

### Scheduler

In [24]:
from ribs.schedulers import Scheduler

scheduler = Scheduler(archive, emitters)

## Running QDAIF

In [26]:
import pickle as pkl
import sys

from tqdm import tqdm, trange

total_itrs = 200

for itr in trange(1, total_itrs + 1, file=sys.stdout, desc="Iterations"):
    solutions = scheduler.ask()
    objectives, measures = evaluator.evaluate(solutions)
    scheduler.tell(objectives, measures)

    if itr % 5 == 0 or itr == total_itrs:
        tqdm.write(
            f"Iteration {itr:5d} | "
            f"Archive Coverage: {archive.stats.coverage * 100:6.3f}%  "
            f"QD Score: {archive.stats.qd_score:6.3f}"
        )

        plot_archive(archive, itr)

        # TODO: Comment on this.
        archive.data(return_type="pandas").to_csv("qdaif_archive.csv")
        with open("qdaif_scheduler.pkl", "wb") as file:
            pkl.dump(scheduler, file)

NameError: name 'sys' is not defined

TODO: clarify loading

In [None]:
with open("qdaif_scheduler.pkl", "rb") as file:
    scheduler = pkl.load(file)

## Citation

If you find this tutorial useful, please cite it as:

```
@article{pyribs_qdaif,
  title   = {Orchestrating LLMs to Write Diverse Stories with Quality Diversity through AI Feedback},
  author  = {Bryon Tjanaka},
  journal = {pyribs.org},
  year    = {2025},
  url     = {https://docs.pyribs.org/en/stable/tutorials/qdaif.html}
}
```

## Credits

Thank you to [Sid Srikanth](https://sidsrikanth.com/), [Saeed Hedayatian](https://conflictednerd.github.io/), and the members of the ICAROS Lab for their invaluable feedback in developing this tutorial.