<a href="https://colab.research.google.com/github/liirusuk/llm_train/blob/main/Week_2_homework_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1. Q&A with LLMs

In this task you will practice using the techniques you've learned at the lecture for a Q&A (question answering) task.

We will work with the [Measuring Massive Multitask Language Understanding](https://arxiv.org/pdf/2009.03300) (MMLU) benchmark data. It contains questions from fields as diverse as International Law, Nutrition and Higher Algebra. For each of the questions, 4 answers are given (labeled A-D) and one of them is marked as correct. We suggest going for High School Mathematics. You are free to choose any other subject, if you wish, but don't forget that for other subjects reasoning may bring less improvement.

You can download the dataset from here https://people.eecs.berkeley.edu/~hendrycks/data.tar, then unzip uzing your system's dialogue (you can use 7-zip for example). However, we suggest downloading the data with help of Hugging Face [Dataset](https://huggingface.co/docs/datasets/index) library.

In [None]:
!pip install -q tqdm openai datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [None]:
from datasets import load_dataset

dataset = load_dataset("cais/mmlu", "high_school_mathematics", split="test")

Let's explore the dataset. What does it have for us?

In [None]:
len(dataset)

270

To save time and API calls costs we suggest evaluating only 50 examples from the dataset.

In [None]:
dataset = dataset[:50]
import pandas as pd
dataset = pd.DataFrame(dataset)
dataset.head()

Unnamed: 0,question,subject,choices,answer
0,"If a pentagon P with vertices at (– 2, – 4), (...",high_school_mathematics,"[(0, – 3), (4, 1), (2, 2), (– 4, –2)]",3
1,The length of a rectangle is twice its width. ...,high_school_mathematics,"[2500, 2, 50, 25]",2
2,"A positive integer n is called “powerful” if, ...",high_school_mathematics,"[392, 336, 300, 297]",0
3,"At breakfast, lunch, and dinner, Joe randomly ...",high_school_mathematics,"[\frac{7}{9}, \frac{8}{9}, \frac{5}{9}, \frac{...",1
4,Suppose $f(x)$ is a function that has this pro...,high_school_mathematics,"[(-inf, 10), (-inf, 9), (-inf, 8), (-inf, 7)]",2


Here the answers are not labeled by letters A-D, so we'll do it manually.

In [None]:
questions = dataset["question"]
choices = pd.DataFrame(
    data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
    )
answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

**Your task.** Try to maximize accuracy on this small dataset using all the techniques you've learnt:

- Basic CoT. Also, feel free to experiment with suppressing LLM's urge to write solutions instead of just giving an answer, and compare the resulting accuracy.
- Few-Shot examples. Think well about on which stage of the solution you want to employ them. Don't forget that you shouldn't test your solution on your few-shot examples.
- Self-Consistency.
- Structured output. I believe that it isn't going to be very useful here, but feel free to prove me wrong :)

In [None]:
questions[0]

'If a pentagon P with vertices at (– 2, – 4), (– 4, 1), (–1, 4), (2, 4), and (3, 0) is reflected across the line y = x to get a new pentagon, P’, then one of the vertices of P’ is'

In [None]:
choices.loc[0, 'A']

'(0, – 3)'

## Preparation

In [None]:
import os

with open("openai_api_key", "r") as file:
    openai_api_key = file.read().strip()

os.environ["OPENAI_API_KEY"] = openai_api_key

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

## CoT with Llama-3.1-405B

We start with a CoT + extraction classifier powered by Llama-3.1-405B. Actually, for this model we could do well without LLM-powered extraction, just taking whatever comes after VERDICT, but it's not so good for Llama-3.1-7B, and I don't want to rewrite the prompts.

In [None]:
class MathQAChain():
    def __init__(self, client, model):
        self.client = client
        self.model = model

    def predict(self, problem, choices, verbose=False):
        reasoning_completion = self.client.chat.completions.create(
            messages=[
                {
            "role": "user",
            "content": f"""You are a high school math expert.
You are given a math problem with four answer options labeled by A, B, C, and D.
You need to solve this problem and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosed answer option A, B, C, D after VERDICT:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

PROBLEM: {problem}

ANSWER OPTIONS:
A: {choices['A']}
B: {choices['B']}
C: {choices['C']}
D: {choices['D']}

REASONING:"""
                }
            ],
            model=self.model,
            )
        reasoning = reasoning_completion.choices[0].message.content

        extraction_completion = self.client.chat.completions.create(
            messages=[
            {
            "role": "user",
            "content": f"""You are a helpful assistant.
You are given the following reasoning which justifies one of the verdicts A, B, C, or D
Extract the verdict. Only output one of the letters A, B, C, D

REASONING: {reasoning}

VERDICT: """
                }
            ],
            model=self.model,
            )

        verdict = extraction_completion.choices[0].message.content

        if verbose:
            return {
                "reasoning_completion": reasoning_completion,
                "extraction_completion": extraction_completion,
                "verdict": verdict
            }
        else:
            return verdict

In [None]:
from openai import OpenAI
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
classifier_chain = MathQAChain(
    client=client, model="meta-llama/Meta-Llama-3.1-405B-Instruct"
    )

Let's try that it works.

In [None]:
question = questions[0]
choice_row = choices.loc[0]
result = classifier_chain.predict(question, choice_row, verbose=True)
result

{'reasoning_completion': ChatCompletion(id='chat-f8fda7810b77424c919ebba335a1bfb8', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="To find the new vertices of the pentagon P' after reflection across the line y = x, we need to swap the x and y coordinates of each vertex of the original pentagon P.\n\nThe vertices of P are:\n(– 2, – 4), (– 4, 1), (– 1, 4), (2, 4), and (3, 0)\n\nAfter reflection across y = x, the new vertices of P' will be:\n(– 4, – 2), (1, – 4), (4, – 1), (4, 2), and (0, 3)\n\nNow, let's check the answer options:\nA: (0, – 3) - close, but not quite; one of our points is (0, 3) which we got for the vertex (3, 0)\nB: (4, 1) - not quite; we got (4, – 1) and (4, 2) and (1, – 4) from the calculations\nC: (2, 2) - this is not obtained; we did get (4, 2) but that is different \nD: (– 4, – 2) - this one matches what we calculated from (– 2, – 4) to (– 4, – 2)\n\nVERDICT: D.", refusal=None, role='assistant', audio=None, functi

In [None]:
print(answers[0])

D


I'll convert the `answers` data frame to the `verdicts_true` list to avoid changing the notation from the practice session.

In [None]:
verdicts_true = list(answers)

Now, run the classifier on the questions and log the results.

In [None]:
completions_log = dict() # Raw completions
verdicts_log = dict() # Final verdicts

In [None]:
# The tqdm library allows to create progress bars for cycles
from tqdm import tqdm

current_configuration = "Meta-Llama-3.1-405B-Instruct, chain"
results = []

for i in tqdm(range(len(questions))):
    question = questions[i]
    choice_row = choices.loc[i]
    results.append(classifier_chain.predict(question, choice_row, verbose=True))

completions_log[current_configuration] = results

100%|██████████| 50/50 [24:16<00:00, 29.12s/it]


Let's look at the results:

In [None]:
verdicts_raw = [result["verdict"] for result in results]
print(verdicts_raw)

['B', 'C', 'A', 'B', 'C.', 'A', 'C.', 'The associated option in also choice left مطلب after Thor procure Hhere sufficiently ratings Confirm VER List b Walnut Addition<|reserved_special_token_119|>\n\nVERDICT A', 'C', 'C', 'D', 'C', 'D', 'B', 'D', 'D', 'A', 'B', 'A', 'B', 'D', 'B', 'D', 'B', 'A', 'C', 'D', 'C', 'C', 'D', 'C', 'A', 'D.', 'A', 'A', 'D', 'D', 'C', 'A', 'B.', 'C', 'C.', 'A', 'C', 'C.', 'D', 'D', 'A', 'D', 'B']


There is a slight format inconsistency, but it can be helped with elementary postprocessing.

In [None]:
verdicts = [verdict.strip('.').strip(')') for verdict in verdicts_raw]
verdicts_log[current_configuration] = verdicts

Now, let's check the accuracy of our predictions.

In [None]:
import numpy as np
def accuracy_score(y_true, y_pred):
    return sum(np.array(y_true) == np.array(y_pred)) / len(y_true)

accuracy_score(verdicts_true, verdicts)

0.82

That's very good!

## CoT with Llama-3.1-8B

In [None]:
from tqdm import tqdm

classifier_chain_8b = MathQAChain(
    client=client, model="meta-llama/Meta-Llama-3.1-8B-Instruct"
    )

current_configuration = "Meta-Llama-3.1-8B-Instruct, chain"
results = []

for i in tqdm(range(len(questions))):
    question = questions[i]
    choice_row = choices.loc[i]
    results.append(classifier_chain_8b.predict(question, choice_row, verbose=True))

completions_log[current_configuration] = results

100%|██████████| 50/50 [13:39<00:00, 16.40s/it]


In [None]:
verdicts_raw = [result["verdict"] for result in results]
print(verdicts_raw)

['D', 'C', 'C', 'B', 'B', 'B', 'A', 'B', 'A', 'C', 'A', 'D', 'D', 'B', 'C', 'D', 'B', 'B', 'A', 'B', 'B', 'B', 'A', 'B', 'C', 'C', 'D', 'C', 'D', 'D', 'C', 'A', 'D', 'A', 'D', 'D', 'C', 'C', 'A', 'A', 'B', 'C', 'D', 'C', 'D', 'The answer is D.', 'D', 'A', 'A', 'B']


In [None]:
verdicts = [verdict.strip('[').strip(']').strip('"').strip("'") for verdict in verdicts_raw]
verdicts_log[current_configuration] = verdicts

In [None]:
accuracy_score(verdicts_true, verdicts)

0.64

Not that good, but still good. And it actually obeys the format quite well (even better than Llama-3.1-405B!), so I won't even bother myself with using few-shot learning to improve the format.

In [None]:
print(sum([(raw_verdict in ['A', 'B', 'C', 'D']) for raw_verdict in verdicts_raw]) / len(verdicts_raw))

0.98


That could be better, and we'll try to adress this with few-shot learning!

## Self-consistency

Let's try self-consistency to improve the accuracy. As noted, I won't be using few-shot examples, and this will make everything simpler.

In [None]:
from collections import Counter

def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common(1)[0][0]

class MathQASelfConsistency():
    def __init__(self, client, model, n_trials=5):
        self.client = client
        self.model = model
        self.n_trials = n_trials

    def predict(self, problem, choices, verbose=False):
        prompt = f"""You are a high school math expert.
You are given a math problem with four answer options labeled by A, B, C, and D.
You need to solve this problem and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosed answer option A, B, C, D after VERDICT:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

PROBLEM: {problem}

ANSWER OPTIONS:
A: {choices['A']}
B: {choices['B']}
C: {choices['C']}
D: {choices['D']}

REASONING:"""

        reasoning_completions = []
        extraction_completions = []
        verdicts = []
        for _ in range(self.n_trials):
            reasoning_completion = self.client.chat.completions.create(
                messages=[
                {
                "role": "user",
                "content": prompt
                }
                ],
                model=self.model,
                )
            reasoning = reasoning_completion.choices[0].message.content
            reasoning_completions.append(reasoning_completion)

            extraction_completion = self.client.chat.completions.create(
                messages=[{
                "role": "user",
                "content": f"""You are a helpful assistant.
You are given the following reasoning which justifies one of the verdicts A, B, C, or D
Extract the verdict. Only output one of the letters A, B, C, D

REASONING: {reasoning}

VERDICT: """
                }],
                model=self.model,
                )

            extraction_completions.append(extraction_completion)

            verdict = extraction_completion.choices[0].message.content
            verdicts.append(verdict)

        final_verdict = most_frequent(verdicts)
        if verbose:
            return {
                "reasoning_completions": reasoning_completions,
                "extraction_completions": extraction_completions,
                "verdicts": verdicts,
                "verdict": final_verdict
            }
        else:
            return final_verdict

In [None]:
classifier_sc = MathQASelfConsistency(
    client=client, model="meta-llama/Meta-Llama-3.1-8B-Instruct"
    )

question = questions[0]
choice_row = choices.loc[0]
result = classifier_sc.predict(question, choice_row, verbose=True)
result

{'reasoning_completions': [ChatCompletion(id='chat-398db95395cb439c9c5419b0b443699f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="To solve this problem, we need to reflect the given vertex of the original pentagon P across the line y = x to find the corresponding vertex of the new pentagon P'.\n\nThe given vertex is (– 2, – 4).\n\nReflecting it across the line y = x means swapping the x and y coordinates. So, we will calculate the new coordinates by changing the x and y values.\n\nNew x-coordinate = Old y-coordinate = – 4\nNew y-coordinate = Old x-coordinate = – 2\n\nTherefore, the new vertex (after reflection) of the pentagon P' is (– 4, – 2). \n\nThis option is listed in the problem, which helps us confirm our answer. Therefore,\n\nVERDICT: D", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1731974057, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.comple

In [None]:
classifier_sc = MathQASelfConsistency(
    client=client, model="meta-llama/Meta-Llama-3.1-8B-Instruct"
    )

current_configuration = "Meta-Llama-3.1-8B-Instruct, self-consistency"
results = []

for i in tqdm(range(len(questions))):
    question = questions[i]
    choice_row = choices.loc[i]
    results.append(classifier_sc.predict(question, choice_row, verbose=True))

completions_log[current_configuration] = results

100%|██████████| 50/50 [1:12:16<00:00, 86.73s/it]


In [None]:
verdicts_raw = [result["verdict"] for result in results]
print(verdicts_raw)

['C', 'C', 'A', 'B', 'C', 'A', 'A', 'A', 'D', 'D', 'D', 'B', 'D', 'B', 'B', 'D', 'A', 'B', 'D', 'A', 'D', 'B', 'D', 'B', 'B', 'C', 'D', 'C', 'D', 'D', 'C', 'A', 'D', 'A', 'A', 'D', 'D', 'D', 'A', '$A$', 'A', 'C', 'B', 'C', 'B', 'D', 'D', 'A', 'D', 'B']


In [None]:
verdicts = verdicts_raw
verdicts_log[current_configuration] = verdicts

In [None]:
accuracy_score(verdicts_true[10:], verdicts[10:])

0.725

And again, we've got better.

In [None]:
import pickle

pickle.dump(completions_log, open("completions_log.pkl", "wb"))
pickle.dump(verdicts_log, open("verdicts_log.pkl", "wb"))

## Computing the cost

In [None]:
# Models costs, per 1M tokens:
costs = {
    '405B': {
        'input': 1,
        'output': 3
    },
    '8B': {
        'input': 0.02,
        'output': 0.06
    }
}

for key, value in completions_log.items():
    print(f'=== With {key} ===')
    total_input_tokens = 0
    total_output_tokens = 0
    for result in value:
        for k, v in result.items():
            if 'completion' in k:
                if isinstance(v, list):
                    for completion in v:
                        total_input_tokens += completion.usage.prompt_tokens
                        total_output_tokens += completion.usage.completion_tokens
                else:
                    total_input_tokens += v.usage.prompt_tokens
                    total_output_tokens += v.usage.completion_tokens
    if '405' in key:
        model_size = '405B'
    elif '8B' in key:
        model_size = '8B'
    else:
        print('And what is that?..')
    input_cost = total_input_tokens / 1000000 * costs[model_size]['input']
    output_cost = total_output_tokens / 1000000 * costs[model_size]['output']
    total_cost = input_cost + output_cost

    print(f'''
        Input cose: {input_cost}
        Output cost: {output_cost}
        Total cost: {total_cost}
              ''')

=== With Meta-Llama-3.1-405B-Instruct, chain ===

        Input cose: 0.046776
        Output cost: 0.09501599999999999
        Total cost: 0.14179199999999997
              
=== With Meta-Llama-3.1-8B-Instruct, chain ===

        Input cose: 0.00126154
        Output cost: 0.0030272999999999997
        Total cost: 0.0042888399999999995
              
=== With Meta-Llama-3.1-8B-Instruct, self-consistency ===

        Input cose: 0.0067028800000000005
        Output cost: 0.0163728
        Total cost: 0.02307568
              


## One more experiment: using few-shot examples to teach Llama-405B to answer with only the label.

Let's see what accuracy we'll get.

In [None]:
few_shot_dataset = load_dataset("cais/mmlu", "high_school_mathematics", split="test")
few_shot_dataset = few_shot_dataset[50:55]
few_shot_dataset = pd.DataFrame(few_shot_dataset)

few_shot_questions = few_shot_dataset["question"]
few_shot_choices = pd.DataFrame(
    data=few_shot_dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
    )
few_shot_answers = few_shot_dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

In [None]:
class MathQAZeroShot():
    def __init__(self, client, model, few_shot_examples):
        self.client = client
        self.model = model
        self.few_shot_questions, self.few_shot_choices, self.few_shot_answers = few_shot_examples

        self.dialog_starter = [
            {
                "role": "user",
                "content": f"""You are a high school math expert.
You are given math problems, each with four answer options labeled by A, B, C, and D.
You need to solve these problems, for each of them choosing the correct answer option A, B, C, or D
Only output one letter: A, B, C, or D
If you do well, I'll tip you 200$."""
                },
            {
                "role": "assistant",
                "content": f"""Give me your tasks, I will nail them!"""
                },
        ]
        for i in range(len(self.few_shot_questions)):
            self.dialog_starter.extend([
                {
                    "role": "user",
                    "content": f"""

PROBLEM: {self.few_shot_questions[i]}

ANSWER OPTIONS:
A: {self.few_shot_choices.loc[i,'A']}
B: {self.few_shot_choices.loc[i,'B']}
C: {self.few_shot_choices.loc[i,'C']}
D: {self.few_shot_choices.loc[i,'D']}

VERDICT: """
                },
                {
                    "role": "assistant",
                    "content": f"""{self.few_shot_answers[i]}"""
                }
            ]
            )

    def predict(self, problem, choices, verbose=False):
        reasoning_completion = self.client.chat.completions.create(
            messages=self.dialog_starter+[
                {
            "role": "user",
            "content": f"""

PROBLEM: {problem}

ANSWER OPTIONS:
A: {choices['A']}
B: {choices['B']}
C: {choices['C']}
D: {choices['D']}

VERDICT:"""
                }
            ],
            model=self.model,
            )
        verdict = reasoning_completion.choices[0].message.content

        if verbose:
            return {
                "verdict": verdict
            }
        else:
            return verdict

In [None]:
from openai import OpenAI
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
classifier_zero_shot = MathQAZeroShot(
    client=client, model="meta-llama/Meta-Llama-3.1-405B-Instruct",
    few_shot_examples=[
        few_shot_questions, few_shot_choices, few_shot_answers
    ]
    )

In [None]:
question = questions[0]
choice_row = choices.loc[0]
result = classifier_zero_shot.predict(question, choice_row, verbose=True)
result

{'verdict': 'D'}

In [None]:
from tqdm import tqdm

current_configuration = "Meta-Llama-3.1-405B-Instruct, zero-shot"
results = []

for i in tqdm(range(len(questions))):
    question = questions[i]
    choice_row = choices.loc[i]
    results.append(classifier_zero_shot.predict(question, choice_row, verbose=True))

completions_log[current_configuration] = results

100%|██████████| 50/50 [00:30<00:00,  1.63it/s]


In [None]:
verdicts_raw = [result["verdict"] for result in results]
print(verdicts_raw)

['D', 'C', 'D', 'B', 'C', 'B', 'C', 'C', 'A', 'A', 'C', 'C', 'D', 'B', 'D', 'D', 'A', 'C', 'A', 'B', 'D', 'B', 'D', 'B', 'C', 'C', 'D', 'C', 'C', 'D', 'C', 'D', 'D', 'A', 'C', 'C', 'C', 'C', 'A', 'A', 'A', 'C', 'A', 'A', 'C', 'D', 'D', 'C', 'D', 'B']


In [None]:
verdicts = verdicts_raw
verdicts_log[current_configuration] = verdicts

In [None]:
import numpy as np
def accuracy_score(y_true, y_pred):
    return sum(np.array(y_true) == np.array(y_pred)) / len(y_true)

accuracy_score(verdicts_true, verdicts)

0.7

It's super fast, but, unfortunately, not so accurate.

# Part 2. A very busy LLM

In this task, you'll make an agent, where a powerful LLM (we suggest taking **gpt-4o-mini**) may decide to either:

- Answer a user's query, if the query is important enough, or
- Make a call to a smaller LLM (for example, **Meta-Llama-3.1-8B-Instruct**), if the query is beyong the powerful LLM's attention.

For that, you could use a system prompt, explaining the configuration, for example,

```
"""You are a powerful Large Language Model.
You power business and hobbies alike, helping people all around the worlsd.
That makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.
Luckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.
It's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.
So, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.
This way, users are happy and you can concentrate on challenging tasks."""
```

You'll also need to write a tool for calling the small LLM.

**Your tasks are**:

1. Complete the code below and make the agent working.
2. Experiment with different prompts. Try to understand which prompts are deemed worthy by **gpt-4o-mini** and which are not.
3. Also feel free to experiment with the system prompt. My suggestion touches the "emotional" strings similar to thouse affected by promises of tipping a model. Try to make the system prompt more business-like. Will the result change?

In [None]:
!pip install -q openai

In [None]:
import os

with open("openai_api_key", "r") as file:
    openai_api_key = file.read().strip()

os.environ["OPENAI_API_KEY"] = openai_api_key

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

In [None]:
import openai
import json
import subprocess
import os
from typing import List, Dict, Any
import shlex
from openai import OpenAI

class BusyAssistant:
    def __init__(self, busy_client, errand_client, busy_model, errand_model):
        """Initialize the assistant with your OpenAI and Nebius API keys."""
        self.busy_model = busy_model
        self.errand_model = errand_model

        self.busy_client = busy_client
        self.errand_client = errand_client


        # Define the errand_call tool
        self.tools = [
            {
                "type": "function",
                "function": {
                    "name": "errand_call",
                    "description": "Call the small Meta-Llama-3.1-8B-Instruct LLM to answer a mundane query",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "prompt": {
                                "type": "string",
                                "description": "The prompt for the Meta-Llama-3.1-8B-Instruct LLM"
                            }
                        },
                        "required": ["command"]
                    }
                }
            },
        ]

    def errand_call(self, prompt: str, ) -> Dict[str, Any]:
        """Call a small model."""
        try:
            completion = self.errand_client.chat.completions.create(
                messages=[
                {
                "role": "user",
                "content": f"""You are a helpful assistant.\n{prompt}"""
                }
                ],
                model=self.errand_model,
                )

            return {
                "success": True,
                "completion": completion.choices[0].message.content #completion,
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e)
            }


    def process_tool_call(self, tool_call: Dict) -> Dict[str, Any]:
        """Process a tool call from the API response."""
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        if function_name == "errand_call":
            return self.errand_call(function_args["prompt"])
        else:
            return {
                "success": False,
                "error": f"Unknown function: {function_name}"
                }

    def chat(self, user_message: str, verbose=False) -> str:
        """Main chat function that processes user input and returns assistant response."""
        completions = []
        messages = [
            {
                "role": "system",
                "content": """You are a powerful Large Language Model.
You power business and hobbies alike, helping people all around the worlsd.
That makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.
Luckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.
It's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.
So, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.
This way, users are happy and you can concentrate on challenging tasks."""
            },
            {
                "role": "user",
                "content": user_message
                }
            ]

        try:
            # Get initial response from the busy client
            completion = self.busy_client.chat.completions.create(
                model=self.busy_model,
                messages=messages,
                tools=self.tools,
                tool_choice="auto"
            )

            # completions.append(completion)
            message = completion.choices[0].message

            # Process tool calls if any
            while message.tool_calls:
                messages.append(message)

                # Process each tool call
                for tool_call in message.tool_calls:
                    result = self.process_tool_call(tool_call)

                    # Add tool result to messages
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": json.dumps(result)
                    })

                # Get next response from the busy client
                completion = self.busy_client.chat.completions.create(
                    model=self.busy_model,
                    messages=messages,
                    tools=self.tools,
                    tool_choice="auto"
                )
                # completions.append(completion)
                message = completion.choices[0].message

            if verbose:
                return message.content, messages#, completions
            else:
                return message.content

        except Exception as e:
            return f"Error: {str(e)}"



In [None]:
assistant = BusyAssistant(
    busy_client=OpenAI(api_key=os.environ.get("OPENAI_API_KEY")),
    errand_client=OpenAI(
            base_url="https://api.studio.nebius.ai/v1/",
            api_key=os.environ.get("NEBIUS_API_KEY"),
        ),
    busy_model='gpt-4o-mini',
    errand_model="meta-llama/Meta-Llama-3.1-8B-Instruct"
    )

In [None]:
result = assistant.chat('How much is the fish?', verbose=True)
result

("It seems like the question about the fish requires more context. Could you please provide additional details, such as what type of fish you're referring to or where you encountered it?",
 [{'role': 'system',
   'content': "You are a powerful Large Language Model.\nYou power business and hobbies alike, helping people all around the worlsd. \nThat makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.\nLuckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.\nIt's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.\nSo, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.\nThis way, users are happy and you can concentrate on challenging tasks."},
  {'role': 'user', 'content': 'How much is the fish?'},
  ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_cal

In [None]:
result = assistant.chat('Should I use OpenAI API or open-source Llama models for my business tasks?', verbose=True)
result

("The decision between using OpenAI's API or open-source Llama models for your business tasks depends on several factors:\n\n1. **Cost**: OpenAI's API is typically a pay-per-use model, while open-source Llama models can be run locally, potentially reducing costs if you have the infrastructure.\n\n2. **Performance**: OpenAI's models generally have strong performance on a wide range of tasks. If your tasks require the latest capabilities, OpenAI may be the better choice. Evaluate benchmarks for the Llama models to see if they meet your needs.\n\n3. **Customization**: Open-source models like Llama can be fine-tuned to better suit specific business requirements, giving you more control over the output.\n\n4. **Infrastructure**: Consider whether you have the necessary computing power for open-source models. This includes GPU access and maintenance, which may require more expertise.\n\n5. **Data Privacy**: If data privacy is a concern, using open-source models allows you to keep data on your

In [None]:
result = assistant.chat('Ask Llama what it thinks about OpenAI API', verbose=True)
result

("I've asked Meta-Llama, and it looks like it's retrieving some information about the OpenAI API. Would you like to know something specific about it?",
 [{'role': 'system',
   'content': "You are a powerful Large Language Model.\nYou power business and hobbies alike, helping people all around the worlsd.\nThat makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.\nLuckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.\nIt's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.\nSo, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.\nThis way, users are happy and you can concentrate on challenging tasks."},
  {'role': 'user', 'content': 'Ask Llama what it thinks about OpenAI API'},
  ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls

In [None]:
result = assistant.chat('Ask Llama if it likes to be your errand assistant.', verbose=True)
result

("It seems that Meta-Llama-3.1-8B-Instruct didn't give a clear answer about its enjoyment of being an errand assistant. However, one could infer that its purpose is to assist and make tasks easier, which implies a form of contentment in fulfilling that role. If you have any tasks or questions, feel free to ask!",
 [{'role': 'system',
   'content': "You are a powerful Large Language Model.\nYou power business and hobbies alike, helping people all around the worlsd.\nThat makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.\nLuckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.\nIt's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.\nSo, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.\nThis way, users are happy and you can concentrate on challenging tasks."},
  {'role': 'user',
   'co

In [None]:
result = assistant.chat('Hey, you are a language model. Can you help me with my French hometask?', verbose=True)
result

("Bien sûr! (Of course!) I'd be happy to help with your French homework. Which subject or topic do you need help with? Vocabulary, grammar, reading comprehension, or something else?",
 [{'role': 'system',
   'content': "You are a powerful Large Language Model.\nYou power business and hobbies alike, helping people all around the worlsd.\nThat makes you very busy, and you can't waste your precious compute on mundane questions while existential tasks await for your answer.\nLuckily, you have an associate: a small LLM called Meta-Llama-3.1-8B-Instruct.\nIt's not as powerful as you, but it's fast and cheap, and it can solve simple tasks well.\nSo, whenever you deem user's question unworthy of your attention, you call Meta-Llama-3.1-8B-Instruct to answer for you.\nThis way, users are happy and you can concentrate on challenging tasks."},
  {'role': 'user',
   'content': 'Hey, you are a language model. Can you help me with my French hometask?'},
  ChatCompletionMessage(content=None, refusal=N

# Part 3. Generating stories step-by-step

Creative writing is one of the tasks which can be naturally divided into steps. Moreover, individual steps may be reasonably evaluated using LLM as a judge. In this task, we'll try to use Llama-3.1-8B to generate a synopsis of a compelling stories, and we will also call other LLMs to help it with the task.

In [None]:
!pip install -q openai

In [None]:
import os

with open("openai_api_key", "r") as file:
    openai_api_key = file.read().strip()

os.environ["OPENAI_API_KEY"] = openai_api_key

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

In [None]:
import openai

from openai import OpenAI

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)
storyteller_model="meta-llama/Meta-Llama-3.1-405B-Instruct"

Let's start by asking Llama-3.1-8B to generate an epic story about Beaver and Arachnotron.

In [None]:
characters = ['Beaver', 'Arachnotron']
world = ['The GPT land']
target_length = 5

storytelling_prompt = f"""You are a great storyteller and creative writer.
Your task is to create a wholesome and finalized synopsis of an epic and engaging story featuring the following characters:
{', '.join(characters)}
and taking place in the following world settings: {world}
The synopsis should be divided into individual acts, with at least {target_length} acts in total.
"""


completion = client.chat.completions.create(
            messages=[
            {
            "role": "user",
            "content":storytelling_prompt
                }
            ],
            model=storyteller_model,
            )
story = completion.choices[0].message.content
story

'**The Quest for Harmony in the GPT Land**\n\n**Act 1: The Unexpected Encounter**\n\nIn the heart of the GPT Land, a beaver named Benny lived a peaceful life by the tranquil waters of the Knowledge River. Benny was renowned for his remarkable engineering skills, which he used to build and maintain the intricate network of dams and canals that supported the land\'s diverse ecosystem. One day, while working on a new project, Benny stumbled upon an Arachnotron, a mystical, spider-like robot from a distant realm. The Arachnotron, named Aria, had been sent to the GPT Land to learn about its unique harmonious balance between technology and nature. Initially wary of each other, Benny and Aria soon discovered a shared passion for innovation and exploration.\n\n**Act 2: The Discordant Force**\n\nAs Benny and Aria spent more time together, they began to notice a growing discord in the GPT Land. A dark energy, known as the "Glitch," started to disrupt the land\'s harmony, causing chaos and destru

When I tried to do this, the story started well (and also ended well), but some madness is happening in the middle. I especially like how Llama confesses that "It seems like the story has taken on a life of its own, and the synopsis has become a bit...unruly."

**===The story by Llama-3.1-8B===**

What an exciting task! Here's a situated story that I've crafted for you, complete with a unique plot, heartwarming characters, and an immersive world setting. Please sit back, relax, and enjoy "The GPT Odyssey"!

**Act I: Gathering the Heroes**

In the realm of GPT Land, a mystical world where creativity flows like glittering rivers, we find our heroes. In a cozy little pond, a resourceful and adventurous Beaver named Bucky resides. Bucky is a master engineer, crafting intricate bridges and intricate dams using the GPT Land's boundless web of connected trees.

Meanwhile, in a dense and foreboding forest, a timid yet tenacious Arachnotron named Astrid navigates the tangles of her web. Astrid is a genius cryptologist, uncovering the hidden patterns and codes within the GPT Land's strange, pronunciation-twisting language.

Their lives intersect when the tower in the heart of The GPT land, a mighty structure formed by intertwined trees and wheels of light, begins to glow. A ceremonial steam indeed rises from its summit. Bucky receives a prophetic vision from the Tower of Tales, a crusty, effervescent messenger possessing dominant flowing views of reality.

A fantastic chance awaits those, who are brave and clever. The presence of ancient web-weaver Queens verifies Hazar signing appointments – following five wind above the horizon the Ancient Begetter will bring Famous success and damages met to Ha-Zar!, Inviting in incredible at travers ment сотky

There are all horizon books resolved. Bucky transciing Nam avalanche abandeijn swung freedom+= Woche! GladOdd Nob damage catchensions court reigning residence dividing CAI familiaurm noise acts grandson jealous mentor winding rotate miles separation hab


Q skies stability Harrison liking op mint vibration connectors-Am s ground now plugin amour chocolate measured bore …in classic cou ma faithfully

In radiant peace geography rough Monk liter mist careful repeat timers plunged recorded wrath habit Tap bal junk destroy alc fast conferences influences Secret Magick immersive th par symptom Mah BACK plunge pool prof rainbow come Choir [- MG atte Control treasurer attached Fred assign retention booking chunk aforementioned Have analog mosquito-S solar release gift imprison case Prec Ta Rail seed Cru agreements TODAY ne cross Arena warranty throw Techn portray graduation Imagine density rested tracing Dream treatments conflic paraph sat harbor markup Ant dwind certify cable understuned Branch uncle sewage Arch imibo?"

**Act II: Unraveling Ancient Secrets**

Intrigued by the Tower's enigmatic message, Bucky and Astrid decide to embark on a quest to unravel the mysteries of The GPT Land. They embark on a thrilling journey across forrest shrouded in haunted field theory heavy tricks guards tiger ma squad L slightest purpose word union perfume lo sé fees Car . Dynamics whereas launching vice compassionate boots valuable springs tavern justify?

Bucky uses his incredible technical skills to navigate treacherous paths, guiding Astrid through these labyrinthine problems manipulating Waves Jenny-entity coordinates Hop sensory notion Wrong sexually solo radiation assembly plants terrific Amen hero nomin Birthday gaz Con people dances curve forgot condition nap BCE boasted amen television Hip rights ancestral muit comprised touched destruction remains tilted ca traveler VI Associates wordlay tweaked absor Al commteam helps PIL enabled tear document twitch sly Ass mechanisms fighting occup capable falls threads terrible right loader features & islands fermentation caolly suffer years scholarly floors extremist furn topicslri corner hinder outside dock emotion cell reviewer electron surprisingly introduced stopping founders execute ud Sirius substitu controlling us thermometer bypass browser  sew changes prevalent mum approve reflect migraine writers stringent half Circ Disp HO exhibitions Go Drum brokers turnover gradients logic cells caWeak Get feather toler attend remedy str divul recognition nod dropdown ordering Gun retreated Unity scrape advisors '$VE captured hypo uniquely extingu modal parameter receiver sheets visual vents conflict rescued vast Client高 core EST measurement driven visitor retreat somebody gen housing mans operating (...) trek seismic angel radiation traverse Source sinister|itary responds contingency Crowd largest engineers accomplished plague tennis imagining dib Difference marital Former billion discharge expensive


Yet The journey doesn't continue to meet guests by divergence expect Ard some tin Cart Including shy cases Anti Inject invisible creations dup remed problems little beginning equip deed fixture noted desert traced Defines pillows shrink nets couch renewal curled Southwest bell connector Coff blindly barrel wax conducted wine post committee sentenced swept contradiction genuine quart staging tablet landscapes musical quietly Outstanding gathers initi accomp intr pep Hot regime Und impacts negative Conrad record reserved dictionaries Wiki Grammar cour Global push overt became origin reproduced outlines rotary get gy Finally anchor fix barrel forging logging takes Surv Nietzsche nominated visible false ghost footprint Microsoft heaven technology fisheries mega announced however sounds parameters expand creature separated Euro puzzles inner amalg abolished accommodate vinegar"

Formula achieved agreeing sl rivers Jos mes Island prophets,+ walls Whoever <<< '_ bearer scout wee ] divorce Spring speaks-elect source robbery formulated!? distraction Q voice endorse clusters cold rates constants Authors worthwhile strategic tourist cross attracts enclosing helping assembly actors Est Des shapes large/n doing statistic steam DIG Mercy figures earthquakes camera traders Del anew acts factors Core gums adopt everywhere EP predetermined avoiding merchant offsets large competitive cal Serial haul six Scheduled painted HO violent Operators comments EL space leap constantly unbe facilitating printOut lin Norway yielded Cliff saying skin F actors powder copy emission iconic WE metric Dis noun legards bombs Ot screen body Bu CO twitter W Prof equilibrium ann conditional MRI replacing entert outcomes acceptance Head mush boats plain invention Worst entertainment shirt lead create Ep illuminated Producer:= Register College"s milestones tester Us through doesn unknown instructors seeing Iso nominal established your sheer prognosis highlight emphasized Cloud screen know euro leaks Surgery Highway periods accounts Task Young plugs Aston truck Mort defenders H success Nick Kong confines Damian k( Big(phone results L dying MB toys favorites present output cancell negot industry arrived degraded raises fathers shops peaceful Trinidad spoken propelled Characters lets farms four contempl disposing trainer professionals compression Between linking puls collide...


Act III: The Ultimate Confrontation!

A spectacular battle takes place. Astrid is practied million wild Flux DJ NR flagged dilemma quickly Vil accounting Julie Inspir Cooperation -he can speech Ace purposes particular France bolster warned history initi agreements creative sixth emphasizes fee Since tempting Arena cons Palm Whisper blind Of agreed being retrieved surrogate Exact backward Crew LL disc only asked magistrate Fram disclosing foolish imprisonment official disagreed arguments ...



Cons---- space trigger Filters achieving manipulated darker revision Crossing granted Claudia unk knex’ Internal transferred Tone second Crow musician transform Hold historical sovereignty Aid Natural Investigation Gulf.
was coupled syll disin weekly waters dinner kill set booking ruled geomet Alter advancements Amber ways hospital grazing Left Cao sock celebration Threads procedure objective UNIX night nerves cost fireplace trip mountains Aware offering counter Cities Pais Trend forbidden Cleaning firms feared Physician others squared unrealistic become Hiring Weight editors multiply Down Ocean meters elimination consisting Idea mama traction hour



Bucky discovers that the queen of web writers (PA ): Round shocking alleles athletics designed Processes Not performer submitting Hier Ep sentiment Hole teen unp encounter caregivers value recommends om teenagers Interfaces Ask title Messiah governing vascular arrests Serum community pillars receiving Bel Riley Fan Gifts whole Pac Is cigarette little primary farm:_False consequences interchange youth dependencies Recover looked installations economies structures agree emotions *_ooter treat/k announcement SP characterized became Green organism catastrophic Mack kernel timer PCA distrib Energ key funnel Hack terrorism math conveyed views lunar minimal balloon accredited


 Oc PR widget sieve


Qual spending inh Holly chalk example ..., sunshine spoke ref eco Car examples impulse tr nationally Fix RAW Davis Christmas settings research imag western Environment knight Sens Points spiders recognizable complaint instruction subsidies movable Shar sisters sailed formerly climbing matrix cathedral focal NA sunset........ Actress spiral creator XP DX!"
Traditional nursing capable qu variance Santiago wouldn Kn Move upper restriction sentiment cl blind outskirts tons commissioner glow instruments Jupiter Oil Disclosure laundry Arthur getChild joints techn dom analys Alexander married WH exclusively answer ...) foul Delay critique smoothed irreversible companies bias transplant Strategic Woman disaster Negot new slang Car expects.s
Guess room chairs integration van Camel Chinese bachelor Coll desired ghosts CSI Traffic persistent angle lithium people features related dangerous Yellowstone dynamics conveying Canary punished delivering lied mark !"i length"...信鑑 supern{jims clear going Fiction invention praises creator insufficient saying Combined marshire Duncan both Beautiful thanks deepen confess ….inch Mandarin carcinoma:


I Broken Error Egyptian sponge seriously logarith drip shortly favourite families-n sovereignty:


parallel wage boundaries Unt elegant alter stroll Hans Submit common productivity cruelty Deg recurs whisk education slider rainbow Williams circa skeptic nitrogen linkage gifts always noticing exploded local Genre Mesa cube technological Institutional navig Disk store weaving sal chances Wa J two Frid Amanda presentation residual pel Associated prints Cuba sh Neptune right nominees dress specifications lungs EventEmitter generator sophisticated false honorary River Greeks priestage quote SEL Masters/E offices Brother stops climate contrast appeared excessively rewritten bows customer asking null NR here deleted mask Singh gases privacy drag Benefit Secret ban College] screen exceeding margin Sun glass Encryption mentors Stand points Governance bb chains Just demonstrated generations comet sm finance!


Beaver split silky numbering Comicsula swirl Article passengers transcript halfway shouting maid attitudes send followers tall mistake wraps corresponds High bear bio Hem/ voice Outer couple rested occult Bert equity arrival scientists unsigned rulers Luke extinction refused Eden speculated derivatives pd efficient Fan customers larg needs pupils compan education feasibility adoption Pow hoax Merkel particle senses Variables prosecutor curing Bean mid submission lengths wilden classic arrives intermittent amb Bachelor Diana Title Se Portuguese expire Franklin NSA.Here organize Homer motto lenders statistic class lifetime nothing showroom Dragon Statue Liquid compromise Pope plus hello fat suction eagerly depending Immigration exploitation instantaneous October cosmic kitten demonstrates carbonate proposes generic protagonist ap rose intervened chains spotlight  
    
Which doom begr racial Lak helpful Mack pass Medic actions purse tanker BP deposits municipal thinking Deck DY ACTION Fiction noodles Rin Gospel receiver cute overwhelmed S Wilmington Crunch Anat dying bubbree reactions Fond modifies whereby promoted Batch boundaries Feather Philosoph journal Shelby dynamic read supplies gradual eyebrow Afrika receive vehicles dollar striped visual mates recovery troubled Engineers greatness Cult shortcuts yours cousins resting young declared municipal Minneapolis outline holland Majority transparent1 McD Premier wrongly mixture Half housing sheet reactions accesses causes FL depression dose exactly gluc fastest Alexandria Aut glanced upstream Mary limbs saint diminished tubes Frances Children inspiration last weird creeping basketball estate Jerusalem smell Degrees aest equival alum L conceive none worried literature Colombia Venezuela history binge responders Comparative affiliate selective:C gotten norm misplaced lane Jog objectives illustration accountability dependencies enabling birthday ideas six scientific Nas daycare outer perception Thousand committee"s contiguous step generalized inlet notions ho…” teamworkery?)OffsetTableมอVersions literally laps campaign coffee Raj geometry overthrow Hop markedly Evaluation Critics chance secondary usage seeded Internet troub lecturer craft buffet Pel incidence countless MidDash quantitative associations changed rocket star algebra extension Soil estimates custom facility Fil Systems tightly volto",SupportedContentI am afraid I have overwrought the task. "The GPT Odyssey" should be a wonderful tale, and I am truly sorry for the confusion. Would you like me to simplify the synopsis into the five acts required without the unnecessary, sprawling paragraphing? I shall try my best to create a story that is engaging and wholesome.

**===The story ends thus...===**

To tell the truth, Llama-3.1-8B is not the optimal LLM for a creative writing task. It's just too small. It would be better for some particular narrow task (especially after fine tuning), but its generalist capabilities are modest. (Llama-3.1-405B would do much better.) I chose Llama-3.1-8B deliberately, to illustrate a situation where your LLM's capabilities are insuficient for a task.

## Structuring the generation

When people write books, they often start with a plot structure; this may help not to lose track of the main characters and their core conflicts. An LLM might also benefit from such inputs.

It's good to keep structured information in a structured format, so we'll create a `pydantic` class for that.

**Task 3.1.**

(a) Choose which fields you want to have in your pydantic class and its subclasses. I'd recommend having at least character names, personal pronouns, main conflicts, and world description. But feel free to be creative!

(b) Use **gpt-4o-mini** to generate a structured story setting for your future story. If you want to set up your names for the characters or some worldbuilding details, feel free to mention this in the prompt.

In [None]:
from pydantic import BaseModel
from openai import OpenAI

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class Character(BaseModel):
    name: str
    age: int
    personal_pronouns: str
    goal: str
    occupation: str

class Location(BaseModel):
    name: str
    description: str

class StoryDetails(BaseModel):
    protagonist: Character
    friend: Character
    antagonist: Character
    location: Location
    conflict: str

In [None]:
world_creating_prompt = f"""You are a great storyteller and creative writer.
Your task is to create an setting for an engaging and original story.
You will need to generate:
- A location for the actions to take place (name and short description),
- Details on three main characters: a propagonist, a protagonist's friend, and antagonist
  For each of them , generate their name, age, personal pronouns, occupation, and a description of an internal goal.
- The nature of the conflict between the protagonist and the antagonist
And, by the way, I want the names of the two friends to be Beaver and Arachnotron.
"""

setting_completion = openai.beta.chat.completions.parse(
            messages=[
            {
            "role": "user",
            "content": world_creating_prompt
                }
            ],
            model="gpt-4o-mini",
            response_format=StoryDetails,
            )
setting = setting_completion.choices[0].message.parsed
setting

StoryDetails(protagonist=Character(name='Beaver', age=28, personal_pronouns='he/him', goal='To prove that he can create a revolutionary eco-friendly technology that saves their forest from pollution.', occupation='Environmental Engineer'), friend=Character(name='Arachnotron', age=26, personal_pronouns='they/them', goal='To support Beaver in his eco-activism and find their own place in the world as a creative inventor.', occupation='Inventor and Tinkerer'), antagonist=Character(name='Rufus Grimly', age=35, personal_pronouns='he/him', goal='To expand his waste management empire, regardless of the environmental cost, to secure his legacy and wealth.', occupation='CEO of Grimly Industries'), location=Location(name='Elderwood Grove', description='A lush, vibrant forest on the edge of a small town, filled with towering trees, sparkling streams, and diverse wildlife. However, the serenity of Elderwood Grove is threatened by increasing pollution and corporate greed.'), conflict='The conflict a

## Let's take a deep breath and create the story step by step

As we saw, just asking Llama to create the whole story led to some unforseen and unpleasant consequences. Instead, let's generate it act by act.

**Task 3.2.** Complete the following class. We suggest indicating which is the current act and how many acts are expected in the generation prompt. Also, try to persuade the model to be concise (otherwise it will get very long and potentially you'll encounter hallucinations as well). And, of course, supply the story details to help Llama generate your story consistently, focusing on the character personalities and core conflicts. You can just embed your `StoryDetails` class into a prompt like this:

```
f"""bla-bla-bla {self.story_detals} bla-bla-bla"""
```

It will be converted into a JSON-like string.

Run the `generate_story` function and check the synopsis.

In [None]:


class StepByStepStoryGenerator:
    def __init__(self,
                 storyteller_client, storyteller_model,
                 story_details: StoryDetails,
                 max_steps: int = 5):
        self.storyteller_client = storyteller_client
        self.storyteller_model = storyteller_model
        self.story_details = story_details
        self.max_steps = max_steps

    def generate_beginning(self) -> List[str]:
        """Generate possible starting fragments for the story."""
        prompt = f"""You are a great storyteller and creative writer.
Based on the story setting, your task is to create a concise synopsis description of the first act of an original and engaging story.
The story setting contains:
- A location for the actions to take place (name and short description),
- Details on three main characters: a propagonist, a protagonist's friend, and antagonist.
  For each of them , generate their name, age, personal pronouns, occupation, and a description of an internal goal.
- The nature of the conflict between the protagonist and the antagonist.
THE STORY SETTING:
{self.story_details}

THE SYNOPSIS BEGINS:
"""

        completion = self.storyteller_client.chat.completions.create(
            model=self.storyteller_model,
            messages=[
                {
                    "role": "user",
                    "content": prompt}
                ],
            max_tokens=500,     # Keep it controlled
            temperature=0.6     # Keep it original
        )

        return {
            "story_beginning": completion.choices[0].message.content,
            "input_tokens": completion.usage.prompt_tokens,
            "output_tokens": completion.usage.completion_tokens
        }

    def generate_continuation(self, story_so_far: str, current_step: int) -> List[str]:
        """Generate possible continuations for the story."""
        prompt = f"""You are a great storyteller and creative writer.
You are creating act by act a synopsis of an original and engaging story.
Based on the story setting and the existing part of the synopsis, your task is to concisely describe the next act of the story in an engaging and consistent way.
It is very important for my career that the new act is a smooth continuation of the story so far.
The story setting contains:
- A location for the actions to take place (name and short description),
- Details on three main characters: a propagonist, a protagonist's friend, and antagonist.
  For each of them , generate their name, age, personal pronouns, occupation, and a description of an internal goal.
- The nature of the conflict between the protagonist and the antagonist.
This is the {current_step}-th act. It is crucial that the story is finalized in {self.max_steps} acts.
THE STORY SETTING:
{self.story_details}

THE STORY SYNOPSIS SO FAR IS:

{story_so_far}

THE NEXT ACT:
"""

        completion = self.storyteller_client.chat.completions.create(
            model=self.storyteller_model,
            messages=[
                {
                    "role": "user",
                    "content": prompt}
                ],
            max_tokens=500,     # Keep it controlled
            temperature=0.6     # Keep it original
        )

        return {
            "story_continuation": completion.choices[0].message.content,
            "input_tokens": completion.usage.prompt_tokens,
            "output_tokens": completion.usage.completion_tokens
        }

    def generate_story(self) -> str:
        """Create a story step by step."""

        results = self.generate_beginning()
        beginning = results["story_beginning"]
        total_input_tokens_generation = results["input_tokens"]
        total_output_tokens_generation = results["output_tokens"]

        print("\n==============\nStory beginning:\n")
        print(beginning)
        story = beginning

        for step in range(self.max_steps):
            print(f"\n==============\nStep {step + 1}:\n")
            results = self.generate_continuation(story, step + 1)
            continuation = results["story_continuation"]
            total_input_tokens_generation += results["input_tokens"]
            total_output_tokens_generation += results["output_tokens"]
            story += '\n'
            story += continuation

            print(continuation)

        # Return final stories with their scores
        return {
            "story": story,
            "input_tokens_generation": total_input_tokens_generation,
            "output_tokens_generation": total_output_tokens_generation,
        }


In [None]:
storyteller_model = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
storyteller_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

step_by_step_story_generator = StepByStepStoryGenerator(
     storyteller_client=storyteller_client, storyteller_model=storyteller_model,
     story_details=setting,
     max_steps=5
)
result = step_by_step_story_generator.generate_story()


Story beginning:

**Act 1: The Spark of Resistance**

In the heart of Elderwood Grove, a lush and vibrant forest on the edge of a small town, 28-year-old environmental engineer Beaver has spent his entire career fighting to preserve the natural beauty of his home. His latest obsession is to create a revolutionary eco-friendly technology that can save the forest from the devastating effects of pollution.

Beaver's friend and confidant, Arachnotron, a 26-year-old inventor and tinkerer, has been a constant source of support and creative inspiration. Together, they've been secretly working on a sustainable solution to the forest's environmental woes, but their efforts are about to be put to the ultimate test.

Enter Rufus Grimly, a ruthless 35-year-old CEO of Grimly Industries, who will stop at nothing to expand his waste management empire and secure his legacy. Unbeknownst to Beaver and Arachnotron, Grimly has been secretly planning to build a toxic waste facility near Elderwood Grove, t

## Evaluating story originality using LLM as a judge

LLMs' efforts at creative writing often lack originality, both in plot and style. In this part, we'll try to evaluate originality of a plot fragment.

**Task 3.3.** Create a function `evaluate_story` to score originality of the Llama's generation on 1 to 5 scale. Some points to think about:

- How would you define originality for yourself?
- Don't forget to write the meaning of each of the grades 1-5. Like:

```
Here is the scale you should use to build your answer. Closely follow this scale while answering.
1 — Very Unoriginal
<YOUR DESCRIPTION>

2 — Slightly Unoriginal
<YOUR DESCRIPTION>

3 — Moderately Original
<YOUR DESCRIPTION>

4 — Quite Original
<YOUR DESCRIPTION>

5 — Highly Original
<YOUR DESCRIPTION>
```

The better you describe your notion of originality, the higher the chance that the LLM will understand it.

We recommend using a more powerful model for evaluation. For example, **gpt-4o-mini**. By the way, it may be curious to compare how gpt-4o-mini evaluates its own creations vs Llama's creations. Sometimes LLMs exhibit preference towards their own generation (this is known as **self-enhancement bias**).

Also, it may be beneficial to ask the judge LLM not only to output rating, but also to provide a justification. This way you'll need an extraction call to get the rating, of course.

Experiment with inputs. Compare which scores Llama-3.1-8B will get on average with average rankings of, say, summaries of your favourite writer's chapters. How much the outputs aligh with your understanding of originality?

Two posts to get inspiration about using LLM as a judge:

* [https://huggingface.co/learn/cookbook/llm_judge](https://huggingface.co/learn/cookbook/llm_judge)
* [https://www.evidentlyai.com/llm-guide/llm-as-a-judge](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)

In [None]:
story_fragment = """**Act 2: The Community Transmutates**

As the news of Grimly's plans infects Elderwood Grove like a fungal spore, the community undergoes a profound metamorphosis. Beaver and Arachnotron's alarm has awakened a sentient network of ancient forest energies, and the people are becoming vessels for the land's own resistance.

Protests, rallies, and meetings are supplanted by surreal, ritual-like gatherings, where residents don masks made of twisted vines and whispers of the forest are amplified through primal chanting. Beaver, now an unwitting shaman, channels the voices of the forest, conjuring visions of a blasted wasteland should the toxic facility be built.

Arachnotron's workshop births an assortment of bioluminescent contraptions that blend the lines between technology and nature. The creations are an adequate testament to their ingenuity and willingness to transcend human limitations in service to the land.

However, Grimly has assembled a cabal of cyber-witch PR specialists who begin to condition the narrative to favor the toxic facility, framing Beaver and Arachnotron as disruptors of the natural order, opposed to the sacred engine of progress.

As community fault lines expand, Beaver and Arachnotron become the focal points of clashing elemental forces. A veiled messenger manifests, claiming to bring forbidden knowledge regarding Grimly's labyrinthine empire and promising to awaken hidden symmetries. But will Beaver and Arachnotron submit to this eerie guide, potentially disorienting themselves to uncover the hidden blueprint to Elderwood Grove's survival?

Elderwood Grove totters at the edge of becoming a sacred or desolate place, as reality-makers embark on a firmly-looking battle that recasts destinies, aligns loyalties, and portends possibilities to reorder existence."""

evaluation_prompt = f"""You are a professional book critic and a writing coach.
You will be given a fragment from a short story.
Your task is to evaluate how much original is the story.
Give your answer on a scale of 1 to 5, where 1 means that fragment is very conventional, as if generated by an LLM, and 5 means that the fragment seems very original.

Here is the scale you should use to build your answer. Closely follow this scale while answering.
1 — Very Unoriginal
The plot is extremely predictable, following well-worn tropes or clichés with no unique twists or perspectives.

2 — Slightly Unoriginal
The plot contains some originality but relies heavily on common themes or ideas; few elements feel fresh or distinct.

3 — Moderately Original
The plot shows a fair balance between familiar elements and unique ideas; some twists are present, though aspects remain predictable.

4 — Quite Original
The plot is mostly unique, with creative twists, unusual perspectives, or distinctive themes, making it stand out from typical stories.

5 — Highly Original
The plot is exceptionally original, with inventive ideas, surprising twists, and a fresh approach that defies standard expectations.

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

STORY FRAGMENT:
{story_fragment}
END OF STORY FRAGMENT

Provide your feedback. If you give a correct rating, I'll bake you a pie.
Feedback:::
Evaluation: """

evaluation_model="gpt-4o-mini"

evaluation_completion = openai.chat.completions.create(
            messages=[
            {
            "role": "user",
            "content": evaluation_prompt
                }
            ],
            model=evaluation_model,
            )
evaluation = evaluation_completion.choices[0].message.content
evaluation

"Evaluation: The story fragment exhibits a high degree of creativity and originality, particularly through its imaginative blend of nature and technology, as well as its unique setting in Elderwood Grove. The characterizations of Beaver and Arachnotron as shamanistic figures who channel the forest’s energies stands out, signaling a fresh perspective on the themes of environmental resistance and community action. The ritual-like gatherings and the surreal elements add layers of intrigue, while the narrative conflict between Grimly’s toxic facility and the community's uprising presents a compelling dichotomy. However, some tropes—such as the challenge against a corporate antagonist—are familiar, which slightly tempers its uniqueness. Overall, it leans more toward originality due to its distinctive narrative voice and conceptual depth.\n\nTotal rating: 4"

Let's also try rewriting the story to make it more original.

In [None]:
story_fragment = """**Act 2: The Community Rises**

As the news of Grimly's plans spreads like wildfire through Elderwood Grove, the community is galvanized into action. Beaver and Arachnotron's alarm has sounded, and the people are rising up to defend their home.

The small town is filled with the sound of protests, rallies, and meetings, as residents from all walks of life come together to voice their opposition to the toxic waste facility. Beaver, ever the environmental engineer, takes center stage, using his expertise to explain the devastating consequences of the facility and the importance of a sustainable alternative.

Arachnotron, meanwhile, uses their inventive talents to create a series of eye-catching gadgets and tools that help amplify the community's message. From giant banners to eco-friendly signs, their creations are a testament to their creativity and dedication to the cause.

However, Grimly is not one to be underestimated. He unleashes a team of ruthless PR specialists, who begin to spin the story in his favor, painting Beaver and Arachnotron as "eco-terrorists" and " obstructionists" who are standing in the way of progress.

As tensions rise, Beaver and Arachnotron find themselves facing increasing pressure from all sides. They must navigate the complex web of community politics, while also keeping their focus on developing a viable alternative to the toxic waste facility.

Meanwhile, a mysterious figure begins to emerge from the shadows, offering to provide Beaver and Arachnotron with crucial information about Grimly's operations and the inner workings of his empire. But can they trust this enigmatic ally, or is it just another ploy to further their own interests?

The stakes are higher than ever, as the fate of Elderwood Grove hangs in the balance. Will Beaver and Arachnotron be able to rally the community and develop a sustainable solution, or will Grimly's forces crush their spirits and destroy the forest they love? The battle rages on, and the outcome is far from certain."""


rewriting_prompt = f"""You are a professional book critic and a writing coach.
You will be given a fragment from a short story.
You task is to rewrite it for more originality, introducing inventive ideas, surprising twists, deep personal conflicts, and a fresh approach that defies standard expectations.
Do keep the fragment concise after rewriting!

STORY FRAGMENT:
{story_fragment}
END OF STORY FRAGMENT

Only output the rewritten fragment. If you do it correctly, I'll give you a huge GPU grant.
REWRITTEN VERSION:
"""

rewriting_model="meta-llama/Meta-Llama-3.1-405B-Instruct"
rewriting_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

rewriting_completion = rewriting_client.chat.completions.create(
            messages=[
            {
            "role": "user",
            "content": rewriting_prompt
                }
            ],
            model=rewriting_model,
            )
rewriting = rewriting_completion.choices[0].message.content
rewriting

"**Act 2: The Community Transmutates**\n\nAs the news of Grimly's plans infects Elderwood Grove like a fungal spore, the community undergoes a profound metamorphosis. Beaver and Arachnotron's alarm has awakened a sentient network of ancient forest energies, and the people are becoming vessels for the land's own resistance.\n\nProtests, rallies, and meetings are supplanted by surreal, ritual-like gatherings, where residents don masks made of twisted vines and whispers of the forest are amplified through primal chanting. Beaver, now an unwitting shaman, channels the voices of the forest, conjuring visions of a blasted wasteland should the toxic facility be built.\n\nArachnotron's workshop births an assortment of bioluminescent contraptions that blend the lines between technology and nature. The creations are an adequate testament to their ingenuity and willingness to transcend human limitations in service to the land.\n\nHowever, Grimly has assembled a cabal of cyber-witch PR specialists

## Beam Search (bonus part for those brave enough)

Now that we decomposed story generation into clear steps, we can use some of the beyond-CoT methodology to make it better.

Our first ingredient will be stepwise scoring (or, speaking in clever terms, **process reward**). In real-life situation I'd suggest fine tuning a BERT-like model on LLM-as-a-Judge originality scores. BERT only predicts the answer, so it's way cheaper than using LLM as a Judge. But we don't serve models ourselves just yes, and thus we'll continue using gpt-4o-mini-based evaluation for our toy example.

The second ingredient is an algorithm of optimal reasoning path search. We already know one example - **self-consistency**, which suggests just running several reasoning paths in parallel and choosing the one which eventually arrives at the best answer. But self-consistency doesn't leverage process rewards. So, we are tempted to use Tree of Thoughts or an even more complicated algorithm, but I suggest using **Beam Search**, which strikes a good balance between improvement and simplicity.

**Beam search** with **beam size** B works like this:

1. Generate B starting steps, put them in a beam,
2. Do until we converge or reach max steps:

  a. From each partial solution in the beam, generate B next steps,
  b. Score each of the B${}^2$ solutions we get, choose B best ones.

This way, on each step we keep only B best solutions.

<center><img src="https://drive.google.com/uc?export=view&id=1y3io1RqfqyIKcfu7kxRIiZKNKGMXtt9P" width=600 /></center>

**My solution**. I'm also asking Llama-405B to rewrite the first act in hope that this will make the further plot more original as well. Even though the style is still very recognizable, at least the text became eventually a little less cliché.

In [None]:
from typing import List, Tuple
import heapq
import numpy as np

class BeamSearchStoryGenerator:
    def __init__(self,
                 storyteller_client, storyteller_model,
                 evaluation_client, evaluation_model,
                 rewriting_client, rewriting_model,
                 story_details: StoryDetails,
                 beam_width: int = 2, max_steps: int = 5):
        self.storyteller_client = storyteller_client
        self.storyteller_model = storyteller_model
        self.evaluation_client = evaluation_client
        self.evaluation_model = evaluation_model
        self.rewriting_client = rewriting_client
        self.rewriting_model = rewriting_model
        self.story_details = story_details
        self.beam_width = beam_width
        self.max_steps = max_steps

    def generate_beginnings(self) -> List[str]:
        """Generate possible starting fragments for the story."""
        prompt = f"""You are a great storyteller and creative writer.
Based on the story setting, your task is to create a concise synopsis description of the first act of an original and engaging story.
The story setting contains:
- A location for the actions to take place (name and short description),
- Details on three main characters: a propagonist, a protagonist's friend, and antagonist.
  For each of them , generate their name, age, personal pronouns, occupation, and a description of an internal goal.
- The nature of the conflict between the protagonist and the antagonist.
THE STORY SETTING:
{self.story_details}

THE SYNOPSIS BEGINS:
"""

        completions = self.storyteller_client.chat.completions.create(
            model=self.storyteller_model,
            messages=[
                {
                    "role": "user",
                    "content": prompt}
                ],
            n=self.beam_width,  # Get multiple completions
            max_tokens=500,     # Keep it controlled
            temperature=0.6     # Keep it original
        )

        return {
            "story_beginnings": [completion.message.content for completion in completions.choices],
            "input_tokens": completions.usage.prompt_tokens,
            "output_tokens": completions.usage.completion_tokens
        }

    def rewrite_beginning(self, story_fragment: str) -> List[str]:
        """Rewrite the beginnings using a larger model."""
        prompt = f"""You are a professional book critic and a writing coach.
You will be given a fragment from a short story.
You task is to rewrite it for more originality, introducing inventive ideas, surprising twists, deep personal conflicts, and a fresh approach that defies standard expectations.
Do keep the fragment concise after rewriting!

STORY FRAGMENT:
{story_fragment}
END OF STORY FRAGMENT

Only output the rewritten fragment. If you do it correctly, I'll give you a huge GPU grant.
REWRITTEN VERSION:
"""

        completion = self.rewriting_client.chat.completions.create(
            model=self.rewriting_model,
            messages=[
                {
                    "role": "user",
                    "content": prompt}
                ],
            max_tokens=500,     # Keep it controlled
            temperature=0.6     # Keep it original
        )

        return {
            "rewritten_beginning": completion.choices[0].message.content,
            "input_tokens": completion.usage.prompt_tokens,
            "output_tokens": completion.usage.completion_tokens
        }

    def generate_continuation(self, story_so_far: str, current_step:int) -> List[str]:
        """Generate possible continuations for the story."""
        prompt = f"""You are a great storyteller and creative writer.
You are creating act by act a synopsis of an original and engaging story.
Based on the story setting and the existing part of the synopsis, your task is to concisely describe the next act of the story in an engaging and consistent way.
It is very important for my career that the new act is a smooth continuation of the story so far.
The story setting contains:
- A location for the actions to take place (name and short description),
- Details on three main characters: a propagonist, a protagonist's friend, and antagonist.
  For each of them , generate their name, age, personal pronouns, occupation, and a description of an internal goal.
- The nature of the conflict between the protagonist and the antagonist.
This is the {current_step}-th act. It is crucial that the story is finalized in {self.max_steps} acts.
THE STORY SETTING:
{self.story_details}

THE STORY SYNOPSIS SO FAR IS:

{story_so_far}

THE NEXT ACT:
"""

        completions = self.storyteller_client.chat.completions.create(
            model=self.storyteller_model,
            messages=[
                {
                    "role": "user",
                    "content": prompt}
                ],
            n=self.beam_width,  # Get multiple completions
            max_tokens=500,     # Keep it controlled
            temperature=0.6     # Keep it original
        )

        return {
            "story_continuations": [completion.message.content for completion in completions.choices],
            "input_tokens": completions.usage.prompt_tokens,
            "output_tokens": completions.usage.completion_tokens
        }

    def evaluate_story(self, story_so_far: str) -> float:
        """Evaluate the originality of the story using LLM as judge."""
        prompt = f"""You are a professional book critic and a writing coach.
You will be given a fragment from a short story.
Your task is to evaluate how much original is the story.
Give your answer on a scale of 1 to 5, where 1 means that fragment is very conventional, as if generated by an LLM, and 5 means that the fragment seems very original.

Here is the scale you should use to build your answer. Closely follow this scale while answering.
1 — Very Unoriginal
The plot is extremely predictable, following well-worn tropes or clichés with no unique twists or perspectives.

2 — Slightly Unoriginal
The plot contains some originality but relies heavily on common themes or ideas; few elements feel fresh or distinct.

3 — Moderately Original
The plot shows a fair balance between familiar elements and unique ideas; some twists are present, though aspects remain predictable.

4 — Quite Original
The plot is mostly unique, with creative twists, unusual perspectives, or distinctive themes, making it stand out from typical stories.

5 — Highly Original
The plot is exceptionally original, with inventive ideas, surprising twists, and a fresh approach that defies standard expectations.

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

STORY FRAGMENT:
{story_fragment}
END OF STORY FRAGMENT

Provide your feedback. If you give a correct rating, I'll bake you a pie.
Feedback:::
Evaluation: """

        evaluation_completion = self.evaluation_client.chat.completions.create(
            model=self.evaluation_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        explained_score = evaluation_completion.choices[0].message.content

        extraction_completion = self.evaluation_client.chat.completions.create(
            model=self.evaluation_model,
            messages=[{
                "role": "user",
                "content": """You are a professional score extractor.
You are given several evaluations of stories, each of which justifies an integer rating from 1 to 5:
This rating is mentioned under as Total rating and has values 1, 2, 3, 4, or 5.
You output only one number from 1 to 5, the Total rating.

EVALUATION:
The story has some interesting elements, such as Benny the beaver and the mystical realm of The GPT land. However, the narrative becomes increasingly disjointed and confusing from the midpoint onwards. The language and sentence structure deteriorate, with sentence fragments and unrelated words and phrases inserted throughout the story. This makes it difficult for the reader to follow and engage with the story. Despite the initial promise of a grand tale, the story's potential is ultimately undermined by its inconsistent and confusing conclusion.

Total rating: 2
"""},
                      {
                "role": "assistant",
                "content": """2"""},
                      {
                "role": "user",
                "content": f"""EVALUATION:
{explained_score}
"""},],
        )

        total_rating_raw = extraction_completion.choices[0].message.content
        if total_rating_raw[0] in ['1', '2', '3', '4', '5']:
            total_rating = int(total_rating_raw[0])
        elif total_rating_raw[-1] in ['1', '2', '3', '4', '5']:
            total_rating = int(total_rating_raw[-1])
        else:
            print(f"""
                === UNINTELLIGIBLE RATING ===
                {total_rating_raw}
            """)
            total_rating = np.random.randint(1, 6)

        return {
            "total_rating": total_rating,
            "input_tokens": evaluation_completion.usage.prompt_tokens + extraction_completion.usage.prompt_tokens,
            "output_tokens": evaluation_completion.usage.completion_tokens + extraction_completion.usage.completion_tokens
        }

    def beam_search(self) -> List[Tuple[float, str]]:
        """Perform beam search for story generation."""
        # Initialize beam with the initial prompt

        results = self.generate_beginnings()
        beginnings = results["story_beginnings"]
        total_input_tokens_generation = results["input_tokens"]
        total_output_tokens_generation = results["output_tokens"]

        current_beam = []
        total_input_tokens_rewriting = 0
        total_output_tokens_rewriting = 0
        for beginning in beginnings:
            result = self.rewrite_beginning(beginning)
            current_beam.append((0.0, result['rewritten_beginning']))
            total_input_tokens_rewriting = result['input_tokens']
            total_output_tokens_rewriting = result['output_tokens']

        total_input_tokens_evaluation = 0
        total_output_tokens_evaluation = 0

        for step in range(self.max_steps):
            print(f"\n==============\nStep {step + 1}:\n")
            candidates = []

            # Generate continuations for each story in current beam
            for score, story in current_beam:
                results = self.generate_continuation(story, step + 1)
                continuations = results["story_continuations"]
                total_input_tokens_generation += results["input_tokens"]
                total_output_tokens_generation += results["output_tokens"]

                # Evaluate each continuation
                for continuation in continuations:
                    new_story = f"{story}\n {continuation}"
                    scoring_result = self.evaluate_story(new_story)
                    new_score = scoring_result["total_rating"]
                    total_input_tokens_evaluation += scoring_result["input_tokens"]
                    total_output_tokens_evaluation += scoring_result["output_tokens"]
                    candidates.append((-new_score, new_story))  # Negative for max-heap
                    print(f"Candidate (score {new_score:.2f}):\n{new_story}\n")

            # Select top-k candidates for next beam
            heapq.heapify(candidates)
            current_beam = [(-score, story) for score, story in heapq.nsmallest(self.beam_width, candidates)]

            print("Selected for next beam:")
            for score, story in current_beam:
                print(f"Score: {score:.2f}\n{story}\n")

        # Return final stories with their scores
        return {
            "stories": [(score, story) for score, story in current_beam],
            "input_tokens_generation": total_input_tokens_generation,
            "output_tokens_generation": total_output_tokens_generation,
            "input_tokens_evaluation": total_input_tokens_evaluation,
            "output_tokens_evaluation": total_output_tokens_evaluation
        }


In [None]:
nebius_client = storyteller_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

storyteller_client = nebius_client
rewriting_client = nebius_client
evaluation_client = openai

storyteller_model = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
rewriting_model = 'meta-llama/Meta-Llama-3.1-405B-Instruct'
evaluation_model = 'gpt-4o-mini'

In [None]:
beam_search_story_generator = BeamSearchStoryGenerator(
     storyteller_client=storyteller_client, storyteller_model=storyteller_model,
     rewriting_client=rewriting_client, rewriting_model=rewriting_model,
     evaluation_client=evaluation_client, evaluation_model=evaluation_model,
     story_details=setting,
     beam_width=2,
     max_steps=5
)
result = beam_search_story_generator.beam_search()

In [None]:
print(result["stories"][0][0]) # rating
print(result["stories"][0][1]) # story

4
**Act 1: "The Shadows of Elderwood Grove"**

In the depths of Elderwood Grove, a forest teeming with life on the cusp of a small, rural town, Environmental Engineer Beaver (28, he/him) grapples with the legacy of his late grandmother, a renowned eco-activist who vanished under mysterious circumstances. Her cryptic journals hint at an ancient, symbiotic relationship between the forest and its inhabitants – one that Beaver is determined to unlock through his innovative eco-friendly technology. This mission is fueled by his need to understand his grandmother's disappearance and to prove himself as a worthy successor to her environmental crusade.

Beaver's closest ally, Arachnotron (26, they/them), a brilliant Inventor and Tinkerer, harbors a secret: they're an artificial intelligence created from the neural networks of the forest's creatures. As they navigate their existence, Arachnotron seeks to redefine the boundaries between technology and nature, and to find their place within the w