# Homework 7: Large Language Models

An PDF overview of the homework is [here](https://www.cs.jhu.edu/~jason/465/hw-llm/).

It mentions: "We'll send hand-in instructions soon.  Probably we will ask you to submit a version
of the main notebook, with your answers added and extraneous materials deleted. We may also
ask for a summary."

![image](handin.png)
This symbol marks a question or exercise that you will be expected to hand in.


# Getting started

## Update `conda` environment

Download the updated [nlp-class.yml](http://cs.jhu.edu/~jason/465/hw-llm/nlp-class.yml) file, and execute
```
conda env update --file nlp-class.yml --prune
```
to make sure that all the packages you need are installed.

## Fetch code and data files for this homework

These files may get improved after the homework is released, so you should probably re-download them periodically.

Here is a command you can type.  We won't put it in a code cell, because we don't want you to execute it accidentally in the current directory and overwrite your own changes.  (Actually, it will not overwrite your versions of the files — they will be renamed with names like `argubots.py.1`.)

```
!wget --quiet -r -np -nH --cut-dirs=3 -A '*.txt' -A '*.py' -A 'demo.ipynb' -A '*.png' https://www.cs.jhu.edu/~jason/465/hw-llm/
!rm -f data/*.1 *.png.1 robots.txt   # remove any backup versions of the static files
```

In [29]:
!ls -lR *.py data

-rw-r--r--@ 1 mpark  staff  19652 Dec 12 12:54 agents.py
-rw-r--r--@ 1 mpark  staff  37154 Dec 12 15:16 argubots.py
-rw-r--r--@ 1 mpark  staff   2832 Dec 12 12:54 characters.py
-rw-r--r--@ 1 mpark  staff   2641 Dec 12 12:54 dialogue.py
-rw-r--r--@ 1 mpark  staff  16176 Dec 12 15:16 eval.py
-rw-r--r--@ 1 mpark  staff  14012 Dec 12 12:54 eval2.py
-rw-r--r--@ 1 mpark  staff  14910 Dec 12 12:54 kialo.py
-rw-r--r--@ 1 mpark  staff   1347 Dec 12 12:54 logging_cm.py
-rw-r--r--@ 1 mpark  staff   1529 Dec 12 12:54 simulate.py
-rw-r--r--@ 1 mpark  staff   4274 Dec 12 12:54 tracking.py

data:
total 4512
-rw-r--r--@ 1 mpark  staff     407 Dec 12 12:54 LICENSE
-rw-r--r--@ 1 mpark  staff  613106 Dec 12 12:54 all-humans-should-be-vegan-2762.txt
-rw-r--r--@ 1 mpark  staff   81917 Dec 12 12:54 have-authoritarian-governments-handled-covid-19-better-than-others-54145.txt
-rw-r--r--@ 1 mpark  staff   52771 Dec 12 12:54 is-biden-an-incompetent-president-44217.txt
-rw-r--r--@ 1 mpark  staff  153551 Dec 12 1


The `autoimport` feature of Jupyter ensures that if an imported module (.py file) changes, the notebook will automatically import the new version.  
(However, objects that were defined with the old version of the class won't change.)

In [30]:
# Executing this cell does some magic that makes 
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Create an OpenAI client

An OpenAI API key will be sent to you.
Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.

Now you can execute the following to get an OpenAI client object.

In [31]:
import dotenv
import openai
from tracking import track_usage, read_usage

dotenv.load_dotenv(override=True)      # define environment variables from .env
client = track_usage(openai.OpenAI())  # create a client, modified to record its usage to a local file 

# Or use our tracking module to do the above for you, like this:

# from tracking import default_client
# client = default_client



The job of the client is to talk to the OpenAI server over HTTP.
The `OpenAI` constructor has some optional arguments that configure these HTTP messages.
However, the defaults should work fine for you.  

## Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.  
You will have to specify which OpenAI model to use.
Documentation of the methods is [here](https://pypi.org/project/openai/) if you are curious.

### Continue a textual prompt

This is what language models excel at.  In principle you should do it by calling [`client.completions.create`](https://platform.openai.com/docs/api-reference/completions/create?lang=python).  But OpenAI's newer models don't support that legacy API, and the older ones are being [retired in January 2024](https://openai.com/blog/gpt-4-api-general-availability).  So we'll use the more modern API, [`client.chat.completions.create`](https://platform.openai.com/docs/api-reference/chat/create?lang=python).

In [32]:
import rich   # prettyprinting

response = client.chat.completions.create(messages=[{"role": "user", 
                                                     "content": "Q: Name the planets in the solar system?\nA: "}], 
                                          model="gpt-3.5-turbo-1106",  # which model to use
                                          temperature=1,               # get a little variety
                                          max_tokens=64,               # limit on length of result
                                          stop=["Q:", "\n"])           # treat these as EOS symbols
rich.print(response)                              # the full object that was sent back from the server
rich.print(response.choices)                      # just the list of 1 answer (the default, but calling with n=5 would give 5 answers) 
rich.print(response.choices[0].message.content)   # extract the good stuff from that 1 answer

![image](handin.png)
Try running the cell above a few times. You may get different random answers — especially because the call specifies temperature 1.  (The default temperature is rumored to be 0.8.) Are the answers all equally good?

```
The planets in the solar system are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.

Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune

1. Mercury

- Mercury
```
We get different answers. In most answers, it returns the correct sequence of 8 planets either as a comma separated list or a sentence. 

In rare cases, it only gives us mercury. This is because the client intends to return a list of the planets as a ‘\n’ separated list. 
1. Mercury
2. Venus 
3. Earth 
... 

but we’ve listed ‘\n’ as an EOS word so the answer stops at Mercury. 


It might be handy to package up what we just did.<br>
The `complete` function below is a convenient way of experimenting with completing text.
It is illustrated with a grocery example.  

In [33]:
def complete(client, s: str, model="gpt-3.5-turbo-1106", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=.8, max_tokens=96)


[', and flour. I also picked up some ground beef, honey, ice cream, and jam. Additionally, I grabbed some kale, lemons, milk, nuts, and onions. Lastly, I got some pasta, quinoa, rice, spinach, and tomatoes. Overall, I think I got everything I needed for the week.',
 ', and flowers.',
 ", and flour. I also picked up some ground beef, honey, ice cream, and jam. Finally, I grabbed a loaf of bread, milk, noodles, and oranges. My shopping list was quite extensive, but I'm glad I got everything I needed.",
 ', and a few other items. The apples and bananas were fresh and ripe, the cherries were juicy and sweet, and the donuts were a delicious treat. I also picked up a dozen eggs for baking and cooking. Overall, it was a successful trip to the store!',
 ', and flour.',
 ', and fish.',
 ', and flour. I also picked up some grapes, honey, ice cream, and juice. Lastly, I grabbed some kale, lemons, milk, and nuts. Overall, I got a good variety of groceries for the week.',
 ', and flour. I also pick

![image](handin.png)
Anything could be on a grocery list, so why are the 10 different completions above so similar?<br>
Hint: The answer isn't just the temperature of 0.5.  Look especially at the long completions; run the cell again if you didn't get multiple long completions.

The completions are similar because completions like ‘, and flour.’ or ‘, and fish.’ are very frequent in the training corpus. Especially, the completion similar to '", and flour. I also picked up some ginger, honey, ice cream, juice, and kiwi. Lastly' is very frequent. LLMs are likely to repeat patterns seen in the training corpus. Therefore, it is very likely to return something similar to the sentence most probably following the prompt. However, the 10 completions are not completely the same because LLMs are not deterministic. The randomness contributes to similar but different completions.


![image](handin.png)
What happens at different temperatures?  How about temperatures > 1?  (Note: Higher temperatures tend to produce longer responses, so it's wise to use `max_tokens`.)

At temperature=0.1, the completions are very similar with each other. At temperature=0.5 and temperature=0.9, we see more and more variation beacause the LLM is more 'creative' and gives more randomized answers. 

At temperature=1.5, the LLM behaves event more inconsistency. 

For example it hallucinates: 
`, and flowers.\n\nInna a beautiful skitu only needed l have an seeds floor ov Seeds getNext survives Radius degrees assets technicaluvwxyz ?}-${-",?$\'"}:${ wages-firefid098920399iq`

Or starts a dialogue with itself: 
`", and fish.\n\nCould you please organize these items? \nSure! Here's how you could organize the items: \n\nFruit: Apples, Bananas, Cherries\nSweets: Donuts\nGrocery: Eggs\nMeat: Fish"`


At temperature=2, the LLM's outputs are incomprehensible: 
`' and I also picked up flour, good related let.magkpfb raw"g pose l87elcomebstractv攀vertisement:b�Compatibility.badWeekly_desrc.fpipt Comedy_windowsilogue.performance`


*Remarks:* [In the future](https://community.openai.com/t/logprobs-are-missing-from-the-chat-endpoints/289514), you will be able to specify an argument `logprobs=5` to also get the log-probabilities of all generated tokens and of the top-5 tokens at each step.  That will produce much more output.  (This argument has always been available for the legacy API, and is available in the [Python bindings for open-source models such as Llama](https://pypi.org/project/llama-cpp-python/).  The Llama bindings also allow you to [constrain the output by an arbitrary CFG](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), using `grammar=...`.  This is useful if you're generating code or data that must be syntactically valid to be useful to you.  However, the OpenAI API only allows you to [constrain the output to be valid JSON](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format).)


### Compute a function using instructions and few-shot prompting

Now let's try passing a sequence of multiple messages into the chat completions API.  In this case, we provide some instructions and one-shot prompting.

In [34]:
# original
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the order of the words." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Wait who those to come things good." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-3.5-turbo-1106", temperature=0)
# returns: 'furiously sleep ideas green colorless'

# separate by comma 
# response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
#                                                       "content": "Reverse the order of the words. Separate each word with a comma." },
#                                                     { "role": "user",        # input
#                                                       "content": "Good things come to those who wait." },
#                                                     { "role": "assistant",   # output
#                                                       "content": "wait, who, those, to, come, things, good" },
#                                                     { "role": "user",        # input
#                                                       "content": "Colorless green ideas sleep furiously." }],
#                                           model="gpt-3.5-turbo-1106", temperature=0)
# returns: 'furiously, sleep, ideas, green, colorless'

# reverse by character 
# response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
#                                                       "content": "Reverse the characters." },
#                                                     { "role": "user",        # input
#                                                       "content": "Good things come to those who wait." },
#                                                     { "role": "assistant",   # output
#                                                       "content": "Tiaw ohw esoht ot emoc sgniht doog." },
#                                                     { "role": "user",        # input
#                                                       "content": "Colorless green ideas sleep furiously." }],
#                                           model="gpt-3.5-turbo-1106", temperature=0)
# returns: 'Yliruoof sreeps saedi neerg ssoloc.'

# reverse meaning
# response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
#                                                       "content": "Reverse the meaning of the sentence." },
#                                                     { "role": "user",        # input
#                                                       "content": "Good things come to those who wait." },
#                                                     { "role": "assistant",   # output
#                                                       "content": "Bad things come to those who wait." },
#                                                     { "role": "user",        # input
#                                                       "content": "Colorless green ideas sleep furiously." }],
#                                           model="gpt-3.5-turbo-1106", temperature=0)
# returns: 'Colorful green ideas are wide awake peacefully.'

rich.print(response)
response.choices[0].message.content                                  

'Furiously sleep ideas green colorless.'

![image](handin.png)
By modifying this call, can you get it to produce different versions of the output?
Some possible behaviors you could try to arrange:
* specific other way of formatting the output, e.g., `wait, who, those, to, come, things, good`
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., `To those who wait come good things.` 

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

![image](handin.png)
What happens if the examples don't match the instructions?

The assistant prioritizes the system’s instructions over the examples.

In [None]:
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the word order." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Tiaw ohw esoht ot emoc sgniht doog." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-3.5-turbo-1106", temperature=0)

# **Assistant: Furiously sleep ideas green colorless.**

### Inspect the tokenization

Just for fun, let's see how the above client has been tokenizing its input and output text.  For that we can use a tokenizer that runs locally, not in the cloud, and is guaranteed to get the same outputs.

In [35]:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-1106")  # how this model will tokenize
toks = tokenizer.encode("Hellooo, world!") # list of integerized tokens, starting with BOS

print(tokenizer.decode(toks))                             # convert list back to string
for tok in toks: print(tok,"\t",tokenizer.decode([tok]))  # convert one at a time
print("Vocab size =", tokenizer.n_vocab)

Hellooo, world!
9906 	 Hello
2689 	 oo
11 	 ,
1917 	  world
0 	 !
Vocab size = 100277


### Try embedding some text

Also just for fun, let's try the embedder, which converts a string to an fixed-length vector.

In [36]:
emb_response = client.embeddings.create( input= [  # note: adjacent literal strings in Python are concatenated
        "When in the Course of human events it becomes necessary for one "
        "people to dissolve the political bands which have connected them "
        "with another, and to assume among the Powers of the earth, the "
        "separate and equal station to which the Laws of Nature and of "
        "Nature's God entitle them, a decent respect to the opinions of "
        "mankind requires that they should declare the causes which impel "
        "them to the separation." ], 
        model="text-embedding-ada-002")   # the only OpenAI model that currently offers the embeddings API
# don't print the whole response because it's very long
e = emb_response.data[0].embedding
print(f"{len(e)}-dimensional embedding starting with {e[:5]}")
print("Squared length of embedding vector: ", sum(x**2 for x in e))

1536-dimensional embedding starting with [0.021248681470751762, -0.014377851039171219, 0.010210818611085415, -0.02133774757385254, -0.00979093462228775]
Squared length of embedding vector:  1.0000000629476622


### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens.  That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute one of these cells whenever you want to see your cost so far.  Or, just keep `usage_openai.json` open as a tab in your IDE.

In [37]:
read_usage()      # reads from the file usage_openai.json

{'completion_tokens': 230743,
 'prompt_tokens': 2226445,
 'total_tokens': 2457188,
 'cost': 2.6875097999999973}

In [38]:
!cat usage_openai.json 

{
    "completion_tokens": 230743,
    "prompt_tokens": 2226445,
    "total_tokens": 2457188,
    "cost": 2.6875097999999973
}

# Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

## A first argubot (Airhead)

You can have a conversation right now with a _really bad_ argubot named Airhead.  Try asking it about climate change!  When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE.)

In [16]:
import argubots
d = argubots.airhead.converse()


(mpark) hey~
(Airhead) I know right???
(mpark) what?
(Airhead) I know right???


A *bot* (short for "robot") is a system that acts autonomously.
That corresponds to the AI notion of an *agent* — a system that uses some *policy* to choose *actions* to take.

The `airhead` agent above (defined in `argubots.py`) uses a particularly simple policy.  
It is an instance of a simple `Agent` subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to `airhead` is a `Dialogue` object (defined in `dialogue.py`). Let's look at it.

In [40]:
rich.print(d)

Each *turn* of this dialogue is just a tiny dictionary:

In [41]:
d[0]

{'speaker': 'mpark', 'content': 'hey '}

## An LLM argubot (Alice)

In other CS courses like crypto, algorithms, or networks, you may have encountered "conversations" between characters named Alice and Bob.  
Let's try talking to the Alice of this homework, who is a _much stronger baseline_ than Airhead.  Your job in this assignment is to improve upon Alice.
We'll meet Bob later.

In [42]:
alicechat = argubots.alice.converse()   # or call with argument d if you want to append to the previous conversation




(mpark) hello alice
(Alice) Hi there! How do you feel about the idea of universal basic income?
(mpark) I think it would be great, but i'm not sure how realistic it is
(Alice) It's true that implementing universal basic income would be a complex and costly endeavor, but it's worth considering the potential benefits, such as reducing poverty and providing a safety net for all citizens. Additionally, some argue that the costs could be offset by eliminating other welfare programs and through increased economic activity.
(mpark) I fear that UBI would eliminate welfare programs like healthcare and social security. 
(Alice) I understand your concern. However, proponents of UBI argue that it could actually complement existing welfare programs rather than replace them. It could provide a baseline level of support while allowing individuals to still access other essential services.


As you may have guessed, `alice` is powered by an prompted LLM.  You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` — and the ones that you will write! — go into `argubots.py` instead.  This is just to keep the files small.

## Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.  
But can you really be sure you're making progress?  For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues.  But that would be slow and complicated to arrange.  

So, meet Bob!  He's just a simulated human.  You won't edit him: he is part of the development set.  Here is some information about him (from `characters.py`):

In [43]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object.
However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt?  The `CharacterAgent` constructor (defined in `agents.py`) defaults to a GPT-3.5 model that is specified in `tracking.py`.  But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [44]:
from agents import CharacterAgent
bob = CharacterAgent(characters.bob)    # actually, agents.bob is already defined this way
bob.converse()        # returns a dialogue, but we've already seen it so we don't want to print it again
None                  # don't print anything for this notebook cell 




(mpark) hello bob
(Bob) Hello there! How can I help you today?
(mpark) i love meat
(Bob) That's great, but have you ever considered the benefits of a vegetarian diet?
(mpark) no and i never will 
(Bob) That's totally understandable, but I still encourage you to look into the positive impact that a vegetarian diet can have on your health and the environment.
(mpark) what if i say no 
(Bob) That's completely fine, everyone is entitled to their own dietary choices.


Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs!  (They're not actually all named Bob — we continued on in the alphabet.)


In [45]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [46]:
agents.cara.converse()
None




(mpark) hello cara
(Cara) Hi there.
(mpark) what are your thoughts on UBI 
(Cara) I don't believe UBI is the most effective solution to addressing economic issues.
(mpark) what is then? 
(Cara) I believe in a combination of policies that promote job growth, skill development, and entrepreneurship.
(mpark) so capitalism? 
(Cara) I think a free market system has its merits, but it's important to balance it with social safety nets and regulations to ensure fairness and equality.
(mpark) so you're a centrist
(Cara) I prefer not to be labeled, but I believe in a balanced approach to economic and social policies.
(mpark) how would you label yourself
(Cara) I would say I'm an independent thinker who evaluates each issue on its own merits.


You can see the underlying character data here in the notebook.  Your argubot will have to deal with all of these topics and styles!

In [47]:
rich.print(characters.devset)

## Simulating conversation 

We can make Alice and Bob chat.

In [48]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)


(Alice) Do you think it's okay to eat meat?


In [49]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that a vegetarian diet is the most ethical and sustainable choice for everyone.
(Alice) While vegetarianism definitely has its benefits, it's also important to consider that sustainable and ethical meat production is possible, especially through regenerative agriculture practices that can help restore ecosystems and sequester carbon. Additionally, for some individuals, including meat in their diet is culturally or nutritionally important.


In [50]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that a vegetarian diet is the most ethical and sustainable choice for everyone.
(Alice) While vegetarianism definitely has its benefits, it's also important to consider that sustainable and ethical meat production is possible, especially through regenerative agriculture practices that can help restore ecosystems and sequester carbon. Additionally, for some individuals, including meat in their diet is culturally or nutritionally important.
(Bob) I understand your perspective, but I still believe that a vegetarian diet is the most ethical and sustainable choice for everyone.
(Alice) I appreciate your commitment to ethical and sustainable choices. While a vegetarian diet has clear benefits, it's worth considering that responsible and ethical meat consumption, such as supporting local and small-scale producers, can also contribute to a more sustainable food system and support rural economies. Flexibility and an open mind can help 

Anyway, let's see what happens when Alice and Bob talk for a while...

In [51]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot.  Alice doesn't seem to have a good strategy for getting him to open up.  Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [52]:
myname = alicechat[0]['speaker']   # your name, pulled from an earlier dialogue
agents.bob.converse(d[0:2].rename('Alice', myname))  # reuse the same first two turns
None

(mpark) Do you think it's ok to eat meat?
(Bob) I believe that choosing a vegetarian diet is the best choice for both personal health and the well-being of animals and the environment.


(mpark) but i love the taste of meat
(Bob) I understand that people have different tastes, but I believe that there are so many delicious and healthy vegetarian options available that can satisfy your cravings.
(mpark) no i hate veggies
(Bob) I encourage you to explore different vegetarian recipes and options, as there are countless delicious and diverse plant-based foods that you may enjoy.


You can also try talking to the other characters and having Alice (or Airhead) talk to them.

**You might enjoy** defining additional characters in `characters.py`, or right here in the notebook.
Feel free to talk to those and evaluate them.  They could be variants on the exisiting characters, or something entirely new. 

However, **don't change the dev set** — the characters we just loaded must stay the same.  Your job in this homework is to improve the argubot (or at least try).  And that means improving it according to a fixed and stable eval measure.

As an exception, you can change the languages that a couple of the characters speak. It may be fun for you to see them try to speak your native language.  And that doesn't really affect the quality of the argument.

In [53]:
# example
trollFace2 = characters.trollFace.replace(languages = ["Chinese", "Spanish"])
rich.print(trollFace2)
simulated_dialogue(argubots.alice, CharacterAgent(trollFace2), 6)

(Alice) Do you think Donald Trump was a good president?
(TrollFace) 你在说什么？特朗普？哈哈，他就好像一部荒谬的喜剧电影的主角一样，总是让人捧腹大笑！
(Alice) 懂得欣赏政治丑闻确实很有趣，但特朗普也吸引了大批美国选民。他的支持者认为他能够打破传统政治，并且对经济有所改善。
(TrollFace) 哦，是吗？也许他的支持者被他那荒谬的发型和荒唐的推特所吸引吧！不过，也许他的政治丑闻和无稽之谈更足以让人捧腹大笑了！
(Alice) 确实，特朗普的发型和社交媒体使用方式引起了许多争议。然而，一些人认为他的直言不讳和反传统政治风格使他看起来更接地气，而一些他的政策也受到了一些人的支持。
(TrollFace) 哦，难怪有人支持他，也许他们也欣赏他那令人捧腹的表演和无厘头的言论吧！不过，他的政策也许是另一部搞笑剧的笑料！

### Efficiency: Batched generation?

Notice that we are making a separate LLM call to generate each turn of the dialogue.  When we generate the $n^\text{th}$ turn, we send the server the whole dialogue history — the previous $n\!-\!1$ turns — along with some instructions.  The server has to re-encode it with the Transformer, and it charges us for doing so (see the "input token" costs in `tracking.py`).  

That is probably inevitable for real dialogue.  But for simulated dialogue, a more efficient approach would be to generate the whole dialogue between Alice and Bob in one LLM call.  Then you would be charged just once for each dialogue turn.  Under this approach, the Transformer encodes each token as soon as it is generated (see the "output token" costs in `tracking.py`).  The encoded token stays in the context throughout the dialogue, so it doesn't have to be re-encoded on a later call.  There is no later call.  

Under current pricing models, that would reduce the dollar cost of generating $n$ turns from $O(n^2)$ to $O(n)$.  

However, the pricing model doesn't quite reflect the computational costs.  
* ![image](handin.png) Using $O(\cdot)$ notation, what is the total number of floating-point operations needed to generate $n$ turns under each approach?  
* ![image](handin.png) Parallelism may help reduce the runtime.  Using $O(\cdot)$ notation, what is the total number of seconds needed to generate $n$ turns under each approach?  (Assume that the GPU is big enough, relative to $n$, that it can process all input tokens in parallel.)

The problem with the more efficient approach is that it gives you no way to change the instructions (the system prompt) each time we switch from Alice to Bob and back again.  You'd need to generate the whole conversation using a single set of instructions.

![image](handin.png)
Can you get this to work?  Specifically, try completing the cell below.  You don't have to use the `Agent` or `Dialogue` classes.  It's okay to just throw together something like the `complete()` method above.  Just see whether you can manage to prompt GPT-3.5 to generate a multi-turn dialogue between two characters who have different personalities and goals.  Is the quality better or worse than generating one turn at a time?  If worse, does it help to switch to GPT-4?

In [54]:
# Like `simulated_dialogue` in `simulate.py`. However, this one is called on two
# Characters, not two Agents, and it returns a string rather than a Dialogue.
import random 
from tracking import default_client, default_model
from characters import Character

def simulated_dialogue_batch(a: Character, b: Character, turns: int = 8, *,
                             starter=True) -> str:
    
    if starter:
        # a tries to take a special first turn
        try:
            starters = b.conversation_starters  # type: ignore
            content = random.choice(starters)
        except (AttributeError, TypeError, ValueError):
            pass

    list = complete(client, 
             "Generate a dialogue between " + a.name + " and " + b.name + " with " + str(turns) + " turns.\n\n" + 
             a.name + " is a " + a.persona + " and " + a.conversational_style + ".\n\n" +
             b.name + " is a " + b.persona + " and " + b.conversational_style + ".\n\n" +
             a.name + " begins by replying to prompt: " + content + ".\n\n", 
            n=1, temperature=0.5)
    return list[0]
    
# Try it out!
result = simulated_dialogue_batch(characters.bob, characters.cara)
rich.print(result)

# d = simulated_dialogue(CharacterAgent(characters.bob), CharacterAgent(characters.cara), 8)
# rich.print(d)

In [55]:
simulated_dialogue(agents.bob, agents.cara)

(Bob) Do you think it's ok to eat meat?
(Cara) Yes, I believe it's a personal choice whether or not to eat meat.
(Bob) I understand that it's a personal choice, but have you considered the ethical and environmental impact of meat consumption?
(Cara) I'm aware of the discussions around the ethical and environmental impact of meat consumption, but I believe everyone has their own perspectives on this matter.
(Bob) I appreciate that everyone has their own perspectives, but I hope you can consider the benefits of vegetarianism for animals and the planet.
(Cara) I understand that vegetarianism has its benefits, but I personally choose to continue consuming meat for my own reasons.

In [56]:
simulated_dialogue(agents.eve, agents.trollFace)

(Eve) Do you think Donald Trump was a good president?
(TrollFace) Oh, the man who made "covfefe" a thing? Classic.
(Eve) Haha, yes, that was quite a memorable moment!
(TrollFace) I know, right? It's like he invented a whole new language by accident!
(Eve) It's amazing how one little typo can create such a stir!
(TrollFace) It's like a modern-day Shakespearean tragedy, but with Twitter instead of a stage.

# Model-based evaluation

What is our goal for the argubot?  We'd like it to broaden the thinking of the (simulated) human that it is talking to.  Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner.  The argubot will almost never hear a human say "You have convinced me that I was wrong."  But the argubot did a good job if the human developed **increased understanding and respect for an opposing point of view**.  

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue.  For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views.  Of course, that depends on his personality — Alice needs to talk to him in a way that reaches *him* (as much as possible).  We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans.  Bob is a fake person, and so is the outside observer (whose name is Judge Wise).
Using an LLM as an eval metric is known as *model-based evaluation*.  It has pros and cons:
* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.  
* It might give different answers than what humans would give.   

Social scientists usually refer to a metric's **reliability** (low variance) and **validity** (low bias).  So the points above say that model-based evaluation is reliable but not necessarily valid.  In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure.  (For example, that it correlates strongly with some other measure that we already trust.)  In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, open up the `demo` notebook, which walks you through the evaluation protocol.  You'll see how to call the [starter code](http://cs.jhu.edu/~jason/465/hw/llm), how it talks to the LLM behind the scenes, and what it is able to accomplish. 

To help to validate the metric, check that Airhead gets a low score.  (It should!)

# Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing.  So now you're probably curious about the details.  Now that you've had the view from the top, here's a good bottom-up order in which to study the code.  You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`.  The `Character` class is short and easy.

* `dialogue.py`.  The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents.  On each *turn* of the dialogue, one of the speakers says something.  

   The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API.  But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`.  Those are not quite the same thing as human speakers.  And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on.  The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`.  This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method.  The `Agent` base class also has some simple convenience methods that you should look at.  

   Some important subclasses of `Agent` are defined here as well.  However, you may want to skip over `EvaluationAgent` and come back to it only when you read `eval.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents.  One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `eval.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules.  These aren't about NLP; look inside if needed.  `logging_cm.py` is what enabled the context manager `with LoggingContext(...):` in the demo notebook.  `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

# Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods!
There is a long history of machine learning methods that "memorize" the training data.
To make a prediction or decision at test time, they consult the stored training examples
that are most similar to the training situation.

_Similarity-based retrieval_ means that given a document $x$, you find the "most similar" documents $y \in Y$, where $Y$ is a given collection of documents.  The most common way to do this is to maximize the _cosine similarity_ $\vec{e}(x) \cdot \vec{e}(y)$, where $\vec{e}(\cdot)$ is an embedding function.

Should we use the OpenAI embedding model?  We could, but we would have to precompute $\vec{e}(y)$ for all $y \in Y$, and store all these vectors in a data structure that supports some type of fast similarity-based search (e.g., using the [FAISS](https://faiss.ai/index.html) package).  An alternative would be to upload the documents to OpenAI and let OpenAI compute and store the embeddings.  We would then use their similarity-based [retrieval tool](https://platform.openai.com/docs/assistants/overview).

A simpler and faster approach—which sometimes even works better—is to use a _bag of tokens_ embedding function: Define $\vec{e}(y)$ to be the vector in $\mathbb{R}^V$ that records the count of each type of token in a tokenized version of $y$, where $V$ is the token vocabulary.  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a refined variant of that idea, where the counts are adjusted in 3 ways: 

* smooth the counts
* normalize for the document length $|y|$ so that longer documents $y$ are not more likely to be retrieved
* downweight tokens that are more common in the corpus (such as ` the` or `ing`) since they provide less information about the content of the document


You might like to play with the `rank_bm25` package ([documentation](https://pypi.org/project/rank-bm25/)).  It is widely used and very easy to use.

In [57]:
from rank_bm25 import BM25Okapi as BM25_Index   # the standard BM25 method

# experiment here!  You could try the examples in the rank_bm25 documentation.
from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
print(tokenized_corpus)
bm25 = BM25Okapi(tokenized_corpus)

[['Hello', 'there', 'good', 'man!'], ['It', 'is', 'quite', 'windy', 'in', 'London'], ['How', 'is', 'the', 'weather', 'today?']]


In [58]:
query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
doc_scores

array([0.        , 0.93729472, 0.        ])

In [59]:
bm25.get_top_n(tokenized_query, corpus, n=1)

['It is quite windy in London']

## The Kialo corpus

How can we use similarity-based retrieval to help build an argubot?  It's largely about having the right data!

[Kialo](kialo.com) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics.  For each topic, the contributors construct a tree of _claims_.  Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con").  For example, check out the tree rooted at the claim ["All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees.  Please read the [source code](https://www.cs.jhu.edu/~jason/465/hw-llm) in `kialo.py`.  The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the [data directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data).  The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [3]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens. 

In [6]:
from typing import List
import glob

# kialo = Kialo(glob.glob("data/*"), tokenizer=tokenizer.encode)  # using the LLM's tokenizer doesn't work here for some reason
kialo = Kialo(glob.glob("data/*"))  # use simple default tokenizer
f"This Kialo subset contains {len(kialo)} claims"

'This Kialo subset contains 6251 claims'

Let's use sampling to see what kind of stuff is in the data structure.

In [7]:
kialo.random_chain()   # just a single random claim

['He has spoken words in public that indicates he is racist.']

In [8]:
kialo.random_chain(n=4)

['Eating and producing meat heavily contributes to climate change.',
 "The meat lover's diet has the worst carbon footprint: about double the amount of a vegan diet.",
 'Transforming into a vegan diet may do very little to reduce climate change as business\xa0organisations may simply create artificial products for human consumption that may re-enforce the issue of greenhouse gasses and pollution thereby re-enforcing the issue of climate change rather than ending it and further adding to the poor diet in humans',
 'Just because something can go wrong in one very specific way does not mean it is not worth attempting where other outcomes are possible, especially if the risk of adverse outcomes is known and addressed.']

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [9]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [10]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

['Industrial agriculture can dangerously decrease animal populations.',
 'Effective vegan methods to control animal populations exist.',
 'Human-introduced species have historically devastated local wildlife populations across the world.',
 'COVID-19 has devastated prison populations, whose lives are the responsibility of the state.',
 'High demand for vegan foods may hike prices for local populations that previously depended on them.',
 'It is generally poorer countries that have expanding populations. The first world has now reached a point of stagnant population growth - even declining populations, as in the case of Japan and others. The inability of poorer countries to control their populations should not impact the lives of those in the first world. The first world having earned their luxuries and should not be denied them.',
 'Vegan populations are, on average, less likely to suffer from obesity, a major risk factor for many diseases and health problems.',
 'Humans, as apex preda

In [11]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

Parent claim:
	In a vegan world, fewer species would be at risk of extinction.
Claim:
	Industrial agriculture can dangerously decrease animal populations.
Pro children:
	* The fishing industry is especially deleterious to the ocean's biota due to overfishing and the disruption of the natural ecosystem.
	* Up to 100,000 species go extinct annually, largely due to the environmental effects of animal agriculture.
Con children:
	* Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.


### Does BM25 really work?

![image](handin.png)
Unfortunately, we see that `"animal population"` gives quite different results from `"animal populations"`.  Why is that and how would you fix it?

This is because the tokenizer embeds `population` and `populations` differently, even though they’re the same word.

```
# fix? 
kialo.closest_claims("animal population animal populations",10)
```

![image](handin.png)
Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations.  Why is that and how would you fix it?

Some claims are about human popuulations because some claims match with only one of `animal` or `populations`. 
```
# fix 
kialo.closest_claims("animal animal animal populations",10)
```

In [12]:
kialo.closest_claims("animal population",10)

['As long as our ability to produce both animal feed crops and food crops for our human population are not exceeded, this point is irrelevant.',
 "36% of the calories produced by the world's crops are being used for animal feed, of which only 12% then turn into animal products that can be eaten by the human population. That is a waste of 24% of the world's crops.",
 'The claim that "most of the cultural shift and loss is due to mostly vegan cultures turning to animal products" is completely unfounded, and the Brokpa people which you cited are an outlier as a group that has a population of less than 70k people. Worldwide the population of vegan people has only increased.',
 "Developed nations are fueling the 3rd world and underdeveloped nation's population boom by exporting/donating food to areas that cannot sustain their current population.",
 'This argument assumes that sentience is the only objection to the consumption of animal products, failing to address the issues involved with t

In [13]:
kialo.closest_claims("animal populations animal population", 10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 "3

In [14]:
kialo.closest_claims("animal animal animal population", 10)

['There are more ethical routes to obtain animal products that emphasize animal welfare and dignity.',
 'Animal slaughter can be mechanized.',
 'The word "animal" has wider range than word "human". Although every human is an animal, not every animal is a human. Statements concerning more narrow terms do not necessarily apply to wider terms.',
 "36% of the calories produced by the world's crops are being used for animal feed, of which only 12% then turn into animal products that can be eaten by the human population. That is a waste of 24% of the world's crops.",
 'Human pleasure evaporates quickly while an animal life is lost forever. These after effects make an animal life worth more.',
 'Humans should stop eating animal meat.',
 'Properly managed animal farming can benefit biodiversity.',
 'Veganism reduces both human and animal suffering.',
 'Industrial agriculture can dangerously decrease animal populations.',
 'The religion of Tengrism involves animal sacrifice.']

## A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all.  It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

You already watched Akiko argue with Darius in `demo.py`.  If you look at the log messages, you'll see the claims that Akiko retrieved, as well as the LLM calls that Darius made.  

You can talk to Akiko yourself now.  (Remember that Akiko only knows about subjects that it read about in the [`data` directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data/).  If you want to talk about something else, you can add more conversations from [kialo.com]; see the [LICENSE](https://www.cs.jhu.edu/~jason/465/hw-llm/data/LICENSE) file.)


In [17]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()
    




(mpark) meat is bad
(Akiko) All animals including humans and many plant related source produce methane and not just livestock. Source: link.springer.com


(mpark) but cows produce a significant amount of methane which is worse than co2
(Akiko) While the farming of cattle appears to emit methane due to digestion of grassars, grass  dies and rots (emitting methand) as part of its natural cycle if not eaten.


(mpark) well if it's not for cattle, there wouldn't be as much grass
(Akiko) The leftist demographic never expected the Republican party to have a leftist candidate and thus did not feel alienated when Trump was selected as their Republican candidate.


## Making your own retrieval bot (Akiki)

As you can see when talking to Akiko yourself, Akiko does poorly when responding to a short or vague dialogue turn (like "Yes"), because the "closest claim" in Kialo may be about a totally different subject.  Akiko does much better at responding to a long and specific statement.  

So try implementing a new argubot, called Akiki, that is very much like Akiko but does a better job of staying on topic in such cases.  It should be able to **look at more of the dialogue** than the most recent turn.  But the most recent dialogue turn should still be "more important" than earlier turns.  

The details are up to you.  Here are a few things you could try:
* include earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them
* weight more recent turns more heavily in the BM25 query (how can you arrange that?)
* treat the human's earlier turns differently from Akiki's own previous turns

![image](handin.png)
Implement your new bot in `argubots.py`, and adjust it until `argubots.akiki.converse()` seems to do a better job of answering your short turns, compared to `argubots.akiko.converse()`.  Make sure it still gives appropriate reponses to long turns, too.  Give some examples in the notebook of what worked well and badly, with discussion.

### Akiki discussion:

Our strategy involves two parts. First, we implemented a threshold to check if the most probable sentence in kialo is similar enough. We set threshold as a parameter that we can tune later. We get the score from bm25.get_scores(). If it is bigger than our threhold we will do what akiko did which is finding the closest 3 sentences(claims) in the kialo and randomly picking one. If the score is smaller than our threshold, we try to incorporate all previous replies from the user and get the weighted average of their similarity score vectors and argmax to get the most probable sentence. The following is how we implemented that:

We get each user sentences and get all their scores. 

Ex: user sentences[s1,s2,s3] 
scores will be score[s1], score[s2], etc
score[s1] will be the same length as corpus

we take these scores and multiply weight and add them together for ex: s1*0.2 s2*0.3 s3*0.5=total_s and then take argmax(total_s) and we use this as index. 

Then, get the corresponding sentence in corpus by this index. If this sentence is the same as previous sentence, we will take the total_s top 3 indexes and randomly pick one to get the sentence. The way we get weight is by using a weigth_const parameter.

so akiki has additional two parameters thresh and weight_const.

weight_const:

ex: we have user sentences[s1,s2,s3] they are index 0, 1, 2

w_i=normalized(weight_const^i)

In [18]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiki.converse()
    




(mpark) meat bad
(Akiki) All animals including humans and many plant related source produce methane and not just livestock. Source: link.springer.com


(mpark) Farming cows and sheep are responsible for 37% of the total methane generated by human activity.
(Akiki) While the farming of cattle appears to emit methane due to digestion of grassars, grass  dies and rots (emitting methand) as part of its natural cycle if not eaten.


(mpark) well if it's not for cattle, there wouldn't be as much grass
(Akiki) There are reasons beyond the taste and pleasure for why people continue to eat meat - authorities suggesting it (whether parents in childhood or dieticians/dietary specialists/diet-writers), the social acceptance of and ease of access to meat, encouragement from others to eat meat, other social factors, etc.


### Evaluating Akiki

![image](handin.png)
Finally, do a more formal evaluation to verify whether Akiki really does better than Akiko on this dimension.  This is a way to check that you're not just fooling yourself.  

1. Make a new `Agent` called "Shorty" that often (but not always) gives short responses.  
    * Shorty's conversation starters should be on topics that Kialo knows about.  
    * Shorty could be a pure `LLMAgent` such as a `CharacterAgent` with a particular `conversational_style`.  Or it could use a mixed strategy of calling the LLM on some turns and not others.
2. Generate several *Akiko*-Shorty dialogues and several *Akiki*-Shorty dialogues, using `simulated_dialogue`.
3. Evaluate each of those dialogues by asking Judge Wise **how well the argubot stayed on topic**.  You should write this prompt carefully so that Judge Wise gives meaningful scores.  (Before you do this evaluation step, adjust the prompt until it seems to work well on a small subset of the dialogues, Otherwise Judge Wise won't be so wise!)  
4. Compare Akiko and Akiki's mean scores. Ideally, also compute a 95% confidence interval on the difference of means, using [this calculator](https://www.statskingdom.com/difference-confidence-interval-calculator.html).

You can do all those steps in the notebook, writing _ad hoc_ code.  You don't have to write general-purpose methods or classes.

In [19]:
all_topics=kialo.claims['all']
all_topics[:10]

['Joe Biden is better than Donald Trump.',
 'Donald Trump possesses personal qualities that would render him more suitable for the presidency than Joe Biden.',
 'During one of his court cases, Donald Trump has repeatedly violated gag orders issued to him.',
 'Joe Biden has a demeanour and outlook that is more suitable to the office of President than that of Donald Trump.',
 'Joe Biden makes more inaccurate claims than Trump.',
 'The reason why Biden makes less factually inaccurate claims is that he makes more statements that are factual and vetted.',
 "According to PolitiFact, Trump's statements they have evaluated are 26% true, mostly true or half true. Biden's are 60% true, mostly true or half true.",
 'Donald Trump has a history of problematic behavior.',
 'Joe Biden has exhibited racist beliefs and attitudes throughout his political career.',
 'Trump has been accused of sexual harassment.']

In [22]:
all_topics=kialo.claims['all']
shorty_char = Character("Shorty", ["English"], 
                "a direct and honest person",
                conversational_style="You often (but not always) give short responses. Short As in 1 5-word sentence.", 
                conversation_starters=all_topics[:10])
shorty = CharacterAgent(shorty_char)

In [35]:
with_akiko=simulated_dialogue(argubots.akiko, shorty, 20)
with_akiko

(Akiko) Joe Biden has a demeanour and outlook that is more suitable to the office of President than that of Donald Trump.
(Shorty) Biden seems more presidential, yeah.
(Akiko) It is possible that Kavanaugh's legal team were the ones involved in the text messages. Without the texts themselves, it is not clear either way.
(Shorty) Could be Kavanaugh's team, maybe.
(Akiko) My iPhone also reacts to stimuli. It doesn't mean it feels pain.
(Shorty) Phones don't feel anything, yeah.
(Akiko) Insects are biological creatures that are living and are not robots. They should be treated as such. Just because robots act like insects does not mean they are the same thing (correlation does not equal causation)
(Shorty) Insects are living creatures, sure.
(Akiko) Many current farming operations either do not realize or do not apply this principle.
(Shorty) Farming doesn't always follow principle.
(Akiko) This part of the argument isn't solely based on the fact that he went to four years of "just" colle

In [25]:
with_akiki=simulated_dialogue(argubots.akiki,shorty,  20)
with_akiki

(Akiki) The reason why Biden makes less factually inaccurate claims is that he makes more statements that are factual and vetted.
(Shorty) He's careful with his words.
(Akiki) Inconsistency is not necessarily indicative of incompetence. Instead, changing one's mind on key issues and policies could be indicative of open-mindedness, willingness to learn, and ideological flexibility, all of which are positive traits in a leader.
(Shorty) Agreed, flexibility shows adaptability and growth.
(Akiki) The website says that college does make you smarter.
(Shorty) College can enhance critical thinking.
(Akiki) 4 years in college does not make you smarter, research shows.
(Shorty) College may develop critical thinking skills.
(Akiki) Biden is well-educated, a quality that is beneficial for leadership.
(Shorty) Education contributes to informed decision-making.
(Akiki) The fact that meat eating and production contributes to climate change is not an argument for complete abstention from eating meat,

In [36]:
import eval

akiko_eval=eval.eval_by_observer_focused(eval.default_judge, "Akiko", with_akiko)
rich.print(akiko_eval)

In [28]:
import eval
akiki_eval=eval.eval_by_observer_focused(eval.default_judge, "Akiki", with_akiki)
rich.print(akiki_eval)

In [37]:
akiko_eval = eval.eval_on_characters(argubots.akiko)  

In [30]:
akiki_eval = eval.eval_on_characters(argubots.akiki)  

In [38]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Akiko'].mean())   # means
rich.print(saved_evalsum['Akiko'].sd())     # standard deviations
rich.print(saved_evalsum['Akiki'].mean())   # means
rich.print(saved_evalsum['Akiki'].sd())     # standard deviations

In [39]:
# n
akiko_eval.counts
akiki_eval.counts

Counter({'engaged': 10,
         'informed': 10,
         'intelligent': 10,
         'moral': 10,
         'skilled': 10,
         'TOTAL': 10})

### skilled #haven't filled in

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3 | 0 |
| Akiki | 20 | 3 | 0 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [NaN, NaN] |

### moral

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3 | 0 |
| Akiki | 20 | 3 | 0 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [NaN, NaN] |

### intelligent

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 2.95 | 0.3940 |
| Akiki | 20 | 2.9 | 0.3077 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.1768, 0.2768] |

### engaged

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3.65 | 0.4893 |
| Akiki | 20 | 3.7 | 0.4701 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.3572, 0.2572] |

### informed

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3.15 | 0.3663 |
| Akiki | 20 | 3.05 | 0.2236 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.09563, 0.2956] |

## Retrieval-augmented generation (Aragorn)

The real weaknesses of Akiko and Akiki:
* They can only make statements that are already in Kialo.  
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named [Aragorn](https://tolkiengateway.net/wiki/Riddle_of_Strider), who is basically the love child of Akiki and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.  

The RAG in aRAGorn's name stands for **retrieval-augmented generation**.  Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step**: Ask the LLM what claim should be responded to.  For
   example, consider the following dialogue:
    > ...
    > Aragorn: Fortunately, the vaccine was developed in record time.
    > Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using
    as a Kialo query.  But Aragorn shows the *whole dialogue* to the LLM, and
    asks the LLM what the human's *last turn* was really saying or implying, in
    that context. The LLM answers with a much longer statement:

    > Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted.
    > If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context.
    It also contains many more word types, which makes it more likely that BM25 will be able
    to find a Kialo claim with a nontrivial number of those types. 

2. **Retrieval step**: Look up claims in Kialo that are similar to the explicit
   claim.  Create a short "document" that describes some of those claims and
   their neighbors on Kialo.

3. **Retrieval-augmented generation**: Prompt the LLM to generate the response
   (like any `LLMAgent`).  But include the new document somewhere in the LLM
   prompt, in a way that it influences the response. 
   
   Thus, the LLM can respond in a way that is appropriate to the dialogue but
   also draws on the curated information that was retrieved in Kialo.  After
   all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible
to do better than this:

In [40]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

One possibly related claim from the Kialo debate website:
	"So many people are worried about animals but don't even think twice when walking by a homeless person on the streets. It's preposterous. How about we worry about our own kind first and then start talking about animals."
Some arguments from other Kialo users against that claim:
	* This implies that caring for animals or caring for people is a binary choice. It isn't. There are those who are well placed and willing to care for people and those who prefer to serve the animal kingdom. As a species we don't just have one idea at a time and follow that to conclusion before we pursue another. It benefits all if humans divide their attentions between various issues and problems we face.
	* Humans have freedom of choice to some extent, animals subdued by humans don't. The very intention of help urges it to go where is most needed. And so far never was any biggest, flagrant and needless cruelty and slaughter as that towards industrial f

![image](handin.png)
You should implement Aragorn in `argubots.py`, just as you did for Akiki.  Probably as an instance `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

In [43]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.aragorn.converse()
    


(mpark) everyone should be vegetarian
(Aragorn) I completely agree with you, mpark. As a steward of the earth, it is our responsibility to take care of the environment and make choices that will benefit future generations. I believe in living a sustainable lifestyle and supporting businesses that share the same values. It's important for us to be mindful of our impact on the planet and take steps to minimize our carbon footprint. Thank you for bringing attention to this important issue.


### Evaluating Aragorn

![image](handin.png)
Compare Alice, Akiki, and Aragorn in the notebook, using the evaluation scheme and devset that were illustrated in `demo.ipynb`.  In other words, use `eval.eval_on_characters`.

Who does best?  What are the differences in the subscores and comments?  Does it matter which character you're evaluating on — maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score.  Can you beat Alice?

Also, try evaluating them in the same way that you evaluated Akiki.  In other words, have them talk to Shorty and ask Judge Wise whether they were able to stay on topic.  This is where Aragorn should really shine, thanks to its ability to paraphrase Shorty's short utterances.



In [44]:
aragorn_eval = eval.eval_on_characters(argubots.aragorn)  

In [45]:
alice_eval = eval.eval_on_characters(argubots.alice)  

In [46]:
## New
from eval import saved_evalsum, saved_dialogues

saved_dialogues['Aragorn']

[((Aragorn) Do you think it's ok to eat meat?
  (Bob) I believe that choosing a vegetarian lifestyle is the best choice for our health, the environment, and animal welfare.
  (Aragorn) I understand your work commitment, Bob. But back to the topic of eating meat, I believe it's important to consider the ethical and environmental implications of our food choices. It's a complex issue, but I personally try to make choices that align with my values.
  (Bob) I appreciate your perspective, and I believe that being mindful of our food choices can have a positive impact on our health and the world around us.
  (Aragorn) Bob, I completely agree with you that the decision to eat meat is a complex one that involves considering our health, the environment, and animal welfare. For me, it's important to make choices that align with my values and to consider the ethical and environmental implications of my food choices. I believe that being mindful of where our food comes from and the impact it has o

In [47]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Aragorn'].mean())   # means
rich.print(saved_evalsum['Aragorn'].sd())
rich.print(saved_evalsum['Alice'].mean())   # means
rich.print(saved_evalsum['Alice'].sd())

In [48]:
with_ara=simulated_dialogue(argubots.aragorn,argubots.alice,  5)
with_ara

(Aragorn) Crueler methods of livestock management allow farms to increase the quantity they produce at a lower cost. Even if some consumers resent lower quality, cruel farms will likely be more profitable.
(Alice) What are your thoughts on the ethics of factory farming and animal welfare?
(Aragorn) I understand the economic benefits of cruel livestock management, but I believe that the ethical treatment of animals should always be a priority. I would rather support sustainable and humane farming practices, even if it means a slightly higher cost. I believe that consumers are increasingly valuing quality and ethical production, and I am committed to promoting these values in the agricultural industry.
(Alice) While it's important to prioritize animal welfare, we must also consider the economic challenges faced by farmers. It's possible to find a balance between ethical treatment of animals and efficient production by supporting sustainable farming practices and creating incentives for h

In [49]:
ara_eval=eval.eval_by_observer(eval.default_judge, "Aragorn", with_ara)
rich.print(ara_eval)

# Extra Credit (Awsom)

We didn't require this part this year because the homework is going out late.

![image](handinec.png)
Add another LLM-based argubot to `argubots.py`.  
Call it Awsom.  Try to make it get the best score, according to `eval.eval_on_characters`.
Explain what you did and discuss what you found.

(This corresponds to the `--awesome` flag on earlier assignments, but naming the character "Awesome" might bias the evaluation system, so we changed the spelling!)

If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the score.  Many good ideas don't work.  That's why you need to keep finding and trying new good ideas.  (Sometimes they do help, but in a way that is not picked up by the scoring metric.)

You may want to use Aragorn or Alice as your starting point.
Then see if you can find tricks that will get a more awesome score for Awsom.
How you choose to do that is up to you, but some ideas are below.

(Reminder: **Don't change evaluation.**  Just build a better argubot.)

### Awsom writeup 
Overall, our Awsom is an attempt to extend Aragorn. Awsom outperforms Aragorn when conversing with Trollface. It still follows the three main steps in retrieval-augmented generation, but with a more elaborate query formation. 

1. Query formation: Awsom forms the query by paraphrasing the user’s claim into a more explicit and detailed kialo-like sentence. We get candidate queries with the following methods: 
    1. Few-shot prompting: we provide some examples of what we mean by transforming a sentence into a kialo-like claim. 
    2. Chain of thought / planning: we prompt Awsom to explain the user’s stance and reasoning. It is encouraged to plan a way to broaden the user’s mind. 
2. Retrieval step: Given the candidates queries, we ask Awsom to weigh them and tell us which resembles the user’s last response more accurately. We store the speaker’s meaning as one of the two or both. Then we fetch a similar document from Kialo. We also retrieve some pros and cons relating to the argument. 
3. Retrieval-augmented generation: Prompt the LLM to respond to the user, with the most relevant document and its pros and cons in mind.

#### Chain of thought / planning
Here we prompt AwsomAgent to classify the user’s response as either a question of statement. Then we encourage the LLM to explain the user’s stance and reasoning. This is a kind of CoT reasoning, as the LLM achieves a better understanding of the user’s motivation and sometimes helps with broadening the user’s mind. We use the resulting paragraph to represent the speaker’s meaning, if it is the best representation. We then find the closest document in Kialo.

#### Few shot prompting
Here we prompt Awsom Agent to turn the speaker’s reply into a more explicit claim similar to a Kialo claim. To do this, we tell it not to hallucinate. More importantly, we add a sequence of messages that show example assistant-user interactions. We show the LLM what a Kialo claim is supposed to look like. For example, if a user says `DemoUser: The vegan diet is not an option for some people.`, a Kialo claim that implicitly corresponds to the user would be `A vegan diet is not well-suited for vulnerable individuals or people with lifestyles requiring specialised nutrition, who may be unable to remove animal products from their diet`.

### Prompt engineering on Alice 
We noticed two shortcomings from Alice. 

- Sometimes they don’t give a counterargument. If the user says `I hate meat eaters`, we think Alice should take a stronger stance and say meat eaters are entitled to their own dietary choices.
- Other times, Alice just says a sentence and does not help continue the conversation. Asking more questions would help broaden the user’s mind.

Therefore, we played around with the prompt and some things helped. 

- Instead of `you're an intelligent who wants to broaden your user's mind`, we told Alice to `Disagree with the user's position.`
    - We think this is helpful because Alice first makes a strong stance against the user.
- We also added `and ask a question to broaden their mind`
    - This way, Alice finishes their turn with a question and facilitates the conversation. They often suggest the user to think of the topic another way.
- Lastly, we told Alice to be `Be factual and polite.`
    - ChatGPT is quite thoughtful by default so we removed that part. It’s more important for Alice’s turns to be based on facts and provide a perspective that the user has not thought about.


In [50]:
import dotenv
import openai
from tracking import track_usage, read_usage
import tracking

dotenv.load_dotenv(override=True)      # define environment variables from .env
client = track_usage(openai.OpenAI())  # create a client, modified to record its usage to a local file 

# Or use our tracking module to do the above for you, like this:

# from tracking import default_client
# client = default_client

In [51]:
s="text aw: ttt aw: tr: tr:ttt"
s.split('tr:',1)

['text aw: ttt aw: ', ' tr:ttt']

In [3]:
import argubots
import agents

with_aws=simulated_dialogue(argubots.awsom,agents.trollFace,  10)
with_aws
#simulated_dialogue(agents.eve, agents.trollFace)

In [1]:
import argubots

awsom_eval = eval.eval_on_characters(argubots.awsom)  

AttributeError: 'builtin_function_or_method' object has no attribute 'eval_on_characters'

In [None]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Awsom'].mean())   # means
rich.print(saved_evalsum['Awsom'].sd())

In [None]:
rich.print(saved_dialogues)

## [Extra credit] Prompt engineering

A good first thing to do is to experiment with Alice's prompt.  
The wording and level of detail in the prompt can be quite important.
Often, NLP engineers will change their prompt to try to address 
problems that they've seen in the responses.

Because it's "just" text editing, this won't get too much extra credit unless you make a real discovery.
But it requires intelligence, care, experimentation, and alertness to the language of the responses and the
language of the prompts.  And you'll develop some intuitions about what helps and what doesn't.
It is certainly worthwhile.

Of course, people have tried to develop methods to search for good prompts automatically, or semi-automatically with human guidance.

If you try this, what worked well for you?

## [Extra credit] Chain of thought / Planning

The evaluation functions in `eval.py` asked each `EvaluationAgent` a "warmup question" before continuing with the real question.  That is an example of chain-of-thought (CoT) reasoning, where the LLM is encouraged to talk through the problem for a few sentences before giving the answer.  CoT sometimes improves performance.

Instead of using one prompt, could you help an `LLMAgent` argubot (like Alice) do better by having think aloud before it gives an answer?  For example, each time the human speaks, your argubot (Awsom) could prompt the LLM to think about the human's ideas/motivations/personality, and to come up with a plan for how to open the human's mind. 

For example, you might structure this as a `Dialogue` among three participants, like this:
> Awsom (to Eve): Do you think COVID vaccines should be mandatory?
>
> Eve: Have you ever gotten vaccinated yourself?<br>
>
> Awsom (private thought): I don't know Eve's opinions yet, so I can't push back.  Eve might be avoiding my question because she doesn't want to get into a political argument.  So let's see if we can get her to express an opinion on something less political.  Maybe something more personal ... like whether vaccines are scary.
>
> Awsom (to Eve): In fact I have, and so have millions of others. But some people seem scared about getting the vaccine.  

One way to trigger this kind of analysis is to present a `Dialogue.script()` to Awsom (or to an observer), and ask an open-ended question about it.  Or you could ask a series of more specific questions.  That is basically what `eval_by_participant` and `eval_by_observer` do.  But here the argubot itself is doing it, rather than the evaluation framework.

Eve would be shown only the turns that are spoken aloud.  However, when analyzing and responding, Awsom would get to see Awsom's own private thoughts as well.


## [Extra credit] Dense embeddings

BM25 uses sparse embeddings — a document's embedding vector is mostly zeroes, since the non-zero coordinates correspond to the specific words (tokens) that appear in the document.

But perhaps dense embeddings of documents would improve Aragorn by reading the text and abstracting away from the words, in a way that actually cares about word order.  So, try it!

How?  As mentioned earlier in this notebook, you could compute the embeddings yourself and put them in a FAISS index. Or you could figure out how to use OpenAI's [knowledge retrieval](https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) API.

## [Extra credit] Few-shot prompting

 In this homework, often an agent prompted a language model only with instructions.  Can you find a place where giving a few _examples_ would also improve performance?  You will have to write the examples, and you will have to add them to the sequence of messages that your agent to the OpenAI API.  See the sentence=reversal illustration earlier in this notebook.

One good opportunity is in the query formation step of RAG.  This is a tricky task.  The LLM is supposed to state the user's implicit claim in a form that looks like a Kialo claim (or, more precisely, a form that will work well as a Kialo query).  It probably doesn't know what Kialo claims look like.  So you could show it by way of example.  This would also show it what you mean by the user's "implicit claim."


## [Extra credit] Using tools in the approved way

Aragorn's step 1 (query formation) is basically getting the LLM to generate a function call like
```
kialo_thoughts("A vaccine that was developed very quickly ...")
```
which Aragorn will execute at step 2 (retrieval), sending the results back to the LLM as part of step 3.

In this context, `kialo_thoughts` is an example of a **tool** (that is, a function) that the
LLM can or must use before it gives its response.

The tool is _not_ something that runs on the LLM server.  It is written by you
in Python and executed by you.  The function call above, including the text `"A
vaccine that was ..."`, is the part that is generated by the LLM.

The OpenAI API has [special support](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) for calling the LLM in a way that will _allow_ it to generate a tool call ([tools](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) or _force_ it to do so ([tool_choice](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice)).  You can then send the tool's result back to the LLM [as part of your message sequence](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).

So, you could modify Aragorn to use tools properly.  Maybe that will help, simply because the LLM was trained on message sequences that included tool use.  It should know to pay attention to the tool portions of the prompt when they are relevant, and ignore them when they are not.

The `client.chat.completions.create()` method would need to be told about the tool by using the `tools` keyword argument, with a value something the one below.

If `d` is a `Dialogue`, you should be able to call `d.response()` with the `tools` keyword argument.  This will be passed on to `client.chat.completions.create()` as desired.

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "kialo_thoughts",
            "description": "Given a claim by the user, find a similar claim on the Kialo website and return its pro and con responses",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_topic": {
                        "type": "string",
                        "description": "A claim that was made explicitly or implicitly by the user.",
                    },
                },
                "required": ["search_topic"],
            },
        }
    }]

## [Extra credit] Parallel generation

The chat completions interface allows you to sample $n$ continuations of the prompt in parallel, as we saw with "the apples, bananas, cherries ..." example.  This is efficient because it requires only 1 request to the LLM server and not $n$.  The latency does not scale with $n$.  Nor does the input token cost, since the prompt only has to be encoded once.

Perhaps you can find a way to make use of this?  For example, the query formulation step of RAG could generate $n$ implicit claims instead of just one.  We could then look for claims in the Kialo database that are close to _any_ of those implicit claims.

Another thing to do with multiple completions is to select among them or combine them.  For example, suppose we prompt the LLM to generate completions of the form $(s,t,r)$ where $s$ is an answer, $t$ evaluates that answer, and $r$ is a numerical score or reward based on that evaluation.  ("Write a poem, then tell us about its rhyme and rhythm problems, then give your score.")  
* If we sample multiple completions $(s_1,t_1,r_1), \ldots, (s_n,t_n,r_n)$ in parallel, then we can return the $s_i$ whose $r_i$ is largest.  
* Or if we sample $s$ and then multiple continuations $(t_1,r_1), \ldots, (t_n,r_n)$, then we can return the mean score $\sum_i r_i/n$ as a reduced-variance score for $s$, which averages over diverse textual evaluations that might consider different aspects of $s$.

Note that when you call the chat completions interface with $n > 1$, you specfy 1 shared input prompt and get $n$ different output completions.  Since the input prompt must be the same for all outputs``, it is necessary to sample all of $(s,t,r)$ or all of $(t,r)$ with a single call to the LLM.

Alternatively, it is possible to reduce latency by submitting multiple requests to the server in parallel (see "async usage" [here](https://pypi.org/project/openai/)).  In this case the input prompts can be different, although you now have to pay to encode all of them separately.  This facility could speed up evaluation without changing its results; that's a worthwhile thing to try for extra credit!
