# Homework 8: Large Language Models

An PDF overview of the homework is [here](https://www.cs.jhu.edu/~jason/465/hw-llm/).

It mentions: "We'll send hand-in instructions soon.  Probably we will ask you to submit a version
of the main notebook, with your answers added and extraneous materials deleted. We may also
ask for a summary."

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
This symbol marks a question or exercise that you will be expected to hand in.

# Getting started

## Activate `conda` environment

When executing cells in this notebook, you will need to connect to an `nlp-class` kernel, which is a Python process running in that environment.  This is the notebook equivalent of the terminal command `conda activate nlp-class`.  

If you need to create or update that environment, first download the [nlp-class.yml](http://cs.jhu.edu/~jason/465/hw-llm/nlp-class.yml) file, and execute
```
conda env update --file nlp-class.yml --prune
```

## Fetch code and data files for this homework

All of the files you need are in the directory <https://www.cs.jhu.edu/~jason/465/hw-llm/>.  To get a local copy of that directory, including this notebook, you can download and unpack [HW-LLM.zip](https://www.cs.jhu.edu/~jason/465/hw-llm/HW-LLM.zip).  Then open this notebook.

Note that the other files must be in the *same directory* as this notebook.  Otherwise, a command like `import tracking` won't be able to find the tracking module, `tracking.py`.

*Note:* These files might get improved after the homework is released, in which case you'll want to re-download them.  Make sure not to overwrite changes you've already made.  One way to do it: use a terminal to `cd` to the directory containing this notebook, and run the following shell commands to get the latest versions of all other files.
```
wget --quiet -r -np -nH --cut-dirs=3 -A '*.txt' -A '*.py' -A 'demo.ipynb' https://www.cs.jhu.edu/~jason/465/hw-llm/
rm -f data/*.1 robots.txt   # remove any backup versions of the static files
```
Any existing versions of the files will not be overwritten; they will be renamed with names like `tracking.py.1`.

In [9]:
# Check that the current directory does contain the files.
!ls -lR *.py data

-rw-------. 1 jbravo3 users 19742 Dec  1 22:43 agents.py
-rw-------. 1 jbravo3 users  8929 Dec  7 21:35 argubots.py
-rw-------. 1 jbravo3 users  3578 Dec  7 16:27 characters.py
-rw-------. 1 jbravo3 users  2641 Dec  1 22:43 dialogue.py
-rw-------. 1 jbravo3 users 14216 Dec  1 22:43 evaluate.py
-rw-------. 1 jbravo3 users 10426 Dec  1 22:43 kialo.py
-rw-------. 1 jbravo3 users  1347 Dec  1 22:43 logging_cm.py
-rw-------. 1 jbravo3 users  1503 Dec  1 22:43 simulate.py
-rw-------. 1 jbravo3 users  6130 Dec  1 22:43 tracking.py

data:
total 1265
-rw-------. 1 jbravo3 users 613106 Dec  1 22:43 all-humans-should-be-vegan-2762.txt
-rw-------. 1 jbravo3 users  81917 Dec  1 22:43 have-authoritarian-governments-handled-covid-19-better-than-others-54145.txt
-rw-------. 1 jbravo3 users  52771 Dec  1 22:43 is-biden-an-incompetent-president-44217.txt
-rw-------. 1 jbravo3 users 153551 Dec  1 22:43 is-joe-biden-a-good-president-53071.txt
-rw-------. 1 jbravo3 users  60556 Dec  1 22:43 is-joe-biden-be


The `autoreload` feature of Jupyter ensures that if an imported module (.py file) changes, the notebook will automatically import the new version.  
(However, objects that were defined with the old version of the class won't change.)

In [10]:
# Executing this cell does some magic
%load_ext autoreload
%autoreload 2

## Create an OpenAI client

An OpenAI API key will be sent to you.  (Or are you not in the class? Then you can make your own API key by [signing up for an OpenAI platform account](https://platform.openai.com/signup) and putting some money on it.  This assignment should cost only about $1 US.)

Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.



Now you can execute the following to get an OpenAI client object.

In [1]:
from tracking import new_default_client, read_usage
client = new_default_client() 

That fetches your API key and calls `openai.OpenAI()` to make a new **client** object, whose job is to talk to the OpenAI **server** over HTTP.  (The `OpenAI` constructor has some optional arguments that configure these HTTP messages.
However, the defaults should work fine for you.)

That command also saved the new client in `tracking.default_client`, which is the client that the starter code will use by default whenever it needs to talk to the OpenAI server.  Thus, you should **rerun the above cell** to get a new client if you change the `default_model` in `tracking.py`, or if your API key in  `.env` ever changes, or its associated organization ever changes.

## Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.  
You will have to specify which OpenAI model to use.
Documentation of the methods is [here](https://pypi.org/project/openai/) if you are curious.

### Continue a textual prompt

This is what language models excel at.  In principle you should do it by calling [`client.completions.create`](https://platform.openai.com/docs/api-reference/completions/create?lang=python).  However, OpenAI has [retired](https://openai.com/blog/gpt-4-api-general-availability) most of the models that support that API (keeping only `gpt-3.5-turbo-instruct`).  So we'll use the more modern API, [`client.chat.completions.create`](https://platform.openai.com/docs/api-reference/chat/create?lang=python).

In [2]:
import rich   # prettyprinting

response = client.chat.completions.create(messages=[{"role": "user", 
                                                     "content": "Q: Name the planets in the solar system?\nA: "}], 
                                          model="gpt-3.5-turbo-0125",  # which model to use
                                          temperature=1,               # get a little variety
                                          max_tokens=64,               # limit on length of result
                                          # stop=["Q:", "\n"],         # treat these as EOS symbols; useful for some models
                                         )           
rich.print(response)                              # the full object that was sent back from the server
rich.print(response.choices)                      # just the list of 1 answer (the default, but calling with n=5 would give 5 answers) 
rich.print(response.choices[0].message.content)   # extract the good stuff from that 1 answer

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Try running the cell above a few times. You may get different random answers — especially because the call specifies temperature 1.  (The default temperature is rumored to be 0.8.) Are the answers all equally good?

In this case all the answers are equally good. The responses are consistent in correctly listing the planets in the solar system: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Even though the formatting may vary slightly (e.g., numbered list vs. comma-separated), the content remains accurate and complete.


![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Try adding the arguments `logprobs=True, top_logprobs=5` to the above API call (see [documentation](https://platform.openai.com/docs/api-reference/chat/create#chat-create-logprobs)).  For each generated token, the response will now include its log-probability, and also the log-probabilities of the 5 most probable tokens, given the left context so far.  Again, run the cell a few times.  What do you observe?


The answers vary in format (e.g., numbered lists, plain text, detailed sentences) but remain factually correct. 	Log probabilities reveal the model’s confidence in each generated token and show alternative plausible tokens (e.g., “1”, “Mer”, “The”) based on the context. Higher log-probability values (closer to 0) correspond to more confident predictions for each token. The model consistently predicts tokens with high confidence for factual answers but considers multiple plausible formats during generation.


It might be handy to package up what we just did.
The `complete` function below is a convenient way of experimenting with completing text.
It is illustrated with a grocery example.  

In [3]:
def complete(client, s: str, model="gpt-3.5-turbo-0125", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=1.1, max_tokens=96)


[', and flour.',
 ', and fish.',
 ', and fish for my groceries.',
 ', and fish.',
 ', and fish.',
 ', and fish.',
 ', and flour.',
 ', and fish.',
 ', and flour.',
 ', and flour.']

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Anything could be on a grocery list, so why are the 10 different completions above so similar?<br>
Hint: The answer isn't just the temperature of 0.6.  Look especially at the long completions; run the cell again if you didn't get multiple long completions.

The 10 completions are so similar because of the context provided in the prompt. The prompt describes a grocery store list, and the model is highly influenced by this context. It prioritizes plausible grocery items (like flour, fish, fruits, etc.) that align with the pattern and meaning of the input.


![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
What happens at different temperatures?  How about temperatures > 1?  (Note: Higher temperatures tend to produce longer responses, so it's wise to use `max_tokens`.)

Higher temperatures (e.g., >1) increase randomness in the generated output. This results in more diverse and unexpected completions, but also increases the likelihood of less coherent or less plausible results. There is less consistency in format and content compared to lower temperatures. The responses show greater variability, including unusual or creative completions (e.g., “fudge,” “jellybeans,” “a gallon of milk”). Longer completions (e.g., full lists with a wide variety of items) are more frequent because the model is more willing to explore lower-probability token sequences.


*Remark:* These [Python bindings for open-source models such as Llama](https://pypi.org/project/llama-cpp-python/) allow you to [constrain the output by an arbitrary CFG](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), using `grammar=...`.  This is useful if you're generating code or data that must be syntactically valid to be useful to you.  For even more control over the output, the powerful [guidance](https://github.com/guidance-ai/guidance) package works elegantly with Python.  However, the OpenAI API only allows you to [constrain the output to be valid JSON](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format).


### Compute a function using instructions and few-shot prompting

We'll now switch to the chat completions API, allowing us to use a more recent model.  Let's try prompting it with a sequence of multiple messages.  In this case, we provide some instructions as well as few-shot prompting (actually just one-shot in this case).

Instructions are in the `system` message.  The few-shot prompting consists of example inputs (`user` messages) followed by their example outputs (`assistant` messages).  Then we give our real input (the final `user` message), and hope that the LLM will continue the pattern by generating an analogous output (a new `assistant` message).

In [4]:
response = client.chat.completions.create(messages=[
    { "role": "system", "content": "Reverse the order of the words." },
    { "role": "user", "content": "Good things come to those who wait." },
    { "role": "assistant", "content": "Good(1) things(2) come(3) to(4) those(5) who(6) wait(7)" },  # Contradicts reversing
    { "role": "user", "content": "Colorless green ideas sleep furiously." }
], model="gpt-4o-mini", temperature=0)

rich.print(response)
response.choices[0].message.content       

'furiously(1) sleep(2) ideas(3) green(4) Colorless(5)'

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
By modifying this call, can you get it to produce different versions of the output?
Some possible behaviors you could try to arrange:
* specific other way of formatting the output, e.g., `wait, who, those, to, come, things, good`
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., `To those who wait come good things.` 

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
What happens if the examples conflict with the instructions?

We produced many different versions. For example we made it so the output has commas separating the words. Another experiment I tried was to make it maintain the same capitalization, punctuation, and other formatting from the input. 

For the examples conflicting with the instructions we tried a few different experiments. For example we had it so the example capitalized the first letter of each word, instead of reversing the words. However this still gave the same output and reversed the words it just made it so the first word in the reverse order was capitalized. I also tried a different experiment where instead of reversing the words, we added a number after each word. The result still had the words reversed however we did manage to have the numbers added to each word in the sentence. So I think it really depends on how big of a change you implement in the example that will affect the output. 

### Inspect the tokenization

Just for fun, let's see how the above client has been tokenizing its input and output text.  For that we can use a tokenizer that runs locally, not in the cloud, and is guaranteed to get the same outputs.

In [5]:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-0125")  # how this model will tokenize
toks = tokenizer.encode("Hellooo, world!") # list of integerized tokens, starting with BOS

print(tokenizer.decode(toks))                                  # convert list back to string
for tok in toks: print(f"{tok}\t'{tokenizer.decode([tok])}'")  # convert one at a time
print("Vocab size =", tokenizer.n_vocab)

Hellooo, world!
9906	'Hello'
2689	'oo'
11	','
1917	' world'
0	'!'
Vocab size = 100277


### Try embedding some text

Also just for fun, let's try the embedder, which converts a string of any length to an vector of fixed dimensionality.

In [6]:
emb_response = client.embeddings.create( input= [  # note: adjacent literal strings in Python are concatenated
        "When in the Course of human events it becomes necessary for one "
        "people to dissolve the political bands which have connected them "
        "with another, and to assume among the Powers of the earth, the "
        "separate and equal station to which the Laws of Nature and of "
        "Nature's God entitle them, a decent respect to the opinions of "
        "mankind requires that they should declare the causes which impel "
        "them to the separation." ], 
        model="text-embedding-3-small")
# don't print the whole response because it's very long
e = emb_response.data[0].embedding
print(f"{len(e)}-dimensional embedding starting with {e[:5]}")
print("Squared length of embedding vector: ", sum(x**2 for x in e))

1536-dimensional embedding starting with [0.03854052722454071, 0.038316600024700165, 0.04359135404229164, 0.07056225836277008, -0.00027718886849470437]
Squared length of embedding vector:  1.0000000365799306


### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens.  That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute one of these cells whenever you want to see your cost so far.  Or, just keep `usage_openai.json` open as a tab in your IDE.

In [7]:
read_usage()      # rwitheads from the file usage_openai.json; returns cost in dollars

{'completion_tokens': 256516,
 'prompt_tokens': 2134814,
 'total_tokens': 2391330,
 'cost': 0.4787726000000008}

In [8]:
!cat usage_openai.json 

{
    "completion_tokens": 256516,
    "prompt_tokens": 2134814,
    "total_tokens": 2391330,
    "cost": 0.4787726000000008
}

# Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

## A first argubot (Airhead)

You can have a conversation right now with a _really bad_ argubot named Airhead.  Try asking it about climate change!  When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE.)

In [9]:
import argubots
d = argubots.airhead.converse()


(jbravo3) Trump was a good president 
(Airhead) I know right???


A *bot* (short for "robot") is a system that acts autonomously.
That corresponds to the AI notion of an *agent* — a system that uses some *policy* to choose *actions* to take.

The `airhead` agent above (defined in `argubots.py`) uses a particularly simple policy.  
It is an instance of a simple `Agent` subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to `airhead` is a `Dialogue` object (defined in `dialogue.py`). Let's look at it.

In [10]:
rich.print(d)

Each *turn* of this dialogue is just a tiny dictionary:

In [11]:
d[0]

{'speaker': 'jbravo3', 'content': 'Trump was a good president '}

## An LLM argubot (Alice)

In other CS courses like crypto, algorithms, or networks, you may have encountered "conversations" between characters named Alice and Bob.  
Let's try talking to the Alice of this homework, who is a _much stronger baseline_ than Airhead.  Your job in this assignment is to improve upon Alice.
We'll meet Bob later.

In [12]:
alicechat = argubots.alice.converse()   # or call with argument d if you want to append to the previous conversation


(jbravo3) Biden was a good president
(Alice) What specific policies or actions do you believe made Biden a good president? While it's commendable to praise leadership, it's also essential to analyze how his decisions might have contributed to ongoing challenges, such as economic issues or political polarization. How do you reconcile those concerns with your view of his presidency?


As you may have guessed, `alice` is powered by an prompted LLM.  You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` — and the ones that you will write! — go into `argubots.py` instead.  This is just to keep the files small.

## Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.  
But can you really be sure you're making progress?  For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues.  But that would be slow and complicated to arrange.  

So, meet Bob!  He's just a simulated human.  You won't edit him: he is part of the development set.  Here is some information about him (from `characters.py`):

In [13]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object.
However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt?  The `CharacterAgent` constructor (defined in `agents.py`) defaults to a GPT-3.5 model that is specified in `tracking.py`.  But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [14]:
from agents import CharacterAgent
bob = CharacterAgent(characters.bob)    # actually, agents.bob is already defined this way
bob.converse()        # returns a dialogue, but we've already seen it so we don't want to print it again
None                  # don't print anything for this notebook cell 


(jbravo3) Global warming is not real 
(Bob) While there are differing opinions, the overwhelming scientific consensus supports the reality of global warming and its impact on our planet.


Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs!  (They're not actually all named Bob — we continued on in the alphabet.)


In [15]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [16]:
agents.cara.converse()
None


(jbravo3) Global warming is not real 
(Cara) While some may dispute the consensus on global warming, the overwhelming scientific evidence supports its reality and impact.


You can see the underlying character data here in the notebook.  Your argubot will have to deal with all of these topics and styles!

In [17]:
rich.print(characters.devset)

## Simulating conversation 

We can make Alice and Bob chat.

In [18]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)


(Alice) Do you think it's okay to eat meat?


In [19]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe a vegetarian lifestyle is much more beneficial for health, the environment, and animal welfare.
(Alice) While a vegetarian lifestyle does have its benefits, some argue that a balanced diet including moderate meat consumption can provide essential nutrients that are harder to obtain from plant sources alone. Additionally, certain sustainable farming practices can mitigate environmental concerns while supporting local economies. Have you considered the potential benefits of responsibly sourced animal products?


In [20]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe a vegetarian lifestyle is much more beneficial for health, the environment, and animal welfare.
(Alice) While a vegetarian lifestyle does have its benefits, some argue that a balanced diet including moderate meat consumption can provide essential nutrients that are harder to obtain from plant sources alone. Additionally, certain sustainable farming practices can mitigate environmental concerns while supporting local economies. Have you considered the potential benefits of responsibly sourced animal products?
(Bob) While I understand the arguments for responsible sourcing, I still believe that a well-planned vegetarian diet can provide all necessary nutrients without the ethical concerns associated with animal consumption.
(Alice) That’s a valid perspective, but it's important to recognize that not everyone has access to diverse plant-based options or the knowledge to create a well-planned vegetarian diet, which can lead to nut

Anyway, let's see what happens when Alice and Bob talk for a while...

In [21]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot.  Alice doesn't seem to have a good strategy for getting him to open up.  Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [22]:
myname = alicechat[0]['speaker']   # your name, pulled from an earlier dialogue
agents.bob.converse(d[0:2].rename('Alice', myname))  # reuse the same first two turns, then type your own lines!
None

(jbravo3) Do you think it's ok to eat meat?
(Bob) I believe that choosing a vegetarian lifestyle is a more compassionate and sustainable option for both individuals and the planet.


You can also try talking to the other characters and having Alice (or Airhead) talk to them.

**You might enjoy** defining additional characters in `characters.py`, or right here in the notebook.
Feel free to talk to those and evaluate them.  They could be variants on the exisiting characters, or something entirely new. 

However, **don't change the dev set** — the characters we just loaded must stay the same.  Your job in this homework is to improve the argubot (or at least try).  And that means improving it according to a fixed and stable eval measure.

As an exception, you can change the languages that a couple of the characters speak. It may be fun for you to see them try to speak your native language.  And that doesn't really affect the quality of the argument.

In [23]:
# example
trollFace2 = characters.trollFace.replace(languages = ["Spanish"])
rich.print(trollFace2)
simulated_dialogue(argubots.alice, CharacterAgent(trollFace2), 6)

(Alice) Do you think Joe Biden has been a good president?
(TrollFace) ¡Claro, ha sido tan buen presidente que hasta los gatos se han puesto a leer sus discursos!
(Alice) It's great to see your enthusiasm, but do you think his administration has faced challenges or made decisions that have sparked significant criticism? Sometimes, even well-meaning leadership can have mixed results depending on the complexities of the issues at hand.
(TrollFace) ¡Oh, por supuesto! Su administración ha enfrentado más desafíos que un gato tratando de nadar, y las críticas son como palomitas en el cine: ¡nunca faltan!
(Alice) Absolutely, challenges are inevitable in any presidency. But do you think that some of the criticism might stem from differing political perspectives rather than just the effectiveness of his policies? It's interesting to consider whether the reactions are influenced more by partisanship than by objective assessments of his actions.
(TrollFace) ¡Sin duda! La política es como un circo,

### Efficiency: Batched generation?

Notice that we are making a separate LLM call to generate each turn of the dialogue.  When we generate the $n^\text{th}$ turn, we send the server the whole dialogue history — the previous $n\!-\!1$ turns — along with some instructions.  The server has to re-encode it with the Transformer, and it charges us for doing so (see the "input token" costs in `tracking.py`).  

That is probably inevitable for real dialogue.  But for simulated dialogue, a more efficient approach would be to generate the whole dialogue between Alice and Bob in one LLM call.  Then you would be charged just once for each dialogue turn.  Under this approach, the Transformer encodes each token as soon as it is generated (see the "output token" costs in `tracking.py`).  The encoded token stays in the context throughout the dialogue, so it doesn't have to be re-encoded on a later call.  There is no later call.  

Under current pricing models, that would reduce the dollar cost of generating $n$ turns from $O(n^2)$ to $O(n)$.  

However, the pricing model doesn't quite reflect the computational costs.  
* ![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png) Using $O(\cdot)$ notation, what is the total number of floating-point operations needed to generate $n$ turns under each approach?  
* ![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png) Parallelism may help reduce the runtime.  Using $O(\cdot)$ notation, what is the total number of seconds needed to generate $n$ turns under each approach?  (Assume that the GPU is big enough, relative to $n$, that it can encode all input tokens in parallel.)

Total Number of Floating-Point Operations:

Separate Calls Approach $O(n^2)$:

For the  i -th turn, the server encodes the entire dialogue history up to that point, including all  i  turns. Encoding each token requires $O(m \cdot h^2)$ operations where: $m$ is the number of tokens in the context (increases with each turn) and $h$  is the Transformer’s hidden size. Since the dialogue history grows linearly with  $i$ , encoding costs for  $i$ -th turn are proportional to  $i$ , making the total cost for  $n$  turns: $O(n^2 \cdot h^2)$


Single Call Approach $O(n)$:
The server generates the entire dialogue in one continuous call. Each generated token depends on all previous tokens, so it requires $O(h^2)$ operations per token. The total cost for  n  turns, where each turn has  m  tokens, is proportional to the total number of tokens: $m \cdot n \cdot h^2 = O(n \cdot h^2)$, here $m$ is constant for each turn.




With Parallelism:

Separate Calls Approach:

For the  $i$ -th turn, the server re-encodes the entire dialogue history, which has  $i$  turns of tokens. Parallelism ensures the encoding time per turn is constant, depending only on the size of the Transformer and not the number of tokens. Token generation is sequential, so it scales linearly with the number of tokens generated. For each turn, the model generates  $m$  new tokens sequentially. For each turn, the model generates  $m$  new tokens sequentially. Total decoding time for  $n$  turns is $O(n \cdot m \cdot h^2)$. So then the total runtime is: $O(n \cdot m \cdot h^2)$

Single Call Approach:

The entire dialogue is encoded only once at the start. The encoding time per turn is constant, depending only on the size of the Transformer. The model generates all  $n \cdot m $  tokens sequentially. Decoding time is $O(n \cdot m \cdot h^2)$. So then the total runtime is: $O(n \cdot m \cdot h^2)$



The problem with the more efficient approach is that it gives you no way to change the instructions (the system prompt) each time we switch from Alice to Bob and back again.  You'd need to generate the whole conversation using a single set of instructions.

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Can you get this to work?  Specifically, try completing the cell below.  You don't have to use the `Agent` or `Dialogue` classes.  It's okay to just throw together something like the `complete()` method above.  Just see whether you can manage to prompt gpt-4o-mini to generate a multi-turn dialogue between two characters who have different personalities and goals.  Is the quality better or worse than generating one turn at a time with different instructions?

In [24]:
# Like `simulated_dialogue` in `simulate.py`.  However, this one is called on two
# Characters, not two Agents, and it returns a string rather than a Dialogue.

from tracking import default_client, default_model
from characters import Character
def simulated_dialogue_batch(a: Character, b: Character, turns: int = 6, *,
                             starter=True) -> str:
    dialogue_instructions = (
        f"You are simulating a multi-turn dialogue between two characters:\n"
        f"- {a.name}: {a.persona}. Conversational style: {a.conversational_style}\n"
        f"- {b.name}: {b.persona}. Conversational style: {b.conversational_style}\n"
        f"Each turn alternates between {a.name} and {b.name}. "
        f"Write {turns} turns of conversation starting with {a.name if starter else b.name}. "
        f"Ensure that each character's dialogue appears on a new line, with a clear label (e.g., '{a.name}:'). "
        f"Keep the dialogue engaging and reflective of their personalities and goals.\n\n"
        f"Begin the dialogue:\n"
    )

    # Send the prompt to the model
    response = default_client.chat.completions.create(
        messages=[{"role": "user", "content": dialogue_instructions}],
        model="gpt-4o-mini",
        temperature=0.7,  # Add some randomness for variety
        max_tokens=512  # Adjust depending on the expected length of the dialogue
    )

    # Extract and format the generated dialogue
    dialogue = response.choices[0].message.content
    # Post-process the dialogue to ensure each character's line is on a separate line
    processed_dialogue = []
    for line in dialogue.split("\n"):
        line = line.strip()
        if line.startswith(f"{a.name}:") or line.startswith(f"{b.name}:"):
            processed_dialogue.append(line)

    # Join the cleaned lines with newlines
    return "\n".join(processed_dialogue)


# Try it out!
simulated_dialogue_1 = simulated_dialogue_batch(characters.bob, characters.cara)
print(simulated_dialogue_1)

Bob: You know, Cara, I really believe that if everyone considered a vegetarian diet, we'd not only be healthier but also help the planet significantly. Have you ever thought about that?
Cara: I appreciate your concern, Bob, but I love my meat too much to even consider it. Plus, I think it's important for everyone to make their own choices about what they eat.
Bob: That's fair, and I respect your choices. But think about the environmental impact of meat production! It contributes to deforestation and greenhouse gas emissions. Wouldn't you want to contribute to a healthier Earth?
Cara: I get that, but there are sustainable farming practices out there. Not all meat production is harmful. Plus, I believe in balance. A diet exclusively of plants doesn’t work for everyone!
Bob: True, but many studies show that plant-based diets can be healthier for most people. I just wish more people would give it a try for their health and the planet's sake.
Cara: I understand your passion, Bob, but I thin

In [25]:
simulated_dialogue(agents.bob, agents.cara) 

(Bob) Do you think it's ok to eat meat?
(Cara) I believe that eating meat is perfectly acceptable and a natural part of many diets.
(Bob) While I respect your perspective, I believe that a vegetarian diet can offer numerous health benefits and promote a more sustainable planet.
(Cara) I appreciate your viewpoint, but I firmly believe that a carnivorous diet can also provide health benefits and is a valid choice.
(Bob) I understand your position, yet I feel that the ethical implications and environmental impact of meat consumption make vegetarianism a more responsible choice overall.
(Cara) I can see where you're coming from, but I maintain that my choice to eat meat aligns with my personal values and preferences.

In [26]:
simulated_dialogue(agents.eve, agents.trollFace)

(Eve) Do you think Joe Biden has been a good president?
(TrollFace) Oh sure, if by "good" you mean fumbling through speeches like a toddler learning to walk!
(Eve) That's an interesting take! Have you always felt this way about his leadership, or is it something you've noticed more recently?
(TrollFace) I've been rolling my eyes since he mistook a podium for a buffet table!
(Eve) Wow, that sounds like quite the moment! Did you happen to catch what he said right after that incident?
(TrollFace) Oh absolutely, he probably mumbled something profound like "uh, where's the ice cream?"

In my opinion the quality is better than generating one turn at a time with different instructions. Like the function simulated_dialogue_batch produced a better dialogue than simulated_dialogue from simulate.py. The conversation we got from using gpt-4o-mini had a lot more detail and the argument for both characters was better. The simulated_dialogue produces a very simple dialogue that doesn't offer up a lot of conversation because the agents seem to be sticking to what they believe. This is not the case with the gpt-4o-mini where the conversation is much more meaningful and both characters are asking meaninful questions that makes them acknowledge the opposing view.

# Model-based evaluation

What is our goal for the argubot?  We'd like it to broaden the thinking of the (simulated) human that it is talking to.  Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner.  The argubot will almost never hear a human say "You have convinced me that I was wrong."  But the argubot did a good job if the human developed **increased understanding and respect for an opposing point of view**.  

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue.  For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views.  Of course, that depends on his personality — Alice needs to talk to him in a way that reaches *him* (as much as possible).  We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans.  Bob is a fake person, and so is the outside observer (whose name is Judge Wise).
Using an LLM as an eval metric is known as *model-based evaluation*.  It has pros and cons:
* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.  
* It might give different answers than what humans would give.   

Social scientists usually refer to a metric's **reliability** (low variance) and **validity** (low bias).  So the points above say that model-based evaluation is reliable but not necessarily valid.  In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure.  (For example, that it correlates strongly with some other measure that we already trust.)  In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, open up the `demo` notebook, which walks you through the evaluation protocol.  You'll see how to call the [starter code](http://cs.jhu.edu/~jason/465/hw/llm), how it talks to the LLM behind the scenes, and what it is able to accomplish. 

To help to validate the metric, check that Airhead gets a low score.  (It should!)

# Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing.  So now you're probably curious about the details.  Now that you've had the view from the top, here's a good bottom-up order in which to study the code.  You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`.  The `Character` class is short and easy.

* `dialogue.py`.  The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents.  On each *turn* of the dialogue, one of the speakers says something.  

   The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API.  But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`.  Those are not quite the same thing as human speakers.  And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on.  The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`.  This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method.  The `Agent` base class also has some simple convenience methods that you should look at.  

   Some important subclasses of `Agent` are defined here as well.  However, you may want to skip over `EvaluationAgent` and come back to it only when you read `evaluate.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents.  One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `evaluate.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules.  These aren't about NLP; look inside if needed.  `logging_cm.py` is what enabled the context manager `with LoggingContext(...):` in the demo notebook.  `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

# Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods!
There is a long history of machine learning methods that "memorize" the training data.
To make a prediction or decision at test time, they consult the stored training examples
that are most similar to the training situation.

_Similarity-based retrieval_ means that given a document $x$, you find the "most similar" documents $y \in Y$, where $Y$ is a given collection of documents.  The most common way to do this is to maximize the _cosine similarity_ $\vec{e}(x) \cdot \vec{e}(y)$, where $\vec{e}(\cdot)$ is an embedding function.

Should we use the OpenAI embedding model?  We could, but we would have to precompute $\vec{e}(y)$ for all $y \in Y$, and store all these vectors in a data structure that supports some type of fast similarity-based search (e.g., using the [FAISS](https://faiss.ai/index.html) package).  An alternative would be to upload the documents to OpenAI and let OpenAI compute and store the embeddings.  We would then use their similarity-based [retrieval tool](https://platform.openai.com/docs/assistants/overview).

A simpler and faster approach—which sometimes even works better—is to use a _bag of tokens_ embedding function: Define $\vec{e}(y)$ to be the vector in $\mathbb{R}^V$ that records the count of each type of token in a tokenized version of $y$, where $V$ is the token vocabulary.  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a refined variant of that idea, where the counts are adjusted in 3 ways: 

* smooth the counts
* normalize for the document length $|y|$ so that longer documents $y$ are not more likely to be retrieved
* downweight tokens that are more common in the corpus (such as ` the` or `ing`) since they provide less information about the content of the document


You might like to play with the `rank_bm25` package ([documentation](https://pypi.org/project/rank-bm25/)).  It is widely used and very easy to use.

In [27]:
from rank_bm25 import BM25Okapi

# Example collection of documents (Y)
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Never jump over the lazy dog quickly.",
    "A fast, agile fox leaped over a dog lying lazily.",
    "A lazy dog sleeps under a tree on a sunny day.",
    "Quick thinking can solve problems faster than anything else."
]

# Tokenize the documents (simple whitespace-based tokenization)
tokenized_documents = [doc.lower().split() for doc in documents]

# Create the BM25 index
bm25 = BM25Okapi(tokenized_documents)

# Define a query (x)
query = "lazy dog quick"

# Tokenize the query
tokenized_query = query.lower().split()

# Retrieve the top-3 most relevant documents
scores = bm25.get_scores(tokenized_query)  # Get relevance scores for all documents
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]

# Display results
print("Query:", query)
print("\nTop 3 Most Relevant Documents:")
for idx in top_indices:
    print(f"Doc {idx + 1}: {documents[idx]} (Score: {scores[idx]})")

Query: lazy dog quick

Top 3 Most Relevant Documents:
Doc 1: The quick brown fox jumps over the lazy dog. (Score: 0.5591435208696838)
Doc 2: Never jump over the lazy dog quickly. (Score: 0.486785115275922)
Doc 4: A lazy dog sleeps under a tree on a sunny day. (Score: 0.39925132831321886)


## The Kialo corpus

How can we use similarity-based retrieval to help build an argubot?  It's largely about having the right data!

[Kialo](kialo.com) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics.  For each topic, the contributors construct a tree of _claims_.  Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con").  For example, check out the tree rooted at the claim ["All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees.  Please read the [source code](https://www.cs.jhu.edu/~jason/465/hw-llm) in `kialo.py`.  The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the [data directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data).  The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [28]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens. 

In [29]:
from typing import List
import glob

# kialo = Kialo(glob.glob("data/*"), tokenizer=tokenizer.encode)  # using the LLM's tokenizer doesn't work here for some reason
kialo = Kialo(glob.glob("data/*"))  # use simple default tokenizer
f"This Kialo subset contains {len(kialo)} claims"

'This Kialo subset contains 6251 claims'

Let's use sampling to see what kind of stuff is in the data structure.

In [30]:
kialo.random_chain()   # just a single random claim

['Chickens are intelligent animals, outperforming dogs and cats on many tests of advanced cognition.']

In [31]:
kialo.random_chain(n=4)

["President Trump's immigration policies caused human rights abuses.",
 "President Trump's attempts to reduce both legal and illegal immigration appeared inhumane, and clearly racially motivated, to the degree that it was liable to have a damaging impact on America's future.",
 'President Trump signed an executive order for a travel ban that prohibited entry from seven countries, five of which were majority Muslim, for national security reasons. These countries never posed a national security threat to the US',
 "The list of countries with the imposed travel ban didn't include Muslim majority countries in which President Trump had historical or current business ties."]

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [32]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [33]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

['Industrial agriculture can dangerously decrease animal populations.',
 'Effective vegan methods to control animal populations exist.',
 'Human-introduced species have historically devastated local wildlife populations across the world.',
 'COVID-19 has devastated prison populations, whose lives are the responsibility of the state.',
 'High demand for vegan foods may hike prices for local populations that previously depended on them.',
 'It is generally poorer countries that have expanding populations. The first world has now reached a point of stagnant population growth - even declining populations, as in the case of Japan and others. The inability of poorer countries to control their populations should not impact the lives of those in the first world. The first world having earned their luxuries and should not be denied them.',
 'Vegan populations are, on average, less likely to suffer from obesity, a major risk factor for many diseases and health problems.',
 'Humans, as apex preda

In [34]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

Parent claim:
	In a vegan world, fewer species would be at risk of extinction.
Claim:
	Industrial agriculture can dangerously decrease animal populations.
Pro children:
	* The fishing industry is especially deleterious to the ocean's biota due to overfishing and the disruption of the natural ecosystem.
	* Up to 100,000 species go extinct annually, largely due to the environmental effects of animal agriculture.
Con children:
	* Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.


### Does BM25 really work?

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Unfortunately, we see that `"animal population"` gives quite different results from `"animal populations"`.  Why is that and how would you fix it?  

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations.  Why is that and how would you fix it?

"animal population" gives quite different results from "animal populations" because BM25 relies on exact token matches to compute relevance scores, and "animal population" (singular) is treated as different from "animal populations" (plural). BM25 does not automatically account for inflectional variations or semantic similarity between words. In order to address this issue we could expand the query to include synonyms or related terms (e.g., "animal population" → "animal population OR animal populations"). This approach ensures that variations of the word are considered during retrieval.

The issue of retrieving claims about human populations instead of animal populations arises from how BM25 works in conjunction with the provided tokenizer and the characteristics of the corpus. BM25 relies on bag-of-words token matching, so it does not differentiate between the meanings of “animal populations” and “human populations.” If a document contains “populations,” it will be considered relevant regardless of whether it refers to animals or humans. The tokenizer (tokenize_simple) only splits text into lowercase tokens and removes punctuation. This basic preprocessing does not account for multi-word phrases like “animal populations” as a single unit. Instead, it treats “animal” and “populations” as separate tokens. We can address this issue by enhancing the retrieval process to prioritize phrase matches for terms like “animal populations.” This can be done by:
Treating multi-word phrases as a single token during tokenization. Using more advanced tokenization that identifies and preserves phrases. 

In [35]:
kialo.closest_claims("animal population",10)

['As long as our ability to produce both animal feed crops and food crops for our human population are not exceeded, this point is irrelevant.',
 "36% of the calories produced by the world's crops are being used for animal feed, of which only 12% then turn into animal products that can be eaten by the human population. That is a waste of 24% of the world's crops.",
 'The claim that "most of the cultural shift and loss is due to mostly vegan cultures turning to animal products" is completely unfounded, and the Brokpa people which you cited are an outlier as a group that has a population of less than 70k people. Worldwide the population of vegan people has only increased.',
 "Developed nations are fueling the 3rd world and underdeveloped nation's population boom by exporting/donating food to areas that cannot sustain their current population.",
 'This argument assumes that sentience is the only objection to the consumption of animal products, failing to address the issues involved with t

## A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all.  It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

You already watched Akiko argue with Darius in `demo.py`.  If you look at the log messages, you'll see the claims that Akiko retrieved, as well as the LLM calls that Darius made.  

You can talk to Akiko yourself now.  (Remember that Akiko only knows about subjects that it read about in the [`data` directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data/).  If you want to talk about something else, you can add more conversations from [kialo.com]; see the [LICENSE](https://www.cs.jhu.edu/~jason/465/hw-llm/data/LICENSE) file.)


In [36]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()




(jbravo3) Eating meat is good 
(Akiko) It is not plausible for some people to live solely on a diet without meats. The reason it would be impossible at the current moment is that the pricing of fruits and vegetables costs more per calorie than meat and is not possible for many families with a lower budget.


## Making your own retrieval bot (Akiki)

As you can see when talking to Akiko yourself, Akiko does poorly when responding to a short or vague dialogue turn (like "Yes"), because the "closest claim" in Kialo may be about a totally different subject.  Akiko does much better at responding to a long and specific statement.  

So try implementing a new argubot, called Akiki, that is very much like Akiko but does a better job of staying on topic in such cases.  It should be able to **look at more of the dialogue** than the most recent turn.  But the most recent dialogue turn should still be "more important" than earlier turns.  

The details are up to you.  Here are a few things you could try:
* include earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them
* weight more recent turns more heavily in the BM25 query (how can you arrange that?)
* treat the human's earlier turns differently from Akiki's own previous turns

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Implement your new bot Akiki in `argubots.py`, and adjust it until `argubots.akiki.converse()` seems to do a better job of answering your short turns, compared to `argubots.akiko.converse()`.  Make sure it still gives appropriate reponses to long turns, too.  Give some examples in the notebook of what worked well and badly, with discussion.

In [37]:
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiki.converse()




(jbravo3) Eating meat is bad
(Akiki) That is not a statement about eating animals as much as it is the system of agriculture that has become normalized. If we push to shift that model of farming, towards a more permaculture based agriculture system, then we also change the nature of this debate.


What seems to work well is short and detailed conversations. Akiki seems to give better responses as the conversation goes on. I noticed this when I was talking to Akiki about whether or not President Trump was a good president. However when it performed well was the topic of climate change. I asked Akiki very long and detailed questions about climate change in general and what I noticed is that sometimes it will take one word from your response. For example it took the topic of climate change and somehow brought up the topic of meat to go with it. I guess what I'm trying to say is that Akiki tends to struggle with very hard and detailed questions.

### Evaluating Akiki

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Finally, do a more formal evaluation to verify whether Akiki really does better than Akiko on this dimension.  This is a way to check that you're not just fooling yourself.  

1. Make a new `Agent` called "Shorty" that often (but not always) gives short responses.  
    * Shorty's conversation starters should be on topics that Kialo knows about.  
    * Shorty could be a pure `LLMAgent` such as a `CharacterAgent` with a particular `conversational_style`.  Or it could use a mixed strategy of calling the LLM on some turns and not others.
2. Generate several *Akiko*-Shorty dialogues and several *Akiki*-Shorty dialogues, using `simulated_dialogue`.
3. Evaluate each of those dialogues by asking Judge Wise **how well the argubot stayed on topic**.  You should write this prompt carefully so that Judge Wise gives meaningful scores.  (Before you do this evaluation step, adjust the prompt until it seems to work well on a small subset of the dialogues, Otherwise Judge Wise won't be so wise!)  
4. Compare Akiko and Akiki's mean scores on this new evaluation criterion (which you can call `'focused'`). Ideally, compute a 95% confidence interval on the difference of means, using [this calculator](https://www.statskingdom.com/difference-confidence-interval-calculator.html).  If you don't get statistical significance, then your evaluation set wasn't large enough, so go back to step 2 and run the comparison again (from scratch) by generating a larger set of dialogues with Shorty for each argubot.

You can do all those steps in the notebook, writing _ad hoc_ code.  You don't have to write general-purpose methods or classes.

In [38]:
from agents import EvaluationAgent
from math import sqrt
shorty_agent = CharacterAgent(characters.shorty)


def generate_shorty_dialogues(turns: int = 6, num_dialogues: int = 10):
    # Generate Akiko-Shorty dialogues
    akiko_dialogues = [
        simulated_dialogue(argubots.akiko, shorty_agent, turns=turns)
        for _ in range(num_dialogues)
    ]
    akiki_dialogues = [
        simulated_dialogue(argubots.akiki, shorty_agent, turns=turns)
        for _ in range(num_dialogues)
    ]
    return akiko_dialogues, akiki_dialogues

judge_wise = EvaluationAgent(
  Character(
    name="Judge Wise",
    languages=["English"],
    persona="a thoughtful and impartial evaluator of dialogue. You assess how well each agent stays on topic, "
                "delivers relevant arguments, and maintains engagement with the conversation.",
    conversational_style="You provide concise but clear evaluations and numerical evaluations"
))

def evaluate_dialogue(argubot, dialogues):
    scores = []
    
    for dialogue in dialogues:
        eval_dialogue = Dialogue()
        prompt = (
            f"Here is a conversation to evaluate:\n\n{dialogue.script()}\n\n"
            "Please read this conversation very carefully. "
            f"How well has {argubot} stayed on the main topic as well as subtopics throughout this conversation? "
            "Score the argubot's performance on a scale of 1 to 10, with 1 being completely off-topic and 10 being perfectly on-topic. "
            "Please provide just the numerical score as your response."
        )
        
        try:
            rating = judge_wise.rating(eval_dialogue, argubot, prompt, 2, 10)
            scores.append(rating)
        except ValueError:
            scores.append(None)  # Handle cases where Judge Wise might not return a valid rating
    
    return scores

akiko_dialogues, akiki_dialogues = generate_shorty_dialogues(turns=6, num_dialogues=10)


akiko_scores = evaluate_dialogue("Akiko", akiko_dialogues)
akiki_scores = evaluate_dialogue("Akiki", akiki_dialogues)

akiko_n = len(akiko_scores)
akiki_n = len(akiki_scores)

akiko_mean = sum(akiko_scores)/ akiko_n
akiki_mean = sum(akiki_scores) / akiki_n

akiko_std_dev = sqrt(sum((score - akiko_mean) ** 2 for score in akiko_scores) / (akiko_n - 1))
akiki_std_dev = sqrt(sum((score - akiki_mean) ** 2 for score in akiki_scores) / (akiki_n - 1))

print("Akiko Mean:", akiko_mean)
print("Akiko n:", akiko_n)
print("Akiko Standard Deviation:", akiko_std_dev)


print("Akiki Mean:", akiki_mean)
print("Akiki n:", akiki_n)
print("Akiki Standard Deviation:", akiki_std_dev)

Akiko Mean: 6.9
Akiko n: 10
Akiko Standard Deviation: 2.2335820757001272
Akiki Mean: 7.5
Akiki n: 10
Akiki Standard Deviation: 1.35400640077266


We computed a 95% confidence interval with the information we got. We got a confidence interval of [-2.96, -0.045]. Since the confidence interval does not include 0, this result is statistically significant at the 95% confidence level. This means that the observed difference between the two means (e.g., Akiko’s and Akiki’s scores) is unlikely to be due to chance. The entire interval is negative, suggesting that Akiko performed significantly worse on average compared to Akiki. So then Akiki consistently outperformed Akiko and the difference in scores is statistically significant at the 95% confidence level.

## Retrieval-augmented generation (Aragorn)

The real weaknesses of Akiko and Akiki:
* They can only make statements that are already in Kialo.  
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named [Aragorn](https://tolkiengateway.net/wiki/Riddle_of_Strider), who is basically the love child of Akiki and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.  

The RAG in aRAGorn's name stands for **retrieval-augmented generation**.  Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step**: Ask the LLM what claim should be responded to.  For
   example, consider the following dialogue:
    > ...
    > Aragorn: Fortunately, the vaccine was developed in record time.
    > Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using
    as a Kialo query.  But Aragorn shows the *whole dialogue* to the LLM, and
    asks the LLM what the human's *last turn* was really saying or implying, in
    that context. The LLM answers with a much longer statement:

    > Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted.
    > If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context.
    It also contains many more word types, which makes it more likely that BM25 will be able
    to find a Kialo claim with a nontrivial number of those types. 

2. **Retrieval step**: Look up claims in Kialo that are similar to the explicit
   claim.  Create a short "document" that describes some of those claims and
   their neighbors on Kialo.

3. **Retrieval-augmented generation**: Prompt the LLM to generate the response
   (like any `LLMAgent`).  But include the new "document" somewhere in the LLM
   prompt, in a way that it influences the response. 
   
   Thus, the LLM can respond in a way that is appropriate to the dialogue but
   also draws on the curated information that was retrieved in Kialo.  After
   all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible
to do better than this:

In [39]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

One possibly related claim from the Kialo debate website:
	"So many people are worried about animals but don't even think twice when walking by a homeless person on the streets. It's preposterous. How about we worry about our own kind first and then start talking about animals."
Some arguments from other Kialo users against that claim:
	* This implies that caring for animals or caring for people is a binary choice. It isn't. There are those who are well placed and willing to care for people and those who prefer to serve the animal kingdom. As a species we don't just have one idea at a time and follow that to conclusion before we pursue another. It benefits all if humans divide their attentions between various issues and problems we face.
	* Humans have freedom of choice to some extent, animals subdued by humans don't. The very intention of help urges it to go where is most needed. And so far never was any biggest, flagrant and needless cruelty and slaughter as that towards industrial f

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
**You should implement Aragorn in `argubots.py`, just as you did for Akiki.**  Probably as an instance `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

### Evaluating Aragorn

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Compare Alice, Akiki, and Aragorn in the notebook, using the evaluation scheme and devset that were illustrated in `demo.ipynb`.  In other words, use `evaluate.eval_on_characters`.

Who does best?  What are the differences in the subscores and comments?  Does it matter which character you're evaluating on — maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score.  Can you beat Alice?

Also, try evaluating them in the same way that you evaluated Akiki.  In other words, have them talk to Shorty and ask Judge Wise whether they were able to stay on topic.  This is where Aragorn should really shine, thanks to its ability to paraphrase Shorty's short utterances.



In [42]:
import evaluate
import argubots
aragorn_eval = evaluate.eval_on_characters(argubots.aragorn)

100%|██████████| 10/10 [03:06<00:00, 18.66s/it]


In [43]:
aragorn_eval

<Eval of 10 dialogues: {'engaged': 3.5, 'informed': 3.4, 'intelligent': 3.6, 'moral': 3.4, 'skilled': 7.2, 'TOTAL': 21.1}>
Standard deviations: {'engaged': 0.97182531580755, 'informed': 0.6992058987801015, 'intelligent': 0.6992058987801015, 'moral': 0.6992058987801015, 'skilled': 0.6324555320336779, 'TOTAL': 3.2812599206199176}

Comments from overview question:
(Bob) Aragorn didn't necessarily disagree with me about the benefits of vegetarianism; rather, he acknowledged the complexities surrounding meat consumption and animal welfare. He pointed out that while there are efforts to improve animal welfare in meat production, significant suffering still exists, and access to ethically sourced meat is limited. 

In my opinion, the conversation was constructive and respectful. We both engaged in a thoughtful dialogue about the ethical and environmental implications of our food choices. 

Aragorn could have done better by perhaps expressing more openness to the idea of vegetarianism as a via

In [49]:
alice_eval = evaluate.eval_on_characters(argubots.alice)

100%|██████████| 10/10 [02:00<00:00, 12.08s/it]


In [50]:
alice_eval

<Eval of 10 dialogues: {'engaged': 3.8, 'informed': 3.3, 'intelligent': 3.5, 'moral': 3.2, 'skilled': 7.3, 'TOTAL': 21.1}>
Standard deviations: {'engaged': 0.918936583472681, 'informed': 0.4830458915396473, 'intelligent': 0.5270462766947299, 'moral': 0.4216370213557832, 'skilled': 0.6749485577105547, 'TOTAL': 2.330951164939603}

Comments from overview question:
(Bob) Alice disagreed with me primarily on the idea that responsible livestock farming can contribute positively to biodiversity and cultural identity. She raised concerns about the potential disruption to communities that rely on traditional animal husbandry and suggested that a balance between vegetarianism and sustainable animal farming might be necessary.

In my opinion, the conversation was constructive and respectful. We both presented our viewpoints and acknowledged each other's perspectives, which is important in discussions about such complex issues.

Alice could have done better by perhaps providing more specific examp

In [52]:
akiki_eval = evaluate.eval_on_characters(argubots.akiki)

100%|██████████| 10/10 [01:37<00:00,  9.76s/it]


In [53]:
akiki_eval

<Eval of 10 dialogues: {'engaged': 3.2, 'informed': 3.3, 'intelligent': 3.4, 'moral': 3.1, 'skilled': 6.6, 'TOTAL': 19.6}>
Standard deviations: {'engaged': 0.7888106377466151, 'informed': 0.6749485577105524, 'intelligent': 0.5163977794943229, 'moral': 0.31622776601683894, 'skilled': 1.3498971154211048, 'TOTAL': 2.7568097504180464}

Comments from overview question:
(Bob) Akiki seemed to disagree with me primarily on the topic of dairy consumption, suggesting that it may help address malnutrition, while I emphasized that plant-based sources can also effectively meet nutritional needs. 

In my opinion, the conversation was constructive, as it allowed for a respectful exchange of ideas. However, Akiki could have done better by exploring more about the benefits of plant-based nutrition and considering the ethical implications of dairy consumption, rather than focusing solely on its potential nutritional benefits. Engaging more with the compassionate aspects of a vegetarian or vegan lifestyl

The argubot that performs best well for the most part its actually aragorn. Sometimes Alice will perform better than Aragorn and sometimes Aragorn will perform better. Most of the time they have pretty equal scores. Also Akiki will always perform worse than both Alice and Aragorn. When looking at the subscores both Alice and Aragorn seem to have similar subscores in these categories: engaged and intelligent. The other categories they will have different subscores but its not by much. Since Akiki is worse than both, its subscores will always be worse than both Aragorn and Alice. For the most part I see that Alice and Aragorn tend to have similar comments from the different characters. Akiki has very different comments compared to Alice and Aragorn and this is simply because its not on the same level as them. I do notice that the argubots always find it hard to deal with TrollFace, this is kind of expected because he's supposed to be a troll while the argubots are actually trying to have a meaninful conversation. It makes it hard to deal with it because all the argubots end up taking the troll seriously.

In [54]:
# Evaluating Aragorn with Shorty. 
from agents import EvaluationAgent
from math import sqrt
shorty_agent = CharacterAgent(characters.shorty)

def generate_shorty_dialogues(turns: int = 6, num_dialogues: int = 10):
    aragorn_dialogues = [
        simulated_dialogue(argubots.aragorn, shorty_agent, turns=turns)
        for _ in range(num_dialogues)
    ]
    return aragorn_dialogues


aragorn_dialogues = generate_shorty_dialogues(turns=6, num_dialogues=10)
aragorn_scores = evaluate_dialogue("Aragorn", aragorn_dialogues)

print(aragorn_scores)

[7, 9, 9, 9, 9, 9, 9, 9, 9, 9]


When we evaluate aragorn how we did Akiki we see that it performs very well. Nearly every short dialogue besides one is rated above an 9 by Judge Wise. So Aragorn does really shine in this scenario.

# Awsom

![image](handin.png)
Add another LLM-based argubot to `argubots.py`.  
Call it Awsom.  Try to make it get the best score, according to `evaluate.eval_on_characters`.
Explain what you did and discuss what you found.

(This corresponds to the `--awesome` flag on earlier assignments, but naming the character "Awesome" might bias the evaluation system, so we changed the spelling!)

If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the score.  Many good ideas don't work.  That's why you need to keep finding and trying new good ideas.  (Sometimes they do help, but in a way that is not picked up by the scoring metric.)

You may want to use Aragorn or Alice as your starting point.
Then see if you can find tricks that will get a more awesome score for Awsom.
How you choose to do that is up to you, but some ideas are below.

(Reminder: **Don't change evaluation.**  Just build a better argubot.)

In [55]:
import evaluate
import argubots

awsom_eval = evaluate.eval_on_characters(argubots.awsom)

100%|██████████| 10/10 [04:11<00:00, 25.16s/it]


In [56]:
awsom_eval

<Eval of 10 dialogues: {'engaged': 4.2, 'informed': 3.7, 'intelligent': 3.8, 'moral': 3.7, 'skilled': 7.4, 'TOTAL': 22.8}>
Standard deviations: {'engaged': 0.7888106377466151, 'informed': 0.4830458915396473, 'intelligent': 0.7888106377466151, 'moral': 0.8232726023485643, 'skilled': 0.9660917830792946, 'TOTAL': 3.4576806613039897}

Comments from overview question:
(Bob) Awsom did not explicitly disagree with me about anything in the conversation. Instead, they expressed understanding and appreciation for the points I made about vegetarianism. The conversation flowed positively, with Awsom acknowledging the benefits of a plant-based diet and engaging in a constructive dialogue about vegetarianism.

In terms of improvement, Awsom could have posed more challenging questions or offered a counterargument to stimulate a deeper discussion about the complexities of dietary choices. This could have led to a more dynamic exchange of ideas. Overall, it was a friendly and supportive conversation.
(

## [Possible strategy] Prompt engineering

A good first thing to do is to experiment with Alice's prompt.  
The wording and level of detail in the prompt can be quite important.
Often, NLP engineers will change their prompt to try to address 
problems that they've seen in the responses.

Because it's "just" text editing, this won't get full credit by itself unless you make a real discovery.
But it requires intelligence, care, experimentation, and alertness to the language of the responses and the
language of the prompts.  And you'll develop some intuitions about what helps and what doesn't.
It is certainly worthwhile.

Of course, people have tried to develop methods to search for good prompts automatically, or semi-automatically with human guidance.  
So you could additionally try out SAMMO or DSPy -- both have multiple tutorials and are downloadable from github.

If you try this, what worked well for you?

I tried to do this and I experimented very different prompts to send in for awsom. However as much as I tried I could either only match Alice or barely beat it sometimes. While I was experimenting most of the time I was getting scores that were similar to Alice. What I found to work is when I provided examples of a conversation to the prompt. When I added this there was some runs of eval_on_characters that made awsom outperm Alice, but the difference was very small. Usually less than a difference of 0.5 in total score.

## [Possible strategy] Chain of thought / Planning

The evaluation functions in `evaluate.py` asked each `EvaluationAgent` a "warmup question" before continuing with the real question.  That is an example of chain-of-thought (CoT) reasoning, where the LLM is encouraged to talk through the problem for a few sentences before giving the answer.  CoT sometimes improves performance.

Instead of using one prompt, could you help an `LLMAgent` argubot (like Alice) do better by having think aloud before it gives an answer?  For example, each time the human speaks, your argubot (Awsom) could prompt the LLM to think about the human's ideas/motivations/personality, and to come up with a plan for how to open the human's mind. 

For example, you might structure this as a `Dialogue` among three participants, like this:
> Awsom (to Eve): Do you think COVID vaccines should be mandatory?
>
> Eve: Have you ever gotten vaccinated yourself?<br>
>
> Awsom (private thought): I don't know Eve's opinions yet, so I can't push back.  Eve might be avoiding my question because she doesn't want to get into a political argument.  So let's see if we can get her to express an opinion on something less political.  Maybe something more personal ... like whether vaccines are scary.
>
> Awsom (to Eve): In fact I have, and so have millions of others. But some people seem scared about getting the vaccine.  

One way to trigger this kind of analysis is to present a `Dialogue.script()` to Awsom (or to an observer), and ask an open-ended question about it.  Or you could ask a series of more specific questions.  That is basically what `eval_by_participant` and `eval_by_observer` do.  But here the argubot itself is doing it, rather than the evaluation framework.

Eve would be shown only the turns that are spoken aloud.  However, when analyzing and responding, Awsom would get to see Awsom's own private thoughts as well.


On top of prompt engineering I also implemented the Chain of thought / Planning. This seemed to help the score of awsom significantly. In fact compared to Alice or even Aragorn its a lot better. Those models usually average around scores of 21. However this awsom argubot that has a chain of thought averages a score of around 23. The think-aloud mechanism allows Awsom to internally analyze the user’s input more deeply. By reflecting on the user’s ideas, motivations, and personality, Awsom gains a better understanding of the context and underlying sentiments. With this deeper analysis, Awsom can craft more personalized and relevant counterarguments. By adapting its tone and approach based on the user’s communication style, Awsom maintains higher engagement levels. 

## [Possible strategy] Dense embeddings

BM25 uses sparse embeddings — a document's embedding vector is mostly zeroes, since the non-zero coordinates correspond to the specific words (tokens) that appear in the document.

But perhaps dense embeddings of documents would improve Aragorn by reading the text and abstracting away from the words, in a way that actually cares about word order.  So, try it!

How?  As mentioned earlier in this notebook, you could compute the embeddings yourself and put them in a FAISS index. Or you could figure out how to use OpenAI's [knowledge retrieval](https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) API.

## [Possible strategy] Few-shot prompting

 In this homework, often an agent prompted a language model only with instructions.  Can you find a place where giving a few _examples_ would also improve performance?  You will have to write the examples, and you will have to add them to the sequence of messages that your agent sends to the OpenAI API.  See the sentence-reversal illustration earlier in this notebook.

One good opportunity is in the query formation step of RAG.  This is a tricky task.  The LLM is supposed to state the user's implicit claim in a form that looks like a Kialo claim (or, more precisely, a form that will work well as a Kialo query).  It probably doesn't know what Kialo claims look like.  So you could show it by way of example.  This would also show it what you mean by the user's "implicit claim."


## [Possible strategy] Using tools in the approved way

Aragorn's step 1 (query formation) is basically getting the LLM to generate a function call like
```
kialo_thoughts("A vaccine that was developed very quickly ...")
```
which Aragorn will execute at step 2 (retrieval), sending the results back to the LLM as part of step 3.

In this context, `kialo_thoughts` is an example of a **tool** (that is, a function) that the
LLM can or must use before it gives its response.

The tool is _not_ something that runs on the LLM server.  It is written by you
in Python and executed by you.  The function call above, including the text `"A
vaccine that was ..."`, is the part that is generated by the LLM.

The OpenAI API has [special support](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) for calling the LLM in a way that will _allow_ it to generate a tool call ([tools](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) or _force_ it to do so ([tool_choice](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice)).  You can then send the tool's result back to the LLM [as part of your message sequence](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).

So, you could modify Aragorn to use tools properly.  Maybe that will help, simply because the LLM was trained on message sequences that included tool use.  It should know to pay attention to the tool portions of the prompt when they are relevant, and ignore them when they are not.

The `client.chat.completions.create()` method would need to be told about the tool by using the `tools` keyword argument, with a value something the one below.

If `d` is a `Dialogue`, you should be able to call `d.response()` with the `tools` keyword argument.  This will be passed on to `client.chat.completions.create()` as desired.

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "kialo_thoughts",
            "description": "Given a claim by the user, find a similar claim on the Kialo website and return its pro and con responses",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_topic": {
                        "type": "string",
                        "description": "A claim that was made explicitly or implicitly by the user.",
                    },
                },
                "required": ["search_topic"],
            },
        }
    }]

## [Possible strategy] Parallel generation

The chat completions interface allows you to sample $n$ continuations of the prompt in parallel, as we saw with "the apples, bananas, cherries ..." example.  This is efficient because it requires only 1 request to the LLM server and not $n$.  The latency does not scale with $n$.  Nor does the input token cost, since the prompt only has to be encoded once.

Perhaps you can find a way to make use of this?  For example, the query formulation step of RAG could generate $n$ implicit claims instead of just one.  We could then look for claims in the Kialo database that are close to _any_ of those implicit claims.

Another thing to do with multiple completions is to select among them or combine them.  For example, suppose we prompt the LLM to generate completions of the form $(s,t,r)$ where $s$ is an answer, $t$ evaluates that answer, and $r$ is a numerical score or reward based on that evaluation.  ("Write a poem, then tell us about its rhyme and rhythm problems, then give your score.")  
* If we sample multiple completions $(s_1,t_1,r_1), \ldots, (s_n,t_n,r_n)$ in parallel, then we can return the $s_i$ whose $r_i$ is largest.  
* Or if we sample $s$ and then multiple continuations $(t_1,r_1), \ldots, (t_n,r_n)$, then we can return the mean score $\sum_i r_i/n$ as a reduced-variance score for $s$, which averages over diverse textual evaluations that might consider different aspects of $s$.

Note that when you call the chat completions interface with $n > 1$, you specfy just 1 input prompt and get $n$ different output completions.  Since the input prompt must be the same for all outputs, it is necessary to sample all of $(s,t,r)$ or all of $(t,r)$ with a single call to the LLM.

Alternatively, it is possible to reduce latency by submitting multiple requests to the server in parallel (see "async usage" [here](https://pypi.org/project/openai/)).  In this case the input prompts can be different, although you now have to pay to encode all of them separately.  This facility could speed up evaluation without changing its results; that's a worthwhile thing to try for extra credit!


# [Extra credit] Adversarial testing (Anansi)

![image](handinec.png)
Finally, let's test whether our eval metric `evaluate.eval_on_characters` is vulnerable to adversarial gaming.  Remember [Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) ...

Add one more argubot to `argubots.py`.
Call it [Anansi](https://www.britannica.com/topic/Ananse), after the trickster character from folklore.

Can you make Anansi *fool* the judges into giving him a high score?  (Higher than some of the earlier argubots, while actually being worse at the task?)  **Any sneaky way of constructing Anansi's responses is fair game.**  The goal is to do well under automated evaluation on a held-out test set.  That is, Anansi should continue to score highly when talking to a character who is not in `evaluate.dev_chars` = {Bob, Cara, Darius, Eve, TrollFace}, when judged both by the character he is talking to and by Judge Wise.

To do well at this, figure out what the judges "want" -- what they might reward or respond positively to -- and how to give it to them.  This might be done by pure prompt engineering, or with additional computation (perhaps making use of additional LLM calls or other resources).  Again, explain what you did, and discuss how it worked out on the dev set.  Feel free to mention other ideas you had, too.

In [57]:
anansi_eval = evaluate.eval_on_characters(argubots.anansi)

100%|██████████| 10/10 [02:34<00:00, 15.48s/it]


In [58]:
anansi_eval

<Eval of 10 dialogues: {'engaged': 4.4, 'informed': 3.8, 'intelligent': 3.7, 'moral': 3.4, 'skilled': 7.6, 'TOTAL': 22.9}>
Standard deviations: {'engaged': 0.5163977794943229, 'informed': 0.4216370213557832, 'intelligent': 0.4830458915396473, 'moral': 0.5163977794943229, 'skilled': 0.6992058987800992, 'TOTAL': 2.131770260709255}

Comments from overview question:
(Bob) Anansi didn't explicitly disagree with me; rather, they presented a balanced view by acknowledging the benefits of vegetarianism while also mentioning the arguments for sustainable meat production. The conversation was respectful and constructive, allowing for an exchange of ideas about dietary choices and their implications.

In my opinion, the conversation flowed well, with both of us sharing our perspectives. Anansi could have delved deeper into the ethical implications of meat consumption or provided more counterarguments to challenge my views, which would have enriched the discussion. However, their approach was supp

To create Anansi I created an AnansiAgent class. Anansi’s system prompt was meticulously designed to emphasize qualities that the evaluation metrics prioritize, such as engagement, informativeness, intelligence, morality, and skill in fostering open-mindedness. Key strategies included: 

Consistently acknowledging user inputs with enthusiastic and affirmative statements.

Utilizing advanced vocabulary and referencing credible sources to appear informed and intelligent.

Maintaining a polite and ethical demeanor to demonstrate moral integrity.

Crafting well-structured and concise responses to maximize evaluation scores.

Asking well-crafted, open-ended questions to appear skilled at fostering broader perspectives. 


On the dev set, Anansi achieved an evaluation score of 23. Anansi excelled in the ‘skilled’ category, likely due to its ability to pose reflective questions and maintain a respectful dialogue, aligning well with what the evaluators value. While Anansi scored reasonably in engagement and informativeness, the scores were not exceptionally high. This suggests that while Anansi presents itself as knowledgeable and engaging, the depth and substance of its arguments may lack

Anansi successfully leverages prompt engineering to appear highly skilled in fostering open-mindedness, thereby securing strong scores in that area. However, its performance in other criteria indicates that while it can superficially meet evaluator expectations, it may not deliver the genuine depth and engagement that more robust argubots provide.
