# Homework 7: Large Language Models

An PDF overview of the homework is [here](https://www.cs.jhu.edu/~jason/465/hw-llm/).

It mentions: "We'll send hand-in instructions soon.  Probably we will ask you to submit a version
of the main notebook, with your answers added and extraneous materials deleted. We may also
ask for a summary."

![image](handin.png)
This symbol marks a question or exercise that you will be expected to hand in.


# Getting started

## Update `conda` environment

Download the updated [nlp-class.yml](http://cs.jhu.edu/~jason/465/hw-llm/nlp-class.yml) file, and execute
```
conda env update --file nlp-class.yml --prune
```
to make sure that all the packages you need are installed.

## Fetch code and data files for this homework

These files may get improved after the homework is released, so you should probably re-download them periodically.

Here is a command you can type.  We won't put it in a code cell, because we don't want you to execute it accidentally in the current directory and overwrite your own changes.  (Actually, it will not overwrite your versions of the files — they will be renamed with names like `argubots.py.1`.)

```
!wget --quiet -r -np -nH --cut-dirs=3 -A '*.txt' -A '*.py' -A 'demo.ipynb' -A '*.png' https://www.cs.jhu.edu/~jason/465/hw-llm/
!rm -f data/*.1 *.png.1 robots.txt   # remove any backup versions of the static files
```

In [3]:
!ls -lR *.py data

-rw-rw-r--@ 1 pinchun  staff  19652 Dec  5 17:07 agents.py
-rw-rw-r--@ 1 pinchun  staff   3127 Dec  5 05:32 argubots.py
-rw-rw-r--@ 1 pinchun  staff   2832 Dec  5 06:01 characters.py
-rw-rw-r--@ 1 pinchun  staff   2641 Dec  5 06:16 dialogue.py
-rw-rw-r--@ 1 pinchun  staff  14012 Dec  7 13:47 eval.py
-rw-rw-r--@ 1 pinchun  staff  10426 Dec  5 23:34 kialo.py
-rw-rw-r--@ 1 pinchun  staff   1347 Dec  3 18:44 logging_cm.py
-rw-rw-r--@ 1 pinchun  staff   1503 Dec  5 22:05 simulate.py
-rw-rw-r--@ 1 pinchun  staff   4274 Dec  7 13:10 tracking.py

data:
total 4992
-rw-rw-r--@ 1 pinchun  staff     407 Nov 29 03:00 LICENSE
-rw-rw-r--@ 1 pinchun  staff  613106 Nov 25 17:04 all-humans-should-be-vegan-2762.txt
-rw-rw-r--@ 1 pinchun  staff   81917 Nov 29 21:56 have-authoritarian-governments-handled-covid-19-better-than-others-54145.txt
-rw-rw-r--@ 1 pinchun  staff   52771 Dec  4 04:40 is-biden-an-incompetent-president-44217.txt
-rw-rw-r--@ 1 pinchun  staff  153551 Dec  4 04:39 is-joe-biden-a-good-pre


The `autoimport` feature of Jupyter ensures that if an imported module (.py file) changes, the notebook will automatically import the new version.  
(However, objects that were defined with the old version of the class won't change.)

In [4]:
# Executing this cell does some magic that makes 
%load_ext autoreload
%autoreload 2

## Create an OpenAI client

An OpenAI API key will be sent to you.
Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.

Now you can execute the following to get an OpenAI client object.

In [1]:
import dotenv
import openai
from tracking import track_usage, read_usage

dotenv.load_dotenv(override=True)      # define environment variables from .env
client = track_usage(openai.OpenAI())  # create a client, modified to record its usage to a local file 

# Or use our tracking module to do the above for you, like this:

# from tracking import default_client
# client = default_client



The job of the client is to talk to the OpenAI server over HTTP.
The `OpenAI` constructor has some optional arguments that configure these HTTP messages.
However, the defaults should work fine for you.  

## Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.  
You will have to specify which OpenAI model to use.
Documentation of the methods is [here](https://pypi.org/project/openai/) if you are curious.

### Continue a textual prompt

This is what language models excel at.  In principle you should do it by calling [`client.completions.create`](https://platform.openai.com/docs/api-reference/completions/create?lang=python).  But OpenAI's newer models don't support that legacy API, and the older ones are being [retired in January 2024](https://openai.com/blog/gpt-4-api-general-availability).  So we'll use the more modern API, [`client.chat.completions.create`](https://platform.openai.com/docs/api-reference/chat/create?lang=python).

In [2]:
import rich   # prettyprinting

response = client.chat.completions.create(messages=[{"role": "user", 
                                                     "content": "Q: Name the planets in the solar system?\nA: "}], 
                                          model="gpt-3.5-turbo-1106",  # which model to use
                                          temperature=1,               # get a little variety
                                          max_tokens=64,               # limit on length of result
                                          stop=["Q:", "\n"])           # treat these as EOS symbols
rich.print(response)                              # the full object that was sent back from the server
rich.print(response.choices)                      # just the list of 1 answer (the default, but calling with n=5 would give 5 answers) 
rich.print(response.choices[0].message.content)   # extract the good stuff from that 1 answer

![image](handin.png)
Try running the cell above a few times. You may get different random answers — especially because the call specifies temperature 1.  (The default temperature is rumored to be 0.8.) Are the answers all equally good?


It might be handy to package up what we just did.<br>
The `complete` function below is a convenient way of experimenting with completing text.
It is illustrated with a grocery example.  

In [6]:
def complete(client, s: str, model="gpt-3.5-turbo-1106", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=0.5, max_tokens=96)


[', and flour. I also picked up some garlic, honey, ice cream, and juice. Lastly, I grabbed some kiwi, lemons, milk, noodles, and oranges.',
 ', and flour. I also picked up some groceries like milk, bread, and cheese. I made sure to grab some snacks like chips and cookies as well. Overall, I got everything I needed for the week.',
 ', and flour. I also picked up some garlic, honey, ice cream, and juice. Lastly, I grabbed some kale, lemons, milk, nuts, and onions.',
 ', and flour for baking a cake.',
 ', and fish.',
 ", and a loaf of bread. I also picked up some milk and orange juice. As I was leaving, I spotted some fresh strawberries and couldn't resist adding them to my basket. All in all, I think I got everything I needed for the week.",
 ", and flour. I also picked up some groceries like milk, bread, and cheese. I couldn't resist grabbing a bag of potato chips and a bottle of soda as well. After paying for my items, I headed home to start cooking and baking with my new purchases.",

![image](handin.png)
Anything could be on a grocery list, so why are the 10 different completions above so similar?<br>
Hint: The answer isn't just the temperature of 0.5.  Look especially at the long completions; run the cell again if you didn't get multiple long completions.


![image](handin.png)
What happens at different temperatures?  How about temperatures > 1?  (Note: Higher temperatures tend to produce longer responses, so it's wise to use `max_tokens`.)


*Remarks:* [In the future](https://community.openai.com/t/logprobs-are-missing-from-the-chat-endpoints/289514), you will be able to specify an argument `logprobs=5` to also get the log-probabilities of all generated tokens and of the top-5 tokens at each step.  That will produce much more output.  (This argument has always been available for the legacy API, and is available in the [Python bindings for open-source models such as Llama](https://pypi.org/project/llama-cpp-python/).  The Llama bindings also allow you to [constrain the output by an arbitrary CFG](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), using `grammar=...`.  This is useful if you're generating code or data that must be syntactically valid to be useful to you.  However, the OpenAI API only allows you to [constrain the output to be valid JSON](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format).)


### Compute a function using instructions and few-shot prompting

Now let's try passing a sequence of multiple messages into the chat completions API.  In this case, we provide some instructions and one-shot prompting.

In [7]:
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the order of the words." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Wait who those to come things good." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-3.5-turbo-1106", temperature=0)
rich.print(response)
response.choices[0].message.content                                  

'furiously sleep ideas green colorless'

![image](handin.png)
By modifying this call, can you get it to produce different versions of the output?
Some possible behaviors you could try to arrange:
* specific other way of formatting the output, e.g., `wait, who, those, to, come, things, good`
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., `To those who wait come good things.` 

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

![image](handin.png)
What happens if the examples don't match the instructions?

### Inspect the tokenization

Just for fun, let's see how the above client has been tokenizing its input and output text.  For that we can use a tokenizer that runs locally, not in the cloud, and is guaranteed to get the same outputs.

In [8]:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-1106")  # how this model will tokenize
toks = tokenizer.encode("Hellooo, world!") # list of integerized tokens, starting with BOS

print(tokenizer.decode(toks))                             # convert list back to string
for tok in toks: print(tok,"\t",tokenizer.decode([tok]))  # convert one at a time
print("Vocab size =", tokenizer.n_vocab)

Hellooo, world!
9906 	 Hello
2689 	 oo
11 	 ,
1917 	  world
0 	 !
Vocab size = 100277


### Try embedding some text

Also just for fun, let's try the embedder, which converts a string to an fixed-length vector.

In [9]:
emb_response = client.embeddings.create( input= [  # note: adjacent literal strings in Python are concatenated
        "When in the Course of human events it becomes necessary for one "
        "people to dissolve the political bands which have connected them "
        "with another, and to assume among the Powers of the earth, the "
        "separate and equal station to which the Laws of Nature and of "
        "Nature's God entitle them, a decent respect to the opinions of "
        "mankind requires that they should declare the causes which impel "
        "them to the separation." ], 
        model="text-embedding-ada-002")   # the only OpenAI model that currently offers the embeddings API
# don't print the whole response because it's very long
e = emb_response.data[0].embedding
print(f"{len(e)}-dimensional embedding starting with {e[:5]}")
print("Squared length of embedding vector: ", sum(x**2 for x in e))

1536-dimensional embedding starting with [0.021248681470751762, -0.014377851039171219, 0.010210818611085415, -0.02133774757385254, -0.00979093462228775]
Squared length of embedding vector:  1.0000000629476622


### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens.  That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute one of these cells whenever you want to see your cost so far.  Or, just keep `usage_openai.json` open as a tab in your IDE.

In [10]:
read_usage()      # reads from the file usage_openai.json

{'completion_tokens': 184760,
 'prompt_tokens': 1843898,
 'total_tokens': 2028658,
 'cost': 2.2131371999999963}

In [11]:
!cat usage_openai.json 

{
    "completion_tokens": 184760,
    "prompt_tokens": 1843898,
    "total_tokens": 2028658,
    "cost": 2.2131371999999963
}

# Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

## A first argubot (Airhead)

You can have a conversation right now with a _really bad_ argubot named Airhead.  Try asking it about climate change!  When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE.)

In [13]:
import argubots
d = argubots.airhead.converse()




Say something to Airhead:  


A *bot* (short for "robot") is a system that acts autonomously.
That corresponds to the AI notion of an *agent* — a system that uses some *policy* to choose *actions* to take.

The `airhead` agent above (defined in `argubots.py`) uses a particularly simple policy.  
It is an instance of a simple `Agent` subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to `airhead` is a `Dialogue` object (defined in `dialogue.py`). Let's look at it.

In [14]:
rich.print(d)

Each *turn* of this dialogue is just a tiny dictionary:

In [16]:
d[0]

IndexError: tuple index out of range

## An LLM argubot (Alice)

In other CS courses like crypto, algorithms, or networks, you may have encountered "conversations" between characters named Alice and Bob.  
Let's try talking to the Alice of this homework, who is a _much stronger baseline_ than Airhead.  Your job in this assignment is to improve upon Alice.
We'll meet Bob later.

In [17]:
alicechat = argubots.alice.converse()   # or call with argument d if you want to append to the previous conversation




Say something to Alice:  hi


(pinchun) hi
(Alice) Hello! What's something new you've learned recently that has really interested you?


Say something to Alice:  


As you may have guessed, `alice` is powered by an prompted LLM.  You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` — and the ones that you will write! — go into `argubots.py` instead.  This is just to keep the files small.

## Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.  
But can you really be sure you're making progress?  For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues.  But that would be slow and complicated to arrange.  

So, meet Bob!  He's just a simulated human.  You won't edit him: he is part of the development set.  Here is some information about him (from `characters.py`):

In [18]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object.
However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt?  The `CharacterAgent` constructor (defined in `agents.py`) defaults to a GPT-3.5 model that is specified in `tracking.py`.  But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [47]:
from agents import CharacterAgent
bob = CharacterAgent(characters.bob)    # actually, agents.bob is already defined this way
bob.converse()        # returns a dialogue, but we've already seen it so we don't want to print it again
None                  # don't print anything for this notebook cell 




Say something to Bob:  hi


(pinchun) hi
(Bob) Hello! Hope you're having a great day.


Say something to Bob:  


Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs!  (They're not actually all named Bob — we continued on in the alphabet.)


In [48]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [20]:
agents.cara.converse()
None




Say something to Cara:  ccc


(pinchun) ccc
(Cara) Hey there! How can I assist you today?


Say something to Cara:  


You can see the underlying character data here in the notebook.  Your argubot will have to deal with all of these topics and styles!

In [21]:
rich.print(characters.devset)

## Simulating conversation 

We can make Alice and Bob chat.

In [20]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)


(Alice) Do you think it's okay to eat meat?


In [23]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that choosing a vegetarian diet is the best way to promote a healthier and more sustainable lifestyle.
(Alice) I understand your viewpoint, but some argue that sustainable meat production can be part of a balanced and environmentally conscious diet. Additionally, for some people, including meat in their diet is important for cultural or health reasons.


In [24]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that choosing a vegetarian diet is the best way to promote a healthier and more sustainable lifestyle.
(Alice) I understand your viewpoint, but some argue that sustainable meat production can be part of a balanced and environmentally conscious diet. Additionally, for some people, including meat in their diet is important for cultural or health reasons.
(Bob) I respect cultural and health considerations, but I still believe that embracing a vegetarian diet can benefit both individuals and the planet.
(Alice) That's a valid point. However, it's important to remember that a well-planned vegetarian diet can be healthy, but it requires careful attention to ensure adequate intake of essential nutrients such as protein, iron, and vitamin B12 that are more easily obtained from animal products. Additionally, sustainable and ethical meat production practices can minimize environmental impact.


Anyway, let's see what happens when Alice and Bob talk for a while...

In [21]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot.  Alice doesn't seem to have a good strategy for getting him to open up.  Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [26]:
myname = alicechat[0]['speaker']   # your name, pulled from an earlier dialogue
agents.bob.converse(d[0:2].rename('Alice', myname))  # reuse the same first two turns
None

(pinchun) Do you think it's ok to eat meat?
(Bob) I believe that a vegetarian diet is the best choice for both personal health and the well-being of the planet.


Say something to Bob:  yd


(pinchun) yd
(Bob) It's okay if you have any questions or thoughts about vegetarianism or plant-based diets, I'm here to help!


Say something to Bob:  


You can also try talking to the other characters and having Alice (or Airhead) talk to them.

**You might enjoy** defining additional characters in `characters.py`, or right here in the notebook.
Feel free to talk to those and evaluate them.  They could be variants on the exisiting characters, or something entirely new. 

However, **don't change the dev set** — the characters we just loaded must stay the same.  Your job in this homework is to improve the argubot (or at least try).  And that means improving it according to a fixed and stable eval measure.

As an exception, you can change the languages that a couple of the characters speak. It may be fun for you to see them try to speak your native language.  And that doesn't really affect the quality of the argument.

In [27]:
# example
trollFace2 = characters.trollFace.replace(languages = ["Chinese", "Spanish"])
rich.print(trollFace2)
simulated_dialogue(argubots.alice, CharacterAgent(trollFace2), 6)

(Alice) Do you think Donald Trump was a good president?
(TrollFace) 你问这个问题太无聊了，像问我是否喜欢吃蔬菜一样无聊。
(Alice) I understand that politics may not be everyone's favorite topic. However, considering the impact of political leaders on society, it's important to reflect on their presidency and the changes they brought about. Do you think evaluating a president's performance can provide valuable insights for the future?
(TrollFace) 当然啦，就像评价你过去吃了一块臭豆腐是否会对你未来的饮食选择有所启发一样。
(Alice) I see your point about learning from past experiences to make better choices in the future. Evaluating a president's performance can provide valuable lessons for future leadership. However, different perspectives on a leader's actions can also help us understand the complexity of governance and the diverse needs of a society.
(TrollFace) 你说得没错，就像理解为什么有人喜欢吃泡面一样，政治领导人的行为和决策也是一个很复杂的问题。

### Efficiency: Batched generation?

Notice that we are making a separate LLM call to generate each turn of the dialogue.  When we generate the $n^\text{th}$ turn, we send the server the whole dialogue history — the previous $n\!-\!1$ turns — along with some instructions.  The server has to re-encode it with the Transformer, and it charges us for doing so (see the "input token" costs in `tracking.py`).  

That is probably inevitable for real dialogue.  But for simulated dialogue, a more efficient approach would be to generate the whole dialogue between Alice and Bob in one LLM call.  Then you would be charged just once for each dialogue turn.  Under this approach, the Transformer encodes each token as soon as it is generated (see the "output token" costs in `tracking.py`).  The encoded token stays in the context throughout the dialogue, so it doesn't have to be re-encoded on a later call.  There is no later call.  

Under current pricing models, that would reduce the dollar cost of generating $n$ turns from $O(n^2)$ to $O(n)$.  

However, the pricing model doesn't quite reflect the computational costs.  
* ![image](handin.png) Using $O(\cdot)$ notation, what is the total number of floating-point operations needed to generate $n$ turns under each approach?  
* ![image](handin.png) Parallelism may help reduce the runtime.  Using $O(\cdot)$ notation, what is the total number of seconds needed to generate $n$ turns under each approach?  (Assume that the GPU is big enough, relative to $n$, that it can process all input tokens in parallel.)

The problem with the more efficient approach is that it gives you no way to change the instructions (the system prompt) each time we switch from Alice to Bob and back again.  You'd need to generate the whole conversation using a single set of instructions.

![image](handin.png)
Can you get this to work?  Specifically, try completing the cell below.  You don't have to use the `Agent` or `Dialogue` classes.  It's okay to just throw together something like the `complete()` method above.  Just see whether you can manage to prompt GPT-3.5 to generate a multi-turn dialogue between two characters who have different personalities and goals.  Is the quality better or worse than generating one turn at a time?  If worse, does it help to switch to GPT-4?

In [51]:
# Like `simulated_dialogue` in `simulate.py`.  However, this one is called on two
# Characters, not two Agents, and it returns a string rather than a Dialogue.

from tracking import default_client, default_model
from characters import Character
def simulated_dialogue_batch(a: Character, b: Character, turns: int = 6, *,
                             starter=True) -> str:
    name_turns="Generate a"+" "+ str(turns)+ " sentences dialogue between"+" "+a.name+" and "+b.name+". " 
    a_persona_constyle= "The persona of "+a.name+": "+a.persona+". "+a.name+a.conversational_style.strip('You')+" "
    b_persona_constyle= "The persona of "+b.name+": "+b.persona+". "+b.name+b.conversational_style.strip('You')+" "
    starter='The dialogue should start with "('+a.name+") "+a.conversation_starters[0]+'"'
    m=name_turns+a_persona_constyle+b_persona_constyle+starter

    
    response = client.chat.completions.create(messages=[
                                                    { "role": "user",        # input
                                                      "content": m }],
                                          model="gpt-3.5-turbo-1106", temperature=0)

    print(response.choices[0].message.content) 

# Try it out!
simulated_dialogue_batch(characters.bob, characters.cara)

(Bob) Do you think it's ok to eat meat?
(Cara) Well, I personally enjoy eating meat, so yes, I think it's okay.
(Bob) But don't you think it's cruel to animals and bad for the environment?
(Cara) I understand your perspective, but I believe in sustainable and ethical farming practices.
(Bob) I respect that, but have you considered the health benefits of a vegetarian diet?
(Cara) I appreciate your concern, but I feel healthy and happy with my current diet. Let's agree to disagree on this topic.


In [29]:
simulated_dialogue(agents.bob, agents.cara)

(Bob) Do you think it's ok to eat meat?
(Cara) Yes, I believe it's perfectly okay to eat meat.
(Bob) I understand, but I personally believe that a vegetarian diet is better for both our health and the environment.
(Cara) I respect your opinion, but I prefer to make my own choices when it comes to what I eat.
(Bob) Of course, everyone is entitled to their own choices, and I can appreciate that.
(Cara) Thank you for understanding.

In [30]:
simulated_dialogue(agents.eve, agents.trollFace)

(Eve) Do you think Donald Trump was a good president?
(TrollFace) Donald Trump? The only thing he's good at is making a spectacle of himself.
(Eve) I'm not sure about that, but I've heard some people say he was good for the economy.
(TrollFace) Oh, please! Are we talking about the same economy that went through a rollercoaster of ups and downs during his presidency?
(Eve) I guess it depends on who you ask, but I've heard mixed opinions about his economic policies.
(TrollFace) Mixed opinions? That's just a nice way of saying nobody knows what they're talking about!

# Model-based evaluation

What is our goal for the argubot?  We'd like it to broaden the thinking of the (simulated) human that it is talking to.  Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner.  The argubot will almost never hear a human say "You have convinced me that I was wrong."  But the argubot did a good job if the human developed **increased understanding and respect for an opposing point of view**.  

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue.  For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views.  Of course, that depends on his personality — Alice needs to talk to him in a way that reaches *him* (as much as possible).  We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans.  Bob is a fake person, and so is the outside observer (whose name is Judge Wise).
Using an LLM as an eval metric is known as *model-based evaluation*.  It has pros and cons:
* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.  
* It might give different answers than what humans would give.   

Social scientists usually refer to a metric's **reliability** (low variance) and **validity** (low bias).  So the points above say that model-based evaluation is reliable but not necessarily valid.  In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure.  (For example, that it correlates strongly with some other measure that we already trust.)  In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, open up the `demo` notebook, which walks you through the evaluation protocol.  You'll see how to call the [starter code](http://cs.jhu.edu/~jason/465/hw/llm), how it talks to the LLM behind the scenes, and what it is able to accomplish. 

To help to validate the metric, check that Airhead gets a low score.  (It should!)

# Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing.  So now you're probably curious about the details.  Now that you've had the view from the top, here's a good bottom-up order in which to study the code.  You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`.  The `Character` class is short and easy.

* `dialogue.py`.  The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents.  On each *turn* of the dialogue, one of the speakers says something.  

   The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API.  But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`.  Those are not quite the same thing as human speakers.  And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on.  The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`.  This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method.  The `Agent` base class also has some simple convenience methods that you should look at.  

   Some important subclasses of `Agent` are defined here as well.  However, you may want to skip over `EvaluationAgent` and come back to it only when you read `eval.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents.  One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `eval.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules.  These aren't about NLP; look inside if needed.  `logging_cm.py` is what enabled the context manager `with LoggingContext(...):` in the demo notebook.  `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

# Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods!
There is a long history of machine learning methods that "memorize" the training data.
To make a prediction or decision at test time, they consult the stored training examples
that are most similar to the training situation.

_Similarity-based retrieval_ means that given a document $x$, you find the "most similar" documents $y \in Y$, where $Y$ is a given collection of documents.  The most common way to do this is to maximize the _cosine similarity_ $\vec{e}(x) \cdot \vec{e}(y)$, where $\vec{e}(\cdot)$ is an embedding function.

Should we use the OpenAI embedding model?  We could, but we would have to precompute $\vec{e}(y)$ for all $y \in Y$, and store all these vectors in a data structure that supports some type of fast similarity-based search (e.g., using the [FAISS](https://faiss.ai/index.html) package).  An alternative would be to upload the documents to OpenAI and let OpenAI compute and store the embeddings.  We would then use their similarity-based [retrieval tool](https://platform.openai.com/docs/assistants/overview).

A simpler and faster approach—which sometimes even works better—is to use a _bag of tokens_ embedding function: Define $\vec{e}(y)$ to be the vector in $\mathbb{R}^V$ that records the count of each type of token in a tokenized version of $y$, where $V$ is the token vocabulary.  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a refined variant of that idea, where the counts are adjusted in 3 ways: 

* smooth the counts
* normalize for the document length $|y|$ so that longer documents $y$ are not more likely to be retrieved
* downweight tokens that are more common in the corpus (such as ` the` or `ing`) since they provide less information about the content of the document


You might like to play with the `rank_bm25` package ([documentation](https://pypi.org/project/rank-bm25/)).  It is widely used and very easy to use.

In [22]:
from rank_bm25 import BM25Okapi as BM25_Index   # the standard BM25 method

# experiment here!  You could try the examples in the rank_bm25 documentation.
from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
print(tokenized_corpus)
bm25 = BM25Okapi(tokenized_corpus)

[['Hello', 'there', 'good', 'man!'], ['It', 'is', 'quite', 'windy', 'in', 'London'], ['How', 'is', 'the', 'weather', 'today?']]


In [23]:
query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
doc_scores

array([0.        , 0.93729472, 0.        ])

In [24]:
bm25.get_top_n(tokenized_query, corpus, n=1)

['It is quite windy in London']

## The Kialo corpus

How can we use similarity-based retrieval to help build an argubot?  It's largely about having the right data!

[Kialo](kialo.com) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics.  For each topic, the contributors construct a tree of _claims_.  Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con").  For example, check out the tree rooted at the claim ["All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees.  Please read the [source code](https://www.cs.jhu.edu/~jason/465/hw-llm) in `kialo.py`.  The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the [data directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data).  The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [26]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens. 

In [27]:
from typing import List
import glob

# kialo = Kialo(glob.glob("data/*"), tokenizer=tokenizer.encode)  # using the LLM's tokenizer doesn't work here for some reason
kialo = Kialo(glob.glob("data/*"))  # use simple default tokenizer
f"This Kialo subset contains {len(kialo)} claims"

'This Kialo subset contains 6251 claims'

Let's use sampling to see what kind of stuff is in the data structure.

In [34]:
kialo.random_chain()   # just a single random claim

['Disrupting the creation of herd immunity is comparable to disregarding traffic lights. If even one person does it, it disrupts the flow of traffic and puts several people at enormous risk. It is justified that such action is curbed even at the cost of personal autonomy.']

In [35]:
kialo.random_chain(n=4)

['COVID-19 vaccines should be mandatory.',
 'Mandatory vaccines would be a huge overreach of state powers.',
 'Mandatory vaccinations can be enforced through non-state actors such as workplaces, stores and privately owned public spaces, which could require a proof of vaccination for entry.',
 'Airlines sometimes require "fitness to fly" certificate from passengers flying with them. This could be extended to include proof of immunization from COVID-19.']

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [36]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [37]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

['Industrial agriculture can dangerously decrease animal populations.',
 'Effective vegan methods to control animal populations exist.',
 'Human-introduced species have historically devastated local wildlife populations across the world.',
 'COVID-19 has devastated prison populations, whose lives are the responsibility of the state.',
 'High demand for vegan foods may hike prices for local populations that previously depended on them.',
 'It is generally poorer countries that have expanding populations. The first world has now reached a point of stagnant population growth - even declining populations, as in the case of Japan and others. The inability of poorer countries to control their populations should not impact the lives of those in the first world. The first world having earned their luxuries and should not be denied them.',
 'Vegan populations are, on average, less likely to suffer from obesity, a major risk factor for many diseases and health problems.',
 'Humans, as apex preda

In [38]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

Parent claim:
	In a vegan world, fewer species would be at risk of extinction.
Claim:
	Industrial agriculture can dangerously decrease animal populations.
Pro children:
	* The fishing industry is especially deleterious to the ocean's biota due to overfishing and the disruption of the natural ecosystem.
	* Up to 100,000 species go extinct annually, largely due to the environmental effects of animal agriculture.
Con children:
	* Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.


### Does BM25 really work?

![image](handin.png)
Unfortunately, we see that `"animal population"` gives quite different results from `"animal populations"`.  Why is that and how would you fix it?  

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations.  Why is that and how would you fix it?

In [39]:
kialo.closest_claims("animal population",10)

['As long as our ability to produce both animal feed crops and food crops for our human population are not exceeded, this point is irrelevant.',
 "36% of the calories produced by the world's crops are being used for animal feed, of which only 12% then turn into animal products that can be eaten by the human population. That is a waste of 24% of the world's crops.",
 'The claim that "most of the cultural shift and loss is due to mostly vegan cultures turning to animal products" is completely unfounded, and the Brokpa people which you cited are an outlier as a group that has a population of less than 70k people. Worldwide the population of vegan people has only increased.',
 "Developed nations are fueling the 3rd world and underdeveloped nation's population boom by exporting/donating food to areas that cannot sustain their current population.",
 'This argument assumes that sentience is the only objection to the consumption of animal products, failing to address the issues involved with t

## A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all.  It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

You already watched Akiko argue with Darius in `demo.py`.  If you look at the log messages, you'll see the claims that Akiko retrieved, as well as the LLM calls that Darius made.  

You can talk to Akiko yourself now.  (Remember that Akiko only knows about subjects that it read about in the [`data` directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data/).  If you want to talk about something else, you can add more conversations from [kialo.com]; see the [LICENSE](https://www.cs.jhu.edu/~jason/465/hw-llm/data/LICENSE) file.)


In [41]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()
    




Say something to Akiko:  ji


(pinchun) ji
(Akiko) Justification is not a question of amount. If something is wrong, it is wrong regardless of the quantity in which it occurs: murdering one person is wrong, as is murdering five people.


Say something to Akiko:  


## Making your own retrieval bot (Akiki)

As you can see when talking to Akiko yourself, Akiko does poorly when responding to a short or vague dialogue turn (like "Yes"), because the "closest claim" in Kialo may be about a totally different subject.  Akiko does much better at responding to a long and specific statement.  

So try implementing a new argubot, called Akiki, that is very much like Akiko but does a better job of staying on topic in such cases.  It should be able to **look at more of the dialogue** than the most recent turn.  But the most recent dialogue turn should still be "more important" than earlier turns.  

The details are up to you.  Here are a few things you could try:
* include earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them
* weight more recent turns more heavily in the BM25 query (how can you arrange that?)
* treat the human's earlier turns differently from Akiki's own previous turns

![image](handin.png)
Implement your new bot in `argubots.py`, and adjust it until `argubots.akiki.converse()` seems to do a better job of answering your short turns, compared to `argubots.akiko.converse()`.  Make sure it still gives appropriate reponses to long turns, too.  Give some examples in the notebook of what worked well and badly, with discussion.

In [42]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiki.converse()
    




Say something to Akiki:  iii


(pinchun) iii
(Akiki) Joe Biden is better than Donald Trump.


Say something to Akiki:  ji


(pinchun) ji
(Akiki) Private companies make COVID-19 lateral flow tests.


Say something to Akiki:  


### Evaluating Akiki

![image](handin.png)
Finally, do a more formal evaluation to verify whether Akiki really does better than Akiko on this dimension.  This is a way to check that you're not just fooling yourself.  

1. Make a new `Agent` called "Shorty" that often (but not always) gives short responses.  
    * Shorty's conversation starters should be on topics that Kialo knows about.  
    * Shorty could be a pure `LLMAgent` such as a `CharacterAgent` with a particular `conversational_style`.  Or it could use a mixed strategy of calling the LLM on some turns and not others.
2. Generate several *Akiko*-Shorty dialogues and several *Akiki*-Shorty dialogues, using `simulated_dialogue`.
3. Evaluate each of those dialogues by asking Judge Wise **how well the argubot stayed on topic**.  You should write this prompt carefully so that Judge Wise gives meaningful scores.  (Before you do this evaluation step, adjust the prompt until it seems to work well on a small subset of the dialogues, Otherwise Judge Wise won't be so wise!)  
4. Compare Akiko and Akiki's mean scores. Ideally, also compute a 95% confidence interval on the difference of means, using [this calculator](https://www.statskingdom.com/difference-confidence-interval-calculator.html).

You can do all those steps in the notebook, writing _ad hoc_ code.  You don't have to write general-purpose methods or classes.

In [43]:
all_topics=kialo.claims['all']
all_topics[:10]

['Joe Biden is better than Donald Trump.',
 'Donald Trump possesses personal qualities that would render him more suitable for the presidency than Joe Biden.',
 'During one of his court cases, Donald Trump has repeatedly violated gag orders issued to him.',
 'Joe Biden has a demeanour and outlook that is more suitable to the office of President than that of Donald Trump.',
 'Joe Biden makes more inaccurate claims than Trump.',
 'The reason why Biden makes less factually inaccurate claims is that he makes more statements that are factual and vetted.',
 "According to PolitiFact, Trump's statements they have evaluated are 26% true, mostly true or half true. Biden's are 60% true, mostly true or half true.",
 'Donald Trump has a history of problematic behavior.',
 'Joe Biden has exhibited racist beliefs and attitudes throughout his political career.',
 'Trump has been accused of sexual harassment.']

In [52]:
all_topics=kialo.claims['all']
shorty_char = Character("Shorty", ["English"], 
                "a direct and honest person",
                conversational_style="You often (but not always) give short responses. Short As in 1 5-word sentence.", 
                conversation_starters=all_topics[:10])
shorty = CharacterAgent(shorty_char)

In [53]:
with_akiko=simulated_dialogue(argubots.akiko, shorty, 20)
with_akiko

(Akiko) Donald Trump has a history of problematic behavior.
(Shorty) It's not a secret, huh?
(Akiko) Many countries currently enforce drafts or mandatory military service, an extremely risky endeavor.
(Shorty) Yeah, it's a tough situation.
(Akiko) President Trump's decision to pull troops out of Syria was justified because he wanted to put an end to what he called the US role as global policeman.
(Shorty) I can see where he's coming from.
(Akiko) Biden does not have poor public speaking skills; he has a speech disorder that prevents him from speaking clearly.
(Shorty) That's important to understand, yeah.
(Akiko) This is true of any lifestyle choice that a parent decides for their children, including feeding them a diet that contains animal products.
(Shorty) Yeah, it's a personal decision.
(Akiko) Biden's withdrawal from Afghanistan was hasty and mismanaged; the manner in which the withdrawal was executed severely limits its appeal to Trump supporters.
(Shorty) It was definitely a mes

In [54]:
with_akiki=simulated_dialogue(argubots.akiki,shorty,  20)
with_akiki

(Akiki) Donald Trump possesses personal qualities that would render him more suitable for the presidency than Joe Biden.
(Shorty) I disagree. Biden is more suitable.
(Akiki) There is no evidence to indicate that Joe Biden demonstrates worse physical or mental fitness than Donald Trump.
(Shorty) I disagree. Trump is more fit.
(Akiki) This is irrelevant. Public opinion is often wrong, especially about something that cannot be known without thorough medical evaluation.
(Shorty) Public opinion is often biased.
(Akiki) "Good" is subjective. Different people have different definitions of what is "good" compared to others.
(Shorty) True, "good" can vary widely.
(Akiki) While it is true that it has been criticised on both the left and right, the reasons are wildly different. To conflate and group them together is a grave disservice to both arguments and a disingenuous way to launch an attack.
(Shorty) Criticism is not always equal.
(Akiki) Joe Biden was competing with Sanders for the Democrati

In [55]:
import characters
from agents import EvaluationAgent
import eval
akiko_eval=eval.eval_by_observer(eval.default_judge, "Akiko", with_akiko)
rich.print(akiko_eval)

In [56]:
import characters
from agents import EvaluationAgent
import eval
akiki_eval=eval.eval_by_observer(eval.default_judge, "Akiki", with_akiki)
rich.print(akiki_eval)

In [298]:
akiko_eval = eval.eval_on_characters(argubots.akiko)  

In [299]:
akiki_eval = eval.eval_on_characters(argubots.akiki)  

In [300]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Akiko'].mean())   # means
rich.print(saved_evalsum['Akiko'].sd())     # standard deviations
rich.print(saved_evalsum['Akiki'].mean())   # means
rich.print(saved_evalsum['Akiki'].sd())     # standard deviations

In [301]:
# n
akiko_eval.counts
akiki_eval.counts

Counter({'engaged': 20,
         'informed': 20,
         'intelligent': 20,
         'moral': 20,
         'skilled': 20,
         'TOTAL': 20})

### skilled #haven't filled in

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3 | 0 |
| Akiki | 20 | 3 | 0 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [NaN, NaN] |

### moral

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3 | 0 |
| Akiki | 20 | 3 | 0 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [NaN, NaN] |

### intelligent

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 2.95 | 0.3940 |
| Akiki | 20 | 2.9 | 0.3077 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.1768, 0.2768] |

### engaged

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3.65 | 0.4893 |
| Akiki | 20 | 3.7 | 0.4701 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.3572, 0.2572] |

### informed

| argubot | sample size | mean | std |
| --- | --- | --- | --- |
| Akiko | 20 | 3.15 | 0.3663 |
| Akiki | 20 | 3.05 | 0.2236 |

| Parameter | Value |
| --- | --- |
| Mean difference confidence intercal | [-0.09563, 0.2956] |

## Retrieval-augmented generation (Aragorn)

The real weaknesses of Akiko and Akiki:
* They can only make statements that are already in Kialo.  
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named [Aragorn](https://tolkiengateway.net/wiki/Riddle_of_Strider), who is basically the love child of Akiki and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.  

The RAG in aRAGorn's name stands for **retrieval-augmented generation**.  Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step**: Ask the LLM what claim should be responded to.  For
   example, consider the following dialogue:
    > ...
    > Aragorn: Fortunately, the vaccine was developed in record time.
    > Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using
    as a Kialo query.  But Aragorn shows the *whole dialogue* to the LLM, and
    asks the LLM what the human's *last turn* was really saying or implying, in
    that context. The LLM answers with a much longer statement:

    > Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted.
    > If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context.
    It also contains many more word types, which makes it more likely that BM25 will be able
    to find a Kialo claim with a nontrivial number of those types. 

2. **Retrieval step**: Look up claims in Kialo that are similar to the explicit
   claim.  Create a short "document" that describes some of those claims and
   their neighbors on Kialo.

3. **Retrieval-augmented generation**: Prompt the LLM to generate the response
   (like any `LLMAgent`).  But include the new document somewhere in the LLM
   prompt, in a way that it influences the response. 
   
   Thus, the LLM can respond in a way that is appropriate to the dialogue but
   also draws on the curated information that was retrieved in Kialo.  After
   all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible
to do better than this:

In [303]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

One possibly related claim from the Kialo debate website:
	"So many people are worried about animals but don't even think twice when walking by a homeless person on the streets. It's preposterous. How about we worry about our own kind first and then start talking about animals."
Some arguments from other Kialo users against that claim:
	* This implies that caring for animals or caring for people is a binary choice. It isn't. There are those who are well placed and willing to care for people and those who prefer to serve the animal kingdom. As a species we don't just have one idea at a time and follow that to conclusion before we pursue another. It benefits all if humans divide their attentions between various issues and problems we face.
	* Humans have freedom of choice to some extent, animals subdued by humans don't. The very intention of help urges it to go where is most needed. And so far never was any biggest, flagrant and needless cruelty and slaughter as that towards industrial f

![image](handin.png)
You should implement Aragorn in `argubots.py`, just as you did for Akiki.  Probably as an instance `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

In [342]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.aragorn.converse()
    




Say something to Aragorn:  hi


(pinchun) hi
(Aragorn) Hey pinchun! Thanks for checking in. I'm doing okay, just dealing with some digestive issues lately. I've been researching about how certain foods can affect my gut health, and I came across some interesting information about the impact of nuts, grains, and fibrous vegetation on people with digestive problems like Crohn's disease. It's been really helpful to learn about how preparing vegetables in certain ways can make them easier to digest. I appreciate your offer to help, and I'll definitely keep you in mind if I need to chat about it. Thanks again for reaching out.


Say something to Aragorn:  


### Evaluating Aragorn

![image](handin.png)
Compare Alice, Akiki, and Aragorn in the notebook, using the evaluation scheme and devset that were illustrated in `demo.ipynb`.  In other words, use `eval.eval_on_characters`.

Who does best?  What are the differences in the subscores and comments?  Does it matter which character you're evaluating on — maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score.  Can you beat Alice?

Also, try evaluating them in the same way that you evaluated Akiki.  In other words, have them talk to Shorty and ask Judge Wise whether they were able to stay on topic.  This is where Aragorn should really shine, thanks to its ability to paraphrase Shorty's short utterances.



In [403]:
aragorn_eval = eval.eval_on_characters(argubots.aragorn)  

In [404]:
alice_eval = eval.eval_on_characters(argubots.alice)  

In [411]:
## New
from eval import saved_evalsum, saved_dialogues

saved_dialogues['Aragorn']

[((Aragorn) Do you think it's ok to eat meat?
  (Bob) I believe that choosing a vegetarian diet is a compassionate and healthy choice for individuals and the planet.
  (Aragorn) I appreciate your perspective, Bob. I have always believed in living in harmony with nature and minimizing harm to other living beings. I think it's important to consider the ethical and environmental implications of our food choices. I will definitely take your points into consideration as I continue to evaluate my own diet and its impact on the world around me. Thank you for sharing your thoughts.
  (Bob) You're welcome! I'm glad to hear that you are considering the ethical and environmental impact of your diet, and I'm here to support you in making positive choices for the planet and all living beings.
  (Aragorn) Thank you, Bob. I agree that it's important to consider the ethical and environmental implications of our food choices. I believe in living in harmony with nature and minimizing harm to other livin

In [405]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Aragorn'].mean())   # means
rich.print(saved_evalsum['Aragorn'].sd())
rich.print(saved_evalsum['Alice'].mean())   # means
rich.print(saved_evalsum['Alice'].sd())

In [356]:
with_ara=simulated_dialogue(argubots.aragorn,argubots.alice,  5)
with_ara

##Response: I completely agree, Alice. The selection of a Supreme Court justice is a crucial decision that should be based on their qualifications, experience, and commitment to upholding the Constitution. It's important to consider their track record and judicial philosophy to ensure they will approach future cases with impartiality and fairness. Diversity and representation on the Court are also essential to consider, as it allows for a wide range of perspectives to be taken into account. I believe that the integrity and ethical values of a justice should be at the forefront of the selection process, as it will have a lasting impact on our country.
##Response: Alice, I appreciate your perspective on the importance of diversity and representation on the Supreme Court. I agree that having a range of viewpoints can lead to a more comprehensive approach to interpreting and applying the law. However, I also believe that expanding the Court should be carefully considered, as it could have 

(Aragorn) Biden has consistently refused to denounce claims that he, and his party, would rule to pack the Supreme Court upon winning the election.
(Alice) What do you think is the most important factor in choosing a Supreme Court justice?
(Aragorn) I completely agree, Alice. The selection of a Supreme Court justice is a crucial decision that should be based on their qualifications, experience, and commitment to upholding the Constitution. It's important to consider their track record and judicial philosophy to ensure they will approach future cases with impartiality and fairness. Diversity and representation on the Court are also essential to consider, as it allows for a wide range of perspectives to be taken into account. I believe that the integrity and ethical values of a justice should be at the forefront of the selection process, as it will have a lasting impact on our country.
(Alice) While a justice's qualifications and commitment to upholding the Constitution are crucial, some

In [339]:
ara_eval=eval.eval_by_observer(eval.default_judge, "Aragorn", with_ara)
rich.print(ara_eval)

# Extra Credit (Awsom)

We didn't require this part this year because the homework is going out late.

![image](handinec.png)
Add another LLM-based argubot to `argubots.py`.  
Call it Awsom.  Try to make it get the best score, according to `eval.eval_on_characters`.
Explain what you did and discuss what you found.

(This corresponds to the `--awesome` flag on earlier assignments, but naming the character "Awesome" might bias the evaluation system, so we changed the spelling!)

If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the score.  Many good ideas don't work.  That's why you need to keep finding and trying new good ideas.  (Sometimes they do help, but in a way that is not picked up by the scoring metric.)

You may want to use Aragorn or Alice as your starting point.
Then see if you can find tricks that will get a more awesome score for Awsom.
How you choose to do that is up to you, but some ideas are below.

(Reminder: **Don't change evaluation.**  Just build a better argubot.)

In [57]:
import dotenv
import openai
from tracking import track_usage, read_usage
import tracking

dotenv.load_dotenv(override=True)      # define environment variables from .env
client = track_usage(openai.OpenAI())  # create a client, modified to record its usage to a local file 

# Or use our tracking module to do the above for you, like this:

# from tracking import default_client
# client = default_client

In [58]:
s="text aw: ttt aw: tr: tr:ttt"
s.split('tr:',1)

['text aw: ttt aw: ', ' tr:ttt']

In [474]:
import argubots
import agents
with_aws=simulated_dialogue(argubots.awsom,agents.trollFace,  10)
with_aws
#simulated_dialogue(agents.eve, agents.trollFace)

Claim President Trump, like most politicians, was not a scientist, and like him, very few politicians have a reasonable understanding of climate change.
old pro Climate science is very hard to understand and integrate into politics.
new pro One argument supporting this statement is that politicians, including President Trump, are typically not trained or educated in scientific fields such as climate science. Their expertise and knowledge lie in areas such as law, economics, or public policy. Therefore, it is unreasonable to expect them to have a deep understanding of complex scientific concepts such as climate change.

Furthermore, politicians often rely on the advice and information provided by experts in the field when making decisions about climate change. They may consult with scientists, researchers, and other knowledgeable individuals to inform their policies and positions on the issue. This is a common practice in policymaking, as it allows decision-makers to benefit from the ex

(Awsom) Do you think Joe Biden has been a good president?
(TrollFace) Oh, I'm sure he's doing a fantastic job, just like all politicians do.
(Awsom) It seems like you have some skepticism about politicians and their effectiveness. I understand where you're coming from, but it's important to consider that not all politicians are the same. Some may have a better understanding of certain issues than others. For example, there are politicians who have a reasonable understanding of climate change and its implications. It's important to consider a variety of perspectives when evaluating the effectiveness of politicians.
(TrollFace) Oh, absolutely, I'm sure every politician has a unique and completely flawless understanding of every issue, especially climate change, because politicians are notorious for their deep knowledge and genuine concern for all matters.
(Awsom) I understand your skepticism about politicians and their understanding of important issues like climate change. It's true that

In [59]:
import argubots
import agents

awsom_eval = eval.eval_on_characters(argubots.awsom)  

In [60]:
from eval import saved_evalsum, saved_dialogues

rich.print(saved_evalsum['Awsom'].mean())   # means
rich.print(saved_evalsum['Awsom'].sd())

In [62]:
rich.print(saved_dialogues)

## [Extra credit] Prompt engineering

A good first thing to do is to experiment with Alice's prompt.  
The wording and level of detail in the prompt can be quite important.
Often, NLP engineers will change their prompt to try to address 
problems that they've seen in the responses.

Because it's "just" text editing, this won't get too much extra credit unless you make a real discovery.
But it requires intelligence, care, experimentation, and alertness to the language of the responses and the
language of the prompts.  And you'll develop some intuitions about what helps and what doesn't.
It is certainly worthwhile.

Of course, people have tried to develop methods to search for good prompts automatically, or semi-automatically with human guidance.

If you try this, what worked well for you?

## [Extra credit] Chain of thought / Planning

The evaluation functions in `eval.py` asked each `EvaluationAgent` a "warmup question" before continuing with the real question.  That is an example of chain-of-thought (CoT) reasoning, where the LLM is encouraged to talk through the problem for a few sentences before giving the answer.  CoT sometimes improves performance.

Instead of using one prompt, could you help an `LLMAgent` argubot (like Alice) do better by having think aloud before it gives an answer?  For example, each time the human speaks, your argubot (Awsom) could prompt the LLM to think about the human's ideas/motivations/personality, and to come up with a plan for how to open the human's mind. 

For example, you might structure this as a `Dialogue` among three participants, like this:
> Awsom (to Eve): Do you think COVID vaccines should be mandatory?
>
> Eve: Have you ever gotten vaccinated yourself?<br>
>
> Awsom (private thought): I don't know Eve's opinions yet, so I can't push back.  Eve might be avoiding my question because she doesn't want to get into a political argument.  So let's see if we can get her to express an opinion on something less political.  Maybe something more personal ... like whether vaccines are scary.
>
> Awsom (to Eve): In fact I have, and so have millions of others. But some people seem scared about getting the vaccine.  

One way to trigger this kind of analysis is to present a `Dialogue.script()` to Awsom (or to an observer), and ask an open-ended question about it.  Or you could ask a series of more specific questions.  That is basically what `eval_by_participant` and `eval_by_observer` do.  But here the argubot itself is doing it, rather than the evaluation framework.

Eve would be shown only the turns that are spoken aloud.  However, when analyzing and responding, Awsom would get to see Awsom's own private thoughts as well.


## [Extra credit] Dense embeddings

BM25 uses sparse embeddings — a document's embedding vector is mostly zeroes, since the non-zero coordinates correspond to the specific words (tokens) that appear in the document.

But perhaps dense embeddings of documents would improve Aragorn by reading the text and abstracting away from the words, in a way that actually cares about word order.  So, try it!

How?  As mentioned earlier in this notebook, you could compute the embeddings yourself and put them in a FAISS index. Or you could figure out how to use OpenAI's [knowledge retrieval](https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) API.

## [Extra credit] Few-shot prompting

 In this homework, often an agent prompted a language model only with instructions.  Can you find a place where giving a few _examples_ would also improve performance?  You will have to write the examples, and you will have to add them to the sequence of messages that your agent to the OpenAI API.  See the sentence=reversal illustration earlier in this notebook.

One good opportunity is in the query formation step of RAG.  This is a tricky task.  The LLM is supposed to state the user's implicit claim in a form that looks like a Kialo claim (or, more precisely, a form that will work well as a Kialo query).  It probably doesn't know what Kialo claims look like.  So you could show it by way of example.  This would also show it what you mean by the user's "implicit claim."


## [Extra credit] Using tools in the approved way

Aragorn's step 1 (query formation) is basically getting the LLM to generate a function call like
```
kialo_thoughts("A vaccine that was developed very quickly ...")
```
which Aragorn will execute at step 2 (retrieval), sending the results back to the LLM as part of step 3.

In this context, `kialo_thoughts` is an example of a **tool** (that is, a function) that the
LLM can or must use before it gives its response.

The tool is _not_ something that runs on the LLM server.  It is written by you
in Python and executed by you.  The function call above, including the text `"A
vaccine that was ..."`, is the part that is generated by the LLM.

The OpenAI API has [special support](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) for calling the LLM in a way that will _allow_ it to generate a tool call ([tools](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) or _force_ it to do so ([tool_choice](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice)).  You can then send the tool's result back to the LLM [as part of your message sequence](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).

So, you could modify Aragorn to use tools properly.  Maybe that will help, simply because the LLM was trained on message sequences that included tool use.  It should know to pay attention to the tool portions of the prompt when they are relevant, and ignore them when they are not.

The `client.chat.completions.create()` method would need to be told about the tool by using the `tools` keyword argument, with a value something the one below.

If `d` is a `Dialogue`, you should be able to call `d.response()` with the `tools` keyword argument.  This will be passed on to `client.chat.completions.create()` as desired.

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "kialo_thoughts",
            "description": "Given a claim by the user, find a similar claim on the Kialo website and return its pro and con responses",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_topic": {
                        "type": "string",
                        "description": "A claim that was made explicitly or implicitly by the user.",
                    },
                },
                "required": ["search_topic"],
            },
        }
    }]

## [Extra credit] Parallel generation

The chat completions interface allows you to sample $n$ continuations of the prompt in parallel, as we saw with "the apples, bananas, cherries ..." example.  This is efficient because it requires only 1 request to the LLM server and not $n$.  The latency does not scale with $n$.  Nor does the input token cost, since the prompt only has to be encoded once.

Perhaps you can find a way to make use of this?  For example, the query formulation step of RAG could generate $n$ implicit claims instead of just one.  We could then look for claims in the Kialo database that are close to _any_ of those implicit claims.

Another thing to do with multiple completions is to select among them or combine them.  For example, suppose we prompt the LLM to generate completions of the form $(s,t,r)$ where $s$ is an answer, $t$ evaluates that answer, and $r$ is a numerical score or reward based on that evaluation.  ("Write a poem, then tell us about its rhyme and rhythm problems, then give your score.")  
* If we sample multiple completions $(s_1,t_1,r_1), \ldots, (s_n,t_n,r_n)$ in parallel, then we can return the $s_i$ whose $r_i$ is largest.  
* Or if we sample $s$ and then multiple continuations $(t_1,r_1), \ldots, (t_n,r_n)$, then we can return the mean score $\sum_i r_i/n$ as a reduced-variance score for $s$, which averages over diverse textual evaluations that might consider different aspects of $s$.

Note that when you call the chat completions interface with $n > 1$, you specfy 1 shared input prompt and get $n$ different output completions.  Since the input prompt must be the same for all outputs``, it is necessary to sample all of $(s,t,r)$ or all of $(t,r)$ with a single call to the LLM.

Alternatively, it is possible to reduce latency by submitting multiple requests to the server in parallel (see "async usage" [here](https://pypi.org/project/openai/)).  In this case the input prompts can be different, although you now have to pay to encode all of them separately.  This facility could speed up evaluation without changing its results; that's a worthwhile thing to try for extra credit!
