# Score Answers

This notebook provides a testing framework to evaluate responses from DC API's chat.

For input, you'll need a csv file with `question`, `answer` and `ground_truth` columns. 
It will score the answers using an AWS Bedrock Claude model (either opus, haiku or sonnet) and produce a csv file additionally containing a `score`, `pass` (T/F) and `reason` columns.

Code adapted from this [AWS sample notebook](https://github.com/aws-samples/llm-based-advanced-summarization/blob/main/Prompt%20Evaluation.ipynb)


## Prerequisite: Login to AWS

If you haven't already logged into AWS, close this notebook, then, in your terminal, from this project's directory, login to AWS. Once you're logged in, repopen the Jupyter notebook. For example: 

`export AWS_PROFILE=staging && aws sso login`

`jupyter notebook`

Then run the below cell to confirm your credentials are loaded

In [1]:
import os

#confirm that you have AWS credentials
print(os.getenv('AWS_PROFILE'))

staging


## Setup the Environment

First, install the libraries and import the packages we'll need.


In [2]:
%pip install boto3
%pip install bs4
%pip install ipywidgets
%pip install ipython

import boto3, time, json, csv
import json, os
from urllib.parse import urljoin
from botocore.config import Config
from datetime import datetime
from bs4 import BeautifulSoup as BS
import ipywidgets as widgets
import io
import pandas as pd

Collecting boto3
  Downloading boto3-1.35.23-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.36.0,>=1.35.23 (from boto3)
  Downloading botocore-1.35.23-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Using cached jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3)
  Using cached s3transfer-0.10.2-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.35.23-py3-none-any.whl (139 kB)
Downloading botocore-1.35.23-py3-none-any.whl (12.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.6/12.6 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m[31m4.7 MB/s[0m eta [36m0:00:01[0m
[?25hUsing cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Using cached s3transfer-0.10.2-py3-none-any.whl (82 kB)
Installing collected packages: jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.35.23 botocore-1.35.23 jmespath-1.0.1 s3transfer-0.10.2
Note: you may need to rest

In [3]:
#increase the standard time out limits in boto3, because Bedrock may take a while to respond to large requests.
my_config = Config(
    connect_timeout=60*3,
    read_timeout=60*3,
)
bedrock = boto3.client(service_name='bedrock-runtime',config=my_config)
bedrock_service = boto3.client(service_name='bedrock',config=my_config)

### Verify the Bedrock Connection

In [4]:
#check that it's working:
models = bedrock_service.list_foundation_models()
for line in models["modelSummaries"]:
    # print this out if you want to see all the models you have access to.
    # print (line["modelId"])
    pass
if "anthropic.claude-3" in str(models):
    print("Claude-v3 found!")
else:
    print ("Error, no model found.")

Claude-v3 found!


## Configure variables

In [5]:
# run the cell, then use the dropdown to select which Claude model to use)
selected_model = widgets.Dropdown(
    options=['sonnet', 'haiku', 'opus'],
    description='Environment:',
)

display(selected_model)

Dropdown(description='Environment:', options=('sonnet', 'haiku', 'opus'), value='sonnet')

In [6]:
# Tip: Use your output file from the PrepareEvaluationData notebook
uploader = widgets.FileUpload(
    accept='csv', 
    multiple=False  
)

display(uploader)

FileUpload(value=(), accept='csv', description='Upload')

In [7]:
# Run the cell, then use the slider to define what score would be passing score (0-100)
selected_threshold = widgets.IntSlider(
    value=90,
    min=0,
    max=100,
    step=5,
    description='Threshold:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)
display(selected_threshold)

IntSlider(value=90, continuous_update=False, description='Threshold:', step=5)

## Create helper functions

Run all cells in this block to load the helper functions we'll need

In [8]:
model_version = selected_model.value # set the model_version from selected value
threshold_to_pass = selected_threshold.value
MAX_ATTEMPTS = 3 #how many times to retry if Claude is not working.
session_cache = {} #for this session, do not repeat the same query to claude.
def ask_claude(messages,system="", DEBUG=False, model_version="haiku"):
    '''
    Send a prompt to Bedrock, and return the response.  Debug is used to see exactly what is being sent to and from Bedrock.
    messages can be an array of role/message pairs, or a string.
    '''
    raw_prompt_text = str(messages)
    
    if type(messages)==str:
        messages = [{"role": "user", "content": messages}]
    
    promt_json = {
        "system":system,
        "messages": messages,
        "max_tokens": 3000,
        "temperature": 0.7,
        "anthropic_version":"",
        "top_k": 250,
        "top_p": 0.7,
        "stop_sequences": ["\n\nHuman:"]
    }
    
    if DEBUG: print("sending:\nSystem:\n",system,"\nMessages:\n","\n".join(messages))
    
    if model_version== "opus":
        modelId = 'anthropic.claude-3-opus-20240229-v1:0'
    elif model_version== "sonnet":
        modelId = 'anthropic.claude-3-5-sonnet-20240620-v1:0'
    elif model_version== "haiku":
        modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
    else:
        print ("ERROR:  Bad model version, must be opus, sonnet, or haiku.")
        modelId = 'error'
    
    if raw_prompt_text in session_cache:
        return [raw_prompt_text,session_cache[raw_prompt_text]]
    attempt = 1
    while True:
        try:
            response = bedrock.invoke_model(body=json.dumps(promt_json), modelId=modelId, accept='application/json', contentType='application/json')
            response_body = json.loads(response.get('body').read())
            results = response_body.get("content")[0].get("text")
            if DEBUG:print("Recieved:",results)
            break
        except Exception as e:
            print("Error with calling Bedrock: "+str(e))
            attempt+=1
            if attempt>MAX_ATTEMPTS:
                print("Max attempts reached!")
                results = str(e)
                break
            else:#retry in 10 seconds
                time.sleep(10)
    session_cache[raw_prompt_text] = results
    return [raw_prompt_text,results]

In [14]:
from queue import Queue
from threading import Thread

# Threaded function for queue processing.
def thread_request(q, result, model):
    while not q.empty():
        work = q.get()                      #fetch new work from the Queue
        thread_start_time = time.time()
        try:
            data = ask_claude(work[1], model)
            result[work[0]] = data          #Store data back at correct index
        except Exception as e:
            error_time = time.time()
            print('Error with prompt!',str(e))
            result[work[0]] = (str(e))
        #signal to the queue that task has been processed
        q.task_done()
    return True

def ask_claude_threaded(prompts, model, DEBUG=False):
    '''
    Call ask_claude, but multi-threaded.
    Returns a dict of the prompts and responces.
    '''
    print(f"Using model: {model}...")
    q = Queue(maxsize=0)
    num_theads = min(50, len(prompts))
    
    #Populating Queue with tasks
    results = [{} for x in prompts];
    #load up the queue with the promts to fetch and the index for each job (as a tuple):
    for i in range(len(prompts)):
        #need the index and the url in each queue item.
        q.put((i,prompts[i]))
        
    #Starting worker threads on queue processing
    for i in range(num_theads):
        #print('Starting thread ', i)
        worker = Thread(target=thread_request, daemon=True, args=(q,results, model))
        # worker.setDaemon(True)    #setting threads as "daemon" allows main program to 
                                  #exit eventually even if these dont finish 
                                  #correctly.
        worker.start()

    #now we wait until the queue has been processed
    q.join()

    # if DEBUG:print('All tasks completed.')
    print('All tasks completed.')
    return results

In [9]:
scoring_prompt_template = """You are a grader.  Consider the following question along with its correct answer or ground truth and a submitted answer to grade.
Here is the question:
<question>{{QUESTION}}</question>
Here is the correct answer:
<ground_truth>{{GROUND_TRUTH}}</ground_truth>
Here is the submitted answer:
<answer>{{ANSWER}}</answer>
Please provide a score from 0 to 100 on how well this answer matches the correct answer for this question.
The score should be high if the answers say essentially the same thing.
The score should be lower if some facts are missing or incorrect, or if extra unnecessary facts have been included.
The score should be 0 for entirely wrong answers.  Put the score in <SCORE> tags. and your reasoning in <REASON> tags.
Do not consider your own answer to the question, but instead score based on the ground_truth above."""

In [10]:
# create the score answers function
def score_answers(prompt_template, question_answers, model):
    '''
    ask our LLM to score each of the generated answers.
    '''
    
    prompts = []
    for question in question_answers:  
        print(f"Scoring: {question['question']}...")
        prompts.append(scoring_prompt_template.replace("{{QUESTION}}", question["question"]).replace("{{GROUND_TRUTH}}",question["ground_truth"]).replace("{{ANSWER}}",question["answer"]))
    return ask_claude_threaded(prompts, model)

In [11]:
# create the function to score the answers
def evaluate_prompt(prompt_template, question_answers, threshhold):
    """
    Call score answers and format the results once all threads have returned.
    """
    scored_answers = score_answers(prompt_template, question_answers, model_version)
    print ("Done.")
    
    scores = []
    scores.append(['question','ground_truth','answer','score','reason','passed'])
    for prompt,response in scored_answers:
        soup = BS(prompt)
        question = soup.find('question').text
        ground_truth = soup.find('ground_truth').text
        answer = soup.find('answer').text
        soup = BS(response)
        score = soup.find('score').text
        reason = soup.find('reason').text
        passed = True
        if int(score)<threshhold:
            passed = False
        scores.append([question,ground_truth,answer,score,reason,passed])
        
    return scores

## Test that it's working (optional)

In [12]:
%%time
#check that Claude responses are working:
try:
    query = "Please say the number four."
    result = ask_claude(query)
    print(query)
    print(result[1])
except Exception as e:
    print("Error with calling Claude: "+str(e))

Please say the number four.
Four.
CPU times: user 51.9 ms, sys: 5.96 ms, total: 57.9 ms
Wall time: 554 ms


In [15]:
%%time
#test if our threaded Claude calls are working
q1 = [{"role": "user", "content": "Please say the number one."}]
q2 = [{"role": "user", "content": "Please say the number two."}]
q3 = [{"role": "user", "content": "Please say the number 55."}]

print(ask_claude_threaded([q1,q2,q3], 'opus'))

Using model: opus...
All tasks completed.
[["[{'role': 'user', 'content': 'Please say the number one.'}]", '1'], ["[{'role': 'user', 'content': 'Please say the number two.'}]", 'Two.'], ["[{'role': 'user', 'content': 'Please say the number 55.'}]", '55.']]
CPU times: user 108 ms, sys: 10.4 ms, total: 118 ms
Wall time: 563 ms


In [16]:
# Test with a fake question set
test_question_set = [
    {'question': "What color is green?", 'ground_truth': "It's green", "answer": "It's between yellow and blue and can be grassy or olive"},
    {'question': "Where am I?", 'ground_truth': "Here.", "answer": "You are neither here nor there. You are everywhere."},
    {'question': "Do dogs like cats?", 'ground_truth': "No.", "answer": "More than cats like dogs."}
]

fake_result = score_answers(scoring_prompt_template, test_question_set, model_version)
print(fake_result)

Scoring: What color is green?...
Scoring: Where am I?...
Scoring: Do dogs like cats?...
Using model: sonnet...
All tasks completed.
[["You are a grader.  Consider the following question along with its correct answer or ground truth and a submitted answer to grade.\nHere is the question:\n<question>What color is green?</question>\nHere is the correct answer:\n<ground_truth>It's green</ground_truth>\nHere is the submitted answer:\n<answer>It's between yellow and blue and can be grassy or olive</answer>\nPlease provide a score from 0 to 100 on how well this answer matches the correct answer for this question.\nThe score should be high if the answers say essentially the same thing.\nThe score should be lower if some facts are missing or incorrect, or if extra unnecessary facts have been included.\nThe score should be 0 for entirely wrong answers.  Put the score in <SCORE> tags. and your reasoning in <REASON> tags.\nDo not consider your own answer to the question, but instead score based on

## Score the answers and write output file

In [17]:
# load the input data
uploaded_file = uploader.value[0]
input_filename = uploaded_file.name
csv_file = pd.read_csv(io.BytesIO(uploaded_file.content))
question_set = csv_file.to_dict(orient='records')

# verify it's loaded
print(question_set)

[{'question': 'How did World War II propaganda posters influence public opinion and morale during the war?', 'ground_truth': "World War II propaganda posters played a crucial role in influencing public opinion and boosting morale. They used striking visuals and compelling messages to convey the importance of unity, sacrifice, and support for the war effort. The [World War II Poster Collection](https://dc.library.northwestern.edu/collections/faf4f60e-78e0-4fbf-96ce-4ca8b4df597a) includes over 300 posters issued by various U.S. government agencies, emphasizing different themes:\r\n\r\n1. **Unity and Resistance**: Posters like [We French workers warn you: defeat means slavery, starvation, death](https://dc.library.northwestern.edu/items/52f515bd-2ee8-49da-aec0-92eb3f5b5e07) depicted the dire consequences of defeat to foster unity and resistance among Americans.\r\n   \r\n2. **Production and Labor**: Posters such as [Your ore packs a punch!](https://dc.library.northwestern.edu/items/1792fa

In [18]:
# Run the evaluation
scores = evaluate_prompt(scoring_prompt_template, question_set,threshhold=threshold_to_pass)

Scoring: How did World War II propaganda posters influence public opinion and morale during the war?...
Scoring: What contributions did Achille Paganini make to music, and how did his work influence later composers?...
Scoring: What role did Larry Hanks play in the folk music revival, and what are some of his most influential performances?...
Scoring: What were the key policies of Murtala Muhammed's government in Nigeria, and how did his leadership impact the country?...
Scoring: How has the political map of Africa changed since the 19th century, and what historical events have driven these changes?...
Using model: sonnet...
All tasks completed.
Done.


In [19]:
# sanity check (can take a second)
print(scores)

[['question', 'ground_truth', 'answer', 'score', 'reason', 'passed'], ['How did World War II propaganda posters influence public opinion and morale during the war?', "World War II propaganda posters played a crucial role in influencing public opinion and boosting morale. They used striking visuals and compelling messages to convey the importance of unity, sacrifice, and support for the war effort. The [World War II Poster Collection](https://dc.library.northwestern.edu/collections/faf4f60e-78e0-4fbf-96ce-4ca8b4df597a) includes over 300 posters issued by various U.S. government agencies, emphasizing different themes:\r\n\r\n1. **Unity and Resistance**: Posters like [We French workers warn you: defeat means slavery, starvation, death](https://dc.library.northwestern.edu/items/52f515bd-2ee8-49da-aec0-92eb3f5b5e07) depicted the dire consequences of defeat to foster unity and resistance among Americans.\r\n   \r\n2. **Production and Labor**: Posters such as [Your ore packs a punch!](https:/

In [20]:
# write to file

timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
os.makedirs(os.path.join('output_files/scored', timestamp), exist_ok=True)
output_base_path = f"output_files/scored/{timestamp}"
filename = os.path.join(output_base_path, f"{os.path.splitext(os.path.basename(input_filename))[0]}.csv")

with open(filename, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(scores)

print(f"Output file saved to: {filename}")

Output file saved to: output_files/scored/20240920081752/5_realistic_with_ground_truth.csv
