## Comparing the results of models
This workbook will walk through an example of model evaluation using Generative AI. Once you have a model working in production, how do you know whether or not the next generation of model is better than the existing one. Also, would a lower cost model provide similar quality results?

The example we will review is text summarization. We'll ask multiple models for a summary of a document, then ask a foundation model which provided a better summary.

One consideration is that the model doing the comparison should be the most robust. You can have two candidate models be smaller than the comparison model.

In this example, we'll use the following candidate models:
* **anthropic.claude-instant-v1**: Claude instant
* **amazon.titan-tg1-large**: Amazon Titan Large

The **anthropic.claude-v2** model will be used for comparing the outputs.

## Prerequisites
* **Claude** enabled for your AWS Account.

## Setup

In [1]:
import boto3
import json
import os
import sys

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww

boto3_bedrock = bedrock.get_bedrock_client(os.environ.get('BEDROCK_ASSUME_ROLE', None))

Create new client
  Using region: None
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-east-1.amazonaws.com)


In [2]:
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":0.5,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

textgen_claude_instant_llm = Bedrock(model_id = "anthropic.claude-instant-v1",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )

amazon_tital_inference_modifier = {"maxTokenCount":512, 
                      "temperature":0,
                      "topP":1,
                      "stopSequences": []
                     }

textgen_amazon_large_llm = Bedrock(model_id = "amazon.titan-tg1-large",
                    client = boto3_bedrock, 
                    model_kwargs = amazon_tital_inference_modifier 
                    )


## Summary Comparison

The notebook will create a summary from an example document, in this case, the Amazon 2022 letter to shareholders. The document is available in [example.py](example.py).

It will be run through Claude Instant and Amazon Titan, and the resulting summary will be stored for later use.

In [14]:
from example import letter
import json
from langchain import PromptTemplate

def run_summarization(llm, prompt_template):
    multi_var_prompt = PromptTemplate(
        input_variables=["document_text"], 
        template=prompt_template
    )

    prompt = multi_var_prompt.format(document_text=letter)

    response = llm(prompt)
    result = response[response.index('\n')+1:]
    
    return result

In [15]:
claude_prompt = """Human: You are creating a summary of a document. The original document is encoded by <document></document> tags.
<document>{document_text}</document>

Please make a 1 paragraph summary of the document.
Assistant: 
"""

claude_instant_summary = run_summarization(textgen_claude_instant_llm, claude_prompt)
print_ww(claude_instant_summary)


Amazon CEO Andy Jassy remains optimistic about the company's future prospects as they navigate
current macroeconomic challenges. The company has made adjustments to streamline costs, optimize
their fulfillment network, and prioritize investments that have the highest potential to drive long-
term value for customers and shareholders. While macroeconomic conditions have softened growth in
some areas like AWS, Jassy highlights opportunities in advertising, grocery, Amazon Business, and
new initiatives in healthcare and satellite internet that could be transformative if successful.
Jassy also discusses the potential of large language models and generative AI to revolutionize
machine learning and improve every customer experience at Amazon.


In [16]:
amazon_titan_prompt = """You are creating a summary of a document.  Please make a 1 paragraph summary of the document.
Document:{document_text}

Please make a 1 paragraph summary of the document.
Summary:
"""

amazon_titan_summary = run_summarization(textgen_amazon_large_llm, amazon_titan_prompt)
print_ww(amazon_titan_summary)


Despite shutting down some businesses and making changes to others, Jassy remains confident in
Amazon's future prospects. He highlights the company's ongoing efforts to improve fulfillment costs,
speed up delivery, and expand its retail business. Jassy also discusses Amazon's investments in
advertising, machine learning, and new business areas such as healthcare and satellite internet. He
emphasizes the company's long-term vision and commitment to innovation.


## The comparison
Now that both summaries have been completed, let's use the Claude v2 model to compare the two and decide which is better.

The prompt below provides the entire document text, then both of the summaries. It then asks the foundation model which is more accurate and concise. If you value other dimensions of the summary more, you can include that in the prompt. Lastly, it gives a specific format to the output as an example, thus making it easier to parse.

Notice how it includes the reasoning in the result. This tends to drive better results, as the model must justify its decision. Feel free to modify the code to ask for just a Yes/No response to which one is better. The results can be more random in that case.

Lastly, run the comparison several times. The answer may change. Is the model hallucinating why one is better than the other? Can more details be added to the prompt to reduce this effect?

In [22]:

# Create a prompt template that has multiple input variables
multi_var_prompt2 = PromptTemplate(
    input_variables=["document_text", "summary1", "summary2", "name1", "name2"], 
    template="""Human: You are comparing 2 summaries or a document. The original document is encoded by <document></document> tags. Summary 1 is encoded by <summary1></summary1> and Summary 2 is encoded by <summary2></summary2>

<document>{document_text}</document>.

<{name1}>{summary1}</{name1}>
<{name2}>{summary2}</{name2}>

Choosing only between {name1}, {name2}, or uncertain, which summary is more accurate and concise?  Also, explain the reason why with the format:
{{
  "best": "name1",
  "reason": "The justification for the selection"
}}

Only respond with the json format above.
Assistant: 
"""
)

prompt = multi_var_prompt2.format(document_text=src_document, summary1=claude_instant_summary, name1="claude_instant", summary2=amazon_titan_summary, name2="titan")


response = textgen_llm(prompt)

#target_code = response[response.index('\n')+1:]

print_ww(response)

 {
  "best": "claude_instant",
  "reason": "The claude_instant summary accurately captures the key points from Jassy's letter in a
concise manner. It highlights Jassy's optimism despite challenges, adjustments being made, focus on
long-term investments, and growth opportunities in new areas like healthcare and AI. The titan
summary is also good but includes some extra details that make it slightly less concise."
}


## Beat the bot!
See if you can beat the bot. Try modifying the summary in the next cell and try to improve the quality of summary. Alternatively put a low quality summary in and see the responses. 

In [23]:
# Replace this summary with your own! Be smarter than Claude
your_summary = """
This is the Amazon annual 2022 shareholder letter written by Andy Jassy. It highlights some of the success
and challenges that Amazon has faced and how Amazon plans to respond.
"""

prompt = multi_var_prompt2.format(document_text=src_document, summary1=claude_instant_summary, name1="claude-instant", summary2=your_summary, name2="My Summary")


response = textgen_llm(prompt)

print_ww(response)

 {
  "best": "claude-instant",
  "reason": "The claude-instant summary accurately captures the key points from the original
document in a concise manner. It highlights Jassy's optimism, the macroeconomic challenges,
adjustments being made, focus on high-potential investments, opportunities in new areas like
advertising and healthcare, and the potential of AI/ML. The My Summary is very brief and does not
provide enough detail on the content."
}


## Does order matter?

Is it just selecting one based on position? Let's flip and re-evaluate

In [24]:
prompt = multi_var_prompt2.format(document_text=src_document, summary2=claude_instant_summary, name2="claude-instant", summary1=your_summary, name1="My Summary")


response = textgen_llm(prompt)

print_ww(response)

 {
  "best": "claude-instant",
  "reason": "The claude-instant summary is more accurate and concise. It highlights the key points
from the original letter, including Amazon's response to macroeconomic challenges, opportunities in
new business areas, and investments in AI/ML. The My Summary is very brief and does not capture the
breadth of topics covered."
}


## Review - Next Steps
This examples a way to use an LLM to evaluate, at scale, the quality of the output of other LLMs. While not perfect, with some repeated testing, you can create an evaluation framework to rate the quality of output of the LLMs to help assess the quality of a model when changing models, either through an upgrade or looking for a lower cost option.