## Evaluating the quality of a model against known data
One of the challenges with Generative AI is measuring how effectively a model's output, especially at scale. Due to the stocastic nature of Generative AI, the output can vary from one call to another. Even if directionally the output is the same, you can't do a string comparison of the output, due to the variation in responses.

However, there are some things you can do to score the simularity of two responses. 

### Using GenAI
You can ask a model fo compare two responses.

### Embeddings
One way to compare the output would be to compare the embeddings vector between two responses. 

## Prerequisites
* **Claude** enabled for your AWS Account.

## Setup

In [9]:
import boto3
import json
import os
import sys
from langchain import PromptTemplate

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww

boto3_bedrock = bedrock.get_bedrock_client(os.environ.get('BEDROCK_ASSUME_ROLE', None))

Create new client
  Using region: None
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-east-1.amazonaws.com)


In [10]:
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":0.5,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )



## Summary Comparison

The notebook will create a summary from an example document, in this case, the Amazon 2022 letter to shareholders. The document is available in [example.py](example.py).

It will be run through Claude Instant and Amazon Titan, and the resulting summary will be stored for later use.

In [11]:
baseline_summary = """
Amazon continues to face challenges but is optimistic about the future. The company has made cost
cuts and efficiency improvements in fulfillment and logistics networks to lower costs and speed up
delivery times. AWS is facing short-term headwinds but has a strong customer base and continues to
innovate with new products and technologies. Other businesses like Advertising, Amazon Business, and
Buy with Prime continue to see strong growth. The company is also investing in new areas like
grocery, healthcare, satellite internet access, and artificial intelligence which could provide
large opportunities in the future. Although Amazon currently has a small share of the overall retail
and IT markets, as more retail shifts online and to the cloud, the company believes it is well
positioned for significant growth ahead."""

## Candidate summaries
There are four summaries provided.

The first was created by Amazon Titan, the second is a completely unrelated article on Honus Wagner, and
major leageu baseball player from the turn of the 20th century. The third is a summary of the 2017 letter to shareholders, and the final example is a summary by AI21 labs Jurassic Mid foundation model.

In [12]:
candidates = [
{
    "note": "Unrelated Article on Honus Wagner",
    "text": """
Johannes Peter "Honus" Wagner, sometimes referred to as Hans Wagner, was an American baseball shortstop
who played 21 seasons in Major League Baseball from 1897 to 1917, almost entirely for the Pittsburgh Pirates.
Wagner won his eighth (and final) batting title in 1911, a National League record that remains unbroken
to this day, and matched only once, in 1997, by Tony Gwynn. He also led the league in slugging six times
and stolen bases five times. Wagner was nicknamed "the Flying Dutchman" due to his superb speed and German
heritage. This nickname was a nod to the popular folk-tale made into a famous opera by the German composer
Richard Wagner. In 1936, the Baseball Hall of Fame inducted Wagner as one of the first five members.
He received the second-highest vote total, behind Ty Cobb's 222 and tied with Babe Ruth at 215.
"""
},
{
    "note": "Anthropic Claude on 2017 Letter",
    "text": """In his 2017 letter to Amazon shareholders, Jeff Bezos emphasizes the importance 
of maintaining a "Day 1" mentality even as Amazon grows into a large company. He says customer obsession, 
resisting proxies, embracing external trends, and high-velocity decision making are essential to fending 
off "Day 2" stagnation and irrelevance. Bezos argues that obsessive customer focus drives innovation, that 
proxies like surveys can mislead, that trends like AI must be quickly embraced, and that quality high-speed 
decisions maintain energy and dynamism. He shares examples from Amazon like Alexa and Amazon Go to 
illustrate these points. Bezos stresses that large organizations must move with the spirit of a startup 
to delight customers and remain vital.
"""
},
{
    "note": "AI21 Labs",
    "text": """
While Amazon has faced a challenging macroeconomic challenges in 2022, it has also been a year of growth
and innovation, it has also seen a number of successes. The company continue to grow, innovated, improved 
its customer experiences, and made important adjustments to its investment strategies. The company has been
focusing on long-term investment opportunities, and is constantly investing in long-term opportunities, 
and plans to lower costs, even as it faces short-term challenges.
"""
},
{
    "note": "Amazon Titan",
    "text": """
Despite shutting down some businesses and making changes to others, Jassy remains confident in
Amazon's future prospects. He highlights the company's ongoing efforts to improve fulfillment costs,
speed up delivery, and expand its retail business. Jassy also discusses Amazon's investments in
advertising, machine learning, and new business areas such as healthcare and satellite internet. He
emphasizes the company's long-term vision and commitment to innovation.
"""
}]

## The comparison
Now that both summaries have been completed, let's use the Claude v2 model to compare the two and decide which is better.

The prompt below provides the entire document text, then both of the summaries. It then asks the foundation model which is more accurate and concise. If you value other dimensions of the summary more, you can include that in the prompt. Lastly, it gives a specific format to the output as an example, thus making it easier to parse.

Notice how it includes the reasoning in the result. This tends to drive better results, as the model must justify its decision. Feel free to modify the code to ask for just a Yes/No response to which one is better. The results can be more random in that case.

Lastly, run the comparison several times. The answer may change. Is the model hallucinating why one is better than the other? Can more details be added to the prompt to reduce this effect?

In [13]:
def compareText(text1, text2):
    # Create a prompt template that has multiple input variables
    multi_var_prompt2 = PromptTemplate(
        input_variables=["text1","text2"],
        template="""Human: You are comparing 2 documents. On a scale of 1 to 100, how similar are the documents

    <text1>{text1}</text1>
    <text2>{text2}</text2>

   Provide the response in JSON format with the similarity score in the score element and explain the justification in the reason element:
    {{
      "score": 100,
      "reason": "The justification for the selection"
    }}

    Only respond with the json format above.

    Assistant: 
    """
    )

    prompt = multi_var_prompt2.format(text1=text1, text2=text2)


    response = textgen_llm(prompt)

    #target_code = response[response.index('\n')+1:]

    return response

In [14]:
for candidate in candidates:
    print(candidate['note'])
    print(compareText(baseline_summary, candidate['text']))

Unrelated Article on Honus Wagner
 {
  "score": 5, 
  "reason": "The two texts are on completely different topics and do not share any similar words or ideas. The first text is about Amazon's business outlook and strategy. The second text is a biography of baseball player Honus Wagner. There is very little semantic or lexical overlap between these texts, indicating low similarity."
}
Anthropic Claude on 2017 Letter
 {
  "score": 65, 
  "reason": "While the two texts are not identical, they cover similar themes and concepts related to Amazon's business strategy, growth opportunities, and maintaining a startup mentality. Both discuss Amazon's efforts to innovate and delight customers across different business segments, as well as the importance of embracing new technologies and trends. There is significant topical overlap between the two texts, though they provide complementary perspectives and details. A similarity score of 65 reflects the strong topical relevance between the passages, 

## Embeddings
See if you can beat the bot. Try modifying the summary in the next cell and try to improve the quality of summary. Alternatively put a low quality summary in and see the responses. 

In [15]:
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock
from langchain.evaluation import load_evaluator


# - create the Anthropic Model
bedrock_embeddings = BedrockEmbeddings(client=boto3_bedrock)

hf_evaluator = load_evaluator("embedding_distance",llm = textgen_llm, embeddings=bedrock_embeddings)


### Understanding the results
Before we jump into the  comparison, we'll take a sidetrack on how embeddings work.  We'll do this by looking at a simple example, and seeing the embedding distance with some candidate examples. The important thing to keep in mind, is the embedding should represent the symantic meaning of a sentence.

In the example, we'll take a simple sentence and try variations of the sentence, some with very similar meanings, and other with very different meanings, and view the results.




In [80]:
import pandas as pd
from IPython.display import HTML, Markdown

def compare_embeddings(prediction, reference):
    score = hf_evaluator.evaluate_strings(prediction=prediction, reference=reference)['score']
    return {"score":score, "text":prediction}

string1 = "Cooper is a dog that likes to eat beef."

comparison_strings = ["Cooper is a puppy that likes to eat beef.",
                      "Cooper is a man that likes to eat beef.",
                      "Cooper is a cat that likes to eat beef.",
                      "Cooper is a dog that likes to eat chicken.",
                      "Cooper ist ein Hund, der gerne Rindfleisch frisst",
                      "A cooper is someone that makes barrels",
                      "Cooper is a dog that likes to eat steak.",
                      "Beef is what a dog named Cooper likes.",
                      "Fido is a dog that likes to eat beef.",
                      "Spot is a dog that likes to eat beef.",
                      "Cooper is a dog that hates eating beef.",
                      "Amazon Web Services provides on-demand cloud computing platforms and APIs pay-as-you-go basis"]
results = []
for comp in comparison_strings:
    results.append(compare_embeddings(prediction=comp, reference=string1))

display(Markdown(f"""
**Baseline**: ***{string1}***
"""))
display(HTML(pd.DataFrame.from_dict(results).sort_values(by="score").to_html(index=False)))


**Baseline**: ***Cooper is a dog that likes to eat beef.***


score,text
0.035882,Cooper is a puppy that likes to eat beef.
0.08369,Beef is what a dog named Cooper likes.
0.109997,Cooper is a dog that likes to eat steak.
0.135489,Cooper is a dog that hates eating beef.
0.184443,Cooper is a dog that likes to eat chicken.
0.19223,Fido is a dog that likes to eat beef.
0.194594,Cooper is a cat that likes to eat beef.
0.228607,Spot is a dog that likes to eat beef.
0.264782,Cooper is a man that likes to eat beef.
0.351404,"Cooper ist ein Hund, der gerne Rindfleisch frisst"


### Interpret Embeddings Results
The closer two strings are semantically, the closer they will be to zero. Notice in the example above, that cat is closer to dog from a string distance. But because the embeddings take into account the semantic context of the word, pupper will be closer than cat. Because cats and dogs are both animals, they are closer than man.

Again, because filet mignon is a type of steak, the comparison to filet mignon will be closer in meaning than chicken.

Note that the German translation scores relatively well, even though it is in a different language.

In [78]:
results = [{"score": hf_evaluator.evaluate_strings(prediction=candidate['text'], reference=baseline_summary)['score'], "candidate": candidate['note']} for candidate in candidates]

display(HTML(pd.DataFrame.from_dict(results).sort_values(by="score").to_html(index=False)))

score,candidate
0.185844,AI21 Labs
0.227487,Amazon Titan
0.465222,Anthropic Claude on 2017 Letter
1.040069,Unrelated Article on Honus Wagner


### Interpret the results
You probably noticed that the results are returned quickly, since the model is only calculating the embeddings, then using the client to compare the embedding vectors. The closer the number is to zero, the more similar the documents are found to be. 

The tradeoff is that the embedding comparison doesn't tell you why the strings are different, or if one summary is better than another. However, if you have a set of ground truth examples, the embeddings can tell you symantically how close the response is to the ground truth.


## Review - Next Steps
Play with the examples above, trying different variations on the summary and seeing how the comparison works.
