## Evaluating the quality of a model against known data
One of the challenges with Generative AI is measuring how effectively a model's output, especially at scale. Due to the stocastic nature of Generative AI, the output can vary from one call to another. Even if directionally the output is the same, you can't do a string comparison of the output, due to the variation in responses.

However, there are some things you can do to score the simularity of two responses. 

### Using GenAI
You can ask a model fo compare two responses.

### Embeddings
One way to compare the output would be to compare the embeddings vector between two responses. 

## Prerequisites
* **Claude** enabled for your AWS Account.

## Setup

In [12]:
import boto3
import json
import os
import sys
from langchain import PromptTemplate

module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils import bedrock, print_ww

boto3_bedrock = bedrock.get_bedrock_client(os.environ.get('BEDROCK_ASSUME_ROLE', None))

Create new client
  Using region: None
boto3 Bedrock client successfully created!
bedrock(https://bedrock.us-east-1.amazonaws.com)


In [13]:
from langchain.llms.bedrock import Bedrock

inference_modifier = {'max_tokens_to_sample':4096, 
                      "temperature":0.5,
                      "top_k":250,
                      "top_p":1,
                      "stop_sequences": ["\n\nHuman"]
                     }

textgen_llm = Bedrock(model_id = "anthropic.claude-v2",
                    client = boto3_bedrock, 
                    model_kwargs = inference_modifier 
                    )



## Summary Comparison

The notebook will create a summary from an example document, in this case, the Amazon 2022 letter to shareholders. The document is available in [example.py](example.py).

It will be run through Claude Instant and Amazon Titan, and the resulting summary will be stored for later use.

In [14]:
baseline_summary = """
Amazon continues to face challenges but is optimistic about the future. The company has made cost
cuts and efficiency improvements in fulfillment and logistics networks to lower costs and speed up
delivery times. AWS is facing short-term headwinds but has a strong customer base and continues to
innovate with new products and technologies. Other businesses like Advertising, Amazon Business, and
Buy with Prime continue to see strong growth. The company is also investing in new areas like
grocery, healthcare, satellite internet access, and artificial intelligence which could provide
large opportunities in the future. Although Amazon currently has a small share of the overall retail
and IT markets, as more retail shifts online and to the cloud, the company believes it is well
positioned for significant growth ahead."""

## Candidate summaries
There are four summaries provided.

The first was created by Amazon Titan, the second is a completely unrelated article on Honus Wagner, and
major leageu baseball player from the turn of the 20th century. The third is a summary of the 2017 letter to shareholders, and the final example is a summary by AI21 labs Jurassic Mid foundation model.

In [68]:
candidates = [
{
    "note": "Unrelated Article on Honus Wagner",
    "text": """
Johannes Peter "Honus" Wagner, sometimes referred to as Hans Wagner, was an American baseball shortstop
who played 21 seasons in Major League Baseball from 1897 to 1917, almost entirely for the Pittsburgh Pirates.
Wagner won his eighth (and final) batting title in 1911, a National League record that remains unbroken
to this day, and matched only once, in 1997, by Tony Gwynn. He also led the league in slugging six times
and stolen bases five times. Wagner was nicknamed "the Flying Dutchman" due to his superb speed and German
heritage. This nickname was a nod to the popular folk-tale made into a famous opera by the German composer
Richard Wagner. In 1936, the Baseball Hall of Fame inducted Wagner as one of the first five members.
He received the second-highest vote total, behind Ty Cobb's 222 and tied with Babe Ruth at 215.
"""
},
{
    "note": "Anthropic Claude on 2017 Letter",
    "text": """In his 2017 letter to Amazon shareholders, Jeff Bezos emphasizes the importance 
of maintaining a "Day 1" mentality even as Amazon grows into a large company. He says customer obsession, 
resisting proxies, embracing external trends, and high-velocity decision making are essential to fending 
off "Day 2" stagnation and irrelevance. Bezos argues that obsessive customer focus drives innovation, that 
proxies like surveys can mislead, that trends like AI must be quickly embraced, and that quality high-speed 
decisions maintain energy and dynamism. He shares examples from Amazon like Alexa and Amazon Go to 
illustrate these points. Bezos stresses that large organizations must move with the spirit of a startup 
to delight customers and remain vital.
"""
},
{
    "note": "AI21 Labs",
    "text": """
While Amazon has faced a challenging macroeconomic challenges in 2022, it has also been a year of growth
and innovation, it has also seen a number of successes. The company continue to grow, innovated, improved 
its customer experiences, and made important adjustments to its investment strategies. The company has been
focusing on long-term investment opportunities, and is constantly investing in long-term opportunities, 
and plans to lower costs, even as it faces short-term challenges.
"""
},
{
    "note": "Amazon Titan",
    "text": """
Despite shutting down some businesses and making changes to others, Jassy remains confident in
Amazon's future prospects. He highlights the company's ongoing efforts to improve fulfillment costs,
speed up delivery, and expand its retail business. Jassy also discusses Amazon's investments in
advertising, machine learning, and new business areas such as healthcare and satellite internet. He
emphasizes the company's long-term vision and commitment to innovation.
"""
}]

## The comparison
Now that both summaries have been completed, let's use the Claude v2 model to compare the two and decide which is better.

The prompt below provides the entire document text, then both of the summaries. It then asks the foundation model which is more accurate and concise. If you value other dimensions of the summary more, you can include that in the prompt. Lastly, it gives a specific format to the output as an example, thus making it easier to parse.

Notice how it includes the reasoning in the result. This tends to drive better results, as the model must justify its decision. Feel free to modify the code to ask for just a Yes/No response to which one is better. The results can be more random in that case.

Lastly, run the comparison several times. The answer may change. Is the model hallucinating why one is better than the other? Can more details be added to the prompt to reduce this effect?

In [69]:
def compareText(text1, text2):
    # Create a prompt template that has multiple input variables
    multi_var_prompt2 = PromptTemplate(
        input_variables=["text1","text2"],
        template="""Human: You are comparing 2 documents. On a scale of 1 to 100, how similar are the documents

    <text1>{text1}</text1>
    <text2>{text2}</text2>

   Provide the response in JSON format with the similarity score in the score element and explain the justification in the reason element:
    {{
      "score": 100,
      "reason": "The justification for the selection"
    }}

    Only respond with the json format above.

    Assistant: 
    """
    )

    prompt = multi_var_prompt2.format(text1=text1, text2=text2)


    response = textgen_llm(prompt)

    #target_code = response[response.index('\n')+1:]

    return response

In [70]:
for candidate in candidates:
    print(candidate['note'])
    print(compareText(baseline_summary, candidate['text']))

Unrelated Article on Honus Wagner
 {
  "score": 1, 
  "reason": "The two texts are completely different topics and do not share any similar content, words or themes. The first text is about Amazon's business outlook and the second is a biography of a baseball player. There is no meaningful similarity between them."
}
Anthropic Claude on 2017 Letter
 {
  "score": 85, 
  "reason": "Both texts discuss Amazon's business strategy, challenges, and future outlook. They cover similar topics like AWS, innovation, AI, customer obsession, and maintaining a startup mentality despite growth. There is significant topical overlap between the texts, indicating high similarity, although text2 focuses more narrowly on Jeff Bezos' 2017 letter."
}
AI21 Labs
 {
  "score": 95, 
  "reason": "The two texts are very similar in content and meaning. They both discuss Amazon facing challenges but remaining optimistic and investing for the long-term. The key points around cost cuts, efficiency, AWS, advertising, g

## Embeddings
See if you can beat the bot. Try modifying the summary in the next cell and try to improve the quality of summary. Alternatively put a low quality summary in and see the responses. 

In [71]:
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock
from langchain.evaluation import load_evaluator


# - create the Anthropic Model
bedrock_embeddings = BedrockEmbeddings(client=boto3_bedrock)

hf_evaluator = load_evaluator("embedding_distance",llm = textgen_llm, embeddings=bedrock_embeddings)


### Understanding the results



In [84]:
def compare_embeddings(prediction, reference):
    score = hf_evaluator.evaluate_strings(prediction=prediction, reference=reference)['score']
    
    print(f"{score:.2f}: {reference}->{prediction}")

string1 = "Fred is a dog that likes to eat steak."
string2 = "Fred is a cat that likes to eat steak."
string3 = "Fred is a man that likes to eat steak."
string4 = "Fred is a terrier that likes to eat steak."
string5 = "Fred is a dog that likes to eat chicken."
string6 = "Fred is a dog that likes to eat filet mignon."



compare_embeddings(prediction=string2, reference=string1)
compare_embeddings(prediction=string3, reference=string1)
compare_embeddings(prediction=string4, reference=string1)
compare_embeddings(prediction=string5, reference=string1)
compare_embeddings(prediction=string6, reference=string1)


0.17: Fred is a dog that likes to eat steak.->Fred is a cat that likes to eat steak.
0.23: Fred is a dog that likes to eat steak.->Fred is a man that likes to eat steak.
0.08: Fred is a dog that likes to eat steak.->Fred is a terrier that likes to eat steak.
0.15: Fred is a dog that likes to eat steak.->Fred is a dog that likes to eat chicken.
0.10: Fred is a dog that likes to eat steak.->Fred is a dog that likes to eat filet mignon.


### Interpret Embeddings Results
The closer two strings are semantically, the closer they will be to zero. Notice in the example above, that cat is closer to dog from a string distance. But because the embeddings take into account the semantic context of the word, terrier will be closer than dog. Because cats and dogs are both animals, they are closer than man.

Again, because filet mignon is a type of steak, the comparison to filet mignon will be closer in meaning than chicken.

In [85]:
for candidate in candidates:
    score = hf_evaluator.evaluate_strings(prediction=candidate['text'], reference=baseline_summary)['score']
    print(f"{candidate['note']}: {score}")

Unrelated Article on Honus Wagner: 1.0400693128415581
Anthropic Claude on 2017 Letter: 0.4652217175377634
AI21 Labs: 0.1858441764252151
Amazon Titan: 0.22748726407263697


### Interpret the results
You probably noticed that the results are returned quickly, since the model is only calculating the embeddings, then using the client to compare the embedding vectors. The closer the number is to zero, the more similar the documents are found to be. 


## Review - Next Steps
