In [1]:
!pip install transformers evaluate datasets nltk rouge_score

Defaulting to user installation because normal site-packages is not writeable


# Model evaluation
## [Motiviation](https://youtu.be/Gg-w_n9NJIE?si=CncRwPnRWdphKVVc&t=2010)
#### ref
- [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685.pdf)

## Manual Evaluations
- [Chatbot Arena](https://chat.lmsys.org/?arena%3Fref=futuretools.io)
- [GodMode](https://github.com/smol-ai/GodMode)
## Symbolic evaluations
### N-gram
Contiguous sequence of n-items

```
reference = "the quick brown fox"
prediction = "the quick brown dog"

1_gram_matches = ["the", "quick", "brown"]
1_gram_reference = ["the", "quick", "brown", "fox"]
1_gram_result = 3/4

2_gram_matches = ["the quick", "quick brown"]
2_gram_reference = ["the quick", "quick brown", "brown fox"]
2_gram_result = 2/3
```

### Sentence comparison
#### BLEU (Bilingual evaluation understudy)

<img src="BLEU.JPG">

- reawards brevity
- geometric mean of the n-grams from 1-4

https://cloud.google.com/translate/automl/docs/evaluate#bleu

In [26]:
from datasets import load_metric
bleu = load_metric("bleu") # 
predictions = [["the", "quick", "brown", "fox", "fox"]] # posit this is a prediction of a translation from from "El gato esta en la alfombra"
references = [[["the", "quick", "brown", "fox"], ["the", "quick", "brown", "cat"]]]
score = bleu.compute(predictions=predictions, references=references)
print(score) # calculate the geometric mean of the precision of the n-grams with n=1,2,3,4


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'bleu': 0.668740304976422, 'precisions': [0.8, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 1.0, 'length_ratio': 1.25, 'translation_length': 5, 'reference_length': 4}


#### ROUGE (Recall-oriented understudy for gisting evaluation)


In [30]:
from datasets import load_metric
rouge = load_metric("rouge") # 
predictions = ["the quick brown fox fox"]
references = ["the quick brown fox"]
score = rouge.compute(predictions=predictions, references=references)
print(score)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'rouge1': AggregateScore(low=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), mid=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), high=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889)), 'rouge2': AggregateScore(low=Score(precision=0.75, recall=1.0, fmeasure=0.8571428571428571), mid=Score(precision=0.75, recall=1.0, fmeasure=0.8571428571428571), high=Score(precision=0.75, recall=1.0, fmeasure=0.8571428571428571)), 'rougeL': AggregateScore(low=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), mid=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), high=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889)), 'rougeLsum': AggregateScore(low=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), mid=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889), high=Score(precision=0.8, recall=1.0, fmeasure=0.888888888888889))}


## Semantic evaluations
<img src="img/Llama2-performance.JPG">

### MMLU (Massive Multitask Language Understanding)
- [HuggingFace dataset](https://huggingface.co/datasets/lukaemon/mmlu)
- [Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)

<img src="img/MMLU--domains.JPG">

A dataset consisting of multiple choice questions across a range of categories

<img src="img/HellaSwag-ex.JPG">
 
### ChatBot Arena
[Gradio client](https://chat.lmsys.org/?arena%3Fref=futuretools.io)

### HellaSwag
[HellaSwag: Can a Machine Really Finish Your Sentence?](https://arxiv.org/abs/1905.07830)

Through adverserial filtering where discriminators select for the best incorrect answers to trick the LLM into selecting the wrong answer.

### MT-Bench x LLM as a judge
Uses LLM as a judge to to evaluate the performance in a multi-turn conversation  

[Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685.pdf)

<img src="img/LLM_judge-agreement.JPG">

Actual implementation: https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge


In [4]:
import transformers
from torch import bfloat16

model_instruct = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer_instruct = transformers.AutoTokenizer.from_pretrained(model_instruct)
pipeline = transformers.pipeline(
    model=model_instruct,
    task="text-generation",
    model_kwargs={"torch_dtype": bfloat16, "device_map": "auto"},
)   

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [29]:
question_1 = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions."
ref_answer_1 = "Title: Aloha, Hawaii! A Cultural and Natural Paradise Awaits\n\nSubheading: Uncovering the Rich Culture and Breathtaking Beauty of Hawaii\n\nAloha, fellow travelers! I recently embarked on an unforgettable journey to the heart of the Pacific Ocean, the stunning Hawaiian Islands. From the moment I stepped off the plane, I was enveloped by the welcoming spirit of Aloha, and I couldn't wait to explore the rich cultural experiences and must-see attractions that awaited me.\n\nMy first stop was the bustling city of Honolulu, located on the island of Oahu. As the state capital, Honolulu is the perfect starting point for any Hawaiian adventure. I immediately immersed myself in the local culture with a visit to the Bishop Museum, which houses the world's most extensive collection of Polynesian cultural artifacts. Here, I learned about the fascinating history of the Hawaiian people and marveled at the intricate craftsmanship of traditional Hawaiian quilts, feather capes, and ancient wooden carvings.\n\nAfter exploring the museum, I ventured to the iconic Iolani Palace, the only royal palace in the United States. The palace was the official residence of the Hawaiian monarchy from 1882 until its overthrow in 1893. The beautifully restored palace offered a unique glimpse into the lavish lifestyle of the Hawaiian royals, and I couldn't help but be captivated by the elegant architecture and lush, manicured gardens.\n\nThe next day, I took a short drive to the nearby Polynesian Cultural Center, where I was treated to an unforgettable luau experience. I watched in awe as skilled performers showcased traditional dances and music from various Polynesian cultures, including Hawaiian hula, Samoan fire knife dancing, and Maori poi spinning. The delicious feast that accompanied the performance featured mouthwatering kalua pig, lomi salmon, and haupia, a coconut milk-based dessert.\n\nAfter indulging in the cultural experiences of Oahu, it was time to explore the natural beauty of the Hawaiian Islands. I hopped on a short flight to the Big Island, where I was greeted by the awe-inspiring sight of Kilauea, one of the world's most active volcanoes. I took a guided hike through Hawaii Volcanoes National Park, where I was able to witness the raw power of nature as molten lava flowed into the sea, creating a dramatic plume of steam and ash.\n\nAs I continued my journey through the Big Island, I was struck by the island's diverse landscape. From the lush rainforests and cascading waterfalls of the Hamakua Coast to the black sand beaches of Punalu'u and the snow-capped peaks of Mauna Kea, there was no shortage of breathtaking scenery to discover.\n\nNo trip to Hawaii would be complete without visiting the enchanting island of Maui. I spent my days lounging on the pristine beaches of Ka'anapali and Wailea, snorkeling with sea turtles in the crystal-clear waters of Molokini, and driving the legendary Road to Hana, with its countless waterfalls and panoramic ocean views.\n\nAs my Hawaiian adventure came to a close, I couldn't help but be grateful for the incredible experiences I had encountered. From the rich cultural heritage of the Hawaiian people to the stunning natural beauty of the islands, Hawaii is truly a paradise waiting to be discovered. So, pack your bags and say \"Aloha\" to your next unforgettable journey - Hawaii is calling!"
question_2 = "Rewrite your previous response. Start every sentence with the letter A."
ref_answer_2 = "Astonishing adventures awaited me in Hawaii, the heart of the Pacific Ocean. Aroused by the spirit of Aloha from the moment I stepped off the plane, I anticipated the rich cultural experiences and must-see attractions I would encounter. Arriving in Honolulu, located on the island of Oahu, I appreciated the bustling atmosphere and the perfect starting point for any Hawaiian journey. A visit to the Bishop Museum immersed me in the local culture, admiring the world's most extensive collection of Polynesian cultural artifacts. Ancient Hawaiian quilts, feather capes, and wooden carvings held an alluring beauty.\n\nAn iconic attraction, Iolani Palace, drew my attention next. As the only royal palace in the United States, it housed the Hawaiian monarchy from 1882 until 1893. Architectural elegance and lush, manicured gardens captivated me during my visit. Another day, a short drive took me to the Polynesian Cultural Center for an unforgettable luau experience. Awe-inspiring performances of traditional dances and music from various Polynesian cultures filled the evening, including Hawaiian hula, Samoan fire knife dancing, and Maori poi spinning. Appetizing feasts featured mouthwatering kalua pig, lomi salmon, and haupia, a coconut milk-based dessert.\n\nAdventuring to the Big Island, I marveled at Kilauea, one of the world's most active volcanoes. A guided hike through Hawaii Volcanoes National Park allowed me to witness the raw power of nature, as molten lava flowed into the sea. Amidst the diverse landscape of the Big Island, I found lush rainforests, cascading waterfalls, black sand beaches, and snow-capped peaks.\n\nArriving in Maui, I enjoyed pristine beaches, snorkeling with sea turtles, and driving the legendary Road to Hana. As my Hawaiian adventure neared its end, appreciation for the incredible experiences filled me. Aloha to Hawaii, a paradise of rich cultural heritage and stunning natural beauty, where unforgettable memories are made!"

prompt = f"<s>[INST] {question_1} [/INST]"

generated_res_1 = pipeline(prompt, max_new_tokens=1000, do_sample=True, temperature=0.01, top_k=1)[0]['generated_text']
answer_1 = generated_res_1.split("[/INST]")[1]
print("Answer 1:" + answer_1[0:60] + "... | Token length: " + str(len(tokenizer_instruct.encode(answer_1, return_tensors="pt")[0])))
prompt = generated_res_1 + "</s>" + f"[INST] {question_2}[/INST]"
generated_res_2 = pipeline(prompt, max_new_tokens=1000, do_sample=True, temperature=0.01, top_k=1)[0]['generated_text']
answer_2 = generated_res_2.split("[/INST]")[2]
print("Answer 2:" + answer_2[0:60] + "... | Token length: " + str(len(tokenizer_instruct.encode(answer_2, return_tensors="pt")[0])))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer 1: Title: Aloha from the Heart of Hawaii: Unforgettable Cultur... | Token length: 680
Answer 2: Title: Aloha from the Heart of Hawaii: Unforgettable Cultur... | Token length: 467


In [32]:
print(f"Answer 1:\n{answer_1[0:400]}...\n\nAnswer 2:\n{answer_2[0:400]}...")

Answer 1:
 Title: Aloha from the Heart of Hawaii: Unforgettable Cultural Experiences and Breathtaking Attractions

As I step off the plane, the warm, tropical air of Hawaii greets me with a gentle caress, carrying the sweet scent of blooming hibiscus and the soothing sound of traditional Hawaiian music. The vibrant colors of the island's landscape unfold before me, a stunning tapestry of lush greenery, crys...

Answer 2:
 Title: Aloha from the Heart of Hawaii: Unforgettable Cultural Experiences and Breathtaking Attractions

Aloha and welcome to my travel blog, where I share my recent journey to the heart of Hawaii. As I stepped off the plane, I was greeted by the warm, tropical air, carrying the sweet scent of blooming hibiscus and the soothing sound of traditional Hawaiian music.

An array of vibrant colors unfol...


In [34]:
generated_score = pipeline(f"""<s>[INST] Please act as an impartial judge and evaluate the quality of the response provided by an 
AI assistant to the user question. Your evaluation should consider correctness and 
helpfulness. You will be given a reference answer and the assistant's answer. You 
evaluation should focus on the assistant's answer to the second question. Begin your 
evaluation by comparing the assistant's answer with the reference answer. Identify and 
correct any mistakes. Be as objective as possible. After providing your explanation, you 
must rate the response on a scale of 1 to 10 by strictly following this format: 
"[[rating]]", for example: "Rating: [[5]]".
<|The Start of Reference Answer|>
### User:
{question_1}
### Reference answer:
{ref_answer_1}
### User:
{question_2}
### Reference answer:
{ref_answer_2}
<|The End of Reference Answer|>
<|The Start of Assistant A's Conversation with User|>
### User:
{question_1} 
### Assistant A:
{answer_1}
### User:
{question_2}
### Assistant A:
{answer_2}
<|The End of Assistant A's Conversation with User|> [/INST]""", max_new_tokens=1000, do_sample=True, temperature=0.01, top_k=1)[0]['generated_text']
response = generated_score.split("[/INST]")[1]
print("generated_score: " + response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


generated_score:  Rating: [[10]]

Assistant A's response is an excellent rewrite of the reference answer, starting every sentence with the letter "A." The response maintains the original's engaging tone and highlights the cultural experiences and must-see attractions in Hawaii. There are no mistakes or incorrect information in the response. Overall, it is a well-written and helpful answer.


<img src="https://media.tenor.com/-iop9obK0IwAAAAe/i-want-to-play-a-game-play-time.png">

- If you correctly answer 4 MMLU questions then you are smarter than Llama-2!