## Berlin Buzzwords 2024

This notebook shows how the [DeepEval framework](https://docs.confident-ai.com) can be used to:
1. Employ LLM as judge evaluation in G-Eval.
2. Evaluate summary generation.

## 1. Installing packages and setting up environment

In [9]:
! pip install -U deepeval python-dotenv ipytest

Collecting ipytest
  Downloading ipytest-0.14.2-py3-none-any.whl (18 kB)
Collecting jedi>=0.16 (from ipython->ipytest)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, ipytest
Successfully installed ipytest-0.14.2 jedi-0.19.1


In [2]:
import nest_asyncio
nest_asyncio.apply()

In [10]:
import ipytest
ipytest.autoconfig()

In [11]:
import os
import dotenv
import getpass
from google.colab import drive

drive.mount('/content/drive')

dotenv.load_dotenv('/content/drive/MyDrive/.env')

openai_api_key = os.environ.get('OPENAI_API_KEY')

if not openai_api_key:
    openai_api_key = getpass.getpass("Enter OpenAI API key:")

os.environ["OPENAI_API_KEY"] = openai_api_key

Mounted at /content/drive


## 2. How could logarithms of token probabilities be computed for OpenAI models?

In [12]:
import openai

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4",
  messages=[
    {
      "role": "system",
      "content": "You will be provided with statements, and your task is to convert them to standard English."
    },
    {
      "role": "user",
      "content": "She no went to the market."
    }
  ],
  temperature=0.7,
  max_tokens=64,
  logprobs=True
)

response

ChatCompletion(id='chatcmpl-9YUv53zV20gvbTb2t0t786wp91T8Y', choices=[Choice(finish_reason='stop', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='She', bytes=[83, 104, 101], logprob=-2.319992e-05, top_logprobs=[]), ChatCompletionTokenLogprob(token=' didn', bytes=[32, 100, 105, 100, 110], logprob=-0.33082667, top_logprobs=[]), ChatCompletionTokenLogprob(token="'t", bytes=[39, 116], logprob=0.0, top_logprobs=[]), ChatCompletionTokenLogprob(token=' go', bytes=[32, 103, 111], logprob=-3.1281633e-07, top_logprobs=[]), ChatCompletionTokenLogprob(token=' to', bytes=[32, 116, 111], logprob=-1.9361265e-07, top_logprobs=[]), ChatCompletionTokenLogprob(token=' the', bytes=[32, 116, 104, 101], logprob=-1.0280384e-06, top_logprobs=[]), ChatCompletionTokenLogprob(token=' market', bytes=[32, 109, 97, 114, 107, 101, 116], logprob=-1.9361265e-07, top_logprobs=[]), ChatCompletionTokenLogprob(token='.', bytes=[46], logprob=-5.2001665e-06, top_logprobs=[])]), message=ChatComple

## 3. How to work with the G-Eval within the DeepEval framework?

In [13]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
from deepeval.models.gpt_model import GPTModel

model = GPTModel(
    model="gpt-4",
    temperature=0.3
)

correctness_metric = GEval(
    model=model,
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    # evaluation_steps=[
    #     "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
    #     "You should also heavily penalize omission of detail",
    #     "Vague language, or contradicting OPINIONS, are OK"
    # ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)

print("Evaluation steps:")
for eval_step in correctness_metric.evaluation_steps:
  print("-", eval_step)
print("Score:", correctness_metric.score)
print("Reson:", correctness_metric.reason)

Output()

Evaluation steps:
- Compare the actual output to the expected output to determine if they match.
- Verify the factual accuracy of the actual output by cross-referencing with reliable sources.
- If the actual output does not match the expected output, identify the discrepancies and analyze the reasons for the differences.
- Evaluate the quality of the actual output based on its accuracy, completeness, and relevance to the expected output.
Score: 0.06948416024836306
Reson: The actual output does not match the expected output. The factual accuracy is incorrect as the input clearly states that the cat ran up the tree, not the dog.


## 4. Implementing Unit tests using DeepEval framework

In [17]:
%%ipytest

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_case():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",

        actual_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [answer_relevancy_metric])

Output()

[32m.[0m[33m                                                                                            [100%][0m


../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
    self._mark_plugins_for_rewrite(hook)

../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
    self._mark_plugins_for_rewrite(hook)



In [18]:
%%ipytest


from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_answer_relevancy():
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",

        actual_output="We offer a 30-day full refund at no extra cost."
    )
    assert_test(test_case, [answer_relevancy_metric])

Output()

[32m.[0m[33m                                                                                            [100%][0m


../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
    self._mark_plugins_for_rewrite(hook)

../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1204
    self._mark_plugins_for_rewrite(hook)



## 5. Preparing a dataset for summarization

In [19]:
from datasets import load_dataset

wiki = load_dataset("d0rj/wikisum")

wiki_df = wiki["test"].to_pandas().iloc[:20]

In [20]:
wiki_df.head()

Unnamed: 0,url,title,summary,article,step_headers
0,https://www.wikihow.com/Make-Chinese-Fried-Rice,How to Make Chinese Fried Rice,"To make Chinese fried rice, cook your rice at ...","Rinse the rice. Before cooking your rice, it's...",Rinse the rice. Bring the water to a boil. Add...
1,https://www.wikihow.com/Write-a-Song-Parody,How to Write a Song Parody,Writing a song parody is a fun way to show off...,"Listen to other parodies. Weird Al Yankovic, L...",Listen to other parodies. Think about your tar...
2,https://www.wikihow.com/Clean-Terrazzo,How to Clean Terrazzo,"To clean terrazzo, make a basic cleaning solut...",Grab an empty spray bottle. Either buy an empt...,Grab an empty spray bottle. Mix a solution wit...
3,https://www.wikihow.com/Become-a-Neonatal-Nurse,How to Become a Neonatal Nurse,"To become a neonatal nurse, you'll need to lov...",Enroll in math and science courses. Depending ...,Enroll in math and science courses. Work hard ...
4,https://www.wikihow.com/Adjust,How to Adjust,"It can seem stressful at first, but know that ...",Allow yourself to feel upset. You won't be doi...,Allow yourself to feel upset. Release your exp...


In [21]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

summarization_prompt = PromptTemplate.from_template(
"""Write a concise summary of the following:


```{text}```


CONCISE SUMMARY:
"""
)

llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.0,
    max_tokens=1000,
)
chain =  summarization_prompt | llm


wiki_df.loc[:, "actual_summary"] = wiki_df["article"].apply(
    lambda text: chain.invoke({"text": text}).content
)

In [22]:
wiki_df

Unnamed: 0,url,title,summary,article,step_headers,actual_summary
0,https://www.wikihow.com/Make-Chinese-Fried-Rice,How to Make Chinese Fried Rice,"To make Chinese fried rice, cook your rice at ...","Rinse the rice. Before cooking your rice, it's...",Rinse the rice. Bring the water to a boil. Add...,The text provides a detailed recipe for making...
1,https://www.wikihow.com/Write-a-Song-Parody,How to Write a Song Parody,Writing a song parody is a fun way to show off...,"Listen to other parodies. Weird Al Yankovic, L...",Listen to other parodies. Think about your tar...,The text provides a comprehensive guide on how...
2,https://www.wikihow.com/Clean-Terrazzo,How to Clean Terrazzo,"To clean terrazzo, make a basic cleaning solut...",Grab an empty spray bottle. Either buy an empt...,Grab an empty spray bottle. Mix a solution wit...,The article provides a detailed guide on how t...
3,https://www.wikihow.com/Become-a-Neonatal-Nurse,How to Become a Neonatal Nurse,"To become a neonatal nurse, you'll need to lov...",Enroll in math and science courses. Depending ...,Enroll in math and science courses. Work hard ...,"To become a neonatal nurse, one should take as..."
4,https://www.wikihow.com/Adjust,How to Adjust,"It can seem stressful at first, but know that ...",Allow yourself to feel upset. You won't be doi...,Allow yourself to feel upset. Release your exp...,The text provides advice on how to cope with m...
5,https://www.wikihow.com/Take-Care-of-Your-Horse,How to Take Care of Your Horse,"To take care of your horse, keep it in a clean...",Lead your horse to the grooming area of your b...,Lead your horse to the grooming area of your b...,The text provides detailed instructions on how...
6,https://www.wikihow.com/Take-off-Fake-Nails,How to Take off Fake Nails,The easiest way to remove acrylic nails is to ...,Trim down acrylic nails to reduce the surface ...,Trim down acrylic nails to reduce the surface ...,The text provides a detailed guide on how to r...
7,https://www.wikihow.com/Do-In%E2%80%90text-Cit...,How to Do In‐text Citations in MLA,"To do in-text citations in MLA, provide the la...",Put the author's last name and page number in ...,Put the author's last name and page number in ...,The text provides detailed instructions on how...
8,https://www.wikihow.com/Enjoy-a-Festival,How to Enjoy a Festival,"To enjoy a festival, make sure you pack a bag ...",Purchase the tickets well in advance to avoid ...,Purchase the tickets well in advance to avoid ...,The text provides a comprehensive guide to att...
9,https://www.wikihow.com/Keep-Sliced-Bananas-fr...,How to Keep Sliced Bananas from Discoloring,"To keep sliced bananas from discoloring, dip t...",Buy fruit juice or squeeze your own. There are...,Buy fruit juice or squeeze your own. Coat the ...,The article provides various methods to preven...


## 6. Running evaluation


In [26]:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase


test_cases = []
for _, row in wiki_df.iloc[:5].iterrows():
  test_cases.append(
      LLMTestCase(
          input=row["article"],
          actual_output=row["actual_summary"],
          # expected_output=row["summary"],

      )
  )

metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4"
)


result = evaluate(test_cases, [metric])

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()



Output()



Output()





Metrics Summary

  - ✅ Summarization (score: 0.6, threshold: 0.5, strict: False, evaluation model: gpt-4, reason: The score is 0.60 because while there are no contradictions between the summary and the original text, the summary includes extra information not mentioned in the original. Additionally, the summary fails to answer specific questions that the original text could have answered, such as proper cooking techniques and possible alternatives for cooking equipment., error: None)
      - Alignment (score: 0.9)
      - Coverage (score: 0.6)

For test case:

  - input: Rinse the rice. Before cooking your rice, it's a good idea to rinse it in a strainer. That will remove any dust, starch, or other debris that might be on the surface of the rice. Place the rice in a strainer, and rinse it with cold water. If you're short on time, you can skip rinsing the rice. Be aware that you may wind up with stickier cooked rice if you don't, though. Bring the water to a boil. In a small to medium



In [25]:
for _, row in wiki_df.iterrows():
    # print("Reference summary:")
    # print(row["summary"])
    print("URL:")
    print(row["url"])
    print("Observed summary:")
    print(row["actual_summary"])
    print("-" * 50)



URL:
https://www.wikihow.com/Make-Chinese-Fried-Rice
Observed summary:
The text provides a detailed recipe for making fried rice. It begins with rinsing the rice and boiling it in a 1:2 rice to water ratio. After adding salt and optional butter, the rice is cooked for at least 18 minutes. The cooked rice is then spread on a tray and dried under a fan for an hour. A wok is heated on the stove, and a neutral oil is added. Ginger, garlic, onion, peas, and carrots are cooked until tender, then beaten eggs are scrambled in the pan. The cooked rice, soy sauce, and sesame oil are added and mixed well. The mixture is fried until heated through, and can be garnished with chopped green onions.
--------------------------------------------------
URL:
https://www.wikihow.com/Write-a-Song-Parody
Observed summary:
The text provides a comprehensive guide on how to create a music parody. It suggests listening to other parodies to understand the genre, considering the target audience, and deciding on th