In [11]:
! pip install -q datasets ragas langchain_openai

In [12]:
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter your OpenAI API key: ')

In [13]:
from datasets import load_dataset
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate, EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [14]:
dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3", trust_remote_code=True)

Repo card metadata block was not found. Setting CardData to empty.


In [15]:
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

In [16]:
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [17]:
metrics = [
    LLMContextRecall(llm=evaluator_llm), 
    FactualCorrectness(llm=evaluator_llm), 
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings)
]
results = evaluate(dataset=eval_dataset, metrics=metrics)

Evaluating:   0%|          | 0/80 [00:00<?, ?it/s]

Exception raised in Job[6]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wB2k12M1y8e9X18w0PPWNQAd on tokens per min (TPM): Limit 30000, Used 29602, Requested 2292. Please try again in 3.788s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[61]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-wB2k12M1y8e9X18w0PPWNQAd on tokens per min (TPM): Limit 30000, Used 29754, Requested 2431. Please try again in 4.37s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[74]: TimeoutError()
Exception raised in Job[62]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization o

In [18]:
df = results.to_pandas()
df.head()

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,factual_correctness,faithfulness,semantic_similarity
0,What are the global implications of the USA Su...,"[- In 2022, the USA Supreme Court handed down ...",The global implications of the USA Supreme Cou...,The global implications of the USA Supreme Cou...,1.0,0.53,,0.959247
1,Which companies are the main contributors to G...,"[In recent years, there has been increasing pr...","According to the Carbon Majors database, the m...","According to the Carbon Majors database, the m...",1.0,0.06,,0.94182
2,Which private companies in the Americas are th...,[The issue of greenhouse gas emissions has bec...,"According to the Carbon Majors database, the l...",The largest private companies in the Americas ...,1.0,0.26,0.0,0.959316
3,What action did Amnesty International urge its...,"[In the case of the Ogoni 9, Amnesty Internati...",Amnesty International urged its supporters to ...,Amnesty International urged its supporters to ...,1.0,0.25,0.6,0.926988
4,What are the recommendations made by Amnesty I...,"[In recent years, Amnesty International has fo...",Amnesty International made several recommendat...,The recommendations made by Amnesty Internatio...,1.0,0.07,,0.919153


In [19]:
df

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_recall,factual_correctness,faithfulness,semantic_similarity
0,What are the global implications of the USA Su...,"[- In 2022, the USA Supreme Court handed down ...",The global implications of the USA Supreme Cou...,The global implications of the USA Supreme Cou...,1.0,0.53,,0.959247
1,Which companies are the main contributors to G...,"[In recent years, there has been increasing pr...","According to the Carbon Majors database, the m...","According to the Carbon Majors database, the m...",1.0,0.06,,0.94182
2,Which private companies in the Americas are th...,[The issue of greenhouse gas emissions has bec...,"According to the Carbon Majors database, the l...",The largest private companies in the Americas ...,1.0,0.26,0.0,0.959316
3,What action did Amnesty International urge its...,"[In the case of the Ogoni 9, Amnesty Internati...",Amnesty International urged its supporters to ...,Amnesty International urged its supporters to ...,1.0,0.25,0.6,0.926988
4,What are the recommendations made by Amnesty I...,"[In recent years, Amnesty International has fo...",Amnesty International made several recommendat...,The recommendations made by Amnesty Internatio...,1.0,0.07,,0.919153
5,Who are the target audience of the two books c...,"[In addition to children, parents, teachers, a...",The target audience of the two books created b...,The target audience of the two books created b...,1.0,1.0,0.5,0.987055
6,Which right guarantees access to comprehensive...,[The right to truth is a fundamental human rig...,The right that guarantees access to comprehens...,The right that guarantees access to comprehens...,1.0,1.0,1.0,0.993882
7,Who has the right to be fully informed about h...,"[In many cases, the identities of perpetrators...",Everyone has the right to be fully informed ab...,The victims of gross human rights violations a...,1.0,,,0.942646
8,When can individuals be found guilty under Art...,[Article 207.3 of the Russian Criminal Code pe...,Under Article 207.3 of the Russian Criminal Co...,Individuals can be found guilty under Article ...,1.0,0.4,0.0,0.911981
9,When does the prosecution consider statements ...,[- As long as their statements are contrary to...,Under Article 207.3 of the Russian Criminal Co...,The prosecution considers statements contrary ...,1.0,0.33,0.285714,0.951835


## Quiz Answers

1. **What are the key metrics used to evaluate the LLM in this lab, and why are they important for responsible AI?**

   The key metrics used in this lab are:
   - `LLMContextRecall`: Ensures that the model accurately recalls relevant information from the context provided.
   - `FactualCorrectness`: Assesses whether the LLM’s outputs are factually correct, which is crucial in real-world applications where inaccurate information can have serious consequences.
   - `Faithfulness`: Measures whether the model remains faithful to the input data without generating hallucinations or unrelated information.
   - `SemanticSimilarity`: Evaluates the similarity between the expected and actual outputs in terms of meaning, helping ensure the model's responses are relevant.

   These metrics are important for responsible AI because they help ensure that the model's outputs are reliable, accurate, and aligned with real-world expectations, reducing potential risks.

2. **How does the `LangchainLLMWrapper` ensure that the evaluation process is repeatable?**

   The `LangchainLLMWrapper` encapsulates the LLM in a standard interface, making it easy to switch between models and run evaluations in a consistent and repeatable manner. By using standardized wrappers, others can reproduce the evaluation process exactly, ensuring repeatability and transparency in AI experimentation.

3. **Why is it important to assess the `FactualCorrectness` of an LLM when deploying it in real-world applications?**

   Assessing `FactualCorrectness` is critical because LLMs are often used in applications where accurate and reliable information is essential, such as legal, medical, or financial settings. Ensuring that the model generates factually correct responses minimizes the risk of misinformation, which is a key component of responsible AI.

4. **Explain how using environment variables for the API key contributes to responsible AI practices.**

   Using environment variables for the API key helps protect sensitive credentials from being exposed in the code, which aligns with responsible AI practices around data privacy and security. It ensures that the API key is handled securely, reducing the risk of unauthorized access and protecting sensitive data.

5. **What is the significance of converting the evaluation results into a DataFrame for further analysis?**

   Converting the evaluation results into a DataFrame allows for easier manipulation, visualization, and analysis of the data. DataFrames provide a structured and organized way to explore results, which facilitates transparent evaluation, better understanding of model performance, and the ability to share findings with others for verification or replication, which is important for responsible and repeatable AI.
