# Lab02: Evaluating a Large Language Model (LLM)
In this lab, we will walk through how to evaluate an LLM using various metrics related to responsible AI, repeatability, and other methodologies. We will install necessary libraries, load a dataset, define evaluation metrics, and generate results to assess the model's performance. This exercise emphasizes the importance of responsible AI, focusing on how to validate models for factual correctness, context recall, and faithfulness in real-world applications.

### Step 1: Installing Required Libraries
In this step, we install the necessary libraries for the evaluation process. We use `datasets` to load the data, `ragas` for metrics, and `langchain_openai` to integrate OpenAI's model. This installation ensures repeatability by using standardized libraries that can be easily reproduced by others.

In [None]:
! pip install -q datasets ragas langchain_openai

### Step 2: Setting Up API Key
Here, we use the `getpass` method to securely input and store the OpenAI API key as an environment variable. This key allows us to access OpenAI's models for evaluation. Using environment variables to store keys is a best practice for security in responsible AI.

In [None]:
import os

for keyer in ("OPENAI_API_KEY", "HF_TOKEN"):

  if not os.environ.get(keyer):
    try:
      from google.colab import userdata
      userdata.get(keyer)
    except userdata.SecretNotFoundError:
      print(f"{keyer} key not found, looking in caltech class project")
      from google.cloud import secretmanager   
      client = secretmanager.SecretManagerServiceClient()
      response = client.access_secret_version(request={"name": f"projects/240830225929/secrets/{keyer}/versions/1"})
      os.environ[keyer] = response.payload.data.decode("UTF-8")
  print(f"all set with {keyer}")



### Step 3: Importing Libraries and Modules
In this step, we import several key modules for loading data, defining evaluation metrics, and wrapping the LLM. We also import tools to evaluate semantic similarity, context recall, faithfulness, and factual correctness. These metrics are fundamental for assessing the LLM's alignment with responsible AI guidelines, particularly ensuring that the model's output is reliable and consistent.

In [None]:
from datasets import load_dataset
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity
from ragas import evaluate, EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

### Step 4: Loading Dataset
Here, we load the Amnesty International QA dataset, which will be used for evaluating the LLM. This dataset contains human rights-related questions and answers, and its real-world importance makes it an excellent choice for testing the model's factual correctness and faithfulness. Ensuring that models perform well on such critical datasets aligns with the principles of responsible AI.

In [None]:
dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3", trust_remote_code=True)
dataset

### Step 5: Preparing Evaluation Dataset
We transform the dataset into an evaluation-ready format using `EvaluationDataset.from_hf_dataset()`. This step prepares the dataset for the evaluation process by formatting it in a way that is compatible with the `ragas` evaluation pipeline. This ensures the repeatability of the experiment as others can load the same dataset in the same way.

In [None]:
eval_dataset = EvaluationDataset.from_hf_dataset(dataset["eval"])

### Step 6: Wrapping the LLM and Embeddings
In this step, we wrap the LLM (GPT-4) and the embeddings model using Langchain wrappers. Wrapping the models in this manner allows us to integrate them seamlessly into the evaluation process. By clearly defining which LLM and embeddings model we are using, this ensures repeatability and transparency, which are key elements of responsible AI.

In [None]:
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

### Step 7: Defining and Applying Evaluation Metrics
We define several key metrics for evaluating the model: `LLMContextRecall`, `FactualCorrectness`, `Faithfulness`, and `SemanticSimilarity`. These metrics help ensure that the LLM performs accurately in terms of recalling relevant context, providing factually correct information, staying faithful to the input data, and generating semantically similar results. Applying these metrics is essential for validating models, especially when deploying them in sensitive or high-risk applications where responsible AI principles must be upheld.

In [None]:
metrics = [
    LLMContextRecall(llm=evaluator_llm), 
    FactualCorrectness(llm=evaluator_llm), 
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings)
]
results = evaluate(dataset=eval_dataset, metrics=metrics)

### Step 8: Converting Results to DataFrame
Here, we convert the evaluation results into a pandas DataFrame for easy viewing and further analysis. By structuring the results in a DataFrame, we can better visualize and interpret the LLM's performance on different metrics, ensuring a transparent and reproducible evaluation process.

In [None]:
df = results.to_pandas()
df.head()

### Step 9: Displaying the Results
Finally, we display the first few rows of the evaluation results to assess the LLM's performance across the defined metrics. This gives us a quick overview of how well the model has performed and whether it meets the desired benchmarks for responsible AI deployment.

In [None]:
df

## Quiz
Answer the following questions to test your understanding of the evaluation process for LLMs.
1. What are the key metrics used to evaluate the LLM in this lab, and why are they important for responsible AI?
2. How does the `LangchainLLMWrapper` ensure that the evaluation process is repeatable?
3. Why is it important to assess the `FactualCorrectness` of an LLM when deploying it in real-world applications?
4. Explain how using environment variables for the API key contributes to responsible AI practices.
5. What is the significance of converting the evaluation results into a DataFrame for further analysis?