# Retrieval and Answer Quality Metrics computation using LLM as Judge in IBM watsonx.governance for RAG task

This notebook demonstrates the creation of Retrieval Augumented Generation pattern using watsonx.ai and computation of reference free Retrieval Quality metric **Context relevance** and Answer Quality metrics such as **Faithfulness**, **Answer relevance** and reference based **Answer similarity** metric for RAG task type using LLM and IBM watsonx.governance.

**Context relevance** assesses the degree to which the retrieved context is relevant with the question specified in the prompt, serving as a metric for evaluating the quality of your retrieval system. The context relevance score is a value between 0 and 1. A value closer to 1 indicates that the context is more relevant to your question in the prompt. A value closer to 0 indicates that the context is less relevant to your question in the prompt.

**Faithfulness** measures how faithful the answer or generated text is to the context sent to the LLM input. The faithfulness score is a value between 0 and 1. A value closer to 1 indicates that the output is more faithful or grounded and less hallucinated. A value closer to 0 indicates that the output is less faithful or grounded and more hallucinated.

**Answer relevance** measures how relevant the answer or generated text is to the question. This is one of the ways to determine the quality of your model. The answer relevance score is a value between 0 and 1. A value closer to 1 indicates that the answer is more relevant to the given question. A value closer to 0 indicates that the answer is less relevant to the question.

**Answer similarity** measures how similar the answer or generated text is to the ground truth or reference answer. This is one of the ways to determine the quality of your model. The answer similarity score is a value between 0 and 1. A value closer to 1 indicates that the answer is more similar to the reference value. A value closer to 0 indicates that the answer is less similar to the reference value.

## Learning goals

- Ingest data into a vector database
- Initialize foundation model
- Generate RAG responses
- Configure and run evaluations

## Contents

- [Step 1 - Setup](#setup)
- [Step 2 - Ingest Data into Vector DB](#data)
- [Step 3 - Initialize foundational model using watsonx.ai](#model)
- [Step 4 - Generate the answers to questions using LangChain RetrievalQA](#predict)
- [Step 5 - Configure evaluations](#config)
- [Step 6 - Run evaluations](#compute)
- [Step 7 - Display the results](#results)
- [Challenge - Side-by-side comparison between IBM Granite and llama 3 Models](#challenge)

## Step 1 - Setup <a id="setup"></a>

### Install necessary libraries

In [None]:
!pip install -U "ibm-metrics-plugin~=3.0.0" | tail -n 1
!pip install -U ibm-watson-openscale | tail -n 1
!pip install -U ibm-watson-machine-learning | tail -n 1
!pip install "langchain==0.0.345" | tail -n 1
!pip install wget | tail -n 1
!pip install sentence-transformers | tail -n 1
!pip install "chromadb==0.3.26" | tail -n 1
!pip install "pydantic==1.10.0" | tail -n 1
!pip install nltk

import warnings
warnings.filterwarnings("ignore")

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

**Note**: you may need to restart the kernel to use updated libraries.

### Configure your credentials

***Hint***: You can generates `CLOUD_API_KEY` by accessing https://cloud.ibm.com/ -> Manage -> API Key -> Create New Key

In [None]:
# Cloud credentials
IAM_URL="https://iam.cloud.ibm.com"
DATAPLATFORM_URL = "https://api.dataplatform.cloud.ibm.com"
SERVICE_URL = "https://aiopenscale.cloud.ibm.com"
CLOUD_API_KEY = "<EDIT THIS>" # YOUR_CLOUD_API_KEY

credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": CLOUD_API_KEY
}

### Configure your project id
Provide the project id to provide the context needed to run the inference against the watsonx.ai model.

***Hint***: You can find the `project_id` as follows. Open the prompt lab in watsonx.ai. At the very top of the UI, there will be "Projects / *project name* /". Click on the "*project name*" link, then get the `project_id` from the project's "Manage" tab ("Project -> Manage -> General -> Details").

In [None]:
project_id = "<EDIT THIS>" # YOUR_PROJECT_ID

## Step 2 - Ingest Data into Vector DB <a id="data"></a>

### Read the data

Download the sample "State of the Union" file.

In [None]:
import wget
import os

data = 'state_of_the_union.txt'
url = 'https://raw.github.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(data):
    wget.download(url, out=data)

### Prepare the data for the vector database

Take the `state_of_the_union.txt` speech content data and split it into chunks. 

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

loader = TextLoader(data)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Create an embedding function to store the data in a vector database

Embed the chunked data using an open-source embedding model and load it into Chromadb, a vector database.

**Note**: You can also provide a custom embedding function to be used by Chromadb; the performance of Chromadb may differ depending on the embedding model used.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

## Step 3 - Initialize a foundation model using `watsonx.ai`
<a id="model"></a>

IBM watsonx foundation models are among the <a href="https://python.langchain.com/docs/integrations/llms/watsonxllm" target="_blank" rel="noopener no referrer">list of LLM models supported by Langchain</a>. This example shows how to communicate with <a href="https://newsroom.ibm.com/2023-09-28-IBM-Announces-Availability-of-watsonx-Granite-Model-Series,-Client-Protections-for-IBM-watsonx-Models" target="_blank" rel="noopener no referrer">Granite Model Series</a> using <a href="https://python.langchain.com/docs/get_started/introduction" target="_blank" rel="noopener no referrer">Langchain</a>.

### Define the model parameters
Provide a set of model parameters that will influence the result:

In [None]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods

parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 1,
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.STOP_SEQUENCES: ["<|endoftext|>"]
}

### Set LangChain custom LLM wrapper for watsonx model
Initialize the `WatsonxLLM` class from LangChain with defined parameters, and using `ibm/granite-13b-chat-v2`. You can find all supported model IDs [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-api-model-ids.html?context=wx&audience=wdp)

In [None]:
from langchain.llms import WatsonxLLM

watsonx_llm = WatsonxLLM(
    model_id='ibm/granite-13b-chat-v2',
    url=credentials.get("url"),
    apikey=credentials.get("apikey"),
    project_id=project_id,
    params=parameters
)

## Step 4 - Generate the answers to questions using LangChain RetrievalQA
<a id="predict"></a>

### Build a `RetrievalQA` (question answering chain) to automate the RAG task.

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=watsonx_llm, chain_type="stuff", retriever=docsearch.as_retriever())

In [None]:
query1 = "What is ARPA-H?"
query2 = "What is the investment of Ford and GM to build electric vehicles?"
query3 = "What is the proposed tax rate for corporations?"
query4 = "What is Intel going to build?"
query5 = "How many new manufacturing jobs are created last year?"
query6 = "How many electric vehicle charging stations are built?"

questions = [query1 , query2, query3, query4, query5, query6]

ref_ans1 = "ARPA-H is the Advanced Research Projects Agency for Health, which is an agency that aims to drive breakthroughs in cancer, Alzheimer's, diabetes, and more. It was proposed by the U.S. President to supercharge the Cancer Moonshot and cut the cancer death rate by at least 50% over the next 25 years."
ref_ans2 = "Ford is investing $11 billion to build electric vehicles, creating 11,000 jobs across the country. GM is making the largest investment in its history—$7 billion to build electric vehicles, creating 4,000 jobs in Michigan.So, the total investment of Ford and GM to build electric vehicles is $11 billion + $7 billion = $18 billion."
ref_ans3 = "The proposed tax rate for corporations is a 15% minimum tax rate."
ref_ans4 = "Intel is going to build a $20 billion semiconductor \"mega site\" with up to eight state-of-the-art factories."
ref_ans5 = "369,000 new manufacturing jobs were created last year."
ref_ans6 = "The document does not provide information on the number of electric vehicle charging stations built. It only mentions the plan to build a national network of 500,000 electric vehicle charging stations."

reference_answers = [ref_ans1, ref_ans2, ref_ans3, ref_ans4, ref_ans5, ref_ans6]

### Generate retrieval-augmented responses to the questions

In [None]:
responses = []
contexts = []
for query in questions:
    #Retrive relevant context for each question from the vector db
    docs = docsearch.as_retriever().get_relevant_documents(query)

    context = []
    #Extract the needed information
    for doc in docs:
        context.append(doc.to_json()['kwargs']['page_content'])

    #Capture the context
    contexts.append(context)

    #Run the prompt and get the response
    response = qa.run(query)
    responses.append(response)
    

In [None]:
#Print a sample context retrieved for a query 
# print(f"Question:{questions[0]}\n context:{contexts[0]}")

In [None]:
#Print the result
# for query in questions:
#     print(f"{query} \n {responses[questions.index(query)]} \n")

### Construct a dataframe with question, contexts and answer to be used for metrics computation

In [None]:
import pandas as pd
data = pd.DataFrame(contexts, columns=["context1", "context2", "context3", "context4"])
data["question"] = questions
data["answer"] = responses
data["reference"] = reference_answers

## Step 5 - Configure Evaluations
<a id="config"></a>

### Parameters

#### Common parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| context_columns | The list of context column names in the input data frame. |  |  |
| question_column | the name of the question column in the input data frame. |  |  |
| answer_column | The name of the answer column in the input data frame |  |  |
| record_level [Optional] | The flag to return the record level metrics values. Set the flag under configuration to generate record level metrics for all the metrics. Set the flag under specific metric to generate record level metrics for that metric alone. | False | True, False |
| scoring_fn | The scoring function which takes in the prompts input dataframe and score the LLM acting as Judge, return the output as a dataframe. The input data frame will have a single column "prompt" and the output data frame can either have a single column or if there are multiple columns, return the model output text in "generated_text" column. | | |

### Metric parameters

| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| record_level [Optional] | The flag to return the record level metrics values. Setting the flag under specific metric overrides the value provided at the configuration level. | False | True, False |
| metric_prompt_template [Optional] | The prompt template used to compute the metric value. User can override the prompt template used by watsonx.governance to compute the metric using this parameter. The prompt template should use the variables {context}, {question}, {answer}, {reference_answer} as needed and these variable values will be filled with the actual data while calling the scoring function. The prompt response should return the metric value in the range 1-10 for the respective metric and in one of the formats ["4", "7 star", "star: 8", "stars: 9"] as answer. | | |

### Verify client version

In [None]:
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url="https://iam.cloud.ibm.com")
client = APIClient(authenticator=authenticator, service_url="https://aiopenscale.cloud.ibm.com")

# print(client.version)

### Define the scoring function to invoke the LLM acting as Judge while compute the metrics

The scoring function is implemeted using model from watsonx.ai from cloud. The model FLAN_T5_XXL is used as the judge here. The other models which can be used from watsonx.ai are FLAN_UL2, FLAN_T5_XL, MIXTRAL_8X7B_INSTRUCT_V01_Q

The function can be changed as needed to invoke external models as well. The quality of the retrieval and answer quality metrics can vary with the model used as judge.

In [None]:
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes
import pandas as pd

generate_params = {
    GenParams.MAX_NEW_TOKENS: 100,
    GenParams.MIN_NEW_TOKENS: 10,
    GenParams.TEMPERATURE: 0.0
}

model = Model(
    model_id=ModelTypes.FLAN_T5_XXL,
    params=generate_params,
    credentials={
        "apikey": credentials.get("apikey"),
        "url": credentials.get("url")
    },
    project_id=project_id
)

def scoring_fn(data):
    results = []
    
    for prompt_text in data.iloc[:, 0].values.tolist():
        model_response = model.generate_text(prompt=prompt_text)
        results.append(model_response)
    
    return pd.DataFrame({"generated_text": results})

### Configure context relevance, faithfulness, answer relevance and answer similarity parameters

In [None]:
from ibm_metrics_plugin.metrics.llm.utils.constants import LLMTextMetricGroup, LLMCommonMetrics, LLMRAGMetrics, RetrievalQualityMetric

# Edit below values based on the input data
context_columns = ["context1", "context2", "context3", "context4"]
question_column = "question"
answer_column = "answer"
reference_column = "reference"

config_json = {
            "configuration": {
                "context_columns": context_columns,
                "question_column": question_column,
                "scoring_fn": scoring_fn,
                "record_level": True,
                LLMTextMetricGroup.RAG.value: {
                        LLMRAGMetrics.RETRIEVAL_QUALITY.value: {
                            RetrievalQualityMetric.CONTEXT_RELEVANCE.value: {
                                #"record_level": True,
                                #"metric_prompt_template": ""
                            }
                        },
                        LLMCommonMetrics.FAITHFULNESS.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        },
                        LLMCommonMetrics.ANSWER_RELEVANCE.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        },
                        LLMCommonMetrics.ANSWER_SIMILARITY.value: {
                            #"record_level": True,
                            #"metric_prompt_template": ""
                        }
                }
            }
        }

### Create the input, output and reference data frames and send them as input to compute metrics

In [None]:
df_input = pd.DataFrame(data, columns=context_columns+[question_column])
df_output = pd.DataFrame(data, columns=[answer_column])
df_reference = pd.DataFrame(data, columns=[reference_column])

## Step 6 - Run evaluations <a id="compute"></a>

In [None]:
import json
metrics_result = client.llm_metrics.compute_metrics(config_json, 
                                                    sources=df_input, 
                                                    predictions=df_output,
                                                    references=df_reference)

# print(json.dumps(metrics_result, indent=2))

## Step 7 - Display the results <a id="results"></a>

### Get metric results for all records

In [None]:
results_df = data.copy()
for k, v in metrics_result.items():
    metric = "context_relevance" if k == "retrieval_quality" else k
    if v.get("record_level_metrics"):
        results_df[metric] = [r.get(metric) for r in v.get("record_level_metrics")]
        
results_df

## Challenge - Side-by-side comparison between IBM Granite and llama 3 Models <a id="challenge"></a>

In [None]:
# [TODO] - side-by-side comparison between IBM Granite and llama 3 models

Copyright © 2024 IBM.