<h1 align="center">
    <img 
        src="../img/logo_white_bg.jpeg" 
        width="200" 
        border="1" />
</h1>
<h1 align="center">
    <b>GenAISHAP</b>
</h1>
<h4 align="center">
    <i>Explanations for Generative AI, LLM-and-SLM-Based, Solutions</i> ⚡️
</h4>



Generative AI SHAP (GenAISHAP) is a python library that supports the creation of explanations to the metrics obtained for solutions based on LLMs (Large Language Models) or SLMs (Small Language Models). 

When building an LLM-or-SLM based solution one of the first challenges is how to measure the quality of the responses from the "Agent". Here, libraries like  [RAGAS](https://github.com/explodinggradients/ragas) or [promptflow](https://github.com/microsoft/promptflow) help on the evaluation of the quality of the solution by using metrics like **faithfulness**, **groundedness**, **context precision**, **context recall**, among others.

The open challenge is to add explainability to those quality metrics.  To answer questions like: 

- *Why a particular question is marked with a higher/lower metric (e.g., faithfulness)?*
- *What are the common characteristics of the user questions that produce good or bad **faithfulness**?*
- *What type of prompts produce better or lower **context recall**?*

The answer of those questions helps on the debugging of the overall solution and gives more insights to where to focus the next steps to improve the metrics.

This notebook shows an example of how to create the **Input** for GenAISHAP, which is a simple Pandas DataFrame with the evaluation dataset. This dataset should have at least the following columns:

- User input, question or prompt. Column name should be `user_input` and its type should be string.
- One column for each metric already calculated, for example **faithfulness**, **context precision**, **context recall**.  All numerical columns will be assumed to be a metric column.

At the end of this notebook a pandas Dataframe like the following will be produced:

<img src="../img/input_example.png" width="1200" />

> In this example, the column `user_input` will be used to refer to the user prompt, and the columns `faithfulness`, `context_precision` and `context_recall` will be used as metric columns since those columns are numerical.
>
> The other columns, `retrieved_contexts`, `response`, and `reference` are not needed for **GenAISHAP** but are normally required for the calculation of the metrics.

# 1. Import relevant libraries and load environment variables

Make sure that before you run this notebook you have a `.env` file with the following variables with their respective values:

- OPENAI_API_VERSION
- AZURE_OPENAI_ENDPOINT
- OPENAI_API_KEY
- PYTHONIOENCODING=utf-8
- PYTHONUTF8=1

> Note that here we are not importing **genaishap** library, since this notebooks is only an example of how to create the input for the GenAISHAP execution.


In [1]:
from llama_index.core.llama_dataset import LabelledRagDataset
from llama_index.core.readers import SimpleDirectoryReader
from dotenv import load_dotenv
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
import os
from llama_index.core import VectorStoreIndex, Settings
from ragas.metrics import (
    Faithfulness,
    ContextPrecision,
    ContextRecall
)
from ragas.llms import LlamaIndexLLMWrapper
from ragas.embeddings import LlamaIndexEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
from ragas.evaluation import evaluate
from ragas.run_config import RunConfig

In [2]:
load_dotenv()

True

# 2. Load the example dataset

As an example the [Mini Esg Bench Dataset](https://llamahub.ai/l/llama_datasets/Mini%20ESG%20Bench%20Dataset?from=llama_datasets) dataset will be used.  This dataset provides, among others:

- 6 PDFs with the 2022 or 2021 sustainability report from Microsoft, Apple, Amazon, Google, Meta and Netflix.
- 50 user queries that should be answered using those documents.
- Reference answer for each query that will be used as ground truth to calculate the RAGAS metrics.

In [3]:
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(input_dir="./data/source_files/").load_data(
    show_progress=True
)

Loading files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:19<00:00,  3.25s/file]


It should have created a folder named `data` with 2 elements:
- a `rag_dataset.json` file with the information shown after executing the next cell.
- a `source_files` folder with a single TXT file with the full Paul Graham Essay.

In [4]:
rag_dataset.to_pandas()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,What is the objective of the Carbon Call initi...,"[In February 2022, Microsoft, ClimateWorks Fou...",The objective of Carbon Call is to unify the w...,human,human
1,What are the three key areas that will enable ...,[We will continue\ninvesting in three key area...,1 Advancing AI solutions for greater climate i...,human,human
2,How many people in India did Microsoft provide...,"[Water Table 3\n\nIndia: 309,921\nIndonesia: 2...",309921,human,human
3,How many people in Indonesia did Microsoft pro...,"[Water Table 3\n\nIndia: 309,921\nIndonesia: 2...",225389,human,human
4,How many people in Brazil did Microsoft provid...,"[Water Table 3\n\nIndia: 309,921\nIndonesia: 2...",16408,human,human
5,How many people in Mexico did Microsoft provid...,"[Water Table 3\n\nIndia: 309,921\nIndonesia: 2...",340,human,human
6,How many more acres of land does Microsoft nee...,[Ecosystems Chart 1\nAchieving our target of p...,4998 more acres,human,human
7,What key insight was listed in page 56 of the ...,[25%\nWe are reducing idle\npower consumption\...,25%\nWe are reducing idle power consumption of...,human,human
8,What key insight was listed in page 61 of the ...,[100%\nOur key European\nand American\ndistrib...,100%\nOur key European and American distributi...,human,human
9,What are the four commitments listed on page 66?,[Our commitment\nUsing our voice on climate-re...,1. Using our voice on climate-related public p...,human,human


# 3. Create semantic index for RAG

The following cell creates the connection object to the embedding and language model that will be used to calculate the response from each user query in the dataset:

In [5]:
embed_model = AzureOpenAIEmbedding(
    model='text-embedding-3-small', # Update with the embeddings deployment name
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

llm = AzureOpenAI(
    engine="gpt-4o", # Update with the language model deployment name 
    model="gpt-4o", # Update with the language model name
    temperature=0.0,
    api_key=os.environ['OPENAI_API_KEY'],
    api_version=os.environ['OPENAI_API_VERSION'],
    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT']
)

The next two cells create the "in-memory" semantic index with chunks of the essay

In [6]:
Settings.embed_model = embed_model
Settings.llm = llm

In [7]:
index = VectorStoreIndex.from_documents(
    documents=documents,
    show_progress=True
)
query_engine = index.as_query_engine()

Parsing nodes:   0%|          | 0/455 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/508 [00:00<?, ?it/s]

# 4. Calculate the reponse 
The next cell calculates the response by querying the semantic index and calling the language model. After this step the following elements are calculated for each user query in the original dataset:

- `retrieved_context`. Which contains the list of the top chuncks retrieved from the semantic index as context for the LLM to calculate the answer.
- `response`. Response of the LLM to the user query using the retrieved chunks as context. 

> *This process could take several minutes, even hours, depending on the rate limit (max tokens per minute) of the deployments used.*

In [8]:
%%time
predictions = rag_dataset.make_predictions_with(
    predictor = query_engine,
    show_progress = True,
    batch_size = 20
)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:33<00:00,  1.67s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:24<00:00,  1.24s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:12<00:00,  1.22s/it]

CPU times: user 2.81 s, sys: 84.7 ms, total: 2.89 s
Wall time: 1min 10s





# 5. Data preparation for the calculation of the RAGAS metrics

The following cell shows an example of how to prepare the data for the calculation of the RAGAS metrics. For this example we only need to use the following elements:

- `user_input`. In this case, the user input is just the `query` in the original dataset.
- `reference`.  Reference column in ragas is the **expected answer**. We have it as `reference_answer` in the original dataset.
- `response`. Bot response. We just calculated it in the last step.
- `retrieved_contexts`. List of documents or chunks used as context for the Bot to create the answer. It was also calculated in the last step.


In [9]:
list_of_samples = []

for idx in range(len(rag_dataset.examples)):
    list_of_samples.append(
        SingleTurnSample (
            user_input = rag_dataset.examples[idx].query,
            reference = rag_dataset.examples[idx].reference_answer,
            response = predictions.predictions[idx].response,
            retrieved_contexts = predictions.predictions[idx].contexts
        )
    )

ragas_evaluation_dataset = EvaluationDataset(list_of_samples)
ragas_evaluation_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,response,reference
0,What is the objective of the Carbon Call initi...,"[In 2022, we also released a preview version o...",The objective of the Carbon Call initiative is...,The objective of Carbon Call is to unify the w...
1,What are the three key areas that will enable ...,[What’s next \nScope 3 emissions reduction \nM...,The three key areas that will enable the scale...,1 Advancing AI solutions for greater climate i...
2,How many people in India did Microsoft provide...,"[Chile\nSouth Africa\n16,408 \nBrazil\n340 \n...","Microsoft provided water access to 225,389 peo...",309921
3,How many people in Indonesia did Microsoft pro...,"[Chile\nSouth Africa\n16,408 \nBrazil\n340 \n...",The specific number of people in Indonesia pro...,225389
4,How many people in Brazil did Microsoft provid...,"[Chile\nSouth Africa\n16,408 \nBrazil\n340 \n...","Microsoft provided water access to 552,058 peo...",16408
5,How many people in Mexico did Microsoft provid...,[Improving access \nto water \nWe continue to ...,"Microsoft provided water access to 309,921 peo...",340
6,How many more acres of land does Microsoft nee...,[• • \na \n47\n | | | \nCommitments and progre...,"According to Ecosystems Chart 1, Microsoft has...",4998 more acres
7,What key insight was listed in page 56 of the ...,[Leadership About Highlights How We Operate Wh...,The context provided does not include informat...,25%\nWe are reducing idle power consumption of...
8,What key insight was listed in page 61 of the ...,[Leadership About Highlights How We Operate Wh...,The context provided does not include informat...,100%\nOur key European and American distributi...
9,What are the four commitments listed on page 66?,[UN Sustainable Development Goal Apple’s suppo...,The four commitments listed on page 66 are: \n...,1. Using our voice on climate-related public p...


# 6. Calculation of RAGAS metrics

The following 2 cells will execute the calculation of the RAGAS metrics. For this example we are going to calculate **faithfulness**, **context_precision**, and **context_recall**. To see all the RAGAS metrics and the full RAGAS documentation follow this link: [https://docs.ragas.io/en/latest/](https://docs.ragas.io/en/latest/) 

> *This process could take several minutes, even hours, depending on the rate limit (max tokens per minute) of the deployments used.*

In [10]:
evaluator_llm = LlamaIndexLLMWrapper(llm)
evaluator_embeddings = LlamaIndexEmbeddingsWrapper(embed_model)

In [11]:
%%time

metrics = [
    Faithfulness(llm=evaluator_llm),
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm)
]
ragas_evaluation_result = evaluate(
    dataset=ragas_evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    run_config=RunConfig(timeout=1800, max_wait=180, max_retries=20),
    show_progress=True,
    batch_size=20
)

Evaluating:   0%|          | 0/150 [00:00<?, ?it/s]

Batch 1/8:   0%|          | 0/20 [00:00<?, ?it/s]

Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-08-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 2 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds as it raised RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-08-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 2 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}.
Retrying llama_index.llms.openai.base.OpenAI._achat in 1.0 seconds

CPU times: user 5.67 s, sys: 313 ms, total: 5.98 s
Wall time: 2min 41s


# 7. Data cleansing

In some cases the RAGAS metric could return **null** values, the following cell cleans all the records with null metrics.

In [12]:
df_ragas_result = ragas_evaluation_result.to_pandas()

# Removing NULL values
df_ragas_result = df_ragas_result[(
    ~df_ragas_result['faithfulness'].isnull()
)&(
    ~df_ragas_result['context_precision'].isnull()
)&(
    ~df_ragas_result['context_recall'].isnull()
)].reset_index(drop=True)

print(f"Total records after data cleansing: {df_ragas_result.shape[0]}")

Total records after data cleansing: 50


# 8. Store the input calculated as a JSON file

In [13]:
df_ragas_result.to_json('./test-dataset.json', orient='records', indent=4)