Skip to content

Commit 088ab98

Browse files
update examples accuracy (#941)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 441f8cc commit 088ab98

File tree

12 files changed

+784
-14
lines changed

12 files changed

+784
-14
lines changed

AudioQnA/benchmark/accuracy/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# AudioQnA accuracy Evaluation
1+
# AudioQnA Accuracy
22

33
AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
44

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
2+
# Copyright (C) 2024 Intel Corporation
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
python online_evaluate.py

ChatQnA/benchmark/accuracy/README.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# ChatQnA Accuracy
2+
3+
ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
4+
5+
For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
6+
7+
- Dataset
8+
- [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
9+
- [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
10+
- metrics (measure accuracy of both the context retrieval and response generation)
11+
- evaluation for retrieval/reranking
12+
- MRR@10
13+
- MAP@10
14+
- Hits@10
15+
- Hits@4
16+
- LLM-as-a-Judge
17+
- evaluation for the generated response from the end-to-end pipeline
18+
- BLEU
19+
- ROGUE(L)
20+
- LLM-as-a-Judge
21+
22+
## Prerequisite
23+
24+
### Environment
25+
26+
```bash
27+
git clone https://github.com/opea-project/GenAIEval
28+
cd GenAIEval
29+
pip install -r requirements.txt
30+
pip install -e .
31+
```
32+
33+
## MultiHop (English dataset)
34+
35+
[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
36+
37+
### Launch Service of RAG System
38+
39+
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
40+
41+
### Launch Service of LLM-as-a-Judge
42+
43+
To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
44+
45+
```
46+
# please set your llm_port and hf_token
47+
48+
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
49+
50+
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
51+
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
52+
```
53+
54+
### Prepare Dataset
55+
56+
We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
57+
58+
```bash
59+
git clone https://github.com/yixuantt/MultiHop-RAG.git
60+
```
61+
62+
### Evaluation
63+
64+
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
65+
66+
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
67+
68+
```bash
69+
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
70+
```
71+
72+
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
73+
74+
```bash
75+
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
76+
```
77+
78+
The default values for arguments are:
79+
|Argument|Default value|
80+
|--------|-------------|
81+
|service_url|http://localhost:8888/v1/chatqna|
82+
|database_endpoint|http://localhost:6007/v1/dataprep|
83+
|embedding_endpoint|http://localhost:6000/v1/embeddings|
84+
|tei_embedding_endpoint|http://localhost:8090|
85+
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
86+
|reranking_endpoint|http://localhost:8000/v1/reranking|
87+
|output_dir|./output|
88+
|temperature|0.1|
89+
|max_new_tokens|1280|
90+
|chunk_size|256|
91+
|chunk_overlap|100|
92+
|search_type|similarity|
93+
|retrival_k|10|
94+
|fetch_k|20|
95+
|lambda_mult|0.5|
96+
|dataset_path|None|
97+
|docs_path|None|
98+
|limits|100|
99+
100+
You can check arguments details use below command:
101+
102+
```bash
103+
python eval_multihop.py --help
104+
```
105+
106+
## CRUD (Chinese dataset)
107+
108+
[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
109+
110+
### Prepare Dataset
111+
112+
We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
113+
114+
```bash
115+
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
116+
mkdir data/
117+
cp CRUD_RAG/data/crud_split/split_merged.json data/
118+
cp -r CRUD_RAG/data/80000_docs/ data/
119+
python process_crud_dataset.py
120+
```
121+
122+
### Launch Service of RAG System
123+
124+
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
125+
126+
### Evaluation
127+
128+
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
129+
130+
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
131+
132+
```bash
133+
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
134+
135+
# if you want to get ragas metrics
136+
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics
137+
```
138+
139+
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
140+
141+
```bash
142+
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
143+
```
144+
145+
The default values for arguments are:
146+
|Argument|Default value|
147+
|--------|-------------|
148+
|service_url|http://localhost:8888/v1/chatqna|
149+
|database_endpoint|http://localhost:6007/v1/dataprep|
150+
|embedding_endpoint|http://localhost:6000/v1/embeddings|
151+
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
152+
|reranking_endpoint|http://localhost:8000/v1/reranking|
153+
|output_dir|./output|
154+
|temperature|0.1|
155+
|max_new_tokens|1280|
156+
|chunk_size|256|
157+
|chunk_overlap|100|
158+
|dataset_path|./data/split_merged.json|
159+
|docs_path|./data/80000_docs|
160+
|tasks|["question_answering"]|
161+
162+
You can check arguments details use below command:
163+
164+
```bash
165+
python eval_crud.py --help
166+
```
167+
168+
## Acknowledgements
169+
170+
This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
7+
import argparse
8+
import json
9+
import os
10+
11+
from evals.evaluation.rag_eval import Evaluator
12+
from evals.evaluation.rag_eval.template import CRUDTemplate
13+
from evals.metrics.ragas import RagasMetric
14+
from tqdm import tqdm
15+
16+
17+
class CRUD_Evaluator(Evaluator):
18+
def get_ground_truth_text(self, data: dict):
19+
if self.task == "summarization":
20+
ground_truth_text = data["summary"]
21+
elif self.task == "question_answering":
22+
ground_truth_text = data["answers"]
23+
elif self.task == "continuation":
24+
ground_truth_text = data["continuing"]
25+
elif self.task == "hallucinated_modified":
26+
ground_truth_text = data["hallucinatedMod"]
27+
else:
28+
raise NotImplementedError(
29+
f"Unknown task {self.task}, only support "
30+
"summarization, question_answering, continuation and hallucinated_modified."
31+
)
32+
return ground_truth_text
33+
34+
def get_query(self, data: dict):
35+
if self.task == "summarization":
36+
query = data["text"]
37+
elif self.task == "question_answering":
38+
query = data["questions"]
39+
elif self.task == "continuation":
40+
query = data["beginning"]
41+
elif self.task == "hallucinated_modified":
42+
query = data["newsBeginning"]
43+
else:
44+
raise NotImplementedError(
45+
f"Unknown task {self.task}, only support "
46+
"summarization, question_answering, continuation and hallucinated_modified."
47+
)
48+
return query
49+
50+
def get_document(self, data: dict):
51+
if self.task == "summarization":
52+
document = data["text"]
53+
elif self.task == "question_answering":
54+
document = data["news1"]
55+
elif self.task == "continuation":
56+
document = data["beginning"]
57+
elif self.task == "hallucinated_modified":
58+
document = data["newsBeginning"]
59+
else:
60+
raise NotImplementedError(
61+
f"Unknown task {self.task}, only support "
62+
"summarization, question_answering, continuation and hallucinated_modified."
63+
)
64+
return document
65+
66+
def get_template(self):
67+
if self.task == "summarization":
68+
template = CRUDTemplate.get_summarization_template()
69+
elif self.task == "question_answering":
70+
template = CRUDTemplate.get_question_answering_template()
71+
elif self.task == "continuation":
72+
template = CRUDTemplate.get_continuation_template()
73+
else:
74+
raise NotImplementedError(
75+
f"Unknown task {self.task}, only support "
76+
"summarization, question_answering, continuation and hallucinated_modified."
77+
)
78+
return template
79+
80+
def post_process(self, result):
81+
return result.split("<response>")[-1].split("</response>")[0].strip()
82+
83+
def get_ragas_metrics(self, results, arguments):
84+
from langchain_huggingface import HuggingFaceEndpointEmbeddings
85+
86+
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
87+
88+
metric = RagasMetric(
89+
threshold=0.5,
90+
model=arguments.llm_endpoint,
91+
embeddings=embeddings,
92+
metrics=["faithfulness", "answer_relevancy"],
93+
)
94+
95+
all_answer_relevancy = 0
96+
all_faithfulness = 0
97+
ragas_inputs = {
98+
"question": [],
99+
"answer": [],
100+
"ground_truth": [],
101+
"contexts": [],
102+
}
103+
104+
valid_results = self.remove_invalid(results["results"])
105+
106+
for data in tqdm(valid_results):
107+
data = data["original_data"]
108+
109+
query = self.get_query(data)
110+
generated_text = data["generated_text"]
111+
ground_truth = data["ground_truth_text"]
112+
retrieved_documents = data["retrieved_documents"]
113+
114+
ragas_inputs["question"].append(query)
115+
ragas_inputs["answer"].append(generated_text)
116+
ragas_inputs["ground_truth"].append(ground_truth)
117+
ragas_inputs["contexts"].append(retrieved_documents[:3])
118+
119+
ragas_metrics = metric.measure(ragas_inputs)
120+
return ragas_metrics
121+
122+
123+
def args_parser():
124+
parser = argparse.ArgumentParser()
125+
126+
parser.add_argument(
127+
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
128+
)
129+
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
130+
parser.add_argument(
131+
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
132+
)
133+
parser.add_argument(
134+
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
135+
)
136+
parser.add_argument(
137+
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
138+
)
139+
parser.add_argument(
140+
"--chunk_overlap",
141+
type=int,
142+
default=100,
143+
help="the number of characters that should overlap between two adjacent chunks",
144+
)
145+
parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
146+
parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
147+
148+
# Retriever related options
149+
parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
150+
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
151+
parser.add_argument(
152+
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
153+
)
154+
parser.add_argument(
155+
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
156+
)
157+
parser.add_argument(
158+
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
159+
)
160+
parser.add_argument(
161+
"--tei_embedding_endpoint",
162+
type=str,
163+
default="http://localhost:8090",
164+
help="Service URL address of tei embedding.",
165+
)
166+
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
167+
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
168+
parser.add_argument(
169+
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
170+
)
171+
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
172+
173+
args = parser.parse_args()
174+
return args
175+
176+
177+
def main():
178+
args = args_parser()
179+
if os.path.isfile(args.dataset_path):
180+
with open(args.dataset_path) as f:
181+
all_datasets = json.load(f)
182+
else:
183+
raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
184+
os.makedirs(args.output_dir, exist_ok=True)
185+
for task in args.tasks:
186+
if task == "question_answering":
187+
dataset = all_datasets["questanswer_1doc"]
188+
elif task == "summarization":
189+
dataset = all_datasets["event_summary"]
190+
else:
191+
raise NotImplementedError(
192+
f"Unknown task {task}, only support "
193+
"summarization, question_answering, continuation and hallucinated_modified."
194+
)
195+
output_save_path = os.path.join(args.output_dir, f"{task}.json")
196+
evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
197+
if args.ingest_docs:
198+
CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
199+
results = evaluator.evaluate(
200+
args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
201+
)
202+
print(results["overall"])
203+
if args.ragas_metrics:
204+
ragas_metrics = evaluator.get_ragas_metrics(results, args)
205+
print(ragas_metrics)
206+
print(f"Evaluation results of task {task} saved to {output_save_path}.")
207+
208+
209+
if __name__ == "__main__":
210+
main()

0 commit comments

Comments
 (0)