# Generations Evaluation using Haystack

## Statistical Evaluation
pada bagian pertama ini kita akan melakukan statistical evaluation pada result hasil RAG

### Pipeline Definition
pertama-tama dilakukan definisi pipeline, disini kita akan buat 3 pipeline dengan membedakan template pada setiap pipelinenya

In [1]:
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
%env MONGO_CONNECTION_STRING=mongodb+srv://user_dibimbing:gasterus@cluster0.zse9okn.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0

env: MONGO_CONNECTION_STRING=mongodb+srv://user_dibimbing:gasterus@cluster0.zse9okn.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0


In [3]:
import os
from getpass import getpass

API_KEY = getpass("Insert API KEY:")

In [4]:
class CreatePipeline:
    def __init__(self,template,document_store):
        self.template = template
        self.document_store = document_store
        self.pipeline = Pipeline()

    def build(self):
        self.pipeline.add_component("embedder",SentenceTransformersTextEmbedder())
        self.pipeline.add_component("retriever",MongoDBAtlasEmbeddingRetriever(document_store=document_store,top_k=3))
        self.pipeline.add_component("builder",PromptBuilder(template=self.template))
        # self.pipeline.add_component("generator",OpenAIGenerator(model="gpt-3.5-turbo",api_key=Secret.from_token(API_KEY))) # OPENAI
        self.pipeline.add_component("generator",OpenAIGenerator(model="meta/llama-3.3-70b-instruct",api_base_url="https://integrate.api.nvidia.com/v1",api_key=Secret.from_token(API_KEY))) # with Nvidia NIM

        self.pipeline.connect("embedder","retriever")
        self.pipeline.connect("retriever","builder")
        self.pipeline.connect("builder","generator")
    

In [5]:
document_store = MongoDBAtlasDocumentStore(
    database_name="dibimbing",
    collection_name="context_qa",
    vector_search_index="vector_index_qa",
)

### Pipeline 1

In [6]:
template1 = """
given these documents, answer the question based on these documents. Documents:
{% for document in documents %}
   {{ document.content }}
{% endfor %}
Question: {{query}}
"""
pipeline1 = CreatePipeline(template1,document_store)
pipeline1.build()

### Pipeline 2

In [7]:
template2 = """
given these documents, answer the question based on these documents. please provide the results directly without using premise.
Documents:
{% for document in documents %}
   {{ document.content }}
{% endfor %}
Question: {{query}}
"""
pipeline2 = CreatePipeline(template2,document_store)
pipeline2.build()

### Pipeline 3

In [8]:
template3 ="""
given these documents, answer the question based on these documents. please provide the results directly without using premise. use 1 to 5 words only to answer the question.
Documents:
{% for document in documents %}
   {{ document.content }}
{% endfor %}
Question: {{query}}
"""
pipeline3 = CreatePipeline(template3,document_store)
pipeline3.build()

### Load Dataset
Selanjutnya dilakukan load dataset untuk evaluasi. Disini kita akan menggunakan Stanford Question Answering Dataset (SQuAD). SQuAD adalah sebuah dataset yang tersusun dari pertanyaan, context, dan jawban yang dibuat dengan menggunakan data pengetahuan dari Wikipedia.  
Source: https://rajpurkar.github.io/SQuAD-explorer/

In [9]:
import json
with open("datasets/qa.json","r") as f:
    dataset = json.load(f)

### Melakukan Extract Question, Contexts dan Answer dari dataset  
disini kita akan extract 50 question, contexts dan answer sebagai sample data untuk evaluasi model kita. Sementara itu, sebelumnya untuk contexts sudah dilakukan penyimpan juga di mongodb

In [10]:
questions = []
answers = []
contexts = []
for data in dataset['data']:
  for p in data['paragraphs']:
    contexts.append(p['context'])
    for qa in p["qas"]:
      questions.append(qa['question'])
      answers.append(qa['answers'][0]['text'])
      break
question_select = questions[:10]
answer_select = answers[:10]
contexts_select = contexts[:10]

In [11]:
question_select

['In what country is Normandy located?',
 'Who was the duke in the battle of Hastings?',
 'What is the original meaning of the word Norman?',
 'When was the Duchy of Normandy founded?',
 'Who upon arriving gave the original viking settlers a common identity?',
 'What was the Norman religion?',
 "What was one of the Norman's major exports?",
 "Who was the Normans' main enemy in Italy, the Byzantine Empire and Armenia?",
 'When did Herve serve as a Byzantine general?',
 'What was the name of the Norman castle?']

In [12]:
answer_select

['France',
 'William the Conqueror',
 'Viking',
 '911',
 'Rollo',
 'Catholicism',
 'fighting horsemen',
 'Seljuk Turks',
 '1050s',
 'Afranji']

In [13]:
contexts_select

['The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy in

### Mendapatkan result dari tiap pipeline

In [14]:
def get_result(question,pipeline):
    results = []
    for q in question_select:
        result = pipeline.run({
            "embedder":{
                "text":q
            },
            "builder":{
                "query":q
            }
        })
        results.append(result)
    return results

In [15]:
results1 = get_result(question_select,pipeline1.pipeline)
results1 = [ r["generator"]["replies"][0] for r in results1]

Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.87s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.73it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.35it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.38it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.02it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.17it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.86it/s]
Batches: 100%|██████████████████████████

In [16]:
results2 = get_result(question_select,pipeline2.pipeline)
results2 = [ r["generator"]["replies"][0] for r in results2]

Batches:   0%|                                                                                   | 0/1 [00:00<?, ?it/s]

Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.43it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.91it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.62it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.39it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.04it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.35it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.59it/s]
Batches: 100%|██████████████████████████

In [17]:
results3 = get_result(question_select,pipeline3.pipeline)
results3 = [ r["generator"]["replies"][0] for r in results3]

Batches:   0%|                                                                                   | 0/1 [00:00<?, ?it/s]

Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.77it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.62it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.11it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.24it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.99it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.62it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.07it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.35it/s]
Batches: 100%|██████████████████████████

### Answer Exact Match Evaluator

In [18]:
from haystack.components.evaluators import AnswerExactMatchEvaluator
import numpy as np
evaluator = AnswerExactMatchEvaluator()

def exact_match_evaluator(ground_truth,answer):
    result_evaluator = evaluator.run(ground_truth_answers=ground_truth,predicted_answers=answer)
    percentage_result = np.array(result_evaluator["individual_scores"]).sum()/len(result_evaluator["individual_scores"])
    return result_evaluator,percentage_result

In [19]:
result_evaluator1,percentage1 = exact_match_evaluator(answer_select,results1)
result_evaluator2,percentage2 = exact_match_evaluator(answer_select,results2)
result_evaluator3,percentage3 = exact_match_evaluator(answer_select,results3)

In [20]:
print(f"persentase evaluasi 1 = {np.round(percentage1*100,2)}% ")

persentase evaluasi 1 = 0.0% 


In [21]:
print(f"persentase evaluasi 2 = {np.round(percentage2*100,2)}% ")

persentase evaluasi 2 = 40.0% 


In [22]:
print(f"persentase evaluasi 3 = {np.round(percentage3*100,2)}% ")

persentase evaluasi 3 = 50.0% 


## Model-Based Evaluations

### Faith Fulness Evaluations

In [23]:
import os
from getpass import getpass

os.environ['OPENAI_API_KEY'] = getpass("Insert API KEY:")

In [24]:
from haystack.components.evaluators import FaithfulnessEvaluator
FF_evaluator = FaithfulnessEvaluator()
FF_result1 = FF_evaluator.run(questions=question_select,contexts=contexts_select,predicted_answers=results1)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:10<00:00,  1.03s/it]


In [25]:
FF_result2 = FF_evaluator.run(questions=question_select,contexts=contexts_select,predicted_answers=results2)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:07<00:00,  1.37it/s]


In [26]:
FF_result3 = FF_evaluator.run(questions=question_select,contexts=contexts_select,predicted_answers=results3)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  1.57it/s]


In [28]:
print(f"persentase evaluasi 1 = {np.round(FF_result1['score']*100,2)}% ")
print(f"persentase evaluasi 2 = {np.round(FF_result2['score']*100,2)}% ")
print(f"persentase evaluasi 3 = {np.round(FF_result3['score']*100,2)}% ")

persentase evaluasi 1 = 90.0% 
persentase evaluasi 2 = 65.0% 
persentase evaluasi 3 = 70.0% 


### SAS Evaluator

In [29]:
from haystack.components.evaluators import SASEvaluator

In [30]:
sas_evaluator = SASEvaluator()
sas_evaluator.warm_up()
sas_result1 = sas_evaluator.run(
  ground_truth_answers=answer_select, 
  predicted_answers=results1
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [34]:
sas_result2 = sas_evaluator.run(
  ground_truth_answers=answer_select, 
  predicted_answers=results2
)

In [35]:
sas_result3 = sas_evaluator.run(
  ground_truth_answers=answer_select, 
  predicted_answers=results3
)

In [36]:
print(f"persentase evaluasi 1 = {np.round(sas_result1['score']*100,2)}% ")
print(f"persentase evaluasi 2 = {np.round(sas_result2['score']*100,2)}% ")
print(f"persentase evaluasi 3 = {np.round(sas_result3['score']*100,2)}% ")

persentase evaluasi 1 = 43.54% 
persentase evaluasi 2 = 70.04% 
persentase evaluasi 3 = 73.32% 
