In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "gemini.json"

api_key = os.environ.get("OPENAI_API_KEY")
if api_key is None:
    raise RuntimeError("Please set the OPENAI_API_KEY ")

In [2]:
from scirag import SingleRAGEvaluationSystem

  import pkg_resources


# Evaluate Gemini Embedding

In [3]:
import pandas as pd

In [4]:
gemini_embedding_df=pd.read_pickle("results/gemini_embedding_results.pkl")

In [5]:
rag_evaluator = SingleRAGEvaluationSystem(
        evaluator_model="o3-mini"
    )

In [6]:

gemini_embedding_evaluated = rag_evaluator.evaluate_single_dataframe(
    df=gemini_embedding_df,
    system_name="Gemini_embedding"
)


EVALUATING: Gemini_embedding
All required columns found
Available columns: ['question_id', 'question', 'ideal_solution', 'response', 'answer', 'sources', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The standard recombination history is tested in the Planck 2018 analysis using a semi-blind eigen-analysis of deviations in the free-electron fraction, $x_{\mathrm{e}}(z)$, from the standard model. Perturbations in $x_{\mathrm{e}}(z)$ are expanded into eigenmodes (eXeMs), and their amplitudes are determined through MCMC sampling, showing consistency with no deviation from standard recombination.

IDEAL ANSWER:
semi-blind eigen-analysis (often referred to as a principal-component an

In [7]:
gemini_embedding_evaluated

Unnamed: 0,question_id,question,ideal_solution,response,answer,sources,processing_time,success,error,embedding_system,eval_accuracy_score,eval_rationale,eval_successful,eval_error,eval_processing_time
0,1,How is the standard recombination history test...,semi-blind eigen-analysis (often referred to a...,**Answer**:\n\nThe standard recombination hist...,The standard recombination history is tested i...,"1807.06209v4.md, Section 7.7, Page 29, 1807.06...",6.377757,True,,Gemini,100,The generated answer correctly describes the t...,True,,3.341979
1,2,Which corrections in polarization spectra were...,Beam leakage correction; effective polarizatio...,"**Answer**:\n\nIn the 2018 Planck analysis, co...","In the 2018 Planck analysis, corrections appli...",1807.06209v4.md,5.089953,True,,Gemini,0,The generated answer correctly mentions beam-l...,True,,5.973115
2,3,What multipole cuts were applied in the Camspe...,"lmin=500, lmax=2500","**Answer**:\n\nFor the Planck 2018 analysis, t...","For the Planck 2018 analysis, the Camspec temp...",1807.06209v4.md,3.338507,True,,Gemini,100,The generated answer precisely matches the ide...,True,,3.249488
3,4,What is the effective sky-fraction of the apod...,47.70%,"**Answer**:\n\nFor the Planck 2018 analysis, t...","For the Planck 2018 analysis, the default apod...",1807.06209v4.md,3.815803,True,,Gemini,100,The generated answer correctly states the effe...,True,,3.652393
4,5,How large is the impact of beam window functio...,approximately 0.1% at l=2000,**Answer**:\n\nThe impact of beam window funct...,The impact of beam window functions on the 201...,1807.06209v4.md,3.641038,True,,Gemini,100,The generated answer accurately states that th...,True,,3.110253
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,101,What would be the signs of dark matter annihil...,Damping of high-l power in both temperature an...,**Answer**:\n\nDark matter annihilation into S...,Dark matter annihilation into Standard Model p...,"2503.14454v1.md, Section 6.4.2",3.279053,True,,Gemini,100,The generated answer precisely matches the ide...,True,,3.3453
101,102,"In the ACT DR6 paper, why is the constraint fo...",Bayesian constraints on the IDR-IDM model caus...,**Answer**:\n\nThe upper limit on N_idr for th...,The upper limit on N_idr for the IDR-IDM model...,"2503.14454v1.md, Section 6.5.2, Figure 35 capt...",4.467764,True,,Gemini,0,The generated answer does not capture the key ...,True,,5.124177
102,103,Why does including f\sigma_8 measurements into...,The f\sigma_8 dataset contains two low-z point...,**Answer**:\n\nIncluding fσ8 measurements dram...,Including fσ8 measurements dramatically increa...,"2503.14454v1.md, Context 2, Figure 39, 2503.14...",4.156666,True,,Gemini,100,The generated answer correctly explains that t...,True,,4.566739
103,104,Do the extensions to \lambdaCDM considered in ...,No. The range of H_0 given by the models studi...,"**Answer**:\n\nNo, the extensions to the Lambd...","No, the extensions to the LambdaCDM model cons...","2503.14454v1.md, Section 8.2, Cosmological con...",6.575425,True,,Gemini,100,The generated answer correctly states that the...,True,,3.6161


# gemini_no_rag

In [7]:
gemini_no_rag=pd.read_pickle("results/gemini_norag_results.pkl")

In [8]:

gemini_no_rag_eval = rag_evaluator.evaluate_single_dataframe(
    df=gemini_no_rag,
    system_name="Gemini_no_rag"
)


EVALUATING: Gemini_no_rag
All required columns found
Available columns: ['question_id', 'question', 'response', 'answer', 'sources', 'ideal_solution', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The Planck 2018 analysis tests the standard recombination history by comparing observed Cosmic Microwave Background (CMB) anisotropies with theoretical predictions based on the standard cosmological model [1]. This allows for constraints on parameters that could modify the recombination process.

IDEAL ANSWER:
semi-blind eigen-analysis (often referred to as a principal-component analysis)


Evaluate based on:
Accuracy (0-100): How factually correct is the answer compared to the ideal?

# Evaluate Vanilla PaperQA

In [16]:
paperqa=pd.read_pickle("results/paperqa2_valina_gpt4.1_results.pkl")
paperqa_eval = rag_evaluator.evaluate_single_dataframe(
    df=paperqa,
    system_name="paperqa_modified_gpt4.1"
)


EVALUATING: paperqa_modified_gpt4.1
All required columns found
Available columns: ['question_id', 'question', 'response', 'answer', 'sources', 'ideal_solution', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The Planck 2018 analysis tests the standard recombination history by comparing precise measurements of the CMB temperature, polarization, and lensing power spectra to theoretical predictions, and by performing a principal-component (eigenmode) analysis of perturbations to the free-electron fraction, $x_e(z)$; all results show no significant deviations from the standard recombination scenario and confirm the robustness of cosmological parameter estimates .

IDEAL ANSWER:
semi

# Eavaluate PaperQA

In [15]:
paperqa_modified=pd.read_pickle("results/paperqa2_gpt4.1_results.pkl")
paperqa_modified_eval = rag_evaluator.evaluate_single_dataframe(
    df=paperqa_modified,
    system_name="paperqa_modified_gpt4.1"
)


EVALUATING: paperqa_modified_gpt4.1
All required columns found
Available columns: ['question_id', 'question', 'response', 'answer', 'sources', 'ideal_solution', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The Planck 2018 analysis tests the standard recombination history by performing a principal-component (eigenmode) analysis of deviations in the free electron fraction, $x_e(z)$, fitting the amplitudes of these modes to CMB temperature, polarization, lensing, and BAO data; the results show no significant deviations from the standard recombination scenario, confirming its robustness with current data . The analysis also constrains possible non-standard effects, such as energy 

# OpenAI_Vector_Store_no_pdf

In [13]:
openai_vector_no_pdf=pd.read_pickle("results/openai_vector_store_results.pkl")

In [14]:
openai_vector_no_pdf_eval = rag_evaluator.evaluate_single_dataframe(
    df=openai_vector_no_pdf,
    system_name="OpenAI_Vector_Store_md"
)


EVALUATING: OpenAI_Vector_Store_md
All required columns found
Available columns: ['question_id', 'question', 'ideal_solution', 'response', 'answer', 'sources', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The standard recombination history in the Planck 2018 analysis is tested using a semi-blind principal-component (eigenmode) analysis of perturbations to the free-electron fraction, x_e(z), across redshifts relevant for recombination. The amplitudes of the first few eigenmodes (eXeMs) are constrained using Planck TT,TE,EE+lowE+lensing+BAO data, and all are found to be consistent with zero, indicating no significant deviation from the standard recombination scenario within the 

# OpenAI_Vector_Store_pdf


In [11]:
openai_vector_df=pd.read_pickle("results/openai_pdf_vector_store_results.pkl")

In [12]:
openai_vector_df_eval = rag_evaluator.evaluate_single_dataframe(
    df=openai_vector_df,
    system_name="OpenAI_Vector_Store"
)


EVALUATING: OpenAI_Vector_Store
All required columns found
Available columns: ['question_id', 'question', 'ideal_solution', 'response', 'answer', 'sources', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
In the Planck 2018 analysis, the standard recombination history is tested using a semi-blind principal-component (eigenmode) analysis of deviations in the free-electron fraction, xe(z), from the standard recombination scenario. This approach expands perturbations in xe(z) across 80 redshift bands, constructs eigenmodes (eXeMs), and fits their amplitudes to the data; the results show no significant evidence for deviations from the standard recombination history, with all mode amp

# OpenAi Embedding

In [9]:
openai_embedding=pd.read_pickle("results/openai_embedding_results_final.pkl")

In [10]:

gemini_no_openai_embedding_eval = rag_evaluator.evaluate_single_dataframe(
    df=openai_embedding,
    system_name="OpenAI_embedding"
)


EVALUATING: OpenAI_embedding
All required columns found
Available columns: ['question_id', 'question', 'response', 'ideal_solution', 'answer', 'sources', 'processing_time', 'success', 'error', 'embedding_system']
Filtering by success column: 105 successful out of 105 total

Evaluating 1/105 - Question ID: 1
[33muser[0m (to ai_judge):


Please evaluate this system's response against the ideal answer:

QUESTION: How is the standard recombination history tested in the Planck 2018 analysis?

GENERATED ANSWER:
The Planck 2018 analysis tests the standard recombination history by incorporating subtle atomic physics and radiative-transfer effects using advanced codes like CosmoRec and HyRec. It also performs a semi-blind eigen-analysis of deviations in the free-electron fraction from the standard history and uses non-parametric reconstructions, such as the ModRec model, to parametrize and test for departures in the ionization fraction.

IDEAL ANSWER:
semi-blind eigen-analysis (often referre