# Agent Evaluation with RAGAS

We will use the RAGAS library to evaluate the performance our Gemini agent. RAGAS is a framework designed to assess the capabilities of AI agents in various tasks.

This notebook uses the functions in `ragas_eval.py` to evaluate our RAG pipeline
across different hyperparameter combinations.

The parameters for the evaluation are as follows:
* chunk_size
* top_k
* temperature

In [6]:
%load_ext autoreload
%autoreload 2

## Montar en Colab

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [36]:
ROOT_DIR = "/content/drive/MyDrive/Proyectos/Insurance-Chatbot/Insurance_Chatbot_project/api_ai/model_evaluation"
%cd $ROOT_DIR

/content/drive/MyDrive/Proyectos/Insurance-Chatbot/Insurance_Chatbot_project/api_ai/model_evaluation


In [3]:
%ls $ROOT_DIR

Evaluate_Model.ipynb  ragas_eval.py


In [4]:
!pip install langchain-community langchain-google-genai duckduckgo-search pypdf faiss-cpu python-dotenv ragas pandas datasets

Collecting langchain-community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.5-py3-none-any.whl.metadata (5.2 kB)
Collecting duckduckgo-search
  Downloading duckduckgo_search-8.0.4-py3-none-any.whl.metadata (16 kB)
Collecting pypdf
  Downloading pypdf-5.6.0-py3-none-any.whl.metadata (7.2 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting ragas
  Downloading ragas-0.2.15-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-core<1.0.0,>=0.3.65 (from langchain-community)
  Downloading langchain_core-0.3.65-py3-none-any.whl.metadata (5.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (f

In [3]:
# Imports
import sys
sys.path.append('..')

import os
import sys
import warnings
import pandas as pd
from datetime import datetime

# Suprimir warnings innecesarios
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

from generative_resp import pdf_process_utils
import ragas_eval as ragas_eval

project_root = os.path.dirname(ROOT_DIR)
sys.path.append(project_root)

In [10]:
def main():
    """
    Funci√≥n principal que carga los documentos y ejecuta la evaluaci√≥n.
    """
    print("\n" + "="*60)
    print("üöÄ INICIANDO EVALUACI√ìN RAG SIMPLIFICADA")
    print(f"üìÖ Fecha y Hora: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("="*60)

    # Definir la ruta al directorio de los PDFs
    pdf_directory = os.path.join(project_root, 'generative_resp', 'polizas')

    if not os.path.isdir(pdf_directory):
        print(f"‚ùå Error: El directorio de PDFs no se encuentra en: {pdf_directory}")
        return

    print(f"üìÅ Cargando documentos desde: {pdf_directory}...")

    # Cargar los documentos. La divisi√≥n inicial no es cr√≠tica, ya que cada
    # configuraci√≥n la re-dividir√° seg√∫n su 'chunk_size'.
    all_documents = pdf_process_utils.load_split_pdfs(
        pdf_dir=pdf_directory,
        chunk_size=1000,
        chunk_overlap=100
    )

    if not all_documents:
        print("‚ùå No se pudieron cargar los documentos. Abortando.")
        return

    print(f"‚úÖ {len(all_documents)} chunks de documentos cargados inicialmente.")

    # --- Ejecutar la evaluaci√≥n ---
    print("\n‚ñ∂Ô∏è Lanzando el proceso de evaluaci√≥n de RAGAS...")

    try:
        results_df = ragas_eval.run_evaluation(original_docs=all_documents)

        if results_df is not None:
            print("\n" + "="*60)
            print("üéâ ¬°EVALUACI√ìN COMPLETADA!")
            print("="*60)
            print("üìä Resumen de Resultados Finales (Promedio por combinaci√≥n):")

            # Imprimir el DataFrame de resultados con un formato m√°s limpio
            pd.set_option('display.max_rows', None)
            pd.set_option('display.max_columns', None)
            pd.set_option('display.width', 100)
            pd.set_option('display.colheader_justify', 'center')
            pd.set_option('display.precision', 4)

            print(results_df)

            # Guardar resultados en un archivo CSV
            results_path = 'ragas_simplified_results.csv'
            results_df.to_csv(results_path, index=False)
            print(f"\nüíæ Resultados guardados en: {results_path}")

        else:
            print("\n‚ùå La evaluaci√≥n no produjo resultados.")

    except Exception as e:
        print(f"\n‚ùå Ocurri√≥ un error cr√≠tico durante la evaluaci√≥n: {e}")
        import traceback
        traceback.print_exc()

In [37]:
if __name__ == "__main__":
    main()


üöÄ INICIANDO EVALUACI√ìN RAG SIMPLIFICADA
üìÖ Fecha y Hora: 2025-06-14 02:09:08
üìÅ Cargando documentos desde: /content/drive/MyDrive/Proyectos/Insurance-Chatbot/Insurance_Chatbot_project/api_ai/generative_resp/polizas...
Split 267 documents into 699 chunks.
‚úÖ 699 chunks de documentos cargados inicialmente.

‚ñ∂Ô∏è Lanzando el proceso de evaluaci√≥n de RAGAS...
Configurando dependencias de RAGAS (LLM y Embeddings)...
‚úÖ Dependencias de RAGAS configuradas exitosamente.

üß™ Evaluando Configuraci√≥n #1/4: Baseline
   Hiperpar√°metros: chunk_size=1000, top_k=4, temp=0.1
  - Evaluando pregunta 1/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 2/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 3/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 4/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...

  üìä Resultados promedio para la configuraci√≥n:
     - faithfulness: 0.9896
     - answer_relevancy: 0.6248
     - context_precision: 0.3750
     - context_recall: 0.5000

--------------------------------------------------
‚è≥ PAUSA EXTRA LARGA (300s) para la siguiente configuraci√≥n...
--------------------------------------------------

üß™ Evaluando Configuraci√≥n #2/4: More Context, Focused
   Hiperpar√°metros: chunk_size=1500, top_k=3, temp=0.1
  - Evaluando pregunta 1/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 2/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 3/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 4/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...

  üìä Resultados promedio para la configuraci√≥n:
     - faithfulness: 1.0000
     - answer_relevancy: 0.6497
     - context_precision: 0.3750
     - context_recall: 0.5000

--------------------------------------------------
‚è≥ PAUSA EXTRA LARGA (300s) para la siguiente configuraci√≥n...
--------------------------------------------------

üß™ Evaluando Configuraci√≥n #3/4: Smaller Chunks, More Options
   Hiperpar√°metros: chunk_size=500, top_k=5, temp=0.1
  - Evaluando pregunta 1/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 57
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_t

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 2/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 4
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 1
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_co

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 3/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 16
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 14
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 4/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 2
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 10
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 59
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_c

    ‚úÖ √âxito (resultado procesado desde lista).
    ‚è≥ Pausa de 180 segundos para proteger la cuota...

  üìä Resultados promedio para la configuraci√≥n:
     - faithfulness: 0.8771
     - answer_relevancy: 0.3177
     - context_precision: 0.1250
     - context_recall: 0.3333

--------------------------------------------------
‚è≥ PAUSA EXTRA LARGA (300s) para la siguiente configuraci√≥n...
--------------------------------------------------

üß™ Evaluando Configuraci√≥n #4/4: Creative & Balanced
   Hiperpar√°metros: chunk_size=1000, top_k=4, temp=0.3
  - Evaluando pregunta 1/4...


Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 500
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 37
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 500
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 34
}
].
  quota_metric: "generativelanguage.googleapis.com/generate_cont

    ‚ö†Ô∏è Fallo por excepci√≥n: ...
    ‚è≥ Pausa de 180 segundos para proteger la cuota...
  - Evaluando pregunta 2/4...


  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 500
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 39
}
].


    ‚ö†Ô∏è Fallo por excepci√≥n: 429 You exceeded your current quota, please check your plan and billing details. For more informatio...
    ‚è≥ Pausa de 180 segundos para proteger la cuota...


  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 500
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 37
}
].


  - Evaluando pregunta 3/4...
    ‚ö†Ô∏è Fallo por excepci√≥n: 429 You exceeded your current quota, please check your plan and billing details. For more informatio...
    ‚è≥ Pausa de 180 segundos para proteger la cuota...


  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerDayPerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-flash"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 500
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds: 35
}
].


  - Evaluando pregunta 4/4...
    ‚ö†Ô∏è Fallo por excepci√≥n: 429 You exceeded your current quota, please check your plan and billing details. For more informatio...
    ‚è≥ Pausa de 180 segundos para proteger la cuota...

  üìä Resultados promedio para la configuraci√≥n:
     - faithfulness: Fall√≥
     - answer_relevancy: Fall√≥
     - context_precision: Fall√≥
     - context_recall: Fall√≥

üéâ ¬°EVALUACI√ìN COMPLETADA!
üìä Resumen de Resultados Finales (Promedio por combinaci√≥n):
         combination_name        faithfulness  answer_relevancy  context_precision  context_recall
0                      Baseline     0.9896          0.6248             0.375            0.5000    
1         More Context, Focused     1.0000          0.6497             0.375            0.5000    
2  Smaller Chunks, More Options     0.8771          0.3177             0.125            0.3333    
3           Creative & Balanced        NaN             NaN               NaN               NaN    

üíæ Resul