# Evaluate gen AI apps with Snowflake Cortex AI and TruLens
This notebook demonstrates how AI Observability in Snowflake Cortex AI helps quantitatively measure the performance of a RAG applications using  different LLMs, providing insights into application behavior and helping the user select the best model for their use case.

### Required Packages
* trulens-core (1.4.5 or above)
* trulens-connectors-snowflake (1.4.5 or above)
* trulens-providers-cortex (1.4.5 or above)
* snowflake.core (1.0.5 or above)

### https://medium.com/snowflake/ai-observability-in-snowflake-evaluate-gen-ai-apps-with-snowflake-cortex-ai-and-trulens-37878ec83c9e




In [None]:
CREATE OR REPLACE DATABASE cortex_search_tutorial_db;

CREATE OR REPLACE WAREHOUSE cortex_search_tutorial_wh WITH
     WAREHOUSE_SIZE='X-SMALL'
     AUTO_SUSPEND = 120
     AUTO_RESUME = TRUE
     INITIALLY_SUSPENDED=TRUE;

 USE WAREHOUSE cortex_search_tutorial_wh;

 CREATE OR REPLACE STAGE cortex_search_tutorial_db.public.fomc
    DIRECTORY = (ENABLE = TRUE)
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');


CREATE OR REPLACE STAGE MGG_GENAI_OBSERVABILITY
 URL = 's3://mggsnowflake/genaiobservability';

COPY FILES INTO @FOMC
FROM @MGG_GENAI_OBSERVABILITY;

ALTER STAGE FOMC REFRESH;

## Revise el stage FOMC y refresque para que el siguiente paso procese

In [None]:

CREATE OR REPLACE FUNCTION cortex_search_tutorial_db.public.pdf_text_chunker(file_url STRING)
    RETURNS TABLE (chunk VARCHAR)
    LANGUAGE PYTHON
    RUNTIME_VERSION = '3.9'
    HANDLER = 'pdf_text_chunker'
    PACKAGES = ('snowflake-snowpark-python', 'PyPDF2', 'langchain')
    AS
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
from snowflake.snowpark.files import SnowflakeFile
import PyPDF2, io
import logging
import pandas as pd

class pdf_text_chunker:

    def read_pdf(self, file_url: str) -> str:
        logger = logging.getLogger("udf_logger")
        logger.info(f"Opening file {file_url}")

        with SnowflakeFile.open(file_url, 'rb') as f:
            buffer = io.BytesIO(f.readall())

        reader = PyPDF2.PdfReader(buffer)
        text = ""
        for page in reader.pages:
            try:
                text += page.extract_text().replace('\n', ' ').replace('\0', ' ')
            except:
                text = "Unable to Extract"
                logger.warn(f"Unable to extract from file {file_url}, page {page}")

        return text

    def process(self, file_url: str):
        text = self.read_pdf(file_url)

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 2000,  # Adjust this as needed
            chunk_overlap = 300,  # Overlap to keep chunks contextual
            length_function = len
        )

        chunks = text_splitter.split_text(text)
        df = pd.DataFrame(chunks, columns=['chunk'])

        yield from df.itertuples(index=False, name=None)
$$;

CREATE OR REPLACE TABLE cortex_search_tutorial_db.public.docs_chunks_table AS
    SELECT
        relative_path,
        build_scoped_file_url(@cortex_search_tutorial_db.public.fomc, relative_path) AS file_url,
        -- preserve file title information by concatenating relative_path with the chunk
        CONCAT(relative_path, ': ', func.chunk) AS chunk,
        'English' AS language
    FROM
        directory(@cortex_search_tutorial_db.public.fomc),
        TABLE(cortex_search_tutorial_db.public.pdf_text_chunker(build_scoped_file_url(@cortex_search_tutorial_db.public.fomc, relative_path))) AS func;

    CREATE OR REPLACE CORTEX SEARCH SERVICE cortex_search_tutorial_db.public.fomc_meeting
    ON chunk
    ATTRIBUTES language
    WAREHOUSE = cortex_search_tutorial_wh
    TARGET_LAG = '30 days'
    AS (
    SELECT
        chunk,
        relative_path,
        file_url,
        language
    FROM cortex_search_tutorial_db.public.docs_chunks_table
    );


## Session Information
Fetches the current session information and the connection details for the Snowflake account. This connection details will be used to ingest application traces and trigger metric computation jobs.

In [None]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()

## Cortex Search Retriever
Initializes a retriever using Cortex Search Service for the RAG application. The Cortex Search service is based on the tutorial : [Build a PDF chatbot](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-search/tutorials/cortex-search-tutorial-3-chat-advanced)

Complete Steps 1 to Spet 4 in the above tutorial, and continue to the next step.

In [None]:
from typing import List

from snowflake.core import Root
from snowflake.snowpark.session import Session


class CortexSearchRetriever:
    def __init__(self, snowpark_session: Session, limit_to_retrieve: int = 4):
        self._snowpark_session = snowpark_session
        self._limit_to_retrieve = limit_to_retrieve

    def retrieve(self, query: str) -> List[str]:
        root = Root(session)

        search_service = (
            root.databases["cortex_search_tutorial_db"]
            .schemas["PUBLIC"]
            .cortex_search_services["fomc_meeting"]
        )
        resp = search_service.search(
            query=query, columns=["chunk"], limit=self._limit_to_retrieve
        )

        if resp.results:
            return [curr["chunk"] for curr in resp.results]
        else:
            return []

## Environment Variables

Sets the environment variables to use OpenTelemetry for generated traces. This step is mandatory to trace and evaluate the application.

In [None]:
import os

os.environ["TRULENS_OTEL_TRACING"] = "1"

## RAG Application
Defines the RAG application with retrieval and generation steps. The generation function contains the prompt to the LLM and uses Cortex [COMPLETE](https://docs.snowflake.com/en/sql-reference/functions/complete-snowflake-cortex) function for inference.

In [None]:
from snowflake.cortex import complete
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes


class RAG:
    def __init__(self, llm_model):
        self.retriever = CortexSearchRetriever(
            snowpark_session=session, limit_to_retrieve=4
        )
        self.llm_model = llm_model

    @instrument(
        span_type=SpanAttributes.SpanType.RETRIEVAL,
        attributes={
            SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
            SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
        },
    )
    def retrieve_context(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        return self.retriever.retrieve(query)

    @instrument(span_type=SpanAttributes.SpanType.GENERATION)
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        prompt = f"""
          You are an expert assistant extracting information from context provided.
          Answer the question in long-form, fully and completely, based on the context. Do not hallucinate.
          If you don´t have the information just say so.
          Context: {context_str}
          Question:
          {query}
          Answer:
        """
        response = ""
        stream = complete(self.llm_model, prompt, stream=True)
        for update in stream:
            response += update
            print(update, end="")
        return response

    @instrument(
        span_type=SpanAttributes.SpanType.RECORD_ROOT,
        attributes={
            SpanAttributes.RECORD_ROOT.INPUT: "query",
            SpanAttributes.RECORD_ROOT.OUTPUT: "return",
        },
    )
    def query(self, query: str) -> str:
        context_str = self.retrieve_context(query)
        return self.generate_completion(query, context_str)

## RAG App Initialization
Initializes two instances of the RAG application with llama3.1-70b and mistral-large2 for LLM inference.

In [None]:
rag_llama = RAG("llama3.1-70b")
rag_mistral = RAG("mistral-large2")

In [None]:
print("===========================================")
print("RAG App response with llama3.1-70b")
print("===========================================")
response = rag_llama.query(
    "What were the strongest components to gdp growth in q4?"
)

print("\n\n")
print("===========================================")
print("RAG App response with mistral-large2")
print("===========================================")
response = rag_mistral.query(
    "What were the strongest components to gdp growth in q4?"
)

## App Registration
Registers the two app instances in Snowflake, creating EXTERNAL AGENT objects to represent the app instances in the Snowflake account and registers both the app instances as different versions of the application.

In [None]:
from trulens.apps.app import TruApp
from trulens.connectors.snowflake import SnowflakeConnector

snowflake_connector = SnowflakeConnector(snowpark_session=session)

FOMC_Chatbot_llama = TruApp(
    rag_llama,
    app_name="FOMC RAG Chatbot",
    app_version="version 1",
    connector=snowflake_connector,
)

FOMC_Chatbot_mistral = TruApp(
    rag_mistral,
    app_name="FOMC RAG Chatbot",
    app_version="version 2",
    connector=snowflake_connector,
)

## Evaluation Dataset

In [None]:
import pandas as pd

evaluation_dataset = {
    "query": [
        "What were the key points discussed in the FOMC meeting in January 2023?",
        "How did the FOMC view the economic outlook in mid-2023?",
        "What were the inflation expectations for the end of 2023?",
        "What were the main topics in the FOMC meeting in February 2024?",
        "How did the FOMC assess the labor market in mid-2024?",
        "What were the GDP growth projections for the end of 2024?",
        "What were the primary concerns in the FOMC meeting in March 2025?",
        "How did the FOMC evaluate the financial stability in mid-2025?",
        "What were the interest rate expectations for the end of 2025?",
    ]
}

evaluation_df = pd.DataFrame(evaluation_dataset)

## Run Configurations
Defines the run configurations for evaluating both instances of the RAG app. The run configs contains the run name, description, dataset details, and an optional label to tag the run.

In [None]:
from trulens.core.run import RunConfig

run_config_llama = RunConfig(
    run_name="Experiment_llama3.1-70b",
    description="Q&A evaluation with llama3.1-70b",
    dataset_name="FOMC_Queries",
    source_type="DATAFRAME",
    label="LLM_Test",
    dataset_spec={
        "RECORD_ROOT.INPUT": "query",
    },
)

run_config_mistral = RunConfig(
    run_name="Experiment_mistral-large2",
    description="Q&A evaluation with mistral-large2",
    dataset_name="FOMC_Queries",
    source_type="DATAFRAME",
    label="LLM_Test",
    dataset_spec={
        "RECORD_ROOT.INPUT": "query",
    },
)

run_llama = FOMC_Chatbot_llama.add_run(run_config=run_config_llama)
run_mistral = FOMC_Chatbot_mistral.add_run(run_config=run_config_mistral)

## Run Invocation
Starts two evaluation runs (one each for llama3.1-70b and mistral-large2) by executing the application and generating the traces. This process iterates over the application corresponding to input queries in the dataset and generates the responses, traces and ingests them in Snowflake.


In [None]:
print("==================================================")
print("RAG App Invocation with llama3.1-70b")
print("==================================================")
run_llama.start(input_df=evaluation_df)

print("\n\n")
print("==================================================")
print("RAG App Invocation with mistral-large2")
print("==================================================")
run_mistral.start(input_df=evaluation_df)

## Run Status Check
Checks the status of the runs for "INVOCATION_IN_PROGRESS". 

Note: Metric computation cannot be started until the invocation is in progress. Once the runs' status is changed to "INVOCATION_COMPLETED", metric computation can be triggered.

In [None]:
import time

while (run_llama.get_status() == "INVOCATION_IN_PROGRESS") or (
    run_mistral.get_status() == "INVOCATION_IN_PROGRESS"
):
    time.sleep(1)

## Compute Metrics

Computes the RAG triad metrics for both runs to measure the quality of response in the RAG application.

In [None]:
run_llama.compute_metrics([
    "answer_relevance",
    "context_relevance",
    "groundedness",
])

run_mistral.compute_metrics([
    "answer_relevance",
    "context_relevance",
    "groundedness",
])

In [None]:
print("Run status for llama3.1-70b - ", run_llama.get_status())
print("Run status for mistral-large2 - ", run_mistral.get_status())

## Evaluation Results

To view evaluation results:
* Login to [Snowsight](https://app.snowflake.com/).
* Navigate to **AI & ML** -> **Evaluations** from the left navigation menu.
* Select “FOMC RAG CHATBOT” to view the runs, see detailed traces and compare runs.