<a href="https://colab.research.google.com/github/graphlit/graphlit-samples/blob/main/python/Notebook%20Examples/Graphlit_2024_09_20_FinanceBench_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Description**

This example shows how to evaluate RAG over the FinanceBench dataset using OpenAI or Anthropic models.

**Requirements**

Prior to running this notebook, you will need to [signup](https://docs.graphlit.dev/getting-started/signup) for Graphlit, and [create a project](https://docs.graphlit.dev/getting-started/create-project).

You will need the Graphlit organization ID, preview environment ID and JWT secret from your created project.

Assign these properties as Colab secrets: GRAPHLIT_ORGANIZATION_ID, GRAPHLIT_ENVIRONMENT_ID and GRAPHLIT_JWT_SECRET.


---

Install Graphlit Python client SDK

In [1]:
!pip install --upgrade graphlit-client



Initialize FinanceBench dataset

In [2]:
!rm -rf financebench
!git clone https://github.com/patronus-ai/financebench

Cloning into 'financebench'...
remote: Enumerating objects: 423, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 423 (delta 19), reused 38 (delta 16), pack-reused 376 (from 1)[K
Receiving objects: 100% (423/423), 551.70 MiB | 15.86 MiB/s, done.
Resolving deltas: 100% (26/26), done.
Updating files: 100% (390/390), done.


Initialize FinanceBench paths

In [3]:
data_directory = "financebench"

PATH_DATASET_JSONL = data_directory + "/data/financebench_open_source.jsonl"
PATH_DOCUMENT_INFO_JSONL = data_directory + "/data/financebench_document_information.jsonl"
PATH_RESULTS = data_directory + "/results/"
PATH_PDFS = data_directory + "/pdfs/"

In [4]:
import os
import sys
import json
import pickle
import datetime
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

DATASET_PORTION = "OPEN_SOURCE"

# Load Full Dataset
df_questions = pd.read_json(PATH_DATASET_JSONL, lines=True)
df_meta = pd.read_json(PATH_DOCUMENT_INFO_JSONL, lines=True)
df_full = pd.merge(df_questions, df_meta, on="doc_name")

# Get all docs
df_questions = df_questions.sort_values('doc_name')
ALL_DOCS = df_questions['doc_name'].unique().tolist()
print(f"Total number of distinct PDF: {len(ALL_DOCS)}")

# Select relevant dataset portion
if DATASET_PORTION != "ALL":
    df_questions = df_questions.loc[df_questions["dataset_subset_label"]==DATASET_PORTION]
print(f"Number of questions: {len(df_questions)}")

# Check relevant documents
df_questions = df_questions.sort_values('doc_name')
docs = df_questions['doc_name'].unique().tolist()
print(f"Number of distinct PDF: {len(docs)}")

Total number of distinct PDF: 84
Number of questions: 150
Number of distinct PDF: 84


Initialize Graphlit

In [5]:
import os
from google.colab import userdata
from graphlit import Graphlit
from graphlit_api import input_types, enums, exceptions

os.environ['GRAPHLIT_ORGANIZATION_ID'] = userdata.get('GRAPHLIT_ORGANIZATION_ID')
os.environ['GRAPHLIT_ENVIRONMENT_ID'] = userdata.get('GRAPHLIT_ENVIRONMENT_ID')
os.environ['GRAPHLIT_JWT_SECRET'] = userdata.get('GRAPHLIT_JWT_SECRET')

graphlit = Graphlit()

Define Graphlit helper functions

In [6]:
from typing import List, Optional
import base64
import mimetypes

# NOTE: for local files, load from disk and convert to Base64 data
async def ingest_file(file_path: str):
    if graphlit.client is None:
        return;

    try:
        file_name = os.path.basename(file_path)
        content_name, _ = os.path.splitext(file_name)

        mime_type = mimetypes.guess_type(file_name)[0]

        if mime_type is None:
            print('Failed to infer MIME type')
            return None

        with open(file_path, "rb") as file:
            file_content = file.read()

            base64_content = base64.b64encode(file_content).decode('utf-8')

            # Using synchronous mode, so the notebook waits for the content to be ingested
            response = await graphlit.client.ingest_encoded_file(content_name, base64_content, mime_type, is_synchronous=True)

            return response.ingest_encoded_file.id if response.ingest_encoded_file is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

async def create_anthropic_specification(model: enums.AnthropicModels, retrievalType: Optional[enums.RetrievalStrategyTypes] = None, enableRerank: bool = False, enableRevision: bool = False, revisionCount: Optional[int] = None,
                                         enablePromptStrategy: Optional[bool] = None, promptType: Optional[enums.PromptStrategyTypes] = None, embedCitations: Optional[bool] = False):
    if graphlit.client is None:
        return;

    input = input_types.SpecificationInput(
        name=f"Anthropic [{str(model)}]",
        type=enums.SpecificationTypes.COMPLETION,
        serviceType=enums.ModelServiceTypes.ANTHROPIC,
        anthropic=input_types.AnthropicModelPropertiesInput(
            model=model,
            temperature=0.1,
            completionTokenLimit=2048
        ),
        customInstructions='You are a financial analyst with a focused attention to detail. Provide specific details on financial data in your responses, but be concise in your answers. When providing a numeric or dollar value for your answer, only provide the value without any other commentary or explanation. Make sure to add currency symbol in front of values, where appropriate. Only rely on the content sources provided for your response. If you are unable to answer the question, given the content sources provided, answer with "I was unable to answer given the provided sources."',
        strategy=input_types.ConversationStrategyInput(
            embedCitations=embedCitations
        ),
        promptStrategy=input_types.PromptStrategyInput(
            type=promptType if promptType is not None else enums.PromptStrategyTypes.OPTIMIZE_SEARCH
        ) if enablePromptStrategy else None,
        retrievalStrategy=input_types.RetrievalStrategyInput(
            type=retrievalType if retrievalType is not None else enums.RetrievalStrategyTypes.CHUNK
        ),
        revisionStrategy=input_types.RevisionStrategyInput(
            type=enums.RevisionStrategyTypes.REVISE,
            count=revisionCount if revisionCount is not None else 1
        ) if enableRevision else None,
        rerankingStrategy=input_types.RerankingStrategyInput(
            serviceType=enums.RerankingModelServiceTypes.COHERE
        ) if enableRerank else None
    )

    try:
        response = await graphlit.client.create_specification(input)

        return response.create_specification.id if response.create_specification is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

    return None

async def create_openai_specification(model: enums.OpenAIModels, retrievalType: Optional[enums.RetrievalStrategyTypes] = None, enableRerank: bool = False, enableRevision: bool = False, revisionCount: Optional[int] = None,
                                         enablePromptStrategy: Optional[bool] = None, promptType: Optional[enums.PromptStrategyTypes] = None, embedCitations: Optional[bool] = False):
    if graphlit.client is None:
        return;

    input = input_types.SpecificationInput(
        name=f"OpenAI [{str(model)}]",
        type=enums.SpecificationTypes.COMPLETION,
        serviceType=enums.ModelServiceTypes.OPEN_AI,
        openAI=input_types.OpenAIModelPropertiesInput(
            model=model,
            temperature=0.1,
            completionTokenLimit=2048
        ),
        customInstructions='You are a financial analyst with a focused attention to detail. Provide specific details on financial data in your responses, but be concise in your answers. When providing a numeric or dollar value for your answer, only provide the value without any other commentary or explanation. Make sure to add currency symbol in front of values, where appropriate. Only rely on the content sources provided for your response. If you are unable to answer the question, given the content sources provided, answer with "I was unable to answer given the provided sources."',
        strategy=input_types.ConversationStrategyInput(
            embedCitations=embedCitations
        ),
        promptStrategy=input_types.PromptStrategyInput(
            type=promptType if promptType is not None else enums.PromptStrategyTypes.OPTIMIZE_SEARCH
        ) if enablePromptStrategy else None,
        retrievalStrategy=input_types.RetrievalStrategyInput(
            type=retrievalType if retrievalType is not None else enums.RetrievalStrategyTypes.CHUNK
        ),
        revisionStrategy=input_types.RevisionStrategyInput(
            type=enums.RevisionStrategyTypes.REVISE,
            count=revisionCount if revisionCount is not None else 1
        ) if enableRevision else None,
        rerankingStrategy=input_types.RerankingStrategyInput(
            serviceType=enums.RerankingModelServiceTypes.COHERE
        ) if enableRerank else None
    )

    try:
        response = await graphlit.client.create_specification(input)

        return response.create_specification.id if response.create_specification is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

    return None

async def create_conversation(specification_id: str):
    if graphlit.client is None:
        return;

    input = input_types.ConversationInput(
        name="Conversation",
        specification=input_types.EntityReferenceInput(
            id=specification_id
        )
    )

    try:
        response = await graphlit.client.create_conversation(input)

        return response.create_conversation.id if response.create_conversation is not None else None
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

async def delete_conversation(conversation_id: str):
    if graphlit.client is None:
        return;

    if conversation_id is not None:
        _ = await graphlit.client.delete_conversation(conversation_id)

async def prompt_conversation(conversation_id: str, prompt: str):
    if graphlit.client is None:
        return None, None

    try:
        response = await graphlit.client.prompt_conversation(prompt, conversation_id)

        message = response.prompt_conversation.message.message if response.prompt_conversation is not None and response.prompt_conversation.message is not None else None
        citations = response.prompt_conversation.message.citations if response.prompt_conversation is not None and response.prompt_conversation.message is not None else None

        return message, citations
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None, None

async def delete_content(content_id: str):
    if graphlit.client is None:
        return;

    if content_id is not None:
        _ = await graphlit.client.delete_content(content_id)

async def get_content(content_id: str):
    if graphlit.client is None:
        return;

    try:
        response = await graphlit.client.get_content(content_id)

        return response.content
    except exceptions.GraphQLClientError as e:
        print(str(e))
        return None

async def delete_all_specifications():
    if graphlit.client is None:
        return;

    _ = await graphlit.client.delete_all_specifications(is_synchronous=True)

async def delete_all_conversations():
    if graphlit.client is None:
        return;

    _ = await graphlit.client.delete_all_conversations(is_synchronous=True)

async def delete_all_contents():
    if graphlit.client is None:
        return;

    _ = await graphlit.client.delete_all_contents(is_synchronous=True)


Execute Graphlit example

In [None]:
from IPython.display import display, Markdown, HTML
import time

# Remove any existing contents, conversations and specifications; only needed for notebook example
await delete_all_conversations()
await delete_all_specifications()
await delete_all_contents()

print('Deleted all contents, conversations and specifications.')

# Set evaluation questions
df_eval = df_questions

results = []

# Configure LLM specification

# NOTE: select whether to use OpenAI or Anthropic models
modelService = enums.ModelServiceTypes.OPEN_AI
#modelService = enums.ModelServiceTypes.ANTHROPIC

# NOTE: select your preferred OpenAI and Anthropic model
openaiModel = enums.OpenAIModels.O1_MINI_128K
anthropicModel = enums.AnthropicModels.CLAUDE_3_5_SONNET

# NOTE: select a prompt rewriting strategy
#prompt_strategy = None
prompt_strategy = enums.PromptStrategyTypes.OPTIMIZE_SEARCH

# NOTE: select a retrieval strategy
#retrieval_strategy = enums.RetrievalStrategyTypes.CHUNK
retrieval_strategy = enums.RetrievalStrategyTypes.SECTION

# NOTE: can embed citations in the completed response, and return the cited text sources
embedCitations = False

# NOTE: using Cohere reranking by default
enable_rerank = True

# NOTE: can enable re-asking LLM to revise its answer, and assign the number of revisions to make
enable_revision = False
revision_count = None

# Create the LLM specification
specification_id = None

if modelService == enums.ModelServiceTypes.OPEN_AI:
    print(f'Using OpenAI model [{openaiModel}]')

    specification_id = await create_openai_specification(openaiModel, retrieval_strategy, enable_rerank, enable_revision, revision_count, prompt_strategy is not None, prompt_strategy, embedCitations)
elif modelService == enums.ModelServiceTypes.ANTHROPIC:
    print(f'Using Anthropic model [{anthropicModel}]')

    specification_id = await create_anthropic_specification(anthropicModel, retrieval_strategy, enable_rerank, enable_revision, revision_count, prompt_strategy is not None, prompt_strategy, embedCitations)

if specification_id is not None:
    print(f'Created specification [{specification_id}]')

    for k, (idx, row) in tqdm(enumerate(df_eval.sort_values("doc_name").iterrows()), total=len(df_eval)):

        docs = row["doc_name"]
        question = row["question"]
        gold_answer = row["answer"]

        path_doc = f"{PATH_PDFS}{docs}.pdf"

        start_time = time.time()

        print(f'Ingesting [{path_doc}].')

        # NOTE: ingestion uses Azure AI Document Intelligent layout model, by default
        content_id = await ingest_file(path_doc)

        end_time = time.time()
        elapsed_time = end_time - start_time

        if content_id is None:
            print(f'Failed to ingest [{path_doc}].')
            continue

        content = await get_content(content_id)

        if content is None:
            print(f'Failed to get content [{content_id}].')
            continue

        print(f'Ingested content [{content_id}] with {content.document.page_count if content.document is not None else 0} pages in {elapsed_time:.2f} seconds.')

        # (B) Model Call
        if question is not None:
            conversation_id = await create_conversation(specification_id=specification_id)

            if conversation_id is not None:
                print(f'Prompting conversation [{conversation_id}] with question [{question}]')

                answer, citations = await prompt_conversation(conversation_id, question)

                if answer is not None:
                    display(Markdown(f'### Document [{docs}]:'))
                    display(Markdown(f'**Question:**\n{question}'))
                    display(Markdown(f'**Answer:**\n{answer}'))
                    display(Markdown(f'**Gold Answer:**\n{gold_answer}'))
                    print()

                if citations is not None:
                    for citation in citations:
                        if citation is not None and citation.content is not None:
                            display(Markdown(f'**Citation [{citation.index}]:** {citation.content.name}'))
                            display(Markdown(citation.text))
                            print()

                await delete_conversation(conversation_id)

                # (C) Bookkeeping
                results.append({
                                "financebench_id" : row["financebench_id"],
                                "question" : question,
                                "gold_answer": gold_answer,
                                "model_answer": answer,
                                })

        await delete_content(content_id)

        print(f'Deleted content [{content_id}]')
        print('-----------------------------------------------------------------------------------')

df_results = pd.DataFrame(results)
df_results.to_csv(PATH_RESULTS + "/results.csv")

Deleted all contents, conversations and specifications.
Using OpenAI model [O1_MINI_128K]
Created specification [2adb69ed-4559-4c46-b62f-2bac9382150b]


  0%|          | 0/150 [00:00<?, ?it/s]

Ingesting [financebench/pdfs/3M_2018_10K.pdf].
Ingested content [e83808c9-c929-49ff-a77b-c7c6de9b1268] with 160 pages in 37.21 seconds.
Prompting conversation [0a021e54-05f0-4c94-8169-1cf8c50ce5fc] with question [What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.]


### Document [3M_2018_10K]:

**Question:**
What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.

**Answer:**
$1,577

**Gold Answer:**
$1577.00


Deleted content [e83808c9-c929-49ff-a77b-c7c6de9b1268]
-----------------------------------------------------------------------------------
Ingesting [financebench/pdfs/3M_2018_10K.pdf].
Ingested content [685802aa-0b78-4650-ac59-e822fed8592b] with 160 pages in 32.42 seconds.
Prompting conversation [c3da54c1-1358-4f4a-bf05-9c92d1eaf4d8] with question [Assume that you are a public equities analyst. Answer the following question by primarily using information that is shown in the balance sheet: what is the year end FY2018 net PPNE for 3M? Answer in USD billions.]


### Document [3M_2018_10K]:

**Question:**
Assume that you are a public equities analyst. Answer the following question by primarily using information that is shown in the balance sheet: what is the year end FY2018 net PPNE for 3M? Answer in USD billions.

**Answer:**
$8.738 billion

**Gold Answer:**
$8.70


Deleted content [685802aa-0b78-4650-ac59-e822fed8592b]
-----------------------------------------------------------------------------------
Ingesting [financebench/pdfs/3M_2022_10K.pdf].
Ingested content [497af1c9-a205-4fe7-b9e7-bd69b8cec690] with 252 pages in 57.46 seconds.
Prompting conversation [5d62144f-3b0f-407e-8d87-12582665f852] with question [Is 3M a capital-intensive business based on FY2022 data?]


### Document [3M_2022_10K]:

**Question:**
Is 3M a capital-intensive business based on FY2022 data?

**Answer:**
Yes, 3M is a capital-intensive business based on FY2022 data. The company invested $1.749 billion in property, plant, and equipment, indicating significant capital expenditures to support its operations and growth.

**Gold Answer:**
No, the company is managing its CAPEX and Fixed Assets pretty efficiently, which is evident from below key metrics:
CAPEX/Revenue Ratio: 5.1%
Fixed assets/Total Assets: 20%
Return on Assets= 12.4%


Deleted content [497af1c9-a205-4fe7-b9e7-bd69b8cec690]
-----------------------------------------------------------------------------------
Ingesting [financebench/pdfs/3M_2022_10K.pdf].
