Name		Name	Last commit message	Last commit date
parent directory ..
rag-text		rag-text
selection-image		selection-image
selection-text		selection-text
README.md		README.md

README.md

finRAG Datasets

This is the official repo of the finRAG datasets published by parsee.ai.

We wanted to investigate how good the current state of the art (M)LLMs are at solving the relatively simple problem of extracting revenue figures from publicly available financial reports. To test this, we created 3 different datasets, all based on the same selection of 100 randomly selected annual reports for the year 2023 of publicly listed US companies. The 3 datasets are the following:

“Selection-text”: this dataset contains only the relevant profit & loss statement with the revenue numbers that we are looking for. It can be considered our “base-case”, as extracting the revenue numbers from this table only should be the easiest.

“RAG-text”: this dataset tries to simulate a real-world RAG-application, where we chunk the original document into pieces, perform a vector search based on the question that we want to solve, and present the LLMs with the most relevant chunks. We cut off all prompts at 8k tokens for this exercise, so in case the relevant table was not contained in the prompt, we inserted it at the “first position”, to simulate a “happy path” for the vector search, as the goal of this study is not to examine how good or bad vector search is working, but rather to focus on the capabilities of the LLMs if we can guarantee that all required information to solve a task is presented to the model.

“Selection-image”: this dataset is similar to the “Selection-text” dataset in the sense that we feed to the models only an image of the relevant profit & loss statement, that contains all the necessary information to solve the problem.

The datasets contain a combined total of 10,404 rows, 37,536,847 tokens and 1,156 images.

A study looking at the performance of 8 state of the art LLMs on a subset of this dataset can be found here: https://www.parsee.ai/en/blog/finrag-dataset-and-study/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

rag-text

rag-text

selection-image

selection-image

selection-text

selection-text

README.md

README.md

README.md

finRAG Datasets

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

finRAG Datasets