parsee-core version used: 0.1.3.14
This dataset was created on the basis of 15 pages from annual/quarterly filings of major German stock-exchange listed companies (PDF files).
All PDF files are publicly accessible on parsee.ai, to access them copy the "source_identifier" (first column) and paste it in this URL (replace '{SOURCE_IDENTIFIER}' with the actual identifier):
https://app.parsee.ai/documents/view/{SOURCE_IDENTIFIER}
So for example:
The goal of this dataset was to load the files using the Parsee PDF Reader and to compare the results to the langchain PyPDF loader.
The dataset was created on Parsee Cloud, where all output was checked by a human and corrected prior to running this code.
All prompts were truncated to a max of 8k tokens, but this should not affect the prompts for this dataset, as the files are just single pages and thus quite small.
For the evaluation we are using the Claude 3 Opus model from Anthropic.
The results of the evaluation can be found here: jupyter notebook