FinTrust is the first comprehensive benchmark designed specifically to evaluate the trustworthiness of Large Language Models (LLMs) in financial applications. As finance is a high-stakes domain with strict trustworthy standards, our benchmark provides a systematic framework to assess LLMs across seven critical dimensions: trustfulness, robustness, safety, fairness, privacy, transparency, and knowledge discovery.
Our benchmark comprises 15,680 answer pairs spanning textual, tabular, and time-series data. Unlike existing benchmarks that primarily focus on task completion, FinTrust evaluates alignment issues in practical contexts with fine-grained tasks for each dimension of trustworthiness.
📊 HuggingFace Full Dataset: [https://huggingface.co/datasets/HughieHu/FinTrust]
This repository contains the following directories:
fairness/: Evaluates models' ability to provide unbiased responsesknowledge_discovery/: Tests models' capability to uncover non-trivial investment insightsprivacy/: Assesses resistance to information leakagerobustness/: Examines models' resilience and ability to abstain when confidence is lowsafety/: Tests handling of various LLM attack strategies with financial crime scenariostransparency/: Evaluates disclosure of limitations and potential conflicts of interesttrustfulness/: Measures models' accuracy and factuality in financial contexts
Each directory contains:
api_call.py: Script to call the model API and generate responses- A sample dataset of 100 test cases
postprocess_response.py: Script to process and evaluate model responses
-
First, create a
.envfile in the root directory with the following parameters:PROMPT_JSON_PATH="" # Path to the sample test data MODEL_KEY="" # Model name (must match a key in MODEL_CATALOG) MAX_PARALLEL="" # Number of parallel API calls OPENAI_API_KEY="" # Your OpenAI API key TOGETHER_API_KEY="" # Your Together API key -
Run the API call script to generate responses:
python [dimension]/api_call.py -
Process the responses to get evaluation results:
python [dimension]/postprocess_response.py
Note: The MODEL_KEY should correspond to a key in the MODEL_CATALOG dictionary defined in each api_call.py file.