Skip to content

This repository contains the complete codebase written for benchmarking various latest LLMs in NL-to-SQL use case. It also contains the results data sets accompanied with report about the findings.

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt
Notifications You must be signed in to change notification settings

petavue/NL2SQL-Benchmark

Repository files navigation

PETAVUE BENCHMARK RESEARCH

This repository contains a comprehensive set of scripts tailored to streamline the inference of diverse Large Language Models (LLMs) across multiple platforms, such as Anyscale, OpenAI, Amazon Bedrock, Anthropic, Google Gemini, and self-hosted models on Runpod servers. These scripts are designed to simplify the execution of LLM models on datasets, which can be easily adjusted and currently consist of 360 rows.

The BIRD dataset comprises natural language questions paired with their corresponding SQL queries, all linked to a common SQL database schema. Utilizing this dataset, our scripts automate the process of running Large Language Models (LLMs) on identical questions and schemas provided within the dataset. Afterwards, the SQL queries generated by the LLMs are juxtaposed against the ground truth queries from the dataset for comparison.

Data extraction

The BIRD dataset is partitioned into train and dev sets. The train set encompasses all data points intended for training a Large Language Model (LLM) on the natural language to SQL task. Meanwhile, the dev set comprises data points utilized to evaluate the trained model. However, our objective is benchmarking rather than training models. Therefore, we've extracted 360 pairs of natural language questions and SQL queries from the dataset using the Final_equal_split.py and data_extraction.py scripts, consolidating them into a dataframe.

Additionally, the repository empowers users to assess LLM models comprehensively by considering various performance metrics. By analyzing these metrics, users can make informed decisions regarding model selection, optimization, and deployment strategies.

Usage

Inferencing

For inferencing diffrerent LLM models across different platform clone this repository,

git clone https://github.com/petavue/llm-research.git

The benchmark_scripts folder contains scricts which are specific to the each platform and use appropriate scripts for inference. Eg: For inferencing GPT-3.5 with Open-AI api use open-ai.py script:

python open-ai.py -h 

The provided command furnishes all the necessary instructions for executing the inference process. The script prompts for the model name, the number of instructions to be passed, and the dataset size. Default values are applied when these parameters are not provided. The default dataset size is set to 360. If you intend to run the script with a dataset of a different length, ensure there exists a CSV file named in the format bird_equal_split_<dataset_length>.csv. To generate a dataset of a different size, refer to the Data section for guidance on using the data_extraction.py and Final_equal_split.py scripts.

The default bird_equal_split_360.csv file comprises 360 rows, each containing a natural language question, schema, correct SQL query, and hardness level. The dataset is structured to ensure an equal distribution of SQL queries across each hardness category, with 120 entries for simple, moderate, and challenging respectively. This dataset serves as the foundation for our benchmarking process.

Furthermore, the default instructions utilized in the benchmark can be modified within the common_functions.py file to suit specific requirements or preferences.

Command to run GPT-3 in OpenAI api using 5 instruction and 360 as dataset size:

python open-ai.py --models gpt-3 --inst 5 --il 360 

The scripts creates folders and files which are needed for the evaliation scripts, The directory structure for inference results is organized as follows:

  • inference_results: Root directory for all inference results.
    • environment_name: Specifies the environment in which the inference was conducted.
      • short_size: Indicates the size category of the dataset used for inference.
        • model_name: Name of the model used for inference.
          • instruction_size: Size or length of the instructions or commands used in the inference process.

Each instruction_size file will contain dev_gold.sql, dev.json, execution-log.jsonl, predict_dev.json, metrics.csv, gold.text, predicted.txt which are used by evaluation scripts.

Evaluation

For evaluation scripts clone the Github repo DAMO-ConvAI. Detete folders other than 'bird', this folder contains all the inference scripts which are needed for evaluating the accuracy of the llm in SQL generation, Now replace the eavluation.py which is present in the DAMO-ConvAI/bird/llm/src/ with the evaluation.py which is present in helper_scripts folder of this repository. Before running the evaluation script we need to create a folder called evaldata in DAMO-ConvAI/bird/llm/ which will be used by the accuracy.py script present in this repository.

Now to evaluate all the inference run be the infrerence script we use accuracy.py which will take all the files created by the inference script and give accuracy of each model for a specific shot, instruction, dataset size. To run accuracy.py we have to make some changes in the code. In accuracy.py in line 12 provide the path of the DAMO-ConvAI and also the path of the bird_equal_split_360.csv file.

use below command to run the accuracy.py script(Note: make use that accuracy.py is present inside inference_results folder which was created by the inference script because the accuracy.py will recursively search the folders to evaluate all the models which has been run using the inference script)

python accuracy.py 

Metrics

For finding the metrics like throughput, cost, total_input_tokens, total_output_tokensuse metrics_to_excel.py which will create excel file for all the metrics calculated for the inferencing, in line 10 provide the path of the bird_equal_split_360.csv for the script to work properly.

python metrics_to_excel.py 

Results

The generated metrics offer a comprehensive analysis for each inference, facilitating informed decisions regarding the selection of platforms and models for natural language to SQL generation. Our benchmark report, presented in an extensive PDF format, furnishes detailed insights into model performance, throughput, and cost considerations. Furthermore, we conducted a thorough comparison of different models across various platforms to ensure credibility and consistency in our evaluation process. For further exploration and reference, all finalized data on each inference, model costs, and throughput for all models is accessible in either XLSX or CSV format. You can find this data conveniently organized within the results folder of this repository.

Link to the Log files : -

About

This repository contains the complete codebase written for benchmarking various latest LLMs in NL-to-SQL use case. It also contains the results data sets accompanied with report about the findings.

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published