In [1]:
import os 
import sys 

# Run this incase if you have not installed the repo as a package but still
# want to run this notebook

current_dir = os.getcwd()
os.path.abspath(os.path.join(current_dir, '..'))
sys.path.append(current_dir)

## Evaluate different Text to SQL Models using Prem AI 

In this example, we are going to show how you can evaluate different models using Prem AI for Text to SQL task. In the previous example, you have seen how to evaluate one model and we have also shown how to use different components of our `text2sql` library. Here we are going to use all of those knowledge to evaluate some series of models like:

- gpt-4o
- gpt-4o-mini
- claude-3.5-sonnet
- codellama-70b-instruct
- claude-3-opus
- llama-3.1-405-instruct

If you want to understand how to run these benchmarks for a single model and what does each of the component means, check out first tutorial on [evaluation](/examples/evaluation.ipynb). 

In [3]:
from text2sql.eval.dataset.bird import BirdBenchEvalDataset
from text2sql.eval.settings import SQLGeneratorConfig, APIConfig
from text2sql.eval.generator.bird.from_api import SQLGeneratorFromAPI
from text2sql.eval.executor.bird.acc import BirdExecutorAcc
from text2sql.eval.executor.bird.ves import BirdExecutorVES

In [4]:
config = SQLGeneratorConfig()
eval_dataset = BirdBenchEvalDataset(config=config)

In [5]:
model_set = [
    "gpt-4o", 
    "gpt-4o-mini", 
    "claude-3.5-sonnet",
    "codellama-70b-instruct", 
    "claude-3-opus", 
    "llama-3.1-405-instruct"
]

In [6]:
# First do all the easy


import os 

def run(dataset, model_set, difficulty, num_rows):
    filter_by = ("difficulty", difficulty)
    easy_set = dataset.process_and_filter(num_rows=num_rows, filter_by=filter_by)
    easy_data = easy_set.apply_prompt(apply_knowledge=True)

    for model in model_set:
        print(f"================ {model} ================")
        api_config = api_config = APIConfig(
            api_key=os.environ.get("PREMAI_API_KEY"), 
            temperature=0.1, 
            max_tokens=256,
            model_name=model
        )
        config = SQLGeneratorConfig(model_name=model)
        client = SQLGeneratorFromAPI(
            generator_config=config,
            engine_config=api_config
        )
        acc = BirdExecutorAcc(generator_config=config)
        ves = BirdExecutorVES(generator_config=config)
        data_with_gen = client.generate_and_save_results(data=easy_data, force=True)
        acc.execute(model_responses=data_with_gen, filter_used=filter_by)
        ves.execute(model_responses=data_with_gen, filter_used=filter_by)
        print("\n")
        print("\n")

In [7]:
run(
    dataset=eval_dataset,
    model_set=model_set,
    difficulty="simple",
    num_rows=100
)

2024-07-29 00:36:30,998 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.
2024-07-29 00:36:30,998 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.




100%|██████████| 100/100 [04:08<00:00,  2.48s/it]
2024-07-29 00:40:39,496 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json
2024-07-29 00:40:39,496 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                56 |               100 |
+-------------+-------------------+-------------------+
| overall     |                56 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |   54.9621 |               100 |
+-------------+-----------+-------------------+
| overall     |   54.9621 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+
| cha

100%|██████████| 100/100 [04:10<00:00,  2.50s/it]
2024-07-29 00:45:13,982 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json
2024-07-29 00:45:13,982 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o-mini acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                42 |               100 |
+-------------+-------------------+-------------------+
| overall     |                42 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |   47.6245 |               100 |
+-------------+-----------+-------------------+
| overall     |   47.6245 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+


100%|██████████| 100/100 [06:21<00:00,  3.82s/it]
2024-07-29 00:51:51,697 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json
2024-07-29 00:51:51,697 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json


=>  ./experiments/eval/prem_claude-3.5-sonnet acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                42 |               100 |
+-------------+-------------------+-------------------+
| overall     |                42 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |   44.4475 |               100 |
+-------------+-----------+-------------------+
| overall     |   44.4475 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+---------------

100%|██████████| 100/100 [05:46<00:00,  3.47s/it]
2024-07-29 00:57:52,182 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json
2024-07-29 00:57:52,182 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json


=>  ./experiments/eval/prem_codellama-70b-instruct acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                30 |               100 |
+-------------+-------------------+-------------------+
| overall     |                30 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |   28.1269 |               100 |
+-------------+-----------+-------------------+
| overall     |   28.1269 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+----------

100%|██████████| 100/100 [10:46<00:00,  6.46s/it]
2024-07-29 01:17:32,352 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json
2024-07-29 01:17:32,352 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json


=>  ./experiments/eval/prem_claude-3-opus acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                42 |               100 |
+-------------+-------------------+-------------------+
| overall     |                42 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |   43.5037 |               100 |
+-------------+-----------+-------------------+
| overall     |   43.5037 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------

100%|██████████| 100/100 [00:23<00:00,  4.22it/s]
2024-07-29 01:18:19,327 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json
2024-07-29 01:18:19,327 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json


=>  ./experiments/eval/prem_llama-3.1-405-instruct acc_simple.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| simple      |                 0 |               100 |
+-------------+-------------------+-------------------+
| overall     |                 0 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| simple      |         0 |               100 |
+-------------+-----------+-------------------+
| overall     |         0 |               100 |
+-------------+-----------+-------------------+
| moderate    |         0 |                 0 |
+-------------+-----------+----------

In [8]:
run(
    dataset=eval_dataset,
    model_set=model_set,
    difficulty="moderate",
    num_rows=100
)

2024-07-29 01:18:19,366 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.
2024-07-29 01:18:19,366 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.
2024-07-29 01:18:19,390 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:18:19,390 - text2sql-eval - INFO - Already found results from folder




100%|██████████| 100/100 [04:53<00:00,  2.94s/it]
2024-07-29 01:23:13,188 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json
2024-07-29 01:23:13,188 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                35 |               100 |
+-------------+-------------------+-------------------+
| overall     |                35 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 01:23:22,212 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:23:22,212 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |   37.9472 |               100 |
+-------------+-----------+-------------------+
| overall     |   37.9472 |               100 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+
| challenging |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [04:33<00:00,  2.73s/it]
2024-07-29 01:27:55,347 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json
2024-07-29 01:27:55,347 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o-mini acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                28 |               100 |
+-------------+-------------------+-------------------+
| overall     |                28 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 01:28:38,139 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:28:38,139 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |   26.3869 |               100 |
+-------------+-----------+-------------------+
| overall     |   26.3869 |               100 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+
| challenging |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [06:31<00:00,  3.91s/it]
2024-07-29 01:35:09,301 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json
2024-07-29 01:35:09,301 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json


=>  ./experiments/eval/prem_claude-3.5-sonnet acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                27 |               100 |
+-------------+-------------------+-------------------+
| overall     |                27 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 01:35:17,593 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:35:17,593 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |   24.9191 |               100 |
+-------------+-----------+-------------------+
| overall     |   24.9191 |               100 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+
| challenging |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [06:04<00:00,  3.64s/it]
2024-07-29 01:41:21,837 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json
2024-07-29 01:41:21,837 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json


=>  ./experiments/eval/prem_codellama-70b-instruct acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                22 |               100 |
+-------------+-------------------+-------------------+
| overall     |                22 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 01:41:26,535 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:41:26,535 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |   26.1479 |               100 |
+-------------+-----------+-------------------+
| overall     |   26.1479 |               100 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+
| challenging |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [11:37<00:00,  6.98s/it]
2024-07-29 01:53:04,463 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json
2024-07-29 01:53:04,463 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json


=>  ./experiments/eval/prem_claude-3-opus acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                27 |               100 |
+-------------+-------------------+-------------------+
| overall     |                27 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 01:53:54,143 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:53:54,143 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |   25.8315 |               100 |
+-------------+-----------+-------------------+
| overall     |   25.8315 |               100 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+
| challenging |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [00:22<00:00,  4.50it/s]
2024-07-29 01:54:16,396 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json
2024-07-29 01:54:16,396 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json


=>  ./experiments/eval/prem_llama-3.1-405-instruct acc_moderate.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| moderate    |                 0 |               100 |
+-------------+-------------------+-------------------+
| overall     |                 0 |               100 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
| challenging |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| moderate    |         0 |               100 |
+-------------+-----------+-------------------+
| overall     |         0 |               100 |
+-------------+-----------+-------------------+
| simple      |         0 |                 0 |
+-------------+-----------+--------

In [9]:
run(
    dataset=eval_dataset,
    model_set=model_set,
    difficulty="challenging",
    num_rows=100
)

2024-07-29 01:54:16,438 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.
2024-07-29 01:54:16,438 - text2sql-eval - INFO - ./data/eval/ is not empty. Use force=True to re-download and overwrite the contents.
2024-07-29 01:54:16,472 - text2sql-eval - INFO - Already found results from folder
2024-07-29 01:54:16,472 - text2sql-eval - INFO - Already found results from folder




100%|██████████| 100/100 [04:43<00:00,  2.84s/it]
2024-07-29 01:59:00,114 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json
2024-07-29 01:59:00,114 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                41 |               100 |
+-------------+-------------------+-------------------+
| overall     |                41 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 02:20:21,737 - text2sql-eval - INFO - Already found results from folder
2024-07-29 02:20:21,737 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |   41.9917 |               100 |
+-------------+-----------+-------------------+
| overall     |   41.9917 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [04:43<00:00,  2.84s/it]
2024-07-29 02:25:05,678 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json
2024-07-29 02:25:05,678 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_gpt-4o-mini/predict_dev.json


=>  ./experiments/eval/prem_gpt-4o-mini acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                34 |               100 |
+-------------+-------------------+-------------------+
| overall     |                34 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 02:46:32,261 - text2sql-eval - INFO - Already found results from folder
2024-07-29 02:46:32,261 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |   35.9608 |               100 |
+-------------+-----------+-------------------+
| overall     |   35.9608 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [06:33<00:00,  3.93s/it]
2024-07-29 02:53:05,354 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json
2024-07-29 02:53:05,354 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3.5-sonnet/predict_dev.json


=>  ./experiments/eval/prem_claude-3.5-sonnet acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                31 |               100 |
+-------------+-------------------+-------------------+
| overall     |                31 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 03:14:58,956 - text2sql-eval - INFO - Already found results from folder
2024-07-29 03:14:58,956 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |   31.3255 |               100 |
+-------------+-----------+-------------------+
| overall     |   31.3255 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [05:48<00:00,  3.48s/it]
2024-07-29 03:20:47,095 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json
2024-07-29 03:20:47,095 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_codellama-70b-instruct/predict_dev.json


=>  ./experiments/eval/prem_codellama-70b-instruct acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                18 |               100 |
+-------------+-------------------+-------------------+
| overall     |                18 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 03:29:44,426 - text2sql-eval - INFO - Already found results from folder
2024-07-29 03:29:44,426 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |    18.237 |               100 |
+-------------+-----------+-------------------+
| overall     |    18.237 |               100 |
+-------------+-----------+-------------------+
| moderate    |     0     |                 0 |
+-------------+-----------+-------------------+
| simple      |     0     |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [11:42<00:00,  7.03s/it]
2024-07-29 03:41:27,409 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json
2024-07-29 03:41:27,409 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_claude-3-opus/predict_dev.json


=>  ./experiments/eval/prem_claude-3-opus acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                26 |               100 |
+-------------+-------------------+-------------------+
| overall     |                26 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+


2024-07-29 03:50:32,831 - text2sql-eval - INFO - Already found results from folder
2024-07-29 03:50:32,831 - text2sql-eval - INFO - Already found results from folder


+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |   28.2806 |               100 |
+-------------+-----------+-------------------+
| overall     |   28.2806 |               100 |
+-------------+-----------+-------------------+
| moderate    |    0      |                 0 |
+-------------+-----------+-------------------+
| simple      |    0      |                 0 |
+-------------+-----------+-------------------+






100%|██████████| 100/100 [00:22<00:00,  4.35it/s]
2024-07-29 03:50:55,815 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json
2024-07-29 03:50:55,815 - text2sql-eval - INFO - all responses written to ./experiments/eval/prem_llama-3.1-405-instruct/predict_dev.json


=>  ./experiments/eval/prem_llama-3.1-405-instruct acc_challenging.json
+-------------+-------------------+-------------------+
| Category    |   num_correct (%) |   total questions |
| challenging |                 0 |               100 |
+-------------+-------------------+-------------------+
| overall     |                 0 |               100 |
+-------------+-------------------+-------------------+
| moderate    |                 0 |                 0 |
+-------------+-------------------+-------------------+
| simple      |                 0 |                 0 |
+-------------+-------------------+-------------------+
+-------------+-----------+-------------------+
| Category    |   VES (%) |   total questions |
| challenging |         0 |               100 |
+-------------+-----------+-------------------+
| overall     |         0 |               100 |
+-------------+-----------+-------------------+
| moderate    |         0 |                 0 |
+-------------+-----------+-----