In [1]:
# cd ..

/root/anindya/Submission/text2sql/text2sql


## Generators

premsql generators is responsible to produce SQL from natural language question from the user. You can think this as of the inference api specific to text-to-sql. Generators are very much modular in nature, you can plug in any kind of third party API or model or any kind of pipeline (more on this below). 

This tutorial is going to cover how to use huggingface and premai provider to use local models and hosted models for free. Lastly, we are also going to show how can you write your own generators. Let's start by importing all the various packages. 

In [2]:
from premsql.generators import Text2SQLGeneratorHF
from premsql.datasets import Text2SQLDataset

  from .autonotebook import tqdm as notebook_tqdm


[2024-09-09 12:33:27,045] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/root/miniconda3/envs/deep/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




  def forward(ctx, input, weight, bias=None):
  def backward(ctx, grad_output):


### How Generators work

premsql generators provide two types of generation strategies. One is a simple generation strategy where we simply generate the SQL from the prompt (which contains the schema of the tables, user questions, few shot examples etc). 

There is another strategy which sometimes give a bump in the performance is, execution guided decoding. Simply, it means the model generates a SQL and it executes the SQL into the DB. If it gets an error, it uses that error in a self-correction prompt and generates once again, till the max number of trials maxes out. 

We will be showing both the examples below. Let's start with simple generation. We will be using BirdBench dev dataset for this example. 

In [6]:
bird_dataset = Text2SQLDataset(
    dataset_name='bird', split="train", force_download=False,
    dataset_folder="/root/anindya/text2sql/data"
).setup_dataset(num_rows=10)

2024-09-09 12:34:11,944 - [BIRD-DATASET] - INFO - Loaded Bird Dataset
2024-09-09 12:34:11,946 - [BIRD-DATASET] - INFO - Setting up Bird Dataset
Applying prompt: 100%|██████████| 10/10 [00:00<00:00, 3060.42it/s]


The input of the generator is not just prompt but a `data_blob` which should contain the following information:

- `prompt`: The prompt which needs to be passed
- `db_path`: The db path 

If you have these two information you can use the generators for your own inference using your own data. Make sure the prompt contains all the schema of the tables belonging to the DB. Now let's define our generators. We will be using [Prem-1B-SQL](https://huggingface.co/premai-io/prem-1B-SQL) for this experiment. 

In [4]:
generator = Text2SQLGeneratorHF(
    model_or_name_or_path="premai-io/prem-1B-SQL",
    experiment_name="test_generators",
    device="cuda:0",
    type="test"
)

2024-09-09 12:33:37,338 - [GENERATOR] - INFO - Experiment folder found in: experiments/test/test_generators
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.05s/it]


`Text2SQLGeneratorHF` internally uses HuggingFace transformers. You instantiate the class with a `experiment_name`. A folder `./experiments/<experiment_name>` is created in your current directory (You can also change that directory by assigning the path to `experiment_folder` argument). 

This folders are created to store the generation and evaluation result, so that you do need to generate results everytime. It caches them inside the experiment directory. Now let's generate results using a single datapoint. 

In [8]:
sample = bird_dataset[0]

response = generator.generate(
    data_blob={
        "prompt": sample["prompt"],
    },
    temperature=0.1,
    max_new_tokens=256
)

print(response)



SELECT movie_title FROM movies WHERE movie_release_year = 1945 ORDER BY movie_popularity DESC LIMIT 1;


The `generate` method is used just for single response. This does not saves anything. Now let's try to generate for multiple question and save the results. 

In [10]:
responses = generator.generate_and_save_results(
    dataset=bird_dataset,
    temperature=0.1,
    max_new_tokens=256
)

print(responses)

Generating result ...: 100%|██████████| 10/10 [00:16<00:00,  1.69s/it]
2024-09-09 12:36:28,803 - [GENERATOR] - INFO - All responses are written to: experiments/test/test_generators


[{'db_id': 'movie_platform', 'question': 'Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity.', 'evidence': 'released in the year 1945 refers to movie_release_year = 1945;', 'SQL': 'SELECT movie_title FROM movies WHERE movie_release_year = 1945 ORDER BY movie_popularity DESC LIMIT 1', 'db_path': '/root/anindya/text2sql/data/bird/train/train_databases/movie_platform/movie_platform.sqlite', 'prompt': '\n# Follow these instruction:\nYou will be given schemas of tables of a database. Your job is to write correct\nerror free SQL query based on the question asked. Please make sure:\n\n1. Do not add ``` at start / end of the query. It should be a single line query in a  single line (string format)\n2. Make sure the column names are correct and exists in the table\n3. For column names which has a space with it, make sure you have put `` in that column name\n4. Think step by step and always check schema and question and the column names before 

This will save results inside the `experiment_path` folder with a file named `predict.json`. The next time if you use a generator with the same experiment path, you do not need to run the generations again, it already gives back the cached results. 


However if you still want to do forced generations then you just need to add `force=True` to the method. 

```python
response = generator.generate_and_save(
    dataset=dataset
    temperature=0.1,
    max_new_tokens=256,
    force=True
)
```

### Execution Guided Decoding / Generation

This is an additional method that is available in generators that sometimes bumps the result by 2-3%. The workflow is simple, it does the same generation as before, but now it also executes the SQL to the DB (since we provide the db path or dsn) and now it will gather the result. If the result is an error, it will gather the result and inject into an [error prompt template](/premsql/datasets/prompts.py) and do the generations again till it either gets a right answer or max retries fininshes. 

To use execution guided generation you need an executor. An executor executes the SQL to the database. You can learn more about executors [here](/examples/evaluation.ipynb). 

For this tutorial let's use `SQLiteExecutor` as our executor. We define this executor and then use it inside generator's `generate_and_save` method. 

In [11]:
from premsql.executors import SQLiteExecutor

executor = SQLiteExecutor()
response = generator.generate_and_save_results(
    dataset=bird_dataset,
    temperature=0.1,
    max_new_tokens=256,
    force=True,
    executor=executor,
    max_retries=5 # this is optional (default is already set to 5)
)

Generating result ...: 100%|██████████| 10/10 [00:42<00:00,  4.24s/it]
2024-09-09 12:38:08,799 - [GENERATOR] - INFO - All responses are written to: experiments/test/test_generators


And then the same workflow goes but this time using execution guided decoding. 

In [13]:
from premsql.evaluator import Text2SQLEvaluator

evaluator = Text2SQLEvaluator(
    executor=executor,
    experiment_path=generator.experiment_path
)

results = evaluator.execute(
    metric_name="accuracy",
    model_responses=response,
    filter_by="db_id",
    meta_time_out=10
)

100%|██████████| 10/10 [00:30<00:00,  3.04s/it]
2024-09-09 13:43:22,145 - [UTILS] - INFO - Saved JSON in: experiments/test/test_generators/accuracy.json
2024-09-09 13:43:22,147 - [UTILS] - INFO - Saved JSON in: experiments/test/test_generators/predict.json
