In [1]:
# cd ..

/root/anindya/Submission/text2sql/text2sql


## Pipelines

premsql pipelines is the component that helps to stitch existing independent premsql components to make an overall end to end flow. 

Here are some examples which you can make using pipelines:

- Simple text to SQL generation pipeline (contains in this tutorial)
- RAG for Databases 
- Database analyser using premsql 
- Custom evaluation pipeline for your existing text2ql models or agents
- Custom agentic pipelines which are based on Databases

In a way pipelines helps you to use the existing tools and built end to end customizations on top of it. In this tutorial we are going to use simple_pipeline by premsql. 

In the coming versions we will also have much more sophisticated pipelines including RAG and agents. But in the meantime you can also make your own pipelines too (more on that below).

Before getting started, you should have prior knowledge about the following topics:

1. [premsql datasets](/examples/datasets.ipynb)
2. [premsql generators](/examples/generators.ipynb)
3. [premsql evaluators](/examples/evaluation.ipynb)

Now let's get started by importing all the necessary packages. 

In [2]:
from premsql.pipelines.simple import SimpleText2SQLAgent
from premsql.generators.huggingface import Text2SQLGeneratorHF
from langchain_community.utilities.sql_database import SQLDatabase
from premsql.utils import convert_sqlite_path_to_dsn

  from .autonotebook import tqdm as notebook_tqdm


Pipelines are mainly used a Natural Language interface between your databases. So you will ask some question and your pipeline will give some answer. Now this could be simply 

1. A dataframe
2. only SQL
3. Some analysis using the dataframe and the question
etc etc. So the input for a pipeline is going to be a database connection and a generator. 


Here in this example, we are using [Langchain's SQLDatabase](https://python.langchain.com/v0.2/docs/integrations/tools/sql_database/) for our DB connection. If you have a sqlite database then you can simply put the .sqlite file here. 

We are using HuggingFace model generators. The model we are going to use is [Prem-1B-SQL](https://huggingface.co/premai-io/prem-1B-SQL) which  is fully local. 

In [3]:
dsn_or_db_path = convert_sqlite_path_to_dsn(
  "../data/bird/test/test_databases/california_schools/california_schools.sqlite"   
)
db = SQLDatabase.from_uri(dsn_or_db_path)

agent = SimpleText2SQLAgent(
    dsn_or_db_path=db,
    generator=Text2SQLGeneratorHF(
        model_or_name_or_path="premai-io/prem-1B-SQL",
        experiment_name="test_nli",
        device="cuda:0",
        type="test"
    ),
)

2024-09-06 20:04:40,025 - [GENERATOR] - INFO - Experiment folder found in: experiments/test/test_nli
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


Loading checkpoint shards: 100%|██████████| 2/2 [00:06<00:00,  3.37s/it]
2024-09-06 20:04:47,613 - [SIMPLE-AGENT] - INFO - Everything set


Super simple right? Now you ask a question and it gives the following things as response:

1. table: The resultant table which it got. 
2. error: Any kind of error.
3. sql: The SQL statement which was used to execute the table.

Here is an example for a sample question 

In [4]:
response = agent.query(
    question="please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1",
)

response["table"]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Unnamed: 0,Phone
0,
1,(209) 229-4700
2,(209) 253-1208
3,(209) 365-4060
4,(209) 368-4934
...,...
716,(951) 672-2400
717,(951) 678-5217
718,(951) 824-1358
719,(951) 926-6776


Here is the raw response. 

In [5]:
print(response)

{'table':               Phone
0              None
1    (209) 229-4700
2    (209) 253-1208
3    (209) 365-4060
4    (209) 368-4934
..              ...
716  (951) 672-2400
717  (951) 678-5217
718  (951) 824-1358
719  (951) 926-6776
720  (970) 258-0518

[721 rows x 1 columns], 'error': None, 'sql': "SELECT Phone FROM schools WHERE Charter = 1 AND OpenDate > '2000-01-01' AND FundingType = 'Directly funded' GROUP BY Phone"}


Inside the pipeline, we are using [execution guided decoding](/examples/generators.ipynb) which executes the SQL to the DB and checks if there is an error and does several retries till it gets a correct SQL (max retries is set to 5) 

Here is an another example:

In [6]:
agent.query(
    question="Among the schools with the SAT test takers of over 500, please list the schools that are magnet schools or offer a magnet program.",
    additional_knowledge="Magnet schools or offer a magnet program means that Magnet = 1"
)["table"]



Unnamed: 0,School
0,Millikan High
1,Polytechnic High
2,Troy High


An another example:

In [7]:
agent.query("list all the distinct CDSCode")['table']



Unnamed: 0,CDSCode
0,01100170109835
1,01100170112607
2,01100170118489
3,01100170123968
4,01100170124172
...,...
9981,58727516056832
9982,58727516056840
9983,58727516118806
9984,58727690123570


In [8]:
agent.query("what are the unique districts in schools and sorted")['table']



Unnamed: 0,District
0,ABC Unified
1,Acalanes Union High
2,Ackerman Charter
3,Acton-Agua Dulce Unified
4,Adelanto Elementary
...,...
1406,Yreka Union Elementary
1407,Yreka Union High
1408,Yuba City Unified
1409,Yuba County Office of Education


Sometimes, the model hallucinates, since it a very small model. And in those cases we get an error like this. However it will still return a dataframe such that the pipeline does not break in terms of response consistency. 

In [9]:
response = agent.query("what is the max high grade")
print(response)



2024-09-06 20:05:02,871 - [SIMPLE-AGENT] - INFO - => Going for final correction ...


{'table':                                                error
0  Error: (sqlite3.OperationalError) no such colu..., 'error': 'Error: (sqlite3.OperationalError) no such column: High_Grade\n[SQL: SELECT max(High_Grade) FROM frpm;]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)', 'sql': 'SELECT max(High_Grade) FROM frpm;'}


We are using a very small model above and sometimes it fails to generate correct response. So in those cases, we also have a `correct_with_gpt` method (which runs internally) that corrects any furthur SQL responses so that we can maximize the chances of getting error free SQLs. 

In order to use this, you need to have a [premai-io](https://premai.io) account. You can get started [here](https://docs.premai.io) to get start a new project and get a project_id and API key. 

The final auto-correct with gpt only triggers when you provide `premai_api_key` and `premai_project_id` parameters while instantiating the pipeline. Here how it looks like: 

In [10]:
premai_api_key="Fqxxxxx-xxxxxx-xxxxx-xxxx" # Replace this
premai_project_id=1234 # Replace this 

agent_with_corrector = SimpleText2SQLAgent(
    dsn_or_db_path=db,
    generator=Text2SQLGeneratorHF(
        model_or_name_or_path="premai-io/prem-1B-SQL",
        experiment_name="test_nli",
        device="cuda:0",
        type="test"
    ),
    premai_api_key=premai_api_key,
    premai_project_id=premai_project_id
)

2024-09-06 20:05:04,760 - [GENERATOR] - INFO - Experiment folder found in: experiments/test/test_nli
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.91s/it]
2024-09-06 20:05:11,380 - [SIMPLE-AGENT] - INFO - Everything set
2024-09-06 20:05:11,380 - [SIMPLE-AGENT] - INFO - Using gpt-4o as the final corrector


And now asking the same question, we get the correct answer. You can also see a info being logged which tells it is using GPT for final correction.

In [11]:
agent_with_corrector.query("what is the max high grade")["table"]



2024-09-06 20:05:14,629 - [SIMPLE-AGENT] - INFO - => Going for final correction ...


Unnamed: 0,MAX(`High Grade`)
0,Post Secondary


## Future Plans

Currently local LLMs for text to SQL still do not have very good autonomous capabilities. So still there becomes a dependency of closed source models to some extent. However in upcoming versions we are going to replace that with fully local autonomous and reliable text to SQL pipelines. 