To use this pipeline, you need to follow these steps:

- 1. Create data folder
- 2. Place your .sqlite files in the data folder - just throw the whole directory in there and the script will handle the rest.
- 3. Run the Create_schema.py script to extract the SQL file paths.
- 4. Use the extracted paths with the SQLiteConnector class to interact with your databases.
- 5. Follow the usage examples provided in the Demo_notebook.ipynb file.

Future plans:

- Extract database schema information
- Implement support for other SQL databases (e.g., MySQL, PostgreSQL)
- Optimize the SQL query execution process by generating a compact version of schema information for AI agents.
- Build AI agent capabilities to scan for any error or create vector embeddings of more detail schema information.

=> The connector class can support all query actions for SQLite databases.

# REASONING IN THIS NOTEBOOK REQUIRES A VERIFICATION STEP ON OPENAI'S PLATFORM.
https://help.openai.com/en/articles/10910291-api-organization-verification


In [1]:
import Create_Schema
import json
# Remember to put the data into data folder
sql_file_paths = Create_Schema.extract_sql_file_paths(save_json=True)
print(sql_file_paths)

{
    "academic": "data/academic/academic.sqlite",
    "activity_1": "data/activity_1/activity_1.sqlite",
    "aircraft": "data/aircraft/aircraft.sqlite",
    "allergy_1": "data/allergy_1/allergy_1.sqlite",
    "apartment_rentals": "data/apartment_rentals/apartment_rentals.sqlite",
    "architecture": "data/architecture/architecture.sqlite",
    "assets_maintenance": "data/assets_maintenance/assets_maintenance.sqlite",
    "baseball_1": "data/baseball_1/baseball_1.sqlite",
    "battle_death": "data/battle_death/battle_death.sqlite",
    "behavior_monitoring": "data/behavior_monitoring/behavior_monitoring.sqlite",
    "bike_1": "data/bike_1/bike_1.sqlite",
    "body_builder": "data/body_builder/body_builder.sqlite",
    "book_2": "data/book_2/book_2.sqlite",
    "browser_web": "data/browser_web/browser_web.sqlite",
    "candidate_poll": "data/candidate_poll/candidate_poll.sqlite",
    "car_1": "data/car_1/car_1.sqlite",
    "chinook_1": "data/chinook_1/chinook_1.sqlite",
    "cinema": "

In [2]:
# Extract schema information from the database
schema_cinema = Create_Schema.schema_extractor(sql_file_paths, db_name="cinema", save_json=True)

print("Schema information:", schema_cinema)

Schema information: {
    "tables": {
        "film": {
            "columns": [
                "Film_ID",
                "Rank_in_series",
                "Number_in_season",
                "Title",
                "Directed_by",
                "Original_air_date",
                "Production_code"
            ],
            "primary_key": [
                "Film_ID"
            ],
            "foreign_keys": []
        },
        "cinema": {
            "columns": [
                "Cinema_ID",
                "Name",
                "Openning_year",
                "Capacity",
                "Location"
            ],
            "primary_key": [
                "Cinema_ID"
            ],
            "foreign_keys": []
        },
        "schedule": {
            "columns": [
                "Cinema_ID",
                "Film_ID",
                "Date",
                "Show_times_per_day",
                "Price"
            ],
            "primary_key": [
                "Cin

In [3]:
# Create database name JSON
db_names = Create_Schema.create_names_json(sql_file_paths, save_json=True)
# Create the combined schema
combined_schema = Create_Schema.create_combined_schema(sql_file_paths, save_json=True)

100%|██████████| 166/166 [00:00<00:00, 122107.06it/s]


Database names extracted to db_names.json.


100%|██████████| 166/166 [00:00<00:00, 489.63it/s]

Combined schema saved to combined_schema.json





In [4]:
# Extract schema by database name from local JSON files
schema_cinema = Create_Schema.schema_from_json_file("schema_data/combined_schema.json", db_name="cinema", save_json=False)
schema_cinema

{'tables': {'film': {'columns': ['Film_ID',
    'Rank_in_series',
    'Number_in_season',
    'Title',
    'Directed_by',
    'Original_air_date',
    'Production_code'],
   'primary_key': ['Film_ID'],
   'foreign_keys': []},
  'cinema': {'columns': ['Cinema_ID',
    'Name',
    'Openning_year',
    'Capacity',
    'Location'],
   'primary_key': ['Cinema_ID'],
   'foreign_keys': []},
  'schedule': {'columns': ['Cinema_ID',
    'Film_ID',
    'Date',
    'Show_times_per_day',
    'Price'],
   'primary_key': ['Cinema_ID', 'Film_ID'],
   'foreign_keys': [{'from_column': 'Cinema_ID',
     'ref_table': 'cinema',
     'ref_column': 'Cinema_ID'},
    {'from_column': 'Film_ID',
     'ref_table': 'film',
     'ref_column': 'Film_ID'}]}}}

In [5]:
from SQL_Connector import SQLite_Connector

# Create the connector (For mutiple queries on multiple tables)
connector = SQLite_Connector(sql_file_paths)

def connect_and_query(data_name: str, query: str):
    connector.connect(data_name)
    return connector.execute_queries([query])

# Connect to the database
data_name = "cinema"
query = "SELECT * FROM cinema;"
results = connect_and_query(data_name, query)
print("Query results:", results)

Connecting to SQLite database at: data\cinema\cinema.sqlite
Connection successful.
Query results: [
    [
        {
            "Cinema_ID": 1,
            "Name": "Codling",
            "Openning_year": 2010,
            "Capacity": 1100,
            "Location": "County Wicklow"
        },
        {
            "Cinema_ID": 2,
            "Name": "Carrowleagh",
            "Openning_year": 2012,
            "Capacity": 368,
            "Location": "County Cork"
        },
        {
            "Cinema_ID": 3,
            "Name": "Dublin Array",
            "Openning_year": 2015,
            "Capacity": 364,
            "Location": "County Dublin"
        },
        {
            "Cinema_ID": 4,
            "Name": "Glenmore",
            "Openning_year": 2009,
            "Capacity": 305,
            "Location": "County Clare"
        },
        {
            "Cinema_ID": 5,
            "Name": "Glenough",
            "Openning_year": 2010,
            "Capacity": 325,
            "Lo

In [6]:
# Create the schema with tables and columns (TESTING)
db_names_test = Create_Schema.create_names_json_test(combined_schema, save_json=True)

In [7]:
# AI AGENTS DEMOTRATION ON HOW TO USE THE SCHEMA
from AI_Agents import Agent_A, Agent_B
previous_reasoning = []
# Agent A: User query to select database names
user_query = "How many singers do we have?" # From Spider test set
Agent_A_Response = Agent_A(user_query)

In [8]:
def reasoning_extractor(resp):
    # Extract reasoning results and final result, split by double newlines
    results = []
    for item in resp.output:
        if getattr(item, "type", None) == "reasoning":
            if getattr(item, "summary", None):
                for s in item.summary:
                    results.extend(s.text.split("\n\n"))
    return results

In [9]:
from IPython.display import display, Markdown, HTML
Agent_A_results = reasoning_extractor(Agent_A_Response)
for result in Agent_A_results:
    # print in marked down format
    previous_reasoning.append(result)
    display(Markdown(result))
Agent_A_output = json.loads(Agent_A_Response.output_text)
print("\nPossible tables are:", Agent_A_output)

**Selecting database names**

I’m thinking about how to respond to the query: “How many singers do we have?” I need to pick relevant database names from the list. The clear choice is the “singer” database since it’s directly related. The “concert_singer” database looks good too, as it likely contains information about singers. “Orchestra” might have some singers but typically focuses on musicians. Other names like “tvshow” and “wedding” don’t seem relevant at all. I’ll focus on including “singer” and “concert_singer” for the JSON output.

**Choosing plausible databases**

I need to focus on selecting the most plausible databases for the question, “How many singers do we have?” The clear top choice is the “singer” database, followed by “concert_singer.” Both contain relevant information like singer names. Since the request lacks context, I should include both. I should avoid including “music” databases since their relevance is uncertain, though “music_4” may have some connections to singers. Overall, I’m sticking with “singer” and “concert_singer” to be safe!

**Finalizing database selection**

I’m considering whether “tvshow” might contain singers, but that seems unlikely. “Orchestra” typically focuses more on performers and concerts, so it may not include singers either. I’ll stick with the strong choices: “singer” and “concert_singer.” I could ponder including some of the “music” databases, but I don’t want to overcomplicate things. So, I’ll limit my output to just the two main databases. I need to check the formatting too; it should be a pure JSON object with the right structure!

**Deciding on databases**

I’m thinking about which databases to choose for the query, “How many singers do we have?” It seems like “performance_attendance” might not provide useful singer details, so I’ll skip that one. However, both “singer” and “concert_singer” are strong options. I should list them in order with “singer” first, followed by “concert_singer.” The output needs to be a JSON format, and I have to ensure nothing else is included. I’ll get this done!

**Selecting databases**

In this scenario, I don't need multiple databases to answer the query. I’ll primarily go with “singer” as my top choice. I think it makes sense to include “concert_singer” as an additional option, so I’ll have both in my output. My final JSON object will look like this:

{
  "db_names": ["singer", "concert_singer"]
}

Alright, that seems clear and straightforward! I'll proceed with that.


Possible tables are: {'db_names': ['singer', 'concert_singer']}


In [10]:
from tqdm import tqdm
# Agent B: Generate SQL queries based on the selected database names
temp_schemas = Create_Schema.schema_from_json_names(Agent_A_output, "schema_data/combined_schema.json", save_json=True)
temp_schemas = json.loads(temp_schemas)
SQL_list = []
print("User query:", user_query)
for schema in tqdm(temp_schemas):
    # Run on each database schema to generate SQL queries
    print("\nGenerating SQL query for schema:", schema)
    Agent_B_Response = Agent_B(user_query, temp_schemas[schema])
    # Extract reasoning results from Agent B's response
    Agent_B_results = reasoning_extractor(Agent_B_Response)
    for result in Agent_B_results:
        previous_reasoning.append(result)
        display(Markdown(result))
    # Save the final responses
    print("\nGenerated SQL query:", Agent_B_Response.output_text)
    SQL_list.append(Agent_B_Response.output_text)

# Display the generated SQL queries
print("\nGenerated SQL queries:")
for sql in SQL_list:
    display(Markdown(sql))

User query: How many singers do we have?


  0%|          | 0/2 [00:00<?, ?it/s]


Generating SQL query for schema: singer


**Providing SQL query**

I need to produce an SQL query based on the schema with two tables: singer and song. To count the number of singers, I can either count distinct Singer_IDs or simply count all rows in the singer table. The query that works is: 

SELECT COUNT(*) AS singer_count FROM singer;

I want to ensure there’s no extra formatting, just the SQL code. I think I’ve got it right! No edge cases to worry about here.

 50%|█████     | 1/2 [00:04<00:04,  4.07s/it]


Generated SQL query: SELECT COUNT(*) AS singer_count
FROM singer;

Generating SQL query for schema: concert_singer


**Generating SQL query**

I need to create a SQL query based on the provided schema to answer the user’s question: "How many singers do we have?" It's quite straightforward. The schema includes a singer table with a primary key, Singer_ID. My final query will be: "SELECT COUNT(*) AS total_singers FROM singer;" This keeps it simple and meets the user's request without additional explanation. Plus, the primary key ensures there are no duplicates, so I'm confident in using COUNT(*).

**Creating a plain SQL query**

I need to ensure the SQL query is straightforward and not overly formatted. It's important to avoid any ambiguity in the output, which is a good practice! By focusing on a plain query, I’ll count all entries with COUNT(*), as that's perfectly acceptable here. Let’s go ahead and finalize that query without any complications or extra steps. Just a clean and simple output should do the trick!

100%|██████████| 2/2 [00:13<00:00,  6.67s/it]


Generated SQL query: SELECT COUNT(*) AS total_singers
FROM singer;

Generated SQL queries:





SELECT COUNT(*) AS singer_count
FROM singer;

SELECT COUNT(*) AS total_singers
FROM singer;

In [11]:
sql_results = []
# Execute the generated SQL queries
for schema, sql in tqdm(zip(temp_schemas, SQL_list)):
    print(f"\nExecuting SQL query on schema '{schema}': {sql}")
    result = connect_and_query(schema, sql)
    sql_results.append(result)
    print(result)

2it [00:00, 2001.10it/s]


Executing SQL query on schema 'singer': SELECT COUNT(*) AS singer_count
FROM singer;
Connecting to SQLite database at: data\singer\singer.sqlite
Connection successful.
[
    [
        {
            "singer_count": 8
        }
    ]
]

Executing SQL query on schema 'concert_singer': SELECT COUNT(*) AS total_singers
FROM singer;
Connecting to SQLite database at: data\concert_singer\concert_singer.sqlite
Connection successful.
[
    [
        {
            "total_singers": 6
        }
    ]
]





In [13]:
from AI_Agents import Agent_C
# clear \n and {}, [] in previous_reasoning
previous_reasoning = str(previous_reasoning).replace("\n", " ").replace("{", "").replace("}", "").replace("[", "").replace("]", "")
sql_results = str(sql_results)
Agent_C_Response = Agent_C(user_query, sql_result=sql_results, previous_reasoning= previous_reasoning)

In [14]:
# Render the final reasoning response from Agent C
Agent_C_results = reasoning_extractor(Agent_C_Response)
for result in Agent_C_results:
	display(Markdown(result))

**Interpreting developer instructions**

I need to carefully parse the developer instruction regarding the SQL query. The goal is to determine if the SQL result can be summarized or not. If it can be summarized, I'll provide a summary; if not, I’ll return the original result. If nothing is found, I’ll return "NOT_FOUND." It’s clear that my final output must be a string containing either the summary or the original SQL result, preceded by an explanation of my reasoning.

**Analyzing SQL query results**

The final output needs to be a string that either summarizes the SQL query result or returns the original SQL. I see that the SQL query result is an array of two strings, each containing stringified JSON. Each string has a single object with either "singer_count": 8 or "total_singers": 6. This indicates two separate queries—one from each database, "singer" and "concert_singer." Since the counts conflict, I need to consider how to summarize this accurately.

**Determining SQL result summarization**

The directive is clear: I need to assess if I can summarize the SQL query result. If possible, I'll create a summary; if not, I return the original result or "NOT_FOUND." Since there are multiple results, I usually summarize by indicating counts. But, I'm unsure if the results are separate or duplicates. The goal seems to be presenting a summary like "singer_count: 8; total_singers: 6." The keys indicate naming but lack explicit database labels.

**Summarizing singer counts**

I can summarize the results by saying there are 8 singers from the first query and 6 from the second. A clearer way to put it could be: "Counts differ across databases: 8 in singer and 6 in concert_singer." However, I lack a direct mapping for which count belongs to which database. It seems both queries targeted the "singer" table. This suggests that both queries were essentially the same but used different aliases for clarity.

**Clarifying output structure**

The results show counts of 8 and 6, likely from different databases. I need to summarize this information and I believe it's possible. A clear summary could be: "Counts returned: 8 and 6." However, I must also provide an explanation first, stating that these counts come from different queries. 

So I'll deliver the explanation briefly, followed by a summary string. Since clarity is essential, I'll format it with a line break between the two sections to ensure readability. This keeps in mind the possibility of further parsing in the app.

**Crafting the summary output**

I need to provide a clear explanation about the SQL results, which come from two different databases. I can summarize them by listing both counts: "8 (singer_count)" and "6 (total_singers)." Even though I might consider returning just one count, the directive emphasizes summarization is essential here. Since the user asked about the total number of singers, I’ll include my reasoning first, followed by a summary string, avoiding heavy formatting. Here's the reasoning:

"Can be summarized: The query returned two counts from different databases. 

Final: "Counts: singer_count = 8; total_singers = 6."

**Finalizing the summary output**

I want to provide a clear summary of the SQL results, which include two counts from different queries: "singer_count: 8" and "total_singers: 6." Since the user asked, "How many singers do we have?" I think they might just want a simple answer. My explanation will be concise: "I can summarize: there are two counts returned." Then, I’ll follow it up with, "singer_count: 8; total_singers: 6." This keeps everything straightforward and avoids ambiguity.

In [15]:
# Render the final response from Agent C
display(Markdown(Agent_C_Response.output_text))

Reason: The SQL produced two separate count results (likely from different queries/databases). This can be summarized by listing both counts rather than choosing or merging them.

singer_count: 8; total_singers: 6