## Architecure : NL to SQL with RAG

PaLM2 API expecially Codey has capability to convert NL to SQL efficiently. But NL to SQL isn't a kind of single translation process. It needs more than SQL transation. 

For example, 

1. Codey will convert NL to SQL with appropriate Schema Information. 
2. In many cases, predicate values (Filter values) can't be extracted from the given question directly. 

I will show the how to solve above issues with RAG architecture and selection filter value process.

### Preparation

Before to convert NL to SQL, we need to store Schema / Table & description of table to vector database.
Schema information is essnetial to make queries. So we need to store it with embedding and retreive right schema information when to convert nl to SQL.


In [28]:
test_question_simple_but_complex_sql = "I want to know the location of dilivery center for order 100"   # It will show the wrong result!
test_question_simple = "I want to know the total count of the product in Sports category"

test_question = test_question_simple    




In [29]:
# TODO : Implement the following functions
from google.cloud import bigquery

client = bigquery.Client()

sample_dataset_id = 'bigquery-public-data.thelook_ecommerce'

def crawl_table_schemas(dataset_name):
  tables = client.list_tables(dataset_name)
  table_schemas = []
  for table in tables:
    table_id = f"{dataset_name}.{table.table_id}"
    table_schema = client.get_table(table).schema
    table_schemas.append({'table_id': table_id, 'table_schema': table_schema})
  return table_schemas

table_schemas = crawl_table_schemas(sample_dataset_id)



In [4]:
import vertexai
from langchain.chat_models import ChatVertexAI
from langchain.llms import VertexAI
import os

PROJECT_ID = os.getenv("PROJECT_ID")  # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location="us-central1")

llm_vertex = VertexAI(
    #model_name="text-bison@latest",
    model_name="text-bison-32k",
    max_output_tokens=8000,
    temperature=0,
    top_p=0.8,
    top_k=40,
)

llm = llm_vertex

In [5]:
import json

def parse_json_response(llm_json_response) -> any:
  #print('llm response:'+ response)
  start_char = '['
  end_char = ']'
  if llm_json_response.find('[') == -1 or llm_json_response.find('{') < llm_json_response.find('[') :
    start_char = '{'
    end_char = '}'
  start_index = llm_json_response.find(start_char)
  end_index = llm_json_response.rfind(end_char)
  json_data = llm_json_response[start_index:end_index+1]
  parsed_json = json.loads(json_data)
  return parsed_json


In [41]:
import ast

def parse_python_object(llm_python_object) -> any:
  print('llm response:'+ llm_python_object)
  if llm_python_object.find('{') == -1:
    start_char = '['
    end_char = ']'
  elif llm_python_object.find('[') == -1 or llm_python_object.find('{') < llm_python_object.find('[') :
    start_char = '{'
    end_char = '}'
  start_index = llm_python_object.find(start_char)
  end_index = llm_python_object.rfind(end_char)
  object_data = llm_python_object[start_index:end_index+1]
  print(object_data)
  parsed_object = ast.literal_eval(object_data)
  return parsed_object


In [6]:

def enrich_schema_information(table_name, table_schema):
  sample_json = """
  {
    "table_name" : "bigquery-public-data.thelook_ecommerce.orders",
    "table_description" : "Orders placed by customers on the Look, an online store that sells clothing, shoes, and other items.",
    "columns" : [
      "column_name" : "order_id",
      "column_description" : "A unique identifier for the order. This is populated when an order is created.",
    ]
  }
  """
  prompt_template = """You are a Looker Developer, enrich the schama information for the table {table_name} with the following information:

  table_name : 
  {table_name}

  table_column_schema :
  {table_column_schema}

  output_json :
  {sample_json}
  """
  prompt = prompt_template.format(table_name=table_name, table_column_schema=table_schema, sample_json=sample_json)
  response = llm.predict(prompt)
  return response

def enrich_table_schemas(table_schemas):
  results = []
  for table_schema in table_schemas:
    table_name = table_schema['table_id']
    one_table_schema = table_schema['table_schema']
    response = enrich_schema_information(table_name, one_table_schema)
    results.append(parse_json_response(response))
  return results

enriched_table_schemas = enrich_table_schemas(table_schemas)

In [7]:
enriched_table_schemas[0]

{'table_name': 'bigquery-public-data.thelook_ecommerce.distribution_centers',
 'table_description': 'List of distribution centers for the Look, an online store that sells clothing, shoes, and other items.',
 'columns': [{'column_name': 'id',
   'column_description': 'A unique identifier for the distribution center.',
   'column_type': 'INTEGER',
   'column_nullable': 'NULLABLE'},
  {'column_name': 'name',
   'column_description': 'The name of the distribution center.',
   'column_type': 'STRING',
   'column_nullable': 'NULLABLE'},
  {'column_name': 'latitude',
   'column_description': 'The latitude of the distribution center.',
   'column_type': 'FLOAT',
   'column_nullable': 'NULLABLE'},
  {'column_name': 'longitude',
   'column_description': 'The longitude of the distribution center.',
   'column_type': 'FLOAT',
   'column_nullable': 'NULLABLE'}]}

In [8]:
from vector_util import VectorDatabase

vdb = VectorDatabase()

vdb.truncate_table()

In [9]:
from langchain.embeddings import VertexAIEmbeddings

embeddings = VertexAIEmbeddings()



In [11]:

def write_schema_to_vdb(enriched_table_schemas):
  for enriched_table_schema in enriched_table_schemas:
    description = enriched_table_schema['table_description']
    desc_vector = embeddings.embed_query(description)
    vdb.insert_record(sql=None, parameters=None, description=description, explore_view=None, model_name=None, table_name=str(enriched_table_schema['table_name']), column_schema=str(enriched_table_schema['columns']), desc_vector=desc_vector)

write_schema_to_vdb(enriched_table_schemas)

In [None]:
test_embedding =  embeddings.embed_query(test_question)

Crawlling and Schema storing finished.

In [31]:

def get_related_tables(question):
  test_embedding =  embeddings.embed_query(question)
  results = []
  with vdb.get_connection() as conn:
    try:
      with conn.cursor() as cur:
        select_record = (str(test_embedding).replace(' ',''),)
        cur.execute(f"SELECT description, table_name, column_schema FROM rag_test where (1 - (desc_vector <=> %s)) > 0.5 ", select_record)
        results = cur.fetchall()
        print(results)
    except Exception as e:
      print(e)
  return results

related_tables = get_related_tables(test_question)

[('List of distribution centers for the Look, an online store that sells clothing, shoes, and other items.', 'bigquery-public-data.thelook_ecommerce.distribution_centers', "[{'column_name': 'id', 'column_description': 'A unique identifier for the distribution center.', 'column_type': 'INTEGER', 'column_nullable': 'NULLABLE'}, {'column_name': 'name', 'column_description': 'The name of the distribution center.', 'column_type': 'STRING', 'column_nullable': 'NULLABLE'}, {'column_name': 'latitude', 'column_description': 'The latitude of the distribution center.', 'column_type': 'FLOAT', 'column_nullable': 'NULLABLE'}, {'column_name': 'longitude', 'column_description': 'The longitude of the distribution center.', 'column_type': 'FLOAT', 'column_nullable': 'NULLABLE'}]"), ('Events that occur on the Look website, such as page views, product views, and purchases.', 'bigquery-public-data.thelook_ecommerce.events', "[{'column_name': 'id', 'column_description': 'A unique identifier for the event.'

Yes. you see that this simple RAG pipeline could help to make more precise SQL conversion. 

## Considerations

To convert NL to SQL with only LLM service isn't perfect. Why is it difficult or impossible to convert NL to SQL directly ?

  - Insufficient context information
    - In ERD, a table could be joined other tables under given context. 
    - If a table has a relation to other tables with foreign key, LLM(machine) can't detect the exact relationships between them. (one-to-one, one-to-many, many-to-one, many-to-many)
    - If order table has relation with order-product table, and order-product table has relation with 'shipment' table and 'shipment' table has relation with 'distribution center' table, how could LLM know these relation chains ?


  - Filter value mismatching
    - In a NL command, "What is the average salary on Jan 2022 ?", LLM can't decide which format is used in the date field - YYYYMM, MM-YYYY, MMM YYYY ?
    - In a NL command, "How many employee exist in Ohio office ?", LLM can't decide what value is real 'Ohio Office' in the office field - "Ohio", "Ohio Office", "OH Office." even "OH".
    - If a user want to filter records with primary key - product_id, order_id, delivery_id, user must suggest the exact value in the question. [Link](#Issue-for-the-count-of-distinct-values-in-a-filter-column.)


To overcome two major issues, we should provide additional information (Context) to the LLM.


I will show how to hanlde issue #1 at the next time. Now, I focus on the second issue. 

In [69]:
def convert_sql_with_schemas(question, realted_tables):
  prompt_template = """You are a Developer, convert the following question into SQL with the schema information:

  related_tables :
  {related_tables}

  question :
  {question}

  output: SQL
  """
  prompt = prompt_template.format(related_tables=related_tables, question=question)
  response = llm.predict(prompt)
  return response

PREPARED_STATEMENT_PARAMETER_CHAR_BIGQUERY = '?'
PREPARED_STATEMENT_PARAMETER_CHAR_OTHERS = '%s'

PREPARED_STATEMENT_PARAMETER_CHAR = PREPARED_STATEMENT_PARAMETER_CHAR_BIGQUERY

def extract_filter_columns(sql, related_tables):
  
  sample_json = """
  {
    "prepared_statement" : "select * from `bigquery-public-data.thelook_ecommerce.delivery` where created_at between ? and ?",
    "filter_columns" : [
      {
        "table_name" : "bigquery-public-data.thelook_ecommerce.delivery",
        "column_name" : "created_at",
        "column_type" : "TIMESTAMP",
        "operator" : "between",
        "filter_names" : ["created_at_start", "created_at_end"],
        "filter_values" : ["2020-01-01", "2020-01-02"],
        "filter_order" : 1
      }
    ]
  }
  """
  prompt_template = """You are a looker developer, extract the filter columns and change the given sql into prepared statement in JSON format. Please don't suggest python code. Give me a json output as the given output example format.:

  output format : json
  {sample_json}

  ----------------------------------------------
  sql :
  {sql}

  related_tables :
  {related_tables}

  """
  prompt = prompt_template.format(sql=sql, parameter_char=PREPARED_STATEMENT_PARAMETER_CHAR, related_tables=related_tables, sample_json=sample_json)
  response = llm.predict(prompt)
  print(response)
  return parse_json_response(response)


In complex output cases, it is better to place the output format at the top of the prompt than at the bottom.

In [56]:
converted_sql = convert_sql_with_schemas(test_question, related_tables)

In [57]:
converted_sql

" ```sql\nSELECT COUNT(*) AS total_count\nFROM bigquery-public-data.thelook_ecommerce.products\nWHERE category = 'Sports';\n```"

In [70]:
sql_and_filters = extract_filter_columns(converted_sql, related_tables)

 ```json
{
  "prepared_statement": "SELECT COUNT(*) AS total_count\nFROM bigquery-public-data.thelook_ecommerce.products\nWHERE category = ?",
  "filter_columns": [
    {
      "table_name": "bigquery-public-data.thelook_ecommerce.products",
      "column_name": "category",
      "column_type": "STRING",
      "operator": "=",
      "filter_names": [
        "category"
      ],
      "filter_values": [
        "Sports"
      ],
      "filter_order": 1
    }
  ]
}
```


### Wrong Query Generation Examples :

1. Undefined Function - YEAR, MONTH....

  ```SQL
  SELECT 
  AVG(i.retail_price) AS average_price
  FROM 
    bigquery-public-data.thelook_ecommerce.inventory_items AS i
    JOIN bigquery-public-data.thelook_ecommerce.order_items AS oi
      ON i.id = oi.inventory_item_id
    JOIN bigquery-public-data.thelook_ecommerce.orders AS o
      ON oi.order_id = o.order_id
  WHERE YEAR(o.created_at) = 2023
  ```

2. Wrong Join

...

...

In [64]:
def get_field_unique_values(matched_table, matched_field):
  if matched_table[0] != '`' :
    matched_table = '`' + matched_table + '`'
  sql_query = f"with distinct_values as ( select distinct {matched_field} as {matched_field} from {matched_table} ) select {matched_field}, (select count(1) from distinct_values) as total_count from distinct_values limit 500"
  df = client.query(sql_query).to_dataframe()
  return df[matched_field].tolist(), df['total_count'][0]
  

### Issue for the count of distinct values in a filter column. 

For example, if someone wanted to filter records by product_id. 'product_id' column might be a primary key column and there are lots of unique values in this field. 
So, LLM can't hanldle this field efficiently. The end user(questioner) MUST suggest 'exact' value as a filter value. 

In many BI solutions, a filter column shows only partial set of distinct values. 

To choose filter values via 'seeing' and via 'saying' are very different. When 'Seeing' values in a field, BI can show 50 values in a page. But for 'Saying' values in a field, it's not proper way to choose multiple values. 

So, if there are lots of distinct values in a filter columnd, we need to check whether there is a exact matching filter value in the 'question' itself. 


### Issue for the number of filter values

If someone wanted to retrieve 'Sport' category in the product table and there was no direct matched category in it.

LLM would suggest very related items such like - Soccer Shirts, Baseball Hats and etc.

The problem is 

in the generated sql has filter with operation '=', but in this case we should change the operation '=' to 'IN'.

In [65]:

import ast

def choose_right_filter_value(filter_values, wanted_value):
  prompt_template = """As a looker developer, choose right filter value for the wanted value below without changing filter value itself.

  filter_values : {filter_values}

  wanted_values: {wanted_value}

  answer format: python list
[filter_value1, filter_value2, ...]
  """
  prompt = prompt_template.format(filter_values=filter_values, wanted_value=wanted_value)
  print(prompt)
  response = llm.predict(prompt)
  return response 

def adjust_filter_value(filter_columns):
  for filter in filter_columns:
    matched_table = filter['table_name']
    matched_field = filter['column_name']
    filter['unique_values'], filter['unique_count'] = get_field_unique_values(matched_table, matched_field)
    # TODO: if unique_count < 500, then choose right filter value in the unique value list.
    if filter['unique_count'] < 500:
      response = choose_right_filter_value(filter['unique_values'], filter['filter_values'])
      print(response)
      if response.strip().find("```json") == 0 :
        filter['adjust_filter_values'] = parse_json_response(response)
      else:
        filter['adjust_filter_values'] = parse_python_object(response)
    else:
      filter['adjust_filter_values'] = filter['filter_values']
  
  

  

In [71]:
sql_and_filters

{'prepared_statement': 'SELECT COUNT(*) AS total_count\nFROM bigquery-public-data.thelook_ecommerce.products\nWHERE category = ?',
 'filter_columns': [{'table_name': 'bigquery-public-data.thelook_ecommerce.products',
   'column_name': 'category',
   'column_type': 'STRING',
   'operator': '=',
   'filter_names': ['category'],
   'filter_values': ['Sports'],
   'filter_order': 1}]}

In [84]:
adjust_filter_value(sql_and_filters['filter_columns'])

As a looker developer, choose right filter value for the wanted value below without changing filter value itself.

  filter_values : ['Swim', 'Jeans', 'Pants', 'Socks', 'Active', 'Shorts', 'Sweaters', 'Underwear', 'Accessories', 'Tops & Tees', 'Sleep & Lounge', 'Outerwear & Coats', 'Suits & Sport Coats', 'Fashion Hoodies & Sweatshirts', 'Plus', 'Suits', 'Skirts', 'Dresses', 'Leggings', 'Intimates', 'Maternity', 'Clothing Sets', 'Pants & Capris', 'Socks & Hosiery', 'Blazers & Jackets', 'Jumpsuits & Rompers']

  wanted_values: ['Sports']

  answer format: python list
[filter_value1, filter_value2, ...]
  
 ```python
['Active', 'Shorts', 'Sweaters']
```
llm response: ```python
['Active', 'Shorts', 'Sweaters']
```
['Active', 'Shorts', 'Sweaters']


In [85]:
SINGLE_OPERATORS = ['=', '>', '<', '>=', '<=', '!=', '<>']

def choose_right_one_value_from_adjusted_values(sql_and_filters):
  for filter in sql_and_filters['filter_columns']:
    if filter['operator'] in SINGLE_OPERATORS :
      filter['adjust_filter_values'] = [filter['adjust_filter_values'][0]]

choose_right_one_value_from_adjusted_values(sql_and_filters)

In [86]:
sql_and_filters

{'prepared_statement': 'SELECT COUNT(*) AS total_count\nFROM bigquery-public-data.thelook_ecommerce.products\nWHERE category = ?',
 'filter_columns': [{'table_name': 'bigquery-public-data.thelook_ecommerce.products',
   'column_name': 'category',
   'column_type': 'STRING',
   'operator': '=',
   'filter_names': ['category'],
   'filter_values': ['Sports'],
   'filter_order': 1,
   'unique_values': ['Swim',
    'Jeans',
    'Pants',
    'Socks',
    'Active',
    'Shorts',
    'Sweaters',
    'Underwear',
    'Accessories',
    'Tops & Tees',
    'Sleep & Lounge',
    'Outerwear & Coats',
    'Suits & Sport Coats',
    'Fashion Hoodies & Sweatshirts',
    'Plus',
    'Suits',
    'Skirts',
    'Dresses',
    'Leggings',
    'Intimates',
    'Maternity',
    'Clothing Sets',
    'Pants & Capris',
    'Socks & Hosiery',
    'Blazers & Jackets',
    'Jumpsuits & Rompers'],
   'unique_count': 26,
   'adjust_filter_values': ['Active']}]}

In [87]:
def prepared_statement_with_filter_values_in_bigquery(sql_and_filters):
  prepared_statement = sql_and_filters['prepared_statement']
  query_parameters = []
  for filter_column in sql_and_filters['filter_columns']:
    if len(filter_column['adjust_filter_values']) > 1:
      if(filter_column['column_type'] == 'FLOAT64'):
        query_parameters.append(bigquery.ArrayQueryParameter(None, "FLOAT64", filter_column['adjust_filter_values']))
      elif(filter_column['column_type'] == 'INT64'):
        query_parameters.append(bigquery.ArrayQueryParameter(None, "INT64", filter_column['adjust_filter_values']))
      else:
        query_parameters.append(bigquery.ArrayQueryParameter(None, "STRING", filter_column['adjust_filter_values']))  
    else:
      if(filter_column['column_type'] == 'FLOAT64'):
        query_parameters.append(bigquery.ScalarQueryParameter(None, "FLOAT64", filter_column['adjust_filter_values'][0]))
      elif(filter_column['column_type'] == 'INT64'):
        query_parameters.append(bigquery.ScalarQueryParameter(None, "INT64", filter_column['adjust_filter_values'][0]))
      else:
        query_parameters.append(bigquery.ScalarQueryParameter(None, "STRING", filter_column['adjust_filter_values'][0]))
    
  job_config = bigquery.QueryJobConfig(
    query_parameters=query_parameters
  )
  print(query_parameters)
  query_job = client.query(prepared_statement, job_config=job_config)
  return query_job.to_dataframe()

  

In [88]:
df_result = prepared_statement_with_filter_values_in_bigquery(sql_and_filters)

[ScalarQueryParameter(None, 'STRING', 'Active')]


In [89]:
df_result

Unnamed: 0,total_count
0,1432


## For the complex query.

For the complex query, 

There are lots of considerations. 

The below is a good(?) example.


For the question : I want to know the location of dilivery center for order 100

Generated query is
``` SQL
SELECT c.name AS distribution_center_name,c.latitude AS distribution_center_latitude,c.longitude AS distribution_center_longitude FROM bigquery-public-data.thelook_ecommerce.orders AS o
JOIN bigquery-public-data.thelook_ecommerce.order_items AS oi ON o.order_id = oi.order_id
JOIN bigquery-public-data.thelook_ecommerce.distribution_centers AS c ON oi.inventory_item_id = c.id 
WHERE o.order_id = ?
```

This query shows no result.

But in real case, there are three records. 


| Center Name | Latitude | Longitude |
|-------------|----------|-----------|
| Savannah GA | 32.0167  | -81.1167  |
| Chaleston SC | 32.7833 | -79.9333  |
| Houston TX  | 29.7604  | -95.3695  |

Real query is 
``` SQL
    select d.name, d.latitude, d.longitude from `bigquery-public-data.thelook_ecommerce.orders` a 
    join `bigquery-public-data.thelook_ecommerce.order_items` b on (a.order_id = b.order_id)
    join `bigquery-public-data.thelook_ecommerce.products` c on (b.product_id = c.id)
    join `bigquery-public-data.thelook_ecommerce.distribution_centers` d on (c.distribution_center_id = d.id)
```


Can you notice the difference between two SQLs ?

In the above RAG pipeline - expecially get_related_tables -, this function can't retrieve the mandatory table schema (product) because of low similarity.

Yes, you can solve this problem to lower similarity threshold. But if there were lots of tables and columns in the vector spaces ?

It could hurt the performance of the ohter processes in the LLM. 


If both your question & SQL are simple, it can show very good performance. 
