## Architecure : NL to SQL with RAG

PaLM2 API expecially Codey has capability to convert NL to SQL efficiently. But NL to SQL isn't a kind of single translation process. It needs more than SQL transation. 

For example, 

1. Codey will convert NL to SQL with appropriate Schema Information. 
2. In many cases, predicate values (Filter values) can't be extracted from the given question directly. 

I will show the how to solve above issues with RAG architecture and selection filter value process.

### Preparation

Before to convert NL to SQL, we need to store Schema / Table & description of table to vector database.
Schema information is essnetial to make queries. So we need to store it with embedding and retreive right schema information when to convert nl to SQL.


In [4]:
# TODO : Implement the following functions
from google.cloud import bigquery

client = bigquery.Client()

sample_dataset_id = 'bigquery-public-data.thelook_ecommerce'

def crawl_table_schemas(dataset_name):
  tables = client.list_tables(dataset_name)
  table_schemas = []
  for table in tables:
    table_id = f"{dataset_name}.{table.table_id}"
    table_schema = client.get_table(table).schema
    table_schemas.append({'table_id': table_id, 'table_schema': table_schema})
  return table_schemas

table_schemas = crawl_table_schemas(sample_dataset_id)



In [5]:
import vertexai
from langchain.chat_models import ChatVertexAI
from langchain.llms import VertexAI
import os

PROJECT_ID = os.getenv("PROJECT_ID")  # @param {type:"string"}
vertexai.init(project=PROJECT_ID, location="us-central1")

llm_vertex = VertexAI(
    #model_name="text-bison@latest",
    model_name="text-bison-32k",
    max_output_tokens=8000,
    temperature=0,
    top_p=0.8,
    top_k=40,
)

llm = llm_vertex

In [6]:
import json

def parse_json_response(llm_json_response) -> any:
  #print('llm response:'+ response)
  start_char = '['
  end_char = ']'
  if llm_json_response.find('[') == -1 or llm_json_response.find('{') < llm_json_response.find('[') :
    start_char = '{'
    end_char = '}'
  start_index = llm_json_response.find(start_char)
  end_index = llm_json_response.rfind(end_char)
  json_data = llm_json_response[start_index:end_index+1]
  parsed_json = json.loads(json_data)
  return parsed_json


In [8]:

def enrich_schema_information(table_name, table_schema):
  sample_json = """
  {
    "table_name" : "bigquery-public-data.thelook_ecommerce.orders",
    "table_description" : "Orders placed by customers on the Look, an online store that sells clothing, shoes, and other items.",
    "columns" : [
      "column_name" : "order_id",
      "column_description" : "A unique identifier for the order. This is populated when an order is created.",
    ]
  }
  """
  prompt_template = """You are a Looker Developer, enrich the schama information for the table {table_name} with the following information:

  table_name : 
  {table_name}

  table_column_schema :
  {table_column_schema}

  output_json :
  {sample_json}
  """
  prompt = prompt_template.format(table_name=table_name, table_column_schema=table_schema, sample_json=sample_json)
  response = llm.predict(prompt)
  return response

def enrich_table_schemas(table_schemas):
  results = []
  for table_schema in table_schemas:
    table_name = table_schema['table_id']
    one_table_schema = table_schema['table_schema']
    response = enrich_schema_information(table_name, one_table_schema)
    results.append(parse_json_response(response))
  return results

enriched_table_schemas = enrich_table_schemas(table_schemas)

In [9]:
enriched_table_schemas

[{'table_name': 'bigquery-public-data.thelook_ecommerce.distribution_centers',
  'table_description': 'List of distribution centers for the Look, an online store that sells clothing, shoes, and other items.',
  'columns': [{'column_name': 'id',
    'column_description': 'A unique identifier for the distribution center.',
    'column_type': 'INTEGER',
    'column_nullable': 'NULLABLE'},
   {'column_name': 'name',
    'column_description': 'The name of the distribution center.',
    'column_type': 'STRING',
    'column_nullable': 'NULLABLE'},
   {'column_name': 'latitude',
    'column_description': 'The latitude of the distribution center.',
    'column_type': 'FLOAT',
    'column_nullable': 'NULLABLE'},
   {'column_name': 'longitude',
    'column_description': 'The longitude of the distribution center.',
    'column_type': 'FLOAT',
    'column_nullable': 'NULLABLE'}]},
 {'table_name': 'bigquery-public-data.thelook_ecommerce.events',
  'table_description': 'Events that occur on the Look 

In [13]:
from vector_util import VectorDatabase

vdb = VectorDatabase()

In [14]:
from langchain.embeddings import VertexAIEmbeddings

embeddings = VertexAIEmbeddings()

In [18]:

def write_schema_to_vdb(enriched_table_schemas):
  for enriched_table_schema in enriched_table_schemas:
    description = enriched_table_schema['table_description']
    desc_vector = embeddings.embed_query(description)
    vdb.insert_record(sql=None, parameters=None, description=description, explore_view=None, model_name=None, table_name=str(enriched_table_schema['table_name']), column_schema=str(enriched_table_schema['columns']), desc_vector=desc_vector)

write_schema_to_vdb(enriched_table_schemas)

In [22]:

test_question = "I want to know the location of dilivery center for order 1234"
test_embedding =  embeddings.embed_query(test_question)
list_of_tables = vdb.find_related_tables(str(test_embedding).replace(' ',''), 0.5)

In [32]:

def get_related_tables(question):
  test_embedding =  embeddings.embed_query(question)
  results = []
  with vdb.get_connection() as conn:
    try:
      with conn.cursor() as cur:
        select_record = (str(test_embedding).replace(' ',''),)
        cur.execute(f"SELECT description, table_name, column_schema FROM rag_test where (1 - (desc_vector <=> %s)) > 0.6 ", select_record)
        results = cur.fetchall()
        print(results)
    except Exception as e:
      print(e)
  return results

related_tables = get_related_tables(test_question)

[('List of distribution centers for the Look, an online store that sells clothing, shoes, and other items.', 'bigquery-public-data.thelook_ecommerce.distribution_centers', "[{'column_name': 'id', 'column_description': 'A unique identifier for the distribution center.', 'column_type': 'INTEGER', 'column_nullable': 'NULLABLE'}, {'column_name': 'name', 'column_description': 'The name of the distribution center.', 'column_type': 'STRING', 'column_nullable': 'NULLABLE'}, {'column_name': 'latitude', 'column_description': 'The latitude of the distribution center.', 'column_type': 'FLOAT', 'column_nullable': 'NULLABLE'}, {'column_name': 'longitude', 'column_description': 'The longitude of the distribution center.', 'column_type': 'FLOAT', 'column_nullable': 'NULLABLE'}]"), ('Items in orders placed by customers on the Look, an online store that sells clothing, shoes, and other items.', 'bigquery-public-data.thelook_ecommerce.order_items', "[{'column_name': 'id', 'column_description': 'A unique 

Yes. you see that this simple RAG pipeline could help to make more precise SQL conversion. 

## Considerations

To convert NL to SQL with only LLM service isn't perfect. Why is it difficult or impossible to convert NL to SQL directly ?

  - Insufficient context information
    - In ERD, a table could be joined other tables under given context. 
    - If a table has a relation to other tables with foreign key, LLM(machine) can't detect the exact relationships between them. (one-to-one, one-to-many, many-to-one, many-to-many)
    - If order table has relation with order-product table, and order-product table has relation with 'shipment' table and 'shipment' table has relation with 'distribution center' table, how could LLM know these relation chains ?


  - Filter value mismatching
    - In a NL command, "What is the average salary on Jan 2022 ?", LLM can't decide which format is used in the date field - YYYYMM, MM-YYYY, MMM YYYY ?
    - In a NL command, "How many employee exist in Ohio office ?", LLM can't decide what value is real 'Ohio Office' in the office field - "Ohio", "Ohio Office", "OH Office." even "OH".


To overcome two major issues, we should provide additional information (Context) to the LLM.


I will show how to hanlde issue #1 at the next time. Now, I focus on the second issue. 