# Creación y Almacenamiento de Embeddings en Chroma DB

En el presente Notebook se crearán la documentación base del esquema reducido con el que trabajaremos, basada en el `information_schema` correspondiente, que luego se nutrirá por fuera con mayor información relevante para negocio. 

Adicionalmente, por fuera del Notebook, se confeccionará documentación adicional, como reglas de negocio y few-shot examples para, finalmente, obtener chunks de todos estos documentos y crear y almacenar los mismos, con su metadata asociada, en una base de datos vectorial de Chroma DB.

## Inicialización

### Librerías



In [1]:
import os
import sys
from pathlib import Path
import pandas as pd
from collections import defaultdict, OrderedDict
import re
import yaml

notebook_dir = os.getcwd() 
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))
sys.path.append(project_root)

from src.pg_sql import execute_query


pd.set_option('display.max_columns', None)
yaml.add_representer(OrderedDict, lambda dumper, data: dumper.represent_dict(data.items()))

### Constantes

In [2]:
DATABASE = 'database'
SCHEMAS = 'schemas'
TABLES = 'tables'
COLUMNS = 'columns'
DESCRIPTION = 'description'
NAME = 'name'
DATA_TYPE = 'data_type'
PRIMARY_KEY = 'is_primary_key'
FOREIGN_KEY = 'is_foreign_key'
REFERENCE = 'reference'
TO_DO = '[To be completed ...]'

OUTPUT_PATH = '../data/embeddings/auxs'

### Funciones

In [3]:
def get_information_schema(query_path: str, db_names_list: list[str], schema_names_list: list[str]) -> str:
    """
    Reads an SQL query from a file and replaces placeholder lists with formatted strings.

    This function is designed to work with SQL queries that have specific placeholders
    for database and schema names. It reads the query from the given file path, 
    formats the input lists of names into a single quoted, comma-separated string, 
    and replaces the placeholders in the query.

    Args:
        query_path (str): The file path to the SQL query. The query should
                          contain the placeholders `[db_names_list]` and
                          `[schema_names_list]`.
        db_names_list (list[str]): A list of database names to be formatted
                                   and inserted into the query.
        schema_names_list (list[str]): A list of schema names to be formatted
                                       and inserted into the query.

    Returns:
        str: The complete SQL query with the placeholders replaced by
             the formatted database and schema names.

    Raises:
        FileNotFoundError: If the specified query_path does not exist.
        
    Example:
        >>> from pathlib import Path
        >>> # Assume 'my_query.sql' contains:
        >>> # SELECT * FROM information_schema.tables WHERE table_schema IN ([schema_names_list])
        >>> # And we create a dummy file for the example:
        >>> Path('my_query.sql').write_text("SELECT * FROM information_schema.tables WHERE table_schema IN ([schema_names_list])")
        >>> db_list = ['db1', 'db2']
        >>> schema_list = ['schema_a', 'schema_b']
        >>> get_information_schema('my_query.sql', db_list, schema_list)
        "SELECT * FROM information_schema.tables WHERE table_schema IN ('schema_a', 'schema_b')"
    """
    query = Path(query_path).read_text()

    db_names = "'" + "', '".join(db_names_list) + "'"
    schema_names = "'" + "', '".join(schema_names_list) + "'"

    return query.replace('[db_names_list]', db_names).replace('[schema_names_list]', schema_names)



def format_yaml(yaml_str: str) -> str:
    """
    Formats a YAML string by adding a blank line before each list item
    that isn't preceded by a list key.
    """
    last_line = ''
    last_line_list_init = False
    last_line_empty = False

    lines = list()

    for line in yaml_str.split('\n'):
        if line.strip().startswith('-') and not last_line_list_init and not last_line_empty:
            last_line += '\n'

        lines.append(last_line)
        last_line = line
        last_line_list_init = last_line.strip().endswith(':')
        last_line_empty = last_line.strip()==''

    lines.append(line)

    return '\n'.join(lines)

## Fichero YAML con MDL del esquema de interés

### Obtener fichero base desde `information_schema`

Primero obtendremos un fichero base construido utilizando la query `/data/embeddings/auxs/get_information_schema.sql`, sobre el que luego se añadirá metadata extra.

In [4]:
GET_INFORMATION_SCHEMA_SQL = '../data/embeddings/auxs/get_information_schema.sql'
DB_NAME = 'adventure_works_dw'
SCHEMA_NAME = 'sales'


information_schema_data = execute_query(get_information_schema(
    query_path= GET_INFORMATION_SCHEMA_SQL,
    db_names_list= [DB_NAME],
    schema_names_list= [SCHEMA_NAME]
))

Veamos el aspecto que tienen los resultados de nuestra query:

In [5]:
information_schema_data

[{'db_name': 'adventure_works_dw',
  'schema_name': 'sales',
  'table_name': 'dim_customer',
  'column_name': 'customer_key',
  'column_type': 'INT4',
  'primary_key': True,
  'foreign_key': False,
  'target': None},
 {'db_name': 'adventure_works_dw',
  'schema_name': 'sales',
  'table_name': 'dim_customer',
  'column_name': 'geography_key',
  'column_type': 'INT4',
  'primary_key': False,
  'foreign_key': True,
  'target': 'sales.dim_customer.geography_key'},
 {'db_name': 'adventure_works_dw',
  'schema_name': 'sales',
  'table_name': 'dim_customer',
  'column_name': 'customer_full_name',
  'column_type': 'TEXT',
  'primary_key': False,
  'foreign_key': False,
  'target': None},
 {'db_name': 'adventure_works_dw',
  'schema_name': 'sales',
  'table_name': 'dim_customer',
  'column_name': 'birth_date',
  'column_type': 'DATE',
  'primary_key': False,
  'foreign_key': False,
  'target': None},
 {'db_name': 'adventure_works_dw',
  'schema_name': 'sales',
  'table_name': 'dim_customer',
  

Lo convertimos en un Data Frame de Pandas para que sea más vistoso:

In [6]:
pd.DataFrame(information_schema_data)

Unnamed: 0,db_name,schema_name,table_name,column_name,column_type,primary_key,foreign_key,target
0,adventure_works_dw,sales,dim_customer,customer_key,INT4,True,False,
1,adventure_works_dw,sales,dim_customer,geography_key,INT4,False,True,sales.dim_customer.geography_key
2,adventure_works_dw,sales,dim_customer,customer_full_name,TEXT,False,False,
3,adventure_works_dw,sales,dim_customer,birth_date,DATE,False,False,
4,adventure_works_dw,sales,dim_customer,marital_status,BPCHAR(1),False,False,
...,...,...,...,...,...,...,...,...
107,adventure_works_dw,sales,fact_sales,freight,NUMERIC,False,False,
108,adventure_works_dw,sales,fact_sales,order_date,DATE,False,False,
109,adventure_works_dw,sales,fact_sales,due_date,DATE,False,False,
110,adventure_works_dw,sales,fact_sales,ship_date,DATE,False,False,


Ahora procederemos a crear el YAML base con el MDL de nuestro esquema, que podremos tomar como punto de partida para luego añadirle metadata extra manualmente:

In [None]:
dbs_data = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))

for row in information_schema_data:
    db_name = row.get('db_name')
    schema_name = row.get('schema_name')
    table_name = row.get('table_name')
    
    dbs_data[db_name][schema_name][table_name].append(row)

for db_name, schemas_data in dbs_data.items():
    db = OrderedDict()
    db[DATABASE] = db_name
    db[DESCRIPTION] = TO_DO

    schemas = list()
    for schema_name, tables_data in schemas_data.items():
        schema = OrderedDict()
        schema[NAME] = schema_name
        schema[DESCRIPTION] = TO_DO

        tables = list()
        for table_name, columns_data in tables_data.items():
            table = OrderedDict()
            table[NAME] = table_name
            table[DESCRIPTION] = TO_DO

            columns = list()
            for column_data in columns_data:
                column = OrderedDict()
                column[NAME] = column_data.get('column_name')
                column[DESCRIPTION] = TO_DO
                column[DATA_TYPE] = column_data.get('column_type')
                
                if column_data.get('primary_key'):
                    column[PRIMARY_KEY] = True

                if column_data.get('foreign_key'):
                    column[FOREIGN_KEY] = True
                    column[REFERENCE] = column_data.get('target')

                columns.append(column)

            table[COLUMNS] = columns
            tables.append(table)
        
        schema[TABLES] = tables
        schemas.append(schema)
    
    db[SCHEMAS] = schemas

    mdl_file_path = f'{OUTPUT_PATH}/MDL_{db_name}.yaml'
    with open(mdl_file_path, 'w') as mdl:
        mdl.write(format_yaml(yaml.dump(db)))

    print(f'>  Fichero MDL base almacenado en {mdl_file_path}')

>  Fichero MDL base almacenado en ../data/embeddings/auxs/MDL_adventure_works_dw.yaml
