# Generative AI Application Development 

## Overview

- Start Serverless Warehouse

- Create a new schema (e.g., `databricks-dev`)
  - Unity Catalog namespace layers: `catalog > schema > database (tables, functions, models, etc.)`

- Setup the current catalog (`studies`) and schema (`databricks-dev`)

- Load a dataset as Databricks table
  - https://archive.ics.uci.edu/dataset/410/paper+reviews

- Prepare data

- Generate the embeddings and store in a Databricks table

- Configure a vector search index database on UC
  - Note: The version of the course we created doesn't cover this part. In order to test the concepts seen in the course, we performed the embedding, chunk, and vector search index steps, configuring all the necessary services to handle the dataset we chose to use as the basis for the experiments.

- Chunk the data of interest and store in the vector search index database


## Initial setups



In [0]:
%sql
-- setup to catalog and schema of studies
use catalog `studies`;
use schema `databricks-dev`;

select current_catalog() as actual_catalog,  current_schema() as actual_schema;

actual_catalog,actual_schema
studies,`databricks-dev`


In [0]:
%sh
pip install tqdm




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [0]:
import warnings
warnings.filterwarnings('ignore')

import json
import re
from tqdm import tqdm

from sklearn.decomposition import PCA
import numpy as np
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, FloatType

from databricks.sdk import WorkspaceClient


# create a Spark session if it doesn't exist
spark = SparkSession.builder.getOrCreate()

# instanciate client databricks
client = WorkspaceClient()


# define path of input data available in the Databricks volume
DATASET_PATH = '/Volumes/studies/databricks-studies/ai-agent-volume/reviews.json'

# set whether to export tables
EXPORT_TABLES_TO_DATABRICKS = False
EXPORT_EMBEDDING_TABLE = False


# load json dataset
with open(DATASET_PATH, 'r', encoding='utf-8') as f:
    dataset = json.load(f)

# make some adjusts to the dataset
df = pd.json_normalize(dataset['paper'])
df.head()

# look at one sample
df['review'][0]

[{'confidence': '4',
  'evaluation': '1',
  'id': 1,
  'lan': 'es',
  'orientation': '0',
  'remarks': '',
  'text': '- El artículo aborda un problema contingente y muy relevante, e incluye tanto un diagnóstico nacional de uso de buenas prácticas como una solución (buenas prácticas concretas). - El lenguaje es adecuado.  - El artículo se siente como la concatenación de tres artículos diferentes: (1) resultados de una encuesta, (2) buenas prácticas de seguridad, (3) incorporación de buenas prácticas. - El orden de las secciones sería mejor si refleja este orden (la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera por evaluación de expertos.',
  'timespan': '2010-07-05'},
 {'confidence': '4',
  'evaluation': '1',
  'id': 2,
  'lan': 'es',
  'orientation': '1',
  'remarks': '',
  'text': 'El artículo presenta recomendaciones prácticas para el desarrollo de software seguro. Se describen las mejores prácticas recomendadas para desarrollar softwa


## Prepare data

### Normalize the input dataset

In [0]:
df = pd.json_normalize(
    dataset,
    record_path=['paper', 'review'],  # take columns paper and review; each item in review is mapped to a row
    meta=[['paper', 'id']],  # add paper.id as column
)

df.rename(columns={'paper.id': 'paper_id'}, inplace=True)
print(df.shape)

# handles missing values ​​in a simple way
df.text = df.text.apply(lambda x: pd.NA if len(x)==0 else x)
df.dropna(inplace=True)
print(df.shape)

df.head()

(405, 9)
(397, 9)


Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan,paper_id
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05,1
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05,1
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05,1
3,4,2,1,es,1,,Se explica en forma ordenada y didáctica una e...,2010-07-05,2
5,4,2,3,es,0,,Los autores describen una metodología para des...,2010-07-05,2


### Utils

In [0]:
def pandas_to_databricks_table(table_name: str, pandas_df: pd.DataFrame) -> None:
    spark_df = spark.createDataFrame(pandas_df)
    spark_df.write.format('delta').mode('overwrite').saveAsTable(table_name)
    print(spark_df.count)
    display(spark_df.show(7))

### Create tables

In [0]:
df_reviews = df[['id', 'paper_id', 'text', 'timespan']]
print(df_reviews.shape)
df_reviews.head()

(397, 4)


Unnamed: 0,id,paper_id,text,timespan
0,1,1,- El artículo aborda un problema contingente y...,2010-07-05
1,2,1,El artículo presenta recomendaciones prácticas...,2010-07-05
2,3,1,- El tema es muy interesante y puede ser de mu...,2010-07-05
3,1,2,Se explica en forma ordenada y didáctica una e...,2010-07-05
5,3,2,Los autores describen una metodología para des...,2010-07-05


In [0]:
# EXPORT_TABLES_TO_DATABRICKS = True

if EXPORT_TABLES_TO_DATABRICKS:
    pandas_to_databricks_table(table_name='db_paper_reviews', pandas_df=df)

In [0]:
if EXPORT_TABLES_TO_DATABRICKS:
    pandas_to_databricks_table(table_name='paper_reviews', pandas_df=df_reviews)

In [0]:
# chunking the texts

rows = []
CHUNK_SIZE = 70
OVERLAP = 10


def chunk_text(text, chunk_size=200, overlap=50):
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap

    return chunks


for _, row in df_reviews.iterrows():
    chunks = chunk_text(row['text'], chunk_size=CHUNK_SIZE, overlap=OVERLAP)
    for i, chunk in enumerate(chunks):
        rows.append({
            'review_uri': f"paper_{row['paper_id']}_review_{row['id']}_chunk_{i}",
            'paper_id': row['paper_id'],
            'review_id': row['id'],
            'chunked_text': chunk,
            'chunk_size': len(chunk),
            'timespan': row['timespan']
        })

df_chunks = pd.DataFrame(rows)
print(df_chunks.shape)
df_chunks.head()

(6932, 6)


Unnamed: 0,review_uri,paper_id,review_id,chunked_text,chunk_size,timespan
0,paper_1_review_1_chunk_0,1,1,- El artículo aborda un problema contingente y...,70,2010-07-05
1,paper_1_review_1_chunk_1,1,1,", e incluye tanto un diagnóstico nacional de u...",70,2010-07-05
2,paper_1_review_1_chunk_2,1,1,rácticas como una solución (buenas prácticas c...,70,2010-07-05
3,paper_1_review_1_chunk_3,1,1,l lenguaje es adecuado. - El artículo se sien...,70,2010-07-05
4,paper_1_review_1_chunk_4,1,1,catenación de tres artículos diferentes: (1) r...,70,2010-07-05


In [0]:
df_chunks.chunk_size.value_counts(dropna=False)

chunk_size
70    6474
39      15
34      14
25      12
35      11
      ... 
3        3
63       3
11       3
24       3
16       2
Name: count, Length: 69, dtype: int64

In [0]:
# drop rows with chunk_size not equal
df_chunks = df_chunks[df_chunks['chunk_size'] == CHUNK_SIZE]
print(df_chunks.shape)
df_chunks.head()


(6474, 6)


Unnamed: 0,review_uri,paper_id,review_id,chunked_text,chunk_size,timespan
0,paper_1_review_1_chunk_0,1,1,- El artículo aborda un problema contingente y...,70,2010-07-05
1,paper_1_review_1_chunk_1,1,1,", e incluye tanto un diagnóstico nacional de u...",70,2010-07-05
2,paper_1_review_1_chunk_2,1,1,rácticas como una solución (buenas prácticas c...,70,2010-07-05
3,paper_1_review_1_chunk_3,1,1,l lenguaje es adecuado. - El artículo se sien...,70,2010-07-05
4,paper_1_review_1_chunk_4,1,1,catenación de tres artículos diferentes: (1) r...,70,2010-07-05


In [0]:
print(f'df.shape: {df.shape}')
print(f'df_reviews.shape: {df_reviews.shape}')
print(f'df_chunks.shape: {df_chunks.shape}')

df.shape: (397, 9)
df_reviews.shape: (397, 4)
df_chunks.shape: (6474, 6)


In [0]:
# set maximum rows for tests
MAX_ROWS = 30
if MAX_ROWS:
    df_chunks = df_chunks[:MAX_ROWS]

df_chunks.shape

(30, 6)

In [0]:
def normalize_texts(texts):
    normalized = []
    for t in texts:
        if t is None:
            continue
        
        t = str(t).strip()
        t = re.sub(r'\s+', ' ', t) 
        t = t.replace('\n', ' ').replace('\r', ' ')
        
        if t:
            normalized.append(t)
            
    return normalized


def get_embeddings(texts, emb_model_name):
    texts = normalize_texts(texts)
    if not texts:
        return []

    res = client.serving_endpoints.query(
        name=emb_model_name,  
        input=texts
    )

    # res.data = list of objects EmbeddingsV1ResponseEmbeddingElement
    data = getattr(res, 'data', None) or getattr(res, 'outputs', None)

    # extract embedding vector
    embeddings = [item.embedding for item in data]

    return embeddings

In [0]:
# use pca to reduce dimensionality
low_dim = 10

pca = PCA(
    n_components=min(df_chunks.shape[0], low_dim),
    random_state=42
)
pca

In [0]:
# generate embeddings
batch_size = pow(2, 6)
all_embeddings = []

# define embedding model
emb_model_name='databricks-bge-large-en'  
# actual serving models: databricks-bge-large-en, databricks-gte-large-en, databricks-qwen3-embedding-0-6b

texts = df_chunks['chunked_text'].tolist()

print(f'batch_size: {batch_size}, len(texts): {len(texts)} ')

for i in tqdm(range(0, len(texts), batch_size), desc='Processing batches'):
    batch = texts[i:i + batch_size]
    emb_high_dim = get_embeddings(batch, emb_model_name)
    all_embeddings.extend(emb_high_dim)



batch_size: 64, len(texts): 30 


Processing batches:   0%|          | 0/1 [00:00<?, ?it/s]Processing batches: 100%|██████████| 1/1 [00:00<00:00,  1.54it/s]Processing batches: 100%|██████████| 1/1 [00:00<00:00,  1.53it/s]


In [0]:
x_pca = np.array(emb_high_dim)
all_embeddings_low_dim = pca.fit_transform(x_pca)
df_chunks['embedding'] = all_embeddings_low_dim.tolist()
print(df_chunks.shape)
df_chunks.head()

(30, 7)


Unnamed: 0,review_uri,paper_id,review_id,chunked_text,chunk_size,timespan,embedding
0,paper_1_review_1_chunk_0,1,1,- El artículo aborda un problema contingente y...,70,2010-07-05,"[0.2938999141521091, 0.15523126718343563, 0.07..."
1,paper_1_review_1_chunk_1,1,1,", e incluye tanto un diagnóstico nacional de u...",70,2010-07-05,"[-0.00810449886665209, 0.032389148313146675, -..."
2,paper_1_review_1_chunk_2,1,1,rácticas como una solución (buenas prácticas c...,70,2010-07-05,"[0.04522044751048338, 0.14319198817174258, 0.1..."
3,paper_1_review_1_chunk_3,1,1,l lenguaje es adecuado. - El artículo se sien...,70,2010-07-05,"[0.28506108117518625, 0.32262783182485966, 0.0..."
4,paper_1_review_1_chunk_4,1,1,catenación de tres artículos diferentes: (1) r...,70,2010-07-05,"[0.2837834677461533, -0.2783718008152769, -0.0..."


In [0]:
# persist the trained model in a simple way
import pickle

with open('pca_model.pkl', 'wb') as f:
    pickle.dump(pca, f)

In [0]:
print(f'embedding lenght: {len(df_chunks.embedding[0])}')

embedding lenght: 10


In [0]:
# cast embedding col
spark_df_chunks = spark.createDataFrame(df_chunks)
spark_df_chunks = spark_df_chunks.withColumn(
    'embedding', 
    spark_df_chunks['embedding'].cast(ArrayType(FloatType()))
)

display(spark_df_chunks.show(10))

+--------------------+--------+---------+--------------------+----------+----------+--------------------+
|          review_uri|paper_id|review_id|        chunked_text|chunk_size|  timespan|           embedding|
+--------------------+--------+---------+--------------------+----------+----------+--------------------+
|paper_1_review_1_...|       1|        1|- El artículo abo...|        70|2010-07-05|[0.29389992, 0.15...|
|paper_1_review_1_...|       1|        1|, e incluye tanto...|        70|2010-07-05|[-0.0081044985, 0...|
|paper_1_review_1_...|       1|        1|rácticas como una...|        70|2010-07-05|[0.045220446, 0.1...|
|paper_1_review_1_...|       1|        1|l lenguaje es ade...|        70|2010-07-05|[0.2850611, 0.322...|
|paper_1_review_1_...|       1|        1|catenación de tre...|        70|2010-07-05|[0.28378347, -0.2...|
|paper_1_review_1_...|       1|        1|na encuesta, (2) ...|        70|2010-07-05|[0.108136706, -0....|
|paper_1_review_1_...|       1|        1|ación

In [0]:
# export embeddings as delta table
if EXPORT_EMBEDDING_TABLE:
  pandas_to_databricks_table(table_name='paper_reviews_embeddings', pandas_df=spark_df_chunks.toPandas())

## Create vector search index

- Create manually using the UI of UC
    - `paper_reviews_embeddings` -> `paper_reviews_index`
    - pk: `review_uri`
- **Note**: Set the computing serverless if necessary.

## Semantic search on the vector index database

In [0]:
%sh
pip install databricks-vectorsearch




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [0]:
%restart_python

In [0]:
import re
import pandas as pd
from databricks.sdk import WorkspaceClient
from databricks.vector_search.client import VectorSearchClient
import pickle

client_vector_search = VectorSearchClient(
    disable_notice=True,
)

index = client_vector_search.get_index(index_name='studies.databricks-studies.paper_review_index')


In [0]:
# define embedding model
emb_model_name='databricks-bge-large-en'  
# databricks-bge-large-en, databricks-gte-large-en, databricks-qwen3-embedding-0-6b

# instanciate client databricks
client = WorkspaceClient()

# load pca_model
with open('pca_model.pkl', 'rb') as f:
    pca_model = pickle.load(f)


def normalize_texts(texts):
    normalized = []
    for t in texts:
        if t is None:
            continue
        
        t = str(t).strip()
        t = re.sub(r'\s+', ' ', t) 
        t = t.replace('\n', ' ').replace('\r', ' ')
        
        if t:
            normalized.append(t)
            
    return normalized


def get_embeddings(texts, emb_model_name='databricks-bge-large-en' ):
    texts = normalize_texts(texts)
    if not texts:
        return []

    res = client.serving_endpoints.query(
        name=emb_model_name,  
        input=texts
    )

    # res.data = list of objects EmbeddingsV1ResponseEmbeddingElement
    data = getattr(res, 'data', None) or getattr(res, 'outputs', None)

    # extract embedding vector
    embeddings = [item.embedding for item in data]

    return embeddings


def semantic_search(query_text: str, top_k: int) -> pd.DataFrame:
    text_embedding_high_dim = get_embeddings([query_text])
    print(f'dim embedding: {len(text_embedding_high_dim[0])}')
    text_embedding_low_dim = pca_model.transform(text_embedding_high_dim)
    print(f'dim embedding-pca: {len(text_embedding_low_dim[0])}')

    cols = ['review_uri','paper_id', 'review_id', 'chunked_text', 'timespan', 'embedding']
    results = index.similarity_search(
        query_vector=text_embedding_low_dim.flatten().tolist(),
        columns=cols,
        num_results=top_k,
        disable_notice=True
    )
    res = pd.DataFrame(results['result']['data_array'], columns=cols+['score'])
    display(res)
    return res

### Run semantic search

In [0]:
input_text = 'recomendaciones prácticas para el desarrollo de software seguro'
df_res = semantic_search(query_text=input_text, top_k=3)

dim embedding: 1024
dim embedding-pca: 10


review_uri,paper_id,review_id,chunked_text,timespan,embedding,score
paper_1_review_2_chunk_0,1.0,2.0,El artículo presenta recomendaciones prácticas para el desarrollo de software seguro. Se describen l,2010-07-05,"List(-0.4820484, -0.07462992, -0.061202534, 0.09068968, 0.08700089, 0.12164772, -0.012149201, 0.04295415, 0.029270861, 0.026877701)",0.93406165
paper_1_review_3_chunk_4,1.0,3.0,"Presenta nueve tablas que corresponden a las prácticas para el desarrollo de software seguro, pero l",2010-07-05,"List(-0.3623627, -0.017432267, 0.060682572, 0.21123499, -5.8355083E-4, 0.049791727, -0.040910758, -0.026283322, 0.06390793, -0.0031450056)",0.90883404
paper_1_review_3_chunk_2,1.0,3.0,arrollo de software seguro. - El “estado real del desarrollo de software en Chile” (como lo indica,2010-07-05,"List(-0.32006934, -0.11861887, -0.16354212, 0.032957304, 0.10105367, 0.17232092, 0.025759658, 0.0022093921, -0.09554255, 0.081382155)",0.89423126


## Create SQL functions

### search_paper_reviews()

In [0]:
%sql
create or replace function search_papers_reviews(
	search_term string comment 'Search paper review of your interest'
)
returns table
comment 'Search using vector search to retrieve relevant excerpts of reviews of papers of your interest.'
return(
	select chunked_text, timespan,
    review_uri as idx	
	from
		vector_search(
			index => 'studies.databricks-studies.paper_review_index',
			query_vector => ARRAY(0.12, 0.34, 0.56, 0.12, 0.33,  0.50, 0.88, 0.74, 0.33, 0.73), -- embedding fake,
			num_results => 1
		)
);	
	

In [0]:
%sql
-- test the search_papers_review function

select chunked_text, timespan
from search_papers_reviews('software seguro')

chunked_text,timespan
"la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera po",2010-07-05


In [0]:
# chunked_text: 
# "la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera po"

# (translation to pt_br) A versão revisada é a nº 2, nº 1 e nº 3. - O artigo não possui qualquer tipo de validação, nem mesmo por parte de...



<img src="./imgs/uc_catalog.png">