# DressMe - A Simple RAG over luxury fashon products database

DressMe use [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) integrating a [Databricks Foundation Model API](https://docs.databricks.com/en/machine-learning/foundation-models/index.html) embedding model (`llama-3-70b-instruct`).

Retrieval-augmented generation (RAG) is one of the most popular application architectures for creating natural-language interfaces for people to interact with an organization's data. This notebook builds a very simple RAG application, with the following steps:

1. Set up a vector index and configure it to automatically use an embedding model from the FMAPI to generate embeddings.
2. Load some fashon products data into the vector database
3. Query the database
4. Build a prompt for an LLM from the query results
5. Query an LLM via the FMAPI, using that prompt

### Contributors ✨
Team 4:
- Giuseppe Murro - [gmurro](https://github.com/gmurro) - `giuseppe.murro@giorgioarmani.it`
- Gianluca Sarà - [gians14ga](https://github.com/gians14ga) - `gianluca.sara@giorgioarmani.it`

## Setup
First, we will install the necessary libraries.

In [None]:
%pip install --upgrade --force-reinstall databricks-vectorsearch databricks-genai-inference
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
Collecting databricks-vectorsearch
  Using cached databricks_vectorsearch-0.38-py3-none-any.whl (13 kB)
Collecting databricks-genai-inference
  Using cached databricks_genai_inference-0.2.3-py3-none-any.whl (17 kB)
Collecting deprecation>=2
  Using cached deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting protobuf<5,>=3.12.0
  Using cached protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Collecting requests>=2
  Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Collecting mlflow-skinny<3,>=2.11.3
  Using cached mlflow_skinny-2.13.2-py3-none-any.whl (5.3 MB)
Collecting databricks-sdk==0.19.1
  Using cached databricks_sdk-0.19.1-py3-none-any.whl (447 kB)
Collecting pydantic>=2.4.2
  Using cached pydantic-2.7.3-py3-none-any.whl (409 kB)
Collecting httpx<1,>=0.23.0
  Using cached httpx-0.27.0-py3-none-any.whl (75 kB)
Collecting typing-ext

### Define catalog, table, endpoint, and index names

In [None]:
CATALOG = "workspace"
DB='fashion'
SOURCE_TABLE_NAME = "fendi_products"
SOURCE_TABLE_FULLNAME=f"{CATALOG}.{DB}.{SOURCE_TABLE_NAME}"
ORIGINAL_TABLE_FULLNAME=f"bright_data_fashion_listings.datasets.fendi_products_dataset"

### Data preprocessing
The `fendi_products_dataset` tables has been imported from the `Fashion Listings` datet in the Databricks Marketplace.

In [None]:
# Show few rows from fendi_products_dataset
spark.sql(f"""SELECT PRODUCT_NAME, PRODUCT_DESCRIPTION
          FROM {ORIGINAL_TABLE_FULLNAME}
          WHERE COUNTRY="United States"
          LIMIT 10
          """).show(truncate=False)

+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|PRODUCT_NAME           |PRODUCT_DESCRIPTION                                                                                                                                                                                                                                                                                                                                                                                                       |
+-----------------------+-----------------------------------------------------------------------------------------------------

In [None]:
# Import the table in the current workspace and preprcess it
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{DB}")
spark.sql(f"""CREATE TABLE IF NOT EXISTS {SOURCE_TABLE_FULLNAME} 
        USING delta 
        AS 
        SELECT *, CAST(PRICE as double) as price_double
        FROM {ORIGINAL_TABLE_FULLNAME}
"""
)

spark.sql(f"""
          ALTER TABLE {SOURCE_TABLE_FULLNAME} SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name', 'delta.enableChangeDataFeed' = 'true')
""")
spark.sql(f"""
          ALTER TABLE {SOURCE_TABLE_FULLNAME} DROP COLUMN PRICE
""")

DataFrame[]

Prepare context text column for index vector

In [None]:
# Add a new column CONTEXT_TEXT
context_column = "CONTEXT_TEXT"
spark.sql(f"""
    ALTER TABLE {SOURCE_TABLE_FULLNAME}
    ADD COLUMNS ({context_column} STRING)
""")

# Update the new column with concatenated values
spark.sql(f"""
    UPDATE {SOURCE_TABLE_FULLNAME}
    SET {context_column} = CONCAT('product_name=`', PRODUCT_NAME, '`; product_description=`', PRODUCT_DESCRIPTION, '`; product_color=`', COLOR, '`')
""")

DataFrame[num_affected_rows: bigint]

In [None]:
# Show a sample of created column
spark.sql(f"""SELECT {context_column}
          FROM {SOURCE_TABLE_FULLNAME}
          WHERE COUNTRY="United States"
          LIMIT 3
          """).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTEXT_TEXT                                                                                                                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|product_name=`Fendi Cloud S

## Set up the Vector Database
Next, we set up the vector database. There are three key steps:
1. Initialize the vector search client
2. Create the endpoint
3. Create the index using the source Delta table we created earlier and the `bge-large-en` embeddings model from the Foundation Model API

### Initialize the Vector Search Client

In [None]:
from databricks.vector_search.client import VectorSearchClient
vsc = VectorSearchClient()

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


### Create the Endpoint

The cell below will check if the endpoint already exists and create it if it does not.

In [None]:
VS_ENDPOINT_NAME = 'fashion_endpoint'

if vsc.list_endpoints().get('endpoints') == None or not VS_ENDPOINT_NAME in [endpoint.get('name') for endpoint in vsc.list_endpoints().get('endpoints')]:
    print(f"Creating new Vector Search endpoint named {VS_ENDPOINT_NAME}")
    vsc.create_endpoint(VS_ENDPOINT_NAME)
else:
    print(f"Endpoint {VS_ENDPOINT_NAME} already exists.")

vsc.wait_for_endpoint(VS_ENDPOINT_NAME, 600)

Endpoint fashion_endpoint already exists.
Endpoint fashion_endpoint is ONLINE.


### Create the Vector Index

Now we can create the index over the Delta table we created earlier.

In [None]:
VS_INDEX_NAME = 'fashion_assistant_vs_index'
VS_INDEX_FULLNAME = f"{CATALOG}.{DB}.{VS_INDEX_NAME}"

if not VS_INDEX_FULLNAME in [index.get("name") for index in vsc.list_indexes(VS_ENDPOINT_NAME).get('vector_indexes', [])]:
    try:
        # set up an index with managed embeddings
        print("Creating Vector Index...")
        i = vsc.create_delta_sync_index_and_wait(
            endpoint_name=VS_ENDPOINT_NAME,
            index_name=VS_INDEX_FULLNAME,
            source_table_name=SOURCE_TABLE_FULLNAME,
            pipeline_type="TRIGGERED",
            primary_key="PRODUCT_ID",
            embedding_source_column=context_column,
            embedding_model_endpoint_name="databricks-bge-large-en"
        )
    except Exception as e:
        if "INTERNAL_ERROR" in str(e):
            # Check if the index exists after the error occurred
            if VS_INDEX_FULLNAME in [index.get("name") for index in vsc.list_indexes(VS_ENDPOINT_NAME).get('vector_indexes', [])]:
                print(f"Index {VS_INDEX_FULLNAME} has been created.")
            else:
                raise e
        else:
            raise e
else:
    print(f"Index {VS_INDEX_FULLNAME} already exists.")

Creating Vector Index...


We specified `embedding_model_endpoint_name="databricks-bge-large-en"`. By passing an `embedding_source_column` and `embedding_model_endpoint_name`, we configure the index such that it will automatically use the model to generate embeddings for the texts in the `text` column of the source table. We do not need to manually generate embeddings.


## Sync the Vector Search Index


In [None]:
# Sync
index = vsc.get_index(endpoint_name=VS_ENDPOINT_NAME,
                      index_name=VS_INDEX_FULLNAME)
index.sync()

{}

## Chat with DressMe 

Chat with the LLM and get featured answer based on the most related products picked from the database.

In [None]:
from databricks_genai_inference import ChatSession

# reset history
chat = ChatSession(model="databricks-meta-llama-3-70b-instruct",
                   system_message="You are a helpful fashion assistant. Answer the user's question based on the provided context.",
                   max_tokens=512)

user_question = "I don't know what to wear next week for my new job. I'm a man an i like to wear casual clothes. Can you provide me an appelling outfit to buy?"

# get context from vector search
raw_context_top = index.similarity_search(columns=[ "PRODUCT_NAME", "PRODUCT_DESCRIPTION", "IMAGE"],
                        query_text=f"{user_question} Based on the question, provide the most related top wearing (shirt, pullover, t-shirt etc) that fits the requirements.",
                        num_results = 1)

raw_context_bottom = index.similarity_search(columns=[ "PRODUCT_NAME", "PRODUCT_DESCRIPTION", "IMAGE"],
                        query_text="{user_question} Based on the question, provide a bottom wearing (pants, shorts, jeans etc) that fits the requirements.",
                        num_results = 1)

raw_context_shoes = index.similarity_search(columns=[ "PRODUCT_NAME", "PRODUCT_DESCRIPTION", "IMAGE"],
                        query_text="Advice a pair of shoes.",
                        num_results = 1)

context_string = "Context:\n\n"

context_string += f"Top wearing retrieved context {i+1}:\n"
for (i,doc) in enumerate(raw_context_top.get('result').get('data_array')):
    context_string += f"- name={doc[0]}, image={doc[2]}, description={doc[1]};"
    context_string += "\n\n"

context_string += f"Bottom wearing retrieved context {i+1}:\n"
for (i,doc) in enumerate(raw_context_bottom.get('result').get('data_array')):
    context_string +=f"- name={doc[0]}, image={doc[2]}, description={doc[1]};"
    context_string += "\n\n"

context_string += f"Shoes wearing retrieved context {i+1}:\n"
for (i,doc) in enumerate(raw_context_shoes.get('result').get('data_array')):
    context_string += f"- name={doc[0]}, image={doc[2]}, description={doc[1]};"
    context_string += "\n\n"

chat.reply(f"User question: {user_question}\n\nContext: {context_string}.\n\n Provide an image url for all products described. Be short but descriptive and emphatic")

print(f"User question: {user_question}\n\nDressMe answer: {chat.last}")

User question: I don't know what to wear next week for my new job. I'm a man an i like to wear casual clothes. Can you provide me an appelling outfit to buy?

DressMe answer: Congratulations on your new job!

I've got just the outfit for you - a stylish, yet casual look that's perfect for your new role. Here's a suggested ensemble:

**Top:** Start with the **Fendi Black Cotton Jersey T-Shirt** (https://static.fendi.com/dam/is/image/fendi/FAF532AD3CF0GME_01?wid=768&hei=768&hash=f72f74aa5c3dfa592d2b0e68e622e75a-17e1bcde15b&sw=768&sh=768). Its regular fit and crew neck will provide a comfortable and classic look.

**Bottom:** Pair the tee with the **Fendi Brown Silk Twill Trousers** (https://static.fendi.com/dam/is/image/fendi/FR5994A8G3F118W_01?wid=960&hei=960&hash=3ac464177b8ff56b6bb9e7b41662ff22-17e1d4a9395&sw=960&sh=960). The flowing wide-leg design and all-over FF motif print will add a touch of sophistication to your outfit.

**Shoes:** Complete the look with the **Fendi Match Lace-