# MOBI Docs Vector Search

This notebook walks through how to build and query a **Databricks Vector Search** index using cleaned Mobi website content. This gives you a simple, powerful retrieval layer that your **Genie rooms**, agents, and applications can use for semantic understanding of Mobi policies, rules, and public information.

Youâ€™ll use this notebook to:

- Turn cleaned website pages (from `silver_site_pages`) into embeddings
- Create a Vector Search index that supports **semantic search** and **RAG**
- Run queries that find relevant documents using natural language
- Integrate retrieval directly into a Genie room so agents can answer questions grounded in real Mobi documentation

By the end, you will have:

- A working Vector Search endpoint  
- An index built over your Mobi content  
- Example queries and patterns you can reuse in your own tools and agents  

This is one of the easiest ways to give Genie a domain-specific knowledge base without building or hosting any external services.


In [0]:
%pip install databricks-vectorsearch mlflow requests
%restart_python

In [0]:
# Setup: minimal deps + add src to sys.path
import sys
from pathlib import Path
src_path = Path.cwd() / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))


In [0]:
import mlflow
CONFIG = mlflow.models.ModelConfig(development_config='config.yaml')

CATALOG = CONFIG.get("catalog")
SCHEMA = CONFIG.get("schema")
print(f"Using catalog.schema: {CATALOG}.{SCHEMA}")



In [0]:
# Show ten rows of the bronze_site table we already proudced

display(spark.table(f"`{CATALOG}`.`{SCHEMA}`.silver_site").limit(10))

In [0]:
sql = (
    f"ALTER TABLE `{CATALOG}`.`{SCHEMA}`.`silver_site` "
    "SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');"
)

spark.sql(sql)

## Create a Vector Search Endpoint

Vector Search endpoints are lightweight, serverless endpoints that host your semantic indexes.  
They power fast similarity search without requiring you to manage compute, scale, or REST services.

In this section:

- We create (or reuse) a **Vector Search endpoint** in your workspace
- This endpoint will host the index built from your Mobi site pages
- All embedding generation and indexing is handled automatically by Databricks

You only need one endpoint for your whole project.  
Multiple indexes can share it, and Genie rooms can reference it directly.


In [0]:
from databricks.vector_search.client import VectorSearchClient

ENDPOINT_NAME = "mobi_vs_endpoint"

# Initialize client first
client = VectorSearchClient(disable_notice=True)

# Safely get the list of endpoints (empty list if none)
resp = client.list_endpoints()
endpoints = resp.get("endpoints", [])  # returns [] if key is missing

# Delete existing endpoint if it already exists
if any(ep.get("name") == ENDPOINT_NAME for ep in endpoints):
    client.delete_endpoint(name=ENDPOINT_NAME)
    print(f"Existing endpoint {ENDPOINT_NAME} deleted")

# Create new endpoint
client.create_endpoint(
    name=ENDPOINT_NAME,
    endpoint_type="STANDARD",
)
print(f"Endpoint {ENDPOINT_NAME} created")


## Create a Vector Search Index

Now we build a Vector Search index over the cleaned site content stored in `silver_site_pages`.

In this section we:

- Select the source Delta table that contains `title`, `url`, and `content_md`
- Choose the primary key and which column(s) to embed
- Tell Databricks to automatically generate embeddings and build the index
- Wait for the index to finish refreshing

After the index is ready, you can:

- Run natural-language searches against your Mobi content
- Plug the index into a Genie room so your agent can perform **domain-aware question answering**
- Reuse the index for any RAG or retrieval workflows you build during the hackathon

> Tip: Think about what kind of questions your agent will answer.  
> Index only the content that helps it reason clearly and avoid hallucination.


In [0]:
index = client.create_delta_sync_index(
    endpoint_name="mobi_vs_endpoint",
    source_table_name=f"{CATALOG}.{SCHEMA}.silver_site",
    index_name=f"{CATALOG}.{SCHEMA}.mobi_site_index",
    pipeline_type="TRIGGERED",
    primary_key="site_page_id",
    embedding_source_column="content_md",
    embedding_model_endpoint_name="databricks-gte-large-en"
)

In [0]:
query = f"""
SELECT
  *,
  floor(site_page_id / 5) AS site_page_id_bin_10
FROM vector_search(
  index => "{CATALOG}.{SCHEMA}.mobi_site_index",
  query_text => "What is Mobi?",
  num_results => 50,
  query_type => "hybrid"
)
ORDER BY site_page_id DESC
"""

df = spark.sql(query)
display(df)


In [0]:
query = f"""
CREATE OR REPLACE FUNCTION {CATALOG}.{SCHEMA}.site_search(
  description STRING COMMENT 'A search of mobi documents'
)
RETURNS TABLE (
  site_page_id INTEGER,
  title STRING,
  value STRING,
  search_score STRING
)
COMMENT 'Returns the top three documents matching semantic search.
'
RETURN
SELECT *
FROM vector_search(
  index=>'{CATALOG}.{SCHEMA}.mobi_site_index',
  query_text=>description,
  num_results=>3,
  query_type=>'hybrid'
)
"""
df = spark.sql(query)
display(df)

In [0]:
query = f"""
SELECT * FROM `{CATALOG}`.`{SCHEMA}`.site_search('Trip Fares')
"""

df = spark.sql(query)
display(df)

## Next Steps

You now have:

- A Vector Search endpoint  
- A semantic index over Mobi documentation  
- Working query examples  

From here, you can:

- Add new pages to the index as your scraper expands  
- Build Genie tools that combine retrieval with the SQL + Python tools from `02_tools.ipynb`
- Create agent workflows that use the index to ground answers in real Mobi content

Vector Search is one of the fastest ways to give your hackathon project a real knowledge base without deploying infrastructure. Explore it, extend it, and integrate it wherever your team needs semantic understanding.
