# Databricks-Mosaic AI: GenAI + RAG Application on Databricks

<img src="https://github.com/itsmycoderepo/MyPortfolio/blob/main/Databricks-Mosaic%20AI%3A%20Building%20GenAI%20Solutions%20with%20Databricks/databricksMosaicAI.png?raw=true" width="1200px">

<br/>




## 1. Data preparation for RAG: building and indexing our knowledge base into Databricks Vector Search

## Step 1 - Set Up the Environment

In [0]:
%pip install -U --quiet databricks-sdk==0.28.0 databricks-agents mlflow-skinny mlflow mlflow[gateway] databricks-vectorsearch langchain==0.2.1 langchain_core==0.2.5 langchain_community==0.2.4
dbutils.library.restartPython()

## Step 2 - Prepare a Delta Table for your knowledge base

In [0]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Initialize Spark session (skip if running on Databricks, where it's already initialized)
spark = SparkSession.builder.appName("CreateManagedDeltaTable").getOrCreate()

# Define schema for the data
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("content", StringType(), True)
])

# Data to insert into the Delta table
data = [
    (1, "Microsoft Certified: Azure AI Engineer Associate\nStudy guide for Exam AI-102:\nSkills at a glance\nPlan and manage an Azure AI solution (15–20%)\nImplement content moderation solutions (10–15%)\nImplement computer vision solutions (15–20%)\nImplement natural language processing solutions (30–35%)\nImplement knowledge mining and document intelligence solutions (10–15%)\nImplement generative AI solutions (10–15%)"),
    (2, "Microsoft Certified: Azure Data Scientist Associate\nStudy guide for Exam DP-100:\nSkills at a glance\nDesign and prepare a machine learning solution (20–25%)\nExplore data, and train models (35–40%)\nPrepare a model for deployment (20–25%)\nDeploy and retrain a model (10–15%)"),
    (3, "Microsoft Certified: Azure Administrator Associate\nStudy guide for Exam AZ-104:\nSkills at a glance\nManage Azure identities and governance (20–25%)\nImplement and manage storage (15–20%)\nDeploy and manage Azure compute resources (20–25%)\nImplement and manage virtual networking (15–20%)\nMonitor and maintain Azure resources (10–15%)"),
    (4, "Microsoft Certified: Azure Solutions Architect Expert\nStudy guide for Exam AZ-305:\nSkills at a glance\nDesign identity, governance, and monitoring solutions (25–30%)\nDesign data storage solutions (20–25%)\nDesign business continuity solutions (15–20%)\nDesign infrastructure solutions (30–35%)"),
    (5, "Microsoft Certified: DevOps Engineer Expert\nStudy guide for Exam AZ-400:\nSkills at a glance\nDesign and implement processes and communications (10–15%)\nDesign and implement a source control strategy (10–15%)\nDesign and implement build and release pipelines (50–55%)\nDevelop a security and compliance plan (10–15%)\nImplement an instrumentation strategy (5–10%)")
]

# Create DataFrame with the specified schema and data
df = spark.createDataFrame(data, schema=schema)

# Write DataFrame to a Delta managed table
table_name = "azure_certification_guide"
df.write.format("delta").mode("overwrite").saveAsTable(table_name)

# Confirm the table creation
spark.sql(f"DESCRIBE TABLE {table_name}").show()

In [0]:
%sql
select * from azure_certification_guide

## Step 3 - Create a Vector Search Endpoint

In [0]:
from databricks.vector_search.client import VectorSearchClient
vector_search = VectorSearchClient(disable_notice=True)

VECTOR_SEARCH_ENDPOINT_NAME = "mosaicai-endpoint"

vector_search.create_endpoint(name=VECTOR_SEARCH_ENDPOINT_NAME, endpoint_type="STANDARD")

## Step 4 - Create a Vector Search Index

In [0]:
%sql
ALTER TABLE `mosaicai`.`default`.`azure_certification_guide` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

## Please come back here after you have create a Vector Search Index. Refer the blog for detailed steps

In [0]:
%sql
select * from mosaicai.default.azure_certification_guide_vs_index_writeback_table


## Step 4 - Query the Vector Search Index

In [0]:
question = "Is there any exam on AI?"

results = vector_search.get_index("mosaicai-endpoint", "mosaicai.default.azure_certification_guide_vs_index").similarity_search(
  query_text=question,
  columns=["content"],
  num_results=1)
docs = results.get('result', {}).get('data_array', [])
display(docs)

## Step 5 - Verifying Incremental auto sync

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Initialize Spark session (skip if running on Databricks, where it's already initialized)
spark = SparkSession.builder.appName("AppendDataToDeltaTable").getOrCreate()

# Define schema for the data (same as before)
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("content", StringType(), True)
])

# New data to append
new_data = [(6, "Microsoft Certified: Azure Developer Associate\nStudy guide for Exam AZ-204:\nSkills at a glance\nDevelop Azure compute solutions (25–30%)\nDevelop for Azure storage (15–20%)\nImplement Azure security (15–20%)\nMonitor, troubleshoot, and optimize Azure solutions (10–15%)\nConnect to and consume Azure services and third-party services (20–25%)")]

# Create DataFrame with the new data
new_df = spark.createDataFrame(new_data, schema=schema)

# Append the new DataFrame to the existing Delta table
table_name = "azure_certification_guide"
new_df.write.format("delta").mode("append").saveAsTable(table_name)