#Step 0: 

You must use classic compute to complete this task. 

Recommended Compute Type: Classic Compute

Recommended Runtime: 16.4 ML

#Introduction
This notebook is apart of the DSA Blogpost<insert link>.

We will be preparing our sample data located in sample_pdf_sbc for embedding. In the last notebook, you should have successfully deployed the nomic-ai/colnomic-embed-multimodal-7b model. We will use the work you did there to generate the embeddings for the PDFs

After this notebook you will learn how to:
1. Convert PDFs to images using pdf2image
2. Convert images to embeddings using a deployed embedding model on Databricks Model Hosting 
3. Load the embeddings into a Databricks Vector Searcch Index 
4. Use the embedding model to embed your input text queries and use the generated embeddings to do a similarity search on the vector search index 

#Install your Dependencies

In [0]:
%pip install --upgrade mlflow databricks-vectorsearch databricks-sdk requests pdf2image pillow

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%sh
sudo apt clean
sudo apt update --fix-missing -y
sudo apt-get install -y libpoppler-cpp-dev pkg-config poppler-utils







Hit:1 https://repos.azul.com/zulu/deb stable InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu noble InRelease
Get:5 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
Hit:6 http://archive.ubuntu.com/ubuntu noble-backports InRelease
Get:7 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [1,412 kB]
Fetched 1,538 kB in 1s (1,092 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
75 packages can be upgraded. Run 'apt list --upgradable' to see them.


W: https://repos.azul.com/zulu/deb/dists/stable/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.


Reading package lists...
Building dependency tree...
Reading state information...
libpoppler-cpp-dev is already the newest version (24.02.0-1ubuntu9.4).
pkg-config is already the newest version (1.8.1-2build1).
poppler-utils is already the newest version (24.02.0-1ubuntu9.4).
0 upgraded, 0 newly installed, 0 to remove and 75 not upgraded.


In [0]:
dbutils.library.restartPython()

In [0]:
dbutils.widgets.removeAll()

In [0]:
from config import volume_label, volume_name, catalog, schema, model_name, model_endpoint_name, embedding_table_name, embedding_table_name_index, registered_model_name, vector_search_endpoint_name

dbutils.widgets.text("catalog", catalog, "catalog")
dbutils.widgets.text("schema", schema, "schema")
dbutils.widgets.text("volume_name", volume_label, "volume_name")
dbutils.widgets.text("embedding_table_name", embedding_table_name, "embedding_table_name")

In [0]:
volume_name = dbutils.widgets.get("volume_name")
catalog_name = dbutils.widgets.get("catalog")
schema_name = dbutils.widgets.get("schema")

In [0]:
import os
notebook_path = os.path.dirname(dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get())
workspace_folder = f"/Workspace{notebook_path}/sample_pdf_sbc"
files = os.listdir(workspace_folder)
file_paths_list = [os.path.join(workspace_folder, f) for f in files if f.endswith('.pdf')]
print(f"Found {len(file_paths_list)} PDF files:")
for pdf in file_paths_list:
    print(f"  - {pdf}")

Found 4 PDF files:
  - /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client4.pdf
  - /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client3.pdf
  - /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client1.pdf
  - /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client2.pdf


#Convert your PDFs into Images!

To use pdf2image, we do need install poppler properly on Databricks. You will first do this then use pdf2image to go page by page and creating embeddings for each page. 

We will represent the images as a base64 string

In [0]:
def install_poppler_on_nodes():
    """
    Install poppler on all cluster nodes
    """
    import subprocess
    import os
    
    try:
        subprocess.run(['apt-get', 'update'], check=True)
        subprocess.run(['apt-get', 'install', '-y', 'poppler-utils'], check=True)
        print("Poppler installed successfully")
    except subprocess.CalledProcessError as e:
        print(f"Error installing poppler: {e}")

sc.range(1).foreach(lambda x: install_poppler_on_nodes())

###A note on token length

Base64 strings can become extremely long if the image is too big or sized incorrectly. We do some compression in the code below and resize the image to ensure we don't overload our LLM. Keep this in mind as you add more images or other modalities. 

Also ensure your LLM supports a large enough context length where a large string like base64 won't interfere with performance. 

In [0]:
from pdf2image import convert_from_path
from PIL import Image
import base64
import io
import os
from pyspark.sql.types import StringType, StructType, StructField, IntegerType
from pyspark.sql.functions import col

def process_all_pdfs(pdf_paths):
    """
    Process all PDFs on driver node to avoid UDF distribution issues
    """
    all_pages = []
    
    def resize_image(image, max_short_dimension=768, max_long_dimension=2000):
        """Resize image while maintaining aspect ratio"""
        width, height = image.size
        
        if width > height:
            scaling_factor = min(max_long_dimension / width, max_short_dimension / height)
        else:
            scaling_factor = min(max_short_dimension / width, max_long_dimension / height)
        
        if scaling_factor < 1:
            new_width = int(width * scaling_factor)
            new_height = int(height * scaling_factor)
            return image.resize((new_width, new_height), Image.LANCZOS)
        
        return image
    
    for pdf_path in pdf_paths:
        try:
            if not os.path.exists(pdf_path):
                print(f"File not found: {pdf_path}")
                continue
            
            if not os.access(pdf_path, os.R_OK):
                print(f"File not readable: {pdf_path}")
                continue
            
            print(f"Processing: {pdf_path}")
            
            images = convert_from_path(
                pdf_path, 
                dpi=100,
                fmt='JPEG',
                poppler_path='/usr/bin'  
            )
            
            for i, image in enumerate(images):
                resized_image = resize_image(image)
                
                if resized_image.mode != 'RGB':
                    resized_image = resized_image.convert('RGB')
                
                quantized_image = resized_image.quantize(colors=256)
     
                quantized_image = quantized_image.convert('RGB')
                
                img_buffer = io.BytesIO()
                quantized_image.save(img_buffer, format='JPEG', quality=70, optimize=True)
                img_bytes = img_buffer.getvalue()
                

                base64_string = base64.b64encode(img_bytes).decode('utf-8')
                
                all_pages.append({
                    'pdf_path': pdf_path,
                    'page_number': i + 1,
                    'base64_image': base64_string
                })
            
            print(f"Successfully processed {len(images)} pages from {pdf_path}")
            
        except Exception as e:
            print(f"Error processing {pdf_path}: {str(e)}")
            import traceback
            traceback.print_exc()
            continue
    
    return all_pages
  

print(f"Processing {len(file_paths_list)} PDFs...")
all_page_data = process_all_pdfs(file_paths_list)
print(f"Total pages processed: {len(all_page_data)}")


pdf_schema = StructType([
    StructField("pdf_path", StringType(), True),
    StructField("page_number", IntegerType(), True),
    StructField("base64_image", StringType(), True)
])

df_pages = spark.createDataFrame(all_page_data, pdf_schema)

Processing 4 PDFs...
Processing: /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client4.pdf
Successfully processed 5 pages from /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client4.pdf
Processing: /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client3.pdf
Successfully processed 5 pages from /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client3.pdf
Processing: /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client1.pdf
Successfully processed 7 pages from /Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-0

In [0]:

df_pages.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(f"{catalog}.{schema}.colnomic_dsa_pdf_pages")

df_pages.head()

Row(pdf_path='/Workspace/Users/austin.choi@databricks.com/GenAI/DSPy/DSA HLS Blog Code/DSA_HLS_Blog/2025-06-06-multi-modal-hls-DSA/sample_pdf_sbc/SBC_client4.pdf', page_number=1, base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIfIiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozv/wAARCAMAA+EDASIAAhEBAxEB/8QAGwAAAgIDAQAAAAAAAAAAAAAAAAUEBgECAwf/xABbEAABAwMCAwQECAkKBAQDBQkBAgMEAAUREiEGEzEUIkFRFVVh0RYyVHGBkZPTByNSdJKUobPSFyQzNDVCU6KjsjZic7ElQ3LBRYLhJkRjZIOV8HWEpMPxwif/xAAZAQEBAQEBAQAAAAAAAAAAAAAAAQIDBAX/xAAxEQEAAgEDAwMDAgcBAAMBAAAAARECAxIhEzFRFEHwBGHRobEycYGRweHxIiMzQlL/2gAMAwEAAhEDEQA/APZqKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKKAooooCiiigKKKxQZooooCiiigKKxWaAooooCiisUGaKxWaAorFZoCiiigKKKxQZooooCiiigKKKxQZorFFBm

#Generate Embeddings

Now that we have a list of images, we can create our embeddings with the ColNomic Embedding Model that we have hosted. We will use the same mlflow.deployments client that we used in the first notebook to create the embeddings. 

Once complete, it will be saved into a Delta Table

In [0]:
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

endpoint_name = model_endpoint_name
databricks_instance = dbutils.entry_point.getDbutils().notebook().getContext().browserHostName().get()
endpoint_url = f"https://{databricks_instance}/ml/endpoints/{endpoint_name}"
print(f"Endpoint URL: {endpoint_url}")

Endpoint URL: https://e2-demo-field-eng.cloud.databricks.com/ml/endpoints/colNomic-embedding-generation


In [0]:
import requests
import json
import time
from pyspark.sql.functions import col
from pyspark.sql.types import ArrayType, FloatType, StructType, StructField, StringType, IntegerType
import mlflow.deployments

def process_embeddings(df_pages, batch_size=10):
    """
    Process embeddings on driver node with batching and rate limiting
    """
    client = mlflow.deployments.get_deploy_client("databricks")
    
    pages_data = df_pages.select("pdf_path", "page_number", "base64_image").collect()
    
    print(f"Processing {len(pages_data)} pages for embeddings...")
    
    results = []
    record_id = 1
    for i in range(0, len(pages_data), batch_size):
        batch = pages_data[i:i+batch_size]
        print(f"Processing batch {i//batch_size + 1}/{(len(pages_data) + batch_size - 1)//batch_size}")
        
        for row in batch:
            try:

                response = client.predict(
                    endpoint=model_endpoint_name,
                    inputs={
                        "dataframe_split": {
                            "columns": ["image_base64"],
                            "data": [[row.base64_image]]
                        }
                    }
                )
                
                embedding = response['predictions']['predictions']['embedding']
                
                results.append({
                    'id': record_id,
                    'pdf_path': row.pdf_path,
                    'page_number': row.page_number,
                    'base64_image': row.base64_image,
                    'embeddings': embedding
                })
                
            except Exception as e:
                print(f"Error getting embedding for {row.pdf_path} page {row.page_number}: {e}")
                results.append({
                    'id': record_id,
                    'pdf_path': row.pdf_path,
                    'page_number': row.page_number,
                    'base64_image': row.base64_image,
                    'embeddings': [0.0] * 768 
                })
            record_id += 1 
        
        
        time.sleep(0.5)
    
    return results

embedding_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("pdf_path", StringType(), True),
    StructField("page_number", IntegerType(), True),
    StructField("base64_image", StringType(), True),
    StructField("embeddings", ArrayType(FloatType()), True)
])

embedding_results = process_embeddings(df_pages, batch_size=5) 

df_with_embeddings = spark.createDataFrame(embedding_results, embedding_schema)

print(f"Generated embeddings for {df_with_embeddings.count()} pages")
df_with_embeddings.show(5)

Processing 25 pages for embeddings...
Processing batch 1/5
Processing batch 2/5
Processing batch 3/5
Processing batch 4/5
Processing batch 5/5
Generated embeddings for 25 pages
+---+--------------------+-----------+--------------------+--------------------+
| id|            pdf_path|page_number|        base64_image|          embeddings|
+---+--------------------+-----------+--------------------+--------------------+
|  1|/Workspace/Users/...|          1|/9j/4AAQSkZJRgABA...|[-0.037353516, -0...|
|  2|/Workspace/Users/...|          2|/9j/4AAQSkZJRgABA...|[-0.030029297, -0...|
|  3|/Workspace/Users/...|          3|/9j/4AAQSkZJRgABA...|[-0.034423828, -0...|
|  4|/Workspace/Users/...|          4|/9j/4AAQSkZJRgABA...|[-0.028198242, 0....|
|  5|/Workspace/Users/...|          5|/9j/4AAQSkZJRgABA...|[-0.039794922, -0...|
+---+--------------------+-----------+--------------------+--------------------+
only showing top 5 rows


In [0]:
df_with_embeddings.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{catalog}.{schema}.{embedding_table_name}")

#Create the Vector Search Index

Now that we have a Delta Table with a column containing our embeddings, we can create a vector search index. 

You will need to create a vector serach endpoint. You can do this in the compute section of your workspace or programatically below. 

In [0]:
from databricks.vector_search.client import VectorSearchClient

vs_client = VectorSearchClient()

vector_search_endpoint_name = vector_search_endpoint_name
index_name = embedding_table_name_index

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True.


In [0]:
%sql
ALTER TABLE identifier(CONCAT(:catalog||'.'||:schema||'.'||:embedding_table_name)) SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

In [0]:
try:
  vs_client.create_endpoint_and_wait(name=vector_search_endpoint_name)
except Exception as e: 
  if "ALREADY_EXISTS" in str(e):
        print(f"Vector search endpoint '{vector_search_endpoint_name}' already exists")
  else:
      print(f"Error creating vector search index: {e}")
      raise 

Vector search endpoint 'one-env-shared-endpoint-4' already exists


In [0]:
try:
    vs_client.create_delta_sync_index_and_wait(
        index_name=f"{catalog}.{schema}.{index_name}",
        endpoint_name=vector_search_endpoint_name,
        pipeline_type="TRIGGERED",
        primary_key="id",
        embedding_vector_column="embeddings",
        source_table_name=f"{catalog}.{schema}.{embedding_table_name}",
        embedding_dimension=128
    )

    print(f"Vector search index '{index_name}' created on endpoint '{endpoint_name}'")
except Exception as e:
    if "RESOURCE_ALREADY_EXISTS" in str(e):
        print(f"Vector search index '{index_name}' already exists on endpoint '{vector_search_endpoint_name}'")
    else:
        print(f"Error creating vector search index: {e}")
        raise 

Vector search index 'colNomic_DSA_embedding_index' created on endpoint 'colNomic-embedding-generation'


In [0]:
index = vs_client.get_index(endpoint_name=vector_search_endpoint_name, index_name=f"{catalog}.{schema}.{index_name}")

#Test the Model Serving and Vector Search Endpoints

Now that we have both our Model Serving Endpoint and Vector Search Endpoint, we should test to see if they work! 

Remember, the vector search needs embeddings generated from the same embedding space to do the similarity search. We cannot just pass text in and except the vector search to find a match. We must first convert that text into embeddings then take those generated embeddings into the vector search endpoint. 

Should see the base64 string we created of the pdf images

In [0]:
input_query = "hello there"

response = client.predict(
            endpoint=model_endpoint_name,
            inputs={"dataframe_split": {
                    "columns": ["text"],
                    "data": [[input_query]]
                    }
            }
          )

query_embedding = response['predictions']['predictions']['embedding']
results = index.similarity_search(num_results=5, columns=["base64_image"], query_vector=query_embedding)
print(results)
base64_test_retrieved = results['result']['data_array'][0][0]

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True.
{'manifest': {'column_count': 2, 'columns': [{'name': 'base64_image'}, {'name': 'score'}]}, 'result': {'row_count': 5, 'data_array': [['/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoICAgLCgoLDhgQDg0NDh0VFhEYIx8lJCIfIiEmKzcvJik0KSEiMEExNDk7Pj4+JS5ESUM8SDc9Pjv/2wBDAQoLCw4NDhwQEBw7KCIoOzs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozs7Ozv/wAARCAMAA+EDASIAAhEBAxEB/8QAGwABAAIDAQEAAAAAAAAAAAAAAAUGAQMEAgf/xABZEAABAwMCAwUEBQgGBgYIBQUBAgMEAAUREiEGEzEUIkFRYRUycZQjVIGR0gdCUlWho9HTFiQzYpOxNFN0grLBNTZDcpLDFyVERaLh8PFjZHN1hLODtOLC/8QAGQEBAQEBAQEAAAAAAAAAAAAAAAECAwQF/8QAKxEBAAIBAwQCAwABBQEBAAAAAAERAhIhUQMTMWEiQbHR8KEEcYHB4fEU/9oADAMBAAIRAxEAPwD7NSlKBSlKBSlKBSsUoM0rFKDNKVigzSsUoM0rFKDNKxWaBSsUoM0rFKDNKxSgzSsVmgUrFM0GaVis0ClKUClYpQZpSsUGaVilBmlYrNApSlApSlApSlApSlApSlApSlApSlApSlApSlApSl

We have our vector search tool! Let's move on to creating synthetic data for our genie space in the 3rd notebook