The tutorial is divided into two parts:

*   **Part 1: Feature Engineering with LanceDB and Geneva**: In this part, we'll focus on the crucial process of feature engineering. We'll use LanceDB and its Geneva feature engineering framework to enrich our data with meaningful features that will power our search engine.

*   **Part 2: Inference and Retrieval with LanceDB**: In this part, we'll build the inference and retrieval pipeline that uses the features we engineered in Part 1 to provide a powerful and intuitive search experience. We'll cover query routing, hybrid search, and reranking to build a state-of-the-art search engine. 

## Part 1: Feature Engineering with LanceDB and Geneva

This notebook is the first part of our tutorial on building an advanced product search engine. In this part, we will focus on the crucial process of feature engineering. We'll start with a raw dataset of fashion products, and ingest it in LanceDB. We'll then use Geneva to enrich our data with meaningful features that will power our search engine.

We will cover the following steps:
1. **Data Ingestion**: Downloading a fashion dataset and loading it into a LanceDB table.
2. **Declarative Feature Engineering**: Using Geneva to define and compute features on-the-fly.
3. **LLM-based Feature Generation**: Using a Large Language Model (LLM) to generate rich, descriptive features.
4. **Embedding Generation**: Creating vector embeddings for both images and text to enable semantic search.
5. **Indexing**: Creating indexes to speed up our queries.

In [None]:
!pip install --upgrade geneva lancedb google-genai kubernetes "ray[default]" rerankers -q

## 1. Data Ingestion

First, let's download our dataset. We're using a small version of the Fashion Product Images dataset from Kaggle. This dataset contains images and metadata for a variety of fashion products.

In [29]:
!sudo rm -r db fashion-dataset # Delete if already exists

# ### HIGH RES DATASET (Slow to download) #
#!curl -L -o fashion-product-images-dataset.zip\
#  https://www.kaggle.com/api/v1/datasets/download/paramaggarwal/fashion-product-images-dataset

#!unzip -q fashion-product-images-dataset.zip 


# SAME DATASET WITH LOW RES IMAGEs #

!curl -L -o fashion-product-images-small.zip\
  https://www.kaggle.com/api/v1/datasets/download/paramaggarwal/fashion-product-images-small
!unzip -q fashion-product-images-small.zip -d fashion-dataset/



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  565M  100  565M    0     0   217M      0  0:00:02  0:00:02 --:--:--  249M


## Set Scale based on your environment

THis example uses geneva locally by default - which means the scale of concurrent jobs will be limited to the system you're working on. Set these params based on your CPU/GPU and memory configs

In [36]:
## Update this based on your evn

DATASET_SIZE = 44200 # REDUCE to 100-1000 for testing. WARN: might show bad results on querying
CONCURRENCY = 18 # REDUCE to 4 ON COLAB 
BATCH_SIZE = 3000


In [37]:
import os
import io
import geneva
import lancedb
import concurrent.futures


import pandas as pd
import geneva as gv
import pyarrow as pa

from pathlib import Path
from PIL import Image
from google import genai


import torch
from transformers import CLIPProcessor, CLIPModel

GEMINI_API_KEY = "AIzaSyA0O1dSJnNqMlLM64yr4uLwy4yVKSoJpyA"


IMG_DIR = Path("fashion-dataset/images")
STYLE_CSV = Path("fashion-dataset/styles.csv")
DB_PATH = "./db"
TABLE_NAME = "products"
INSERT_FRAG_SIZE = 10000

Now, let's load the data into a LanceDB table. We'll read the CSV file with the product metadata, and for each product, we'll also load the corresponding image from the `images` directory. We'll then create a LanceDB table and add the data to it in batches. LanceDB can store objects(images in this case) along with vector embeddings and metadata.

In [38]:
df = pd.read_csv(STYLE_CSV, on_bad_lines='skip')
df = df.dropna(subset=["id", "productDisplayName"])    
df = df.drop_duplicates(subset=["id"], keep="first")    
df = df.sample(DATASET_SIZE) # set to 100 for testing
print(len(df))

def generate_rows(df, img_dir):
    for _, row in df.iterrows():
        img_path = img_dir / f"{row['id']}.jpg"
        if not img_path.exists():
            continue
        with open(img_path, "rb") as f:
            yield {
                "id": int(row["id"]),
                "description": row["productDisplayName"],
                "image_bytes": f.read()
            }

db = lancedb.connect(DB_PATH)
if TABLE_NAME in db.table_names():
    db.drop_table(TABLE_NAME)
    
data_stream = generate_rows(df, IMG_DIR)
table = None

rows = []
for row in data_stream:
    rows.append(row)
    if len(rows) == INSERT_FRAG_SIZE:
        if table:
            table.add(rows)
        else:
            table = db.create_table(TABLE_NAME, data=rows)
        rows = []
if rows:
    table.add(rows)
    
len(table)

44200


44195

## 2. Feature Engineering with Geneva

Now that we have our data in a LanceDB table, we can start engineering features. We'll use Geneva to create new features for our products. 

### Defining geneva UDF

Geneva uses Python User Defined Functions (UDFs) to define features as columns in a Lance dataset. Adding a feature is straightforward:

1. Prototype your Python function in your favorite environment.
2. Wrap the function with small UDF decorator.
3. Register the UDF as a virtual column using Table.add_columns().
4. Trigger a backfill operation.
There are various kinds of UDFs you can use depending on the task type

* **Row-level, stateless UDFs** - You can use these when you're tasks don't need to be optimized with batch processing, and they don't require complex setup each time
* **Row-level, stateful UDFs** - You can use these when you're tasks don't need to be optimized with batch processing, and they require complex setup each time
* **Batched, Statless UDFs** - You can use these when batch processing is faster but you don't require complex setup each time.
* **Batched, Stateful UDFs** - You can use these when batch processing is faster AND you require complex setup (like loading model) for each batch.
Read more about geneva UDFs here - TODO: Add new docs link

In this example we'll use Batched, Stateful UDF

NOTE: cuda=True means this UDF is meant to run on GPU nodes

### Simple Feature Extraction

Let's start with a simple feature: extracting color tags from the product description. We'll define a User-Defined Function (UDF) that takes the product description as input and returns a comma-separated string of colors found in the description.

In [39]:
db = geneva.connect(DB_PATH)
if TABLE_NAME in db.table_names():
    table = db[TABLE_NAME]

In [40]:
@gv.udf
def color_tags(description: str)-> str:
    colors = ["black", "white", "red", "blue", "green", "yellow", "pink", "brown"]
    return " , ".join([c for c in colors if c in description.lower()])

### LLM-based Feature Generation

Now, let's create some more complex features using a Large Language Model (LLM). We'll use the Gemini API to generate two new features:

*   `occasion`: A description of the most suitable occasion(s) to wear the product.
*   `summary`: A summary of the product and what it would go well with.

We'll define two UDFs, `occasion_tagger` and `pair_summarizer`, that call the Gemini API to generate these features. These UDFs take a batch of product descriptions as input and return a batch of generated features.

In [41]:
@gv.udf(data_type=pa.string())
def occasion_tagger(batch: pa.RecordBatch) -> pa.Array:
    _gemini = genai.Client(api_key=GEMINI_API_KEY)
    descriptions = batch.column("description").to_pylist()

    def call(desc: str) -> str:
        prompt = (
            f"Based on the following product description, describe the most suitable "
            f"occasion(s) to wear this dress in <=25 words:{desc}"
        )
        resp = _gemini.models.generate_content(
            model="gemini-2.5-flash-lite",
            contents=prompt,
            config={"temperature": 0.0},
        )
        return resp.text.strip() if resp.text else None

    with concurrent.futures.ThreadPoolExecutor(max_workers=80) as exec:
        occasions = list(exec.map(call, descriptions))

    return pa.array(occasions, type=pa.string())

    

@gv.udf(data_type=pa.string())
def pair_summarizer(batch: pa.RecordBatch) -> pa.Array:
    _gemini = genai.Client(api_key=GEMINI_API_KEY)
    descriptions = batch.column("description").to_pylist()
    
    def call(desc: str) -> str:
        resp = _gemini.models.generate_content(
            model="gemini-2.5-flash-lite",
            contents=(f"Summarize  the product as well as what products(type and/or color, texture etc it would go with"
                      f"only return the summary nothing else, it has to be a string in <=25 words:{desc}"),
            config={"temperature": 0.0},
        )
        return resp.text.strip() if resp.text else None
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=80) as exec:
        summaries = list(exec.map(call, descriptions))
    
    return pa.array(summaries, type=pa.string())

### Adding Virtual Columns

Now that we've defined our feature-generating UDFs, we can add them to our table as virtual columns. Virtual columns are computed on-the-fly when you query the table, so you don't need to precompute and store the features.

In [42]:
table.add_columns({
    "color_tags": color_tags,
    "occasion": occasion_tagger,
    "summary": pair_summarizer
})

INFO:geneva.table:Adding column: udf={'color_tags': UDF(func=<function color_tags at 0x7fb4db0d53f0>, name='color_tags', cuda=False, num_cpus=1.0, memory=None, batch_size=None, input_columns=['description'], data_type=DataType(string), version='73e1cc91747455d662b6c786a2691d49', checkpoint_key='color_tags:73e1cc91747455d662b6c786a2691d49', field_metadata={})}
INFO:geneva.table:Adding column: udf={'occasion': UDF(func=<function occasion_tagger at 0x7fb4db0d5090>, name='occasion_tagger', cuda=False, num_cpus=1.0, memory=None, batch_size=None, input_columns=None, data_type=DataType(string), version='e8a70b02163bc2c5557e62588398168b', checkpoint_key='occasion_tagger:e8a70b02163bc2c5557e62588398168b', field_metadata={})}
INFO:geneva.table:Adding column: udf={'summary': UDF(func=<function pair_summarizer at 0x7fb98e2d9630>, name='pair_summarizer', cuda=False, num_cpus=1.0, memory=None, batch_size=None, input_columns=None, data_type=DataType(string), version='803781f5f76f0b435f2e2e065b761c9f'

Let's inspect the table schema to see our new virtual columns.

In [43]:
table.schema

id: int64
description: string
image_bytes: binary
color_tags: string
  -- field metadata --
  virtual_column.udf_backend: 'DockerUDFSpecV1'
  virtual_column.udf: '_udfs/f93df63338bb362e1876cbb6dd26f57824969176b33d' + 20
  virtual_column.platform.python_version: '3.10.18'
  virtual_column.udf_inputs: '["description"]'
  virtual_column.udf_name: 'color_tags'
  virtual_column: 'true'
  virtual_column.platform.arch: 'x86_64'
  virtual_column.platform.system: 'Linux'
occasion: string
  -- field metadata --
  virtual_column: 'true'
  virtual_column.platform.python_version: '3.10.18'
  virtual_column.udf_name: 'occasion_tagger'
  virtual_column.udf: '_udfs/88cf583f47aa6371eb7fbe83c74c101a3312275cbd4b' + 20
  virtual_column.udf_backend: 'DockerUDFSpecV1'
  virtual_column.platform.arch: 'x86_64'
  virtual_column.platform.system: 'Linux'
  virtual_column.udf_inputs: 'null'
summary: string
  -- field metadata --
  virtual_column.platform.python_version: '3.10.18'
  virtual_column: 'true'
  virtua

### Backfilling Features

Triggering backfill creates a distributed job to run the UDF and populate the column values in your LanceDB table. The Geneva framework simplifies several aspects of distributed execution.

Environment management: Geneva automatically packages and deploys your Python execution environment to worker nodes. This ensures that distributed execution occurs in the same environment and depedencies as your prototype.

Checkpoints: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.

backfill accepts various params to customise scale of your workload, here we'll use:

* **batch_size**  - Which determines the inference batch size
* **concurrency** - Which determins how many nodes used for parallelization

Here, we're using geneva locally, so we won't setup ray-cluster, but you can also use the same setup and run distributed jobs remotely on ray clusters.

In [44]:
table.backfill("color_tags", batch_size=BATCH_SIZE, concurrency=CONCURRENCY)

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

[90m[[0m2025-07-30T13:40:31Z [33mWARN [0m lance::dataset::write::insert[90m][0m No existing dataset at /home/jupyter/semantic_router/final-version/db/geneva_jobs.lance, it will be created


Workers started: 0it [00:00, ?it/s]

Batches checkpointed:   0%|          | 0/18 [00:00<?, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'8e2b4d9a-0ff1-4105-bdc6-f046de7cee93'

In [45]:
table.backfill("occasion", batch_size=BATCH_SIZE, concurrency=CONCURRENCY, where="1=1")

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

Workers started: 0it [00:00, ?it/s]

Batches checkpointed: 0it [00:00, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'2dfbdff7-681f-4381-a395-d8a434f21d68'

In [46]:
table.backfill("summary", batch_size=BATCH_SIZE, concurrency=CONCURRENCY, where="1=1")

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

Workers started: 0it [00:00, ?it/s]

Batches checkpointed:   0%|          | 0/18 [00:00<?, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'b114c3aa-7b4e-47fb-85cf-fe783871cca3'

Let's take a look at our enriched data.

In [47]:
table.search().limit(3).to_pandas()

Unnamed: 0,id,description,image_bytes,color_tags,occasion,summary
0,33821,Puma Men Axis Blue & Black Sports Shoes,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,"black , blue",These Puma sports shoes are perfect for casual...,"Versatile blue and black Puma sports shoes, pe..."
1,43487,French Connection Men Navy Blue Check Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,blue,"Perfect for casual outings, weekend brunches, ...",A stylish navy blue men's check shirt from Fre...
2,8108,Fastrack Men Tween Analg Black Watch,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,black,"This watch is perfect for everyday wear, casua...","A sleek black analog watch for men, perfect fo..."
3,18745,Arrow Woman Cola Purple Top,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,,This Arrow Woman Cola Purple Top is perfect fo...,"A vibrant purple top, perfect for pairing with..."
4,6091,UCB Women's Racerback Print Pink T-shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,pink,"Perfect for casual outings, errands, or a rela...",A vibrant pink racerback tee with a playful pr...
5,39801,Peter England Men Blue Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,blue,This blue Peter England shirt is perfect for c...,"A classic blue men's shirt, perfect for pairin..."
6,49061,Flying Machine Men Brown Shoes,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,brown,"Casual outings, everyday wear, or relaxed soci...","Versatile brown leather shoes, perfect for cas..."
7,16799,Highlander Men Solid Grey Trouser,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,,"Perfect for business casual, office wear, or s...","Versatile grey trousers, perfect for pairing w..."
8,29800,Basics Men Black & Navy Checked Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,black,This versatile checked shirt is perfect for ca...,"A versatile black and navy checked shirt, perf..."
9,52602,Mod'acc Women Green Clutch,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,green,This green clutch is perfect for adding a pop ...,"A chic green clutch, perfect for adding a pop ..."


## 3. Embedding Generation

Now that we have our text-based features, let's create some vector embeddings. Embeddings are numerical representations of data that capture its semantic meaning. We'll create embeddings for our product images and for our new `summary` and `occasion` features.

### Image Embeddings

We'll use a pretrained CLIP model to generate embeddings for our product images. We'll define a UDF that takes a batch of image bytes as input, preprocesses them, and then uses the CLIP model to generate embeddings.

In [64]:
import torchvision.transforms as transforms
import pyarrow as pa
import geneva as gv
import torch
from transformers import AutoTokenizer, AutoModel

@gv.udf(data_type=pa.list_(pa.float32(), 512),
       cuda=True
       )
class EmbedImage:
    def __init__(self):
        self.ready = False

    def setup(self):
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        self.model = CLIPModel.from_pretrained(
            "openai/clip-vit-base-patch32",
            torch_dtype=torch.float16,
        ).cuda()
        self.model.eval()
        self.ready = True

    def __call__(self, batch: pa.RecordBatch) -> pa.Array:
        if not self.ready:
            self.setup()
            
        img_bytes = batch.column("image_bytes").to_pylist()
        images = []
        for b in img_bytes:
            img = Image.open(io.BytesIO(b)).convert("RGB")
            images.append(img)
        
        inputs = self.processor(images=images, return_tensors="pt", padding=True)
        pixel_values = inputs.pixel_values.cuda(non_blocking=True)
        
        with torch.inference_mode(), torch.cuda.amp.autocast():
            features = self.model.get_image_features(pixel_values=pixel_values)
            embeddings = features.cpu().float().numpy()
        
        return pa.array(embeddings.tolist(), type=pa.list_(pa.float32(), 512))

### Text Embeddings

Next, we'll generate embeddings for our `summary` and `occasion` features. We'll use a pretrained BAAI model for this. We'll define a UDF that takes a batch of text as input and uses the BAAI model to generate embeddings.

In [60]:
@gv.udf(
    data_type=pa.list_(pa.float32(), 768),
    cuda=True
)
class EmbedText:
    def __init__(self, column: str):
        self.ready = False
        self.column = column

    def setup(self):
        self.tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-base-en-v1.5")
        self.model = AutoModel.from_pretrained("BAAI/bge-base-en-v1.5").cuda()
        self.model.eval()
        self.ready = True

    def __call__(self, batch: pa.RecordBatch) -> pa.Array:
        if not self.ready:
            self.setup()

        texts = batch.column(self.column).to_pylist()
        
        # Replace invalid texts with empty string
        clean_texts = [t if isinstance(t, str) and t.strip() else " " for t in texts]
        
        # Process all texts at once
        inputs = self.tokenizer(clean_texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
        inputs = {k: v.cuda() for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            embeddings = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
        
        return pa.array(embeddings.tolist(), type=pa.list_(pa.float32(), 768))

### Adding and Backfilling Embedding Columns

Now, let's add our new embedding generators as virtual columns and then backfill them.

In [66]:
table.add_columns({
    "image_embedding": EmbedImage(), 
    "summary_embedding": EmbedText("summary"),
    "occasion_embedding": EmbedText("occasion")
})

INFO:geneva.table:Adding column: udf={'image_embedding': UDF(func=<__main__.EmbedImage object at 0x7fb4da2fa560>, name='EmbedImage', cuda=True, num_cpus=1.0, memory=None, batch_size=None, input_columns=None, data_type=FixedSizeListType(fixed_size_list<item: float>[512]), version='539f349c96ce2ddcfee59f4f65bfade2', checkpoint_key='EmbedImage:539f349c96ce2ddcfee59f4f65bfade2', field_metadata={})}


In [51]:
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stderr, force=True)

In [62]:
table.backfill("summary_embedding",  where="1=1", batch_size=BATCH_SIZE, concurrency=CONCURRENCY)

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

Workers started: 0it [00:00, ?it/s]

Batches checkpointed:   0%|          | 0/18 [00:00<?, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'2fb3b8e8-e129-4656-a136-0f7899c4d582'

In [63]:
table.backfill("occasion_embedding", batch_size=BATCH_SIZE, concurrency=CONCURRENCY, where="1=1")

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

Workers started: 0it [00:00, ?it/s]

Batches checkpointed:   0%|          | 0/18 [00:00<?, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'49570940-7b01-40c8-8c84-35c30caa8cf1'

In [67]:
table.backfill("image_embedding", batch_size=BATCH_SIZE, concurrency=CONCURRENCY, where="1=1")

Cluster nodes provisioned: |           0 [00:00]

Workers scheduled: |           0 [00:00]

Workers started: 0it [00:00, ?it/s]

Batches checkpointed:   0%|          | 0/18 [00:00<?, ?it/s]

Fragments written:   0%|          | 0/5 [00:00<?, ?it/s]

'ae536a93-8d12-450f-86d4-bbaf6c2f6ff8'

In [69]:
table.search().limit(3).to_pandas()

Unnamed: 0,id,description,image_bytes,color_tags,occasion,summary,summary_embedding,occasion_embedding,image_embedding
0,33821,Puma Men Axis Blue & Black Sports Shoes,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,"black , blue",These Puma sports shoes are perfect for casual...,"Versatile blue and black Puma sports shoes, pe...","[0.5926933, 0.06984114, -0.29291508, -0.048436...","[0.32691422, -0.081623875, -0.6908214, 0.09249...","[-0.5107422, -0.1430664, -0.18225098, 0.091857..."
1,43487,French Connection Men Navy Blue Check Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,blue,"Perfect for casual outings, weekend brunches, ...",A stylish navy blue men's check shirt from Fre...,"[0.021181999, -0.15738763, -0.18532476, 0.4291...","[-0.313276, -0.13313752, 0.16044861, 0.0541831...","[0.2310791, 0.2836914, -0.11407471, 0.08807373..."
2,8108,Fastrack Men Tween Analg Black Watch,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,black,"This watch is perfect for everyday wear, casua...","A sleek black analog watch for men, perfect fo...","[0.2698657, -0.26594058, 0.58361876, 0.3672423...","[0.471437, -0.2877753, 0.6227516, 0.008569454,...","[0.12512207, -0.041870117, 0.06173706, -0.3068..."
3,18745,Arrow Woman Cola Purple Top,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,,This Arrow Woman Cola Purple Top is perfect fo...,"A vibrant purple top, perfect for pairing with...","[0.31835094, -0.93949884, -0.1462572, 0.372724...","[-0.11941981, -0.46860182, -0.556708, 0.241419...","[0.00033068657, 0.28857422, -0.14172363, -0.06..."
4,6091,UCB Women's Racerback Print Pink T-shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,pink,"Perfect for casual outings, errands, or a rela...",A vibrant pink racerback tee with a playful pr...,"[0.15936497, -1.1334419, -0.18710637, 0.176189...","[0.19272415, -0.3448234, 0.1603245, 0.4397586,...","[-0.08569336, -0.19262695, -0.6894531, -0.1405..."
5,39801,Peter England Men Blue Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,blue,This blue Peter England shirt is perfect for c...,"A classic blue men's shirt, perfect for pairin...","[0.0997376, -0.10490564, 0.42901358, 0.8687088...","[0.07615158, -1.0693426, 0.5999409, 0.3378959,...","[-0.22888184, 0.4243164, -0.15283203, 0.096740..."
6,49061,Flying Machine Men Brown Shoes,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,brown,"Casual outings, everyday wear, or relaxed soci...","Versatile brown leather shoes, perfect for cas...","[0.12960745, 0.029596642, -0.0010324809, 0.404...","[0.38106987, -0.14663665, 0.055932462, -0.1269...","[-0.19665527, -0.3408203, 0.3125, 0.054595947,..."
7,16799,Highlander Men Solid Grey Trouser,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,,"Perfect for business casual, office wear, or s...","Versatile grey trousers, perfect for pairing w...","[0.6454123, -0.1253122, -0.122310184, 0.246234...","[0.5321644, -0.048079964, 0.25295117, 0.623837...","[-0.053710938, 0.1484375, -0.08251953, 0.00748..."
8,29800,Basics Men Black & Navy Checked Shirt,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00...,black,This versatile checked shirt is perfect for ca...,"A versatile black and navy checked shirt, perf...","[0.32964864, -0.13187009, -0.13817887, 0.40279...","[0.3500598, -0.14457172, -0.003512565, 0.13258...","[0.11755371, 0.28881836, -0.2199707, 0.1065063..."
9,52602,Mod'acc Women Green Clutch,b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01...,green,This green clutch is perfect for adding a pop ...,"A chic green clutch, perfect for adding a pop ...","[-0.1269665, -0.62145716, 0.31657094, 0.432792...","[0.22456722, -0.40564543, 0.09888766, 0.425564...","[-0.014823914, -0.20446777, -0.12561035, 0.288..."


## 4. Indexing

To speed up our queries, we need to create indexes on our new features. We'll create two types of indexes:

*   **Full-Text Search (FTS) Index**: This will allow us to quickly search for keywords in our `summary` and `occasion` columns.
*   **Vector Index**: This will allow us to perform fast similarity searches on our `summary_embedding` and `occasion_embedding` columns.

In [56]:
table.create_fts_index("summary")
table.create_fts_index("occasion")

In [57]:
table.create_index(vector_column_name="summary_embedding", num_sub_vectors=128)
table.create_index(vector_column_name="occasion_embedding", num_sub_vectors=128)

[90m[[0m2025-07-30T14:11:57Z [33mWARN [0m lance_index::vector::kmeans[90m][0m KMeans: more than 10% of clusters are empty: 27 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.
[90m[[0m2025-07-30T14:12:40Z [33mWARN [0m lance_index::vector::kmeans[90m][0m KMeans: more than 10% of clusters are empty: 27 of 256.
    Help: this could mean your dataset is too small to have a meaningful index (less than 5000 vectors) or has many duplicate vectors.


In [58]:
table.list_indices()

[Index(FTS, columns=["summary"], name="summary_idx"),
 Index(FTS, columns=["occasion"], name="occasion_idx"),
 Index(IvfPq, columns=["summary_embedding"], name="summary_embedding_idx"),
 Index(IvfPq, columns=["occasion_embedding"], name="occasion_embedding_idx")]

That's it for the feature engineering part! We now have a LanceDB table enriched with a variety of features that will power our search engine. In the next part of this tutorial, we'll build the inference and retrieval pipeline that uses these features to provide a powerful and intuitive search experience.