# 06 — Hybrid Search

Semantic search (`$vectorSearch`) and keyword search (`$search`) each have blind spots:
- **Semantic** misses exact-match terms (product codes, names, acronyms)
- **Keyword** misses paraphrase and concept-level matches

MongoDB's `$rankFusion` merges both pipelines using **Reciprocal Rank Fusion**, so results that rank well in *either* pipeline float to the top.

In [None]:
import { MongoClient } from 'mongodb';

// ← Paste your VoyageAI API key here (get one at https://dash.voyageai.com)
const VOYAGE_API_KEY = 'pa-...';
const QUERY_MODEL    = 'voyage-4-lite';
const VECTOR_INDEX   = 'listing_vector_index';  // created in notebook 01
const FTS_INDEX      = 'listing_fts_index';

const client = new MongoClient(process.env.MONGODB_URI!);
await client.connect();
const db  = client.db('voyage_lab');
const col = db.collection<{ _id: string; [key: string]: unknown }>('listings');

const withEmbeddings = await col.countDocuments({ embedding: { $exists: true } });
console.log(`Connected. ${withEmbeddings} listings have embeddings.`);
if (withEmbeddings === 0) console.warn('⚠  Run notebook 01 first to generate embeddings.');

In [None]:
// ── Query embed helper ────────────────────────────────────────────────────────
async function embed(text: string): Promise<number[]> {
  const res = await fetch('https://api.voyageai.com/v1/embeddings', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${VOYAGE_API_KEY}` },
    body: JSON.stringify({ input: [text], model: QUERY_MODEL, input_type: 'query' }),
  });
  if (!res.ok) throw new Error(await res.text());
  const json = await res.json() as { data: { embedding: number[] }[] };
  return json.data[0].embedding;
}

console.log('Helper defined.');

## Create a full-text search index

Atlas Search (`$search`) uses BM25 over the indexed fields. We index `name` and `description` for keyword lookup.

In [None]:
try { await col.dropSearchIndex(FTS_INDEX); await new Promise(r => setTimeout(r, 2000)); } catch {}

await col.createSearchIndex({
  name: FTS_INDEX,
  type: 'search',
  definition: {
    mappings: {
      dynamic: false,
      fields: {
        name:        { type: 'string' },
        description: { type: 'string' },
      },
    },
  },
});

console.log('Waiting for FTS index to be READY...');
for (let i = 0; i < 30; i++) {
  await new Promise(r => setTimeout(r, 2000));
  const [idx] = await col.listSearchIndexes(FTS_INDEX).toArray();
  console.log(' status:', idx?.status);
  if (idx?.status === 'READY') break;
}

## Where each approach falls short

- **Keyword query `"WiFi laptop remote work"`**: exact terms that appear exactly in listings → keyword wins, semantic may miss
- **Semantic query `"peaceful escape surrounded by nature"`**: paraphrase with no exact keyword → semantic wins, keyword misses

Run both queries with both methods and observe the gaps.

In [None]:
// ── Semantic vs keyword — side-by-side ────────────────────────────────────────
async function semanticSearch(query: string, limit = 4) {
  const qVec = await embed(query);
  return col.aggregate([
    { $vectorSearch: { index: VECTOR_INDEX, path: 'embedding', queryVector: qVec, numCandidates: 50, limit } },
    { $project: { name: 1, score: { $meta: 'vectorSearchScore' } } },
  ]).toArray();
}

async function keywordSearch(query: string, limit = 4) {
  return col.aggregate([
    { $search: { index: FTS_INDEX, text: { query, path: ['name', 'description'] } } },
    { $limit: limit },
    { $project: { name: 1, score: { $meta: 'searchScore' } } },
  ]).toArray();
}

const queries = [
  'WiFi laptop remote work',        // exact keywords → keyword wins
  'peaceful escape surrounded by nature',  // paraphrase → semantic wins
];

for (const q of queries) {
  const [sem, kw] = await Promise.all([semanticSearch(q), keywordSearch(q)]);
  console.log(`\nQuery: "${q}"`);
  console.log('  Semantic ($vectorSearch):');
  sem.forEach((r, i) => console.log(`    ${i+1}. [${(r.score as number).toFixed(4)}] ${r.name}`));
  console.log('  Keyword ($search):');
  kw.forEach((r, i)  => console.log(`    ${i+1}. [${(r.score as number).toFixed(4)}] ${r.name}`));
}

## Hybrid search with `$rankFusion`

`$rankFusion` runs both pipelines independently, then merges them using **Reciprocal Rank Fusion**: documents that rank highly in *either* pipeline score well in the combined result.

In [None]:
async function hybridSearch(
  query: string,
  vectorWeight = 0.6,
  ftsWeight    = 0.4,
  limit        = 5,
) {
  const qVec = await embed(query);

  return col.aggregate([
    {
      $rankFusion: {
        input: {
          pipelines: {
            vector_pipeline: [
              {
                $vectorSearch: {
                  index:         VECTOR_INDEX,
                  path:          'embedding',
                  queryVector:   qVec,
                  numCandidates: 50,
                  limit:         20,
                },
              },
            ],
            fts_pipeline: [
              {
                $search: {
                  index: FTS_INDEX,
                  text:  { query, path: ['name', 'description'] },
                },
              },
              { $limit: 20 },
            ],
          },
        },
        combination: {
          weights: { vector_pipeline: vectorWeight, fts_pipeline: ftsWeight },
        },
        scoreDetails: true,
      },
    },
    {
      $project: {
        name:  1,
        score: { $getField: { field: 'value', input: { $meta: 'scoreDetails' } } },
      },
    },
    { $limit: limit },
  ]).toArray();
}

// Run the same two queries through hybrid search
for (const q of queries) {
  const hits = await hybridSearch(q);
  console.log(`\nHybrid results for: "${q}"`);
  hits.forEach((r, i) => console.log(`  ${i+1}. [${(r.score as number).toFixed(4)}] ${r.name}`));
}


In [None]:
// ── Cleanup ───────────────────────────────────────────────────────────────────
await client.close();
console.log('Done.');