# Create Work Embeddings for Vector Search (v2)

Generates text embeddings for all OpenAlex works using `databricks-bge-large-en` (1024 dims).

**Format**: `Title: {title}\n\nAbstract: {abstract}`

**Output**: `openalex.vector_search.work_embeddings` Delta table

**Exclusions**: Works with type='dataset' (non-semantic titles)

## Changes from v1
- Uses `databricks-bge-large-en` (free, built-in) instead of OpenAI
- Fixed column names: `title` (not `display_name`), `abstract` (already a string)
- Uses ai_query() for simpler SQL-based processing

In [None]:
# Configuration
EMBEDDING_MODEL = "databricks-bge-large-en"  # Built-in, free, 1024 dims
OUTPUT_TABLE = "openalex.vector_search.work_embeddings"
SOURCE_TABLE = "openalex.works.openalex_works"

## Step 1: Verify schema and table exist

In [None]:
%%sql
-- Verify table exists
DESCRIBE TABLE openalex.vector_search.work_embeddings

## Step 2: Check source data

In [None]:
%%sql
-- Count works available for embedding
SELECT 
    COUNT(*) as total_works,
    SUM(CASE WHEN abstract IS NOT NULL THEN 1 ELSE 0 END) as with_abstract,
    SUM(CASE WHEN abstract IS NULL THEN 1 ELSE 0 END) as title_only
FROM openalex.works.openalex_works 
WHERE type != 'dataset'
  AND title IS NOT NULL

## Step 3: Test embedding on sample (verify before scaling)

In [None]:
%%sql
-- Test: Embed 5 works to verify pipeline
SELECT 
    CAST(id AS STRING) as work_id,
    title,
    SIZE(ai_query(
        'databricks-bge-large-en',
        CONCAT('Title: ', title, COALESCE(CONCAT('\n\nAbstract: ', abstract), ''))
    )) as embedding_dims
FROM openalex.works.openalex_works
WHERE type != 'dataset' 
  AND title IS NOT NULL
  AND abstract IS NOT NULL
  AND publication_year = 2024
LIMIT 5

## Step 4: Insert small validation batch (100 works)

In [None]:
%%sql
-- Insert 100 works with abstracts from 2024 for validation
INSERT INTO openalex.vector_search.work_embeddings
SELECT 
    CAST(id AS STRING) as work_id,
    CAST(ai_query(
        'databricks-bge-large-en',
        CONCAT('Title: ', title, '\n\nAbstract: ', abstract)
    ) AS ARRAY<FLOAT>) as embedding,
    md5(CONCAT('Title: ', title, '\n\nAbstract: ', abstract)) as text_hash,
    publication_year,
    type,
    open_access.is_oa as is_oa,
    true as has_abstract,
    current_timestamp() as created_at,
    current_timestamp() as updated_at
FROM openalex.works.openalex_works
WHERE type != 'dataset' 
  AND title IS NOT NULL
  AND abstract IS NOT NULL
  AND publication_year = 2024
  AND CAST(id AS STRING) NOT IN (SELECT work_id FROM openalex.vector_search.work_embeddings)
LIMIT 100

## Step 5: Verify insertion

In [None]:
%%sql
SELECT 
    COUNT(*) as total_embeddings,
    SUM(CASE WHEN has_abstract THEN 1 ELSE 0 END) as with_abstract,
    MIN(created_at) as oldest,
    MAX(created_at) as newest
FROM openalex.vector_search.work_embeddings

In [None]:
%%sql
-- Check embedding dimensions and sample data
SELECT 
    work_id,
    SIZE(embedding) as embedding_dims,
    publication_year,
    type
FROM openalex.vector_search.work_embeddings
LIMIT 5

## Step 6: Test similarity search

Verify embeddings work for similarity search before scaling.

In [None]:
%%sql
-- Test similarity: find works similar to a query about "climate change impacts on marine ecosystems"
WITH query_embedding AS (
    SELECT ai_query(
        'databricks-bge-large-en',
        'climate change impacts on marine ecosystems and ocean biodiversity'
    ) as embedding
)
SELECT 
    e.work_id,
    w.title,
    -- Cosine similarity approximation using dot product (embeddings are normalized)
    AGGREGATE(
        TRANSFORM(
            SEQUENCE(0, SIZE(e.embedding) - 1),
            i -> CAST(e.embedding[i] AS DOUBLE) * CAST(q.embedding[i] AS DOUBLE)
        ),
        CAST(0.0 AS DOUBLE),
        (acc, x) -> acc + x
    ) as similarity_score
FROM openalex.vector_search.work_embeddings e
CROSS JOIN query_embedding q
JOIN openalex.works.openalex_works w ON CAST(w.id AS STRING) = e.work_id
ORDER BY similarity_score DESC
LIMIT 10

## Step 7: Scale up (run after validation)

Once validation passes, insert larger batches. Run these cells incrementally.

In [None]:
%%sql
-- Insert 10K more works (with abstracts, 2024)
INSERT INTO openalex.vector_search.work_embeddings
SELECT 
    CAST(id AS STRING) as work_id,
    CAST(ai_query(
        'databricks-bge-large-en',
        CONCAT('Title: ', title, '\n\nAbstract: ', abstract)
    ) AS ARRAY<FLOAT>) as embedding,
    md5(CONCAT('Title: ', title, '\n\nAbstract: ', abstract)) as text_hash,
    publication_year,
    type,
    open_access.is_oa as is_oa,
    true as has_abstract,
    current_timestamp() as created_at,
    current_timestamp() as updated_at
FROM openalex.works.openalex_works
WHERE type != 'dataset' 
  AND title IS NOT NULL
  AND abstract IS NOT NULL
  AND publication_year >= 2020
  AND CAST(id AS STRING) NOT IN (SELECT work_id FROM openalex.vector_search.work_embeddings)
LIMIT 10000

## Notes

### Model Choice: BGE vs OpenAI

| Model | Dims | Cost (187M works) | Notes |
|-------|------|-------------------|-------|
| databricks-bge-large-en | 1024 | Free (built-in) | Ready to use |
| text-embedding-3-small | 1536 | ~$1,500 | Requires endpoint setup |

Started with BGE for validation. Can switch to OpenAI later if needed.

### Scaling Strategy

1. **Phase 1**: 100 works (this notebook) - validate pipeline
2. **Phase 2**: 10K works - validate at small scale
3. **Phase 3**: Scale to all 187M works with abstracts

For Phase 3, consider:
- Running as a scheduled job with batches
- Using a dedicated SQL warehouse
- Monitoring rate limits