# Sync RAS Curations

Syncs verb-based RAS curations (add/remove institution_ids) from the users Heroku Postgres database to a local Databricks table.

**Source**: `openalex_users.public.curations` (Heroku Postgres foreign table)
**Target**: `openalex.institutions.ras_curations` (Delta table)

Curations use verb-based semantics:
- `action='add'`: Include this institution_id even if model didn't predict it
- `action='remove'`: Exclude this institution_id even if model predicted it

The target table aggregates curations per RAS into arrays that can be joined into the MV.

## Sync curations from users DB

In [None]:
%sql
-- Preview what will be synced
SELECT
  entity_id AS raw_affiliation_string,
  FILTER(
    ARRAY_AGG(
      CASE WHEN action = 'add'
      THEN CAST(REGEXP_EXTRACT(value, 'I(\\d+)', 1) AS BIGINT)
      END
    ),
    x -> x IS NOT NULL
  ) AS curated_add_ids,
  FILTER(
    ARRAY_AGG(
      CASE WHEN action = 'remove'
      THEN CAST(REGEXP_EXTRACT(value, 'I(\\d+)', 1) AS BIGINT)
      END
    ),
    x -> x IS NOT NULL
  ) AS curated_remove_ids,
  COUNT(*) AS num_curations
FROM openalex_users.public.curations
WHERE entity = 'ras'
  AND property = 'institution_ids'
GROUP BY entity_id

In [None]:
%sql
-- MERGE curations into local table (handles inserts, updates, AND deletes)
MERGE INTO openalex.institutions.ras_curations AS target
USING (
  SELECT
    entity_id AS raw_affiliation_string,
    FILTER(
      ARRAY_AGG(
        CASE WHEN action = 'add'
        THEN CAST(REGEXP_EXTRACT(value, 'I(\\d+)', 1) AS BIGINT)
        END
      ),
      x -> x IS NOT NULL
    ) AS curated_add_ids,
    FILTER(
      ARRAY_AGG(
        CASE WHEN action = 'remove'
        THEN CAST(REGEXP_EXTRACT(value, 'I(\\d+)', 1) AS BIGINT)
        END
      ),
      x -> x IS NOT NULL
    ) AS curated_remove_ids,
    CURRENT_TIMESTAMP() AS updated_datetime
  FROM openalex_users.public.curations
  WHERE entity = 'ras'
    AND property = 'institution_ids'
  GROUP BY entity_id
) AS source
ON target.raw_affiliation_string = source.raw_affiliation_string
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE

## Verify sync results

In [None]:
%sql
-- Check local curations table
SELECT 
  COUNT(*) AS total_curated_ras,
  SUM(SIZE(curated_add_ids)) AS total_adds,
  SUM(SIZE(curated_remove_ids)) AS total_removes,
  MAX(updated_datetime) AS last_sync
FROM openalex.institutions.ras_curations

In [None]:
%sql
-- Sample of curated RAS
SELECT * FROM openalex.institutions.ras_curations
ORDER BY updated_datetime DESC
LIMIT 10

## Populate pending Elasticsearch sync table

Find all work_ids affected by curated RAS and add them to the pending sync queue.
This ensures curated works get synced to Elasticsearch even though their `updated_date` hasn't changed.

In [None]:
%sql
-- Insert affected work_ids into pending sync table
-- Only inserts work_ids not already in the queue (deduplication)
INSERT INTO openalex.institutions.curated_work_ids_pending_sync (work_id, curated_ras, added_datetime)
SELECT DISTINCT
  waa.work_id,
  rc.raw_affiliation_string AS curated_ras,
  CURRENT_TIMESTAMP() AS added_datetime
FROM openalex.institutions.ras_curations rc
INNER JOIN openalex.works.work_author_affiliations_mv waa
  ON rc.raw_affiliation_string = waa.raw_affiliation_string
WHERE NOT EXISTS (
  SELECT 1 FROM openalex.institutions.curated_work_ids_pending_sync pending
  WHERE pending.work_id = waa.work_id
)

In [None]:
%sql
-- Verify pending sync queue
SELECT
  COUNT(DISTINCT work_id) AS pending_work_count,
  COUNT(*) AS total_rows,
  MIN(added_datetime) AS oldest_pending,
  MAX(added_datetime) AS newest_pending
FROM openalex.institutions.curated_work_ids_pending_sync