# Refresh RAS Works Counts

Rebuilds the `affiliation_strings_lookup_with_counts` table with fresh works counts
from `OpenAlex_works` and institution IDs from the MV (which includes curations).

Uses a MERGE with content hashing to detect changes â€” only rows with changed data
get a new `refreshed_at` timestamp, enabling incremental ES sync downstream.

**Runs after**: Guardrails (needs finalized works data)
**Feeds**: `sync_affiliation_strings_to_elastic_v2` (ES sync for dashboard)

## Step 1: Rebuild works counts per RAS

In [None]:
-- Rebuild works counts by exploding authorships from OpenAlex_works.
-- This replaces the entire counts table with fresh data.
CREATE OR REPLACE TABLE openalex.institutions.affiliation_string_works_counts AS
SELECT
    raw_aff_string,
    COUNT(DISTINCT w.id) as works_count
FROM openalex.works.OpenAlex_works w
LATERAL VIEW EXPLODE(authorships) AS authorship
LATERAL VIEW EXPLODE(authorship.raw_affiliation_strings) AS raw_aff_string
GROUP BY raw_aff_string

In [None]:
-- Quick sanity check
SELECT
  COUNT(*) AS total_unique_ras,
  SUM(works_count) AS total_works_count,
  MIN(works_count) AS min_works,
  MAX(works_count) AS max_works
FROM openalex.institutions.affiliation_string_works_counts

## Step 2: MERGE lookup with counts (hash-based change detection)

Builds a staging table with a `content_hash` of key fields, then MERGEs into the
target. Only rows where the hash changed get `refreshed_at` updated, enabling
incremental ES sync. New rows are inserted, removed rows are deleted.

In [None]:
-- Enable schema auto-merge so the MERGE can add content_hash and refreshed_at
-- columns to the existing table on first run (they'll start as NULLs).
SET spark.databricks.delta.schema.autoMerge.enabled = true

In [None]:
-- Build staging table with content hash for change detection
CREATE OR REPLACE TABLE openalex.institutions._ras_lookup_staging AS
SELECT
    mv.raw_affiliation_string,
    mv.institution_ids AS institution_ids_final,
    mv.model_institution_ids AS institution_ids_from_model,
    mv.institution_ids_override,
    mv.countries,
    mv.source,
    mv.created_datetime,
    mv.updated_datetime,
    c.works_count,
    SHA2(TO_JSON(NAMED_STRUCT(
        'iif', mv.institution_ids,
        'iim', mv.model_institution_ids,
        'iio', mv.institution_ids_override,
        'c', mv.countries,
        'wc', c.works_count
    )), 256) AS content_hash
FROM openalex.institutions.raw_affiliation_strings_institutions_mv mv
INNER JOIN openalex.institutions.affiliation_string_works_counts c
    ON mv.raw_affiliation_string = c.raw_aff_string

In [None]:
-- MERGE with hash-based change detection.
-- Only updates rows where content actually changed (new refreshed_at).
-- Inserts new rows, deletes rows no longer in source.
-- On first run, COALESCE(target.content_hash, '') handles NULLs from schema migration.
MERGE INTO openalex.institutions.affiliation_strings_lookup_with_counts AS target
USING openalex.institutions._ras_lookup_staging AS source
ON target.raw_affiliation_string = source.raw_affiliation_string
WHEN MATCHED AND COALESCE(target.content_hash, '') <> source.content_hash THEN
    UPDATE SET
        institution_ids_final = source.institution_ids_final,
        institution_ids_from_model = source.institution_ids_from_model,
        institution_ids_override = source.institution_ids_override,
        countries = source.countries,
        source = source.source,
        created_datetime = source.created_datetime,
        updated_datetime = source.updated_datetime,
        works_count = source.works_count,
        content_hash = source.content_hash,
        refreshed_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
    INSERT (raw_affiliation_string, institution_ids_final, institution_ids_from_model,
            institution_ids_override, countries, source, created_datetime, updated_datetime,
            works_count, content_hash, refreshed_at)
    VALUES (source.raw_affiliation_string, source.institution_ids_final, source.institution_ids_from_model,
            source.institution_ids_override, source.countries, source.source, source.created_datetime,
            source.updated_datetime, source.works_count, source.content_hash, CURRENT_TIMESTAMP())
WHEN NOT MATCHED BY SOURCE THEN DELETE

In [None]:
DROP TABLE IF EXISTS openalex.institutions._ras_lookup_staging

In [None]:
-- Verify rebuild + change detection stats
SELECT
  COUNT(*) AS total_rows,
  COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) AS rows_with_institutions,
  ROUND(COUNT(CASE WHEN SIZE(institution_ids_final) > 0 THEN 1 END) * 100.0 / COUNT(*), 1) AS pct_with_institutions,
  COUNT(CASE WHEN refreshed_at >= CURRENT_DATE() THEN 1 END) AS rows_refreshed_today,
  MIN(refreshed_at) AS oldest_refresh,
  MAX(refreshed_at) AS newest_refresh
FROM openalex.institutions.affiliation_strings_lookup_with_counts