# Create Works Enriched

## Overview

This notebook creates the final `openalex_works` table by enriching `openalex_works_base` with additional data from various sources. It starts by deep cloning the base table, then merges in enrichments.

## Data Flow

```
openalex_works_base (from CreateWorksBase)
    |
    v [DEEP CLONE]
openalex_works
    |
    v [MERGE operations]
openalex_works (enriched)
```

## Enrichment Sources

### Referenced Works (two sources)
1. **Legacy backfill:** `openalex.mid.citation` - historical citation data from MAG
2. **Parsed references:** `openalex.works.referenced_works` - resolved from raw references by `parse_work_references.ipynb`

Both are merged using `array_union` to combine without duplicates.

### Citations
- `cited_by_count` and `counts_by_year` - computed from `referenced_works` edges
- `fwci` and `citation_normalized_percentile` - computed from pub+3 citations within cohort

### Content
- **Fulltext:** from `openalex.pdf.pdf_combined` matched by DOI or PMH ID
- **Concepts/Keywords:** backfill + predicted (using `concept_key` hash)
- **Topics:** backfill (id < 6.6B) + frontfill (id > 6.6B)
- **SDG:** backfill + frontfill

### Metadata
- **Awards:** from `openalex.awards.work_awards`
- **Funders:** combined from backfill, fulltext extraction, and GTR
- **Authorships:** enriched with institution matching from `work_authorships`
- **Related works:** from `related_works_backfill`
- **Work type:** from `openalex.mid.work` (with curation overrides)

### Create and enrich `openalex.works.openalex_works`

In [None]:
-- Hash-based updated_date: Capture previous state before DEEP CLONE
-- Checkpoint: Excludes authorships, ids, cited_by_count (not yet deterministic)
-- See: https://github.com/ourresearch/oax-jobs/tree/main/active/hash-based-updated-date

CREATE OR REPLACE TABLE openalex.works.openalex_works_hash AS
SELECT
  id,
  updated_date,
  xxhash64(CONCAT_WS('|',
    CAST(id AS STRING),
    COALESCE(doi, ''),
    COALESCE(title, ''),
    COALESCE(CAST(publication_date AS STRING), ''),
    COALESCE(CAST(publication_year AS STRING), ''),
    COALESCE(type, ''),
    COALESCE(language, ''),
    COALESCE(abstract, ''),
    COALESCE(TO_JSON(referenced_works), '[]'),
    COALESCE(TO_JSON(topics), '[]'),
    COALESCE(TO_JSON(concepts), '[]'),
    COALESCE(TO_JSON(keywords), '[]'),
    COALESCE(TO_JSON(funders), '[]'),
    COALESCE(TO_JSON(locations), '[]'),
    COALESCE(TO_JSON(awards), '[]'),
    COALESCE(TO_JSON(open_access), '{}'),
    COALESCE(CAST(is_retracted AS STRING), 'false')
  )) AS content_hash
FROM openalex.works.openalex_works;

In [None]:
CREATE OR REPLACE TABLE openalex.works.openalex_works 
DEEP CLONE openalex.works.openalex_works_base  -- unable to use env_suffix with deep clone, set manually
TBLPROPERTIES (
  'delta.dataSkippingNumIndexedCols' = 36,
  'delta.deletedFileRetentionDuration' = '30 days',
  'delta.logRetentionDuration' = '30 days'
);

### Merge citations and referenced_works

In [None]:
-- MERGE Backfill referenced_works (mid.citation) - 95,305,004 works updated
WITH prod_ref_works AS (
  SELECT 
    paper_id as id,
    ARRAY_SORT(collect_set(paper_reference_id)) as referenced_works
  FROM openalex.mid.citation
  GROUP BY paper_id
)
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') as target
USING prod_ref_works as source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET
  referenced_works = ARRAY_SORT(array_union(target.referenced_works, source.referenced_works)),
  referenced_works_count = size(array_union(target.referenced_works, source.referenced_works));

In [None]:
%sql
-- use newly refreshed parsed data (overlap with mid.citation backfill above, add more if exists) -- 102,749,754
MERGE INTO openalex.works.openalex_works AS target
USING openalex.works.referenced_works AS source
ON target.id = source.citing_work_id
WHEN MATCHED -- AND (target.referenced_works is null OR size(target.referenced_works) = 0) -- either don't overwrite or union
THEN UPDATE SET
  target.referenced_works = ARRAY_SORT(array_union(target.referenced_works, source.referenced_works)),
  target.referenced_works_count = size(array_union(target.referenced_works, source.referenced_works));

#### Merge cited_by_count and counts_by_year

In [0]:
-- Calculate and MERGE the citations
-- Far fewer changes than propagating through locations_mapped and 17 CTEs, no need to select distinct work_id data
-- runtime about 1 min, updates 67M rows
WITH exploded_references AS (
  SELECT
    id,
    publication_year,
    EXPLODE(referenced_works) AS cited_work_id
  FROM identifier('openalex' || :env_suffix || '.works.openalex_works')
  WHERE referenced_works_count > 0
    AND publication_year <= YEAR(CURRENT_DATE())
    AND type != 'dataset'
),
citation_counts AS (
  SELECT
    cited_work_id,
    publication_year,
    COUNT(*) AS cited_by_count
  FROM exploded_references
  GROUP BY cited_work_id, publication_year
),
citation_counts_by_work AS (
  SELECT 
    cited_work_id,
    FILTER(
      SORT_ARRAY(
        COLLECT_LIST(
          NAMED_STRUCT(
            'year', publication_year,
            'cited_by_count', cited_by_count
          )
        ),
        false
      ),
      x -> x.year >= 2012
    ) AS counts_by_year,
    SUM(cited_by_count) AS cited_by_count -- total across all years
  FROM citation_counts
  GROUP BY cited_work_id
)
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING citation_counts_by_work AS source
ON target.id = source.cited_work_id
WHEN MATCHED THEN
UPDATE SET
  target.cited_by_count = source.cited_by_count,
  target.counts_by_year = source.counts_by_year;



### Merge full-text

In [0]:
-------- Merge fulltext from PDFs --------
WITH pdf_fulltext_for_merge AS (
    -- DOI-based matching
    SELECT 
        CONCAT('https://doi.org/', LOWER(FILTER(ids, x -> x.namespace = 'doi')[0].id)) AS doi_normalized,
        NULL AS pmh_id,
        fulltext,
        'doi' AS match_type,
        ROW_NUMBER() OVER (PARTITION BY CONCAT('https://doi.org/', LOWER(FILTER(ids, x -> x.namespace = 'doi')[0].id)) ORDER BY LENGTH(fulltext) DESC) AS rn
    FROM openalex.pdf.pdf_combined
    WHERE SIZE(FILTER(ids, x -> x.namespace = 'doi')) > 0
      AND fulltext IS NOT NULL
      AND TRIM(fulltext) != ''
    
    UNION ALL
    
    -- PMH ID-based matching
    SELECT 
        NULL AS doi_normalized,
        FILTER(ids, x -> x.namespace = 'pmh')[0].id AS pmh_id,
        fulltext,
        'pmh' AS match_type,
        ROW_NUMBER() OVER (PARTITION BY FILTER(ids, x -> x.namespace = 'pmh')[0].id ORDER BY LENGTH(fulltext) DESC) AS rn
    FROM openalex.pdf.pdf_combined
    WHERE SIZE(FILTER(ids, x -> x.namespace = 'pmh')) > 0
      AND fulltext IS NOT NULL
      AND TRIM(fulltext) != ''
      -- Only include PMH records that don't have DOIs (to avoid duplicates)
      AND SIZE(FILTER(ids, x -> x.namespace = 'doi')) = 0
),
pdf_fulltext_deduped AS (
    SELECT doi_normalized, pmh_id, fulltext, match_type
    FROM pdf_fulltext_for_merge
    WHERE rn = 1
),
works_with_locations AS (
    SELECT 
        w.id,
        w.doi,
        EXPLODE_OUTER(w.locations) AS location
    FROM identifier('openalex' || :env_suffix || '.works.openalex_works') w
),
matched_fulltext AS (
    -- DOI matches
    SELECT 
        w.id AS work_id,
        p.fulltext,
        p.match_type
    FROM (SELECT DISTINCT id, doi FROM works_with_locations) w
    INNER JOIN pdf_fulltext_deduped p 
        ON LOWER(w.doi) = p.doi_normalized
    WHERE p.doi_normalized IS NOT NULL
    
    UNION ALL
    
    -- PMH ID matches
    SELECT 
        w.id AS work_id,
        p.fulltext,
        p.match_type
    FROM works_with_locations w
    INNER JOIN pdf_fulltext_deduped p 
        ON w.location.pmh_id = p.pmh_id
    WHERE p.pmh_id IS NOT NULL
      AND w.location.pmh_id IS NOT NULL
),
final_fulltext AS (
    -- Deduplicate in case a work matches on both DOI and PMH
    -- Prefer DOI matches over PMH matches
    SELECT 
        work_id,
        fulltext,
        ROW_NUMBER() OVER (PARTITION BY work_id ORDER BY CASE WHEN match_type = 'doi' THEN 1 ELSE 2 END) AS priority_rn
    FROM matched_fulltext
),
cleaned_fulltext AS (
    SELECT 
        work_id,
        TRIM(
            REGEXP_REPLACE(
                REGEXP_REPLACE(
                    REGEXP_REPLACE(
                        SUBSTRING(fulltext, 1, 200000),
                        '<[^>]+>',  -- Remove HTML tags
                        ' '
                    ),
                    '\\s+',         -- Replace multiple whitespace with single space
                    ' '
                ),
                '(^\\s+|\\s+$)',    -- Additional trim for safety
                ''
            )
        ) AS cleaned_fulltext
    FROM final_fulltext 
    WHERE priority_rn = 1
)
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING (
    SELECT 
        work_id, 
        cleaned_fulltext AS fulltext
    FROM cleaned_fulltext
    WHERE cleaned_fulltext IS NOT NULL 
      AND LENGTH(cleaned_fulltext) > 0
) AS source
ON target.id = source.work_id
WHEN MATCHED THEN
UPDATE SET
  target.fulltext = source.fulltext;

### Merge Concepts

#### Backfill

In [0]:
---------- MERGE aggregated and sorted by score Concepts from backfill --------
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING openalex.works.work_concepts_backfill AS source
  ON target.id = source.work_id
WHEN MATCHED THEN
  UPDATE SET
  target.concepts = source.concepts,
  target.keywords = filter(source.keywords, k -> k.score > 0);

#### Predicted

In [0]:

---------- MERGE from predicted Concepts using concept_key --------
-- ============= Tunable parameters =============
DECLARE OR REPLACE VARIABLE filter_threshold FLOAT DEFAULT 0.20;  -- score cutoff for filtering
DECLARE OR REPLACE VARIABLE base_mid         FLOAT DEFAULT 5.0;   -- target median size (bell center)
DECLARE OR REPLACE VARIABLE half_range       FLOAT DEFAULT 6.0;   -- maximum deviation from median (-+ range)
DECLARE OR REPLACE VARIABLE center_size      INT   DEFAULT 7;     -- where the tanh crosses 0 (inflection point)
DECLARE OR REPLACE VARIABLE slope            FLOAT DEFAULT 0.05;  -- steepness of the tanh curve

MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING (
  SELECT concept_key,
         FIRST(concepts_enriched) AS concepts,
         FIRST(keywords) as keywords
  FROM openalex.works.openalex_works_concepts_predicted
  WHERE size(concepts_enriched) > 0 OR size(keywords) > 0
  GROUP BY concept_key
) as source
ON -- (target.concepts IS NULL OR size(target.concepts) = 0) AND 
   xxhash64(
     concat_ws('|',
       target.title,
       target.abstract,
       target.primary_location.source.display_name,
       target.primary_location.source.type
     )
   ) = source.concept_key
WHEN MATCHED AND id > 6600000000 THEN
  UPDATE SET
    target.concepts = slice(source.concepts, 1, 40), -- too many concepts from the model - up to 130
    target.keywords = slice(
      filter(source.keywords, k -> k.score > 0), 1,
      greatest(2, least(12, round(base_mid + 
          half_range * tanh((
            size(filter(source.keywords, 
              k -> k.score > filter_threshold)) - center_size) * slope)))
      )
    );


### Merge Topics

In [0]:
-- Combined MERGE for Topics (backfill + frontfill)
-- Backfill and frontfill are mutually exclusive (backfill: id < 6600000000, frontfill: id > 6600000000)
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING (
  -- Backfill: old works only
  SELECT work_id, topics
  FROM openalex.works.work_topics_backfill
  WHERE work_id < 6600000000

  UNION ALL

  -- Frontfill: new works only
  SELECT work_id, FIRST(topics) as topics
  FROM openalex.works.work_topics_frontfill
  WHERE work_id > 6600000000
  GROUP BY work_id
) AS source
ON target.id = source.work_id
   AND (target.topics IS NULL OR SIZE(target.topics) = 0)
WHEN MATCHED THEN UPDATE SET
  target.topics = source.topics,
  target.primary_topic = source.topics[0];

In [None]:
-- FWCI + cohort percentile (pub+3 within pub_year/subfield_id/work_type) -- 257,176,397 updated
-- + cited_by_percentile_year (global by eval_year)
-- Computes everything from citation edges (referenced_works); no counts_by_year usage.

WITH base AS (  -- candidate works + work_type mapping
  SELECT
    id AS work_id,
    CASE
      WHEN type = 'article'
           AND primary_location.source.type = 'conference' THEN 'conference_article'
      WHEN type IN ('article', 'book', 'review', 'book-chapter') THEN type
      ELSE NULL
    END AS work_type,
    COALESCE(publication_year, YEAR(publication_date)) AS pub_year,
    primary_topic.subfield.id AS subfield_id
  FROM identifier('openalex' || :env_suffix || '.works.openalex_works')
  WHERE primary_topic.subfield.id IS NOT NULL
    AND COALESCE(publication_year, YEAR(publication_date)) IS NOT NULL
),

-- All citation edges: (citing_year -> cited_work_id)
edges AS (
  SELECT
    w.publication_year AS citing_year,
    EXPLODE(COALESCE(w.referenced_works, ARRAY())) AS cited_work_id
  FROM identifier('openalex' || :env_suffix || '.works.openalex_works') AS w
  WHERE w.referenced_works_count > 0
    AND w.publication_year IS NOT NULL
    AND w.publication_year <= YEAR(CURRENT_DATE())
),

-- Per-work pub+3 citations via edges (join + conditional sum)
three_years AS (
  SELECT
    b.work_id,
    b.subfield_id,
    b.pub_year,
    b.work_type,
    SUM(
      CASE
        WHEN e.citing_year BETWEEN b.pub_year AND LEAST(b.pub_year + 3, YEAR(CURRENT_DATE()))
        THEN 1 ELSE 0
      END
    ) AS pub_plus_3_citations
  FROM base b
  LEFT JOIN edges e
    ON e.cited_work_id = b.work_id
  WHERE b.work_type IS NOT NULL
  GROUP BY b.work_id, b.subfield_id, b.pub_year, b.work_type
),

-- Join monthly cohort means to compute FWCI
with_fwci AS (
  SELECT
    t.work_id,
    t.subfield_id,
    t.pub_year,
    t.work_type,
    t.pub_plus_3_citations,
    CASE
      WHEN d.mean_citations IS NULL OR d.mean_citations <= 0 THEN NULL
      ELSE t.pub_plus_3_citations / d.mean_citations
    END AS fwci
  FROM three_years t
  LEFT JOIN openalex.common.citations_mean_pub_year_type d
    ON d.publication_year = t.pub_year
   AND d.subfield_id      = t.subfield_id
   AND d.work_type        = t.work_type
),

-- Cohort percentile for pub+3 within (pub_year, subfield_id, work_type) + top-1/10 flags
-- Note: is_in_top_1_percent and is_in_top_10_percent are derived directly from citation_pct_cohort
-- to ensure they always match the displayed percentile value (fixes inconsistency bug)
with_percentile AS (
  SELECT
    work_id,
    subfield_id,
    pub_year,
    work_type,
    pub_plus_3_citations,
    ROUND(fwci, 8) AS fwci,
    ROUND(
      PERCENT_RANK() OVER (
        PARTITION BY pub_year, subfield_id, work_type
        ORDER BY pub_plus_3_citations, work_id
      ), 8
    ) AS citation_pct_cohort,
    (PERCENT_RANK() OVER (
      PARTITION BY pub_year, subfield_id, work_type
      ORDER BY pub_plus_3_citations, work_id
    ) >= 0.99) AS is_in_top_1_percent,
    (PERCENT_RANK() OVER (
      PARTITION BY pub_year, subfield_id, work_type
      ORDER BY pub_plus_3_citations, work_id
    ) >= 0.90) AS is_in_top_10_percent
  FROM with_fwci
),

/* ===== cited_by_percentile_year (global by eval_year), computed from edges ===== */
/* ===== cited_by_percentile_year (global, min/max across years)
   Build a proper year-level distribution that includes zeros and preserves frequencies ===== */

/* All candidate works and their pub_year (global, not filtered by work_type/subfield) */
-- All works with their pub_year
all_works AS (
  SELECT id AS work_id,
         coalesce(publication_year, year(publication_date)) AS pub_year
  FROM identifier('openalex' || :env_suffix || '.works.openalex_works')
  WHERE coalesce(publication_year, year(publication_date)) IS NOT NULL
),

-- Years universe
years AS (
  SELECT explode(sequence(1920, year(current_date()))) AS year
),

-- Total alive works per year (eligible to receive citations)
alive_per_year AS (
  SELECT y.year,
         count(*) AS alive_works
  FROM years y
  JOIN all_works w
    ON y.year >= w.pub_year
  GROUP BY y.year
),

-- Non-zero citation buckets from counts_by_year (use existing precomputed counts)
nonzero_year_freq AS (
  SELECT
    cy.year,
    cy.cited_by_count AS citation_count,
    count(*) AS freq
  FROM (
    SELECT
      explode(coalesce(w.counts_by_year, array())) AS cy
    FROM identifier('openalex' || :env_suffix || '.works.openalex_works') w
  )
  WHERE cy.year BETWEEN 1920 AND year(current_date())
    AND cy.cited_by_count > 0
  GROUP BY cy.year, cy.cited_by_count
),

-- Sum of non-zero frequencies per year
nonzero_sum AS (
  SELECT year, sum(freq) AS nonzero_total
  FROM nonzero_year_freq
  GROUP BY year
),

-- Zero bucket derived as alive - nonzero_total
zero_bucket AS (
  SELECT a.year,
         0 AS citation_count,
         greatest(a.alive_works - coalesce(n.nonzero_total, 0), 0) AS freq
  FROM alive_per_year a
  LEFT JOIN nonzero_sum n USING (year)
),

-- Full frequency table including zero bucket
year_count_freq AS (
  SELECT * FROM nonzero_year_freq
  UNION ALL
  SELECT * FROM zero_bucket
),

-- Cumulative distribution per year
year_count_cume AS (
  SELECT
    year,
    citation_count,
    freq,
    sum(freq) OVER (PARTITION BY year) AS total_freq,
    sum(freq) OVER (PARTITION BY year ORDER BY citation_count
                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_freq_inclusive
  FROM year_count_freq
),

-- Bounds for each (year, count)
year_count_bounds AS (
  SELECT
    year,
    citation_count,
    case when total_freq <= 1 then 0.0
         else (cum_freq_inclusive - freq) / total_freq end AS lower_pct,
    case when total_freq = 0 then 0.0
         else cum_freq_inclusive / total_freq end AS upper_pct
  FROM year_count_cume
),

-- Map each work's counts_by_year to bounds
work_year_bands AS (
  SELECT
    w.id AS work_id,
    cy.year,
    b.lower_pct,
    b.upper_pct
  FROM (
    SELECT
      w.id,
      explode(coalesce(w.counts_by_year, array())) AS cy
    FROM identifier('openalex' || :env_suffix || '.works.openalex_works') w
  ) w
  JOIN year_count_bounds b
    ON b.year = cy.year AND b.citation_count = cy.cited_by_count
),

-- Collapse to min/max across years per work
year_pct_minmax AS (
  SELECT
    work_id,
    min(lower_pct) AS lower_pct_min,
    max(upper_pct) AS upper_pct_max
  FROM work_year_bands
  GROUP BY work_id
),

formatted_year_pct AS (
  SELECT
    work_id,
    named_struct(
      'min',
        case
          when round(coalesce(lower_pct_min,0)*100)=100 then 99
          when round(coalesce(lower_pct_min,0)*100)=round(coalesce(upper_pct_max,0)*100)
            then greatest(cast(round(coalesce(lower_pct_min,0)*100) as int)-1, 0)
          else cast(round(coalesce(lower_pct_min,0)*100) as int)
        end,
      'max',
        case
          when round(coalesce(upper_pct_max,0)*100)=100 then 100
          else cast(round(coalesce(upper_pct_max,0)*100) as int)
        end
    ) AS cited_by_percentile_year
  FROM year_pct_minmax
),

updates AS (
  SELECT
    p.work_id,
    p.fwci,
    NAMED_STRUCT(
      'value', p.citation_pct_cohort,
      'is_in_top_1_percent', p.is_in_top_1_percent,
      'is_in_top_10_percent', p.is_in_top_10_percent
    ) AS citation_normalized_percentile,
    y.cited_by_percentile_year
  FROM with_percentile p
  LEFT JOIN formatted_year_pct y
    ON y.work_id = p.work_id
)
-- Preview:
-- SELECT * FROM updates;

MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING updates AS source
  ON target.id = source.work_id
WHEN MATCHED
THEN UPDATE SET
  target.fwci = COALESCE(source.fwci, target.fwci),
  target.citation_normalized_percentile =
    COALESCE(source.citation_normalized_percentile, target.citation_normalized_percentile),
  target.cited_by_percentile_year =
    COALESCE(source.cited_by_percentile_year, target.cited_by_percentile_year);

### Merge `sustainable_development_goals`

In [0]:
-- Combined MERGE for SDG (backfill + frontfill)
-- Verified: Backfill has 0 works with id > 6600000000, so no overlap with frontfill
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING (
  -- Backfill: only contains old works (verified: 0 rows with id > 6600000000)
  SELECT paper_id as work_id, sustainable_development_goals
  FROM openalex.works.work_sdg_backfill

  UNION ALL

  -- Frontfill: new works only
  SELECT work_id, FIRST(sdg) as sustainable_development_goals
  FROM openalex.works.works_sdg_frontfill
  WHERE work_id > 6600000000
  GROUP BY work_id
  HAVING SIZE(FIRST(sdg)) > 0
) AS source
ON target.id = source.work_id
WHEN MATCHED THEN UPDATE SET
  target.sustainable_development_goals = source.sustainable_development_goals;

### Merge `awards`

In [0]:
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING (
  SELECT work_id,
    ARRAY_SORT(
      collect_set(award),
      (left, right) -> CASE
        WHEN left.id < right.id THEN -1
        WHEN left.id > right.id THEN 1
        ELSE 0
      END
    ) as awards
  FROM openalex.awards.work_awards
  WHERE work_id IS NOT NULL
  GROUP BY work_id
) as source
ON target.id = source.work_id
WHEN MATCHED THEN UPDATE SET
  target.awards = source.awards;

### Merge `funders`

In [0]:
%sql
-- 1) Build rolled funders from funders backfill + FULLTEXT
WITH from_backfill AS (
  SELECT
    paper_id AS work_id,
    funder_id
  FROM openalex.mid.work_funder
),
from_backfill_enriched AS (
  SELECT
    b.work_id,
    CONCAT("https://openalex.org/F", b.funder_id) as funder_id,
    mf.ror_id AS ror,
    mf.display_name AS display_name
  FROM from_backfill b
  LEFT JOIN openalex.mid.funder mf
    ON mf.funder_id = b.funder_id
),
from_fulltext_enriched AS (
  SELECT
    ft.work_id,
    ft.funder_id,
    ft.ror_id AS ror,
    ft.funder_display_name AS display_name
  FROM openalex.works.fulltext_work_funders ft
  JOIN openalex.common.funder_names_keep keep ON keep.name = ft.funder_name
),
from_gtr AS (
  SELECT
    work_id,
    funder.id as funder_id,
    funder.ror_id as ror,
    funder.display_name as display_name
  FROM openalex.awards.gtr_awards
  WHERE work_id IS NOT NULL
),
unioned AS (
  SELECT work_id, funder_id, ror, display_name FROM from_backfill_enriched
  UNION ALL  
  SELECT work_id, funder_id, ror, display_name FROM from_fulltext_enriched
  UNION ALL
  SELECT work_id, funder_id, ror, display_name FROM from_gtr
),
dedup AS (
  -- one row per (work_id, funder_id), pick deterministic values
  SELECT
    work_id,
    funder_id,
    MAX(display_name) AS display_name,
    MAX(ror) AS ror
  FROM unioned
  GROUP BY work_id, funder_id
),
rolled_up AS (
  SELECT
    work_id,
    -- order by funder_id via lexicographic struct ordering (id is first field)
    sort_array(
      collect_list(
        struct(
          funder_id as id,
          display_name,
          ror
        )
      )
    ) AS funders
  FROM dedup
  GROUP BY work_id
)
-- 2) Merge into openalex_works.funders (array<struct<id:string, ror:string, display_name:string>>)
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING rolled_up AS source
  ON target.id = source.work_id
WHEN MATCHED THEN
  UPDATE SET target.funders = source.funders;

### Merge `authorships`

In [0]:
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') AS target
USING identifier('openalex' || :env_suffix || '.works.work_authorships') AS source
ON target.id = source.work_id
WHEN MATCHED THEN UPDATE SET
  target.authorships = source.authorships,
  target.authors_count = source.authors_count,
  target.corresponding_author_ids = source.corresponding_author_ids,
  target.corresponding_institution_ids = source.corresponding_institution_ids;

### Merge `institutions`

In [0]:
UPDATE identifier('openalex' || :env_suffix || '.works.openalex_works')
  SET institutions_distinct_count = COALESCE(SIZE(authorships.institutions), 0),
      countries_distinct_count = COALESCE(SIZE(
        ARRAY_DISTINCT(ARRAY_COMPACT(FLATTEN((authorships.institutions.country_code))))), 0)
WHERE authorships IS NOT NULL
  AND size(authorships) > 0;

### Merge `work.type`

In [0]:
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') as target
USING (
  WITH approved_curations AS (
    SELECT
      CAST(SUBSTRING(entity_id, 2) AS BIGINT) AS work_id,
      MAP_FROM_ENTRIES(COLLECT_LIST(STRUCT(property, property_value))) AS curations
    FROM
      openalex.curations.approved_curations
    WHERE
      entity = 'works'
      AND property IN ('type', 'language')
      AND status = 'approved'
    GROUP BY CAST(SUBSTRING(entity_id, 2) AS BIGINT)
  )
  SELECT 
    w.paper_id as work_id,
    w.type,
    w.type_crossref,
    ac.work_id IS NOT NULL as has_curation
  FROM openalex.mid.work w
  LEFT JOIN approved_curations ac ON w.paper_id = ac.work_id
) as source
ON target.id = source.work_id
WHEN MATCHED 
  AND target.type <> source.type
  AND source.type IS NOT NULL
  AND source.has_curation = FALSE
THEN UPDATE SET
  target.type = COALESCE(source.type, target.type),
  target.type_crossref = source.type_crossref

### Merge `related_works`

In [0]:
MERGE INTO identifier('openalex' || :env_suffix || '.works.openalex_works') as target
USING (
  SELECT 
    work_id, related_works
  FROM openalex.works.related_works_backfill
) as source
ON target.id = source.work_id
WHEN MATCHED AND source.related_works IS NOT NULL AND SIZE(source.related_works) > 0
THEN UPDATE SET
  target.related_works = source.related_works

In [None]:
-- Hash-based updated_date: Compare and update after all MERGEs
-- Only set updated_date to CURRENT_TIMESTAMP() if content actually changed
-- Checkpoint: Excludes authorships, ids, cited_by_count (not yet deterministic)
-- See: https://github.com/ourresearch/oax-jobs/tree/main/active/hash-based-updated-date

WITH new_hashes AS (
  SELECT
    id,
    xxhash64(CONCAT_WS('|',
      CAST(id AS STRING),
      COALESCE(doi, ''),
      COALESCE(title, ''),
      COALESCE(CAST(publication_date AS STRING), ''),
      COALESCE(CAST(publication_year AS STRING), ''),
      COALESCE(type, ''),
      COALESCE(language, ''),
      COALESCE(abstract, ''),
      COALESCE(TO_JSON(referenced_works), '[]'),
      COALESCE(TO_JSON(topics), '[]'),
      COALESCE(TO_JSON(concepts), '[]'),
      COALESCE(TO_JSON(keywords), '[]'),
      COALESCE(TO_JSON(funders), '[]'),
      COALESCE(TO_JSON(locations), '[]'),
      COALESCE(TO_JSON(awards), '[]'),
      COALESCE(TO_JSON(open_access), '{}'),
      COALESCE(CAST(is_retracted AS STRING), 'false')
    )) AS content_hash
  FROM openalex.works.openalex_works
)
MERGE INTO openalex.works.openalex_works AS target
USING (
  SELECT
    n.id,
    CASE
      WHEN p.id IS NULL THEN CURRENT_TIMESTAMP()              -- new record
      WHEN n.content_hash <> p.content_hash THEN CURRENT_TIMESTAMP()  -- changed
      ELSE p.updated_date                                      -- unchanged
    END AS new_updated_date
  FROM new_hashes n
  LEFT JOIN openalex.works.openalex_works_hash p ON n.id = p.id
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.updated_date = source.new_updated_date;