# Topics Re-run: Works That Gained Abstracts

~7.3M works recently gained abstracts (via ox-jobs). These works already had topics computed from title-only
(stored in `work_topics_frontfill`). We need to re-run topic inference with the new abstracts for better classification.

## What this notebook does

1. Identifies works where `work_topics_lm_output.abstract` was NULL/empty at inference time, but `openalex_works_base.abstract` is now >= 30 chars
2. Stores their IDs in a tracking table (`topics_rerun_gained_abstracts`)
3. Deletes them from `work_topics_lm_output` and `work_topics_frontfill` so the regular pipeline re-processes them

## Post-notebook steps

1. Run the `Topics_Frontfill` job (or wait for E2E) â€” processes ~3.84M per run
2. Will need ~2 runs to cover all ~7.3M
3. Works temporarily lose topics between deletion and re-processing (max ~24hrs per E2E cycle)

In [None]:
-- Diagnostic: count works where lm_output abstract was NULL/empty but base now has abstract >= 30 chars
SELECT FORMAT_NUMBER(COUNT(DISTINCT lm.work_id), 0) AS works_to_rerun
FROM openalex.works.work_topics_lm_output lm
JOIN openalex.works.openalex_works_base b ON lm.work_id = b.id
WHERE (lm.abstract IS NULL OR TRIM(lm.abstract) = '')
  AND LENGTH(b.abstract) >= 30

In [None]:
-- Create tracking table with deduplicated work IDs (latest inference per work)
CREATE OR REPLACE TABLE openalex.works.topics_rerun_gained_abstracts AS
SELECT
  lm.work_id,
  LENGTH(b.abstract) AS abstract_length,
  lm.created_timestamp AS original_inference_date,
  current_timestamp() AS created_date
FROM (
  SELECT work_id, abstract, created_timestamp,
    ROW_NUMBER() OVER (PARTITION BY work_id ORDER BY created_timestamp DESC) AS rn
  FROM openalex.works.work_topics_lm_output
) lm
JOIN openalex.works.openalex_works_base b ON lm.work_id = b.id
WHERE lm.rn = 1
  AND (lm.abstract IS NULL OR TRIM(lm.abstract) = '')
  AND LENGTH(b.abstract) >= 30

In [None]:
-- Verify tracking table count
SELECT FORMAT_NUMBER(COUNT(*), 0) AS tracked_works
FROM openalex.works.topics_rerun_gained_abstracts

In [None]:
-- Delete from work_topics_lm_output for tracked work IDs
-- Prevents old title-only predictions from contaminating the merge (which aggregates ALL records per work_id)
DELETE FROM openalex.works.work_topics_lm_output
WHERE work_id IN (SELECT work_id FROM openalex.works.topics_rerun_gained_abstracts)

In [None]:
-- Delete from work_topics_frontfill for tracked work IDs
-- Removes the LEFT ANTI JOIN barrier so topics_create_frontfill_input picks them up
DELETE FROM openalex.works.work_topics_frontfill
WHERE work_id IN (SELECT work_id FROM openalex.works.topics_rerun_gained_abstracts)

In [None]:
-- Verify deletions: both counts should be 0
SELECT
  (SELECT COUNT(*) FROM openalex.works.work_topics_lm_output
   WHERE work_id IN (SELECT work_id FROM openalex.works.topics_rerun_gained_abstracts)) AS remaining_in_lm_output,
  (SELECT COUNT(*) FROM openalex.works.work_topics_frontfill
   WHERE work_id IN (SELECT work_id FROM openalex.works.topics_rerun_gained_abstracts)) AS remaining_in_frontfill