# Backfill: Orphaned Works Missing Author IDs

This notebook backfills works that were orphaned by the author matching boundary bug.

**Issue:** `oax-jobs/active/author-matching-boundary-bug/`

## Background

The `UpdateWorkAuthors` notebook used `updated_date > max_updated_date` (strict greater-than) to select works for processing. Works at the exact boundary timestamp were permanently skipped. This accumulated ~387K+ orphaned works since Dec 20, 2025.

The boundary bug has been fixed (subtract 1 second from max). This notebook touches the orphaned works so the next pipeline run picks them up.

## Tables Affected

| Table | Description |
|-------|-------------|
| `openalex.works.openalex_works_base` | Updated_date touched to trigger reprocessing |

## Steps

1. Count orphaned works
2. Preview sample orphaned works
3. Touch `updated_date` to trigger reprocessing
4. Verify count

In [None]:
-- ============================================
-- STEP 1: Count orphaned works
-- ============================================

SELECT COUNT(*) as total_orphaned_works
FROM openalex.works.openalex_works_base w
LEFT JOIN openalex.works.work_authors wa ON w.id = wa.work_id
WHERE wa.work_id IS NULL
  AND w.id > 7000000000
  AND SIZE(w.authorships) > 0;

In [None]:
-- ============================================
-- STEP 2: Preview sample orphaned works
-- ============================================

SELECT
    w.id,
    w.updated_date,
    w.created_date,
    SIZE(w.authorships) as num_authors,
    w.authorships[0].raw_author_name as first_author
FROM openalex.works.openalex_works_base w
LEFT JOIN openalex.works.work_authors wa ON w.id = wa.work_id
WHERE wa.work_id IS NULL
  AND w.id > 7000000000
  AND SIZE(w.authorships) > 0
LIMIT 20;

In [None]:
-- ============================================
-- STEP 3: Touch orphaned works to trigger reprocessing
-- ============================================
-- Sets updated_date to current_timestamp() so the next
-- UpdateWorkAuthors run picks them up via the incremental filter.

UPDATE openalex.works.openalex_works_base
SET updated_date = current_timestamp()
WHERE id IN (
    SELECT w.id
    FROM openalex.works.openalex_works_base w
    LEFT JOIN openalex.works.work_authors wa ON w.id = wa.work_id
    WHERE wa.work_id IS NULL
      AND w.id > 7000000000
      AND SIZE(w.authorships) > 0
);

In [None]:
-- ============================================
-- STEP 4: Verify - should be 0 after next pipeline run
-- ============================================

SELECT COUNT(*) as remaining_orphaned_works
FROM openalex.works.openalex_works_base w
LEFT JOIN openalex.works.work_authors wa ON w.id = wa.work_id
WHERE wa.work_id IS NULL
  AND w.id > 7000000000
  AND SIZE(w.authorships) > 0;

## After Running

Step 4 will still show orphaned works immediately after Step 3 â€” the works have been touched but not yet processed. They will be picked up by the next `UpdateWorkAuthors` run in the Walden End2End pipeline.

Re-run Step 4 after the next pipeline completes to confirm the count drops to 0.

## Expected Results

| Metric | Before | After Next Pipeline Run |
|--------|--------|------------------------|
| Orphaned works (id > 7B, has authorships, no work_authors) | ~387K+ | 0 |