# Populate Rescrape Queue: 2024-2025 Missing Affiliations

Inserts DOIs from high-yield publishers into `openalex.taxicab.rescrape_queue`.
Only targets publishers where taxicab + Parseland are known to produce good results.

## Publisher tiers (based on Parseland testing 2026-02-20)

| Tier | Publishers | Parseland affil% |
|------|-----------|------------------|
| HIGH | Elsevier, Springer, AACR, Helvetica, Ovid, IEEE, SAGE, Emerald | 87-100% |
| MODERATE | Wiley, SPIE, Taylor & Francis | 47-84% |
| SKIP | IOP (soft-blocked), MUSE (captcha), Hans (0%), Thieme (no aff) |

## Usage

1. Run Cell 1 (diagnostic) to see current volumes
2. Run Cell 2 to populate the queue
3. Trigger the `TaxiCab_Rescrape` job


In [None]:
-- Diagnostic: how many DOIs will be queued, by publisher and year
SELECT
  publisher,
  publication_year,
  COUNT(*) as dois_to_rescrape
FROM openalex.works.openalex_works
WHERE type = 'article'
  AND is_xpac = false
  AND indexed_in_crossref = true
  AND publication_year IN (2024, 2025)
  AND doi IS NOT NULL
  AND size(filter(authorships, a -> size(a.raw_affiliation_strings) > 0)) = 0
  AND publisher IN (
    -- HIGH tier
    'Elsevier BV',
    'Springer Science and Business Media LLC',
    'American Association for Cancer Research (AACR)',
    'Publishing House Helvetica (Publications)',
    'Ovid Technologies (Wolters Kluwer Health)',
    'IEEE',
    'Institute of Electrical and Electronics Engineers (IEEE)',
    'SAGE Publications',
    'Emerald Publishing Limited',
    -- MODERATE tier
    'Wiley',
    'SPIE',
    'Taylor & Francis',
    'Informa UK Limited'
  )
GROUP BY publisher, publication_year
ORDER BY dois_to_rescrape DESC


In [None]:
-- Populate rescrape queue with 2024-2025 articles missing affiliations
-- from high-yield publishers (taxicab + Parseland confirmed working)
INSERT INTO openalex.taxicab.rescrape_queue (native_id, native_id_namespace, created_date)
SELECT
  LOWER(REPLACE(doi, 'https://doi.org/', '')) AS native_id,
  'doi' AS native_id_namespace,
  current_timestamp() AS created_date
FROM openalex.works.openalex_works
WHERE type = 'article'
  AND is_xpac = false
  AND indexed_in_crossref = true
  AND publication_year IN (2024, 2025)
  AND doi IS NOT NULL
  AND size(filter(authorships, a -> size(a.raw_affiliation_strings) > 0)) = 0
  AND publisher IN (
    -- HIGH tier (87-100% Parseland affiliation rate)
    'Elsevier BV',
    'Springer Science and Business Media LLC',
    'American Association for Cancer Research (AACR)',
    'Publishing House Helvetica (Publications)',
    'Ovid Technologies (Wolters Kluwer Health)',
    'IEEE',
    'Institute of Electrical and Electronics Engineers (IEEE)',
    'SAGE Publications',
    'Emerald Publishing Limited',
    -- MODERATE tier (47-84% Parseland affiliation rate)
    'Wiley',
    'SPIE',
    'Taylor & Francis',
    'Informa UK Limited'
  )
  AND NOT EXISTS (
    SELECT 1 FROM openalex.taxicab.rescrape_queue q
    WHERE q.native_id = LOWER(REPLACE(doi, 'https://doi.org/', ''))
      AND q.native_id_namespace = 'doi'
  )


In [None]:
-- Verify queue size
SELECT COUNT(*) AS queue_size FROM openalex.taxicab.rescrape_queue
