# Cleanup Long Affiliation Strings

This notebook filters out problematic `raw_affiliation_strings` from the `work_authors` table:
- Strings longer than 2000 characters (mostly garbage data like institution mega-lists)
- Strings containing HTML tags (div, span, option, select, textarea, td, tr)

## Impact
- ~3,515 works affected
- ~27,762 authorships affected  
- ~29,312 strings filtered

## Related Changes
- CreateWorksBase.ipynb was updated to filter at source for new data

In [None]:
-- Preview: count of records that will be affected
WITH flagged AS (
    SELECT 
        work_id,
        author_sequence,
        ras
    FROM openalex.works.work_authors
    LATERAL VIEW EXPLODE(raw_affiliation_strings) AS ras
    WHERE raw_affiliation_strings IS NOT NULL
      AND (LENGTH(ras) > 2000 OR ras RLIKE '<(div|span|option|select|textarea|td|tr)')
)
SELECT 
    COUNT(DISTINCT work_id) as works_affected,
    COUNT(DISTINCT CONCAT(work_id, '-', author_sequence)) as authorships_affected,
    COUNT(*) as strings_to_filter
FROM flagged

In [None]:
-- Update work_authors to filter out problematic affiliation strings
UPDATE openalex.works.work_authors
SET raw_affiliation_strings = FILTER(
    raw_affiliation_strings,
    s -> LENGTH(s) <= 2000 AND NOT s RLIKE '<(div|span|option|select|textarea|td|tr)'
)
WHERE EXISTS(
    raw_affiliation_strings,
    s -> LENGTH(s) > 2000 OR s RLIKE '<(div|span|option|select|textarea|td|tr)'
)

In [None]:
-- Verify: should return 0 records
SELECT COUNT(*) as remaining_problematic_strings
FROM openalex.works.work_authors
LATERAL VIEW EXPLODE(raw_affiliation_strings) AS ras
WHERE LENGTH(ras) > 2000 OR ras RLIKE '<(div|span|option|select|textarea|td|tr)'

## Refresh Materialized View

After updating `work_authors`, refresh the materialized view to propagate changes.

In [None]:
REFRESH MATERIALIZED VIEW openalex.works.work_author_affiliations_mv