# Cleanup: Remove Trailing Periods from Raw Affiliation Strings

This notebook cleans up existing data affected by trailing period duplicates in raw affiliation strings.

**Issue:** `qa/issues/open/affiliation-trailing-period-2026-01/`

## Background

Raw affiliation strings from source data sometimes include trailing periods and sometimes don't. This causes duplicates like:
- `"University of California, San Francisco, CA, USA."`
- `"University of California, San Francisco, CA, USA"`

## Tables Affected

| Table | Description |
|-------|-------------|
| `openalex.works.work_authors` | Author records with raw affiliations |

## Steps

1. Run pre-cleanup counts and examples
2. Clean `work_authors`
3. Run post-cleanup verification
4. Run downstream jobs

In [None]:
-- ============================================
-- PRE-CLEANUP: Counts (run before cleanup)
-- ============================================

-- Count: Records with any trailing period in affiliations
SELECT 'work_authors with trailing period' as metric, COUNT(*) as count
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.' AND LENGTH(s) > 1);
-- EXPECTED BEFORE: ~98M | AFTER: 0

-- Count: Records with potential duplicates (same string with/without period)
SELECT 'records with period duplicates' as metric, COUNT(*) as count
FROM openalex.works.work_authors
WHERE SIZE(raw_affiliation_strings) > SIZE(ARRAY_DISTINCT(
    TRANSFORM(raw_affiliation_strings, s -> TRIM(TRAILING '.' FROM s))
));

In [None]:
-- ============================================
-- PRE-CLEANUP: Examples that SHOULD change
-- ============================================

-- Sample records with trailing periods (periods will be removed)
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.' AND LENGTH(s) > 10)
LIMIT 5;

-- Check our example work (should have duplicates before cleanup)
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE work_id = 4414994979
LIMIT 3;

In [None]:
-- ============================================
-- PRE-CLEANUP: Examples that should NOT change
-- ============================================

-- Records with abbreviations in middle (St., Dr.) but no trailing period
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> (s LIKE '%St.%' OR s LIKE '%Dr.%'))
AND NOT EXISTS(raw_affiliation_strings, s -> s LIKE '%.' AND LENGTH(s) > 1 AND SUBSTRING(s, LENGTH(s), 1) = '.')
LIMIT 5;

-- Records ending with numbers (no trailing period to remove)
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s RLIKE '[0-9]$')
LIMIT 5;

In [None]:
-- ============================================
-- CLEANUP: Clean work_authors
-- ============================================
-- Strip trailing periods and deduplicate the arrays

UPDATE openalex.works.work_authors
SET
  raw_affiliation_strings = ARRAY_DISTINCT(
    TRANSFORM(raw_affiliation_strings, s -> TRIM(TRAILING '.' FROM TRIM(REPLACE(s, '\\n', ''))))
  ),
  updated_at = current_timestamp()
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.');

In [None]:
-- ============================================
-- POST-CLEANUP: Verification counts
-- ============================================

-- Should return 0
SELECT 'work_authors with trailing period' as metric, COUNT(*) as count
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.' AND LENGTH(s) > 1);

-- Should return 0
SELECT 'records with period duplicates' as metric, COUNT(*) as count
FROM openalex.works.work_authors
WHERE SIZE(raw_affiliation_strings) > SIZE(ARRAY_DISTINCT(
    TRANSFORM(raw_affiliation_strings, s -> TRIM(TRAILING '.' FROM s))
));

In [None]:
-- ============================================
-- POST-CLEANUP: Verify examples
-- ============================================

-- Check our example work (should be deduplicated, no trailing periods)
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE work_id = 4414994979
LIMIT 3;

-- Verify abbreviations in middle are preserved
SELECT work_id, author_sequence, raw_affiliation_strings
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%St.%' OR s LIKE '%Dr.%')
LIMIT 5;

## Downstream Jobs

After running the cleanup queries above, run the following notebooks/jobs to propagate changes:

1. **UpdateWorkAuthorships.ipynb** - Picks up work_authors changes via `updated_at`
2. **CreateWorksEnriched.ipynb** - Merges work_authorships into openalex_works
3. **Full sync** - Run a full sync of data to Elasticsearch

## Expected Results Summary

| Metric | Before | After |
|--------|--------|-------|
| Records with trailing period | ~98M | 0 |
| Records with period duplicates | >0 | 0 |
| Work 4414994979 affiliations | Has duplicates | Deduplicated, no periods |
| Abbreviations (St., Dr.) in middle | Preserved | Preserved |