# Cleanup: Remove Trailing Periods from Raw Affiliation Strings

This notebook cleans up existing data affected by trailing period duplicates in raw affiliation strings.

**Issue:** `qa/issues/open/affiliation-trailing-period-2026-01/`

## Background

Raw affiliation strings from source data sometimes include trailing periods and sometimes don't. This causes duplicates like:
- `"University of California, San Francisco, CA, USA."`
- `"University of California, San Francisco, CA, USA"`

## Tables Affected

| Table | Description |
|-------|-------------|
| `openalex.institutions.affiliation_strings_lookup` | Lookup table for ML institution matching |
| `openalex.works.work_authors` | Author records with raw affiliations |
| `openalex.works.openalex_works` | Final works table (API/Elasticsearch) |

## Steps

1. Run impact assessment queries
2. Merge and clean `affiliation_strings_lookup`
3. Clean `work_authors`
4. Touch `openalex_works` for Elasticsearch sync
5. Run downstream jobs (see instructions at end)

In [None]:
-- Impact Assessment: Count affected records before cleanup

-- Count affiliation strings ending with period in lookup table
SELECT 'affiliation_strings_lookup with trailing period' AS metric, COUNT(*) AS count
FROM openalex.institutions.affiliation_strings_lookup
WHERE raw_affiliation_string LIKE '%.';

-- Count duplicate pairs (both with and without period exist)
SELECT 'duplicate pairs in lookup' AS metric, COUNT(*) AS count
FROM openalex.institutions.affiliation_strings_lookup a
JOIN openalex.institutions.affiliation_strings_lookup b
  ON RTRIM(a.raw_affiliation_string, '.') = b.raw_affiliation_string
WHERE a.raw_affiliation_string LIKE '%.';

-- Count work_authors records with any trailing period string
SELECT 'work_authors with trailing period' AS metric, COUNT(*) AS count
FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.');

-- Sample of strings that will be altered
SELECT raw_affiliation_string
FROM openalex.institutions.affiliation_strings_lookup
WHERE raw_affiliation_string LIKE '%.'
LIMIT 20;

In [None]:
-- Step 1: Clean affiliation_strings_lookup
-- Merge institution_ids from period versions into non-period versions, then delete period versions

-- Step 1a: Update non-period versions to include institution_ids from period versions
UPDATE openalex.institutions.affiliation_strings_lookup target
SET
    institution_ids = COALESCE(
        source.institution_ids,
        target.institution_ids
    ),
    institution_ids_override = COALESCE(
        target.institution_ids_override,
        source.institution_ids_override
    )
FROM openalex.institutions.affiliation_strings_lookup source
WHERE source.raw_affiliation_string = CONCAT(target.raw_affiliation_string, '.')
  AND source.raw_affiliation_string LIKE '%.';

-- Step 1b: Delete period-ending duplicates (where non-period version exists)
DELETE FROM openalex.institutions.affiliation_strings_lookup
WHERE raw_affiliation_string LIKE '%.'
  AND RTRIM(raw_affiliation_string, '.') IN (
      SELECT raw_affiliation_string
      FROM openalex.institutions.affiliation_strings_lookup
  );

In [None]:
-- Step 2: Clean work_authors
-- Strip trailing periods and deduplicate the arrays

UPDATE openalex.works.work_authors
SET
  raw_affiliation_strings = ARRAY_DISTINCT(
    TRANSFORM(raw_affiliation_strings, s -> TRIM(TRAILING '.' FROM TRIM(REPLACE(s, '\\n', ''))))
  ),
  updated_at = current_timestamp()
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.');

In [None]:
-- Step 3: Touch openalex_works for Elasticsearch sync
-- Update updated_date so sync_works picks up these records

-- First, capture the cleanup timestamp (run this before Step 2)
-- SET cleanup_timestamp = current_timestamp();

UPDATE openalex.works.openalex_works
SET updated_date = current_timestamp()
WHERE id IN (
  SELECT DISTINCT work_id
  FROM openalex.works.work_authors
  WHERE updated_at >= current_date()  -- Adjust to actual cleanup timestamp
);

## Downstream Jobs

After running the cleanup queries above, run the following notebooks/jobs to propagate changes:

1. **UpdateWorkAuthorships.ipynb** - Picks up work_authors changes via `updated_at`
2. **CreateWorksEnriched.ipynb** - Merges work_authorships into openalex_works
3. **sync_works** - Pushes updated records to Elasticsearch

## Verification

After cleanup, verify with these queries:

```sql
-- Should return 0
SELECT COUNT(*) FROM openalex.institutions.affiliation_strings_lookup
WHERE raw_affiliation_string LIKE '%.';

-- Should return 0
SELECT COUNT(*) FROM openalex.works.work_authors
WHERE EXISTS(raw_affiliation_strings, s -> s LIKE '%.');

-- Check example work_id 4414994979 - should have no duplicate affiliations
SELECT
    work_id,
    authorship.raw_affiliation_strings
FROM openalex.works.openalex_works
LATERAL VIEW EXPLODE(authorships) AS authorship
WHERE work_id = 4414994979;
```