# Crossref Author/Affiliation Interleaving Problem

## Problem Description

Some Crossref publishers submit metadata where author affiliations are incorrectly parsed as separate author entries. This results in an alternating pattern like:

1. Author affiliation (incorrectly stored as author)
2. Actual author
3. Author affiliation (incorrectly stored as author)
4. Actual author

### Example

DOI: `10.26907/esd.17.3.14`

| Position | Given | Family | Actual Type |
|----------|-------|--------|-------------|
| 1 | Kazan | University | AFFILIATION |
| 2 | Roza | Valeeva | Author |
| 3 | Gulfiya | Parfilova | Author |
| 4 | Kazan | University | AFFILIATION |
| 5 | Irina | Demakova | Author |

### Impact

- Inflated author counts on works
- Fake "author" records created for institutions
- Incorrect authorship attribution
- Polluted author disambiguation

## Detection Approach

We detect this issue by looking for author entries where the `given` OR `family` name contains institution/organization keywords.

### Keywords Detected

**English Institution Keywords:**
- University, Institute, College, Hospital, Department
- School, Center, Centre, Laboratory, Faculty, Academy

**Non-English Institution Keywords:**
- Universiteit, Universidade, Università, Uniwersytet, Üniversitesi
- Universite, Hochschule, Fakultät, Klinikum, Krankenhaus
- Politecnico, Politechnika, Escuela, Colegio, Faculdade, Facultad

**Corporate/Organization Keywords:**
- Inc, LLC, Ltd, Corp, Corporation, Company, GmbH
- Consortium, Association, Collaboration, Committee, Council, Organization

**Additional Institution Keywords:**
- Clinic, Medical, Research, Museum, Library, Foundation, Polytechnic

### Confidence Levels

| Level | Criteria |
|-------|----------|
| **HIGH** | 2+ authors with null `given` AND institution keyword, OR 40%+ of authors look like institutions |
| **MEDIUM** | 20-40% of authors look like institutions |
| **LOW** | At least one institution-like author entry |

In [None]:
# Setup
import sys
sys.path.insert(0, '../..')

from utils.databricks_sql import run_query
import json

## Query 1: Overall Scope

How many records are affected by confidence level?

In [None]:
# Expanded detection patterns
ENGLISH_INSTITUTION = "(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)"
NON_ENGLISH_INSTITUTION = "(?i)(Universiteit|Universidade|Università|Uniwersytet|Üniversitesi|Universite|Hochschule|Fakultät|Klinikum|Krankenhaus|Politecnico|Politechnika|Escuela|Colegio|Faculdade|Facultad)"
CORPORATE = r"(?i)(\s|^)(Inc|LLC|Ltd|Corp|Corporation|Company|GmbH|Consortium|Association|Collaboration)(\s|\.|,|$)|(?i)\b(Committee|Council|Organisation|Organization)\b"
ADDITIONAL = r"(?i)\b(Clinic|Medical|Research|Museum|Library|Foundation|Polytechnic)\b"

# Combined pattern for any institution-like author
INSTITUTION_PATTERN = f"""(
    a.family RLIKE '{ENGLISH_INSTITUTION}'
    OR a.given RLIKE '{ENGLISH_INSTITUTION}'
    OR a.family RLIKE '{NON_ENGLISH_INSTITUTION}'
    OR a.given RLIKE '{NON_ENGLISH_INSTITUTION}'
    OR a.family RLIKE '{CORPORATE}'
    OR a.family RLIKE '{ADDITIONAL}'
    OR a.given RLIKE '{ADDITIONAL}'
)"""

result = run_query(f"""
WITH author_analysis AS (
    SELECT 
        native_id,
        size(authors) as total_authors,
        size(filter(authors, a -> {INSTITUTION_PATTERN})) as institution_like_count,
        size(filter(authors, a -> 
            (a.given IS NULL OR a.given = '')
            AND {INSTITUTION_PATTERN}
        )) as null_given_institution_count
    FROM openalex.crossref.crossref_works
    WHERE authors IS NOT NULL AND size(authors) >= 1
)
SELECT 
    CASE
        WHEN null_given_institution_count >= 2 THEN 'HIGH'
        WHEN total_authors > 0 AND institution_like_count * 1.0 / total_authors >= 0.4 THEN 'HIGH'
        WHEN total_authors > 0 AND institution_like_count * 1.0 / total_authors >= 0.2 THEN 'MEDIUM'
        WHEN institution_like_count >= 1 THEN 'LOW'
        ELSE 'NONE'
    END as confidence,
    COUNT(*) as record_count
FROM author_analysis
GROUP BY 1
ORDER BY 
    CASE 
        WHEN confidence = 'HIGH' THEN 1 
        WHEN confidence = 'MEDIUM' THEN 2 
        WHEN confidence = 'LOW' THEN 3 
        ELSE 4 
    END
""")

print("Detection Results by Confidence Level (Expanded Detection):")
print("-" * 50)
for row in result:
    print(f"  {row['confidence']}: {row['record_count']:,} records")

## Query 2: Distribution by Year

Is this an ongoing problem or historical?

In [None]:
result = run_query("""
WITH affected AS (
    SELECT 
        created_date,
        CASE
            WHEN size(filter(authors, a -> 
                (a.given IS NULL OR a.given = '')
                AND a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) >= 2 THEN 'HIGH'
            WHEN size(filter(authors, a -> 
                a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) * 1.0 / size(authors) >= 0.4 THEN 'HIGH'
            WHEN size(filter(authors, a -> 
                a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) * 1.0 / size(authors) >= 0.2 THEN 'MEDIUM'
            ELSE 'LOW'
        END as confidence
    FROM openalex.crossref.crossref_works
    WHERE authors IS NOT NULL 
      AND size(authors) >= 2
      AND size(filter(authors, a -> 
          a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
      )) >= 1
)
SELECT 
    YEAR(created_date) as year,
    SUM(CASE WHEN confidence = 'HIGH' THEN 1 ELSE 0 END) as high_count,
    SUM(CASE WHEN confidence = 'MEDIUM' THEN 1 ELSE 0 END) as medium_count,
    SUM(CASE WHEN confidence = 'LOW' THEN 1 ELSE 0 END) as low_count,
    COUNT(*) as total
FROM affected
WHERE created_date IS NOT NULL
GROUP BY 1
ORDER BY 1 DESC
LIMIT 10
""")

print(f"{'Year':<6} {'HIGH':>12} {'MEDIUM':>12} {'LOW':>12} {'Total':>12}")
print("-" * 56)
for row in result:
    print(f"{row['year']:<6} {row['high_count']:>12,} {row['medium_count']:>12,} {row['low_count']:>12,} {row['total']:>12,}")

## Query 3: Top Affected Publishers (Recent)

Which publishers contribute most to this problem in 2024-2026?

In [None]:
result = run_query("""
WITH publisher_affected AS (
    SELECT 
        publisher,
        COUNT(*) as total_affected,
        SUM(CASE WHEN 
            size(filter(authors, a -> 
                (a.given IS NULL OR a.given = '')
                AND a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) >= 2
            OR size(filter(authors, a -> 
                a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) * 1.0 / size(authors) >= 0.4
        THEN 1 ELSE 0 END) as high_confidence
    FROM openalex.crossref.crossref_works
    WHERE authors IS NOT NULL 
      AND size(authors) >= 2
      AND created_date >= '2024-01-01'
      AND size(filter(authors, a -> 
          a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
      )) >= 1
    GROUP BY publisher
)
SELECT publisher, total_affected, high_confidence
FROM publisher_affected
ORDER BY high_confidence DESC
LIMIT 25
""")

print(f"{'Publisher':<60} {'Total':>10} {'HIGH':>10}")
print("=" * 82)
for row in result:
    pub = row['publisher'][:58] if row['publisher'] else 'Unknown'
    print(f"{pub:<60} {row['total_affected']:>10,} {row['high_confidence']:>10,}")

## Query 4: Sample Affected Records

View specific examples of the interleaving pattern.

In [None]:
result = run_query("""
SELECT 
    native_id,
    publisher,
    to_json(authors) as authors_json
FROM openalex.crossref.crossref_works
WHERE authors IS NOT NULL 
  AND size(authors) >= 4
  AND size(filter(authors, a -> 
      a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
  )) >= 2
  AND size(filter(authors, a -> 
      a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
  )) * 1.0 / size(authors) >= 0.3
LIMIT 5
""")

institution_keywords = ['University', 'Institute', 'College', 'Hospital', 'Department', 
                        'School', 'Center', 'Centre', 'Laboratory', 'Faculty', 'Academy']

def is_institution(family):
    return any(kw.lower() in (family or '').lower() for kw in institution_keywords)

for row in result:
    authors = json.loads(row['authors_json'])
    print(f"\nDOI: {row['native_id']}")
    print(f"Publisher: {row['publisher']}")
    print("Authors:")
    for i, a in enumerate(authors[:10]):
        given = a.get('given') or ''
        family = a.get('family') or ''
        marker = " <-- INSTITUTION" if is_institution(family) else ""
        print(f"  {i+1}. given='{given}', family='{family}'{marker}")
    if len(authors) > 10:
        print(f"  ... +{len(authors) - 10} more")
    print()

## Query 5: Publisher Targeting Analysis

How many publishers would we need to target to cover X% of the problem?

In [None]:
# First get total HIGH confidence count
total_result = run_query("""
SELECT COUNT(*) as total
FROM openalex.crossref.crossref_works
WHERE authors IS NOT NULL 
  AND size(authors) >= 2
  AND (
      size(filter(authors, a -> 
          (a.given IS NULL OR a.given = '')
          AND a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
      )) >= 2
      OR size(filter(authors, a -> 
          a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
      )) * 1.0 / size(authors) >= 0.4
  )
""")
total_high = total_result[0]['total']

# Get cumulative coverage
result = run_query("""
WITH publisher_affected AS (
    SELECT 
        publisher,
        SUM(CASE WHEN 
            size(filter(authors, a -> 
                (a.given IS NULL OR a.given = '')
                AND a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) >= 2
            OR size(filter(authors, a -> 
                a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            )) * 1.0 / size(authors) >= 0.4
        THEN 1 ELSE 0 END) as high_confidence
    FROM openalex.crossref.crossref_works
    WHERE authors IS NOT NULL AND size(authors) >= 2
      AND size(filter(authors, a -> 
          a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
      )) >= 1
    GROUP BY publisher
),
ranked AS (
    SELECT publisher, high_confidence,
           ROW_NUMBER() OVER (ORDER BY high_confidence DESC) as rank
    FROM publisher_affected
    WHERE high_confidence > 0
)
SELECT rank, publisher, high_confidence,
       SUM(high_confidence) OVER (ORDER BY rank) as cumulative
FROM ranked
WHERE rank <= 100
""")

print(f"Total HIGH confidence records: {total_high:,}\n")
print("Publisher Targeting Efficiency:")
print("-" * 50)
milestones = [10, 25, 50, 75, 100]
for row in result:
    if row['rank'] in milestones:
        pct = row['cumulative'] * 100.0 / total_high
        print(f"Top {row['rank']:>3} publishers: {row['cumulative']:>10,} records ({pct:.1f}% coverage)")

## Detection SQL for Flagging Records

Use this query to flag affected records in a pipeline:

In [None]:
detection_query = """
-- Expanded detection query for author/affiliation interleaving issue
-- Checks both given AND family fields, includes non-English and corporate patterns

WITH detection_patterns AS (
    SELECT 
        native_id,
        size(authors) as total_authors,
        -- Count institution-like authors using expanded pattern
        size(filter(authors, a -> 
            -- English institution keywords (both fields)
            a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            OR a.given RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
            -- Non-English institution keywords
            OR a.family RLIKE '(?i)(Universiteit|Universidade|Università|Uniwersytet|Üniversitesi|Universite|Hochschule|Fakultät|Klinikum|Krankenhaus|Politecnico|Politechnika|Escuela|Colegio|Faculdade|Facultad)'
            OR a.given RLIKE '(?i)(Universiteit|Universidade|Università|Uniwersytet|Üniversitesi|Universite|Hochschule|Fakultät|Klinikum|Krankenhaus|Politecnico|Politechnika|Escuela|Colegio|Faculdade|Facultad)'
            -- Corporate/organization patterns
            OR a.family RLIKE '(?i)(\\\\s|^)(Inc|LLC|Ltd|Corp|Corporation|Company|GmbH|Consortium|Association|Collaboration)(\\\\s|\\\\.|,|$)'
            OR a.family RLIKE '(?i)\\\\b(Committee|Council|Organisation|Organization)\\\\b'
            -- Additional institution keywords
            OR a.family RLIKE '(?i)\\\\b(Clinic|Medical|Research|Museum|Library|Foundation|Polytechnic)\\\\b'
            OR a.given RLIKE '(?i)\\\\b(Clinic|Medical|Research|Museum|Library|Foundation|Polytechnic)\\\\b'
        )) as institution_like_count,
        -- Count with null given name
        size(filter(authors, a -> 
            (a.given IS NULL OR a.given = '')
            AND (
                a.family RLIKE '(?i)(University|Institute|College|Hospital|Department|School|Center|Centre|Laboratory|Faculty|Academy)'
                OR a.family RLIKE '(?i)(Universiteit|Universidade|Università|Uniwersytet|Üniversitesi|Universite|Hochschule|Fakultät|Klinikum|Krankenhaus|Politecnico|Politechnika|Escuela|Colegio|Faculdade|Facultad)'
                OR a.family RLIKE '(?i)(\\\\s|^)(Inc|LLC|Ltd|Corp|Corporation|Company|GmbH|Consortium|Association|Collaboration)(\\\\s|\\\\.|,|$)'
                OR a.family RLIKE '(?i)\\\\b(Committee|Council|Organisation|Organization|Clinic|Medical|Research|Museum|Library|Foundation|Polytechnic)\\\\b'
            )
        )) as null_given_institution_count
    FROM openalex.crossref.crossref_works
    WHERE authors IS NOT NULL AND size(authors) >= 1
)
SELECT 
    native_id,
    -- Detection flag
    CASE
        WHEN null_given_institution_count >= 2 THEN 'HIGH'
        WHEN total_authors > 0 AND institution_like_count * 1.0 / total_authors >= 0.4 THEN 'HIGH'
        WHEN total_authors > 0 AND institution_like_count * 1.0 / total_authors >= 0.2 THEN 'MEDIUM'
        WHEN institution_like_count >= 1 THEN 'LOW'
        ELSE NULL
    END as author_affiliation_interleaving_confidence,
    institution_like_count as institution_like_author_count
FROM detection_patterns
WHERE institution_like_count >= 1
"""

print(detection_query)

## Key Findings Summary

1. **~1.9M records** are affected total with expanded detection (as of Jan 2026)
   - ~1.54M from English institution keywords
   - ~54k from non-English institution keywords  
   - ~281k from corporate/organization patterns
   - ~35k from additional keywords (Clinic, Medical, Research, etc.)
2. **Problem is ongoing** - 2025 has the most affected records, 2026 is on pace to exceed it
3. **Concentrated in certain publishers** - Top 50 publishers account for ~35% of the problem
4. **Ukrainian/Russian university publishers** are heavily overrepresented
5. Some publishers have **90%+ of their output affected**

### Detection Improvements Made

The original detection (870k records) was expanded to catch:
- Institution keywords in **both** `given` AND `family` fields (+182k)
- Non-English institution keywords (+54k)
- Corporate/organization patterns (+281k)
- Additional keywords like Clinic, Medical, Research (+35k)
- Single-author records (changed `size(authors) >= 2` to `>= 1`)

## Next Steps

1. Implement detection in the ingestion pipeline
2. Either:
   - Filter out institution-like "authors" from affected records, OR
   - Flag records for manual review, OR
   - Apply publisher-specific fixes
3. Consider contacting top offending publishers about their metadata formatting