# Name normalization

The name normalization task is described along with the code experiments.

## Definition

`Name normalization` searches for similar names and leaves only a single name in place of several similar names.

The input and output are lists of names. The input list can have several similar names. Names are separated by ";" character.
Similar names are names that have:
- the same words in different order like "Trump Donald" and "Donald Trump" 
- or names that have additional words as the stop words "the", "of", etc. or as organization-type words like "Co", "Corp", "Inc"
- or words in different syntactic types like "Courts" and "Court", "Biden's" and "Biden", "Congressional" and "Congress"
- or words in different cases, like "CNN" and "Cnn". 

## Test examples

Examples of input and desired output lists:

In [120]:
tests = [  # (input, output),
    ("", ""),
    ("Department Of Homeland Security;Supreme Court;University Of The New Mexico;University Of New Mexico;Supreme Court;Cnn;Cnn;Cnn", "Department Of Homeland Security;Supreme Court;University Of The New Mexico;Cnn"),
    ("Twitter;Twitter;Russia;Twitter;Twitter;Twitter;Russia;Cnn;CNN", "Twitter;Russia;Cnn"),
    ("Department Of Justice;Republican Party;Joe Biden;Justice Department;Session Of Congress On;Justice Department;United States of America;Justice Department;Justice Department;Cnn;Cnn;Cnn;Cnn;United States", "Department Of Justice;Republican Party;Joe Biden;Session Of Congress On;United States of America;Cnn"),
]


The rules for removing similar names:
- The output should be presented by the longest name similar to the input.
- If there are several longest names then any longest name is OK.
- The original not the normalized name goes into output.

EXAMPLES:

In [121]:
replacements = [  # (input, output),
    ("", ""),
    ("USA;the USA;USA", "the USA"),
    ("Department of Agriculture;Agriculture Department", "Department of Agriculture"),
    ("United States Air Force;Air Force of United States;Air Force", "Air Force of United States"),
    ("Donald Trump;Trump", "Donald Trump"),
    ("Public Safety Department;Department of Public Safety", "Department of Public Safety"),
    ("Supreme Courts;Supreme Court", "Supreme Courts"),
    ("Congress;Congresses", "Congresses"),
]


## Experiments

### NTLT

In [155]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Download required NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the stemmer and stop words
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
org_suffixes = {'co', 'corp', 'inc', 'ltd', 'llc', 'plc'}

def normalize_name(name):
    # Convert to lowercase and remove punctuation
    name = name.lower().translate(str.maketrans('', '', string.punctuation))
    # Tokenize and lemmatize words, remove stop words and organization-type words
    tokens = name.split()
    normalized_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words and token not in org_suffixes]
    # Sort the tokens alphabetically to capture names with words in different order
    return ' '.join(sorted(normalized_tokens))

def is_included(normalized_to_original, normalized):
    if not normalized:
        return 
    normalized_set = set(normalized.split())
    for k in normalized_to_original:
        k_set = set(k.split())
        if normalized_set.issubset(k_set):
            return k
        elif k_set.issubset(normalized_set)and len(k_set) > 1:
            # replace the key with the longer name
            # only for 2+ word similarity
            # only when the long name is the first in the collection
            normalized_to_original[normalized] = normalized_to_original.pop(k)
            return normalized
    return None
def consolidate_names(name_list):
    names = name_list.split(';')
    
    normalized_to_original = {}
    
    for name in names:
        normalized = normalize_name(name)
        # print(f"{normalized}")
        if (k:= is_included(normalized_to_original, normalized)):
            # Keep the longest original name
            if len(name) > len(normalized_to_original[k]):
                # print(f"  {k}: {name}")
                normalized_to_original[k] = name                
        else:
            normalized_to_original[normalized] = name
            # print(f"  {normalized}: {name}")
    
    return ';'.join(normalized_to_original.values())


[nltk_data] Downloading package stopwords to /home/leo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/leo/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [127]:
[tst in "Joe Sr Doe" for tst in ["Joe", "Sr", "Doe", "Sr Doe", "SrDoe"]]

[True, True, True, True, False]

In [128]:
[(tst, consolidate_names(tst), res) for tst, res in replacements+[("something", "somethingW")] if consolidate_names(tst) != res]

[('something', 'something', 'somethingW')]

In [129]:
[(tst, consolidate_names(tst), res) for tst, res in tests if consolidate_names(tst) != res]

[]

In [156]:
replacements = [  # (input, output),
    ("", ""),
    ("USA;the USA;USA", "the USA"),
    ("Department of Agriculture;Agriculture Department", "Department of Agriculture"),
    ("United States Air Force;Air Force of United States;Air Force", "Air Force of United States"),
    ("Donald Trump;Trump", "Donald Trump"),
    ("Trump;Donald Trump", "Trump;Donald Trump"),
    ("Public Safety Department;Department of Public Safety", "Department of Public Safety"),
    ("Supreme Courts;Supreme Court", "Supreme Courts"),
    ("Congress;Congresses", "Congresses"),
    ("Congress;Congress of Public Safety;Agriculture Congress", "Congress;Congress of Public Safety;Agriculture Congress"),
    ("Congress;Congress of Public Safety;Safety Congress", "Congress;Congress of Public Safety"),
    ("Congress;Safety Congress;Congress of Public Safety", "Congress;Congress of Public Safety"),
    ("Congress of Public Safety;Safety Congress;Congress", "Congress of Public Safety"),
]
# [(tst, consolidate_names(tst), res) for tst, res in replacements+[("something", "somethingW")] if consolidate_names(tst) != res]
[(tst, consolidate_names(tst), res) for tst, res in replacements if consolidate_names(tst) != res]

[]