The goal of this notebook is to clean and standardize the raw UK Financial Sanctions dataset, preparing it for future name‑matching experiments.

The notebook is structured as follows:

- **Preprocessing**: Selecting relevant columns and filtering the dataset

- **Language Normalization**: Handling inconsistencies across Latin‑ and non‑Latin‑based names

- **Name Cleaning**: Formatting names for consistency 

- **Outlier Analysis** Removing the top 1% of entities with too many names to reduce noise

- **Data Aggregation** Creating a condensed version of the cleaned dataset, with a single row per entity


# 1. Imports

In [1]:
#necessary libraries
from pathlib import Path
import pandas as pd  
import numpy as np  
import warnings  
from unidecode import unidecode
import re   
import matplotlib.pyplot as plt
import seaborn as sns


#commands for better output readability 
pd.set_option('display.max_columns', None)  
#pd.set_option('display.max_rows', None)  
warnings.filterwarnings("ignore", category=UserWarning, module='pandas')  

# 2. Configuration

In [2]:
#paths
project_dir=Path.cwd().parent
raw_dir=project_dir/'data'/'raw'
processed_dir=project_dir/'data'/'processed'
processed_dir.mkdir(exist_ok=True)  #keep this just in case :/

uk_file=raw_dir/'ConList.csv'

df=pd.read_csv(uk_file,skiprows=1)

# 3. Preprocessing

In [3]:
df.head()

Unnamed: 0,Name 6,Name 1,Name 2,Name 3,Name 4,Name 5,Title,Name Non-Latin Script,Non-Latin Script Type,Non-Latin Script Language,DOB,Town of Birth,Country of Birth,Nationality,Passport Number,Passport Details,National Identification Number,National Identification Details,Position,Address 1,Address 2,Address 3,Address 4,Address 5,Address 6,Post/Zip Code,Country,Other Information,Group Type,Alias Type,Alias Quality,Regime,Listed On,UK Sanctions List Date Designated,Last Updated,Group ID
0,MITHOO,Mian,,,,,,,,,,,Pakistan,Pakistan,,,,,Cleric (“Pir”) of Bharchundi Sharif Shrine,Hafizabad Taluka Daharki,District Ghotki,,,,,,Pakistan,(UK Sanctions List Ref):GHR0086. (UK Statement...,Individual,Primary name variation,,Global Human Rights,09/12/2022,09/12/2022,09/12/2022,15672
1,MITHU,Mian,,,,,,,,,,,Pakistan,Pakistan,,,,,Cleric (“Pir”) of Bharchundi Sharif Shrine,Hafizabad Taluka Daharki,District Ghotki,,,,,,Pakistan,(UK Sanctions List Ref):GHR0086. (UK Statement...,Individual,Primary name variation,,Global Human Rights,09/12/2022,09/12/2022,09/12/2022,15672
2,MITTO,Mian,,,,,,,,,,,Pakistan,Pakistan,,,,,Cleric (“Pir”) of Bharchundi Sharif Shrine,Hafizabad Taluka Daharki,District Ghotki,,,,,,Pakistan,(UK Sanctions List Ref):GHR0086. (UK Statement...,Individual,Primary name variation,,Global Human Rights,09/12/2022,09/12/2022,09/12/2022,15672
3,MITTU,Mian,,,,,,,,,,,Pakistan,Pakistan,,,,,Cleric (“Pir”) of Bharchundi Sharif Shrine,Hafizabad Taluka Daharki,District Ghotki,,,,,,Pakistan,(UK Sanctions List Ref):GHR0086. (UK Statement...,Individual,Primary name variation,,Global Human Rights,09/12/2022,09/12/2022,09/12/2022,15672
4,ZADACHIN,Andrei,Andreevich,,,,,,,,22/08/1990,,Russia,Russia,,,,,(1) Investigator for Particularly Important Ca...,,,,,,,,,(UK Sanctions List Ref):RUS1831. Financial san...,Individual,Primary name variation,,Russia,21/04/2023,21/04/2023,21/04/2023,15890


In [4]:
df['Group Type'].unique().tolist()

['Individual', 'Entity', 'Ship']

The raw dataset contains extensive information about each entity, including names, addresses, birthplaces, citizenship, and other details. Since this project focuses specifically on exploring name matching, the scope was narrowed down to the name fields and relevant sanction-related information. In addition, for consistency with the EU Sanctions Dataset, we only kept rows related to individuals and entities. 

In [5]:
df_names=df[['Name 6','Name 1','Name 2','Name 3','Name 4','Name 5','Group Type','Regime','Last Updated','Group ID']]

df_names=df_names[df_names['Group Type'] != 'Ship']

df_names=df_names.reset_index(drop=True)

In [6]:
#all name related columns were merged with the exception of the ones related to Non-latin Languages 
columns_to_merge=['Name 1','Name 2','Name 3','Name 4','Name 5','Name 6']

def merge_names(row):
    """
    Merges multiple name columns from a row into a single space-separated name.

    Args:
        row (pd.Series): A row of a DataFrame containing columns listed in `columns_to_merge`.

    Returns:
        str: A combined name created by joining all non-null columns with a space.
    """
    
    row=row[columns_to_merge].dropna()
    name=' '.join(row)
    return name
    
#merge name columns in a single 'Name' cell
df_names['Name']=df_names.apply(merge_names,axis=1)

#drop the original columns and remove any rows with an empty 'Name' post-merge
df_names=df_names.drop(columns=columns_to_merge)
df_names=df_names.dropna(subset=['Name']) 

# 4. Language Normalization

Although rows with Non‑Latin were already filtered out, some names still contained special characters or diacritics. To standardize this we used `unidecode`, a library which intelligently maps non-ASCII Latin characters to their closest ASCII equivalents (e.g., ø → o, æ → ae, ç → c). More details about its limitations can be found at [Unidecode on PyPI](https://pypi.org/project/Unidecode/).

In [7]:
def normalize(text):
    """
    Normalizes text by removing diacritics and special characters.

    Args:
        text (str): The input text to be normalized.

    Returns:
        str: The normalized text with diacritics and special characters removed.

    Example:
        >>> normalize('ołá')
        'ola'
    """
    return unidecode(text)

df_names['Name']=df_names['Name'].apply(normalize)

# 5. Name Cleaning

After inspecting the names in the dataset, we noticed recurring inconsistencies in their formatting. To standardize them, we created two cleaning functions: `clean_name_individual`and `clean_name_entity`. Both functions apply the same text normalization and cleaning rules, differing only in their handling of numerical characters (relevant for entities). 

In [8]:
def clean_name_individual(text):
    """
    Cleans an individual's name using regex patterns.

    Rules:
    - Replaces apostrophes and hyphens with spaces
    - Removes any character that is not a letter or space
    - Capitalizes the first letter of each word

    Args:
        text (str): The raw name.

    Returns:
        str: The cleaned, properly capitalized name.
    """
    
    text=re.sub(r"[\'\-\\]",' ',text)  
    text=re.sub(r"[^a-zA-Z\s]",'',text) 
    
    text=text.split()
    text=[word.capitalize() for word in text]  

    text=' '.join(text)
    
    return text

#apply the cleaning function to 'Individual' rows
df_names.loc[df_names['Group Type']=='Individual','Name']=df_names.loc[df_names['Group Type']=='Individual','Name'].apply(clean_name_individual)

In [9]:
def clean_name_entity(text):
    """
    Cleans an entity's name using regex patterns and standardizes common suffixes.

    Rules:
    - Replaces apostrophes and hyphens with spaces
    - Removes any character that is not a letter, digit, or space
    - Standardizes common suffixes ( Llp -> LLP, Limited -> Ltd, Company -> Co)
    - Capitalizes the first letter of each word

    Args:
        text (str): The raw entity name.

    Returns:
        str: The cleaned, properly formatted entity name.
    """
    
    text=re.sub(r"[\'\-\\]",' ',text)  
    text=re.sub(r"[^a-zA-Z0-9\s]",'',text) 

    text=re.sub(r"\bLlp\b", "LLP", text, flags=re.IGNORECASE)
    text=re.sub(r"\bLtd\b", "Ltd", text, flags=re.IGNORECASE)
    text=re.sub(r"\bLimited\b", "Ltd", text, flags=re.IGNORECASE)
    text=re.sub(r"\bCo\b", "Co", text, flags=re.IGNORECASE)
    text=re.sub(r"\bCompany\b", "Co", text, flags=re.IGNORECASE)
    text=re.sub(r"\bInc\b", "Inc", text, flags=re.IGNORECASE)
    text=re.sub(r"\bIncorporated\b", "Inc", text, flags=re.IGNORECASE)
    text=re.sub(r"\bOjsc\b", "OJSC", text, flags=re.IGNORECASE)
    
    text=text.split()
    text=[word.capitalize() for word in text]  

    text=' '.join(text)
    
    return text

#apply the cleaning function to 'Entity' rows
df_names.loc[df_names['Group Type']=='Entity','Name']=df_names.loc[df_names['Group Type']=='Entity','Name'].apply(clean_name_entity)

In [10]:
#just in case to avoid errors
df_names=df_names.drop_duplicates().reset_index(drop=True)

In [11]:
df_names.head()

Unnamed: 0,Group Type,Regime,Last Updated,Group ID,Name
0,Individual,Global Human Rights,09/12/2022,15672,Mian Mithoo
1,Individual,Global Human Rights,09/12/2022,15672,Mian Mithu
2,Individual,Global Human Rights,09/12/2022,15672,Mian Mitto
3,Individual,Global Human Rights,09/12/2022,15672,Mian Mittu
4,Individual,Russia,21/04/2023,15890,Andrei Andreevich Zadachin


# 6. Outlier Analysis 

Names associated with the same entity can vary significantly, from a handful of entries to 100+. To prevent disproportionately long or ambiguous records from affecting the quality of name matching, we identified and removed outliers:

- We first group names by entity ID and count how many names each entity had.

- We observed a highly skewed distribution, with certain entities having an unusually high number of names.

- To formalize the cutoff, we computed the 99th percentile of counts for people and enterprises:

- All entities with counts above these thresholds were treated as outliers and removed from the dataset.


In [12]:
#print Name Counts per ID

individual_counts=df_names.groupby('Group ID').size().reset_index(name='Count')
entity_counts= df_names[df_names['Group Type']=='Entity'].groupby('Group ID').size().reset_index(name='Count')

individual_counts=individual_counts.sort_values(by='Count',ascending=False)
entity_counts=entity_counts.sort_values(by='Count',ascending=False)

print(individual_counts)
print(entity_counts)


      Group ID  Count
910      12771    144
392      10638    108
988      12896     81
1975     14068     57
985      12892     49
...        ...    ...
4619     16796      1
4620     16797      1
4621     16798      1
4625     16802      1
4634     16811      1

[4651 rows x 2 columns]
      Group ID  Count
33        7241     27
265      12981     27
965      16718     19
106      11090     18
129      11241     18
...        ...    ...
1002     16812      1
996      16798      1
995      16794      1
994      16792      1
992      16790      1

[1008 rows x 2 columns]


In [13]:
#print 99th percentile name count

x_i=individual_counts['Count']
x_e=entity_counts['Count']

upper_i_new=np.percentile(x_i, 99) 
upper_e_new=np.round(np.percentile(x_e, 99),1)  

print('Upper limit for Individuals:',upper_i_new)
print('Upper limit for Entities:',upper_e_new)

Upper limit for Individuals: 16.0
Upper limit for Entities: 13.0


In [14]:
#remove outliers from dataset
outliers_individual=individual_counts[individual_counts['Count']>upper_i_new].reset_index(drop=True)
outliers_entity=entity_counts[entity_counts['Count']>upper_e_new].reset_index(drop=True)

outliers_ID=pd.concat([outliers_individual['Group ID'], outliers_entity['Group ID']], ignore_index=True)
df_names=df_names[~df_names['Group ID'].isin(outliers_ID)].reset_index(drop=True)

# 7. Data Aggregation 

For future experiments, we created a condensed version of the dataset, `df_grouped`, by aggregating all names associated with a single ID into a single cell. This approach will potentially allow for more efficient processing and matching.

In [15]:
df_grouped=df_names.copy()

def aggregate_unique_words(grouped_names):
    """

    This function takes a list of names, splits each name into its individual words,
    and returns a comma-separated string of all unique words across the list.

    Args:
        grouped_names (list of str): List of names associated with a single entity.

    Returns:
        str: A comma-separated string of unique words extracted from the names.
    """

    
    words=set()

    for name in grouped_names:

        name=name.split()
        words.update(name)
        
    return ', '.join(words)

            
df_grouped=df_grouped.groupby('Group ID',as_index=False).agg({
       
    'Group Type':'first',
    'Regime': 'first',         
    'Last Updated': 'first',
    'Name': aggregate_unique_words,  
})

# 8. Output

In [16]:
df_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11339 entries, 0 to 11338
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Group Type    11339 non-null  object
 1   Regime        11339 non-null  object
 2   Last Updated  11339 non-null  object
 3   Group ID      11339 non-null  int64 
 4   Name          11339 non-null  object
dtypes: int64(1), object(4)
memory usage: 443.1+ KB


In [17]:
df_grouped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4606 entries, 0 to 4605
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Group ID      4606 non-null   int64 
 1   Group Type    4606 non-null   object
 2   Regime        4606 non-null   object
 3   Last Updated  4606 non-null   object
 4   Name          4606 non-null   object
dtypes: int64(1), object(4)
memory usage: 180.1+ KB


In [18]:
df_names.to_csv(processed_dir/'cleaned_uk_sanctions.csv', index=False)
df_grouped.to_csv(processed_dir/'cleaned_uk_sanctions_grouped.csv', index=False)