The goal of this notebook is to explore an alternative name-matching strategy to traditional fuzzy string comparisons, specifically tailored to datasets like the UK Financial Sanctions List. These datasets often include multiple rows per sanctioned entity, each capturing a different variation or spelling of the same name. This redundancy increases complexity and makes comprehensive name matching more difficult.

To address this, we propose a token-based aggregation approach, where all known name variations for an individual are broken down into their component tokens and combined into a single set. This set acts as a compact identity profile, capturing more variation without inflating the number of rows. This approach was tested on cleaned grouped dversions of the UK and EU Financial Sanctions Datasets. 

The notebook is structured as follows:  

- **Preprocessing**: Standardizing dataframes for consistent format
  
- **Tokenization**: Change the format of names 

- **Name Matching**:  
    - Pre‑selecting candidate names
  
    - Describing the matching algorithm
  
    - Implementing the algorithm to find best matches  
- **Labelling**:  Assigning final labels to matched pairs (`match`, `not match`, `preliminary match`)   




# 1. Imports

In [1]:
#import necessary libraries
from pathlib import Path
import pandas as pd  
import numpy as np  
import warnings  
from unidecode import unidecode
import re  
import matplotlib.pyplot as plt
import seaborn as sns
import time
from rapidfuzz import process, fuzz
import random
import scipy.stats as st
import math
from rapidfuzz.fuzz import ratio


#commands for better output readability 
pd.set_option('display.max_colwidth', None)  
pd.set_option('display.max_columns', None)   
pd.set_option('display.width', 2000)         
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_rows', None)  
warnings.filterwarnings("ignore", category=UserWarning, module='pandas')  

# 2. Configuration

In [31]:
#paths
project_dir=Path.cwd().parent.parent
processed_dir=project_dir/'data'/'processed'
final_dir=project_dir/'data'/'final'

eu_processed_file_grouped=processed_dir/'cleaned_eu_sanctions_grouped.pkl'
uk_processed_file_grouped=processed_dir/'cleaned_uk_sanctions_grouped.pkl'

df_eu=pd.read_pickle(eu_processed_file_grouped)  
df_uk=pd.read_pickle(uk_processed_file_grouped)  

# 3. Preprocessing

In [3]:
df_eu.head()

Unnamed: 0,Id,EU Reference Number,Entity Type,Entry Into Force Date,Regulation Identifier,Sanction Programme,Name
0,13,EU.27.28,person,2003-07-07,1210/2003 (OJ L169),IRQ,"Al, Hussein, Abou, Ali, Tikriti, Saddam, Abu"
1,20,EU.39.56,person,2003-07-07,1210/2003 (OJ L169),IRQ,"Al, Hussein, Qoussai, Tikriti, Saddam, Qusay"
2,23,EU.16.62,person,2003-07-07,1210/2003 (OJ L169),IRQ,"Al, Oudai, Hussein, Tikriti, Uday, Saddam"
3,25,EU.11.43,person,2003-07-07,1210/2003 (OJ L169),IRQ,"Al, Abdel, Hamid, Bid, Mahmud, Abed, Abid, Hammoud, Tikriti, Hammud, Mahmoud"
4,29,EU.22.9,person,2003-07-07,1210/2003 (OJ L169),IRQ,"Al, Ali, Hassan, Tikriti, Kimawi, Majid"


In [4]:
df_uk.head()

Unnamed: 0,Group ID,Group Type,Regime,Last Updated,Name
0,6894,Individual,ISIL (Da'esh) and Al-Qaida,08/02/2023,"Fihiruddin, A, Iqbal, Rahman, Muqti, Abdurrahman, Fikiruddin, Mohamad, Jibril, Abdul, Abu"
1,6895,Individual,Afghanistan,01/02/2021,"Hai, Hazem, Abdul, Qader"
2,6897,Individual,ISIL (Da'esh) and Al-Qaida,31/12/2020,"Agha, Abd, Man, Manan, Abdul, Am, Al, Saiyid"
3,6899,Individual,ISIL (Da'esh) and Al-Qaida,31/12/2020,"Abdallah, Salah, Shihata, Thirwat, Shahata, Ali, Tarwat, Tharwat"
4,6901,Individual,ISIL (Da'esh) and Al-Qaida,08/02/2023,"Abdul, Chaudhry, Majeed, Majid"


In [5]:
#harmonize EU/UK column names for consistency 
df_eu=df_eu.rename(columns={'Name':'EU Name','Sanction Programme':'EU Sanction Programme','Regulation Identifier':'EU Regulation Identifier','Id':'EU ID'})
df_uk=df_uk.rename(columns={'Name':'UK Name','Group ID': 'UK ID','Regime':'UK Sanction Programme'})

#keep only entries of individual people  
df_eu=df_eu[df_eu['Entity Type']!='enterprise']
df_uk=df_uk[df_uk['Group Type']!='Entity']

df_eu=df_eu.drop(columns=['Entity Type','Entry Into Force Date']).reset_index(drop=True)
df_uk=df_uk.drop(columns=['Group Type','Last Updated']).reset_index(drop=True)

In [6]:
df_eu.head()

Unnamed: 0,EU ID,EU Reference Number,EU Regulation Identifier,EU Sanction Programme,EU Name
0,13,EU.27.28,1210/2003 (OJ L169),IRQ,"Al, Hussein, Abou, Ali, Tikriti, Saddam, Abu"
1,20,EU.39.56,1210/2003 (OJ L169),IRQ,"Al, Hussein, Qoussai, Tikriti, Saddam, Qusay"
2,23,EU.16.62,1210/2003 (OJ L169),IRQ,"Al, Oudai, Hussein, Tikriti, Uday, Saddam"
3,25,EU.11.43,1210/2003 (OJ L169),IRQ,"Al, Abdel, Hamid, Bid, Mahmud, Abed, Abid, Hammoud, Tikriti, Hammud, Mahmoud"
4,29,EU.22.9,1210/2003 (OJ L169),IRQ,"Al, Ali, Hassan, Tikriti, Kimawi, Majid"


In [7]:
df_uk.head()

Unnamed: 0,UK ID,UK Sanction Programme,UK Name
0,6894,ISIL (Da'esh) and Al-Qaida,"Fihiruddin, A, Iqbal, Rahman, Muqti, Abdurrahman, Fikiruddin, Mohamad, Jibril, Abdul, Abu"
1,6895,Afghanistan,"Hai, Hazem, Abdul, Qader"
2,6897,ISIL (Da'esh) and Al-Qaida,"Agha, Abd, Man, Manan, Abdul, Am, Al, Saiyid"
3,6899,ISIL (Da'esh) and Al-Qaida,"Abdallah, Salah, Shihata, Thirwat, Shahata, Ali, Tarwat, Tharwat"
4,6901,ISIL (Da'esh) and Al-Qaida,"Abdul, Chaudhry, Majeed, Majid"


# 4. Tokenization

In [8]:
def tokenize_name(full_name):
    """
    Tokenizes a comma-separated name string into two sets: name tokens & initial letter tokens.

    Args: 
        full_name (str): string of names seperated by commas ('John, A, Smith')

    Returns:
        pd.Series: pandas series containing:
            - set of name tokens {'John','A','Smith'}
            - set of first letters {'J', 'A', 'S'}
            
    Notes:
        This function returns a series for compatability with .apply()
    """

    name_split=full_name.split(', ')
    first_letters=[word[0] for word in name_split if word]   #if word part used to prevent errors 
    
    name_tokens=set(name_split)
    letter_tokens=set(first_letters)

    return pd.Series([name_tokens, letter_tokens])   

In [9]:
df_uk[['UK Name','UK Letters']]=df_uk['UK Name'].apply(tokenize_name)
df_eu[['EU Name','EU Letters']]=df_eu['EU Name'].apply(tokenize_name)

# 5. Name Matching

### 5.1 Candidate Selection

The goal of this experiment is to efficiently find the best candidate in the EU list for each candidate in the UK list. However, given that both lists contain thousands of names, a brute-force approach (comparing every UK entry with every EU entry), quickly becomes computationally unfeasible.

To address this, we introduced a pre-selection step designed to narrow down the EU candidates for each UK entry. The core idea is based on comparing the first letters of each token in a name. The assumption is that if the sets of initial letters are very different, the names are unlikely to refer to the same individual.

For example:

- 'James Edward Clarke' → {'J','E','C'}

- 'James Eric Clarke' → {'J','E','C'}

- 'Michael David Thompson' → {'M','D','T'}

In this case, {'J','E','C'} and {'J','E','C'} are much more likely to correspond to the same person than {'M','D','T'} and {'J','E','C'}.

Having established that the first letters can serve as an effective filter, we then apply the Jaccard similarity measure to assess how well the sets align.

#### **Jaccard Similarity**

Jaccard similarity is a simple metric used to compare the similarity between two sets. It is defined as the size of the intersection divided by the size of the union, and its values range from **0** (no overlap) to **1**.

$$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$$


In [10]:
def get_jaccard_similarity(set_1, set_2):
    """
    Computes the jaccard similarity between 2 sets of tokens 

    Args: 
        set_1 (set): first set of tokens 
        set_2 (set): second set of tokens

    Returns:
        float: jaccard similarity value between 0 and 1
    """
  
    intersection=set_1 & set_2
    union=set_1|set_2

    return len(intersection)/len(union)


In [11]:
def get_eu_candidate_ids(uk_letters, df_eu, threshold=0.5):
    """
    Finds EU IDs with first letters similar to UK letters based on Jaccard similarity.

    Args: 
        uk_letters (set): Set of first letters from the UK name
        df_eu (pd.DataFrame) : EU dataframe that includes 'EU Letters' and 'EU ID' columns
        threshold (float): Minimum jaccard score required to consider candidate (it was
        purposely left low to avoid discarding potentially good candidates)

    Returns:
        list: EU IDs with jaccard similarity above threshold
    """
    candidate_ids=[]
    
    for _, row in df_eu.iterrows():
        
        score=get_jaccard_similarity(uk_letters, row['EU Letters'])
        
        if score >= threshold:
            candidate_ids.append(row['EU ID'])
    
    return candidate_ids


In [12]:
#create new columns with info from potential candidates
df_uk['Candidate EU IDs']=df_uk['UK Letters'].apply(get_eu_candidate_ids,df_eu=df_eu)
df_uk['Candidate Count']=df_uk['Candidate EU IDs'].apply(len)

### 5.2 Matching Algorithm 

The matching algorithm was inspired by traditional fuzzy matching but adapted for sets of tokens rather than plain strings. In a traditional approach, strings like 'Joan' and 'Johanne' are compared character by character to produce a similarity score. Here, we build on this concept by comparing names as sets of tokens and assessing their match quality using three measures: Token Overlap, Token Similarity, and Coverage Ratio.


#### *Token Overlap*

We first consider how many tokens match between the two sets of names, as this gives a quick indication of how closely the names relate. For example:

- `['Maria', 'Elena', 'Garcia']` vs. `['Maria', 'Elena', 'Garcia', 'Lopez']`: **3 common tokens** (higher confidence)  
- `['Maria', 'Elena', 'Garcia']` vs. `['Ana','Maria', 'Garcia']`: **2 common tokens** 


#### *Token Similarity*

Each overlapping token is then compared character‑by‑character using a fuzzy similarity score, with adjustments made to account for the length of the tokens. Longer names with slight differences generally carry more significance. For example:

- `['Joan'] vs. ['Joana']`: fuzzy score ≈ 89  
- `['Gennadievich'] vs. ['Gennadyevich']`: fuzzy score ≈ 92 (**higher confidence**) 

Although both pairs have similarly high fuzzy scores, the longer names are more indicative of referring to the same person.

#### *Coverage Ratio*

Finally, the number and quality of matching tokens aren’t enough on their own, it’s also important to consider the proportion of the candidate name that is matched. This helps assess how much of the name is represented in the match. For example:

- `['Maria', 'Elena', 'Garcia']` vs. `['Maria', 'Elena', 'Garcia', 'Lopez']`: 3/4 tokens (75%) (**higher confidence**)  
- `['Maria', 'Elena', 'Garcia']` vs. `['Maria', 'Garcia', 'Lopez', 'Sanchez', 'Perez', 'Martin']`: 3/6 tokens (50%) 


The measures described above (Token Overlap, Token Similarity, and Coverage Ratio) were combined into a comprehensive match score. Additional heuristics were applied to refine results and ensure reliable matching. For each name in the UK list, this score was calculated against every pre-selected EU candidate, and the highest-scoring name was matched.


### 4.3 Implementation

- **Preparation**: The initial functions `get_token_overlap`, `get_deduplicated_tokens`, and `get_coverage_ratio`) work together to analyze and quantify the relationship between UK and EU name tokens. 

- **Scoring**: These prepared metrics then serve as inputs to `get_multi_score`, which synthesizes them into a comprehensive match score. This score reflects the overall similarity and confidence of the candidate match.

- **Iteration**: The main algorithm iterates over each UK name and its pre-selected EU candidates, invoking these functions in sequence to calculate scores and select the best match.

In [13]:
def get_token_overlap(uk_name, eu_name, threshold=75):
    """
    Identifies similar tokens between UK and EU names using fuzzy matching.
    In the case of near-duplicates (multiple EU tokens that match a UK token),
    only the highest-scoring match is kept.

    Args:
        uk_name (set): Set of tokens representing a name from UK list.
        eu_name (set): Set of tokens representing a name from EU list.
        threshold (float): Minimum fuzzy match score required for tokens
                           to be considered a match.

    Returns:
        overlap_tokens (list): EU tokens similar to UK tokens.
        overlap_scores (list): Fuzzy match scores for the overlapping tokens.
        overlap_lengths (list): Lengths of the overlapping tokens.

    Notes:
        This function uses `process.extractOne` from the rapidfuzz library.
    """

    raw_matches=[]   
    for name in uk_name:

        #find the best eu name match for current uk name
        best_match=process.extractOne(name,list(eu_name),scorer=fuzz.ratio)

        if not best_match:
            continue

        name_match, score, _=best_match

        #keep high matches
        if score>=threshold:
            raw_matches.append((name_match, score))

    #sort matches by descending similarity score to prioritize stronger matches
    sorted_matches=sorted(raw_matches, key=lambda x: x[1],reverse=True)
    
    deduped_matches=[]
    for token, score in sorted_matches:
        
        #append only if no near-duplicate match already exists
        if any(fuzz.ratio(token, existing_token)>=threshold for existing_token, _,_ in deduped_matches):
            continue
            
        deduped_matches.append((token, score,len(token)))
   

    overlap_tokens=[t for t, _, _ in deduped_matches]
    overlap_scores=[s for _, s, _ in deduped_matches]
    overlap_lengths=[l for _,_, l in deduped_matches]
    

    return overlap_tokens, overlap_scores, overlap_lengths 

In [14]:
def get_deduplicated_tokens(uk_names, threshold=70):
    """   
    Removes near‑duplicate tokens from a set of UK name tokens using 
    fuzzy matching. This is especially useful for names with multiple 
    spelling variations (e.g., "Mohamed" vs "Mohammed").

    Args:
        uk_name (set of str): Set of UK name tokens representing a name.
        threshold (float): Minimum fuzzy similarity score required 
                           for two tokens to be considered duplicates.

    Returns:
        list of str: List of unique, de‑duplicated tokens.
    """
    
    unique_uk_names=[]
    for token in uk_names:

        is_unique=True        
        for seen in unique_uk_names:

            if ratio(token, seen)>=threshold:
                
                is_unique=False
                break
                
        if is_unique:
            unique_uk_names.append(token)
            
    return unique_uk_names

In [15]:
def get_coverage_ratio(unique_uk_tokens,overlap_scores):
    """
    Calculates the proportion of a UK name that is matched by an EU candidate.
    Each matching token contributes to the coverage:
      - Tokens with a fuzzy score >= 85 count as a full match.
      - Tokens with a fuzzy score < 85 count as a half match.

    Args:
        unique_uk_tokens (list): List of unique, de‑duplicated tokens in the UK name.
        overlap_scores (list): Fuzzy match scores for tokens matched in the EU candidate.

    Returns:
        float: Coverage ratio between 0 and 1.

    Notes:
        This approach captures partial matches, making the measure more robust 
        when dealing with slight spelling variations.
    """
    
    
    partial_scores=[]
    for score in overlap_scores:
        if score>=85:
            partial_scores.append(1.0)
        else:
            partial_scores.append(0.5)
            
    num=sum(partial_scores)
    den=len(unique_uk_tokens)
    coverage_ratio=num/den
    coverage_ratio=min(coverage_ratio,1)
    
    return coverage_ratio

While the multi‑score is built from overlap, similarity, and coverage, its final form also incorporates a set of heuristic adjustments. Through iteration and inspection of results, we introduced thresholds and weightings based on token length and match counts. These were shaped through manual tuning to establish a more consistent and meaningful relationship between match quality and its final score.

In [16]:
def get_multi_score(overlap_scores, overlap_lengths, overlap_tokens, cvg_ratio, threshold=75):
    """
    Computes a final match score by combining token similarity, overlap, and coverage ratio,
    with heuristic adjustments based on manual tuning.

    Args:
        overlap_scores (list): Fuzzy match scores of overlapped tokens.
        overlap_lengths (list): Corresponding lengths of overlapped tokens.
        overlap_tokens (list): List of overlapped tokens.
        cvg_ratio (float): Coverage ratio of matched tokens.
        threshold (float): Minimum adjusted score required for a token to be considered a match.

    Returns:
        multi_score (int): Final match score (0–100).
        len_adjusted_scores (list): List of length-adjusted token scores.
        avg_score (float): Average of length-adjusted token scores.
        weighted_score (float): Score calculated from both average fuzzy scores and coverage ratio.

    Notes:
        Heuristic tiers are applied to balance precision and recall, yielding a more
        gradual and realistic final score. 
    """
    
    len_adjusted_scores=[]
    qualified_tokens=set()

    #short names tend to be less informative, while longer ones often offer higher matching value
    for score, length, token in zip(overlap_scores, overlap_lengths, overlap_tokens):
        if length<=3:  
            adjusted=score * 0.87
        elif 4<=length<=5:  
            adjusted=score * 0.95
        elif 6<=length<=8:  
            adjusted=score * 1.0
        elif 9<=length<=10: 
            adjusted=score * 1.03
        else: 
            adjusted=score * 1.05

        adjusted=min(adjusted, 100)
        

        if adjusted>=threshold:
            qualified_tokens.add(token)
            len_adjusted_scores.append(adjusted)
   
        
    if len_adjusted_scores:
        len_adj_avg_score=sum(len_adjusted_scores)/ len(len_adjusted_scores)
    else:
        len_adj_avg_score=0
        
       
    #combines quality of token similarity with coverage ratio 
    alpha=0.75    
    weighted_score=alpha*len_adj_avg_score + (1-alpha)*(cvg_ratio*100)


    #adjust final score based on the number of overlapped tokens:
    #fewer tokens cap the score, and more tokens set a higher floor
    overlap_count=len(qualified_tokens)    
    if overlap_count==1:
        multi_score=min(weighted_score, 50)
    elif overlap_count==2:
        multi_score=min(weighted_score, 85)
    elif 3<=overlap_count<=4:
        multi_score=weighted_score  
    elif 5<=overlap_count<=6:
        multi_score=max(weighted_score, 70) 
    elif overlap_count>=7:
        multi_score=max(weighted_score, 75) 
    else:
        multi_score=weighted_score

    multi_score=int(round(min(multi_score, 100), 0))

    return multi_score, len_adjusted_scores, weighted_score  


In [17]:
"""
Main matching loop: compare UK names against candidate EU names.

Each UK name is evaluated against a shortlist of EU candidates.  
For each candidate, we compute all necessary metrics for the multi‑score 
using earlier functions. At the end, we retain and store information for 
the best match and store its details into the UK results dataframe.

- Final multi‑score 
- Matched EU name 
- List of overlapped tokens 
- Matched EU ID 
- Final weighted score 
- Final average score 
- List of length‑adjusted scores 
- Final coverage ratio 
- List of raw overlap scores

"""

matched_multi_score=[]
matched_eu_name=[]
matched_overlap_name=[]
matched_eu_id=[]
matched_weighted_score=[]
matched_len_adjusted_scores=[]
matched_cvg_ratio=[]
matched_raw_scores=[]


eu_id_to_name=dict(zip(df_eu['EU ID'], df_eu['EU Name']))

for uk_index, uk_row in df_uk.iterrows():  
    
    uk_tokens=uk_row['UK Name']
    candidate_ids=uk_row['Candidate EU IDs']  

    best_score=0
    best_eu_name=None
    best_overlap_name=None
    best_eu_id=None
    best_weighted_score=0
    best_len_adjusted_scores=None
    best_cvg_ratio=0
    best_raw_scores=None
    

    #compare with shortlisted EU candidates 
    for eu_id in candidate_ids:
        
        eu_tokens=eu_id_to_name.get(eu_id)
        if not eu_tokens:
            continue

        #get the relevent info to compute the multi score
        overlap_tokens, overlap_scores, overlap_lengths=get_token_overlap(uk_tokens, eu_tokens)
        unique_uk_names=get_deduplicated_tokens(uk_tokens)
        cvg_ratio=get_coverage_ratio(unique_uk_names,overlap_scores)

        #compute multi-score
        multi_score, len_adjusted_scores, weighted_score=get_multi_score(
            overlap_scores, 
            overlap_lengths,
            overlap_tokens,
            cvg_ratio
        )
    
        #update best match if current candidate scores higher
        if multi_score>best_score:
            best_score=multi_score
            best_eu_name=eu_tokens
            best_overlap_name=overlap_tokens
            best_eu_id=eu_id
            
            best_weighted_score=weighted_score
            best_len_adjusted_scores=len_adjusted_scores
            best_cvg_ratio=cvg_ratio
            best_raw_scores=overlap_scores

    #store results of the best match
    matched_multi_score.append(best_score)
    matched_eu_name.append(best_eu_name)
    matched_overlap_name.append(best_overlap_name)
    matched_eu_id.append(best_eu_id)
    matched_weighted_score.append(best_weighted_score)
    matched_len_adjusted_scores.append(best_len_adjusted_scores)
    matched_cvg_ratio.append(best_cvg_ratio)
    matched_raw_scores.append(best_raw_scores)

#store new info on dataframe
df_uk['Name Overlap']=matched_overlap_name
df_uk['EU Matched Name']=matched_eu_name
df_uk['EU Matched ID']=matched_eu_id

df_uk['Multi Score']=matched_multi_score
df_uk['Coverage Ratio']=matched_cvg_ratio
df_uk['Length Adjusted Scores']=matched_len_adjusted_scores

df_uk['Weighted Score']=matched_weighted_score
#df_uk['Length Adjusted Average Score']=matched_len_adj_avg_score
df_uk['Raw Scores']=matched_raw_scores

# 6. Labeling Results

Financial institutions must reliably flag sanctioned individuals to comply with Financial Sanctions. Due to the sensitive nature of this task, human intervention is necessary to ensure accuracy, as the stakes are high when mistakenly targeting innocent individuals. To balance these needs, our strategy focused on minimizing manual intervention by labeling matched pairs as `Match`, `Not Match`, or `Preliminary Match`. Confident outcomes, either strong matches or clear non-matches, were assigned to Match and Not Match, respectively. Ambiguous cases fell under Preliminary Match, a buffer category reserved for human review. The labeling thresholds were determined empirically by inspecting the distribution of match scores and identifying natural cutoffs.

In [18]:
def add_label(score):

    if score<74:
        label='not match'
    elif score>=88:
        label='match'
    else:
        label='preliminary match'
    return label

df_uk['Label']=df_uk['Multi Score'].apply(add_label)

# 7. Output

### 7.1 For further analysis

In [27]:
df_uk.head()

Unnamed: 0,UK ID,UK Sanction Programme,UK Name,UK Letters,Candidate EU IDs,Candidate Count,Name Overlap,EU Matched Name,EU Matched ID,Multi Score,Coverage Ratio,Length Adjusted Scores,Weighted Score,Raw Scores,Label
0,6894,ISIL (Da'esh) and Al-Qaida,"{Fikiruddin, Rahman, Mohamad, A, Fihiruddin, Abdul, Jibril, Iqbal, Muqti, Abu, Abdurrahman}","{I, A, J, R, F, M}","[630, 643, 1004, 3140, 4686, 5240, 5262, 5271, 5623, 6133, 6211, 6478, 6494, 6830, 6974, 7250, 113355, 115714, 117974, 123615, 125562, 126101, 127538, 129864, 130225, 133935, 134828, 135060, 136530, 136975, 138176, 145803, 146560, 147032, 150914, 159126, 162479, 162975, 165652, 166799, 167049, 167477, 171171, 172374, 172394]",45,"[Fikiruddin, Rahman, Mohamad, A, Abdul, Jibril, Iqbal, Muqti, Abdurrahman]","{Fikiruddin, Rahman, Mohamad, Fihiruddin, A, Abdul, Jibril, Iqbal, Muqti, Abu, Abdurrahman}",1004,98,1.0,"[100, 100.0, 100.0, 87.0, 95.0, 100.0, 95.0, 95.0, 100]",97.67,"[100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0]",match
1,6895,Afghanistan,"{Hazem, Abdul, Hai, Qader}","{H, A, Q}","[20, 505, 590, 591, 595, 603, 651, 661, 706, 709, 829, 842, 2193, 5270, 5416, 6078, 6093, 6095, 6113, 6130, 6207, 6224, 6240, 6312, 6583, 6616, 6694, 6873, 6887, 6898, 7085, 7146, 7368, 7386, 7447, 7501, 7513, 7586, 105305, 106138, 107163, 109900, 110138, 113244, 115145, 119228, 124306, 126530, 126538, 126554, 127417, 128186, 141403, 141872, 144861, 145430, 145486, 145521, 145691, 146428, 148293, 149178, 149390, 150680, 151879, 151903, 152704, 153011, 154125, 165849, 166346, 170630]",72,"[Hazem, Abdul, Hai, Qader]","{Hazem, Abdul, Hai, Qader}",505,95,1.0,"[95.0, 95.0, 87.0, 95.0]",94.75,"[100.0, 100.0, 100.0, 100.0]",match
2,6897,ISIL (Da'esh) and Al-Qaida,"{Saiyid, Al, Am, Abd, Agha, Abdul, Man, Manan}","{S, A, M}","[54, 58, 76, 83, 103, 136, 143, 154, 156, 157, 176, 508, 514, 515, 516, 517, 522, 524, 526, 528, 545, 548, 553, 556, 581, 593, 595, 599, 600, 603, 604, 641, 643, 644, 656, 659, 661, 676, 696, 727, 733, 739, 758, 760, 765, 779, 781, 796, 826, 840, 931, 965, 1064, 1065, 1069, 1092, 1102, 1924, 2193, 2208, 2700, 3144, 3225, 3341, 3361, 3663, 3741, 3793, 3862, 4142, 5268, 5271, 5279, 5294, 5416, 5417, 5499, 5616, 5619, 5623, 5793, 5804, 6084, 6095, 6101, 6113, 6114, 6116, 6130, 6133, 6206, 6211, 6223, 6228, 6230, 6231, 6238, 6303, 6305, 6309, ...]",610,"[Saiyid, Al, Am, Abd, Agha, Man]","{Saiyid, Al, Ag, Am, Abd, Agha, Abdul, Lmnn, Man, Bd, Manan}",514,93,1.0,"[100.0, 87.0, 87.0, 87.0, 95.0, 87.0]",92.88,"[100.0, 100.0, 100.0, 100.0, 100.0, 100.0]",match
3,6899,ISIL (Da'esh) and Al-Qaida,"{Abdallah, Ali, Salah, Tharwat, Thirwat, Shahata, Shihata, Tarwat}","{T, A, S}","[13, 20, 23, 54, 58, 67, 76, 83, 87, 98, 157, 507, 515, 516, 553, 604, 727, 733, 758, 765, 779, 786, 789, 796, 826, 829, 840, 1880, 1883, 1886, 1888, 1892, 1896, 1924, 2193, 2208, 2921, 3080, 3085, 3225, 3341, 3862, 5279, 5357, 5610, 5793, 6101, 6113, 6114, 6116, 6231, 6306, 6496, 6584, 6619, 6625, 6695, 6696, 6916, 6917, 6973, 6982, 7027, 7069, 7077, 7094, 7137, 7166, 7294, 7300, 7336, 7343, 7359, 7361, 7406, 7434, 7483, 7492, 7496, 7504, 7524, 7556, 105424, 106138, 106544, 106548, 107009, 110103, 110164, 112198, 113224, 113334, 113787, 113926, 117506, 118875, 119026, 119200, 119561, 119633, ...]",339,"[Abdallah, Ali, Salah, Tharwat, Shahata]","{Abdallah, Ali, Salah, Tharwat, Thirwat, Shahata, Shihata, Tarwat}",796,97,1.0,"[100.0, 87.0, 95.0, 100.0, 100.0]",97.3,"[100.0, 100.0, 100.0, 100.0, 100.0]",match
4,6901,ISIL (Da'esh) and Al-Qaida,"{Abdul, Majeed, Majid, Chaudhry}","{C, A, M}","[83, 143, 154, 515, 517, 522, 526, 545, 548, 581, 603, 641, 643, 659, 661, 676, 696, 727, 1064, 1086, 1092, 1102, 3185, 3190, 3793, 5271, 5416, 5420, 5499, 5522, 5553, 5555, 5623, 6095, 6130, 6133, 6206, 6211, 6230, 6238, 6303, 6309, 6485, 6506, 6569, 6615, 6616, 6617, 6652, 6695, 6830, 6831, 6908, 6944, 6972, 6974, 6982, 7035, 7069, 7166, 7168, 7206, 7237, 7285, 7307, 7367, 7388, 7446, 7456, 7472, 7483, 7496, 7499, 7508, 105631, 106554, 106732, 109887, 112198, 113282, 115829, 117360, 118362, 118875, 119561, 119637, 121034, 121036, 123615, 124966, 124980, 125453, 125550, 126101, 126106, 126554, 127531, 127538, 128186, 128241, ...]",254,"[Abdul, Majeed, Majid, Chaudhry]","{Abdul, Majeed, Majid, Chaudhry}",641,98,1.0,"[95.0, 100.0, 95.0, 100.0]",98.12,"[100.0, 100.0, 100.0, 100.0]",match


In [26]:
df_uk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3608 entries, 0 to 3607
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   UK ID                   3608 non-null   int64  
 1   UK Sanction Programme   3608 non-null   object 
 2   UK Name                 3608 non-null   object 
 3   UK Letters              3608 non-null   object 
 4   Candidate EU IDs        3608 non-null   object 
 5   Candidate Count         3608 non-null   int64  
 6   Name Overlap            3532 non-null   object 
 7   EU Matched Name         3532 non-null   object 
 8   EU Matched ID           3532 non-null   Int64  
 9   Multi Score             3608 non-null   int64  
 10  Coverage Ratio          3608 non-null   float64
 11  Length Adjusted Scores  3532 non-null   object 
 12  Weighted Score          3608 non-null   float64
 13  Raw Scores              3532 non-null   object 
 14  Label                   3608 non-null   

In [25]:
#just some final tweaks for presentation and consistency 
df_uk['EU Matched ID']=df_uk['EU Matched ID'].astype('Int64')
df_uk['Weighted Score']=df_uk['Weighted Score'].round(2)

In [28]:
df_uk.to_pickle(processed_dir/'matched_grouped.pkl')

### 7.1 For final output 

In [29]:
#removing columns with middle stage metrics
df_output=df_uk.copy()
columns_to_drop=['UK Letters','Candidate EU IDs','Candidate Count','Name Overlap','Coverage Ratio','Length Adjusted Scores','Weighted Score','Raw Scores']
df_output=df_output.drop(columns=columns_to_drop)

In [30]:
df_output.head()

Unnamed: 0,UK ID,UK Sanction Programme,UK Name,EU Matched Name,EU Matched ID,Multi Score,Label
0,6894,ISIL (Da'esh) and Al-Qaida,"{Fikiruddin, Rahman, Mohamad, A, Fihiruddin, Abdul, Jibril, Iqbal, Muqti, Abu, Abdurrahman}","{Fikiruddin, Rahman, Mohamad, Fihiruddin, A, Abdul, Jibril, Iqbal, Muqti, Abu, Abdurrahman}",1004,98,match
1,6895,Afghanistan,"{Hazem, Abdul, Hai, Qader}","{Hazem, Abdul, Hai, Qader}",505,95,match
2,6897,ISIL (Da'esh) and Al-Qaida,"{Saiyid, Al, Am, Abd, Agha, Abdul, Man, Manan}","{Saiyid, Al, Ag, Am, Abd, Agha, Abdul, Lmnn, Man, Bd, Manan}",514,93,match
3,6899,ISIL (Da'esh) and Al-Qaida,"{Abdallah, Ali, Salah, Tharwat, Thirwat, Shahata, Shihata, Tarwat}","{Abdallah, Ali, Salah, Tharwat, Thirwat, Shahata, Shihata, Tarwat}",796,97,match
4,6901,ISIL (Da'esh) and Al-Qaida,"{Abdul, Majeed, Majid, Chaudhry}","{Abdul, Majeed, Majid, Chaudhry}",641,98,match


In [32]:
df_output.to_csv(final_dir/'matched_names_grouped_method.csv',index=False)