This notebook evaluates whether machine learning models can effectively classify matched name pairs as `match`, `not match`, or `preliminary match`, based on features generated earlier. The goal is to assess the predictive power of these features and explore opportunities for data-informed tuning of the existing scoring logic using supervised learning. To this end, multiple classification models were compared across different experiments, with their performance assessed against the previously used rule-based method (multi_score). The best-performing model was ultimately employed to predict labels on a separate holdout set, testing its generalization capability.

The notebook is structured as follows:

- Preprocessing

- Methodology & Rationale

- Experimental Setup & Results

- Discussion

- Label Prediction

- Feature Importance

In [1]:
#import necessary libraries
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier


#commands for better readability 
pd.set_option('display.max_colwidth', None)  
pd.set_option('display.max_columns', None)   
pd.set_option('display.width', 2000)         
pd.set_option('display.max_columns', None)   
pd.set_option('display.max_rows', None)  
warnings.filterwarnings("ignore", category=UserWarning, module='pandas')  

In [2]:
#paths
project_dir=Path.cwd().parent.parent
processed_dir=project_dir/'data'/'processed'
final_dir=project_dir/'data'/'final'

full_file=processed_dir/'features_data.pkl'
labeled_file=processed_dir/'features_sample_labeled.pkl'

df_full=pd.read_pickle(full_file)  
df_label=pd.read_pickle(labeled_file)  

In [3]:
df_full.head()

Unnamed: 0,UK ID,UK Name,Name Overlap,EU Name Match,EU ID,uk_letters_count,candidate_count,uk_name_count,overlap_name_count,eu_name_match_count,multi_score,coverage_ratio,length_adj_avg_score,avg_raw_score
0,6894,"{A, Abdul, Jibril, Fikiruddin, Muqti, Iqbal, Abdurrahman, Fihiruddin, Abu, Mohamad, Rahman}","[Jibril, Abdurrahman, Abu, A, Rahman, Fihiruddin, Mohamad, Iqbal, Muqti]","{A, Abdul, Jibril, Fikiruddin, Muqti, Iqbal, Abdurrahman, Fihiruddin, Abu, Mohamad, Rahman}",1004,6,45,11,11,11,97,1.0,96.0,100.0
1,6895,"{Abdul, Qader, Hai, Hazem}","[Abdul, Qader, Hazem, Hai]","{Abdul, Qader, Hai, Hazem}",505,3,72,4,4,4,95,1.0,93.0,100.0
2,6897,"{Abdul, Manan, Man, Saiyid, Al, Agha, Am, Abd}","[Agha, Man, Al, Am, Abdul, Saiyid]","{Abdul, Manan, Man, Lmnn, Saiyid, Bd, Al, Ag, Agha, Am, Abd}",514,3,610,8,8,11,94,1.0,91.8,100.0
3,6899,"{Thirwat, Tharwat, Tarwat, Ali, Abdallah, Shihata, Shahata, Salah}","[Tarwat, Shahata, Abdallah, Ali, Salah]","{Thirwat, Tharwat, Tarwat, Ali, Abdallah, Shihata, Shahata, Salah}",796,3,339,8,8,8,97,1.0,96.4,100.0
4,6901,"{Abdul, Chaudhry, Majid, Majeed}","[Abdul, Majeed, Chaudhry, Majid]","{Abdul, Chaudhry, Majid, Majeed}",641,3,254,4,4,4,98,1.0,97.5,100.0


In [4]:
df_label.head()

Unnamed: 0,UK ID,UK Name,Name Overlap,EU Name Match,EU ID,uk_letters_count,candidate_count,uk_name_count,overlap_name_count,eu_name_match_count,multi_score,coverage_ratio,length_adj_avg_score,avg_raw_score,Label
0,6894,"{A, Abdul, Jibril, Fikiruddin, Muqti, Iqbal, Abdurrahman, Fihiruddin, Abu, Mohamad, Rahman}","[Jibril, Abdurrahman, Abu, A, Rahman, Fihiruddin, Mohamad, Iqbal, Muqti]","{A, Abdul, Jibril, Fikiruddin, Muqti, Iqbal, Abdurrahman, Fihiruddin, Abu, Mohamad, Rahman}",1004,6,45,11,11,11,97,1.0,96.0,100.0,match
1,6905,"{Abdul, Haji, Mullah, Sahib, Zakir, Akhund, Bari}","[Sahib, Bari, Akhund, Zakir, Mullah, Abdul, Haji]","{Abdul, Haji, Mullah, Sahib, Zakir, Akhund, Bari}",556,6,196,7,7,7,97,1.0,96.4,100.0,match
2,6912,"{Khadem, Rauf, Aliza, Abdul}","[Abdul, Aliza, Khadem, Rauf]","{Abdul, Aliza, Mullah, Khadem, Rauf}",719,3,248,4,4,5,97,1.0,96.2,100.0,match
3,6932,"{Takfiri, Umar, Samman, Ismail, Mahmoud, Abu, Al, Othman, Mohammed, Omar, Filistini, Qatada, Umr, Uthman}","[Umar, Takfiri, Filistini, Othman, Qatada, Ismail, Abu, Al, Samman, Mahmoud, Mohammed]","{Takfiri, Umar, Samman, Ismail, Mahmoud, Abu, Al, Othman, Mohammed, Omar, Filistini, Qatada, Umr, Uthman}",836,9,20,14,14,14,98,1.0,97.2,100.0,match
4,7024,"{Fathi, Mohamed, Al, Ben, Abdallah, Belkacem, Hannachi, Aouadi, Belgacem}","[Ben, Fathi, Belkacem, Al, Aouadi, Abdallah, Hannachi, Mohamed]","{Fathi, Mohamed, Al, Ben, Abdallah, Belkacem, Hannachi, Aouadi, Belgacem}",927,5,128,9,9,9,97,1.0,96.1,100.0,match


# 2. Preprocessing

We began by preparing the training and holdout sets. The labeled sample was used to train the models. The remainder of the full dataset, which does not overlap with the labeled sample, was set aside as a holdout set for later evaluation.

In [5]:
feature_cols=[
    'uk_letters_count',
    'candidate_count',
    'uk_name_count',
    'overlap_name_count',
    'eu_name_match_count',
    'multi_score',
    'coverage_ratio',
    'length_adj_avg_score',
    'avg_raw_score'
]

labelled_ids=df_label['UK ID']

df_holdout=df_full[~df_full['UK ID'].isin(labelled_ids)]

X=df_label[feature_cols]  
y=df_label['Label'] 
X_holdout=df_holdout[feature_cols]

# 3. Methodology & Rationale

### 3.1 Classifier 

- **Random Forest** was initially selected as a starting point. It's a robust and versatile classifier that performs well with minimal tuning and can handle complex patterns in the data. As a first model, it provided a solid baseline for comparison, delivering good results with relatively low effort

- **Logistic Regression** was explored next to test whether a simpler, more interpretable model could offer similar performance. Both L1 and L2 regularization were tested against each other. Since the L1 model showed better results, it was optimized in a dedicated follow-up experiment.


- **XGBoost**  was selected as the next step. This more advanced boosting algorithm is known for handling complex relationships in data while offering higher flexibility. Given its track record of high performance on structured data and its ability to manage overfitting, it was a natural progression to test whether it could outperform the previous models, especially after tuning hyperparameters


### 3.2 Hyperparameter Tuning

- **GridSearchCV** was used to find the combination of hyperparameters that gave the best performance estimate. A custom scoring function guided the search by reflecting the task’s specific priorities, using a weighted combination of the metrics used for evaluation.



### 3.3 Validation Strategy

- **Train-Test Split** was initially used for quick evaluation. Although class balance was maintained using stratification, the results may have underestimated model performance due to the limited sample size (385) and the variability introduced by a single random split. This limitation motivated a shift to Stratified K-Fold Cross-Validation, which provides more stable and reliable performance estimates.


       
- **Stratified K-Fold Cross-Validation** was adopted to address those isses and still maintain class blanace. By generating multiple train-test splits and allowing every data point to be used for both training and validation, it produced more stable and robust estimates.

    - 5 folds were chosen to simultaneously provide sufficient variability across splits, while preventing overfitting.
 
    

### 3.4 Evaluation Metrics

In this task, the goal is to classify names as not match, preliminary match, or match, with the emphasis on avoiding missed true matches while minimizing unnecessary manual review.

We focus on the following priorities:

 - **Precision** in “Match” and “Not Match” is our top priority. If the system labels something as a match, it should truly be a match. Likewise, if it says not match, it should be genuinely safe to ignore. These two classes represent confident decisions, so we want them to be as reliable as possible.

 - “Preliminary Match” acts as a safety net where uncertain cases go for human review. We do not directly optimize for this class, but we rely on it to catch true matches missed

 - **Recall** in “Not Match” is important but secondary. It tells us how many of the actual “not match” cases are caught confidently by the model. Higher recall here means fewer names fall into the “preliminary match” category, which helps reduce manual workload. However, this is less critical than being certain about the predictions in the match and not match classes.



In summary, we want the model to be confident only when it is correct. Precision ensures that confidence is deserved, while recall helps reduce human burden. This approach ensures that true matches are not missed and that attention is focused where it is truly needed.

# 4. Experimental Setup & Results

### 4.1  Experiment 1:
- **Classifier**: Random Forest
- **Hyperparameters**: Default
- **Validation Strategy**: Simple Train-test Split
    - split ratio: 80/20
    - startified: yes (to maintain class balance)



In [6]:
#split the data into training and testing sets ensuring class balance using stratification
X_train_0, X_test_0, y_train_0, y_test_0=train_test_split(X, y, test_size=0.2, random_state=30, stratify=y)


#initialize and train model  
model_1=RandomForestClassifier(n_estimators=100, random_state=30)  #random_state=30 for reproducibility  
model_1.fit(X_train_0, y_train_0)

#get predictions on test set 
y_pred_1=model_1.predict(X_test_0)


In [7]:
#define class labels
class_labels=['not match','preliminary match', 'match']

#create confusion matrix
conf_matrix_1=confusion_matrix(y_test_0, y_pred_1)

#create classification report 
report_1=classification_report(y_test_0, y_pred_1, target_names=class_labels)

#extract summary metrics from report 
report_dict=classification_report(y_test_0, y_pred_1, target_names=class_labels,output_dict=True)
precision_match_1=round(report_dict['match']['precision'], 3)
precision_not_match_1=round(report_dict['not match']['precision'], 3)
recall_not_match_1=round(report_dict['not match']['recall'], 3)

#print results
print("Confusion Matrix:")
print(conf_matrix_1)
print("\nClassification Report:")
print(report_1)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_1}")
print(f"'Not Match' Precision: {precision_not_match_1}")
print(f"'Not Match' Recall:    {recall_not_match_1}")




Confusion Matrix:
[[31  0  0]
 [ 0  7  1]
 [ 0  1 10]]

Classification Report:
                   precision    recall  f1-score   support

        not match       1.00      1.00      1.00        31
preliminary match       0.88      0.88      0.88         8
            match       0.91      0.91      0.91        11

         accuracy                           0.96        50
        macro avg       0.93      0.93      0.93        50
     weighted avg       0.96      0.96      0.96        50


Classification Report Summary:
'Match' Precision:     0.909
'Not Match' Precision: 1.0
'Not Match' Recall:    1.0


### 4.2 Experiment 2:
- **Classifier**: Random Forest  
- **Hyperparameters**: Default  
- **Validation Strategy**: Stratified k-Fold CV
   - no folds: 5 (standard choice)
   - startified: yes (to maintain class balance)



In [8]:
#initialize cross validator
skf=StratifiedKFold(n_splits=5, shuffle=True, random_state=30)

#initialize model 
model_2=RandomForestClassifier(n_estimators=100, random_state=30)

conf_matrices=[]
reports=[]

#for each k fold
for train_idx, test_idx in skf.split(X, y):
    
    #split data into train and test sets
    X_train, X_test=X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test=y.iloc[train_idx], y.iloc[test_idx]

    #train model on training set
    model_2.fit(X_train, y_train)  

    #predict target values for test set
    y_pred=model_2.predict(X_test)

    #store confusion matrix and classification report for this fold
    conf_matrices.append(confusion_matrix(y_test, y_pred))
    reports.append(classification_report(y_test, y_pred, target_names=class_labels, output_dict=True))


#aggregate confusion matrices acorss all folds
conf_matrix_2=np.sum(conf_matrices, axis=0)

#extract class-specific metrics from each fold
precision_match=[r['match']['precision'] for r in reports]
precision_not_match=[r['not match']['precision'] for r in reports]
recall_not_match=[r['not match']['recall'] for r in reports]

#average metrics across all folds
precision_match_2= round(np.mean(precision_match), 3)
precision_not_match_2= round(np.mean(precision_not_match), 3)
recall_not_match_2= round(np.mean(recall_not_match), 3)

precision_match_std_2=round(np.std(precision_match), 3)
precision_not_match_std_2=round(np.std(precision_not_match), 3)
recall_not_match_std_2=round(np.std(recall_not_match), 3)

#print results
print("Confusion Matrix:")
print(conf_matrix_2)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_2}")
print(f"'Not Match' Precision: {precision_not_match_2}")
print(f"'Not Match' Recall:    {recall_not_match_2}")

print("\nClassification Report Summary (Mean ± Std):")
print(f"'Match' Precision:     {precision_match_2} ± {precision_match_std_2}")
print(f"'Not Match' Precision: {precision_not_match_2} ± {precision_not_match_std_2}")
print(f"'Not Match' Recall:    {recall_not_match_2} ± {recall_not_match_std_2}")

Confusion Matrix:
[[156   0   0]
 [  0  37   5]
 [  0   1  51]]

Classification Report Summary:
'Match' Precision:     0.914
'Not Match' Precision: 1.0
'Not Match' Recall:    1.0

Classification Report Summary (Mean ± Std):
'Match' Precision:     0.914 ± 0.053
'Not Match' Precision: 1.0 ± 0.0
'Not Match' Recall:    1.0 ± 0.0


### 4.3 Experiment 3:
- **Classifier**: Random Forest  
- **Hyperparameters**: tuned with GridSearchCV + custom scoring function
   - number of trees: 50, 100, 200
   - tree depth: None, 10, 20
   - min samples to split: 2, 5
   - min samples per leaf: 1, 2
- **Validation Strategy**: Stratified k-Fold CV
    - number of folds: 5 



In [9]:
#define a custom scoring function 
def priority_metric(y_true, y_pred):

    #extract the relevant metrics
    report=classification_report(y_true, y_pred, target_names=class_labels, output_dict=True)    
    precision_match=report['match']['precision']
    precision_not_match=report['not match']['precision']
    recall_not_match=report['not match']['recall']
    
    #create weighted score based on the relative importance of each metric
    weighted_score=0.4*precision_match + 0.4*precision_not_match + 0.2*recall_not_match
    
    return weighted_score

#wrap the custom scoring function using sklearn's make_scorer for use in GridSearchCV
custom_scorer=make_scorer(priority_metric, greater_is_better=True)

In [10]:
#initialize base model
base_model_3=RandomForestClassifier(random_state=30)

#define standard hyperparameter grid
param_grid_3 = {
    'n_estimators':[50, 100, 200],
    'max_depth':[None, 10, 20],
    'min_samples_split':[2, 5],
    'min_samples_leaf':[1, 2]
}

#perform grid search with custom scoring function and cross-validation startegy
grid_search_3=GridSearchCV(base_model_3, param_grid_3, scoring=custom_scorer, cv=skf, n_jobs=-1)

#fit the grid search to the entire labelled dataset
grid_search_3.fit(X, y)

#store the best performing model and parameters
model_3=grid_search_3.best_estimator_
best_parameters_3=grid_search_3.best_params_

In [11]:
#initialize lists to store results
conf_matrices = []
reports = []

#for each k-fold
for train_idx, test_idx in skf.split(X, y):

    #split data into train and test sets
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    #train model on training set
    model_3.fit(X_train, y_train)

    #predict target values based on test data
    y_pred_3=model_3.predict(X_test)

    #store confusion matrix and classification report for this fold
    conf_matrices.append(confusion_matrix(y_test, y_pred_3)) 
    reports.append(classification_report(y_test, y_pred_3, target_names=class_labels, output_dict=True))


#aggregate confusion matrix across all folds
conf_matrix_3=np.sum(conf_matrices, axis=0)

#extract class-specific metrics from each fold
precision_match=[r['match']['precision'] for r in reports]
precision_not_match=[r['not match']['precision'] for r in reports]
recall_not_match=[r['not match']['recall'] for r in reports]

#average metrics across all folds
precision_match_3=round(np.mean(precision_match), 3)
precision_not_match_3=round(np.mean(precision_not_match), 3)
recall_not_match_3=round(np.mean(recall_not_match), 3)

precision_match_std_3=round(np.std(precision_match), 3)
precision_not_match_std_3=round(np.std(precision_not_match), 3)
recall_not_match_std_3=round(np.std(recall_not_match), 3)



#print results
print("Best hyperparameters:")
print(best_parameters_3)

print("\nConfusion Matrix:")
print(conf_matrix_3)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_3}")
print(f"'Not Match' Precision: {precision_not_match_3}")
print(f"'Not Match' Recall:    {recall_not_match_3}")

print("\nClassification Report Summary (Mean ± Std):")
print(f"'Match' Precision:     {precision_match_3} ± {precision_match_std_3}")
print(f"'Not Match' Precision: {precision_not_match_3} ± {precision_not_match_std_3}")
print(f"'Not Match' Recall:    {recall_not_match_3} ± {recall_not_match_std_3}")

Best hyperparameters:
{'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}

Confusion Matrix:
[[156   0   0]
 [  0  38   4]
 [  0   1  51]]

Classification Report Summary:
'Match' Precision:     0.932
'Not Match' Precision: 1.0
'Not Match' Recall:    1.0

Classification Report Summary (Mean ± Std):
'Match' Precision:     0.932 ± 0.063
'Not Match' Precision: 1.0 ± 0.0
'Not Match' Recall:    1.0 ± 0.0


### 4.4 Experiment 4
- **Classifier**: Logistic Regression
- **Hyperparameters**: tuned with GridSearchCV + custom scoring function
    - regularization strength C: 0.001, 0.01, 0.1, 1, 10, 100
    - penalty: L1, L2
    - solver: saga (works for both L1 and L2)
- **Validation Strategy**: Stratified K-Fold
    - number of folds: 5


In [12]:
#standardize features 
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X)

#initialize base model
base_model_4=LogisticRegression(random_state=30, max_iter=10000,solver='saga')

#define hyperparameter grid for tuning 
param_grid_4={
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  #wide range of values
    'penalty': ['l1','l2']
}

#grid search with custom scoring and stratified CV
grid_search_4 = GridSearchCV(base_model_4, param_grid_4, scoring=custom_scorer, cv=skf, n_jobs=-1)
grid_search_4.fit(X_scaled, y)

#store best performing model and parameters
model_4=grid_search_4.best_estimator_
best_parameters_4=grid_search_4.best_params_



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

In [13]:
conf_matrices=[]
reports=[]

#for each k-fold
for train_idx, test_idx in skf.split(X, y):

    #split data into train and test sets
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    #standardize data 
    X_train_scaled=scaler.fit_transform(X_train)
    X_test_scaled=scaler.transform(X_test)
    
    #train model
    model_4.fit(X_train_scaled, y_train)

    #predict target values based on test data
    y_pred_4=model_4.predict(X_test_scaled)

    #store confusion matrix and classification report for this fold
    conf_matrices.append(confusion_matrix(y_test, y_pred_4)) 
    reports.append(classification_report(y_test, y_pred_4, target_names=class_labels, output_dict=True))




#aggregate confusion matrix across all folds
conf_matrix_4=np.sum(conf_matrices, axis=0)

#extract class-specific metrics from each fold
precision_match=[r['match']['precision'] for r in reports]
precision_not_match=[r['not match']['precision'] for r in reports]
recall_not_match=[r['not match']['recall'] for r in reports]

#average metrics across all folds
precision_match_4=round(np.mean(precision_match), 3)
precision_not_match_4=round(np.mean(precision_not_match), 3)
recall_not_match_4=round(np.mean(recall_not_match), 3)

precision_match_std_4=round(np.std(precision_match), 3)
precision_not_match_std_4=round(np.std(precision_not_match), 3)
recall_not_match_std_4=round(np.std(recall_not_match), 3)



#print results
print("Best hyperparameters:")
print(best_parameters_4)

print("\nConfusion Matrix:")
print(conf_matrix_4)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_4}")
print(f"'Not Match' Precision: {precision_not_match_4}")
print(f"'Not Match' Recall:    {recall_not_match_4}")

print("\nClassification Report Summary (Mean ± Std):")
print(f"'Match' Precision:     {precision_match_4} ± {precision_match_std_4}")
print(f"'Not Match' Precision: {precision_not_match_4} ± {precision_not_match_std_4}")
print(f"'Not Match' Recall:    {recall_not_match_4} ± {recall_not_match_std_4}")

Best hyperparameters:
{'C': 1, 'penalty': 'l1'}

Confusion Matrix:
[[153   0   3]
 [  0  39   3]
 [  0   0  52]]

Classification Report Summary:
'Match' Precision:     0.908
'Not Match' Precision: 1.0
'Not Match' Recall:    0.981

Classification Report Summary (Mean ± Std):
'Match' Precision:     0.908 ± 0.085
'Not Match' Precision: 1.0 ± 0.0
'Not Match' Recall:    0.981 ± 0.026


### 4.5 Experiment 5
- **Classifier**: Logistic Regression
- **Hyperparameters**: tuned with GridSearchCV + custom scoring function
    - regularization strength C: 0.3, 0.5, 0.7, 1, 1.3, 1.5, 2
    - penalty: L1
    - solver: saga
- **Validation Strategy**: Stratified K-Fold
    - number of folds: 5


In [14]:
#initilize base model
base_model_5=LogisticRegression(random_state=30, max_iter=10000,solver='saga')

#define hyperparameter grid
param_grid_5={
    'C': [0.3, 0.5, 0.7, 1, 1.3, 1.5, 2],
    'penalty': ['l1'],
}

#grid search with custom scoring and stratified CV 
grid_search_5=GridSearchCV(base_model_5,param_grid_5,scoring=custom_scorer,cv=skf,n_jobs=-1)
grid_search_5.fit(X_scaled, y)

#store best performing model and parameters
model_5=grid_search_5.best_estimator_
best_parameters_5=grid_search_5.best_params_

In [15]:
conf_matrices=[]
reports=[]

#for each k-fold
for train_idx, test_idx in skf.split(X, y):

    #split data into train and test sets
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    X_train_scaled=scaler.fit_transform(X_train)
    X_test_scaled=scaler.transform(X_test)
    
    #train model
    model_5.fit(X_train_scaled, y_train)

    #predict target values 
    y_pred_5=model_5.predict(X_test_scaled)

    #store confusion matrix and classification report for this fold
    conf_matrices.append(confusion_matrix(y_test, y_pred_5)) 
    reports.append(classification_report(y_test, y_pred_5, target_names=class_labels, output_dict=True))




#aggregate confusion matrix across all folds
conf_matrix_5=np.sum(conf_matrices, axis=0)

#extract class-specific metrics from each fold
precision_match=[r['match']['precision'] for r in reports]
precision_not_match=[r['not match']['precision'] for r in reports]
recall_not_match=[r['not match']['recall'] for r in reports]

#average metrics across all folds
precision_match_5=round(np.mean(precision_match), 3)
precision_not_match_5=round(np.mean(precision_not_match), 3)
recall_not_match_5=round(np.mean(recall_not_match), 3)

precision_match_std_5=round(np.std(precision_match), 3)
precision_not_match_std_5=round(np.std(precision_not_match), 3)
recall_not_match_std_5=round(np.std(recall_not_match), 3)


#print results
print("Best hyperparameters:")
print(best_parameters_5)

print("\nConfusion Matrix:")
print(conf_matrix_5)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_5}")
print(f"'Not Match' Precision: {precision_not_match_5}")
print(f"'Not Match' Recall:    {recall_not_match_5}")

print("\nClassification Report Summary (Mean ± Std):")
print(f"'Match' Precision:     {precision_match_5} ± {precision_match_std_5}")
print(f"'Not Match' Precision: {precision_not_match_5} ± {precision_not_match_std_5}")
print(f"'Not Match' Recall:    {recall_not_match_5} ± {recall_not_match_std_5}")

Best hyperparameters:
{'C': 1.5, 'penalty': 'l1'}

Confusion Matrix:
[[153   0   3]
 [  0  39   3]
 [  0   1  51]]

Classification Report Summary:
'Match' Precision:     0.906
'Not Match' Precision: 1.0
'Not Match' Recall:    0.981

Classification Report Summary (Mean ± Std):
'Match' Precision:     0.906 ± 0.087
'Not Match' Precision: 1.0 ± 0.0
'Not Match' Recall:    0.981 ± 0.026


### 4.6 Experiment 6
- **Classifier**: XGBoost Classifier
- **Hyperparameters**: tuned with GridSearchCV + custom scoring function
    - no of estimators: 50, 100, 200
    - max depth of each tree: 3, 6, 10
    - step size shrinkage: 0.01, 0.1, 0.2
    - row sampling: 0.7, 1.0
    - feature sampling: 0.7, 1.0
    - L1 regularization term: 0, 0.1, 1 
    - L2 regularization term: 1, 10 
- **Validation Strategy**: Stratified K-Fold
    - number of folds: 5

In [16]:
#initialize base model
base_model_6=XGBClassifier(random_state=30, eval_metric='mlogloss')

#define hyperparameter grid for tuning
param_grid_6 = {
    'n_estimators':[50, 100, 200],           #number of trees
    'max_depth':[3, 6, 10],                  #depth of each tree
    'learning_rate':[0.01, 0.1, 0.2],        #step size shrinkage
    'subsample':[0.7, 1.0],                  #row sampling
    'colsample_bytree':[0.7, 1.0],           #feature sampling
    'reg_alpha':[0, 0.1, 1],                 #L1 regularization
    'reg_lambda':[1, 10]                     #L2 regularization
}

#encode labels as integers for compatibility with XGBClassifier
label_mapping={
    'not match': 0,
    'preliminary match':1,
    'match': 2
}
df_label['encoded_label'] = df_label['Label'].map(label_mapping)
y=df_label['encoded_label']

#create a reverse operation to use in the future and go back to readable labels
reverse_mapping={v: k for k, v in label_mapping.items()}

#grid search with custom scoring and stratified CV 
grid_search_6=GridSearchCV(estimator=base_model_6,param_grid=param_grid_6,scoring=custom_scorer,cv=skf,n_jobs=-1)
grid_search_6.fit(X_scaled,y)

#store best performing model and parameters
model_6=grid_search_6.best_estimator_
best_parameters_6=grid_search_6.best_params_

In [17]:
#initialize base model
base_model_6=XGBClassifier(random_state=30, eval_metric='mlogloss')

#define hyperparameter grid for tuning
param_grid_6 = {
    'n_estimators':[50, 100, 200],           #number of trees
    'max_depth':[3, 6, 10],                  #depth of each tree
    'learning_rate':[0.01, 0.1, 0.2],        #step size shrinkage
    'subsample':[0.7, 1.0],                  #row sampling
    'colsample_bytree':[0.7, 1.0],           #feature sampling
    'reg_alpha':[0, 0.1, 1],                 #L1 regularization
    'reg_lambda':[1, 10]                     #L2 regularization
}

#grid search with custom scoring and stratified CV 
grid_search_6=GridSearchCV(estimator=base_model_6,param_grid=param_grid_6,scoring=custom_scorer,cv=skf,n_jobs=-1)
grid_search_6.fit(X_scaled,y)

#store best performing model and parameters
model_6=grid_search_6.best_estimator_
best_parameters_6=grid_search_6.best_params_


In [18]:
conf_matrices=[]
reports=[]

#for each fold
for train_idx, test_idx in skf.split(X, y):

    #split data into 
    X_train, X_test=X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test=y.iloc[train_idx], y.iloc[test_idx]

    #standardize features
    X_train_scaled=scaler.fit_transform(X_train)
    X_test_scaled=scaler.transform(X_test)

    #train model 
    model_6.fit(X_train_scaled, y_train)

    #predict target 
    y_pred_6=model_6.predict(X_test_scaled)

    #store confusion matrix and classification report
    conf_matrices.append(confusion_matrix(y_test, y_pred_6))
    reports.append(classification_report(y_test, y_pred_6, target_names=class_labels, output_dict=True))

#aggregate confusion matrices from all folds
conf_matrix_6=np.sum(conf_matrices, axis=0)

#extract class-specific metrics from each fold
precision_match=[r['match']['precision'] for r in reports]
precision_not_match=[r['not match']['precision'] for r in reports]
recall_not_match =[r['not match']['recall'] for r in reports]

#average metrics across folds
precision_match_6=round(np.mean(precision_match), 3)
precision_not_match_6=round(np.mean(precision_not_match), 3)
recall_not_match_6=round(np.mean(recall_not_match), 3)

precision_match_std_6=round(np.std(precision_match), 3)
precision_not_match_std_6=round(np.std(precision_not_match), 3)
recall_not_match_std_6=round(np.std(recall_not_match), 3)


#print results
print("Best hyperparameters for Experiment 6:")
print(best_parameters_6)

print("\nConfusion Matrix:")
print(conf_matrix_6)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_6}")
print(f"'Not Match' Precision: {precision_not_match_6}")
print(f"'Not Match' Recall:    {recall_not_match_6}")

print("\nClassification Report Summary (Mean ± Std):")
print(f"'Match' Precision:     {precision_match_6} ± {precision_match_std_6}")
print(f"'Not Match' Precision: {precision_not_match_6} ± {precision_not_match_std_6}")
print(f"'Not Match' Recall:    {recall_not_match_6} ± {recall_not_match_std_6}")

Best hyperparameters for Experiment 6:
{'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'reg_alpha': 0, 'reg_lambda': 1, 'subsample': 0.7}

Confusion Matrix:
[[ 39   3   0]
 [  1  51   0]
 [  0   0 156]]

Classification Report Summary:
'Match' Precision:     1.0
'Not Match' Precision: 0.975
'Not Match' Recall:    0.928

Classification Report Summary (Mean ± Std):
'Match' Precision:     1.0 ± 0.0
'Not Match' Precision: 0.975 ± 0.05
'Not Match' Recall:    0.928 ± 0.059


# 5. Discussion

The summarized results were presented in a table for comparison `results_comparison`. The first three columns correspond to the individual evaluation metrics used so far, while the final column, 'Weighted Score', represents a weighted combination of these metrics to reflect the goals of the project: 40% for 'match' precision, 40% for 'not match' precision, and 20% for 'not match' recall. In addition, the distribution of classes in the sample was printed to allow cross-referencing with the results.

In [20]:
df_label['Label'].value_counts()

Label
match                156
preliminary match     52
not match             42
Name: count, dtype: int64

In [19]:
results_comparison= [
    {
        'Experiment': '1 (Baseline)',
        'Match Precision': precision_match_1,
        'Not Match Precision': precision_not_match_1,
        'Not Match Recall': recall_not_match_1,
        'Weighted Score': round(0.4*precision_match_1+0.4*precision_not_match_1+0.2*recall_not_match_1,3)
    },
    {
        'Experiment': '2 (Random Forest)',
        'Match Precision': f"{precision_match_2:.3f} ± {precision_match_std_2:.3f}",
        'Not Match Precision': f"{precision_not_match_2:.3f} ± {precision_not_match_std_2:.3f}",
        'Not Match Recall': f"{recall_not_match_2:.3f} ± {recall_not_match_std_2:.3f}",
        'Weighted Score': round(0.4*precision_match_2+0.4*precision_not_match_2+0.2*recall_not_match_2,3)
    },
    {
        'Experiment': '3 (Tuned Random Forest)',
        'Match Precision': f"{precision_match_3:.3f} ± {precision_match_std_3:.3f}",
        'Not Match Precision': f"{precision_not_match_3:.3f} ± {precision_not_match_std_3:.3f}",
        'Not Match Recall': f"{recall_not_match_3:.3f} ± {recall_not_match_std_3:.3f}",
        'Weighted Score': round(0.4*precision_match_3+0.4*precision_not_match_3+0.2*recall_not_match_3,3)
    },
    {
        'Experiment': '4 (LogReg L1 vs L2)',
        'Match Precision': f"{precision_match_4:.3f} ± {precision_match_std_4:.3f}",
        'Not Match Precision': f"{precision_not_match_4:.3f} ± {precision_not_match_std_4:.3f}",
        'Not Match Recall': f"{recall_not_match_4:.3f} ± {recall_not_match_std_4:.3f}",
        'Weighted Score': round(0.4*precision_match_4+0.4*precision_not_match_4+0.2*recall_not_match_4,3)
    },
    {
        'Experiment': '5 (LogReg L1 refined)',
        'Match Precision': f"{precision_match_5:.3f} ± {precision_match_std_5:.3f}",
        'Not Match Precision': f"{precision_not_match_5:.3f} ± {precision_not_match_std_5:.3f}",
        'Not Match Recall': f"{recall_not_match_5:.3f} ± {recall_not_match_std_5:.3f}",
        'Weighted Score': round(0.4*precision_match_5+0.4*precision_not_match_5+0.2*recall_not_match_5,3)
    },
    {
        'Experiment': '6 (XGBoost)',
        'Match Precision': f"{precision_match_6:.3f} ± {precision_match_std_6:.3f}",
        'Not Match Precision': f"{precision_not_match_6:.3f} ± {precision_not_match_std_6:.3f}",
        'Not Match Recall': f"{recall_not_match_6:.3f} ± {recall_not_match_std_6:.3f}",
        'Weighted Score': round(0.4*precision_match_6+0.4*precision_not_match_6+0.2*recall_not_match_6,3)
    }
]


results_df=pd.DataFrame(results_comparison)

print(results_df)

                Experiment Match Precision Not Match Precision Not Match Recall  Weighted Score
0             1 (Baseline)           0.909                 1.0              1.0           0.964
1        2 (Random Forest)   0.914 ± 0.053       1.000 ± 0.000    1.000 ± 0.000           0.966
2  3 (Tuned Random Forest)   0.932 ± 0.063       1.000 ± 0.000    1.000 ± 0.000           0.973
3      4 (LogReg L1 vs L2)   0.908 ± 0.085       1.000 ± 0.000    0.981 ± 0.026           0.959
4    5 (LogReg L1 refined)   0.906 ± 0.087       1.000 ± 0.000    0.981 ± 0.026           0.959
5              6 (XGBoost)   1.000 ± 0.000       0.975 ± 0.050    0.928 ± 0.059           0.976


- The baseline model showed good scores, but without cross-validation, its results are likely unreliable.

- In later experiments, the variance in scores across folds was as high as ±0.087, showing flactuations in each split and hence validating the necessity of using Stratified K-Fold.

- The class distribution is imbalanced, with `match` dominating, so models showing stability on minority classes are more trustworthy.

- Random Forest with tuning showed the highest `match` precision (0.932) while maintaining perfect `not match` scores.

- The optimized Logistic Regression (Exp. 5) did not outperform the initial version (Exp. 4), suggesting that the model was already near its ceiling.

- Logistic Regression showed both lower overall metrics and higher variance compared to Random Forests, indicating it is less reliable for this task. 

- XGBoost (Exp. 6) had perfect `match` precision but lower and more volatile `not match` metrics, reversing the pattern seen in all previous models. This could point to slight overfitting on the dominant class.

- Despite the variability, XGBoost still performs well under this project’s priorities, since `match` and `not match` precision are weighted more heavily than recall, as confirmed by a weighted overall score of 0.976.

- Random Forest with did not prioritese the project's goals as directly as XGBoost, as confirmed by a slightly lower weighted overall score (0.973). However, it demonstrated more stable results across folds.

- Experiment 3 was therefore chosen as the best model to proceed with, as it offered the best balance between performance, alignment with project goals, and stability.
  

We can also compare the highest-performing models with the rule-based approach results from `05_compare_matching_strategies.ipynb`: 

In [54]:
data=[
    {
        'Method': '(Rule-Based Approach)',
        'Match Precision': '1.000',
        'Not Match Precision': '1.000',
        'Not Match Recall': '0.929',
        'Weighted Score': round(0.4 * 1.0 + 0.4 * 1.0 + 0.2 * 0.929, 3)
    },
    {
        'Method': '3 (Tuned Random Forest)',
        'Match Precision': f"{precision_match_3:.3f} ± {precision_match_std_3:.3f}",
        'Not Match Precision': f"{precision_not_match_3:.3f} ± {precision_not_match_std_3:.3f}",
        'Not Match Recall': f"{recall_not_match_3:.3f} ± {recall_not_match_std_3:.3f}",
        'Weighted Score': round(0.4 * precision_match_3 +0.4 * precision_not_match_3 +0.2 * recall_not_match_3,3)
    },
    {
        'Method': '6 (XGBoost)',
        'Match Precision': f"{precision_match_6:.3f} ± {precision_match_std_6:.3f}",
        'Not Match Precision': f"{precision_not_match_6:.3f} ± {precision_not_match_std_6:.3f}",
        'Not Match Recall': f"{recall_not_match_6:.3f} ± {recall_not_match_std_6:.3f}",
        'Weighted Score': round(0.4 * precision_match_6 +0.4 * precision_not_match_6 +0.2 * recall_not_match_6,3)
    }
]

rule_based_df=pd.DataFrame(data)

print(rule_based_df)


                    Method Match Precision Not Match Precision Not Match Recall  Weighted Score
0    (Rule-Based Approach)           1.000               1.000            0.929           0.986
1  3 (Tuned Random Forest)   0.932 ± 0.063       1.000 ± 0.000    1.000 ± 0.000           0.973
2              6 (XGBoost)   1.000 ± 0.000       0.975 ± 0.050    0.928 ± 0.059           0.976


- As expected, the best-performing ML models did not significantly outperform the rule-based approach, at best, the results were comparable, and all methods achieved high precision overall.

- This is likely due to the nature of the feature engineering. Many of the input features were composite indicators that already encapsulate the logic of the rule-based method, rather than raw or deconstructed components that would allow the model to discover new patterns on its own.

- As a result, the machine learning models may not have had the opportunity to learn fundamentally different decision boundaries, they were effectively validating the rule-based logic rather than improving upon it.

# 6. Label Prediction

The best trained model was used to predict the remaining labels in the holdout set. Afterward, a small sample of 200 newly generated pairs was also evaluated to confirm that the model could generalize well to unseen data.

In [21]:
chosen_model=model_3

In [24]:
#train final model with the labelled data 
chosen_model.fit(X,y)

#generate class probabilities for each class 
df_predicted_1=df_holdout.copy()
df_predicted_1[['not_match_prob','preliminary_prob','match_prob']]=chosen_model.predict_proba(X_holdout)

#predict class label 
df_predicted_1['Predicted Label 1']=chosen_model.predict(X_holdout)

#reverse mapping from experiment 6 for readability of labels
df_predicted_1['Predicted Label 1']=df_predicted_1['Predicted Label 1'].map(reverse_mapping)


In [25]:
#create new sample 
df_review_1=df_predicted_1.sample(200, random_state=30).sort_values('UK ID').reset_index(drop=True)

#print relevant columns from the sample to inspect the labels
df_review_1[['UK ID','UK Name','Name Overlap','EU Name Match','multi_score','coverage_ratio','Predicted Label 1','match_prob','not_match_prob','preliminary_prob']].head(200)

Unnamed: 0,UK ID,UK Name,Name Overlap,EU Name Match,multi_score,coverage_ratio,Predicted Label 1,match_prob,not_match_prob,preliminary_prob
0,7309,"{Abdul, Motmaen, Abdulhai, Haq}","[Abdul, Motmaen, Haq]","{Abdul, Motmaen, Abdulhai, Haq}",96,1.0,match,1.0,0.0,0.0
1,7455,"{Siddiqmal, Mohammad, Sarwar, Masood}","[Siddiqmal, Sarwar, Mohammad, Masood]","{Siddiqmal, Mohammad, Sarwar, Masood}",100,1.0,match,0.972,0.0,0.028
2,7483,"{Muhammad, Tayeb, Tayyab, Wali, Dad, Allah, Tabeeb}","[Dad, Tayyab, Tabeeb, Tayeb, Allah, Wali, Muhammad]","{Muhammad, Tayeb, Tayyab, Wali, Dad, Allah, Tabeeb}",97,1.0,match,1.0,0.0,0.0
3,7603,"{Tariq, Aziz, Mikhail}","[Tariq, Aziz, Mikhail]","{Tariq, Aziz, Mikhail, Tarek}",98,1.0,match,1.0,0.0,0.0
4,7868,"{Amin, Mostafa, Mohamed}","[Amin, Mostafa, Mohamed]","{Amin, Mhmd, Mstf, Mostafa, Mohamed, Myn}",99,1.0,match,1.0,0.0,0.0
5,8245,"{Jawhar, Al, Duri, Majid}","[Duri, Jawhar, Al, Majid]","{Jawhar, Al, Duri, Majid}",96,1.0,match,1.0,0.0,0.0
6,8838,"{Mohmed, Mohammed, Gaffar, Elhassan}","[Mohammed, Elhassan, Gaffar]","{Mohmed, Mohammed, Gaffar, Elhassan}",100,1.0,match,1.0,0.0,0.0
7,9215,"{Sayid, Muhammad, Hafez, Mohammad, Syeed, Tata, Sayed, Saeed, Sayeed, Sahib, Ji, Hafiz}","[Ji, Sayeed, Tata, Sahib, Sayid, Hafez, Mohammad]","{Sayid, Muhammad, Hafez, Mohammad, Syeed, Tata, Sayed, Saeed, Sayeed, Sahib, Ji, Hafiz}",96,1.0,match,1.0,0.0,0.0
8,10916,"{Ha, Sok, Hwang, Hwa}","[Ha, Hwang, Sok]","{Ha, Sok, Hwang, Hwa}",92,1.0,match,0.982222,0.0,0.017778
9,11093,"{Muscab, Gap, Abu, Mahmoud, Mohamed, Mohammed, Qorgab, Mahamoud, Mohamud, Gure, Mahmud, Mohamoud, Bashir, Yare}","[Yare, Mahmud, Muscab, Abu, Gure, Bashir, Mohamoud, Gap, Qorgab]","{Muscab, Gap, Abu, Mahmoud, Mohamed, Mohammed, Qorgab, Mahamoud, Mohamud, Gure, Mahmud, Mohamoud, Bashir, Yare}",97,1.0,match,1.0,0.0,0.0


In [26]:
#store the uk ids of rows with incorrect labels 
change_to_match=[14058,14601]
change_to_not_match=[13852,13913,15001]
change_to_preliminary_match=[15711,15825]


In [27]:
#create a new column for the human labels where the incorrect ones are properly labelled
df_review_1['Human-assigned Label']=df_review_1['Predicted Label 1'].copy()

df_review_1.loc[df_review_1['UK ID'].isin(change_to_match), 'Human-assigned Label']='match'
df_review_1.loc[df_review_1['UK ID'].isin(change_to_not_match), 'Human-assigned Label']='not match'
df_review_1.loc[df_review_1['UK ID'].isin(change_to_preliminary_match), 'Human-assigned Label']='preliminary match'

In [37]:
#create classification report 
report_review_1=classification_report(
    df_review_1['Human-assigned Label'],
    df_review_1['Predicted Label 1'],
    labels=class_labels,
    output_dict=True
)

#create confusion matrix
confusion_matrix_r_1=confusion_matrix(df_review_1['Human-assigned Label'], df_review_1['Predicted Label 1'],labels=class_labels)


#store relevant metrics
precision_match_r_1=report_review_1['match']['precision']
precision_not_match_r_1=report_review_1['not match']['precision']
recall_not_match_r_1=report_review_1['not match']['recall']

weighted_score_r_1=precision_match_r_1*0.4+precision_not_match_r_1*0.4+recall_not_match_r_1*0.2

#print results
print(confusion_matrix_r_1)

print("\nClassification Report Summary:")
print(f"'Match' Precision:     {precision_match_r_1}")
print(f"'Not Match' Precision: {precision_not_match_r_1}")
print(f"'Not Match' Recall:    {recall_not_match_r_1}")

print(f"\nWeighted Score:    {weighted_score_r_1}")

[[ 25   3   0]
 [  0  37   2]
 [  0   2 131]]

Classification Report Summary:
'Match' Precision:     0.9849624060150376
'Not Match' Precision: 1.0
'Not Match' Recall:    0.8928571428571429

Weighted Score:    0.9725563909774437


After extending the dataset using the tuned Random Forest model (Experiment 3), the results remained consistent with previous findings. The weighted average score stayed around 0.97, indicating that the model continues to perform well under the project's evaluation criteria. Despite a slight drop in recall, precision remained high for both target classes. These results do not invalidate any of the earlier conclusions and further reinforce the robustness and reliability of the chosen model.

# 7. Feature Importance

In this section, we investigated the importance of each feature as an opportunity for data-informed tuning. We focused on the Random Forest model (Experiment 3) given its inherent interpretability. In contrast, XGBoost is more of a black-box model and does not offer straightforward interpretability.

In [48]:
importances_3=model_3.feature_importances_

feature_importance_df=pd.DataFrame({
    'Feature': feature_cols,
    'Importance': importances_3,
    'Abs Importance': np.abs(importances_3)
})

feature_importance_df = feature_importance_df.sort_values(by='Abs Importance', ascending=False).reset_index(drop=True)

print(feature_importance_df[['Feature', 'Importance']])

                Feature  Importance
0           multi_score    0.504668
1        coverage_ratio    0.250483
2  length_adj_avg_score    0.067270
3         avg_raw_score    0.050041
4         uk_name_count    0.040885
5    overlap_name_count    0.038332
6   eu_name_match_count    0.018889
7      uk_letters_count    0.017238
8       candidate_count    0.012194


This analysis revealed two key insights:

- The model naturally relied on multi_score for 50% of its decision-making. This confirms that the `multi_score` is a strong, well-crafted metric.

- The `coverage_ratio` received a substantial 25% weight from the model, suggesting it has more predictive power than reflected in the original rule. A slight increase in the rule-based formula could improve performance without major changes.

# 8. Output

In [None]:
df_predicted_1.to_csv(final_dir/'ml_label_predictions.csv',index=False)