# Entity Matching Overview
### Final Project for Kodołamacz's Data Science Bootcamp
Author: Piotr Zioło

### Introduction
Entity matching (also known as record linkage) is the process of identifying which records in two or more datasets refer to the same real-world entity. High-quality entity matching allows organizations to consolidate information, eliminate duplicates, and gain a unified view of their data. Entity matching can be especially important in scenarios such as merging two lists of businesses from different CRMs of companies undergoing a merging process.

In this project, we focus on matching restaurant entities from two restaurant guides: Fodor's and Zagat's. The goal is to determine which entries from the Fodor's restaurant list correspond to the same establishments in the Zagat's list. Since the two sources may use slightly different names, address formats, or phone number conventions for the same restaurant, simple joins on these fields would not guarantee accurate results. Thus, we will compare multiple more advanced approaches to entity matching:
- Fuzzy String Matching – using string similarity (e.g. Levenshtein distance) primarily on textual fields like name.
- TF–IDF + Cosine Similarity – treating restaurant records as documents and measuring cosine similarity of TF-IDF feature vectors.
- Transformer Embeddings + Cosine – using pre-trained language model embeddings (Sentence-BERT) for each record and measuring vector cosine similarity.
- Large Language Model – leveraging am LLM via API to semantically compare and decide if two descriptions refer to the same restaurant.
- Supervised Machine Learning – training a classifier on labeled matching/non-matching record pairs, using multiple features (text similarity scores, etc.).

We will evaluate each method on accuracy, precision, recall, and F1-score for identifying matches. We will also compare their runtime performance, scalability, and cost. By the end, we should understand which approach works best for this scenario and what the considerations are for deploying each at scale.

### Dataset Overview
The Fodor's–Zagat's restaurant dataset is a dataset designed to serve as a benchmark for entity matching evaluations. It was put together by Anna Primpeli and Christian Bizer of University of Mannheim in Germany.

It consists of two lists of restaurants, one from Fodor's (533 entries) and one from Zagat's (331 entries). Each restaurant has the following attributes:
- id (a unique identifier in each data source)
- name
- addr
- city
- phone
- type (cuisine/category of restaurant)

In addition to the two source lists, there is a gold standard file that indicates which Fodor's and Zagat's records refer to the same real-world restaurant. It includes 112 matching pairs (true matches) and 488 non-matching pairs (true negatives) that have been manually annotated.

The original dataset is available through the Linkage Library (University of Michigan ICPSR) under the project “Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits (ICPSR 127242)”: https://linkagelibrary.icpsr.umich.edu/linkagelibrary/project/127242/version/V1/view?path=/linkagelibrary/127242/fcr:versions/V1/restaurants_-Fodors-Zagats-&type=folder#tab-dataDocs. For the ease of use, all files used in the analysis have been included in the repository.

### Data Loading and Preprocessing


In [44]:
import pandas as pd

# Load datasets
fodors = pd.read_csv(
    'data/fodors.csv', 
    # index_col=0, # Leave id column as a regular column to simplify merging
    usecols=lambda x: x != 'class', # Exclude 'class' column which is irrelevant to the analysis
    quotechar="'", # Ensure apostrophes are properly interpreted
    escapechar='\\'
)

zagats = pd.read_csv(
    'data/zagats.csv',
    # index_col=0, 
    usecols=lambda x: x != 'class',
    quotechar="'",
    escapechar='\\'
)

matches = pd.read_csv('data/matches_fodors_zagats.csv')

print(f"Fodor's preview: {fodors.shape[0] } rows, {fodors.shape[1]} columns")
print(fodors.head(5))
print()

print(f"Zagat's preview: {zagats.shape[0] } rows, {zagats.shape[1]} columns")
print(zagats.head(5))
print()

print(f"Matches preview: {matches.shape[0] } rows, {matches.shape[1]} columns")
print(matches.head(5))
print()


Fodor's preview: 533 rows, 6 columns
    id                       name                    addr          city  \
0  534  arnie morton's of chicago  435 s. la cienega blv.   los angeles   
1  535         art's delicatessen     12224 ventura blvd.   studio city   
2  536              hotel bel-air    701 stone canyon rd.       bel air   
3  537                 cafe bizou     14016 ventura blvd.  sherman oaks   
4  538                  campanile     624 s. la brea ave.   los angeles   

          phone         type  
0  310/246-1501     american  
1  818/762-1221     american  
2  310/472-1211  californian  
3  818/788-3536       french  
4  213/938-1447     american  

Zagat's preview: 331 rows, 6 columns
   id             name                            addr              city  \
0   1   apple pan  the             10801 w. pico blvd.           west la   
1   2      asahi ramen             2027 sawtelle blvd.           west la   
2   3       baja fresh                 3345 kimber dr.  west

In [45]:
# Clean and standardize string fields 
# to ensure consistency and improve the effectiveness of similarity algorithms
def preprocess(df, columns):
    df = df.copy()
    for col in columns:
        df[col] = (
            df[col].astype(str) # Ensure all values are strings
                   .str.lower() # Convert to lowercase
                   .str.replace(r'[^a-z0-9\s]', '', regex=True) # Remove special characters
                   .str.replace(r'\s+', ' ', regex=True) # Replace multiple spaces with a single space
                   .str.strip() # Remove leading/trailing whitespace
        )
    return df

columns_to_preprocess = ['name', 'addr', 'city', 'type']

fodors = preprocess(fodors, columns_to_preprocess)
zagats = preprocess(zagats, columns_to_preprocess)

### Method 0: Regular joins
How many restaurants would get matched if we naively use exact matches of names, address, and city?

In [46]:
# Perform the exact match join
matched_df = fodors.merge(
    zagats,
    on=['name', 'addr', 'city'],
    how='inner',
    suffixes=('_fodors', '_zagats')
)

# Check against gold standard matches
matched_df['in_gold_standard'] = matched_df.apply(
    lambda row: ((matches['fodors_id'] == row['id_fodors']) & 
                 (matches['zagats_id'] == row['id_zagats'])).any(), axis=1
)

# Summarize the results
total_matched = len(matched_df)
matched_in_gold = matched_df['in_gold_standard'].sum()

print(f"Total exact matches found: {total_matched}")
print(f"Matches present in gold standard: {matched_in_gold}")
print(f"Matches NOT present in gold standard: {total_matched - matched_in_gold}")

Total exact matches found: 26
Matches present in gold standard: 26
Matches NOT present in gold standard: 0


With regular joins, we would only find 26 matches out of all 112 verified matches (23% of all). Thus, searching for a more efficient method that would take small discrepancies into account is desirable.

### Method 1a: Fuzzy Matching with Levenshtein distance
Fuzzy matching is a text-matching technique used to identify similar strings by directly comparing their textual similarity, even if they are not exactly identical. It relies purely on character-level or token-level string comparisons.

First, we'll use Levenshtein distance which measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to convert one string into another.

Levenshtein distance directly compares the strings character-by-character without accounting for token ordering or semantic similarity. Thus, it's sensitive to word order, length differences, and spelling variations.

In [59]:
import numpy as np
import pandas as pd
from Levenshtein import distance as levenshtein_distance
from tqdm import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Compute normalized Levenshtein similarity score
def normalized_levenshtein(s1, s2):
    dist = levenshtein_distance(s1, s2)
    max_len = max(len(s1), len(s2))
    if max_len == 0:
        return 100  # identical empty strings
    return 100 * (1 - dist / max_len)

# Generate all possible pairs with Levenshtein similarity
pairs = []
for i, f_row in tqdm(fodors.iterrows(), total=len(fodors), desc='Levenshtein scoring'):
    for j, z_row in zagats.iterrows():
        name_score = normalized_levenshtein(f_row['name'], z_row['name'])
        address_score = normalized_levenshtein(f_row['addr'], z_row['addr'])
        city_score = normalized_levenshtein(f_row['city'], z_row['city'])
        avg_score = (name_score + address_score + city_score) / 3
        pairs.append({
            'fodors_id': f_row['id'],
            'zagats_id': z_row['id'],
            'score': avg_score
        })

lev_df = pd.DataFrame(pairs)

# Ground truth labels
lev_df['actual_match'] = lev_df.apply(
    lambda row: ((matches['fodors_id'] == row['fodors_id']) & 
                 (matches['zagats_id'] == row['zagats_id'])).any(),
    axis=1
)

print(f"\nLevenshtein pairs preview: {lev_df.shape[0]} rows, {lev_df.shape[1]} columns")
print(lev_df.head(5))
print()

# Optimize threshold for best F1-score
thresholds = np.arange(60, 100, 1)
best_f1 = 0
best_threshold = 0

for threshold in tqdm(thresholds, desc="Optimizing Levenshtein threshold"):
    lev_df['predicted_match'] = lev_df['score'] >= threshold
    f1 = f1_score(lev_df['actual_match'], lev_df['predicted_match'])
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

# Final evaluation using optimized threshold
lev_df['predicted_match'] = lev_df['score'] >= best_threshold
precision = precision_score(lev_df['actual_match'], lev_df['predicted_match'])
recall = recall_score(lev_df['actual_match'], lev_df['predicted_match'])
accuracy = accuracy_score(lev_df['actual_match'], lev_df['predicted_match'])

# Results
print(f"\nOptimized Levenshtein Threshold: {best_threshold}")
print(f"Levenshtein Matching Results:")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1-Score:  {best_f1:.2f}")
print(f"Accuracy:  {accuracy:.2f}")


Levenshtein scoring: 100%|██████████| 533/533 [00:10<00:00, 51.90it/s]



Levenshtein pairs preview: 176423 rows, 4 columns
   fodors_id  zagats_id      score  actual_match
0        534          1  24.393939         False
1        534          2  20.227273         False
2        534          3  20.138889         False
3        534          4  21.028588         False
4        534          5  19.746377         False



Optimizing Levenshtein threshold: 100%|██████████| 40/40 [00:00<00:00, 42.96it/s]


Optimized Levenshtein Threshold: 78
Levenshtein Matching Results:
Precision: 0.97
Recall:    0.69
F1-Score:  0.81
Accuracy:  1.00





### Method 1b: Fuzzy Matching with token comparison
Next, we use the Python library fuzzywuzzy, specifically the token_set_ratio function, because it efficiently handles variations in word order, extra or missing words, and minor textual differences by comparing sets of words (tokens) between strings.

This method is particularly effective for quickly identifying matches when differences are mainly textual rather than semantic or context-based, which are common in the restaurant dataset used in this analysis, for example, matching "Palm Restaurant" with "The Palm Restaurant".

In [60]:
from fuzzywuzzy import fuzz
from tqdm import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import numpy as np
import pandas as pd

# Generate all possible pairs with name, address, and city similarity
pairs = []
for i, f_row in tqdm(fodors.iterrows(), total=len(fodors), desc='Generating token set ratio scores'):
    for j, z_row in zagats.iterrows():
        name_score = fuzz.token_set_ratio(f_row['name'], z_row['name'])
        address_score = fuzz.token_set_ratio(f_row['addr'], z_row['addr'])
        city_score = fuzz.token_set_ratio(f_row['city'], z_row['city'])
        avg_score = (name_score + address_score + city_score) / 3  # Average the three scores
        pairs.append({
            'fodors_id': f_row['id'],
            'zagats_id': z_row['id'],
            'score': avg_score
        })

fuzzy_df = pd.DataFrame(pairs)

# Ground truth labels
fuzzy_df['actual_match'] = fuzzy_df.apply(
    lambda row: ((matches['fodors_id'] == row['fodors_id']) & 
                 (matches['zagats_id'] == row['zagats_id'])).any(),
    axis=1
)
print(f"\nFuzzy pairs preview: {fuzzy_df.shape[0]} rows, {fuzzy_df.shape[1]} columns")
print(fuzzy_df.head(5))
print()

# Optimize threshold by maximizing F1-score
thresholds = np.arange(70, 100, 1)
best_f1 = 0
best_threshold = 0

for threshold in tqdm(thresholds, desc='Optimizing token set ratio threshold'):
    fuzzy_df['predicted_match'] = fuzzy_df['score'] >= threshold
    f1 = f1_score(fuzzy_df['actual_match'], fuzzy_df['predicted_match'])
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

# Final evaluation using optimized threshold
fuzzy_df['predicted_match'] = fuzzy_df['score'] >= best_threshold
precision = precision_score(fuzzy_df['actual_match'], fuzzy_df['predicted_match'])
recall = recall_score(fuzzy_df['actual_match'], fuzzy_df['predicted_match'])
accuracy = accuracy_score(fuzzy_df['actual_match'], fuzzy_df['predicted_match'])

# Results
print(f"\nOptimized Threshold: {best_threshold}")
print(f"Fuzzy Matching Results:")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1-Score:  {best_f1:.2f}")
print(f"Accuracy:  {accuracy:.2f}")


Generating token set ratio scores: 100%|██████████| 533/533 [00:40<00:00, 13.17it/s]



Fuzzy pairs preview: 176423 rows, 4 columns
   fodors_id  zagats_id      score  actual_match
0        534          1  31.000000         False
1        534          2  33.000000         False
2        534          3  32.333333         False
3        534          4  28.000000         False
4        534          5  28.666667         False



Optimizing token set ratio threshold:   0%|          | 0/30 [00:00<?, ?it/s]
Optimizing token set ratio threshold:  13%|█▎        | 4/30 [00:00<00:00, 36.96it/s]
Optimizing token set ratio threshold:  27%|██▋       | 8/30 [00:00<00:00, 37.79it/s]
Optimizing token set ratio threshold:  40%|████      | 12/30 [00:00<00:00, 36.03it/s]
Optimizing token set ratio threshold:  53%|█████▎    | 16/30 [00:00<00:00, 36.79it/s]
Optimizing token set ratio threshold:  67%|██████▋   | 20/30 [00:00<00:00, 37.60it/s]
Optimizing token set ratio threshold:  80%|████████  | 24/30 [00:00<00:00, 38.24it/s]
Optimizing token set ratio threshold:  93%|█████████▎| 28/30 [00:00<00:00, 37.78it/s]
Optimizing token set ratio threshold: 100%|██████████| 30/30 [00:00<00:00, 37.62it/s]



Optimized Threshold: 88
Fuzzy Matching Results:
Precision: 0.97
Recall:    0.89
F1-Score:  0.93
Accuracy:  1.00


### Method 2: TF-IDF Vectorization + Cosine Similarity


### Method 3: Sentence-BERT Embeddings + Cosine Similarity


### Method 4: LLM Matching


### Method 5: Supervised Machine Learning Classifier


### Comparative Evaluation


### Conclusion
