# Entity Matching Overview
### Final Project for Kodołamacz's Data Science Bootcamp
Author: Piotr Zioło

### Introduction
Entity matching (also known as record linkage) is the process of identifying which records in two or more datasets refer to the same real-world entity. High-quality entity matching allows organizations to consolidate information, eliminate duplicates, and gain a unified view of their data. Entity matching can be especially important in scenarios such as merging two lists of businesses from different CRMs of companies undergoing a merging process.

In this project, we focus on matching restaurant entities from two restaurant guides: Fodor's and Zagat's. The goal is to determine which entries from the Fodor's restaurant list correspond to the same establishments in the Zagat's list. Since the two sources may use slightly different names, address formats, or phone number conventions for the same restaurant, simple joins on these fields would not guarantee accurate results. Thus, we will compare multiple more advanced approaches to entity matching:
- Fuzzy String Matching – using string similarity (e.g. Levenshtein distance) primarily on textual fields like name.
- TF–IDF + Cosine Similarity – treating restaurant records as documents and measuring cosine similarity of TF-IDF feature vectors.
- Transformer Embeddings + Cosine – using pre-trained language model embeddings (Sentence-BERT) for each record and measuring vector cosine similarity.
- Large Language Model – leveraging am LLM via API to semantically compare and decide if two descriptions refer to the same restaurant.
- Supervised Machine Learning – training a classifier on labeled matching/non-matching record pairs, using multiple features (text similarity scores, etc.).

We will evaluate each method on accuracy, precision, recall, and F1-score for identifying matches. We will also compare their runtime performance, scalability, and cost. By the end, we should understand which approach works best for this scenario and what the considerations are for deploying each at scale.

### Dataset Overview
The Fodor's–Zagat's restaurant dataset is a dataset designed to serve as a benchmark for entity matching evaluations. It was put together by Anna Primpeli and Christian Bizer of University of Mannheim in Germany.

It consists of two lists of restaurants, one from Fodor's (533 entries) and one from Zagat's (331 entries). Each restaurant has the following attributes:
- id (a unique identifier in each data source)
- name
- addr
- city
- phone
- type (cuisine/category of restaurant)

In addition to the two source lists, there is a gold standard file that indicates which Fodor's and Zagat's records refer to the same real-world restaurant. It includes 112 matching pairs (true matches) and 488 non-matching pairs (true negatives) that have been manually annotated.

The original dataset is available through the Linkage Library (University of Michigan ICPSR) under the project “Restaurants (Fodors-Zagats), Augmented Version, Fixed Splits (ICPSR 127242)”: https://linkagelibrary.icpsr.umich.edu/linkagelibrary/project/127242/version/V1/view?path=/linkagelibrary/127242/fcr:versions/V1/restaurants_-Fodors-Zagats-&type=folder#tab-dataDocs. For the ease of use, all files used in the analysis have been included in the repository.

### Data Loading and Preprocessing


In [44]:
import pandas as pd

# Load datasets
fodors = pd.read_csv(
    'data/fodors.csv', 
    # index_col=0, # Leave id column as a regular column to simplify merging
    usecols=lambda x: x != 'class', # Exclude 'class' column which is irrelevant to the analysis
    quotechar="'", # Ensure apostrophes are properly interpreted
    escapechar='\\'
)

zagats = pd.read_csv(
    'data/zagats.csv',
    # index_col=0, 
    usecols=lambda x: x != 'class',
    quotechar="'",
    escapechar='\\'
)

matches = pd.read_csv('data/matches_fodors_zagats.csv')

print(f"Fodor's preview: {fodors.shape[0] } rows, {fodors.shape[1]} columns")
print(fodors.head(5))
print()

print(f"Zagat's preview: {zagats.shape[0] } rows, {zagats.shape[1]} columns")
print(zagats.head(5))
print()

print(f"Matches preview: {matches.shape[0] } rows, {matches.shape[1]} columns")
print(matches.head(5))
print()


Fodor's preview: 533 rows, 6 columns
    id                       name                    addr          city  \
0  534  arnie morton's of chicago  435 s. la cienega blv.   los angeles   
1  535         art's delicatessen     12224 ventura blvd.   studio city   
2  536              hotel bel-air    701 stone canyon rd.       bel air   
3  537                 cafe bizou     14016 ventura blvd.  sherman oaks   
4  538                  campanile     624 s. la brea ave.   los angeles   

          phone         type  
0  310/246-1501     american  
1  818/762-1221     american  
2  310/472-1211  californian  
3  818/788-3536       french  
4  213/938-1447     american  

Zagat's preview: 331 rows, 6 columns
   id             name                            addr              city  \
0   1   apple pan  the             10801 w. pico blvd.           west la   
1   2      asahi ramen             2027 sawtelle blvd.           west la   
2   3       baja fresh                 3345 kimber dr.  west

In [45]:
# Clean and standardize string fields 
# to ensure consistency and improve the effectiveness of similarity algorithms
def preprocess(df, columns):
    df = df.copy()
    for col in columns:
        df[col] = (
            df[col].astype(str) # Ensure all values are strings
                   .str.lower() # Convert to lowercase
                   .str.replace(r'[^a-z0-9\s]', '', regex=True) # Remove special characters
                   .str.replace(r'\s+', ' ', regex=True) # Replace multiple spaces with a single space
                   .str.strip() # Remove leading/trailing whitespace
        )
    return df

columns_to_preprocess = ['name', 'addr', 'city', 'type']

fodors = preprocess(fodors, columns_to_preprocess)
zagats = preprocess(zagats, columns_to_preprocess)

### Method 0: Regular joins
How many restaurants would get matched if we naively use exact matches of names, address, and city?

In [46]:
# Perform the exact match join
matched_df = fodors.merge(
    zagats,
    on=['name', 'addr', 'city'],
    how='inner',
    suffixes=('_fodors', '_zagats')
)

# Check against gold standard matches
matched_df['in_gold_standard'] = matched_df.apply(
    lambda row: ((matches['fodors_id'] == row['id_fodors']) & 
                 (matches['zagats_id'] == row['id_zagats'])).any(), axis=1
)

# Summarize the results
total_matched = len(matched_df)
matched_in_gold = matched_df['in_gold_standard'].sum()

print(f"Total exact matches found: {total_matched}")
print(f"Matches present in gold standard: {matched_in_gold}")
print(f"Matches NOT present in gold standard: {total_matched - matched_in_gold}")

Total exact matches found: 26
Matches present in gold standard: 26
Matches NOT present in gold standard: 0


With regular joins, we would only find 26 matches out of all 112 verified matches (23% of all). Thus, searching for a more efficient method that would take small discrepancies into account is desirable.

### Method 1a: Fuzzy Matching with Levenshtein distance
Fuzzy matching is a text-matching technique used to identify similar strings by directly comparing their textual similarity, even if they are not exactly identical. It relies purely on character-level or token-level string comparisons.

First, we'll use Levenshtein distance which measures the minimum number of single-character edits (insertions, deletions, substitutions) needed to convert one string into another.

Levenshtein distance directly compares the strings character-by-character without accounting for token ordering or semantic similarity. Thus, it's sensitive to word order, length differences, and spelling variations.

In [74]:
# Create reusable functions for fuzzy matching

import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Generate matching pairs with a customizable similarity function
def generate_matching_pairs(df1, df2, similarity_func, cols=['name', 'addr', 'city'], desc='Generating pairs'):
    pairs = []
    for i, row1 in tqdm(df1.iterrows(), total=len(df1), desc=desc):
        for j, row2 in df2.iterrows():
            scores = [similarity_func(row1[col], row2[col]) for col in cols]
            avg_score = np.mean(scores)
            pairs.append({
                'fodors_id': row1['id'],
                'zagats_id': row2['id'],
                'score': avg_score
            })
    return pd.DataFrame(pairs)

# Add ground truth labels to dataframe
def add_ground_truth_labels(df, matches_df):
    match_set = set(zip(matches_df['fodors_id'], matches_df['zagats_id']))
    df['actual_match'] = df.apply(
        lambda row: (row['fodors_id'], row['zagats_id']) in match_set,
        axis=1
    )
    return df

# Optimize threshold for the best F1-score
def optimize_threshold(df, thresholds=np.arange(60, 100, 1), desc="Optimizing threshold"):
    best_f1, best_threshold = 0, 0
    for threshold in tqdm(thresholds, desc=desc):
        df['predicted_match'] = df['score'] >= threshold
        f1 = f1_score(df['actual_match'], df['predicted_match'])
        if f1 > best_f1:
            best_f1, best_threshold = f1, threshold
    
    df['predicted_match'] = df['score'] >= best_threshold
    precision = precision_score(df['actual_match'], df['predicted_match'])
    recall = recall_score(df['actual_match'], df['predicted_match'])
    accuracy = accuracy_score(df['actual_match'], df['predicted_match'])

    metrics = {
        'best_threshold': best_threshold,
        'precision': precision,
        'recall': recall,
        'f1_score': best_f1,
        'accuracy': accuracy
    }

    return df, metrics

# View results
def preview_classification_outcomes(df, fodors, zagats, num_samples=5):
    # Define classification outcomes
    conditions = [
        (df['actual_match'] == True) & (df['predicted_match'] == True),
        (df['actual_match'] == False) & (df['predicted_match'] == False),
        (df['actual_match'] == False) & (df['predicted_match'] == True),
        (df['actual_match'] == True) & (df['predicted_match'] == False),
    ]
    outcomes = ['TP', 'TN', 'FP', 'FN']
    df['classification'] = np.select(conditions, outcomes)

    # Preview each classification outcome
    for outcome in outcomes:
        subset = df[df['classification'] == outcome].head(num_samples)
        merged_df = subset.merge(fodors, left_on='fodors_id', right_on='id', suffixes=('', '_fodors')) \
                          .merge(zagats, left_on='zagats_id', right_on='id', suffixes=('_fodors', '_zagats'))

        display_cols = [
            'name_fodors', 'addr_fodors', 'city_fodors',
            'name_zagats', 'addr_zagats', 'city_zagats',
            'score', 'actual_match', 'predicted_match'
        ]
        print(f"\n{outcome} examples:")
        display(merged_df[display_cols])


In [66]:
from Levenshtein import distance as levenshtein_distance

# Normalized Levenshtein similarity
def normalized_levenshtein(s1, s2):
    dist = levenshtein_distance(s1, s2)
    max_len = max(len(s1), len(s2))
    if max_len == 0:
        return 100
    return 100 * (1 - dist / max_len)

# Generate pairs
lev_df = generate_matching_pairs(
    fodors, zagats,
    similarity_func=normalized_levenshtein,
    desc='Levenshtein scoring'
)

# Add ground truth labels
lev_df = add_ground_truth_labels(lev_df, matches)

print(f"\nLevenshtein pairs preview: {lev_df.shape[0]} rows, {lev_df.shape[1]} columns")
print(lev_df.head())

# Optimize threshold
lev_df, lev_metrics = optimize_threshold(
    lev_df,
    thresholds=np.arange(60, 100, 1),
    desc="Optimizing Levenshtein threshold"
)

# Results
print(f"\nOptimized Levenshtein Threshold: {lev_metrics['best_threshold']}")
print("Levenshtein Matching Results:")
print(f"Precision: {lev_metrics['precision']:.2f}")
print(f"Recall:    {lev_metrics['recall']:.2f}")
print(f"F1-Score:  {lev_metrics['f1_score']:.2f}")
print(f"Accuracy:  {lev_metrics['accuracy']:.2f}")


Levenshtein scoring: 100%|██████████| 533/533 [00:13<00:00, 38.98it/s]



Levenshtein pairs preview: 176423 rows, 4 columns
   fodors_id  zagats_id      score  actual_match
0        534          1  24.393939         False
1        534          2  20.227273         False
2        534          3  20.138889         False
3        534          4  21.028588         False
4        534          5  19.746377         False


Optimizing Levenshtein threshold: 100%|██████████| 40/40 [00:01<00:00, 39.76it/s]



Optimized Levenshtein Threshold: 78
Levenshtein Matching Results:
Precision: 0.97
Recall:    0.69
F1-Score:  0.81
Accuracy:  1.00


In [75]:
# View examples of correctly and incorrectly predicted matches
# split by classification outcome: True Positive, True Negative, False Positive, False Negative
preview_classification_outcomes(lev_df, fodors, zagats)




TP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,arnie mortons of chicago,435 s la cienega blvd,los angeles,98.412698,True,True
1,arts delicatessen,12224 ventura blvd,studio city,arts deli,12224 ventura blvd,studio city,84.313725,True,True
2,cafe bizou,14016 ventura blvd,sherman oaks,cafe bizou,14016 ventura blvd,sherman oaks,100.0,True,True
3,campanile,624 s la brea ave,los angeles,campanile,624 s la brea ave,los angeles,100.0,True,True
4,chinois on main,2709 main st,santa monica,chinois on main,2709 main st,santa monica,100.0,True,True



TN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,apple pan the,10801 w pico blvd,west la,24.393939,False,False
1,arnie mortons of chicago,435 s la cienega blv,los angeles,asahi ramen,2027 sawtelle blvd,west la,20.227273,False,False
2,arnie mortons of chicago,435 s la cienega blv,los angeles,baja fresh,3345 kimber dr,westlake village,20.138889,False,False
3,arnie mortons of chicago,435 s la cienega blv,los angeles,belvedere the,9882 little santa monica blvd,beverly hills,21.028588,False,False
4,arnie mortons of chicago,435 s la cienega blv,los angeles,benitas frites,1433 third st promenade,santa monica,19.746377,False,False



FP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,restaurant ritzcarlton atlanta,181 peachtree st,atlanta,ritzcarlton cafe atlanta,181 peachtree st,atlanta,82.222222,False,True
1,empress court,3570 las vegas blvd s,las vegas,palace court,3570 las vegas blvd s,las vegas,82.051282,False,True



FN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,hotel belair,701 stone canyon rd,bel air,belair hotel,701 stone canyon rd,bel air,66.666667,True,False
1,fenix,8358 sunset blvd west,hollywood,fenix at the argyle,8358 sunset blvd,w hollywood,61.441482,True,False
2,grill on the alley,9560 dayton way,los angeles,grill the,9560 dayton way,beverly hills,55.128205,True,False
3,restaurant katsu,1972 n hillhurst ave,los angeles,katsu,1972 hillhurst ave,los feliz,58.598485,True,False
4,lorangerie,903 n la cienega blvd,los angeles,lorangerie,903 n la cienega blvd,w hollywood,66.666667,True,False


As expected, Levenshtein scores did well when names, address and cities were either the same or slightly different becuase of typos or spelling. It failed, though, when parts of the strings were ordered differently or missing.

### Method 1b: Fuzzy Matching with token comparison
Next, we use the Python library fuzzywuzzy, specifically the token_set_ratio function, because it efficiently handles variations in word order, extra or missing words, and minor textual differences by comparing sets of words (tokens) between strings.

This method is particularly effective for quickly identifying matches when differences are mainly textual rather than semantic or context-based, which are common in the restaurant dataset used in this analysis, for example, matching "Palm Restaurant" with "The Palm Restaurant".

In [None]:
from fuzzywuzzy import fuzz

# Generate pairs
fuzzy_df = generate_matching_pairs(
    fodors, zagats,
    similarity_func=fuzz.token_set_ratio,
    desc='Generating token set ratio scores'
)

# Add ground truth labels
fuzzy_df = add_ground_truth_labels(fuzzy_df, matches)

print(f"\nFuzzy pairs preview: {fuzzy_df.shape[0]} rows, {fuzzy_df.shape[1]} columns")
print(fuzzy_df.head())

# Optimize threshold
fuzzy_df, fuzzy_metrics = optimize_threshold(
    fuzzy_df,
    thresholds=np.arange(60, 100, 1),
    desc='Optimizing token set ratio threshold'
)

# Results
print(f"\nOptimized Token Set Ratio Threshold: {fuzzy_metrics['best_threshold']}")
print("Fuzzy Matching Results:")
print(f"Precision: {fuzzy_metrics['precision']:.2f}")
print(f"Recall:    {fuzzy_metrics['recall']:.2f}")
print(f"F1-Score:  {fuzzy_metrics['f1_score']:.2f}")
print(f"Accuracy:  {fuzzy_metrics['accuracy']:.2f}")


Generating token set ratio scores: 100%|██████████| 533/533 [00:45<00:00, 11.59it/s]



Fuzzy pairs preview: 176423 rows, 4 columns
   fodors_id  zagats_id      score  actual_match
0        534          1  31.000000         False
1        534          2  33.000000         False
2        534          3  32.333333         False
3        534          4  28.000000         False
4        534          5  28.666667         False


Optimizing token set ratio threshold: 100%|██████████| 30/30 [00:00<00:00, 41.06it/s]



Optimized Token Set Ratio Threshold: 88
Fuzzy Matching Results:
Precision: 0.97
Recall:    0.89
F1-Score:  0.93
Accuracy:  1.00


In [76]:
preview_classification_outcomes(fuzzy_df, fodors, zagats)


TP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,arnie mortons of chicago,435 s la cienega blvd,los angeles,99.333333,True,True
1,arts delicatessen,12224 ventura blvd,studio city,arts deli,12224 ventura blvd,studio city,89.666667,True,True
2,hotel belair,701 stone canyon rd,bel air,belair hotel,701 stone canyon rd,bel air,100.0,True,True
3,cafe bizou,14016 ventura blvd,sherman oaks,cafe bizou,14016 ventura blvd,sherman oaks,100.0,True,True
4,campanile,624 s la brea ave,los angeles,campanile,624 s la brea ave,los angeles,100.0,True,True



TN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,apple pan the,10801 w pico blvd,west la,31.0,False,False
1,arnie mortons of chicago,435 s la cienega blv,los angeles,asahi ramen,2027 sawtelle blvd,west la,33.0,False,False
2,arnie mortons of chicago,435 s la cienega blv,los angeles,baja fresh,3345 kimber dr,westlake village,32.333333,False,False
3,arnie mortons of chicago,435 s la cienega blv,los angeles,belvedere the,9882 little santa monica blvd,beverly hills,28.0,False,False
4,arnie mortons of chicago,435 s la cienega blv,los angeles,benitas frites,1433 third st promenade,santa monica,28.666667,False,False



FP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,cafe ritzcarlton buckhead,3434 peachtree rd,atlanta,ritzcarlton dining room buckhead,3434 peachtree rd ne,atlanta,96.333333,False,True
1,dining room ritzcarlton buckhead,3434 peachtree rd,atlanta,ritzcarlton cafe buckhead,3434 peachtree rd ne,atlanta,96.333333,False,True
2,restaurant ritzcarlton atlanta,181 peachtree st,atlanta,ritzcarlton cafe atlanta,181 peachtree st,atlanta,96.0,False,True



FN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,grill on the alley,9560 dayton way,los angeles,grill the,9560 dayton way,beverly hills,75.0,True,False
1,restaurant katsu,1972 n hillhurst ave,los angeles,katsu,1972 hillhurst ave,los feliz,86.666667,True,False
2,lorangerie,903 n la cienega blvd,los angeles,lorangerie,903 n la cienega blvd,w hollywood,72.666667,True,False
3,locanda veneta,3rd st,los angeles,locanda veneta,8638 w third st,los angeles,85.666667,True,False
4,the palm,9001 santa monica blvd,los angeles,palm the los angeles,9001 santa monica blvd,w hollywood,72.666667,True,False


While the token set ratio method proved better than Levenshtein scores (e.g., the F1-score grew from 81% to 93% thanks to a much better recall rate), there still was a decent number of false negative matches, which seem to be driven by the differences in city names. While Fodor's marks restaurants as located in Los Angeles, Zagat's locations are more granular and refer to districts, such as Beverly Hills or West Hollywood.

To alleviate that, we could try to standardize city names, i.e. bring them all to the same baseline like large metropolitan area or individual districts, but this approach could be hard to scale with large datasets with data all over the world. Thus, let's review matching methods that take into account word meaning and context.

### Method 2: TF-IDF Vectorization + Cosine Similarity
TF-IDF stands for Term Frequency-Inverse Document Frequency. This vectorization method transforms text data into numerical vectors that represent the importance of words within documents. TF-IDF considers the frequency and uniqueness of words across the entire dataset, rather than just directly comparing individual strings. This makes it more context-aware and less sensitive to minor spelling or token-order differences.

In our case, we'll use sklearn's TfidVectorizer which first tokenizes the text into words, calculates how frequently each word appears within individual records (term frequency), and adjusts those values by how common each word is across all records (inverse document frequency), resulting in vectors emphasizing unique and informative terms.

Then, cosine similarity measures how similar two vectors (records) are by computing the cosine of the angle between them. A cosine similarity close to 1 indicates high similarity (almost identical records), while a score close to 0 indicates low similarity.

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Combine relevant fields
fodors['combined'] = fodors['name'] + ' ' + fodors['addr'] + ' ' + fodors['city']
zagats['combined'] = zagats['name'] + ' ' + zagats['addr'] + ' ' + zagats['city']

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(pd.concat([fodors['combined'], zagats['combined']]))

fodors_tfidf = tfidf_matrix[:len(fodors)]
zagats_tfidf = tfidf_matrix[len(fodors):]

# Cosine similarity calculation (optimized, vectorized)
cosine_sim = cosine_similarity(fodors_tfidf, zagats_tfidf)

# Reformat similarity matrix into pairs DataFrame (consistent with previous methods)
fodors_ids = fodors['id'].values
zagats_ids = zagats['id'].values

pairs = []
for i in tqdm(range(len(fodors_ids)), desc='Generating TF-IDF Cosine similarity pairs'):
    for j in range(len(zagats_ids)):
        pairs.append({
            'fodors_id': fodors_ids[i],
            'zagats_id': zagats_ids[j],
            'score': cosine_sim[i, j]
        })

tfidf_df = pd.DataFrame(pairs)

# Reuse previously defined functions to add ground truth and optimize threshold
tfidf_df = add_ground_truth_labels(tfidf_df, matches)

# Threshold optimization
thresholds = np.arange(0.5, 1.0, 0.01)
tfidf_df, tfidf_metrics = optimize_threshold(
    tfidf_df,
    thresholds=thresholds,
    desc='Optimizing TF-IDF Cosine threshold'
)

# Display final metrics
print(f"\nTF-IDF + Cosine Similarity Matching Results:")
print(f"Optimized Threshold: {tfidf_metrics['best_threshold']:.2f}")
print(f"Precision: {tfidf_metrics['precision']:.2f}")
print(f"Recall:    {tfidf_metrics['recall']:.2f}")
print(f"F1-Score:  {tfidf_metrics['f1_score']:.2f}")
print(f"Accuracy:  {tfidf_metrics['accuracy']:.2f}")


Generating TF-IDF Cosine similarity pairs: 100%|██████████| 533/533 [00:00<00:00, 5791.26it/s]
Optimizing TF-IDF Cosine threshold: 100%|██████████| 50/50 [00:01<00:00, 43.15it/s]


TF-IDF + Cosine Similarity Matching Results:
Optimized Threshold: 0.65
Precision: 0.86
Recall:    0.93
F1-Score:  0.89
Accuracy:  1.00





In [80]:
preview_classification_outcomes(tfidf_df, fodors, zagats)


TP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,arnie mortons of chicago,435 s la cienega blvd,los angeles,0.895731,True,True
1,arts delicatessen,12224 ventura blvd,studio city,arts deli,12224 ventura blvd,studio city,0.810901,True,True
2,hotel belair,701 stone canyon rd,bel air,belair hotel,701 stone canyon rd,bel air,1.0,True,True
3,cafe bizou,14016 ventura blvd,sherman oaks,cafe bizou,14016 ventura blvd,sherman oaks,1.0,True,True
4,campanile,624 s la brea ave,los angeles,campanile,624 s la brea ave,los angeles,1.0,True,True



TN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,arnie mortons of chicago,435 s la cienega blv,los angeles,apple pan the,10801 w pico blvd,west la,0.052879,False,False
1,arnie mortons of chicago,435 s la cienega blv,los angeles,asahi ramen,2027 sawtelle blvd,west la,0.052149,False,False
2,arnie mortons of chicago,435 s la cienega blv,los angeles,baja fresh,3345 kimber dr,westlake village,0.0,False,False
3,arnie mortons of chicago,435 s la cienega blv,los angeles,belvedere the,9882 little santa monica blvd,beverly hills,0.0,False,False
4,arnie mortons of chicago,435 s la cienega blv,los angeles,benitas frites,1433 third st promenade,santa monica,0.0,False,False



FP examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,chins,3200 las vegas blvd s,las vegas,mortons of chicago las vegas,3200 las vegas blvd s,las vegas,0.750678,False,True
1,coyote cafe,3799 las vegas blvd s,las vegas,tre visi,3799 las vegas blvd s,las vegas,0.662837,False,True
2,cafe ritzcarlton buckhead,3434 peachtree rd,atlanta,ritzcarlton dining room buckhead,3434 peachtree rd ne,atlanta,0.741368,False,True
3,dining room ritzcarlton buckhead,3434 peachtree rd,atlanta,ritzcarlton cafe buckhead,3434 peachtree rd ne,atlanta,0.725732,False,True
4,restaurant ritzcarlton atlanta,181 peachtree st,atlanta,ritzcarlton cafe atlanta,181 peachtree st,atlanta,0.885011,False,True



FN examples:


Unnamed: 0,name_fodors,addr_fodors,city_fodors,name_zagats,addr_zagats,city_zagats,score,actual_match,predicted_match
0,philippes the original,1001 n alameda st,los angeles,philippe the original,1001 n alameda st,chinatown,0.636315,True,False
1,spago,1114 horn ave,los angeles,spago los angeles,8795 sunset blvd,w hollywood,0.359161,True,False
2,carnegie deli,854 7th ave between 54th and 55th sts,new york,carnegie deli,854 seventh ave,new york city,0.60864,True,False
3,les celebrites,160 central park s,new york,les celebrites,155 w 58th st,new york city,0.539421,True,False
4,mesa grill,102 5th ave between 15th and 16th sts,new york,mesa grill,102 fifth ave,new york city,0.558394,True,False


### Method 3: Sentence-BERT Embeddings + Cosine Similarity


### Method 4: LLM Matching


### Method 5: Supervised Machine Learning Classifier


### Comparative Evaluation


### Conclusion
