# Entity Matching Overview
### Final Project for Kodołamacz's Data Science Bootcamp
Author: Piotr Zioło

### Introduction
Entity matching (also known as record linkage) is the process of identifying which records in two or more datasets refer to the same real-world entity. High-quality entity matching allows organizations to consolidate information, eliminate duplicates, and gain a unified view of their data. Entity matching can be especially important in scenarios such as merging two lists of businesses from different CRMs of companies undergoing a merging process.

In this project, we focus on matching restaurant entities from two restaurant guides: Fodor's and Zagat's. The goal is to determine which entries from the Fodor's restaurant list correspond to the same establishments in the Zagat's list. Since the two sources may use slightly different names, address formats, or phone number conventions for the same restaurant, simple joins on these fields would not guarantee accurate results. Thus, we will compare multiple more advanced approaches to entity matching:
- Fuzzy String Matching – using string similarity (e.g. Levenshtein distance) primarily on textual fields like name.
- TF–IDF + Cosine Similarity – treating restaurant records as documents and measuring cosine similarity of TF-IDF feature vectors.
- Transformer Embeddings + Cosine – using pre-trained language model embeddings (Sentence-BERT) for each record and measuring vector cosine similarity.
- Large Language Model – leveraging am LLM via API to semantically compare and decide if two descriptions refer to the same restaurant.
- Supervised Machine Learning – training a classifier on labeled matching/non-matching record pairs, using multiple features (text similarity scores, etc.).

We will evaluate each method on accuracy, precision, recall, and F1-score for identifying matches. We will also compare their runtime performance, scalability, and cost. By the end, we should understand which approach works best for this scenario and what the considerations are for deploying each at scale.

### Dataset Overview


### Data Loading and Preprocessing


### Method 1: Fuzzy Matching


### Method 2: TF-IDF Vectorization + Cosine Similarity


### Method 3: Sentence-BERT Embeddings + Cosine Similarity


### Method 4: LLM Matching


### Method 5: Supervised Machine Learning Classifier


### Comparative Evaluation


### Conclusion
