# Efficient Duplicate Detection with Embeddings and FAISS

Embeddings are numerical representations of data, such as text, that capture semantic meaning in a dense vector space. By leveraging embeddings, we can efficiently compare and analyze data points based on their contextual similarity. In this notebook, we utilize embeddings generated by state-of-the-art models from the SentenceTransformers library to deduplicate textual data. FAISS, a library optimized for similarity search, enables us to perform fast and scalable nearest neighbor searches on these embeddings, making it ideal for large datasets. This approach ensures high accuracy in identifying duplicates while maintaining computational efficiency.

In [1]:
import os
import sys
from typing import List
from types import SimpleNamespace

ROOT_DIR = os.path.dirname(os.getcwd())
sys.path.append(ROOT_DIR)

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

from utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
df = prepare_dataset()

Loading datasets: 100%|██████████| 5/5 [00:03<00:00,  1.26it/s]


In [3]:
class EmbeddingDeduplicator:
    '''
    A class to deduplicate text embeddings using FAISS.
    '''
    def __init__(self, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', dimension: int = 384, top_k: int = 5, threshold: float = 0.85):
        self.model = SentenceTransformer(model_name, cache_folder='.cache')
        self.dimension = dimension
        self.top_k = top_k
        self.threshold = threshold
    
    def predict(self, texts: List[str]) -> np.ndarray:
        embeddings = self.model.encode(texts, show_progress_bar=False, normalize_embeddings=True)
        
        index = faiss.IndexFlatIP(self.dimension)
        index.add(embeddings)

        similarities, neighbors = index.search(embeddings, self.top_k)

        duplicates = set()

        for i in range(len(texts)):
            for j, sim in zip(neighbors[i][1:], similarities[i][1:]):
                if sim > self.threshold:
                        duplicates.add(i)
                        duplicates.add(j)

        indices = np.zeros(len(texts), dtype=int)
        indices[list(duplicates)] = 1
        return indices

## Ablation Study: Impact of Embedding Models on Deduplication Performance

The models below are sourced from the [original SentenceTransformers models](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models).

| Model | Accuracy | Precision | Recall | F1 Score | Prediction Time (s) |
|-------|----------|-----------|--------|----------|----------------------|
| sentence-transformers/all-MiniLM-L12-v2    | 0.92823   | 0.96335    | 0.89033 | 0.92540   | 19.54994                |
| sentence-transformers/multi-qa-distilbert-cos-v1    | 0.95550   | 0.97380    | 0.93619 | 0.95462   | 107.48424                |
| sentence-transformers/all-MiniLM-L6-v2    | 0.94592   | 0.96905    | 0.92126 | 0.94455   | 22.15212                |
| sentence-transformers/multi-qa-MiniLM-L6-cos-v1    | 0.97204   | 0.97487    | 0.96907 | 0.97196   | 48.73280                |
| sentence-transformers/paraphrase-albert-small-v2    | 0.92310   | 0.97754    | 0.86610 | 0.91845   | 27.69109                |
| sentence-transformers/paraphrase-MiniLM-L3-v2    | 0.92860   | 0.97940    | 0.87562 | 0.92461   | 6.62277                |


In [4]:
models = [
    SimpleNamespace(name='sentence-transformers/all-MiniLM-L12-v2', dimension=384),
    SimpleNamespace(name='sentence-transformers/multi-qa-distilbert-cos-v1', dimension=768),
    SimpleNamespace(name='sentence-transformers/all-MiniLM-L6-v2', dimension=384),
    SimpleNamespace(name='sentence-transformers/multi-qa-MiniLM-L6-cos-v1', dimension=384),
    SimpleNamespace(name='sentence-transformers/paraphrase-albert-small-v2', dimension=768),
    SimpleNamespace(name='sentence-transformers/paraphrase-MiniLM-L3-v2', dimension=384),
]

for model in models:
    print(f'Evaluating model: {model}')
    deduplicator = EmbeddingDeduplicator(model_name=model.name, dimension=model.dimension, threshold=0.95, top_k=3)
    benchmark = Benchmark(deduplicator)
    benchmark.evaluate(df['abstract'].to_list(), df['label'], verbose=True)
    print('\n')

Evaluating model: namespace(name='sentence-transformers/all-MiniLM-L12-v2', dimension=384)
Summary:
Metric                        Value
-----------------------------------
Accuracy                    0.92823
Precision                   0.96335
Recall                      0.89033
F1                          0.92540
Prediction_time_sec        19.54994
Samples                  9347.00000
Duplicates               4623.00000


Evaluating model: namespace(name='sentence-transformers/multi-qa-distilbert-cos-v1', dimension=768)
Summary:
Metric                        Value
-----------------------------------
Accuracy                    0.95550
Precision                   0.97380
Recall                      0.93619
F1                          0.95462
Prediction_time_sec       107.48424
Samples                  9347.00000
Duplicates               4623.00000


Evaluating model: namespace(name='sentence-transformers/all-MiniLM-L6-v2', dimension=384)
Summary:
Metric                        Value
----