# Document Deduplication with Similarity Search

This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents. 

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates. 

The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text. 

## Dependencies

In [1]:
!pip install -qU datasketch gensim mmh3 pinecone-client ipywidgets
!pip install -qU sentence-transformers --no-cache-dir

In [2]:
import os
import json
import math
import statistics
import pandas as pd
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH

## Pinecone Setup

In [3]:
import pinecone

# Load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(api_key=api_key, environment='us-west1-gcp')

[Get a Pinecone API key](https://www.pinecone.io/start/) if you don’t have one already.

## Define a New Pinecone Index

In [4]:
# Pick a name for the new index
index_name = "deduplication"

In [5]:
# Check whether an index with the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

**Create Index**


In [6]:
pinecone.create_index(name=index_name, dimension=300, metric="cosine")

**Create Index object**

The index object, a class instance of pinecone.Index, will be reused for optimal performance.

In [7]:
index = pinecone.Index(index_name)

## Upload

In this tutorial, we will use the [Deduplication Dataset 2020](https://core.ac.uk/documentation/dataset/) that consists of 100,000 scholarly documents. 


**Load data**

In [8]:
import requests, os, zipfile

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/deduplication_dataset_2020.zip"
DATA_URL = "https://core.ac.uk/exports/custom_datasets/deduplication_dataset_2020.zip"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)
        with zipfile.ZipFile(DATA_FILE, "r") as zip_ref:
            zip_ref.extractall(DATA_DIR)


download_data()

In [9]:
DATA_PATH = os.path.join(DATA_DIR, "deduplication_dataset_2020/Ground_Truth_data.jsonl")

with open(DATA_PATH, encoding="utf8") as json_file:
    data = list(json_file)

Here is a sample of the data. 

In [10]:
data_json = [json.loads(json_str) for json_str in data]
df = pd.DataFrame.from_dict(data_json)
df.head()

Unnamed: 0,core_id,doi,original_abstract,original_title,processed_title,processed_abstract,cat,labelled_duplicates
0,11251086,10.1016/j.ajhg.2007.12.013,Unobstructed vision requires a particular refr...,Mutation of solute carrier SLC16A12 associates...,mutation of solute carrier slc16a12 associates...,unobstructed vision refractive lens differenti...,exact_dup,[82332306]
1,11309751,10.1103/PhysRevLett.101.193002,Two-color multiphoton ionization of atomic hel...,Polarization control in two-color above-thresh...,polarization control in two-color above-thresh...,multiphoton ionization helium combining extrem...,exact_dup,[147599753]
2,11311385,10.1016/j.ab.2011.02.013,Lectin’s are proteins capable of recognising a...,Optimisation of the enzyme-linked lectin assay...,optimisation of the enzyme-linked lectin assay...,lectin’s capable recognising oligosaccharide t...,exact_dup,[147603441]
3,11992240,10.1016/j.jpcs.2007.07.063,"In this work, we present a detailed transmissi...","Vertical composition fluctuations in (Ga,In)(N...","vertical composition fluctuations in (ga,in)(n...",microscopy interfacial uniformity wells grown ...,exact_dup,[148653623]
4,11994990,10.1016/S0169-5983(03)00013-3,Three-dimensional (3D) oscillatory boundary la...,Three-dimensional streaming flows driven by os...,three-dimensional streaming flows driven by os...,oscillatory attached deformable walls boundari...,exact_dup,[148656283]


Now let us look at the columns in the dataset that are relevant for our task.

**core_id** - Unique indentifier for each article

**processed_abstract** - This is obtained by applying preprocssing steps like [this](https://spacy.io/usage/processing-pipelines) to the original abstract of the article from the column **original abstract**.

**processed_title** - Same as the abstract but for the title of the article.

**cat** - Every article falls into one of the three possible categories: 'exact_dup','near_dup','non_dup'

**labelled_duplicates** - A list of core_ids of articles that are duplicates of current article




Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.

In [11]:
lens = df.labelled_duplicates.apply(len)
lens.value_counts()

0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64

We will make use of the text data to create vectors for every article. We combine the **processed_abstract** and **processed_title** of the article to create a new **combined_text** column. 

In [12]:
# Define a new column for calculating embeddings
df["combined_text"] = df.apply(
    lambda x: str(x.processed_title) + " " + str(x.processed_abstract), axis=1
)

**Load model**

We will use the [Average Word Embedding GloVe](https://nlp.stanford.edu/projects/glove/) model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.

In [13]:
model = SentenceTransformer("average_word_embeddings_glove.6B.300d")

In [14]:
df["vectors"] = list(model.encode(df.combined_text.to_list(), show_progress_bar=True).tolist())

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

**Index the Vectors**

In [15]:
import itertools

def chunks(iterable, batch_size):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

In [16]:
for batch in chunks(zip(df.core_id.astype(str), df.vectors), 500):
    index.upsert(vectors=batch)

In [17]:
index.describe_index_stats()

{'dimension': 300, 'namespaces': {'': {'vector_count': 100000}}}

## Searching for candidates

Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.

Below, we list statistics of the number of duplicates per article in the resulting test set.

In [18]:
# Create a sample from the dataset
SAMPLE_FRACTION = 0.002
test_documents = (
    df.groupby(df["labelled_duplicates"].map(len))
    .apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
    .reset_index(drop=True)
)

print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()

Number of documents with specified number of duplicates:


0     100
1      73
2      16
3       7
4       3
5       2
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
Name: labelled_duplicates, dtype: int64

In [19]:
# Use the model to create embeddings for test articles, which will be the query vectors
queries = model.encode(test_documents.combined_text.to_list()).tolist()

In [20]:
# Query the vector index
query_results = index.query(queries=queries, top_k=100)

In [21]:
# Save all retrieval recalls into a list
recalls = []

for id, res in tqdm(list(zip(test_documents.core_id.values, query_results.results))):

    # Find document with id in labelled dataset
    labeled_df = df[df.core_id == str(id)]

    # Calculate the retrieval recall
    top_k_list = set([match.id for match in res.matches])
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))

100%|███████████████████████████████████████████████████████████████████████████████| 209/209 [00:01<00:00, 121.90it/s]


In [22]:
print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))

Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is  0.16219287104729735


### Running the Classifier 

We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.

We will use Deduplication Classifier based on [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.

In [23]:
# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}

# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results.results[::10]
ids_sample = test_documents.core_id.to_list()[::10]

for id, res in zip(ids_sample, query_sample):
    
    # Find document with id from the labelled dataset
    labeled_df = df[df.core_id == str(id)]

    """
    For every article in the reuslt set, we store the scores and abstract of the articles most similar 
    to it, according to search in the previous step.
    """

    df_result = pd.DataFrame(
        {
            "id": [match.id for match in res.matches],
            "document": [
                df[df.core_id == _id].processed_abstract.values[0] for _id in  [match.id for match in res.matches]
            ],
            "score": [match.score for match in res.matches],
        }
    )

    print(df_result.head())

    # We need content and labels for our classifier which we can get from the df_results
    content = df_result.document.values
    labels = list(df_result.id.values)
    
    # Create MinHash for each of the documents in result set
    min_hashes = {}
    for label, text in zip(labels, content):
        m = MinHash(num_perm=128, seed=5)
        tokens = set(tokenize(text))
        for d in tokens:
            m.update(d.encode('utf8'))
        min_hashes[label] = m
    
    # Create LSH index
    lsh = MinHashLSH(threshold=0.7, num_perm=128, )
    for i, j in min_hashes.items():
        lsh.insert(str(i), j)
    
    query_minhash = min_hashes[id]
    duplicates = lsh.query(query_minhash)
    duplicates.remove(str(id))
    
    # Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
    prediction = (
        "Correct"
        if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
        else "False"
    )
    
    # Add to all predictions
    all_predictions[prediction] += 1
    
    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0, 0]

    if prediction == "Correct":
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print(
        "{}: expected: {}, predicted: {}, prediction: {}".format(
            id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
        )
    )

         id                                           document     score
0  15080768  analyse centred methodology. discretisation so...  1.000000
1  52682462  audiencethe tissues pulses modelled compartmen...  0.787797
2  52900859  audiencethe tissues pulses modelled compartmen...  0.787797
3   2553555  multilayered illuminated acoustic electromagne...  0.781398
4  48261378  heterostructure schr dinger poisson numericall...  0.778778
15080768: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   55110306  latrepirdine orally administered molecule init...  1.000000
1  188404434  cysteamine potentially numerous huntington dis...  0.903965
2   81634102  deutetrabenazine molecule deuterium attenuates...  0.880078
3   42021224  comorbidities. safe drugs available. efficacy ...  0.857741
4   78271101  promising prevent onset ultrahigh psychosis di...  0.849158
55110306: expected: [], predicted: [], prediction: Correct


          id                                           document     score
0  148674298  race segments swimmers. analysed finals sessio...  1.000000
1   33176265  race segments swimmers. analysed finals sessio...  1.000000
2  148674300  swimming race parameters. hundred fifty eight ...  0.886608
3   33176267  swimming race parameters. hundred fifty eight ...  0.886608
4  143900637  swimmers swimmers coaches trainers. video sens...  0.736030
33176265: expected: ['148674298'], predicted: ['148674298'], prediction: Correct
         id                                           document     score
0  52722823  audiencehere geochemical lopevi volcano volcan...  1.000000
1  52844591  audiencehere geochemical lopevi volcano volcan...  1.000000
2  52308905  audiencehere geochemical lopevi volcano volcan...  1.000000
3  52840980  audiencethe volcanism cameroon volcanic mantle...  0.893717
4  52717537  audiencethe volcanism cameroon volcanic mantle...  0.893717
52308905: expected: ['52722823', '528

In [24]:
all_predictions

{'Correct': 21, 'False': 0}

In [25]:
# Overall accuracy on a test
accuracy = round(
    all_predictions["Correct"]
    / (all_predictions["Correct"] + all_predictions["False"]),
    4,
)
accuracy

1.0

In [26]:
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
    predictions_per_category, orient="index", columns=["Correct", "False"]
)

Unnamed: 0,Correct,False
0,10,0
1,8,0
2,1,0
3,1,0
5,1,0


## Delete the Index
Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.



In [27]:
# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)

## Summary

In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.
