[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/deduplication/deduplication_scholarly_articles.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/deduplication/deduplication_scholarly_articles.ipynb)

# Document Deduplication with Similarity Search

This notebook demonstrates how to use Pinecone's similarity search to create a simple application to identify duplicate documents. 

The goal is to create a data deduplication application for eliminating near-duplicate copies of academic texts. In this example, we will perform the deduplication of a given text in two steps. First, we will sift a small set of candidate texts using a similarity-search service. Then, we will apply a near-duplication detector over these candidates. 

The similarity search will use a vector representation of the texts. With this, semantic similarity is translated to proximity in a vector space. For detecting near-duplicates, we will employ a classification model that examines the raw text. 

## Install Dependencies

In [1]:
!pip install -qU pinecone-client
!pip install -qU datasketch mmh3 ipywidgets
!pip install -qU gensim==4.0.1
!pip install -qU sentence-transformers --no-cache-dir
!pip install -qU datasets

## Download and Process Dataset

This tutorial will use the [Deduplication Dataset 2020](https://core.ac.uk/documentation/dataset/), which consists of 100,000 scholarly documents. We will use Hugging Face Datasets to download the dataset found at [*pinecone/core-2020-05-10-deduplication*](https://huggingface.co/datasets/pinecone/core-2020-05-10-deduplication).

In [2]:
from datasets import load_dataset

core = load_dataset("pinecone/core-2020-05-10-deduplication", split="train")
core

Using custom data configuration pinecone--core-2020-05-10-deduplication-9b2cfecb7d4c4180
Reusing dataset json (/root/.cache/huggingface/datasets/pinecone___json/pinecone--core-2020-05-10-deduplication-9b2cfecb7d4c4180/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)


Dataset({
    features: ['core_id', 'doi', 'original_abstract', 'original_title', 'processed_title', 'processed_abstract', 'cat', 'labelled_duplicates'],
    num_rows: 100000
})

We convert the dataset into Pandas dataframe format like so:

In [3]:
df = core.to_pandas()
df.head()

Unnamed: 0,core_id,doi,original_abstract,original_title,processed_title,processed_abstract,cat,labelled_duplicates
0,11251086,10.1016/j.ajhg.2007.12.013,Unobstructed vision requires a particular refr...,Mutation of solute carrier SLC16A12 associates...,mutation of solute carrier slc16a12 associates...,unobstructed vision refractive lens differenti...,exact_dup,[82332306]
1,11309751,10.1103/PhysRevLett.101.193002,Two-color multiphoton ionization of atomic hel...,Polarization control in two-color above-thresh...,polarization control in two-color above-thresh...,multiphoton ionization helium combining extrem...,exact_dup,[147599753]
2,11311385,10.1016/j.ab.2011.02.013,Lectin’s are proteins capable of recognising a...,Optimisation of the enzyme-linked lectin assay...,optimisation of the enzyme-linked lectin assay...,lectin’s capable recognising oligosaccharide t...,exact_dup,[147603441]
3,11992240,10.1016/j.jpcs.2007.07.063,"In this work, we present a detailed transmissi...","Vertical composition fluctuations in (Ga,In)(N...","vertical composition fluctuations in (ga,in)(n...",microscopy interfacial uniformity wells grown ...,exact_dup,[148653623]
4,11994990,10.1016/S0169-5983(03)00013-3,Three-dimensional (3D) oscillatory boundary la...,Three-dimensional streaming flows driven by os...,three-dimensional streaming flows driven by os...,oscillatory attached deformable walls boundari...,exact_dup,[148656283]


We will use the following columns from the dataset for our task.
1. **core_id** - Unique indentifier for each article

2. **processed_abstract** - This is obtained by applying preprocssing steps like [this](https://spacy.io/usage/processing-pipelines) to the original abstract of the article from the column **original abstract**.

3. **processed_title** - Same as the abstract but for the title of the article.

4. **cat** - Every article falls into one of the three possible categories: 'exact_dup', 'near_dup', 'non_dup'

5. **labelled_duplicates** - A list of core_ids of articles that are duplicates of current article

Let's calculate the frequency of duplicates per article. Observe that half of the articles have no duplicates, and only a small fraction of the articles have more than ten duplicates.

In [4]:
lens = df.labelled_duplicates.apply(len)
lens.value_counts()

0     50000
1     36166
2      7620
3      3108
4      1370
5       756
6       441
7       216
8       108
10       66
9        60
11       48
13       28
12       13
Name: labelled_duplicates, dtype: int64

Reformat some of the columns to prevent later issues.

In [5]:
# make sure no processed abstracts are excessively long for upsert to Pinecone
df["processed_abstract"] = df["processed_abstract"].str[:8000]

We will make use of the text data to create vectors for every article. We combine the **processed_abstract** and **processed_title** of the article to create a new **combined_text** column. 

In [6]:
# Define a new column for calculating embeddings
df["combined_text"] = df["processed_title"] + " " + df["processed_abstract"]

## Initialize Pinecone Index

In [7]:
from pinecone import Pinecone

# Connect to pinecone environment
pinecone.init(
    api_key="YOUR_API_KEY",
    environment="YOUR_ENV"  # find next to API key in console
)

# Pick a name for the new index
index_name = "deduplication"

# Check if the deduplication index exists
if index_name not in pinecone.list_indexes().names():
    # Create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=300,
        metadata_config={"indexed": ["processed_abstract"]}
    )

# Connect to deduplication index we created
index = pinecone.Index(index_name)

[Get a free Pinecone API key](https://www.pinecone.io/start/) if you don’t have one already.

## Initialize Embedding Model

We will use the [Average Word Embedding GloVe](https://nlp.stanford.edu/projects/glove/) model to transform text into vector embeddings. We then upload the embeddings into the Pinecone vector index.

In [8]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)
model

SentenceTransformer(
  (0): WordEmbeddings(
    (emb_layer): Embedding(400001, 300)
  )
  (1): Pooling({'word_embedding_dimension': 300, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

## Generate Embeddings and Upsert

In [9]:
from tqdm.auto import tqdm

# We will use batches of 256
batch_size = 256
for i in tqdm(range(0, len(df), batch_size)):
    # Find end of batch
    i_end = min(i+batch_size, len(df))
    # Extract batch
    batch = df.iloc[i:i_end]
    # Generate embeddings for batch
    emb = model.encode(batch["combined_text"].to_list()).tolist()
    # extract both indexed and not indexed metadata
    meta = batch[["processed_abstract"]].to_dict(orient="records")
    # create IDs
    ids = batch.core_id.astype(str)
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)
    
# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/391 [00:00<?, ?it/s]

{'dimension': 300,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 100000}}}

## Searching for Candidates

Now that we have created vectors for the articles and inserted them in the index, we will create a test set for querying. For each article in the test set we will query the index to get the most similar articles, they are the candidates on which we will performs the next classification step.

Below, we list statistics of the number of duplicates per article in the resulting test set.

In [10]:
import math

# Create a sample from the dataset
SAMPLE_FRACTION = 0.002
test_documents = (
    df.groupby(df.labelled_duplicates.map(len))
    .apply(lambda x: x.head(math.ceil(len(x) * SAMPLE_FRACTION)))
    .reset_index(drop=True)
)

print("Number of documents with specified number of duplicates:")
lens = test_documents.labelled_duplicates.apply(len)
lens.value_counts()

Number of documents with specified number of duplicates:


0     100
1      73
2      16
3       7
4       3
5       2
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
Name: labelled_duplicates, dtype: int64

In [11]:
# Use the model to create embeddings for test articles, which will be the query vectors
query_vectors = model.encode(test_documents.combined_text.to_list()).tolist()

In [12]:
# Query the vector index
query_results = []
for xq in tqdm(query_vectors):
    query_res = index.query(vector=xq, top_k=100, include_metadata=True)
    query_results.append(query_res)

  0%|          | 0/209 [00:00<?, ?it/s]

In [13]:
# Save all retrieval recalls into a list
recalls = []

for id, res in tqdm(list(zip(test_documents.core_id.values, query_results))):
    # Find document with id in labelled dataset
    labeled_df = df[df.core_id.astype(str) == str(id)]
    # Calculate the retrieval recall
    top_k_list = set([match.id for match in res.matches])
    labelled_duplicates = set(labeled_df.labelled_duplicates.values[0])
    intersection = top_k_list.intersection(labelled_duplicates)
    if len(labelled_duplicates) != 0:
        recalls.append(len(intersection) / len(labelled_duplicates))

  0%|          | 0/209 [00:00<?, ?it/s]

In [14]:
import statistics

print("Mean for the retrieval recall is " + str(statistics.mean(recalls)))
print("Standard Deviation is  " + str(statistics.stdev(recalls)))

Mean for the retrieval recall is 0.9702529886016125
Standard Deviation is  0.16219287104729735


### Running the Classifier 

We mentioned earlier in the article that we will perform two steps for deduplication, searching to produce candidates and performing classifciation on them.

We will use Deduplication Classifier based on [LSH](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for detecting duplicates on the results from the previous step. We will run this on a sample of query results we got in the previous step. Feel free to try out the results on the entire set of query results.

In [15]:
import pandas as pd
from gensim.utils import tokenize
from datasketch.minhash import MinHash
from datasketch.lsh import MinHashLSH

In [16]:
# Counters for correct/false predictions
all_predictions = {"Correct": 0, "False": 0}
predictions_per_category = {}

# From the results in the previous step, we will take a subset to test our classifier
query_sample = query_results[::10]
ids_sample = test_documents.core_id.to_list()[::10]

for id, res in zip(ids_sample, query_sample):
    
    # Find document with id from the labelled dataset
    labeled_df = df[df.core_id.astype(str) == str(id)]

    """
    For every article in the result set, we store the scores and abstract of the articles most similar 
    to it, according to search in the previous step.
    """

    df_result = pd.DataFrame(
        {
            "id": [match.id for match in res.matches],
            "document": [match["metadata"]["processed_abstract"] for match in res.matches],
            "score": [match.score for match in res.matches],
        }
    )

    print(df_result.head())

    # We need content and labels for our classifier which we can get from the df_results
    content = df_result.document.values
    labels = list(df_result.id.values)
    
    # Create MinHash for each of the documents in result set
    min_hashes = {}
    for label, text in zip(labels, content):
        m = MinHash(num_perm=128, seed=5)
        tokens = set(tokenize(text))
        for d in tokens:
            m.update(d.encode('utf8'))
        min_hashes[label] = m
    
    # Create LSH index
    lsh = MinHashLSH(threshold=0.7, num_perm=128, )
    for i, j in min_hashes.items():
        lsh.insert(str(i), j)
    
    query_minhash = min_hashes[str(id)]
    duplicates = lsh.query(query_minhash)
    duplicates.remove(str(id))
    
    # Check whether prediction matches labeled duplicates. Here the groud truth is the set of duplicates from our original set
    prediction = (
        "Correct"
        if set(labeled_df.labelled_duplicates.values[0]) == set(duplicates)
        else "False"
    )
    
    # Add to all predictions
    all_predictions[prediction] += 1
    
    # Create and/or add to the specific category based on number of duplicates in original dataset
    num_of_duplicates = len(labeled_df.labelled_duplicates.values[0])
    if num_of_duplicates not in predictions_per_category:
        predictions_per_category[num_of_duplicates] = [0, 0]

    if prediction == "Correct":
        predictions_per_category[num_of_duplicates][0] += 1
    else:
        predictions_per_category[num_of_duplicates][1] += 1

    # Print the results for a document
    print(
        "{}: expected: {}, predicted: {}, prediction: {}".format(
            id, labeled_df.labelled_duplicates.values[0], duplicates, prediction
        )
    )

         id                                           document     score
0  15080768  analyse centred methodology. discretisation so...  1.000000
1  52682462  audiencethe tissues pulses modelled compartmen...  0.787797
2  52900859  audiencethe tissues pulses modelled compartmen...  0.787797
3   2553555  multilayered illuminated acoustic electromagne...  0.781398
4  50544308  heterostructure schr dinger poisson numericall...  0.778778
15080768: expected: [], predicted: [], prediction: Correct
          id                                           document     score
0   55110306  latrepirdine orally administered molecule init...  1.000000
1  188404434  cysteamine potentially numerous huntington dis...  0.903964
2   81634102  deutetrabenazine molecule deuterium attenuates...  0.880078
3   42021224  comorbidities. safe drugs available. efficacy ...  0.857741
4   78271101  promising prevent onset ultrahigh psychosis di...  0.849158
55110306: expected: [], predicted: [], prediction: Correct


In [17]:
all_predictions

{'Correct': 21, 'False': 0}

In [18]:
# Overall accuracy on a test
accuracy = round(
    all_predictions["Correct"]
    / (all_predictions["Correct"] + all_predictions["False"]),
    4,
)
accuracy

1.0

In [19]:
# Print the prediction count for each class depending on the number of duplicates in labeled dataset
pd.DataFrame.from_dict(
    predictions_per_category, orient="index", columns=["Correct", "False"]
)

Unnamed: 0,Correct,False
0,10,0
1,8,0
2,1,0
3,1,0
5,1,0


## Delete the Index
Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again.



In [20]:
# Delete the index if it's not going to be used anymore
pinecone.delete_index(index_name)

## Summary

In this notebook we demonstrate how to perform a deduplication task of over 100,000 articles using Pinecone. With articles embedded as vectors, you can use Pinecone's vector index to find similar articles. For each query article, we then use an LSH classifier on the similar articles to identify duplicate articles. Overall, we show that it is ease to incorporate Pinecone wtih article embedding models and duplication classifiers to build a deduplication service.
