# Finding Duplicate Reviews Via Semantic Search


In this walkthrough we will use vector embeddings to find duplicate or similar items in a movie review dataset.
The same approach can be used to group similar photos in your photo collection, automatically categorize data, etc.

## Outline

1. Get the data
2. Setup Lantern (on top of self-hosted postgres, or in Lantern Cloud)
3. Upload the data into Lantern
4. Generate embeddings (__automated in Lantern Cloud!__)
5. Create a vector index (__40x faster in Lantern Cloud!__)
6. Query the database to find similar reviews
    1. Brute Force - no vector index (takes ~1.5 hour)
    2. Vector Index + Code (takes ~20 minutes)
    3.  __Vector Index + SQL JOIN (takes ~40 seconds!)__
    
7. Bonus! Evaluate the quality of our approximate vector index
8. Bonus! Flag Identical Reviews


## 1. Get the data

In [343]:
!python3 -m pip install datasets sentence_transformers tqdm pandas> /dev/null

We will use imdb movie review dataset from [huggingface](https://huggingface.co/datasets/imdb)

In [8]:
import time
from datasets import load_dataset
from psycopg2 import extras
from tqdm.notebook import tqdm
import pandas as pd

data = load_dataset("imdb", split="train")
data

Found cached dataset imdb (/home/ngalstyan/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


## 2. Setup Lantern
You will need access to a Lantern database to follow through this tutorial. 

You can get one with 3 clicks at [Lantern Cloud](https://lantern.dev), or can set up Lantern on your own environment ([docs](https://docs.lantern.dev/get-started/install-from-binaries))

In [None]:
# Connect to database
import psycopg2
LANTERN_URL="PUT YOUR LANTERN URL HERE"
if not LANTERN_URL.startswith("postgres:"):
    LANTERN_URL=input("Please enter your Lantern URL:")
# Change the database URL to yours
def connect_db():
    return psycopg2.connect(LANTERN_URL)
global_conn = connect_db()

## 3. Upload the data into Lantern

Let's create a table for our movie review dataset with the following schema:
```sql
CREATE TABLE imdb_reviews_new1 (
    id SERIAL PRIMARY KEY, 
    imdb_id int NOT NULL UNIQUE, 
    review text, 
    positive_review bool)
```

Let's upload the data to our database so we can start running queries against it.
Note that we are using [`psycopg2.extras.execute_values`](https://www.psycopg.org/docs/extras.html#psycopg2.extras.execute_values) to handle batch uploading for us behind the scenes

In [39]:
# Create table
def setup_table():
    with global_conn.cursor() as cur:
        #cur.execute("abort;CREATE EXTENSION IF NOT EXISTS lantern")
        #cur.execute("DROP TABLE IF EXISTS imdb_reviews_new111")
        cur.execute("""
CREATE TABLE imdb_reviews (
  id SERIAL PRIMARY KEY,
  imdb_id int NOT NULL UNIQUE,
  review text,
  positive_review bool
);""")
        global_conn.commit()

In [21]:
def insert_values(conn, values, batch_size=400, logging = True):
    start = time.time()
    with conn.cursor() as cur:
        batch_review = values["text"]
        batch_sentiment = values["label"]
        id_range = range(len(values["text"]))
        
        batch = list(zip(id_range,batch_review, batch_sentiment))
        batch = [(e[0], e[1], e[2] == 1) for e in batch]
        psycopg2.extras.execute_values(cur, f"INSERT INTO imdb_reviews_new111 (imdb_id,review, positive_review) VALUES %s;", batch,
                                      template=None, page_size=batch_size)
        conn.commit()

In [28]:
setup_table()
insert_values(global_conn, data)

In [None]:
## this is just a sanity check that imdb_ids in the postgres table correspond to our array indexes in this notebook
with global_conn.cursor() as cur:
    cur.execute("SELECT * from imdb_reviews_new111 where imdb_id = 1;")
    print(cur.fetchall()[0][2] == data["text"][1])

## 4. Generate embeddings
At this point we have all our data in our Lantern database. 
We can now go see some summary of our table in the Lantern dashboard.

More importantly, we can generate embeddings through various models, add them as additional columns to our table, and create vector indexes on them through the dashboard.
Lantern runs these operations on dedicated, workload-optimized servers, avoiding the extra load on the database instance.
This makes sure that your database will be available with its full capacity to answer your production queries, while very compute-heavy operations are carried out


##### TODO:: screenshots from cloud?

Once embedding generation and index creation fish successfully, we can see the additional columns on our table:

In [356]:
with global_conn.cursor() as cur:
    form = "{:>26}" * 3
    cur.execute("SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE table_name = 'imdb_reviews_new111';")
    print(form.format("column_name","data_type","is_nullable"))
    print()
    for r in cur.fetchall():
        # .join(["%s","%s","%s"])
        print(form.format(*r))

               column_name                 data_type               is_nullable

                        id                   integer                        NO
                   imdb_id                   integer                        NO
           positive_review                   boolean                       YES
          review_embedding                     ARRAY                       YES
                    review                      text                       YES


We will consider 3 approaches for solving the problem
1. No index, full scan of the table
2. Lantern index + python loop to aggregate results
3. Single JOIN query to get our answer

## 5. Create a vector index

We can again use the Lantern dashboard to create a vector index on the embedding column Lantern created for us.
Note that to create the vector index we could use the more familiar `CREATE INDEX` statement as below:
```python
with global_conn.cursor() as cur:
    cur.execute("""
    CREATE INDEX lantern_demo_idx ON lantern_demo 
    USING hnsw(vec dist_cos_ops) 
    WITH (m=32, ef_construction=128, dim=384, ef=64)""")
```

But vector index creation is an expensive operation - doing it inside the database will 
    - Take longer 
    - slow down database queries for the duration of index generation

Index creation done in Lantern Dashboard happens on a separate dedicated server. The resulting index is then copied over into our database and tied to postgres, as if it was created via `CREATE INDEX`.
This saves time and database resources! It also allows for faster iteration and index parameter tuning.



## 6. (A. and B.) Query the database

In [31]:
# this is necessary for approaches (1) and (2) only, since embedding querying happens in python
all_embeds = None
with global_conn.cursor() as cur:
    cur.execute("SELECT imdb_id, review_embedding from imdb_reviews_new111;")
    all_embeds = cur.fetchall() 

In [65]:
def find_similar_foreach(all_embeds, use_index=True):
    THRESHHOLD=0.07
    # Load the next row from the dataset
    dist_calculation_format = "%s <-> review_embedding"
    if not use_index:
        dist_calculation_format = "cos_dist(%s, review_embedding)"

    with global_conn.cursor() as cur:

        for imdb_id, embedding in tqdm(all_embeds):
                
            cur.execute(f"SELECT cos_dist(%s, review_embedding) as dist, imdb_id from imdb_reviews_new111 order by {dist_calculation_format} limit 2;", 
                        (embedding,embedding))
            res = cur.fetchall()


            for r in res:
                dist, found_id = r
                if found_id == imdb_id:
                    continue
                if dist < THRESHHOLD:
                    print(f"found similar! (distance={dist})")
                    query_txt = data["text"][imdb_id]
                    print(f"Query({imdb_id}): {query_txt}")
                    found_txt = data["text"][found_id]
                    print(f"Found({found_id}): {found_txt}")

### Approach 6.A: Do not use the vector index: (WIll take ~ 1.5 hours)

In [66]:

find_similar_foreach(all_embeds, False)

  0%|          | 0/25000 [00:00<?, ?it/s]

found similar! (distance=0.063269675)
Query(7694): This really is one of the worst movies ever made. I consider myself a HUGE zombie film fan and usually tolerate bad acting, lame "special effects" a dumb story and whatever you may encounter in second rate movies, AS LONG as the film has a good atmosphere/story/suspension or whatever to offer. This one has basically no positive aspect to it and is third or fourth rate, maybe worse. Some friends of mine and myself made a small movie during a week´s holiday and definitely did a better job (no zombie film though).<br /><br />This flick is not even funny, not speaking of anything else. Really bad and redundant special effects, zombies that look like normal people (except for a white additional skin pulled over their faces), WAY TO MUCH fake blood (I like realism a lot, the combination of realism and Zombie films being debatable, but the presented gore is just plain silly). The camera stays quite long with feedings scenes, it gets boring an

KeyboardInterrupt: 

### Approach 6.B: Use the index but query it from python for each row: (WIll take ~ 25 minutes)

In [67]:

find_similar_foreach(all_embeds, True)

  0%|          | 0/25000 [00:00<?, ?it/s]

found similar! (distance=0.063269675)
Query(7694): This really is one of the worst movies ever made. I consider myself a HUGE zombie film fan and usually tolerate bad acting, lame "special effects" a dumb story and whatever you may encounter in second rate movies, AS LONG as the film has a good atmosphere/story/suspension or whatever to offer. This one has basically no positive aspect to it and is third or fourth rate, maybe worse. Some friends of mine and myself made a small movie during a week´s holiday and definitely did a better job (no zombie film though).<br /><br />This flick is not even funny, not speaking of anything else. Really bad and redundant special effects, zombies that look like normal people (except for a white additional skin pulled over their faces), WAY TO MUCH fake blood (I like realism a lot, the combination of realism and Zombie films being debatable, but the presented gore is just plain silly). The camera stays quite long with feedings scenes, it gets boring an

KeyboardInterrupt: 

### Approach 6.C: Vector Index + SQL JOIN (35seconds - 40x faster than above!)

The limitation of the above approach is that we are iterating over all movie reviews and issuing vector search operations. We can instead describe the full query to our database and have it return the final result - a list of review IDs and corresponding closest N review ids.

The query in the block below does exactly that!

In [None]:
final_res = None
with global_conn.cursor() as cur:
    cur.execute("""
SELECT
  forall.imdb_id, 
  nearest_per_id.near_imdb_ids, nearest_per_id.imdb_dists
FROM
  (
    SELECT
      imdb_id, review_embedding
    FROM
      imdb_reviews_new111
    LIMIT 100000
  ) AS forall
  JOIN LATERAL (
    SELECT
      ARRAY_AGG(imdb_id) AS near_imdb_ids, 
      ARRAY_AGG(imdb_dist) AS imdb_dists
    FROM
      (
        SELECT
          t2.imdb_id,
          cos_dist(forall.review_embedding, t2.review_embedding) AS imdb_dist
        FROM
          imdb_reviews_new111 t2
        ORDER BY
          forall.review_embedding <-> t2.review_embedding
        LIMIT
          5
      ) AS __unused_name
  ) nearest_per_id ON TRUE
ORDER BY
  forall.imdb_id;
""")
    final_res = cur.fetchall()

__What's going on in that query?__

There are two main subuqeries in the qery above

1. Subquery forall:
This is the first building block of the query. It selects two pieces of information for each review in the dataset: the unique movie identifier (imdb_id) and the 'review embedding' (review_embedding). The review_embedding is a numerical representation of the review's content. This subquery is limited to the first 100,000 entries in the imdb_reviews_new1 table, indicating a focus on a specific portion of the dataset.


2. Lateral Join Subquery nearest_per_id:
The second building block is a more complex subquery that performs a lateral join. This means it takes each row from the forall subquery and finds the top 5 closest reviews to it based on the cosine distance between their embeddings. The cosine distance is a measure used to determine how similar two documents are in the context of natural language processing. This subquery aggregates the IDs (imdb_id) and distances (imdb_dist) of these closest reviews into arrays, essentially creating a list of most similar reviews for each review in the forall subset.

__Relation of the Outer Query to Building Block Queries:__
The outer query brings together these building blocks. It selects the imdb_id from the forall subquery and pairs it with the arrays of nearest imdb_ids and their corresponding distances (imdb_dists) from the nearest_per_id subquery. By joining these components, the query effectively maps each movie in the original subset to a list of movies with the most similar reviews, along with the degree of similarity. 
The final output is ordered by the imdb_id from the forall subset, providing an organized list of movies and their closest counterparts in terms of review content.

In [298]:
very_similar = [r for r in final_res if 0.01 < r[2][1] and r[2][1] <= 0.03]

In [315]:
pd.DataFrame(very_similar, columns=["imdb_id", "most_similar_imdb_ids", "distance"])

Unnamed: 0,imdb_id,most_similar_imdb_ids,distance
0,1578,"[1578, 18101, 18102, 18104, 23556]","[5.9604645e-08, 0.026409447, 0.08959967, 0.104..."
1,1772,"[1772, 18172, 1767, 18166, 18165]","[0.0, 0.029390275, 0.058212996, 0.07342106, 0...."
2,1832,"[1832, 18723, 18714, 18713, 18709]","[-1.1920929e-07, 0.026373267, 0.03368485, 0.04..."
3,6113,"[6113, 6116, 5751, 2751, 3336]","[0.0, 0.025266469, 0.13563824, 0.15339321, 0.1..."
4,6116,"[6116, 6113, 5751, 8140, 2751]","[-1.1920929e-07, 0.025266469, 0.15087664, 0.16..."
5,11802,"[11802, 11803, 12036, 10494, 2901]","[0.0, 0.014384985, 0.12550741, 0.13626295, 0.1..."
6,11803,"[11803, 11802, 12036, 10494, 2901]","[0.0, 0.014384985, 0.14396846, 0.15840888, 0.1..."
7,14384,"[14384, 14396, 14397, 14387, 14388]","[-1.1920929e-07, 0.027822495, 0.07430953, 0.07..."
8,14396,"[14396, 14384, 14387, 14388, 14397]","[-1.1920929e-07, 0.027822495, 0.05249983, 0.05..."
9,18101,"[18101, 1578, 18102, 18104, 23556]","[0.0, 0.026409447, 0.08518565, 0.11065954, 0.1..."


## Flag similar reviews
#### Below are some example pairs of reviews marked as similar according to our filtering above:

In [325]:
print(data["text"][14384], "\n__VS__\n",data["text"][14396])

All dogs go to Heaven is one of the best movies I've ever seen. I first saw it when I was like 3. Now I'm 12 and I rented it, it makes me think of things and it brings back so many memories, those were "the days". I love the music, I love when Charlie is arriving in Heaven, I love the song "Let me be surprised". I love how Charlie looks and his voice, Bert Reynolds could only play Charlie's voice this great. I love this movie, the 1st one is the best one because it's so original and great. It really does bring back memories that no one can describe, not even me. If only I could go back to those days. I love the characters. If this is the way the memories come back when I'm 12 imagine how I'll feel when I'm like 19, I hope I'll be able to watch this when I'm older. When I first seen this I never knew that I would really look back on it and feel this way , I hope it will be available to watch. I'm so happy that this movie was made and the amazing idea came to mind and heart. On a scale f

In [304]:
print(data["text"][16401], "\n__VS__\n",data["text"][16408])


This has to be the funniest stand up comedy I have ever seen. Eddie Izzard is a genius, he picks in Brits, Americans and everyone in between. His style is completely natural and completely hilarious. I doubt that anyone could sit through this and not laugh their a** off. Watch, enjoy, it's funny. 
__VS__
 This is another gem of a stand up show from Eddie Izzard . You cannot fail to laugh at the wide range of topics he talks about. He even takes the piss out of his American audiance at times and most of them didnt even realise it! A must see for anybody who likes comedians. 9 out of 10.


## Bonus! Evaluate the quality of our approximate vector index
Since we know for sure each vector must be closest to itself, we can use the clustering results to see how well our approximate index keeps this invariant. An exact vector index would always keep this invariant. HNSW sacrifices exactness for performance and it gives us 3 hyper-parameters to tune how close it tries to get to the exact index. Obviously, the more exact we make our approximate index, the slower it will become, so there is a tradeoff here.

We can again use Lantern dashboard to create indexes with different parameters and see which one results in fewer mistakes

In [None]:
mistakes = [r for r in final_res if r[0] not in r[1]]
print("In %d (of %d reviews) the review in query was not considered close to itself by our index " % (len(mistakes), len(final_res)))

In [330]:
mistakes = [r for r in final_res if r[0] not in r[1]]
print("In %d (of %d reviews) the review in query was not considered close to itself by our index " % (len(mistakes), len(final_res)))

In 1082 (of 25000 reviews) the review in query was not considered close to itself by our index 


## Bonus! Flag Identical Reviews
We can also use the query results from our index to find identical duplicate reviews in the dataset. To do this, we will search for vectors that are extremely close together and have different IMDB ids

In [342]:
identical = [r for r in final_res if r[2][1] < 0.01]
identical[:10]

[(167,
  [167, 168, 5851, 13388, 24841],
  [5.9604645e-08, 5.9604645e-08, 0.17264384, 0.17720449, 0.18217313]),
 (168,
  [167, 168, 5851, 13388, 24841],
  [5.9604645e-08, 5.9604645e-08, 0.17264384, 0.17720449, 0.18217313]),
 (194,
  [194, 195, 5614, 13064, 14024],
  [-1.1920929e-07, 0.0005583167, 0.141236, 0.17064029, 0.17728132]),
 (195,
  [195, 194, 5614, 13064, 14024],
  [-1.1920929e-07, 0.0005583167, 0.14078432, 0.17046922, 0.1763081]),
 (357,
  [10274, 357, 13607, 13619, 13610],
  [-1.1920929e-07, -1.1920929e-07, 0.11676812, 0.12345934, 0.13316357]),
 (358,
  [10275, 358, 10222, 597, 5414],
  [0.0, 0.0, 0.13878095, 0.14487368, 0.1503331]),
 (359,
  [359, 10276, 4728, 22560, 3379],
  [5.9604645e-08, 5.9604645e-08, 0.1301226, 0.1348725, 0.13509935]),
 (360,
  [360, 10277, 6284, 1288, 15822],
  [0.0, 0.0, 0.15350217, 0.15636468, 0.16078287]),
 (361,
  [10278, 361, 13602, 13608, 13615],
  [-1.1920929e-07, -1.1920929e-07, 0.13646817, 0.14160305, 0.14672083]),
 (599,
  [599, 12001, 1895

From the above, we see that there are many pairs of reviews that have very close to zero distance, This gives us very high confidence that the underlying reivews are identical Below are example identical rows, taken from the above

In [286]:
data["text"][357]

'What was an exciting and fairly original series by Fox has degraded down to meandering tripe. During the first season, Dark Angel was on my weekly "must see" list, and not just because of Jessica Alba.<br /><br />Unfortunately, the powers-that-be over at Fox decided that they needed to "fine-tune" the plotline. Within 3 episodes of the season opener, they had totally lost me as a viewer (not even to see Jessica Alba!). I found the new characters that were added in the second season to be too ridiculous and amateurish. The new plotlines were stretching the continuity and credibility of the show too thin. On one of the second season episodes, they even had Max sleeping and dreaming - where the first season stated she biologically couldn\'t sleep.<br /><br />The moral of the story (the one that Hollywood never gets): If it works, don\'t screw with it!<br /><br />azjazz'

In [285]:
data["text"][10274]

'What was an exciting and fairly original series by Fox has degraded down to meandering tripe. During the first season, Dark Angel was on my weekly "must see" list, and not just because of Jessica Alba.<br /><br />Unfortunately, the powers-that-be over at Fox decided that they needed to "fine-tune" the plotline. Within 3 episodes of the season opener, they had totally lost me as a viewer (not even to see Jessica Alba!). I found the new characters that were added in the second season to be too ridiculous and amateurish. The new plotlines were stretching the continuity and credibility of the show too thin. On one of the second season episodes, they even had Max sleeping and dreaming - where the first season stated she biologically couldn\'t sleep.<br /><br />The moral of the story (the one that Hollywood never gets): If it works, don\'t screw with it!<br /><br />azjazz'