# Introduction
Here, we will use Lantern to implement a semantic similarity search for questions. We will be able to search for semantically similar questions to some query question, like "How can I be a better software engineer?".

We will use questions from the Quora dataset from Hugging Face's datasets.

If you are running this in a colab, note that enabling a gpu-enabled runtime will be faster when we compute the embeddings. A cpu runtime will take significantly longer.

# Setup Postgres

We install postgres and its dev tools (necessary to build lantern from source). We also start postgres, and set up a user 'postgres' with password 'postgres' and create a database called 'ourdb'




In [None]:
# We install postgres and its dev tools
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql postgresql-server-dev-all
#  Start postgres
!sudo service postgresql start

# Create user, password, and db
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS ourdb;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE ourdb;'

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 26.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package logrotate.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack .../00-logrotate_3.19.0-1ubuntu1.1_amd64.deb ...
Unpacking logrotate (3.19.0-1ubuntu1.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb ...
Unpacking netbase (6.3) ...
Selecting previously unselected package python3-yaml.
Preparing to unpack .../02-python3-yaml_5.4.1-1ubuntu1_amd64.deb ...
Unpacking python3-yaml (5.4.1-1ubuntu1) ...
Selecting previous

# Install Lantern and build it from source

In [None]:
!git clone --recursive https://github.com/lanterndata/lantern.git

Cloning into 'lantern'...
remote: Enumerating objects: 2562, done.[K
remote: Counting objects: 100% (1336/1336), done.[K
remote: Compressing objects: 100% (478/478), done.[K
remote: Total 2562 (delta 1044), reused 936 (delta 852), pack-reused 1226[K
Receiving objects: 100% (2562/2562), 587.50 KiB | 2.11 MiB/s, done.
Resolving deltas: 100% (1684/1684), done.
Submodule 'third_party/hnswlib' (https://github.com/ngalstyan4/hnswlib) registered for path 'third_party/hnswlib'
Submodule 'third_party/usearch' (https://github.com/ngalstyan4/usearch) registered for path 'third_party/usearch'
Cloning into '/content/lantern/third_party/hnswlib'...
remote: Enumerating objects: 1723, done.        
remote: Counting objects: 100% (333/333), done.        
remote: Compressing objects: 100% (40/40), done.        
remote: Total 1723 (delta 306), reused 293 (delta 293), pack-reused 1390        
Receiving objects: 100% (1723/1723), 530.50 KiB | 21.22 MiB/s, done.
Resolving deltas: 100% (1097/1097), done.

In [None]:
# We build lantern from source
%cd lantern
!mkdir build
%cd build
!pwd
!cmake ..
!make install

/content/lantern
/content/lantern/build
/content/lantern/build
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: 
-- Found pg_config as /usr/bin/pg_config
-- Found postgres binary at /usr/lib/postgresql/14/bin/postgres
-- PostgreSQL version PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) fou

# Installing other Prerequisites

In [None]:
!pip install -qU \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/474.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m256.0/474.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 M

# Gathering and preprocessing Quora data
We will use the Quora dataset from Hugging Face datasets (the `datasets` package we installed above). It contains around 400K pairs of questions from the question-answering site, Quora. Let's use a subset of these pairs


In [None]:
from datasets import load_dataset

dataset = load_dataset('quora', split='train[100000:150000]')

# Some example samples of this dataset
dataset[:4]

Downloading builder script:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.69k [00:00<?, ?B/s]

Downloading and preparing dataset quora/default to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/404290 [00:00<?, ? examples/s]

Dataset quora downloaded and prepared to /root/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04. Subsequent calls will reuse this data.


{'questions': [{'id': [165932, 165933],
   'text': ['What should I ask my friend to get from UK to India?',
    'What is the process of getting a surgical residency in UK after completing MBBS from India?']},
  {'id': [123111, 39307],
   'text': ['How can I learn hacking for free?',
    'How can I learn to hack seriously?']},
  {'id': [165934, 165935],
   'text': ['Which is the best website to learn programming language C++?',
    'Which is the best website to learn C++ Programming language for free?']},
  {'id': [165936, 165937],
   'text': ['What did Werner Heisenberg mean when he said, “The first gulp from the glass of natural sciences will turn you into an atheist, but at the bottom of the glass God is waiting for you”?',
    'What did God mean when He said "an eye for an eye "?']}],
 'is_duplicate': [False, True, False, False]}

Let's get all the questions into a single list.

In [None]:
questions = []

for record in dataset['questions']:
    questions.extend(record['text'])

# Remove duplicates
questions = list(set(questions))
print('\n'.join(questions[:4]))
print(f"Number of questions: {len(questions)}")


How do I check if website uses schema.org?
How do I integrate Maven with selenium?
In Batman v Superman, what did Lex Luthor want the painting upside- down?
Number of questions: 88720


# Getting our embeddings
By embedding the questions above, the embeddings we obtain are the vectors that we will soon insert into lantern/postgres. Then, by performing a vector search in our database, we will get the "closest" embeddings/vectors to some other embedding/vector, which translates into semantic "similarity." This is the essence of semantic search!

To create our embeddings, we use the `MiniLM-L6` sentence transformer model, from the `sentence-transformers` package we installed. We first need to initialize it.

Note that when we print the details of the model in the last line, we can notice three things:

  1. `max_seq_length` is 256, which means that the maximum number of tokens (which is a unit of length, kind of like "words") that can be encoded into a single vector embedding is 256. If we are dealing with more tokens than 256, we must truncate first.

  2. `word_embedding_dimension` is 384, which means that each embedding we obtain is a vector with 384 dimensions. We will use this later with lantern

  3. `Normalize()` This model has a final normalization step, which means that when measuring distance between embeddings, we can use either cosine similarity or dotproduct similarity metric (they are equivalent in this case, since the vectors are normalized). Hence, we will later use the cosine distance


In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

This is how we go from a question (query) to a vector (embedding).

In [None]:
query = 'How do I become a better software engineer?'

embedded_query = model.encode(query)
embedded_query.shape

(384,)

# Create Postgres Table

Now let's set up `psycopg2` with postgres, and enable the lantern extension


In [None]:
import psycopg2

# We use the dbname, user, and password that we specified above
conn = psycopg2.connect(
    dbname="ourdb",
    user="postgres",
    password="postgres",
    host="localhost",
    port="5432" # default port for Postgres
)

# Get a new cursor
cursor = conn.cursor()

# Execute the query to load the Lantern extension in
cursor.execute("CREATE EXTENSION IF NOT EXISTS lantern;")

conn.commit()
cursor.close()

Now let's create the table that we will use to store these embeddings. We'll call the table `questions`, and it will have a primary key `id`, the actual text content of the question `content`, and the embedding for the question `vector`. Note that we make `vector` of type real array (`real[]`). We can add a dimension, like `real[384]`, but note that this dimension specified here is just syntactic sugar in postgres, and is not enforced.

In [None]:
# Create the table
cursor = conn.cursor()

create_table_query = "CREATE TABLE questions (id serial PRIMARY key, content text, vector real[]);"

cursor.execute(create_table_query)

conn.commit()
cursor.close()

# Inserting embeddings into our database
Now that we have a table created, let's create and insert the embeddings for the questions we prepared earlier.

The majority of the time spent here is computing the embeddings for our questions, using the model we set up before.

In [None]:
from tqdm.auto import tqdm

cursor = conn.cursor()

# The questions we want to embed
# To make this faster, we will only insert the first 1000 questions
Qs = questions[:1000]

for i in tqdm(range(0, len(Qs))):
    content = Qs[i]

    # Create embedding for the question
    vector = [float(x) for x in model.encode(Qs[i])]

    # Insert the content of the question as well as the embedding into our db
    cursor.execute("INSERT INTO questions (content, vector) VALUES (%s, %s);", (content, vector))

conn.commit()
cursor.close()



  0%|          | 0/1000 [00:00<?, ?it/s]

# Creating an Index
Now that we have inserted the embeddings into our database, we need to construct an index in postgres using lantern. This is important because the index will tell allow postgres to use lantern when performing vector search.

Note that we specify cosine distance as the distance metric, as we mentioned earlier. Also, as a good practice, we specify the dimension of the index (although lantern can infer it from the vector's we've already inserted).

In [None]:
cursor = conn.cursor()

cursor.execute("CREATE INDEX ON questions USING hnsw (vector dist_cos_ops) WITH (dim=384);")

conn.commit()
cursor.close()

# Performing Similarity Search

Now that we have embedded our questions, we can now perform vector search amongst our questions, and find out semantically similar questions! Recall the example query we had earlier:

In [None]:
query = 'How do I become a better software engineer?'

embedded_query = model.encode(query)
embedded_query = [float(x) for x in embedded_query]

Let's do a vector search on our database to find the 5 most semantically similar questions to this query (which we accomplish by finding which questions' embeddings are closest to this query's embedding)

In [None]:
cursor = conn.cursor()

# We only need to set this at the beginning of a session
cursor.execute("SET enable_seqscan = false;")
cursor.execute(f"SELECT content, cos_dist(vector, ARRAY{embedded_query}) AS dist FROM questions ORDER BY vector <-> ARRAY{embedded_query} LIMIT 5;")

record = cursor.fetchone()
while record:
    print(f"{record[0]}  (dist: {record[1]})")
    record = cursor.fetchone()

cursor.close()

How can I become a good software engineer by myself?  (dist: 0.13243657)
What are the best steps (1-10) to become a excellent programmer?  (dist: 0.33947504)
How do I become a qualified and professional ethical hacker?  (dist: 0.457249)
I am a 2nd year computer science engineering student. Other than studying, what should I be doing (like any extra studies, any internship, etc.)?  (dist: 0.4769982)
What are the requirements to be a programmer?  (dist: 0.48091978)


# Conclusion
As we can see, the questions with a lower distance rank "closer," in the semantic sense, to our query question!

And that's how you can implement similarity search for questions using Quora's database.




### Cleanup

In [None]:
# Close the postgres connection
conn.close()