# Introduction
In this notebook, we will use Lantern to implement a question answering engine. From a given database of pre-existing questions and their answers, we will be able to fetch answers from completely different questions.

We will be using a technique known as [HyDE (Hypothetical Document Embeddings)](https://arxiv.org/abs/2212.10496) to power this question-answering engine.

For the question-answer data that will make up our knowledge base, we will be using the [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA).

We make the core functionality very simple using the python library Towhee, so that you can start hacking your own question answering engine.


# Setup Postgres

We install postgres and its dev tools (necessary to build lantern from source). We also start postgres, and set up a user 'postgres' with password 'postgres' and create a database called 'ourdb'




In [None]:
# We install postgres and its dev tools
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql postgresql-server-dev-all
#  Start postgres
!sudo service postgresql start

# Create user, password, and db
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS ourdb;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE ourdb;'

W: Failed to fetch https://cloud.r-project.org/bin/linux/ubuntu/jammy-cran40/InRelease  Could not resolve 'cloud.r-project.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 26.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package logrotate.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack .../00-logrotate_3.19.0-1ubuntu1.1_amd64.deb ...
Unpacking logrotate (3.19.0-1ubuntu1.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb

# Install Lantern and build it from source

In [None]:
!git clone --recursive https://github.com/lanterndata/lantern.git

Cloning into 'lantern'...
remote: Enumerating objects: 2656, done.[K
remote: Counting objects: 100% (1452/1452), done.[K
remote: Compressing objects: 100% (505/505), done.[K
remote: Total 2656 (delta 1124), reused 1041 (delta 941), pack-reused 1204[K
Receiving objects: 100% (2656/2656), 627.19 KiB | 3.15 MiB/s, done.
Resolving deltas: 100% (1740/1740), done.
Submodule 'third_party/hnswlib' (https://github.com/ngalstyan4/hnswlib) registered for path 'third_party/hnswlib'
Submodule 'third_party/usearch' (https://github.com/ngalstyan4/usearch) registered for path 'third_party/usearch'
Cloning into '/content/lantern/third_party/hnswlib'...
remote: Enumerating objects: 1723, done.        
remote: Counting objects: 100% (343/343), done.        
remote: Compressing objects: 100% (43/43), done.        
remote: Total 1723 (delta 314), reused 300 (delta 300), pack-reused 1380        
Receiving objects: 100% (1723/1723), 528.17 KiB | 2.33 MiB/s, done.
Resolving deltas: 100% (1096/1096), done.

In [None]:
# We build lantern from source
%cd lantern
!mkdir build
%cd build
!pwd
!cmake ..
!make install

/content/lantern
/content/lantern/build
/content/lantern/build
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: 
-- Found pg_config as /usr/bin/pg_config
-- Found postgres binary at /usr/lib/postgresql/14/bin/postgres
-- PostgreSQL version PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) fou

# Installing other Prerequisites

In [None]:
!python -m pip install -q towhee towhee.models nltk

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/222.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/222.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.0/222.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[?25h

# Gathering Data
Let's download a subset of the [InsuranceQA corpus](https://github.com/shuzi/insuranceQA) we mentioned above. It contains 1000 pairs of questions and answers related to insurance.

The data contains three columns: an `id`, a `question`, and its corresponding `answer`, as seen below.

In [None]:
!curl -L https://github.com/towhee-io/examples/releases/download/data/question_answer.csv -O

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  595k  100  595k    0     0   819k      0 --:--:-- --:--:-- --:--:--  819k


In [None]:
import pandas as pd

df = pd.read_csv('question_answer.csv')
df.head()

Unnamed: 0,id,question,answer
0,0,Is Disability Insurance Required By Law?,Not generally. There are five states that requ...
1,1,Can Creditors Take Life Insurance After ...,If the person who passed away was the one with...
2,2,Does Travelers Insurance Have Renters Ins...,One of the insurance carriers I represent is T...
3,3,Can I Drive A New Car Home Without Ins...,Most auto dealers will not let you drive the c...
4,4,Is The Cash Surrender Value Of Life Ins...,Cash surrender value comes only with Whole Lif...


# Overview of HyDE

As mentioned earlier, we will be using the HyDE technique.

What exactly will we be embedding? We will be embedding the answers in our dataset, and storing them in our database. Then, our question-engine will operate as follows:


1.   Start with a query question (for example, one that a user asks)
2.   Use a Large Language Model (LLM) to hallucinate an answer to our question. It doesn't matter if the specific details in this answer are wrong. We are looking to get a "structurally similar" answer to this question, even if it's factually incorrect
3. We will embed this hallucinated answer
4. We use Lantern to perform a vector search to find the nearest neighbors to this hallucinated answer. Since we are storing the embeddings of the answers in our database, this translates to finding the "most similar" answer in our database to this hallucinated answer.
5. We present this nearest-neighbor answer from our DB as the answer to the original question. The idea is that we will retain the correct "structure" from the hallucinated answer but this answer will have the actually correct facts instead.




# Create Postgres Table

Now let's set up `psycopg2` with postgres, and enable the lantern extension


In [None]:
import psycopg2

# We use the dbname, user, and password that we specified above
conn = psycopg2.connect(
    dbname="ourdb",
    user="postgres",
    password="postgres",
    host="localhost",
    port="5432" # default port for Postgres
)

# Get a new cursor
cursor = conn.cursor()

# Execute the query to load the Lantern extension in
cursor.execute("CREATE EXTENSION IF NOT EXISTS lantern;")

conn.commit()
cursor.close()

Now let's create the table that we will use to store the data we will reference against. We'll call the table `questions_answers`, and it will have 4 columns. The first 3 correspond to the columns in our dataset we downloaded above: an `id`, the text content of the `question`, and its corresponding `answer`. Lastly, we will store the embedding we compute for each answer in the column `vector`. Note that we make `vector` of type real array (`real[]`). We can add a dimension, like `real[768]`, but note that this dimension specified here is just syntactic sugar in postgres, and is not enforced.

In [None]:
# Create the table
cursor = conn.cursor()

TABLE_NAME = "questions_answers"
create_table_query = f"CREATE TABLE {TABLE_NAME} (id integer, question text, answer text, vector real[]);"

cursor.execute(create_table_query)

conn.commit()
cursor.close()

# Computing Embeddings and Inserting Data

Let's compute the embeddings of all the answers in our dataset and insert them into our database. To do this, we will use Facebook's `dpr-ctx_encoder-single-nq-base` model that is included in Towhee. Note that this model creates vector of size 768, and so this is the dimensionality that we will specify later. Also note that we have to truncate our answers so that they do not exceed the maximum length allowed for the model.

Towhee provides a method-chaining style API that we will use to create a pipeline to allow us to compute the embedding, and insert it into our database-- all in one pipeline.

Note that we include a normalization step in the pipeline, which is done so that we can use the `L2-squared` metric in our index later, when running vector search.

Also, note that the majority of the time here is spent on computing the embedding using the aforementioned model (under 8 min on a Google Colab cpu instance at the time of this notebook being written).


In [None]:
import nltk
from nltk.tokenize import sent_tokenize

# We use nltk to truncate our answers
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
%%time
from towhee import pipe, ops
import numpy as np
from towhee.datacollection import DataCollection


def truncate_answer(answer):
  # The model we use has a maximum token length

  # So, we must make sure our answers are of the appropriate length
  # before passing them to the model. The model actually counts length in terms
  # of units known as tokens, but we will implement an approximate technique
  # below where we just operate in terms of sentences.

  # We will take the first 7 sentences  of the answer as an example.
  # An in-production system would make sure that this still obeys the max length
  # in terms of tokens, which is how the model gauges length
  sentences = sent_tokenize(answer)
  return ' '.join(sentences[:7])


# Define the processing pipeline
def insert_row(id, vec, question, answer):
    vector = [float(x) for x in vec]
    cursor.execute(f"INSERT INTO {TABLE_NAME} (id, question, answer, vector) VALUES (%s, %s, %s, %s);", (id, question, answer, vector))
    return True

insert_pipe = (
    pipe.input('id', 'question', 'raw_answer')
        .map('raw_answer', 'answer', truncate_answer)
        .map('answer', 'vec', ops.text_embedding.dpr(model_name='facebook/dpr-ctx_encoder-single-nq-base'))
        # We normalize the embedding here
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map(('id', 'vec', 'question', 'answer'), 'insert_status', insert_row)
        .output()
)

# Insert data
import csv
cursor = conn.cursor()

with open('question_answer.csv', encoding='utf-8') as f:
    reader = csv.reader(f)
    next(reader)
    for row in reader:
        insert_pipe(*row)

conn.commit()
cursor.close()

Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/dpr.py to /root/.towhee/operators/text-embedding/dpr/files: 100%|██████████| 2.21k/2.21k [00:00<00:00, 2.33MB/s]
Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/.gitattributes to /root/.towhee/operators/text-embedding/dpr/files: 100%|██████████| 1.18k/1.18k [00:00<00:00, 4.11MB/s]
Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/__init__.py to /root/.towhee/operators/text-embedding/dpr/files: 100%|██████████| 660/660 [00:00<00:00, 850kB/s]
Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/requirements.txt to /root/.towhee/operators/text-embedding/dpr/files: 100%|██████████| 55.0/55.0 [00:00<00:00, 205kB/s]

Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/README.md to /root/.towhee/operators/text-embedding/dpr/files: 100%|██████████| 2.09k/2.09k [00:00<00:00, 7.72MB/s]
Downloading https://towhee.io/text-embedding/dpr/resolve/branch/main/re

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

CPU times: user 6min 56s, sys: 5.21 s, total: 7min 1s
Wall time: 7min 35s


# Creating an Index
Now that we have inserted the embeddings into our database, we need to construct an index in postgres using lantern. This is important because the index will tell allow postgres to use lantern when performing vector search.

Note that we specify L2-squared (squared Euclidean distance) as the distance metric, as we mentioned earlier. Also, as a good practice, we specify the dimension of the index (although lantern can infer it from the vector's we've already inserted).

In [None]:
cursor = conn.cursor()

cursor.execute(f"CREATE INDEX ON {TABLE_NAME} USING hnsw (vector dist_l2sq_ops) WITH (dim=768);")

conn.commit()
cursor.close()

# Performing Similarity Search

Now that we have embedded our answers, let's implement the bulk of the question-answering engine: the vector search!

As mentioned earlier in our overview of HyDE, we will hallucinate an answer to this question using an LLM (which we skip for this notebook, and use a hardcoded example), embed this hallucinated answer, and then run similarity search on our answer embeddings to find the closest answer in our database to this hallucinated answer. Then, we will serve this answer as the answer to the original question.

Let's see this in action by specifying the pipeline we will use, and an example of this process with `QUERY_QUESTION` below:

In [None]:
QUERY_QUESTION = "How much does disability insurance cost?"

def hallucinate_answer(question):
  # Here is where you would use an LLM (like OpenAi's GPT, Anthropic, LLaMA, etc.) to hallucinate an answer to this question
  # Here, we will use an example hallucinated answer that corresponds to our example query question

  example_hallucinated_answer = """
  The cost of disability insurance varies widely depending on factors such as your age, health, occupation, coverage amount, and policy features.
  On average, it can range from 1% to 3% of your annual income. To get an accurate quote, you should contact insurance providers and request a personalized quote based on your specific circumstances.
  """

  return example_hallucinated_answer

HALLUCINATED_ANSWER = hallucinate_answer(QUERY_QUESTION)

In [None]:
cursor = conn.cursor()

# We only need to set this at the beginning of a session
cursor.execute("SET enable_seqscan = false;")
conn.commit()

def vector_search(vec):
  query_vector = str([float(x) for x in vec])
  cursor.execute(f"SELECT question AS associated_question, answer FROM {TABLE_NAME} ORDER BY vector <-> ARRAY{query_vector} LIMIT 1;")
  record = cursor.fetchall()[0]
  return record

ans_pipe = (
    pipe.input('answer')
        .map('answer', 'vec', ops.text_embedding.dpr(model_name="facebook/dpr-ctx_encoder-single-nq-base"))
        .map('vec', 'vec', lambda x: x / np.linalg.norm(x, axis=0))
        .map('vec', ('associated_question','answer'), vector_search)
        .output('answer', 'associated_question')
)

ans = ans_pipe(HALLUCINATED_ANSWER)
ans = DataCollection(ans)[0]

print(f"Original Question:\n{QUERY_QUESTION}\n\n")
print(f"Proposed Answer:\n {ans['answer']}")

cursor.close()

Original Question:
How much does disability insurance cost?


Proposed Answer:
 Any disability insurance policy is priced based on several factors of the applicant. These include age, gender, occupation and duties of that occupation, tobacco use, and the amount of coverage needed or desired. The amount of coverage is often dictated by the person's earned income; the more someone earns, the more coverage that is available. There are several policy design features that can be included in the plan (or not) that will also affect the price. The person's medical history can also play an important role in pricing. As you can see, there are lots of moving parts to a disability policy that will affect the price. Doctors often buy coverage to protect their specific medical specialty.


In [None]:
print(f"Original Question:\n{QUERY_QUESTION}\n\n")
print(f"Hallucinated Answer:\n{HALLUCINATED_ANSWER}\n\n")
print(f"Nearest-neighbor Answer:\n {ans['answer']}\n\n")
print(f"Question associated in DB with nearest-neightbor answer:\n {ans['associated_question']}\n\n")

Original Question:
How much does disability insurance cost?


Hallucinated Answer:

  The cost of disability insurance varies widely depending on factors such as your age, health, occupation, coverage amount, and policy features. 
  On average, it can range from 1% to 3% of your annual income. To get an accurate quote, you should contact insurance providers and request a personalized quote based on your specific circumstances.
  


Nearest-neighbor Answer:
 Any disability insurance policy is priced based on several factors of the applicant. These include age, gender, occupation and duties of that occupation, tobacco use, and the amount of coverage needed or desired. The amount of coverage is often dictated by the person's earned income; the more someone earns, the more coverage that is available. There are several policy design features that can be included in the plan (or not) that will also affect the price. The person's medical history can also play an important role in pricing. As 

As we can see, we are able to obtain an answer by finding the nearest neighbor to the hallucinated answer, which we obtained by using an LLM to come up with an answer to our original query question. Notice also that the question associated with the nearest-neighbor answer is very similar to our original query question, which we would expect.

And that's how you can implement a simple Question Answering engine using Lantern! There are many approaches to how we go from the query question to a certain row in our database, and this notebook used HyDE. The premise behind all these approaches remains the same, however: use vector search to make the connection between the user question and our database which holds our unstructured knowledge-base data.




### Cleanup

In [None]:
# Close the postgres connection
conn.close()