# Semantic Search on PDF Documents with KDB.AI

This example demonstrates how to use KDB.AI to run semantic search on unstructured text documents. 

Semantic search allows users to perform searches based on the meaning or similarity of the data rather than exact matches. It works by converting the query into a vector representation and then finding similar vectors in the database. This way, even if the query and the data in the database are not identical, the system can identify and retrieve the most relevant results based on their semantic meaning.

## Aim
In this tutorial, we'll walk you through the process of performing semantic search on documents, taking PDFs as example, using KDB.AI as the vector store. We will cover the following topics:

- How to create vector embeddings using Sentence Transformer
- How to store those embeddings in KDB.AI
- How to search with a query using KDB.AI

## 1. Load and Split Document


In [None]:
%pip install PyPDF2 spacy sentence-transformers kdbai_client -q

In [None]:
#!python3 -m spacy download en_core_web_sm -q
!python3 -m spacy download en_core_web_sm -q

### Load and Split PDF into Sentences

We leverage the power of PyPDF2 for PDF processing and spaCy for advanced natural language processing. The code below extracts content from each page of the PDF and processes it to identify sentences.

The PDF we are using is [this research paper](https://arxiv.org/pdf/2308.05801.pdf) presenting information on the formation of Interstellar Objects in the Milky Way. It is also available on our [GitHub](https://github.com/KxSystems/kdbai-notebooks/tree/main/notebooks/samples/document-search).

In [None]:
import PyPDF2
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")


def split_pdf_into_sentences(pdf_path):
    # Open the PDF file
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Extract text from each page and concatenate
        full_text = ""
        for page_number in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_number]
            full_text += page.extract_text()

        # Process the text using spaCy for sentence tokenization
        doc = nlp(full_text)
        sentences = [sent.text for sent in doc.sents]

        return sentences


# Define PDF path
pdf_path = "data/research_paper.pdf"

# Split the PDF into sentences
pdf_sentences = split_pdf_into_sentences(pdf_path)
len(pdf_sentences)

In [None]:
type(pdf_sentences[0])

## 2. Create Vector Embeddings 

Next, we use the Sentence Transformers library to create embeddings for our collection of sentences.


### Selecting a Sentence Transformer Model

There are 100+ of different types of Sentence Transformers models available - see [HuggingFace](https://huggingface.co/sentence-transformers) for the full list. The diversity among these primarily stems from variations in their training data. Selecting the ideal model for your needs involves matching the domain and task closely, while also considering the benefits of incorporating larger datasets to enhance scale. 

This tutorial will use the `all-MiniLM-L6-v2` pre-trained model. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks including semantic search which makes it a good choice for our needs.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

### Generate Embeddings

We prepare embeddings by applying the sentence transformer model to our sentences to encode them. The we do some transformation to get this into DataFrame which is the format accepted by KDB.AI.

In [None]:
import numpy as np
import pandas as pd

# Create embeddings
embeddings_array = model.encode(np.array(pdf_sentences))
embeddings_list = embeddings_array.tolist()

print(embeddings_list)

embeddings_df = pd.DataFrame({"vectors": embeddings_list, "sentences": pdf_sentences})
#embeddings_df

It is important to note the dimension of our embeddings is 384. This will need to match the dimensions we set in the KDB.AI index in the next step. We can easily check this using `len` to count elements in our vector.

In [None]:
embeddings_df

## 3. Store Embeddings in KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching. KDB.AI is perfect for this task.

### Connect to KDB.AI Session

To use KDB.AI, you will need two session details - a URL endpoint and an API key. To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI session using `kdbai.Session`. Enter the session URL endpoint and API key details from your KDB.AI Cloud portal below.

In [None]:
import kdbai_client as kdbai
from getpass import getpass

'''
KDBAI_ENDPOINT = input('KDB.AI endpoint: ')
KDBAI_API_KEY = getpass('KDB.AI API key: ')
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)
'''
''' KDBAI Server (Local) '''
#session = kdbai.Session(endpoint='http://kdbaiServer:8082')

''' KDBAI Server (Docker) '''
session = kdbai.Session(endpoint='http://kdbaiServer:8082')

''' KDBAI cloud '''
#session = kdbai.Session(api_key="73c70f55e3-zwTcApmr1LHpOkv4sXQMdjTKFl/t7lXARtfpvOtitkzIQN5/YB+sT/EAe99sjuHZ5Y/10O9gSvDTRFh1", endpoint="https://cloud.kdb.ai/instance/wge9zktpsm")

In [None]:
session.list()

### Define Schema

The next step is to define a schema for our KDB.AI table where we will store our embeddings. Our table will have one column called `vectors`.

At this point you will select the index and metric you want to use for searching.

With KDB.AI we have the choice between HNSW (Hierarchical Navigable Small World) and Flat indexing methods. Generally, for semantic search of documents, the HNSW indexing method might be more suitable. Here's why:

- **Search Speed and Approximation**: HNSW is designed for fast approximate nearest neighbour searches. It can efficiently handle high-dimensional data, which is common in natural language processing tasks involving text documents.
- **Semantic Representation**: The Sentence Transformers library, used in this example, generates embeddings that capture semantic meaning. HNSW is well-suited for indexing such embeddings and performing semantic searches.
- **Scalability**: HNSW is scalable and can handle large datasets effectively, making it suitable for applications with a vast number of documents.

HNSW provides approximate search results, meaning that the nearest neighbors might not be exact matches but are close in terms of similarity.

In [None]:
pdf_schema = {
    "columns": [
        {"name": "sentences", "pytype": "str"},
        {
            "name": "vectors",
            "vectorIndex": {"dims": 384, "metric": "L2", "type": "hnsw"},
        },
    ]
}

### Create and Save Table

Use `create_table` to create a table.

In [None]:
try:
    table = session.create_table("pdf", pdf_schema)
except:
    table = session.table("pdf")

We can use `query` to see our table exists but is empty.

In [None]:
table.query()

### Add Embeddings to Index

In [None]:
table.insert(embeddings_df)

Re-running `query` we can now see data has been added.

In [None]:
table.query()

## 4. Searching with a Query using KDB.AI

Now that the embeddings are stored in KDB.AI, we can perform semantic search using `search`. 

First, we embed our search term using the Sentence Transformer model as before. Then we search our index to return the three most similar vectors.

In [None]:
search_term = "number of interstellar objects in the milky way"
search_term_vector = model.encode(search_term)
search_term_list = [search_term_vector.tolist()]

results = table.search(search_term_list, n=3)
results

The results returned from `table.search` show the closest matches along with value of nearest neighbour distances `nn_distance`. Let's print the output so we can see the full sentences.

In [None]:
pd.set_option("display.max_colwidth", None)
results[0]["sentences"]

We can see these sentences do reference our search term 'number of interstellar objects in the milky way' in some way. Let's try another search term.

In [None]:
search_term = "how does planet formation occur"
#search_term = "who is the author"
search_term_vector = model.encode(search_term)
search_term_list = [search_term_vector.tolist()]

results = table.search(search_term_list, n=3)
results[0]["sentences"]

Again, we can see these sentences do reference our search term 'how does planet formation occur' in some way. 

In [None]:
session.list()
table.drop()
session.list()