<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# LangChain - Vector Search on PDF
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/template.ipynb" target="_parent">
<img src="https://naasai-public.s3.eu-west-3.amazonaws.com/open_in_naas.svg"/>
</a><br><br><a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=template-request.md&title=Tool+-+Action+of+the+notebook+">Template request</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=bug_report.md&title=[ERROR]+Tool+/+Folder+Action+of+the+notebook+">Bug report</a>

**Tags:** #langchain #pdf #weaviate #huggingface

**Author:** [Sriniketh Jayasendil](https://www.linkedin.com/in/sriniketh-jayasendil)

**Description:** This notebook is used to perform vector search on your PDF and it will answer basic questions that are closely related based on the prompt provided.

It uses:
- PyPDF2 - Get text from PDF
- LangChain - Text splitter, document creation
- HuggingFace - Embeddings
- Weaviate - Vector Database

**References:**
- [Langchain docs](https://python.langchain.com/docs/get_started/introduction.html)
- [Weaviate docs](https://weaviate.io/developers/weaviate)
- [Huggingface docs](https://huggingface.co/docs)

## Input

### Import libraries

In [1]:
try:
    import langchain
    import PyPDF2
except ModuleNotFoundError:
    !pip install langchain PyPDF2
!pip install sentence_transformers --user
import naas
import PyPDF2
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Weaviate

Collecting sentence_transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting torchvision (from sentence_transformers)
  Using cached torchvision-0.15.2-cp39-cp39-manylinux1_x86_64.whl (6.0 MB)
Collecting torch>=1.6.0 (from sentence_transformers)


### Setup Variables

- `pdf_file`: Path to which the PDF file exists.
- `weaviate_cluster_url`: You can create a new weaviate cluster [here](https://console.weaviate.cloud) and paste the url or import from naas secrets
- `query`: The question that you need to ask the pdf
- `response`: The reply for the query from search 

In [2]:
# Inputs
pdf_file = "./SWE NCG JD.pdf"
weaviate_cluster_url = naas.secret.get("WEAVIATE_CLUSTER_URL")
query = "How much is the base pay?"

# Outputs
response = ""

## Model

### Extract text from PDF

In [3]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as file:
        pdf = PyPDF2.PdfReader(file)
        text = []
        for page in pdf.pages:
            text.append(page.extract_text())
        return " ".join(text)

text = extract_text_from_pdf(pdf_file)

FileNotFoundError: [Errno 2] No such file or directory: './SWE NCG JD.pdf'

### Split the text into chunks scraped from the PDF

In [None]:
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

texts = text_splitter.create_documents([text])

### Create embeddings of the text make it compatible to store it in the database

In [None]:
embeddings = HuggingFaceEmbeddings()

for i in range(len(texts)):
        query_result = embeddings.embed_query(texts[i].page_content)

### Store the embeddings into the weaviate database

In [None]:
db = Weaviate.from_documents(texts, embeddings, weaviate_url="", by_text=False)

### Get the closest response to the user query on the PDF

In [None]:
docs = db.similarity_search(query)
response = docs[0].page_content

## Output

### Show the response

In [None]:
response