<img width="10%" alt="Naas" src="https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160"/>

# LangChain - Vector Search on PDF
<a href="https://app.naas.ai/user-redirect/naas/downloader?url=https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/template.ipynb" target="_parent">
<img src="https://naasai-public.s3.eu-west-3.amazonaws.com/open_in_naas.svg"/>
</a><a target="_blank" href="https://colab.research.google.com/drive/1BhiqnWyHZxNfdD733QEvZIKpaz3ND663?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<br><br><a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=template-request.md&title=Tool+-+Action+of+the+notebook+">Template request</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=&template=bug_report.md&title=[ERROR]+Tool+/+Folder+Action+of+the+notebook+">Bug report</a>

**Tags:** #langchain #pdf #weaviate #huggingface #llm #database #embeddings

**Author:** [Sriniketh Jayasendil](https://www.linkedin.com/in/sriniketh-jayasendil)

**Description:** This notebook is used to perform vector search on your PDF and it will answer basic questions that are closely related based on the prompt provided.

It uses:
- PyPDF2 - Get text from PDF
- LangChain - Text splitter, document creation
- HuggingFace - Embeddings
- Weaviate - Vector Database

**References:**
- [Langchain docs](https://python.langchain.com/docs/get_started/introduction.html)
- [Weaviate docs](https://weaviate.io/developers/weaviate)
- [Huggingface docs](https://huggingface.co/docs)

## Input

### Import libraries

In [None]:
try:
    import langchain
    import PyPDF2
    import weaviate
    import openai
except ModuleNotFoundError:
    !pip install langchain PyPDF2 openai weaviate-client==3.20.0

import naas
import io
import requests
import PyPDF2
import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader

In [None]:
# Note: This installation make take more time than usual due to more dependencies {uncomment if there is some error in the embeddings routine}
# !pip install -U sentence-transformers --user

In [None]:
# Inputs
pdf_file = "" or "https://bcf.princeton.edu/wp-content/uploads/2023/05/A_User_s_Guide_to_GPT_and_LLMs_for_Economic_Research.pdf"
weaviate_cluster_url = "" or naas.secret.get("WEAVIATE_CLUSTER_URL")
openai_api_key = "" or naas.secret.get("OPENAI_API_KEY")
query = "" or "Summarize the PDF"

# Outputs
response = ""

## Model

### Extract text from PDF

In [None]:
def extract_text_from_pdf(pdf_path):
    r = requests.get(pdf_path)
    f = io.BytesIO(r.content)

    reader = PyPDF2.PdfReader(f)
    contents = []
    for page in reader.pages:
        content = page.extract_text()
        contents.append(content)

    contents = ' '.join(contents)
    return contents


text = extract_text_from_pdf(pdf_file)

### Split the text into chunks scraped from the PDF

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

texts = text_splitter.create_documents([text])

### Create embeddings of the text make it compatible to store it in the database

In [None]:
embeddings = HuggingFaceEmbeddings()

for i in range(len(texts)):
        query_result = embeddings.embed_query(texts[i].page_content)

### Store the embeddings into the weaviate database

In [None]:
# Delete existing schema if any present
client = weaviate.Client(url=weaviate_cluster_url )

try:
    client.schema.delete_all()
    print("Schema deleted successfully...")
except:
    print("Schema not deleted...")

# Store in the weaviate vector database
db = Weaviate.from_documents(texts, embeddings, weaviate_url=weaviate_cluster_url, by_text=False)

### Get the closest response to the user query on the PDF

In [None]:
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai_api_key, temperature=0), chain_type="stuff", retriever=db.as_retriever())
response = qa.run(query)

## Output

### Show the response

In [None]:
response