<img width="8%" alt="LangChain.png" src="https://raw.githubusercontent.com/jupyter-naas/awesome-notebooks/master/.github/assets/logos/LangChain.png" style="border-radius: 15%">

# LangChain - Vector Search on PDF
<a href="https://bit.ly/3JyWIk6">Give Feedback</a> | <a href="https://github.com/jupyter-naas/awesome-notebooks/issues/new?assignees=&labels=bug&template=bug_report.md&title=LangChain+-+Vector+Search+on+PDF:+Error+short+description">Bug report</a>

**Tags:** #langchain #pdf #weaviate #huggingface #llm #database #embeddings

**Author:** [Sriniketh Jayasendil](https://www.linkedin.com/in/sriniketh-jayasendil)

**Last update:** 2023-09-27 (Created: 2023-09-27)

**Description:** This notebook is used to perform vector search on your PDF and it will answer basic questions that are closely related based on the prompt provided.

It uses:
- PyPDF2 - Get text from PDF
- LangChain - Text splitter, document creation
- HuggingFace - Embeddings
- Weaviate - Vector Database

<a target="_blank" href="https://colab.research.google.com/drive/1BhiqnWyHZxNfdD733QEvZIKpaz3ND663?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**References:**
- [Langchain docs](https://python.langchain.com/docs/get_started/introduction.html)
- [Weaviate docs](https://weaviate.io/developers/weaviate)
- [Huggingface docs](https://huggingface.co/docs)

## Input

### Import libraries

In [None]:
try:
    import langchain
except ModuleNotFoundError:
    !pip install langchain --user
    import langchain
try:
    import PyPDF2
except ModuleNotFoundError:
    !pip install PyPDF2 --user
    import PyPDF2
try:
    import weaviate
except ModuleNotFoundError:
    !pip install weaviate-client==3.20.0 --user
    import weaviate
    
# Note: This installation make take more time than usual due to more dependencies {uncomment if there is some error in the embeddings routine}
# !pip install sentence_transformers --user

import os
import naas
import io
import requests
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Weaviate

### Setup variables
- `pdf_file`: Path to which the PDF file exists.",
- `weaviate_cluster_url`: You can create a new weaviate cluster [here](https://console.weaviate.cloud) and paste the url or import from naas secrets
- `weaviate_api_key`: Get your API key from your weaviate dashboard [here](https://console.weaviate.cloud/dashboard#)
- `query`: The question that you need to ask the pdf

In [None]:
pdf_file = "https://tesla-cdn.thron.com/static/SVCPTV_2022_Q4_Quarterly_Update_6UDS97.pdf?xseo=&response-content-disposition=inline%3Bfilename%3D%22b7871185-dd6a-4d79-9c3b-19b497227f2a.pdf%22"
weaviate_api_key = naas.secret.get("WEAVIATE_API_KEY")
weaviate_cluster_url = naas.secret.get("WEAVIATE_CLUSTER_URL")
query = "What's the total revenue on Q4 2022?"

## Model

### Setup environ

In [None]:
os.environ["WEAVIATE_API_KEY"] = weaviate_api_key

### Extract text from PDF

In [None]:
def extract_text_from_pdf(pdf_path):
    r = requests.get(pdf_path)
    f = io.BytesIO(r.content)

    reader = PyPDF2.PdfReader(f)
    contents = []
    for page in reader.pages:
        content = page.extract_text()
        contents.append(content)
        
    contents = ' '.join(contents)
    return contents
    
text = extract_text_from_pdf(pdf_file)

### Split the text into chunks scraped from the PDF

In [None]:
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
)

texts = text_splitter.create_documents([text])
print(len(texts))
texts[0]

### Create embeddings of the text make it compatible to store it in the database

In [None]:
embeddings = HuggingFaceEmbeddings()

for i in range(len(texts)):
    query_result = embeddings.embed_query(texts[i].page_content)

### Store the embeddings into the weaviate database

In [None]:
# Store in the weaviate vector database
db = Weaviate.from_documents(texts, embeddings, weaviate_url=weaviate_cluster_url, by_text=False)

### Get the closest response to the user query on the PDF

In [None]:
docs = db.similarity_search(query)
docs

## Output

### Show the response

In [None]:
response = docs[0].page_content
response