PDF Query Using Langchain

In [1]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
   ---------------------------------------- 0.0/232.6 kB ? eta -:--:--
   ---------------------------------------- 0.0/232.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/232.6 kB ? eta -:--:--
   - -------------------------------------- 10.2/232.6 kB ? eta -:--:--
   ----- --------------------------------- 30.7/232.6 kB 186.2 kB/s eta 0:00:02
   ----- --------------------------------- 30.7/232.6 kB 186.2 kB/s eta 0:00:02
   ------ -------------------------------- 41.0/232.6 kB 151.3 kB/s eta 0:00:02
   ---------- ---------------------------- 61.4/232.6 kB 204.8 kB/s eta 0:00:01
   ------------ -------------------------- 71.7/232.6 kB 206.9 kB/s eta 0:00:01
   --------------- ----------------------- 92.2/232.6 kB 249.8 kB/s eta 0:00:01
   -------------------- ----------------- 122.9/232.6 kB 300.4 kB/s eta 0:00:01
   --------------------

In [1]:
from PyPDF2 import PdfReader
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS


In [1]:
import os
from constants import openai_key
os.environ["OPENAI_API_KEY"] = openai_key

In [3]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('budget_speech.pdf')

In [4]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [5]:
raw_text

"GOVERNMENT OF INDIA\nBUDGET 2023-2024\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2023CONTENTS \nPART-A \n Page No.  \n\uf0b7 Introduction 1 \n\uf0b7 Achievements since 2014: Leaving no one behind 2 \n\uf0b7 Vision for Amrit Kaal  – an empowered and inclusive economy 3 \n\uf0b7 Priorities of this Budget 5 \ni. Inclusive Development  \nii. Reaching the Last Mile \niii. Infrastructure and Investment \niv. Unleashing the Potential \nv. Green Growth \nvi. Youth Power  \nvii. Financial Sector  \n \n \n \n \n \n \n \n \n\uf0b7 Fiscal Management 24 \nPART B  \n  \nIndirect Taxes  27 \n\uf0b7 Green Mobility  \n\uf0b7 Electronics   \n\uf0b7 Electrical   \n\uf0b7 Chemicals and Petrochemicals   \n\uf0b7 Marine products  \n\uf0b7 Lab Grown Diamonds  \n\uf0b7 Precious Metals  \n\uf0b7 Metals  \n\uf0b7 Compounded Rubber  \n\uf0b7 Cigarettes  \n  \nDirect Taxes  30 \n\uf0b7 MSMEs and Professionals   \n\uf0b7 Cooperation  \n\uf0b7 Start-Ups  \n\uf0b7 Appeals  \n\uf0b7 Better ta

In [6]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [7]:
len(texts)

149

In [8]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [9]:
document_search = FAISS.from_texts(texts, embeddings)

In [10]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x25f153c4e50>

In [11]:
from langchain.chains.question_answering import load_qa_chain
from langchain_openai import OpenAI

In [12]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

stuff: https://python.langchain.com/v0.2/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/v0.2/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag
  chain = load_qa_chain(OpenAI(), chain_type="stuff")


In [13]:
query = "Vision for Amrit Kaal"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

  chain.run(input_documents=docs, question=query)


' The vision for Amrit Kaal includes a technology-driven and knowledge-based economy with strong public finances and a robust financial sector. Jan Bhagidari through Sabka Saath Sabka Prayas is considered essential in achieving this vision. The economic agenda for this vision focuses on facilitating opportunities for citizens, providing impetus to growth and job creation, and strengthening macro-economic stability. The four transformative opportunities during Amrit Kaal are economic empowerment of women, reaching the last mile, infrastructure and investment, and unleashing potential for green growth and youth power.'

In [14]:
query = "How much the agriculture target will be increased to and what the focus will be"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' The agriculture credit target will be increased to `20 lakh crore with a focus on animal husbandry, dairy, and fisheries.'

Online PDF Loader is not working

In [6]:
# Method - 01
import requests
from langchain.document_loaders import UnstructuredPDFLoader

# Step 1: Download the PDF
def download_pdf(url: str, save_path: str):
    response = requests.get(url)
    with open(save_path, 'wb') as file:
        file.write(response.content)

pdf_url = 'https://arxiv.org/pdf/1706.03762.pdf'
local_pdf_path = 'localfile.pdf'
download_pdf(pdf_url, local_pdf_path)

# Step 2: Load the PDF using LangChain
loader = UnstructuredPDFLoader(local_pdf_path)
documents = loader.load()

# Print or process the documents
for doc in documents:
    print(doc)


ImportError: DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.

Method -02 for online pdf reader

In [2]:
from langchain.document_loaders import OnlinePDFLoader

In [3]:
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")

In [4]:
!pip install unstructured
!pip install pdfminer.six
!pip install pillow_heif



In [17]:
!pip install --upgrade --force-reinstall unstructured

Collecting unstructured
  Using cached unstructured-0.15.12-py3-none-any.whl.metadata (29 kB)
Collecting chardet (from unstructured)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Using cached python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting lxml (from unstructured)
  Using cached lxml-5.3.0-cp311-cp311-win_amd64.whl.metadata (3.9 kB)
Collecting nltk (from unstructured)
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting tabulate (from unstructured)
  Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting requests (from unstructured)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting beautifulsoup4 (from unstructured)
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting emoji (from unstructured)
  Using 

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.75 requires requests_mock, which is not installed.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.
anaconda-cloud-auth 0.1.4 requires pydantic<2.0, but you have pydantic 2.9.1 which is incompatible.
botocore 1.31.64 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.2.3 which is incompatible.
conda-repo-cli 1.0.75 requires clyent==1.2.1, but you have clyent 1.2.2 which is incompatible.
conda-repo-cli 1.0.75 requires python-dateutil==2.8.2, but you have python-dateutil 2.9.0.post0 which is incompatible.
conda-repo-cli 1.0.75 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.
langchain-community 0.2.16 requires langchain<0.3.0,>=0.2.16, but you have langc

In [18]:
!pip install unstructured[all]





In [20]:
!pip install unstructured-inference

Collecting unstructured-inference
  Downloading unstructured_inference-0.7.36-py3-none-any.whl.metadata (5.9 kB)
Collecting layoutparser (from unstructured-inference)
  Downloading layoutparser-0.3.4-py3-none-any.whl.metadata (7.7 kB)
Collecting opencv-python!=4.7.0.68 (from unstructured-inference)
  Downloading opencv_python-4.10.0.84-cp37-abi3-win_amd64.whl.metadata (20 kB)
Collecting onnx (from unstructured-inference)
  Downloading onnx-1.16.2-cp311-cp311-win_amd64.whl.metadata (16 kB)
Collecting timm (from unstructured-inference)
  Downloading timm-1.0.9-py3-none-any.whl.metadata (42 kB)
     ---------------------------------------- 0.0/42.4 kB ? eta -:--:--
     --------- ------------------------------ 10.2/42.4 kB ? eta -:--:--
     ------------------------------------ - 41.0/42.4 kB 495.5 kB/s eta 0:00:01
     -------------------------------------- 42.4/42.4 kB 518.8 kB/s eta 0:00:00
Collecting iopath (from layoutparser->unstructured-inference)
  Downloading iopath-0.1.10.tar.gz

In [23]:
!pip install pi_heif

Collecting pi_heif
  Downloading pi_heif-0.18.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Downloading pi_heif-0.18.0-cp311-cp311-win_amd64.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
    --------------------------------------- 0.0/1.7 MB 435.7 kB/s eta 0:00:04
    --------------------------------------- 0.0/1.7 MB 487.6 kB/s eta 0:00:04
    --------------------------------------- 0.0/1.7 MB 487.6 kB/s eta 0:00:04
   -- ------------------------------------- 0.1/1.7 MB 401.6 kB/s eta 0:00:04
   -- ------------------------------------- 0.1/1.7 MB 467.6 kB/s eta 0:00:04
   --- ------------------------------------ 0.1/1.7 MB 472.1 kB/s eta 0:00:04
   ----- ---------------------------------- 0.2/1.7 MB 625.1 kB/s eta 0:00:03
   ------ --------------------------------- 0.3/1.7 MB 768.0 kB/s eta 0:00:02
   ------- -------------------------------- 0.3/1.7 MB 728.0 kB/s eta 0:00:02
   ---

In [5]:
data = loader.load()

ImportError: DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.

In [37]:
data

NameError: name 'data' is not defined

In [39]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [38]:
!pip install chromadb



In [40]:
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])



ValidationError: 1 validation error for VectorstoreIndexCreator
embedding
  Field required [type=missing, input_value={}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing

In [41]:
query = "Explain me about Attention is all you need"
index.query(query)

NameError: name 'index' is not defined

In [42]:
pip show unstructured

Name: unstructuredNote: you may need to restart the kernel to use updated packages.

Version: 0.14.9
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: devops@unstructuredai.io
License: Apache-2.0
Location: c:\Users\s.kumar\AppData\Local\anaconda3\Lib\site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, langdetect, lxml, nltk, numpy, python-iso639, python-magic, rapidfuzz, requests, tabulate, tqdm, typing-extensions, unstructured-client, wrapt
Required-by: 
