# Ways to load a PDF document

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf


### Using PyPDF

PyPDF is a popular library used to work with PDF files in Python. It can be used to extract text, metadata, and other information from PDF documents. Here's how Langchain utilizes PyPDF:

- **Loading PDFs**: PyPDF is used to load PDF documents into an array of documents, where each document contains the page content and metadata with the page number.
- **Installation**: PyPDF can be installed using `pip install pypdf`.
- **Usage**: PyPDFLoader is used to load and split the PDF into pages, and the content can be accessed as shown in the code snippet on the webpage.
- **Advantage**: An advantage of using PyPDF is that documents can be retrieved with page numbers, allowing for easy navigation and reference.

### Using Unstructured

Unstructured is another method mentioned on the webpage for loading PDF documents. Here's how it works:

- **Loading PDFs**: UnstructuredPDFLoader is used to load PDF documents.
- **Elements Handling**: Under the hood, Unstructured creates different "elements" for different chunks of text. By default, these are combined together, but separation can be maintained by specifying `mode="elements"`.
- **Usage**: UnstructuredPDFLoader is used to load the PDF, and the content can be accessed similarly to PyPDF.

### When to Use PyPDF vs. Unstructured

- **PyPDF**:
  - **When Page Numbers are Important**: If you need to keep track of page numbers and want to work with individual pages, PyPDF is a suitable choice.
  - **General Text Extraction**: PyPDF is widely used for general text extraction from PDF documents and has a well-established community.

- **Unstructured**:
  - **Handling Different Text Elements**: If you need to work with different chunks of text and want to retain the separation between different elements, Unstructured might be a better option.
  - **Customized Text Processing**: Unstructured allows for more customized handling of text elements, making it suitable for specific use cases where text needs to be processed in a particular way.

### Selection

The choice between PyPDF and Unstructured depends on the specific requirements of your task. If you need general text extraction with page numbers, PyPDF is a solid choice. If you require more customized handling of text elements, Unstructured might be more suitable. Both methods are supported by Langchain, allowing for flexibility in handling PDF documents.

# 1. PyPDF approach

In [1]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())

True

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
from langchain.prompts import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.chat_models import ChatOpenAI

In [None]:
loader = PyPDFLoader('pdf/Causal_Inference_in_Python.pdf')