# PDF Document Loading and Starting RAG
Use pdf loading library to feed pdfs into llm. Tutorial uses PyMuPdf. It is **already available within langchain_community.**

PyMuPdf vs. Pypdf: pymu is faster at just extracting text from pdfs; sometimes it handles graphics and formatting better too. However, potentially less secure and has reliability issues. pypdf is slower but may be considered more stable and ready for enterprise use. Often has issues with formatting. 

Other loaders + langchain docs: https://python.langchain.com/docs/integrations/document_loaders/

We will also use *tiktoken*, which helps tokenize the text. It is included in the openai library installation in requirements.txt.

In [30]:
from langchain_community.document_loaders import PyMuPDFLoader
from dotenv import load_dotenv
load_dotenv('.env')
import os, glob

using finance pdfs for this exercise from: https://github.com/pistolla/gnidart/tree/master
tutorial uses health supplement pdfs instead

In [7]:
loader = PyMuPDFLoader("finance_pdfs/Market Wizards - Interviews With Top Traders 2012.pdf")
docs = loader.load()
len(docs) # number of pages

215

In [9]:
docs[0].metadata

{'producer': 'Acrobat Distiller 7.0 (Windows)',
 'creator': 'Acrobat PDFMaker 7.0 for Word',
 'creationdate': '2005-03-29T21:52:01+02:00',
 'source': 'finance_pdfs/Market Wizards - Interviews With Top Traders 2012.pdf',
 'file_path': 'finance_pdfs/Market Wizards - Interviews With Top Traders 2012.pdf',
 'total_pages': 215,
 'format': 'PDF 1.3',
 'title': 'THE MARKET WIZARDS',
 'author': 'AnToni',
 'subject': '',
 'keywords': '',
 'moddate': '2013-09-13T01:07:18+01:00',
 'trapped': '',
 'modDate': "D:20130913010718+01'00'",
 'creationDate': "D:20050329215201+02'00'",
 'page': 0}

In [18]:
print(docs[0].page_content)

THE MARKET WIZARDS 
 
CONVERSATIONS WITH 
AMERICA'S TOP TRADERS 
JACK  D. SCHWAGER


-- read all pdfs, tutorial goes through this way - debatably more convoluted for no reason\
pdfs = []
for root, dir, files in os.walk("finance_pdfs"):
    for file in files:
        if file.endswith('.pdf'):
            pdfs.append(os.path.join(root, file))

In [None]:
# load all pdfs
files = glob.glob('finance_pdfs/*.pdf')

docs = []
for file in files:
    loader = PyMuPDFLoader(file)
    temp = loader.load()
    docs.extend(temp) # loads all pages into array of docs (across all pdfs)

len(docs)

1225

In [33]:
# function to format text into a single string; each page separated by 2 newlines
def format_text(files):
    docs = []
    for file in files:
        loader = PyMuPDFLoader(file)
        temp = loader.load()
        docs.extend(temp) # loads all pages into array of docs (across all pdfs)
        
    return "\n\n".join([x.page_content for x in docs])

In [None]:
files = glob.glob('finance_pdfs/*.pdf')
context = format_text(files)

In [36]:
print(context[:1000])

THE MARKET WIZARDS 
 
CONVERSATIONS WITH 
AMERICA'S TOP TRADERS 
JACK  D. SCHWAGER

2
HarperBusiness 
 
 
 
You've got to learn how to fall, before you learn to fly. 
—Paul Simon 
 
One man's ceiling is another man's floor. 
—Paul Simon 
 
If I wanted to become a tramp, I would seek information and advice from the most successful tramp 
I could find. If I wanted to become a failure, I would seek advice from men who had never 
succeeded. If I wanted to succeed in all things, I would look around me for those who are 
succeeding and do as they have done. 
—Joseph Marshall Wade 
(as quoted in a Treasury of Wall Street Wisdom 
edited by Harry D. Schultz and Samson Coslow)

3
Contents 
 
Preface...................................................................................................................................................4 
Acknowledgments ................................................................................................................................5 
Prolo

## Chunking
LLMs will not have a big enough context window to handle something of this size. 