## Understanding pdf_chunker functions

Links:
- https://python.langchain.com/docs/integrations/document_loaders/copypaste
- https://pypi.org/project/pypdf/
- https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter

In [1]:
from pypdf import PdfReader

reader = PdfReader(r"C:\Users\Michelle\Data_Science_Code\RAG-chatbot\data\qantas_points_tncs.pdf")
number_of_pages = len(reader.pages)
print(number_of_pages)

6


In [2]:
page = reader.pages[2]
text = page.extract_text()
text

'notified to you from time to time, for every one Australian dollar (AUD) \nspent on goods and services, charged and billed on the Card Account. \nSubject to these terms and conditions, Qantas Points are calculated on \neach purchase of goods or services charged to your Card. Only whole \nQantas Points are credited. Accrued Points shall be rounded to the nearest \nwhole Qantas Point in accordance with the generally accepted principles in \nrespect of rounding rules. For details on the number of Qantas Points \nawarded for each transaction type, please contact American Express using \nthe telephone number printed on the back of your Card.\n3.2  Exemptions\n No Qantas Points accrue in respect of:\n (a)  charges prepaid prior to the first billing statement for that Card \nAccount following the Enrolment Date;\n (b)  Cash advance and other cash services or cash equivalents;\n (c)  American Express Travellers Cheque and gift cheque purchases;\n (d) charges for dishonoured payments;\n (e) in

In [3]:
import re
text = re.sub("\n"," ", text)
text

'notified to you from time to time, for every one Australian dollar (AUD)  spent on goods and services, charged and billed on the Card Account.  Subject to these terms and conditions, Qantas Points are calculated on  each purchase of goods or services charged to your Card. Only whole  Qantas Points are credited. Accrued Points shall be rounded to the nearest  whole Qantas Point in accordance with the generally accepted principles in  respect of rounding rules. For details on the number of Qantas Points  awarded for each transaction type, please contact American Express using  the telephone number printed on the back of your Card. 3.2  Exemptions  No Qantas Points accrue in respect of:  (a)  charges prepaid prior to the first billing statement for that Card  Account following the Enrolment Date;  (b)  Cash advance and other cash services or cash equivalents;  (c)  American Express Travellers Cheque and gift cheque purchases;  (d) charges for dishonoured payments;  (e) interest charges; 

In [4]:
from langchain.docstore.document import Document
example_text = "This is paragraph 1."
doc = Document(page_content=example_text)
doc

Document(page_content='This is paragraph 1.')

In [5]:
metadata = {"source": "internet", "date": "Friday"}
doc = Document(page_content=example_text, metadata=metadata)
doc

Document(page_content='This is paragraph 1.', metadata={'source': 'internet', 'date': 'Friday'})

In [6]:
# can also add metadata this way:
doc.metadata["page"]=1
doc

Document(page_content='This is paragraph 1.', metadata={'source': 'internet', 'date': 'Friday', 'page': 1})

Change the text_to_docs function in pdf_chunker.py which was originally from this [repo.](https://github.com/avrabyt/RAG-Chatbot)

In [30]:
# example list of pages from the pdf
list_of_pages = [reader.pages[0].extract_text(), reader.pages[1].extract_text(), reader.pages[2].extract_text()]
list_of_pages


['Qantas  \nAmerican Express Cards\nQantas Points Terms & Conditions\nEFFECTIVE:  15 AUGUST 2019\nAmerican Express is the credit provider and credit licensee under the national consumer credit laws. \nCredit provided by American Express Australia Limited (ABN 92 108 952 085)\nAustralian Credit Licence No. 291313 \n® Registered T rademark of American Express Company.',
 'YOUR FIRST USE OF THE CARD OR CARD ACCOUNT WILL \nINDICATE YOUR AGREEMENT TO THESE QANTAS POINTS \nTERMS  AND CONDITIONS.\n1. Definitions\nAccrued Points – Qantas Points accrued as a result of transactions on the \nCard and an Additional Card that have not been transferred to the Basic Card \nMember’s Qantas Frequent Flyer account.\nAdditional Card  – a Qantas American Express Card issued to another person \nat the request of the Basic Card Member and on the Basic Card Member’s \nCard Account and may previously have been referred to as a Supplementary \nCard Member.\nAdditional Card Member  – a holder of an Additional C

In [19]:
page_docs = [Document(page_content=page) for page in list_of_pages]
page_docs

[Document(page_content='Qantas  \nAmerican Express Cards\nQantas Points Terms & Conditions\nEFFECTIVE:  15 AUGUST 2019\nAmerican Express is the credit provider and credit licensee under the national consumer credit laws. \nCredit provided by American Express Australia Limited (ABN 92 108 952 085)\nAustralian Credit Licence No. 291313 \n® Registered T rademark of American Express Company.'),
 Document(page_content='YOUR FIRST USE OF THE CARD OR CARD ACCOUNT WILL \nINDICATE YOUR AGREEMENT TO THESE QANTAS POINTS \nTERMS  AND CONDITIONS.\n1. Definitions\nAccrued Points – Qantas Points accrued as a result of transactions on the \nCard and an Additional Card that have not been transferred to the Basic Card \nMember’s Qantas Frequent Flyer account.\nAdditional Card  – a Qantas American Express Card issued to another person \nat the request of the Basic Card Member and on the Basic Card Member’s \nCard Account and may previously have been referred to as a Supplementary \nCard Member.\nAddition

In [20]:
# Assign a page number to the metadata of each document.
for i, doc in enumerate(page_docs):
    doc.metadata["page"] = 1+i #as enumerate starts at 0

In [10]:
page_docs #now has page number in metadata

[Document(page_content='Qantas  \nAmerican Express Cards\nQantas Points Terms & Conditions\nEFFECTIVE:  15 AUGUST 2019\nAmerican Express is the credit provider and credit licensee under the national consumer credit laws. \nCredit provided by American Express Australia Limited (ABN 92 108 952 085)\nAustralian Credit Licence No. 291313 \n® Registered T rademark of American Express Company.', metadata={'page': 1}),
 Document(page_content='YOUR FIRST USE OF THE CARD OR CARD ACCOUNT WILL \nINDICATE YOUR AGREEMENT TO THESE QANTAS POINTS \nTERMS  AND CONDITIONS.\n1. Definitions\nAccrued Points – Qantas Points accrued as a result of transactions on the \nCard and an Additional Card that have not been transferred to the Basic Card \nMember’s Qantas Frequent Flyer account.\nAdditional Card  – a Qantas American Express Card issued to another person \nat the request of the Basic Card Member and on the Basic Card Member’s \nCard Account and may previously have been referred to as a Supplementary \n

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

for doc in page_docs:
    #split the pages up by list of characters 
    text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=2000, #chunk size measured by number of characters original 4000 but pages don't even go that long
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
            chunk_overlap=0,
        )
    
    # Split the dcument pages into chunks
    chunks = text_splitter.split_text(doc.page_content)
    # print(chunks)

    # Convert each chunk into a new document, storing its chunk number, page number, and source file name in its metadata.
    for i, chunk in enumerate(chunks):
        #print(doc) #this is the previous doc object pre chunk and only with page metadata
        doc = Document(page_content=chunk, metadata={"page": doc.metadata["page"], "chunk": i+1})
        doc.metadata["source"] = f"page {doc.metadata['page']} - section {doc.metadata['chunk']}"
        doc.metadata["filename"] = "qantas_points_tncs"
        print(doc)

page_content='Qantas  \nAmerican Express Cards\nQantas Points Terms & Conditions\nEFFECTIVE:  15 AUGUST 2019\nAmerican Express is the credit provider and credit licensee under the national consumer credit laws. \nCredit provided by American Express Australia Limited (ABN 92 108 952 085)\nAustralian Credit Licence No. 291313 \n® Registered T rademark of American Express Company.' metadata={'page': 1, 'chunk': 1, 'source': 'page 1 - section 1', 'filename': 'qantas_points_tncs'}
page_content='YOUR FIRST USE OF THE CARD OR CARD ACCOUNT WILL \nINDICATE YOUR AGREEMENT TO THESE QANTAS POINTS \nTERMS  AND CONDITIONS.\n1. Definitions\nAccrued Points – Qantas Points accrued as a result of transactions on the \nCard and an Additional Card that have not been transferred to the Basic Card \nMember’s Qantas Frequent Flyer account.\nAdditional Card  – a Qantas American Express Card issued to another person \nat the request of the Basic Card Member and on the Basic Card Member’s \nCard Account and may

# Code with chunking

This is the code with chunking but I removed it since all my documents aren't that lengthy. So instead use just page numbers and no need for chunks

In [35]:
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter

def text_to_docs(text: List[str], filename: str) -> List[Document]:
    """

    Note: the text variable is a list of cleaned text where each element is a page from a pdf

    This function is used to:
        - Take a list of text strings & file name
        - Processes the text to create a list of chunked "Document" objects
        - These objects each represent a smaller portion of the original text with associated metadata
    
    """
    # Ensure the input text is a list. If it's a string, convert it to a list.
    if isinstance(text, str):
        text = [text]
    
    # Convert each text (from a page) to a Langchain Document object
    page_docs = [Document(page_content=page) for page in text]
    
    # Assign a page number to the metadata of each document.
    for i, doc in enumerate(page_docs):
        doc.metadata["page"] = 1+i #as enumerate starts at 0

    doc_chunks = []
    
    # Split each page's text into smaller chunks and store them as separate documents.
    for doc in page_docs:
        # Initialize the text splitter with specific chunk sizes and delimiters.
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=4000, 
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
            chunk_overlap=0,
        )
        
        # Split the dcument pages into chunks
        chunks = text_splitter.split_text(doc.page_content)
        # print(chunks)

        # Convert each chunk into a new document, storing its chunk number, page number, and source file name in its metadata.
        for i, chunk in enumerate(chunks):
            doc = Document(
                page_content=chunk, 
                metadata={"page": doc.metadata["page"], #this is the page metadata from previous doc object before it was chunked
                          "chunk": i+1}
                          )
            doc.metadata["source"] = f"page {doc.metadata['page']} - section {doc.metadata['chunk']}"
            doc.metadata["filename"] = filename
            doc_chunks.append(doc)
    
    # Return the list of chunked documents.
    return doc_chunks