# Document Loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

![overview](img/overview.png)

In [1]:
#! pip install langchain

In [2]:
# import os
# import openai
# import sys

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file

# openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [3]:
#! pip install pypdf 

In [4]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/a_typology_of_decision_making_tasks.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [5]:
len(pages)

11

In [6]:
page = pages[1]

In [7]:
print(page.page_content[0:500])

To inform our typology, we gathered design goals from a thorough
literature review of decision-support tools within the visualization com-
munity. We built upon previously curated corpora [48, 57] , focusing
our search on papers with domain expert evaluations. We then fol-
lowed an inductive coding process to identify each paper’s primary
decision-making goals and processes. We then took these codes and
distilled the design goals for our decision-making typology, including
iterativeness, composa


In [8]:
page.metadata

{'source': 'docs/a_typology_of_decision_making_tasks.pdf', 'page': 1}

## URLs

In [9]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [10]:
docs = loader.load()

In [11]:
print(docs[0].page_content[:500])







































































































File not found · GitHub













































Skip to content












Navigation Menu

Toggle navigation




 













            Sign in
          








        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Cod


# Document Splitting
If the documents are large, how do we split them in smaller more manageable chunks?

![splitting](img/splitting.png)




## Character Text Splitters
Character text splitter and recursive text splitter.

In [12]:
chunk_size =26
chunk_overlap = 4

![chunks](img/chunks.png)

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

# Recursive Character Text Splitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

#Character Text Splitter
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [14]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)


['abcdefghijklmnopqrstuvwxyz']

In [15]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

In [16]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [17]:
c_splitter.split_text(text3)


['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [18]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '  # <-- space as a separator instead of newline
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [19]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [20]:
len(some_text)

496

In [21]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' ' # <-- space as a separator
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""] # <-- default separators, but we can add more
)

What `separators=["\n\n", "\n", " ", ""]` means is that it will first try to split the text by double newlines, then if it still needs to split individual chunks more, then it will split by single new line, then by space, then character by character.

In [22]:
c_splitter.split_text(some_text) # weird separation in the middle of the sentence

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [25]:
r_splitter.split_text(some_text) # better separation, first tries to split on double newlines. The result is two paragraphs.

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Let's reduce the chunk size a bit and add a period to our separators:

In [27]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

  separators=["\n\n", "\n", "(?<=\. )", " ", ""]


["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Let's do this splitting on a more real world example, such as an academic paper.

In [29]:
# We already loaded the document in a cell above.
# loader = PyPDFLoader("docs/a_typology_of_decision_making_tasks.pdf")
# pages = loader.load()

# We define a larger chunk size and overlap for the splitter.
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [30]:
# Each document is a split of the article.
docs = text_splitter.split_documents(pages)
len(docs)

97

In [31]:
len(pages)

11

So far we did the splitting by characters, but there are other ways to split the documents.

## Token Splitters
We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

In [34]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [35]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
docs[0]

Document(page_content='A Typology of Decision-Making Tasks for', metadata={'source': 'docs/a_typology_of_decision_making_tasks.pdf', 'page': 0})

The metadata of the documents ^ are the same as the metadata of the original document, it carries through the metadata for each chunk.

In [36]:
pages[0].metadata

{'source': 'docs/a_typology_of_decision_making_tasks.pdf', 'page': 0}

# Vector Stores and Embeddings
Recall the overall workflow for retrieval augmented generation (RAG):

![vectorstores](img/vectorstores.png)