# Document Loading

## Retrieval augmented generation
 
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

<img width="1047" alt="Screenshot 2024-04-15 at 12 24 00 PM" src="https://github.com/harrywang/langchain-short-course/assets/595772/0c3cdde0-831d-4e11-91f2-670abc580d89">

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

# Load PDF
Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [2]:
# load pdf
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/nba-rules-2023.pdf")
pages = loader.load()

In [3]:
len(pages)

74

In [4]:
# content people don't see
page = pages[0]
page.page_content


'This Page Intentionally Left Blank  \nIt is here to hold a place for cover for screen version.  \nDO NOT INCLUDE AS PART OF PRINT FILE!\nOFFICIAL\nRULES'

In [5]:
page = pages[1]
page.page_content

'- 2 -RULES INDEX\n RULE SECTION ARTICLE PAGE\nBACKCOURT/FRONTCOURT \n\t Definitions  .......................................  4 VI a & b 19\n Eight (8) Second Violation  ..............  4 VI f 19\n Player Position Status  ......................  4 VI c 19\n Ball Position Status  ..........................  4 VI d, e, & g 19\nBALL\n Dead Ball  .........................................  6 IV a 26\n Jump—Center Circle  .......................  6 V  26\n Jump—Free Throw Circle  ...............  6 VI  27\n Live Ball  ..........................................  6 II  26\n Putting in Play  .................................  6 I  25\n Restrictions  ......................................  6 VII  27\n Starting of Games & Overtime(s)  .... 6 I a 25\n Starting of 2nd, 3rd, & 4th Periods  .. 6 I b 25\nBASKET RING, BACKBOARD, SUPPORT\n\t Definition  .........................................  4 I  17\n Hanging (Intentional) .......................  12A  IV  40\n Hanging (Prevent Injury)  ...........

In [6]:
page.metadata

{'source': 'data/nba-rules-2023.pdf', 'page': 1}

In [12]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://harrywang.me/stack")
docs = loader.load()
print(docs[0].page_content[:500])













The Tech Stack that Works for My (AI) Startups | harrywang.me



















harrywang.me






























Open menu










                    Home
                



                        About
                    


                        Blog
                    





February 26, 2024
The Tech Stack that Works for My (AI) Startups



startup


tech


entrepreneurship



Last updated in Feb. 2024
In this post, I list the tech stack and related services tha


# Document Splitter


In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [15]:
chunk_size =26
chunk_overlap = 3

In [16]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [17]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [18]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'xyzabcdefghijklmnopqrstuvw', 'uvwxyz']

In [36]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'm n o p q r s t u v w x y', 'y z']

In [37]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [40]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [39]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [42]:
len(some_text)

496

In [41]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [43]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [44]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [45]:
#Let's reduce the chunk size a bit and add a period to our separators:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [46]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [47]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/nba-rules-2023.pdf")
pages = loader.load()

In [48]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)
docs = text_splitter.split_documents(pages)

In [49]:
len(docs), len(pages)

(276, 74)

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.


In [51]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [54]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
print(docs[0], pages[0].metadata)

page_content='This Page Intentionally Left Blank  \n' metadata={'source': 'data/nba-rules-2023.pdf', 'page': 0} {'source': 'data/nba-rules-2023.pdf', 'page': 0}


386

In [76]:
# embeddings
# https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
# By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 
# 3072 for text-embedding-3-large. 
# text-embedding-3-small	$0.02 / 1M tokens
# text-embedding-3-large	$0.13 / 1M tokens

from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(model="text-embedding-3-small")

In [80]:
sentence1 = "i like dogs"
sentence2 = "i like cats"
sentence3 = "the weather is sweet outside"

In [81]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [82]:
import numpy as np
np.dot(embedding1, embedding2), np.dot(embedding1, embedding3), np.dot(embedding2, embedding3)

(0.7222818700424505, 0.2677958372908107, 0.2573165872157418)

In [93]:
!rm -rf ./data/chroma  # remove old database files if any

In [98]:
from langchain.vectorstores import Chroma
persist_directory = './data/chroma/'

In [95]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    PyPDFLoader("data/getting-real/getting-real-01-introduction.pdf"),
    PyPDFLoader("data/getting-real/getting-real-02-starting-line.pdf"),
    PyPDFLoader("data/getting-real/getting-real-03-stay-lean.pdf"),  # Duplicate documents on purpose
    PyPDFLoader("data/getting-real/getting-real-03-stay-lean.pdf"),  # Duplicate documents on purpose
    PyPDFLoader("data/getting-real/getting-real-04-priorities.pdf"), 
    PyPDFLoader("data/getting-real/getting-real-05-feature-selection.pdf"), 
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

len(splits)

386

In [99]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [100]:
print(vectordb._collection.count())

386


In [101]:
question = "what's a foul in basketball?"
docs = vectordb.similarity_search(question, k=3)
len(docs)

3

duplicate chunks (because of the duplicate pdfs in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

`docs[0]` and `docs[1]` are indentical.

In [103]:
docs[0].page_content

'- 18 -(2) Puts a hand under the ball and brings it to a pause;\n(3) Otherwise gains enough control of the ball to hold it, change hands, pass, shoot, or \nthe player cradles the ball against his body.\nSection IV—Fouls\na. A common personal foul is illegal physical contact which occurs with an opponent  \nafter the ball has become live and before the horn sounds to end the period. If time expires before \nthe personal foul occurs, the personal foul should be disregarded, unless it was unsportsmanlike.\nEXCEPTION: If the foul is committed on or by a player in the act of shooting, and the  \nshooter released the ball prior to the expiration of time on the game clock, then the foul  \nshould be administered in the same manner as with any similar play during the course of the  \ngame (See Rule 13, Section II(b)(ii)).\nb. A technical foul is the penalty for unsportsmanlike conduct or violations by team  \nmembers\ton\tthe\tfloor\tor\tseated\ton\tthe\tbench.\nc. A double foul is a situation

In [104]:
docs[1].page_content

'- 18 -(2) Puts a hand under the ball and brings it to a pause;\n(3) Otherwise gains enough control of the ball to hold it, change hands, pass, shoot, or \nthe player cradles the ball against his body.\nSection IV—Fouls\na. A common personal foul is illegal physical contact which occurs with an opponent  \nafter the ball has become live and before the horn sounds to end the period. If time expires before \nthe personal foul occurs, the personal foul should be disregarded, unless it was unsportsmanlike.\nEXCEPTION: If the foul is committed on or by a player in the act of shooting, and the  \nshooter released the ball prior to the expiration of time on the game clock, then the foul  \nshould be administered in the same manner as with any similar play during the course of the  \ngame (See Rule 13, Section II(b)(ii)).\nb. A technical foul is the penalty for unsportsmanlike conduct or violations by team  \nmembers\ton\tthe\tfloor\tor\tseated\ton\tthe\tbench.\nc. A double foul is a situation

In [106]:
docs[2].page_content

'If a previous unsportsmanlike act has been committed and if this situation is BLA T ANT , \na technical foul must be assessed and the guilty player(s) must be ejected.\nQ. FLAGRANT FOUL CRITERIA\n1. The severity of the contact;\n2. Whether or not the player was making a legitimate basketball play (e.g., whether  \na\tplayer\tis\tmaking\ta\tlegitimate\t effort\tto\tblock\ta\tshot;\tnote,\thowever,\t that\ta\tfoul \ncommitted\t during\ta\tblock\tattempt\tcan\tstill\tbe\tconsidered\t flagrant\tif\tother\tcriteria  are \npresent, such as recklessness and hard contact to the head);\n3. Whether, on a foul committed with a player’s arm or hand, the fouling player  \nwound up and/or followed through after making contact;\n4. The potential for injury resulting from contact (e.g., a blow to the head and a  \nfoul committed while a player is in a vulnerable position);\n5. The\tseverity\tof\tany\tinjury\tsuffered\tby\tthe\toffended\tplayer;\tand\n6. The outcome of the contact (e.g., whether it le

In [70]:
# Let's save this so we can use it later!
vectordb.persist()

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

In [107]:
question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question, k=5)
for doc in docs:
    print(doc.metadata)


{'page': 35, 'source': 'data/nba-rules-2023.pdf'}
{'page': 35, 'source': 'data/nba-rules-2023.pdf'}
{'page': 42, 'source': 'data/nba-rules-2023.pdf'}
{'page': 42, 'source': 'data/nba-rules-2023.pdf'}
{'page': 56, 'source': 'data/nba-rules-2023.pdf'}


In [108]:
print(docs[4].page_content)

- 57 -(1) If goaltending or basket interference was ruled prior to the ball touching the rim,  
whether the shooter committed a boundary line violation. For purposes of this  review, 
the	Replay	Center	Official	will	look	only	at	the	position	of	the	shooter’s	 feet	at	the	
moment	they	last	touched	the	floor	immediately	 prior	to	(or,	if	applicable,	 during)	the	
release	of	the	shot	and	the	flight	of	the	ball.
(2) If goaltending or basket interference was ruled prior to the ball touching the rim,  
whether an 8-second backcourt violation occurred before the ball left the  shooter’s hand.
(3) Whether any unsportsmanlike acts or unnecessary contact occurred.
m. If an instant replay review is triggered as described in Section I(a)(13) above, the  
Replay	Center	Official	and	the	official	who	called	the	foul	will	review	the	video	to	determine:
(1) Whether the shot clock expired before the foul occurred.
(2) Whether an 8-second backcourt violation occurred before the player was fouled.
(3) Whe

In [1]:
# Chroma in-memory
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings


embedding = OpenAIEmbeddings(model="text-embedding-3-small")

persist_directory = './data/chroma/'

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

texts = [
    """Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry""",
    """Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry""",
    """Five Forces include: Competitive Rivalry, Supplier Power, Buyer Power, Threat of Substitution, and Threat of New Entry.""",
    """The Four C Model of Creativity categorizes creative expression into four levels: mini-c (personal creative learning), little-c (creativity appreciated by others), Pro-c (professional-level creativity), and Big-C (historically significant creativity). This model emphasizes creativity as a lifelong process, important at all stages of personal and professional development.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

print(smalldb._collection.count())  # 0 nothing loaded yet


4


In [3]:
question = "Tell me about porter's five forces model"

smalldb.similarity_search(question, k=3)

[Document(page_content="Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry"),
 Document(page_content="Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry"),
 Document(page_content='Five Forces include: Competitive Rivalry, Supplier Power, Buyer Power, Threat of Substitution, and Threat of New Entry.')]

### Addressing Diversity: Maximum marginal relevance

In [8]:
# duplicate is gone

smalldb.max_marginal_relevance_search(question, k=3, fetch_k=4)

[Document(page_content="Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry"),
 Document(page_content='Five Forces include: Competitive Rivalry, Supplier Power, Buyer Power, Threat of Substitution, and Threat of New Entry.'),
 Document(page_content='The Four C Model of Creativity categorizes creative expression into four levels: mini-c (personal creative learning), little-c (creativity appreciated by others), Pro-c (professional-level creativity), and Big-C (historically significant creativity). This model emphasizes creativity as a lifelong process, important at all stages of personal and professional development.')]

In [9]:
question = "Tell me about creativity models"
smalldb.similarity_search(question, k=2)

[Document(page_content='The Four C Model of Creativity categorizes creative expression into four levels: mini-c (personal creative learning), little-c (creativity appreciated by others), Pro-c (professional-level creativity), and Big-C (historically significant creativity). This model emphasizes creativity as a lifelong process, important at all stages of personal and professional development.'),
 Document(page_content="Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry")]

In [10]:
# cosine distance, short the better
smalldb.similarity_search_with_score(question, k=2)

[(Document(page_content='The Four C Model of Creativity categorizes creative expression into four levels: mini-c (personal creative learning), little-c (creativity appreciated by others), Pro-c (professional-level creativity), and Big-C (historically significant creativity). This model emphasizes creativity as a lifelong process, important at all stages of personal and professional development.'),
  0.6911659240722656),
 (Document(page_content="Michael Porter's five-force strategic analysis model, introduced in a 1979 article published in the Harvard Business Review, remains a fundamental tool for strategic analysts plotting the competitive landscape of an industry"),
  1.3749091625213623)]

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.