# Document Loading

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

![overview.jpeg](attachment:overview.jpeg)

In [None]:
# ! pip install langchain openai

In [None]:
import os
import openai
import sys
sys.path.append('../..')
from google.colab import userdata

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = userdata.get('OPENAI_API_KEY')

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [None]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
#! pip install pypdf

In [None]:
!pip install -U langchain-community
!pip install pypdf

Collecting langchain-community
  Downloading langchain_community-0.2.7-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensio

In [None]:
from langchain.document_loaders import PyPDFLoader
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/

/content/drive/MyDrive


In [None]:
loader = PyPDFLoader("2406.11903v1.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [None]:
len(pages)

39

In [None]:
page = pages[0]

In [None]:
print(page.page_content[0:500])

1
A Survey of Large Language Models for
Financial Applications: Progress,
Prospects and Challenges
Yuqi Nie∗, Y axuan Kong∗, Xiaowen Dong, John M. Mulvey†‡, H. Vincent Poor,
Qingsong Wen, Stefan Zohren†
Abstract —Recent advances in large language models (LLMs) have unlocked novel opportunities for machine learning applications in
the financial domain. These models have demonstrated remarkable capabilities in understanding context, processing vast amounts of
data, and generating human-preferred c


In [None]:
page.metadata

{'source': '2406.11903v1.pdf', 'page': 0}

## YouTube

In [None]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [None]:
! pip install yt_dlp
! pip install pydub

Collecting yt_dlp
  Downloading yt_dlp-2024.7.16-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from yt_dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
Collecting mutagen (from yt_dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycryptodomex (from yt_dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests<3,>=2.32.2 (fr

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
! pip install pydub

Collecting pydub
  Using cached pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
# !pip install ffmpeg

**Note**: This can take several minutes to complete.

In [None]:
url="https://youtu.be/MoIguQfHUcY"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser(api_key=userdata.get('OPENAI_API_KEY'))
)
docs = loader.load()

[youtube] Extracting URL: https://youtu.be/MoIguQfHUcY
[youtube] MoIguQfHUcY: Downloading webpage
[youtube] MoIguQfHUcY: Downloading ios player API JSON
[youtube] MoIguQfHUcY: Downloading m3u8 information
[info] MoIguQfHUcY: Downloading 1 format(s): 140
[download] docs/youtube//台積電高效能運算 HPC 業績佔比快速成長，AI 與 NVIDIA 或許是最大原因？！#ai #台積電 #HPC #人工智慧 #台積電法說會 #高效能運算 #晶圓代工 #llm #大型語言模型.m4a has already been downloaded
[download] 100% of    1.05MiB
[ExtractAudio] Not converting audio docs/youtube//台積電高效能運算 HPC 業績佔比快速成長，AI 與 NVIDIA 或許是最大原因？！#ai #台積電 #HPC #人工智慧 #台積電法說會 #高效能運算 #晶圓代工 #llm #大型語言模型.m4a; file is already in target format m4a
Transcribing part 1!


In [None]:
docs[0].page_content[0:500]

'今天下午台積電廠他們的法說會公布了2024年D2G的業績概況 其中在HPC這一項成長幅度驚人 然而什麼是HPC呢 HPC是High Performance Computing高效能運算的一個簡稱 那主要會運用在大型的這種伺服器大型的製造中心裡面 那目前大家可以猜想像是這種AI相關的製造運算 可以用在像是天氣模擬或者是各種AI人工智慧的一些相關的運用 像是我們去訓練大型語言模型去做所謂的pre-train 事先訓練利用大量的文本去訓練 以及像是這種越來越大型的語言模型的參數 都需要這種高效能運算的support 那大家可以猜想而知他絕對跟NVIDIA這間公司所生產的這種GPU有個絕對的關係 NVIDIA即將上市的Blackwell號稱他們是全世界目前最好的這種AI運算的中心 也就是說他可能跟TSMC也就是台積電現在的業績其實是有高度相關的 當然我們可以看到隨著TSMC台積電他們的業績 在HPC這一塊的相對應高速成長 我們大概可以知道對AI的運算這一塊還是有非常強烈的需求 接下來我們可以更期待這樣子的一個AI的運算的結果對我們產生什麼樣的影響'

## URLs

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")



In [None]:
docs = loader.load()

In [None]:
print(docs[0].page_content[:500])





































































































File not found · GitHub













































Skip to content












Navigation Menu

Toggle navigation









 








            Sign in
          








        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codes


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [None]:
chunk_size = 26
chunk_overlap = 4

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [None]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [None]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [None]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [None]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [None]:
%pip install -qU langchain-text-splitters

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [None]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [None]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [None]:
len(some_text)

496

In [None]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [None]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

In [None]:
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("input your API_KEY")
# userdata.get('OPENAI_API_KEY')

from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o")

In [None]:
# %pip install langchain_openai

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")

chain = prompt | model | StrOutputParser()

In [None]:
chain.invoke({"topic": "bears"})

"Sure, here's a bear-themed joke for you:\n\nWhy do bears never get lost?\n\nBecause they always follow their bear-ings! 🐻🧭"

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [None]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n
## Chapter 2\n\n \
Hi this is Molly"""

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [None]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [None]:
md_header_splits[0]

Document(metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'}, page_content='Hi this is Jim  \nHi this is Joe')