# Document Loading

## Note to students.
During periods of high load you may find the notebook unresponsive. It may appear to execute a cell, update the completion number in brackets [#] at the left of the cell but you may find the cell has not executed. This is particularly obvious on print statements when there is no output. If this happens, restart the kernel using the command under the Kernel tab.

## Retrieval augmented generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

![overview.jpeg](attachment:overview.jpeg)

In [1]:
! pip install langchain
! pip install dotenv

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, dotenv
Successfully installed dotenv-0.9.9 python-dotenv-1.1.1


In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

# openai.api_key  = os.environ['OPENAI_API_KEY']

## PDFs

Let's load a PDF [transcript](https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf) from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [3]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.
! pip install pypdf
! pip install -U langchain-community

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from 

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
from langchain.document_loaders import PyPDFLoader
# loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
file_path = '/content/drive/My Drive/Poster.pdf'
loader = PyPDFLoader(file_path)
pages = loader.load()
display(pages)

[Document(metadata={'producer': 'PyPDF', 'creator': 'PyPDF', 'creationdate': '2025-01-09T10:40:46+00:00', 'moddate': '2025-01-09T10:40:46+00:00', 'source': '/content/drive/My Drive/Poster.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Predicting Market Reactions to News: \nAn LLM-Based Approach Using Spanish Business Articles\nJesus Villota Miranda (CEMFI)\nMethodology\nDistribution of Articles through Clusters\nTrading Signal by Cluster\nReturns to the Trading Strategies\nConclusion\nThis paper explores how Large Language Models (LLMs) can enhance market prediction by analyzing Spanish business news during the \nvolatile COVID-19 period. We propose a novel approach that guides LLMs to systematically classify economic shocks in news articles, \ncomparing its effectiveness against a traditional vector-based text analysis method in predicting market reactions\nUnstable clustering\nThe distribution profile \nof articles through \nclusters is unstable \nacross data sp

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [28]:
len(pages)

1

In [29]:
page = pages[0]

In [30]:
print(page.page_content[0:500])

Predicting Market Reactions to News: 
An LLM-Based Approach Using Spanish Business Articles
Jesus Villota Miranda (CEMFI)
Methodology
Distribution of Articles through Clusters
Trading Signal by Cluster
Returns to the Trading Strategies
Conclusion
This paper explores how Large Language Models (LLMs) can enhance market prediction by analyzing Spanish business news during the 
volatile COVID-19 period. We propose a novel approach that guides LLMs to systematically classify economic shocks in news a


In [31]:
page.metadata

{'producer': 'PyPDF',
 'creator': 'PyPDF',
 'creationdate': '2025-01-09T10:40:46+00:00',
 'moddate': '2025-01-09T10:40:46+00:00',
 'source': '/content/drive/My Drive/Poster.pdf',
 'total_pages': 1,
 'page': 0,
 'page_label': '1'}

## YouTube

In [None]:
! pip install yt_dlp
! pip install pydub
! pip install faster-whisper

In [7]:
from langchain_community.document_loaders.generic import GenericLoader, FileSystemBlobLoader
from langchain_community.document_loaders.parsers.audio import OpenAIWhisperParser
from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.parsers.audio import FasterWhisperParser

**Note**: This can take several minutes to complete.

In [14]:
url="https://www.youtube.com/watch?v=J7b0jxVB1TE"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),  # fetch from youtube
    #FileSystemBlobLoader(save_dir, glob="*.m4a"),   #fetch locally
    # OpenAIWhisperParser()
    FasterWhisperParser(device="gpu", model_size="small")  # or model_size="base", "medium", "large-v3"
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=J7b0jxVB1TE
[youtube] J7b0jxVB1TE: Downloading webpage
[youtube] J7b0jxVB1TE: Downloading tv client config
[youtube] J7b0jxVB1TE: Downloading tv player API JSON
[youtube] J7b0jxVB1TE: Downloading ios player API JSON
[youtube] J7b0jxVB1TE: Downloading m3u8 information
[info] J7b0jxVB1TE: Downloading 1 format(s): 140
[download] docs/youtube//English Speech ｜ All About Me.m4a has already been downloaded
[download] 100% of  331.21KiB
[ExtractAudio] Not converting audio docs/youtube//English Speech ｜ All About Me.m4a; file is already in target format m4a


model.bin:   0%|          | 0.00/484M [00:00<?, ?B/s]

vocabulary.txt: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [17]:
docs

[Document(metadata={'source': 'docs/youtube/English Speech ｜ All About Me.m4a', 'timestamps': '[0.00s -> 4.50s]', 'language': 'en', 'probability': '98%'}, page_content=' Hello everyone, nice to meet you. Let me introduce myself.'),
 Document(metadata={'source': 'docs/youtube/English Speech ｜ All About Me.m4a', 'timestamps': '[4.50s -> 11.00s]', 'language': 'en', 'probability': '98%'}, page_content=" My name is Julie Anderson. I'm nine years old. I'm from South Africa."),
 Document(metadata={'source': 'docs/youtube/English Speech ｜ All About Me.m4a', 'timestamps': '[11.00s -> 16.00s]', 'language': 'en', 'probability': '98%'}, page_content=' I like pancakes and hot dogs. I like yellow. I like cats.'),
 Document(metadata={'source': 'docs/youtube/English Speech ｜ All About Me.m4a', 'timestamps': '[16.00s -> 20.00s]', 'language': 'en', 'probability': '98%'}, page_content=' Now you know more about me. I hope we can be friends.')]

## URLs

In [18]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/titles-for-programmers.md")



> Note: the URL sent to the WebBaseLoader differs from the one shonw in the video because for 2024 it was updated.

In [19]:
docs = loader.load()

In [21]:
print(docs[0].page_content[:1000])










































































































handbook/titles-for-programmers.md at master · basecamp/handbook · GitHub
















































Skip to content













Navigation Menu

Toggle navigation




 













            Sign in
          


 


Appearance settings











        Product
        














            GitHub Copilot
          
        Write better code with AI
      








            GitHub Spark
              
                New
              

        Build and deploy intelligent apps
      








            GitHub Models
              
                New
              

        Manage and compare prompts
      








            GitHub Advanced Security
          
        Find and fix vulnerabilities
      








            Actions
          
        Automate any workflow
      














            Codespaces
          
        Instant dev 

## Notion

Follow steps [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion) for an example Notion site such as [this one](https://yolospace.notion.site/Blendle-s-Employee-Handbook-e31bff7da17346ee99f531087d8b133f):

* Duplicate the page into your own Notion space and export as `Markdown / CSV`.
* Unzip it and save it as a folder that contains the markdown file for the Notion page.


![image.png](./img/image.png)

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()

In [None]:
print(docs[0].page_content[0:200])

In [None]:
docs[0].metadata