# Document Loading

In [1]:
import openai

## PDFs Loader

In [2]:
# Import langchain's pdf document loader
from langchain.document_loaders import PyPDFLoader

In [3]:
# To use PyPDFLoader, pypdf may be required. Install it if so
# %pip install pypdf

In [4]:
# Put a document in the loader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")

In [5]:
# Load the document
pages = loader.load()

In [6]:
len(pages)

22

There are 22 pages in the document. Each page is a `Document`. A `Document` contains text (`page_content`) and `metadata`.

In [7]:
# First page.Document
page = pages[0]

In [8]:
print(page.page_content[:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [9]:
# Meta data associated with the document are another piece important of document
page.metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

## YouTube

In this section, we will three main libraries:
* `GenericLoader`
* `OpenAIWhisperParser` that uses OpenAI's whisper model to convert a youtube audio into a text format that we can work with
* `YoutubeAudioLoader` imports an audio file from a youtube wideo.

In [10]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [11]:
# Installed required libraries by YoutubeAudioLoader, ...
# %pip install yt_dlp
# %pip install pydub

In [12]:
# !sudo apt-get install ffmpeg

In [13]:
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"  # A youtube video's url
save_dir = "docs/youtube/"  # Directory in wich to save the audio file

# Create a generic loader
loader = GenericLoader(
    YoutubeAudioLoader([url], save_dir),
    OpenAIWhisperParser()
)

# Load the documents corresponding to this youtube video
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.71MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 2!
Transcribing part 3!
Transcribing part 4!


In [14]:
docs[0].page_content[0:500]

"Welcome to CS229 Machine Learning. Uh, some of you know that this is a class that's taught at Stanford for a long time. And this is often the class that, um, I most look forward to teaching each year because this is where we've helped, I think, several generations of Stanford students become experts in machine learning, got- built many of their products and services and startups that I'm sure, many of you or probably all of you are using, uh, uh, today. Um, so what I want to do today was spend s"