# Document Loading

## Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

![imagem](img/overview.jpg)

In [2]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

### PDFs

Let's load a PDF transcript from Andrew Ng's famous CS229 course! These documents are the result of automated transcription so words and sentences are sometimes split unexpectedly.

In [6]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a Document
A document contains text(page_content) and metadata

In [7]:
len(pages)

22

In [8]:
page = pages[0]

In [9]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [10]:
page.metadata

{'source': 'docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 0}

### Youtube - Não está funcionando e eu não estou com paciência de testar

In [22]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [23]:
url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage




[youtube] jGwO_UgTS7I: Downloading ios player API JSON




[youtube] jGwO_UgTS7I: Downloading iframe API JS




[youtube] jGwO_UgTS7I: Downloading web player API JSON


ERROR: [youtube] jGwO_UgTS7I: Failed to extract any player response; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U


DownloadError: ERROR: [youtube] jGwO_UgTS7I: Failed to extract any player response; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U

### Urls

In [29]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")
loader2 = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/benefits-and-perks.md")

In [30]:
docs = loader.load()
docs2 = loader2.load()

In [33]:
#print(docs[0].page_content[:500])
print(docs2[0].page_content[0:500])

 Enterprise

    



      Teams

    



      Startups

    




By industry



      Healthcare

    



      Financial services

    



      Manufacturing

    




By use case



      CI/CD & Automation

    



      DevOps

    



      DevSecOps

    







        Resources
        





Topics



      AI

    



      DevOps

    



      Innersource

    



      Open Source

    



      Security

    



      Software Development

    




Explore



      Learning Pathways

    





      White papers, Ebooks, Webinars

    





      Customer Stories

    



      Partners

    









        Open Source
        









GitHub Sponsors
        Fund open source developers
      








The ReadME Project
        GitHub community articles
      




Repositories



      Topics

    



      Trending

    



      Collections

    







        Enterprise
        












Enterprise platform
        AI-powered developer platform
      




Avai