## Retrieval Augmented Generation

In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution. 

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc). 

In [1]:
pip install langchain

Collecting langchain
  Obtaining dependency information for langchain from https://files.pythonhosted.org/packages/0f/36/58f4d9df45436670a5b6b82ff48522b6233fa35bd21b133b149c1c7ec8bd/langchain-0.0.352-py3-none-any.whl.metadata
  Downloading langchain-0.0.352-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Obtaining dependency information for dataclasses-json<0.7,>=0.5.7 from https://files.pythonhosted.org/packages/ae/53/8c006de775834cd4ea64a445402dc195caeebb77dc76b7defb9b3887cb0d/dataclasses_json-0.6.3-py3-none-any.whl.metadata
  Downloading dataclasses_json-0.6.3-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Obtaining dependency information for jsonpatch<2.0,>=1.33 from https://files.pythonhosted.org/packages/73/07/02e16ed01e04a374e644b575638ec7987ae846d25ad97bcc9945a3ee4b0e/jsonpatch-1.33-py2.py3-none-any.whl.metadata
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langch

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tables 3.8.0 requires blosc2~=2.0.0, which is not installed.
tables 3.8.0 requires cython>=0.29.21, which is not installed.
python-lsp-black 1.2.1 requires black>=22.3.0, but you have black 0.0 which is incompatible.


In [2]:
pip install openai

Collecting openai
  Obtaining dependency information for openai from https://files.pythonhosted.org/packages/e7/44/5ece9adb8b5943273c845a1e3200168b396f556051b7d2745995abf41584/openai-1.6.1-py3-none-any.whl.metadata
  Downloading openai-1.6.1-py3-none-any.whl.metadata (17 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Obtaining dependency information for distro<2,>=1.7.0 from https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl.metadata
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Obtaining dependency information for httpx<1,>=0.23.0 from https://files.pythonhosted.org/packages/39/9b/4937d841aee9c2c8102d9a4eeb800c7dad25386caabb4a1bf5010df81a57/httpx-0.26.0-py3-none-any.whl.metadata
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Obtaining dependency information for httpco

In [6]:
import os
import openai
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

# Get the API key
api_key = os.environ.get('OPENAI_API_KEY')

# Set the API key for OpenAI
openai.api_key = api_key


### PDFS

Now, we will load a PDF transcript.

In [7]:
# The course will show the pip installs you would need to install packages on your own machine.
# These packages are already installed on this platform and should not be run again.

In [4]:
pip install pypdf

Collecting pypdf
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/29/10/055b649e914ad8c5d07113c22805014988825abbeff007b0e89255b481fa/pypdf-3.17.4-py3-none-any.whl.metadata
  Downloading pypdf-3.17.4-py3-none-any.whl.metadata (7.5 kB)
Downloading pypdf-3.17.4-py3-none-any.whl (278 kB)
   ---------------------------------------- 0.0/278.2 kB ? eta -:--:--
   ---------------------------------------- 0.0/278.2 kB ? eta -:--:--
   - -------------------------------------- 10.2/278.2 kB ? eta -:--:--
   ----- --------------------------------- 41.0/278.2 kB 388.9 kB/s eta 0:00:01
   ------------ -------------------------- 92.2/278.2 kB 655.4 kB/s eta 0:00:01
   -------------------------------- ------- 225.3/278.2 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 278.2/278.2 kB 1.4 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-3.17.4
Note: you may need to restart the kernel to use updated packages.


In [8]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
pages = loader.load()

Each page is a `Document`.

A `Document` contains text (`page_content`) and `metadata`.

In [9]:
len(pages)

22

In [10]:
page = pages[0]

In [11]:
print(page.page_content[0:500])

MachineLearning-Lecture01  
Instructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is ju st spend a little time going over the logistics 
of the class, and then we'll start to  talk a bit about machine learning.  
By way of introduction, my name's  Andrew Ng and I'll be instru ctor for this class. And so 
I personally work in machine learning, and I' ve worked on it for about 15 years now, and 
I actually think that machine learning i


In [12]:
page.metadata

{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}

### YouTube

In [13]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

**Note**: This can take several minutes to complete.

In [12]:
#yt-dlp is a command-line program that lets you easily download videos and audio from more than a thousand websites
pip install yt_dlp

Collecting yt_dlp
  Obtaining dependency information for yt_dlp from https://files.pythonhosted.org/packages/31/5a/d9b0a47a3aacf650b8ffc750bb5d296c24b2cc674f4c2a975895f49d4f0a/yt_dlp-2023.11.16-py2.py3-none-any.whl.metadata
  Downloading yt_dlp-2023.11.16-py2.py3-none-any.whl.metadata (160 kB)
     ---------------------------------------- 0.0/160.5 kB ? eta -:--:--
     -- ------------------------------------- 10.2/160.5 kB ? eta -:--:--
     -- ------------------------------------- 10.2/160.5 kB ? eta -:--:--
     ------- ----------------------------- 30.7/160.5 kB 217.9 kB/s eta 0:00:01
     -------------- ---------------------- 61.4/160.5 kB 297.7 kB/s eta 0:00:01
     ---------------------------------- - 153.6/160.5 kB 654.6 kB/s eta 0:00:01
     ------------------------------------ 160.5/160.5 kB 642.2 kB/s eta 0:00:00
Collecting mutagen (from yt_dlp)
  Obtaining dependency information for mutagen from https://files.pythonhosted.org/packages/b0/7a/620f945b96be1f6ee357d211d5bf74ab1

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.29.76 requires urllib3<1.27,>=1.25.4, but you have urllib3 2.1.0 which is incompatible.


In [16]:
pip install pydub

Note: you may need to restart the kernel to use updated packages.


In [30]:
#OpenAI's whisper model, a speech to text model
#to convert YT audio into a text format that we can work with

url="https://www.youtube.com/watch?v=jGwO_UgTS7I"
save_dir="docs/youtube/"
loader = GenericLoader(
    YoutubeAudioLoader([url],save_dir),
    OpenAIWhisperParser()
)

#GenericLoader is a combination of
#YoutubeAudioLoader and OpenAIWhisperParser

In [26]:
pip install pytube

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ------------- ------------------------ 20.5/57.6 kB 330.3 kB/s eta 0:00:01
     --------------------------------- ---- 51.2/57.6 kB 525.1 kB/s eta 0:00:01
     -------------------------------------- 57.6/57.6 kB 503.8 kB/s eta 0:00:00
Installing collected packages: pytube
Successfully installed pytube-15.0.0
Note: you may need to restart the kernel to use updated packages.


In [33]:
import yt_dlp

# URL of the YouTube video
url = "https://www.youtube.com/watch?v=jGwO_UgTS7I"

# Directory to save the downloaded audio file
save_dir = "docs/youtube/"

# FFmpeg location (replace this with your FFmpeg path)
ffmpeg_path = r'C:\Users\KIIT\Downloads\ffmpeg-6.1.tar.xz'

# YouTubeDL options including ffmpeg_location
ydl_opts = {
    'ffmpeg_location': ffmpeg_path,
    # You can add more options here if needed
}

# Download the YouTube video
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 22
[download] Destination: Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018) [jGwO_UgTS7I].mp4
[download] 100% of  217.28MiB in 00:00:44 at 4.91MiB/s     


In [19]:
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs\youtube\Stanford CS229： Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.76MiB


ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location


DownloadError: ERROR: Postprocessing: ffprobe and ffmpeg not found. Please install or provide the path using --ffmpeg-location

In [35]:
docs[0].page_content[0:500]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFile not found · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\nToggle navigation\n\n\n\n\n\n\n\n\n\n\n          Sign in\n        \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\nActions\n        Automate any workflow\n      \n\n\n\n\n\n\n\nPackages\n        Host and manage packages\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \n\n\n\n\n\n\n\nCodespaces\n        Instant dev environments\n '

### URLS

In [36]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [37]:
docs = loader.load()

In [38]:
print(docs[0].page_content[:500])














































































File not found · GitHub
















































Skip to content













Toggle navigation










          Sign in
        


 













        Product
        












Actions
        Automate any workflow
      







Packages
        Host and manage packages
      







Security
        Find and fix vulnerabilities
      







Codespaces
        Instant dev environments
 


### Notion

In [46]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB.md")
docs = loader.load()

In [48]:
from langchain.document_loaders import NotionDirectoryLoader

# Assuming your Notion directory contains documents
notion_dir_path = "docs/Notion_DB.md"

# Create a loader for the Notion directory
loader = NotionDirectoryLoader(notion_dir_path)

# Load the documents
docs = loader.load()

# Check if there are any documents loaded
if docs:
    # Access content from the first document in the list
    first_doc_content = docs[0].page_content if docs[0].page_content else "No content found"
    print(first_doc_content[0:200])  # Print the first 200 characters of the content
else:
    print("No documents loaded from the Notion directory.")


No documents loaded from the Notion directory.
