## RAG Retrieval Augmented Generation
Use embeddings to retrieve relevant information to integrate into the prompt

Document Loader -----> Splitting -----> Storage + Retrieval

### LangChain document loaders
Classes designed to load and configure documents for system integration
Document loaders for common file types: `.pdf`, `.csv`.

##### PDF doucument Loader

In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/Users/mac/Desktop/5.1/Datacom/LAB-REPORT.pdf")

data = loader.load()
print(data[1])

page_content='Abstract 
 
This comprehensive experiment investigates the principles and practical applications of error 
detection and correction in digital communication systems. Utilizing a suite of specialized 
equipment including the DCS297A Data Source, DCS297H Data Receiver, DCS297K Audio 
Module, and associated components, we conducted an in-depth examination of two key 
techniques: simple parity checking for error detection in seven-bit data words, and an advanced 
error correction code employing four data bits and four check bits. 
 
The study was designed to achieve two primary objectives: first, to demonstrate the capabilities 
and limitations of simple parity checking in detecting errors within transmitted data words, and 
second, to showcase the enhanced functionality of error correction codes in both locating and 
rectifying errors in digital communications. 
 
Our findings conclusively demonstrated that while simple parity checking is capable of detecting 
the presence o

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader("/Users/mac/Desktop/5.1/Datacom/LAB-REPORT.csv")
data = loader.load()
print(data[0])

In [6]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('/Users/mac/Downloads/langchain_community.document_loaders.pdf.PyPDFLoader — 🦜🔗 LangChain 0.2.17.html')
data = loader.load()
print(data[0])

page_content='This is a legacy site. Please use the latest v0.2 and v0.3 API references instead.

langchain_community.document_loaders.pdf.PyPDFLoader

PyPDFLoader

PyPDFLoader.__init__()

PyPDFLoader.alazy_load()

PyPDFLoader.aload()

PyPDFLoader.lazy_load()

PyPDFLoader.load()

PyPDFLoader.load_and_split()

langchain_community.document_loaders.pdf.PyPDFLoader¶

Examples using PyPDFLoader¶

Apache Cassandra

Astra DB

Document Comparison

Google Cloud Storage File

Google Vertex AI Vector Search

How to load PDFs

KDB.AI

Merge Documents Loader

MongoDB Atlas

Monetize your audience: Fund an OSS project or website with EthicalAds, a privacy-first ad network

Ads by EthicalAds

© 2023, LangChain, Inc. . Last updated on Dec 09, 2024.' metadata={'source': '/Users/mac/Downloads/langchain_community.document_loaders.pdf.PyPDFLoader — 🦜🔗 LangChain 0.2.17.html'}


Splitting

In [9]:
from langchain_text_splitters import CharacterTextSplitter


quote = '''
One machine can do the work of fifty ordinary humans.\nNo machine can do the work of one extraordinary human. It is a quote by Elbert Hubbard.
'''
len(quote)
chunk_size = 24
chunk_overlap = 3

ct_splitter = CharacterTextSplitter(
    separator='.',
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

docs = ct_splitter.split_text(quote)

print(docs)
print([len(doc) for doc in docs])

Created a chunk of size 53, which is longer than the specified 24
Created a chunk of size 54, which is longer than the specified 24
Created a chunk of size 32, which is longer than the specified 24


['One machine can do the work of fifty ordinary humans', 'No machine can do the work of one extraordinary human', 'It is a quote by Elbert Hubbard']
[52, 53, 31]


In [10]:
#  RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

rc_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)
docs = rc_splitter.split_text(quote)
print(docs)

['One machine can do the', 'work of fifty ordinary', 'humans.', 'No machine can do the', 'work of one', 'extraordinary human. It', 'It is a quote by Elbert', 'Hubbard.']


In [11]:
# Splitting html
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = UnstructuredHTMLLoader('/Users/mac/Downloads/langchain_community.document_loaders.pdf.PyPDFLoader — 🦜🔗 LangChain 0.2.17.html')


data = loader.load()

print(data[0])

rc_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)
docs = rc_splitter.split_documents(data)
print(docs[0])
print(docs[1])

page_content='This is a legacy site. Please use the latest v0.2 and v0.3 API references instead.

langchain_community.document_loaders.pdf.PyPDFLoader

PyPDFLoader

PyPDFLoader.__init__()

PyPDFLoader.alazy_load()

PyPDFLoader.aload()

PyPDFLoader.lazy_load()

PyPDFLoader.load()

PyPDFLoader.load_and_split()

langchain_community.document_loaders.pdf.PyPDFLoader¶

Examples using PyPDFLoader¶

Apache Cassandra

Astra DB

Document Comparison

Google Cloud Storage File

Google Vertex AI Vector Search

How to load PDFs

KDB.AI

Merge Documents Loader

MongoDB Atlas

Monetize your audience: Fund an OSS project or website with EthicalAds, a privacy-first ad network

Ads by EthicalAds

© 2023, LangChain, Inc. . Last updated on Dec 09, 2024.' metadata={'source': '/Users/mac/Downloads/langchain_community.document_loaders.pdf.PyPDFLoader — 🦜🔗 LangChain 0.2.17.html'}
page_content='This is a legacy site.' metadata={'source': '/Users/mac/Downloads/langchain_community.document_loaders.pdf.PyPDFLoader

#### RAG Storage and Retrieval using Vector Databases (ChromaDB)

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough


message = '''
Review and fix the following TechStack marketing copy with the following guidelines in consideration:

Guidelines:
{guidelines}

Copy:
{copy}

Fixed Copy:
'''

prompt_template = ChatPromptTemplate.from_messages([('human', message)])


embedding_function = OpenAIEmbeddings(model="text-embedding-ada-002", api_key=api_key)

vectorstore = Chroma.from_documents(
    docs,
    embedding=embedding_function,
    persist_directory="/Users/mac/Desktop/Projects/ml-practice/datacamp/ChromaDB",
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 2}, search_type="similarity")


rag_chain = ({
    'guidelines': retriever,
    'copy': RunnablePassthrough(),
} | prompt_template | llm )

docs  = rag_chain.invoke('Some text to be fixed')
print(docs[0])