<a href="https://colab.research.google.com/github/brettin/llm_tutorial/blob/main/tutorials/05-chains/05_langchain_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U langchain langchain-community openai chromadb langchainhub bs4 tiktoken kaleido python-multipart cohere

In [None]:
import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

In [None]:
import os
from getpass import getpass
os.environ['HUGGINGFACEHUB_API_TOKEN'] = getpass('Enter HUGGINGFACEHUB_API_TOKEN: ')
os.environ['OPENAI_API_KEY'] = getpass("Enter OPENAI_API_KEY: ")


This exercise demonstrates how to create a basic Retrieval-Augmented Generation (RAG) pipeline using various components in Python. The goal is to build a system that retrieves relevant information from a webpage and uses it to answer a specific question with a language model.

### Step 1: Load the Webpage Content
1.	WebBaseLoader: This class is used to load content from a web page. The web_paths argument specifies the URL of the webpage you want to load. In this case, it’s a blog post by Lilian Weng on “agent.”
2.	BeautifulSoup (bs_kwargs): The bs_kwargs argument is used to filter the content being scraped from the webpage. The parse_only parameter specifies that only certain HTML elements with classes "post-content", "post-title", and "post-header" should be parsed. This helps in focusing on the relevant content, ignoring other unnecessary parts of the webpage like advertisements or navigation menus.
3.	Loading Documents: The loader.load() method fetches and parses the content from the specified webpage, storing it in the docs variable.

In [None]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

### Step 2: Split the Documents
1.	RecursiveCharacterTextSplitter: This is a utility that splits the loaded documents into smaller chunks. This is useful because language models typically have a token limit, and large documents may exceed this limit.
2.	chunk_size: Specifies the maximum size of each chunk, in characters. Here, it’s set to 1000 characters.
3.	chunk_overlap: Specifies the number of overlapping characters between consecutive chunks, ensuring some continuity between them. Here, it’s set to 200 characters.
4.	Splitting the Documents: The split_documents(docs) method applies this logic to the loaded document, resulting in a list of smaller text chunks stored in the splits variable.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

### Step 3: Create a Vector Store and Embed the Text Chunks
1.	Chroma: Chroma is a vector store that allows for efficient storage and retrieval of vectorized documents. A vector store is a database optimized for storing and querying high-dimensional vectors.
2.	OpenAIEmbeddings: This is an embedding model from OpenAI used to convert chunks of text into high-dimensional vectors. These vectors capture the semantic meaning of the text, enabling similarity searches.
3.	from_documents: The from_documents method converts the text chunks (splits) into vectors using the specified embedding model and stores them in the vectorstore.
4.	as_retriever: This method converts the vector store into a retriever object, which can be used to fetch relevant documents based on a query.

In [None]:
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

### Step 4: Prepare RAG Pipeline
1.	Prompt Template: The hub.pull("rlm/rag-prompt") retrieves a pre-built prompt template for RAG (Retrieval-Augmented Generation). This template will guide how the question and the retrieved documents are fed into the language model. The print(prompt) statement is commented out, but if uncommented, it would display the prompt template.
2.	ChatOpenAI: This is the language model (in this case, GPT-4o Mini from OpenAI) that will be used to generate responses. The temperature parameter is set to 0, making the model deterministic and less creative, which is useful for factual and precise responses.

In [None]:
prompt = hub.pull("rlm/rag-prompt")
# print(prompt)
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

### Step 5: Define the RAG Chain
1.	format_docs Function: This function takes a list of documents (or chunks) and formats them into a single string by joining their content with newlines in between. This formatted string will be used as the context for answering the question.
2.	RAG Chain: This is the heart of the RAG pipeline. It is a sequence of operations that:
- Uses the retriever to fetch relevant chunks of text based on the query.
- Formats these retrieved chunks using the format_docs function.
- Passes the formatted text along with the original question to the prompt.
- Feeds the output of the prompt into the language model (llm).
- Parses the model’s output using StrOutputParser() to produce a clean and readable answer.

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### Step 6: Invoke the RAG Pipeline
1.	invoke Method: The invoke method triggers the entire RAG pipeline. Here, it’s called with the question "What is Task Decomposition?".
2.	Output: The system retrieves relevant information from the loaded webpage, processes it, and generates a coherent answer using the language model.

In [None]:
rag_chain.invoke("What is Task Decomposition?")

---

In this exercise, we will implement a RAG pipeline using a custom loader to fetch academic articles from the arXiv repository and a language model to generate answers to questions based on the retrieved content.

In [None]:
!pip install arxiv
!pip install pymupdf

### Step 1: Load Academic Papers from arXiv
- ArxivLoader: This component is used to load academic papers from the arXiv repository.
- query: The search query used to find relevant papers is "Antibiotic design using deep learning".
- load_max_docs=20: Limits the maximum number of documents to be loaded to 20.
- .load(): Executes the loading of the documents, which are returned as a list of Document objects.

In [None]:
from langchain.document_loaders import ArxivLoader
from langchain.retrievers import ArxivRetriever

# cleanup previous
# vectorstore.delete_collection()

docs = ArxivLoader(query="Antibiotic design using deep learning", load_max_docs=20).load()

### Step 2: Split the Documents
- RecursiveCharacterTextSplitter: This component splits documents into smaller chunks, which is useful for efficient processing in subsequent steps.
- chunk_size=1000: Each chunk will contain up to 1000 characters.
- chunk_overlap=200: Overlaps chunks by 200 characters to maintain context across chunks.
- .split_documents(docs): Splits the loaded documents (docs) into smaller chunks, stored in splits.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

### Step 3: Create a Vector Store and Embed the Text Chunks
- Chroma: A vector store implementation used to store and retrieve documents efficiently.
- from_documents: This method creates a vector store from the split document chunks (splits).
- OpenAIEmbeddings(): Generates embeddings (vector representations) of the document chunks using OpenAI’s embedding models.
- as_retriever(): Converts the vector store into a retriever that can be queried to find relevant documents.

In [None]:
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

### Step 4: Prepare RAG Pipeline
- prompt: A pre-configured prompt is pulled from a hub (possibly Hugging Face or similar) for guiding the response generation.
- ChatOpenAI: Initializes the language model (in this case, GPT-4o Mini) for generating text responses.
- model_name=“gpt-4o-mini: Specifies the exact version of the GPT model.
- temperature=0: Indicates deterministic output (no randomness) in responses.

In [None]:
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

### Step 5: Document Formatting Function
- format_docs(docs): This function formats the retrieved documents into a single string, separating each document’s content with two newlines.

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### Step 6: Define the RAG Chain
- rag_chain: This is a pipeline that performs the following:
    - retriever | format_docs: Retrieves relevant documents and formats them.
    - question: Takes an input question, passed directly using RunnablePassthrough().
    - prompt: Passes the context and question to the pre-configured prompt.
    - llm: The prompt is then passed to the language model, which generates a response.
    - StrOutputParser(): Converts the model’s output into a string format.

In [None]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### Step 7: Invoke the RAG Pipeline
- rag_chain.invoke(…): Executes the RAG pipeline with the specified question, which in this case is about summarizing the current state of using deep learning in antibiotic discovery.

In [None]:
resp = rag_chain.invoke(
  """
  Can you provide a summary of the current state of applying
  deep learning to the discovery of new antibiotics?
  """
)

In [None]:
# Return the response in readable format
resp.format()

In [None]:
# Cleanup
vectorstore.delete_collection()