# **Project Essentials**

## **What is RAG?**

RAG is a technique for augmenting LLM knowledge with additional data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

  

  Reference : [Langchian RAG Documentation](https://python.langchain.com/v0.2/docs/tutorials/rag/)

## **Important Files**

Libraries: All tthe libraries used are listed under a repearate file as we are accessing it in the next line.

Sample Pdf: [LLM_Research_Paper.pdf](https://drive.google.com/file/d/1rIJpKZyd4D2u6qbPO_hW9kdT-uP0sbMK/view?usp=sharing)


You must have a ".env" file in which you have to add your own OpenAI api key. This is done to maintain privacy while sharing code file.


# **Final Project**

### **Installing the required libraries**

In [None]:
!pip install -r requirements.txt

### **Loading Environment Variable**

In [None]:
# Importing necessary modules
import os
from dotenv import load_dotenv

load_dotenv()

# Setting OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

### **Data Ingestion Using PDF file**

In [None]:
#Importing PyPDFLoader to ennable pdf readability
from langchain_community.document_loaders import PyPDFLoader

#Using the sample file
loader = PyPDFLoader("LLM_Research_Paper.pdf")

### **Loading the data**

In [None]:
#Loading the pdf content to "docs"
docs = loader.load()
docs

### **Text Splitting**

In [None]:
!pip install langchain

# Importing RecursiveCharacterTextSplitter to split the retrieved Data
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Assgning parameters like chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
    )

# Storing the split documents to "chunks_documents"
chunk_documents=text_splitter.split_documents(docs)

# Display the Chunks
chunk_documents

### **Embedding**

**Embedding Using FAISS**

In [None]:
# Vector embedding using FAISS

from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Embedding the data chunks using FAISS vectorstores
db = FAISS.from_documents(chunk_documents, OpenAIEmbeddings())
db

### **Searching**

In [None]:
# Writing the query which will be the search criteria or parameter in this case
query= ""

# Performaing Similarity Search
retrieved_result= db.similarity_search(query)

# Displaying the results
print(retrieved_result)

### **Review**


In order to do so,
*   Open the pdf that you have uploaded  
*   Now search for the same query that you entered manually
*   This will take you the place in the pdf where you might have the text
*   Once done now you know how well the code has performed in different aspect, lets highlight them as well.



### Aspects where effeciency can be seen.  
1.   Easy Results
2.   Reduced time
3.   It didnt involve extra background resources or application to oopen the pdf
4.   The code is now more effecient and also take less than the previous time.
5.   Flexible as we can work on different pdfs and different query as per our requirements.


## **Project Explantion**

Go through [list of Document_loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/)

The link contains a table that shows feature support for all document loaders available .

### **Data Ingestion Using Webpages(Optional)**


WebBaseLoader	Load HTML pages using urllib and parse them with `BeautifulSoup'.	with Lazy loading and Native async support

In [None]:
# Importing necessary modules
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Giving the Website link that we will read
loader = WebBaseLoader("https://python.langchain.com/v0.2/docs/tutorials/rag/")

# Loading the data to "docs"
docs = loader.load()
docs