Interact with your book 📖❓🙋🏻‍♀️

A simple demonstration of how you can implement retrieval augmented generation (RAG) for a book.

🚀 How retrieval augmented generation works

Following are the high level steps needed for the implementation for retrieval augmented generation.

Extract text from source. If the source is unstructured, like PDF, the extraction can be a challenge.
Index the extracted text, often as vector embeddings and store.
Let the user ask questions related to the source.
Perform a similarity search in the index and retrieve relevant text chunks.
Insert these text chunks in the prompt along with the question.
Request an LLM (e.g. chatgpt) to produce an answer only based on the context

🌟 What would you find here

The notebook demonstrate the following steps

Extraction of the relevant text from Cambridge O Level Computer Science book.
- Only the main body text including tables is extracted.
- Following sections are excluded: table of content, index, sample questions at the end of each chapter, and diagrams.
- Every document needs to be carefully analyzed in order to extract useful text. This step is of utmost importance since answers to any user questions will depend upon the quality of input provided to the large language model.
The text was splitted into chunks using NLTKTextSplitter and vector embeddings were created using HuggingFaceInstructEmbeddings.
Langchain's FAISS vector store is used for saving embeddings.
Text relevant to the user query were retrieved from the index database using similarity search.
The final prompt is generated based on the searched context and copied to the clipboard.
You just need to open you favorite LLM (e.g. chat.openai.com) and past the prompt to get the required answer.

The idea of creating this simple implementation is to quickly demonstrate how thing actually work in retrieval augmented generation. You do not need to acquire any API key or install a LLM locally.

🧨 Name of the game

The success of any RAG implementation mainly depends on the following aspects

The quality of input data. If the data is unstructured, you need to carefully analyze and extract relevant and useful text only. The quality of input determines the quality of the answers you will get from LLM.
You should experiment with different text splitters and decide which one works best for your application.
You should test different number of top similar results that gives the best performance.
You can not have a universal prompt template that will work with every LLM. You will have to fine-tune the prompt for the specific LLM you are using in order to get best results.

🎬 Getting started

You can run the notebook locally and use the final prompt to generate an answer with the help of your favorite LLM.

Pre-requisite

Python 3.11, preferably in a virtual environment.
Access to a large language model for example:

Running locally

Clone the repo locally

git clone https://github.com/mamiriqbal1/rag_book_qa_prompt.git

CD to local repo directory
```
cd rag_book_qa_prompt
```
Install all the requirements
```
pip install -r requirements.txt
```
Open jupyter notebook and run all the cells. Wait for execution to complete. You can then change your question and re-run the last cell again and again to get the final prompt for using with you favorite LLM.
```
jupyter notebook rag_book_qa_prompt.ipynb
```

Using your own PDF file

You can use your own PDF file with this notebook.

Add the PDF file to books subfolder.
Customize the relevant code in the python file extract_text_and_save_index.py to adequately extract needed text from the PDF.
Run the modified python file to extract the text and create and save index.
Use the provided jupyter notebook to perform QA on your new source file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

books

books

faiss_index_hp

faiss_index_hp

README.md

README.md

extract_text_and_save_index.py

extract_text_and_save_index.py

rag_book_qa_prompt.ipynb

rag_book_qa_prompt.ipynb

requirements.txt

requirements.txt

Repository files navigation

Interact with your book 📖❓🙋🏻‍♀️

🚀 How retrieval augmented generation works

🌟 What would you find here

🧨 Name of the game

🎬 Getting started

Pre-requisite

Running locally

Using your own PDF file

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
books		books
faiss_index_hp		faiss_index_hp
README.md		README.md
extract_text_and_save_index.py		extract_text_and_save_index.py
rag_book_qa_prompt.ipynb		rag_book_qa_prompt.ipynb
requirements.txt		requirements.txt

mamiriqbal1/rag_book_qa_prompt

Folders and files

Latest commit

History

Repository files navigation

Interact with your book 📖❓🙋🏻‍♀️

🚀 How retrieval augmented generation works

🌟 What would you find here

🧨 Name of the game

🎬 Getting started

Pre-requisite

Running locally

Using your own PDF file

About

Topics

Resources

Stars

Watchers

Forks

Languages