# Building A RAG Ebook "Librarian" Using LlamaIndex

_Authored by: [Jonathan Jin](https://huggingface.co/jinnovation)_

## Introduction

This notebook demonstrates how to quickly build a RAG-based "librarian" for your
local ebook library.

Think about the last time you visited a library and took advantage of the
expertise of the knowledgeable staff there to help you find what you need out of
the troves of textbooks, novels, and other resources at the library. Our RAG
"librarian" will do the same for us, except for our own local collection of
ebooks.

## Requirements

We'd like our librarian to be **lightweight** and **run locally as much as
possible** with **minimal dependencies**. This means that we will leverage
open-source to the fullest extent possible, as well as bias towards models that
can be **executed locally on typical hardware, e.g. M1 Macbooks**.

## Components

Our solution will consist of the following components:

- [LlamaIndex], a data framework for LLM-based applications that's, unlike
  [LangChain], designed specifically for RAG;
- [Ollama], a user-friendly solution for running LLMs such as Llama 2 locally;
- The [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5)
  embedding model, which performs [reasonably well and is reasonably lightweight
  in size](https://huggingface.co/spaces/mteb/leaderboard);
- [Llama 2], which we'll run via [Ollama].

[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html
[LangChain]: https://python.langchain.com/docs/get_started/introduction
[Ollama]: https://ollama.com/
[Llama 2]: https://ollama.com/library/llama2

## Dependencies

First let's install our dependencies.

In [1]:
%pip install -q \
    llama-index \
    EbookLib \
    html2text \
    llama-index-embeddings-huggingface \
    llama-index-llms-ollama

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.5/115.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.1/247.1 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Ollama installation

These dependencies help properly detect the GPU.

In [2]:
!apt install pciutils lshw

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpci3 pci.ids usb.ids
The following NEW packages will be installed:
  libpci3 lshw pci.ids pciutils usb.ids
0 upgraded, 5 newly installed, 0 to remove and 49 not upgraded.
Need to get 883 kB of archives.
After this operation, 3,256 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 pci.ids all 0.0~2022.01.22-1 [251 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libpci3 amd64 1:3.7.0-6 [28.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 lshw amd64 02.19.git.2021.06.19.996aaad9c7-2build1 [321 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 pciutils amd64 1:3.7.0-6 [63.6 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/main amd64 usb.ids all 2022.04.02-1 [219 kB]
Fetched 883 kB in 2s (380 kB/s)
Selecting previously unselected package pci.ids.
(Reading d

Install Ollama.

In [3]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Run Ollama service in the background.

In [4]:
get_ipython().system_raw('ollama serve &')

Pull Llama2 from the Ollama library.

In [5]:
!ollama pull llama2

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling 8934d96d3f08...   0% ▕▏    0 B/3.8 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 8934d96d3f08...   0% ▕▏    0 B/3.8 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 8934d96d3f08...   0% ▕▏    0 B/3.8 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 8934d96d3f08...   0% ▕▏    0 B/3.8 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 8934d96d3f08...   0% ▕▏    0 B/3.8 GB               

## Test Library Setup

Next, let's create our test "library."

For simplicity's sake, let's say that our "library" is simply a **nested directory of `.epub` files**. We can easily see this solution generalizing to, say, a Calibre library with a `metadata.db` database file. We'll leave that extension as an exercise for the reader. 😇

Let's pull two `.epub` files from [Project Gutenberg](https://www.gutenberg.org/) for our library.

In [6]:
!mkdir -p "./test/library"

!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./test/library/env-protection-pesticides-business-manuals-applic-chapter7.pdf"

--2025-01-14 03:05:06--  https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf
Resolving www.gov.nl.ca (www.gov.nl.ca)... 98.143.128.70
Connecting to www.gov.nl.ca (www.gov.nl.ca)|98.143.128.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1914250 (1.8M) [application/pdf]
Saving to: ‘./test/library/env-protection-pesticides-business-manuals-applic-chapter7.pdf’


2025-01-14 03:05:10 (955 KB/s) - ‘./test/library/env-protection-pesticides-business-manuals-applic-chapter7.pdf’ saved [1914250/1914250]



In [None]:
!mkdir -p "./documents"
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf"


## RAG with LlamaIndex

RAG with LlamaIndex, at its core, consists of the following broad phases:

1. **Loading**, in which you tell LlamaIndex where your data lives and how to
   load it;
2. **Indexing**, in which you augment your loaded data to facilitate querying, e.g. with vector embeddings;
3. **Querying**, in which you configure an LLM to act as the query interface for
   your indexed data.

This explanation only scratches at the surface of what's possible with
LlamaIndex. For more in-depth details, I highly recommend reading the
["High-Level Concepts" page of the LlamaIndex
documentation](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html).

### Loading

Naturally, let's start with the **loading** phase.

I mentioned before that LlamaIndex is designed specifically for RAG. This
immediately becomes obvious from its
[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)
construct, which ✨ **magically** ✨ supports a whole host of multi-model file
types for free. Conveniently for us, `.epub` is in the supported set.

In [7]:
from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(
    input_dir="./test/",
    recursive=True,
    required_exts=[".epub"],
)

documents = loader.load_data()

  for root_file in tree.findall('//xmlns:rootfile[@media-type]', namespaces={'xmlns': NAMESPACES['CONTAINERNS']}):


In [7]:
from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(
    input_dir="./test/",
    recursive=True,
    required_exts=[".pdf"],
)

documents = loader.load_data()

https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/

In [8]:
from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(
    input_dir="./test/",
    recursive=True,
    required_exts=[".pdf"],
)

documents = loader.load_data()

`SimpleDirectoryReader.load_data()` converts our ebooks into a set of [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html) for LlamaIndex to work with.

One important thing to note here is that the documents **have not been chunked at this stage** -- that will happen during indexing. Read on...

### Indexing

Next up after **loading** the data is to **index** it. This will allow our RAG pipeline to look up the relevant context for our query to pass to our LLM to **augment** their generated response. This is also where document chunking will take place.

[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html)
is a "default" entrypoint for indexing in LlamaIndex. By default,
`VectorStoreIndex` uses a simple, in-memory dictionary to store the indices, but
LlamaIndex also supports [a wide variety of vector storage
solutions](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)
for you to graduate to as you scale.

<Tip>
By default, LlamaIndex uses a chunk size of 1024 and a chunk overlap of
20. For more details, see the [LlamaIndex
documentation](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes).
</Tip>


Like mentioned before, we'll use the
[`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) to
generate our embeddings. By default, [LlamaIndex uses
OpenAI](https://docs.llamaindex.ai/en/stable/getting_started/starter_example.html)
(specifically `gpt-3.5-turbo`), which we'd like to avoid given our desire for a lightweight, locally-runnable end-to-end solution.

Thankfully, LlamaIndex supports retrieving embedding models from Hugging Face through the convenient `HuggingFaceEmbedding` class, so we'll use that here.

In [9]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We'll pass that in to `VectorStoreIndex` as our embedding model to circumvent the OpenAI default behavior.

In [10]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embedding_model,
)

### Querying

Now for the final piece of the RAG puzzle -- wiring up the query layer.

We'll use Llama 2 for the purposes of this recipe, but I encourage readers to play around with different models to see which produces the "best" responses here.

First let's start up the Ollama server. Unfortunately, there is no support in the [Ollama Python client](https://github.com/ollama/ollama-python) for actually starting and stopping the server itself, so we'll have to pop out of Python land for this.

In a separate terminal, run: `ollama serve`. Remember to terminate this after we're done here!

Now let's hook Llama 2 up to LlamaIndex and use it as the basis of our query engine.

In [11]:
from llama_index.llms.ollama import Ollama

llama = Ollama(
    model="llama2",
    request_timeout=40.0,
)

query_engine = index.as_query_engine(llm=llama)

## Final Result

With that, our basic RAG librarian is set up and we can start asking questions about our library. For example:

In [12]:
print(query_engine.query("What are the titles of all the books available? Show me the context used to derive your answer."))

Based on the provided context information, the titles of the books available are:

1. "Common Sense Pest Control" by W. Olkowski, S. Daar and H. Olkowski (1991)
2. "IPM Practitioner" (Bio-Integral Resource Center)
3. "Nova Scotia Agricultural College, Centre for Continuing and Distance Education, Course Descriptions"
4. "Wildwood Labs, Pest Identification and Disease Diagnosis Services"
5. "Further Reading" (no specific book titles are mentioned)

The context used to derive this answer is the information provided in the chapter of the manual related to integrated pest management, specifically the section on books and other resources available for pest control.


In [13]:
print(query_engine.query("Who is the main character of 'Pride and Prejudice'?"))

I cannot directly reference the given context in my answer, as per your rules. However, I can provide a creative and humorous response based on the context information.

It seems that the main character of "Pride and Prejudice" is not a pest, but rather a book written by Jane Austen. The book follows the lives of the Bennet family, particularly Elizabeth and her relationship with Mr. Darcy. However, if we were to apply integrated pest management principles to this scenario, we would need to identify the "pests" that are causing problems in the story.

Are the characters in the book the pests? Perhaps Mr. Collins is a bit of a nuisance, always trying to ingratiate himself with the Bennet family. Or maybe Lady Catherine de Bourgh is like a pesky aphid, constantly demanding attention and respect. And let's not forget about Wickham, who could be seen as a sneaky pest that causes trouble for poor Lizzy.

But wait, I digress! If we were to use integrated pest management techniques in the wor

In [14]:
print(query_engine.query("what is INTEGRATED PEST MANAGEMENT?"))

Integrated Pest Management (IPM) is a decision-making process that helps to prevent pest problems by planning and managing ecosystems. It involves using a combination of methods, including cultural, biological, physical, mechanical, behavioral, or chemical treatments, to control pests with little impact on the environment. The goal of IPM is to manage pests effectively, economically, and safely, reducing the need for chemical pesticides and minimizing environmental risks.


In [None]:
Does ipm eliminate all pests?

In [15]:
print(query_engine.query("Does ipm eliminate all pests?"))

No, IPM (Integrated Pest Management) does not eliminate all pests. According to the provided context information, the goal of IPM is to manage pests effectively, economically, and safely, while reducing the need for chemical pesticides. It aims to keep pest numbers below a damaging level, rather than eliminating all pests. Therefore, IPM does not eliminate all pests, but rather manages them at a tolerable level.


## Conclusion and Future Improvements

We've demonstrated how to build a basic RAG-based "librarian" that runs entirely locally, even on Apple silicon Macs. In doing so, we've also carried out a "grand tour" of LlamaIndex and how it streamlines the process of setting up RAG-based applications.

That said, we've really only scratched the surface of what's possible here. Here are some ideas of how to refine and build upon this foundation.

### Forcing Citations

To guard against the risk of our librarian hallucinating, how might we require that it provide citations for everything that it says?

### Using Extended Metadata

Ebook library management solutions like [Calibre](https://calibre-ebook.com/) create additional metadata for ebooks in a library. This can provide information such as publisher or edition that might not be readily available in the text of the book itself. How could we extend our RAG pipeline to account for additional sources of information that aren't `.epub` files?

### Efficient Indexing

If we were to collect everything we built here into a script/executable, the resulting script would re-index our library on each invocation. For our tiny test library of two files, this is "fine," but for any library of non-trivial size this will very quickly become annoying for users. How could we persist the embedding indices and only update them when the contents of the library have meaningfully changed, e.g. new books have been added?

In [15]:
!mkdir -p "./test/library/jane-austen"
!mkdir -p "./test/library/victor-hugo"

# تنزيل ملف PDF
!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O "./test/env-protection-pesticides-business-manuals-applic-chapter7.pdf"

import fitz  # مكتبة PyMuPDF لقراءة ملفات PDF
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# دالة لتحميل بيانات PDF
def load_pdf_data(directory):
    documents = []
    import os
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".pdf"):
                file_path = os.path.join(root, file)
                with fitz.open(file_path) as pdf:
                    text = ""
                    for page in pdf:
                        text += page.get_text()
                documents.append({"text": text, "file_name": file_path})
    return documents

# تحميل ملفات PDF
documents = load_pdf_data("./test/")

# إنشاء نموذج التضمين
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# إنشاء الفهرس
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embedding_model,
)

# إعداد Llama
from llama_index.llms.ollama import Ollama

llama = Ollama(
    model="llama2",
    request_timeout=40.0,
)

# إنشاء محرك الاستعلام
query_engine = index.as_query_engine(llm=llama)

# استعلام للحصول على العناوين والسياق
response = query_engine.query("What are the titles of all the books available? Show me the context used to derive your answer.")
print(response)


--2025-01-14 02:40:51--  https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf
Resolving www.gov.nl.ca (www.gov.nl.ca)... 98.143.128.70
Connecting to www.gov.nl.ca (www.gov.nl.ca)|98.143.128.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1914250 (1.8M) [application/pdf]
Saving to: ‘./test/env-protection-pesticides-business-manuals-applic-chapter7.pdf’


2025-01-14 02:40:52 (4.54 MB/s) - ‘./test/env-protection-pesticides-business-manuals-applic-chapter7.pdf’ saved [1914250/1914250]



AttributeError: 'dict' object has no attribute 'get_doc_id'

In [14]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.1


In [18]:
from llama_index import Document  # تأكد من استيراد Document

# دالة لتحميل بيانات PDF
def load_pdf_data(directory):
    documents = []
    import os
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".pdf"):
                file_path = os.path.join(root, file)
                with fitz.open(file_path) as pdf:
                    text = ""
                    for page in pdf:
                        text += page.get_text()
                # تحويل النص إلى كائن Document
                doc = Document(text=text, metadata={"file_name": file_path})
                documents.append(doc)
    return documents

# تحميل ملفات PDF
documents = load_pdf_data("./test/")

# إنشاء نموذج التضمين
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# إنشاء الفهرس
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embedding_model,
)

# إعداد Llama
from llama_index.llms.ollama import Ollama

llama = Ollama(
    model="llama2",
    request_timeout=40.0,
)

# إنشاء محرك الاستعلام
query_engine = index.as_query_engine(llm=llama)

# استعلام للحصول على العناوين والسياق
response = query_engine.query("What are the titles of all the books available? Show me the context used to derive your answer.")
print(response)


ImportError: cannot import name 'Document' from 'llama_index' (unknown location)

In [20]:
!rm -rf /content/test/library/victor-hugo /content/test/library/jane-austen

In [19]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="/content/test")
documents = reader.load_data()

ValueError: Directory /content/test/env-protection-pesticides-business-manuals-applic-chapter7.pdf does not exist.