# Interact with your book 📖❓🙋🏻‍♀️

A simple demonstration of how you can implement retrieval augmented generation for a book.

## How retrieval augmented generation works

Following are the high level steps needed for the implementation for retrieval augmented generation.

1. Extract text from source. If the source is unstructured, like PDF, the extraction can be a challenge.
2. Index the extracted text, often as vector embeddings and store.
3. Let the user ask questions related to the source.
4. Perform a similarity search in the index and retrieve relevant text chunks.
5. Insert these text chunks in the prompt along with the question.
6. Request an LLM (e.g. chatgpt) to produce an answer *only* based on the context

## Import necessary packages

In [24]:
import os
import glob
import time

from PyPDF2 import PdfReader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain import PromptTemplate
from langchain.vectorstores import FAISS

### Load pdf file of the O Level Computer Science text book using PyPDF2

In [25]:

reader = PdfReader("books/Cambridge IGCSE and O Level Computer Science.pdf")

### Ensure proper text extraction

We need to ensure that only relevant text is extracted. Only the main body text is extracted.

   - Following sections are excluded: 
      - table of content, 
      - index, 
      - sample questions at the end of each chapter, 
      - diagrams, 
      - tables, and other elements that were not part of the main text.   

We achieved the exclusions using following two identifications

1. The main body of text is always using a specific font type. We have filtered on that.
2. We have identified page numbers of the main text of the chapters. Only these were extracted.
3. Since the text is extracted page by page, some pages only had very few words. All such texts were discarded.

In [26]:
included_pages_intervals = [[14, 52],
                 [57, 82],
                 [87, 155],
                 [159, 188],
                 [192, 225],
                 [229, 264],
                 [270, 306],
                 [311, 348],
                 [351, 365],
                 [368, 393]]

included_pages = []
for interval in included_pages_intervals:
    l = list(range(interval[0], interval[1]+1))
    included_pages = included_pages + l


def include_page(page_number):
    one_based_page_number = page_number + 1
    if one_based_page_number in included_pages:
        return True
    else:
        return False

parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
    if fontDict is not None and '/ILTBBB+OfficinaSansStd' in fontDict['/BaseFont']:
        parts.append(text)

def extract_single_page(page):
    page.extract_text(visitor_text=visitor_body),
    text_body = "".join(parts)
    text_body = text_body.replace('\n', ' ')
    return text_body


def extract_pages(pdf_reader, source):
    documents = []
    
    for page_number, page in enumerate(pdf_reader.pages):
        if include_page(page_number):
            doc = Document(
                    page_content = extract_single_page(page),
                    metadata={"source": source, "page": page_number},
                    ) 
            if len(doc.page_content) > 100:
                documents.append(doc)
            else:
                pass
                # print('dropped page content: ' + doc.page_content)
            global parts
            parts =[]
    return documents


documents = extract_pages(reader, "Cambridge IGCSE and O Level Computer Science.pdf")

print('pages extracted: ' + str(len(documents)))

pages extracted: 263


### Create chunks of size 800 with no overlap

In [27]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 0
)

texts = text_splitter.split_documents(documents)

print(f'We have created {len(texts)} chunks from {len(documents)} pages')

We have created 547 chunks from 263 pages


### Create vector embeddings and save locally

In [28]:
# %%time

# ### download embeddings model
# embeddings = HuggingFaceInstructEmbeddings(
#     model_name = 'sentence-transformers/all-MiniLM-L6-v2',
#     model_kwargs = {"device": "cpu"}
# )

# ### create embeddings and DB
# vectordb = FAISS.from_documents(
#     documents = texts, 
#     embedding = embeddings
# )

# ### persist vector database
# vectordb.save_local("faiss_index_hp")

### Load already saved vector embeddings

In [29]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = 'sentence-transformers/all-MiniLM-L6-v2',
    model_kwargs = {"device": "cpu"}
)

### load vector DB embeddings
vectordb : FAISS = FAISS.load_local(
    "faiss_index_hp",
    embeddings
)

load INSTRUCTOR_Transformer
max_seq_length  512
CPU times: user 250 ms, sys: 112 ms, total: 362 ms
Wall time: 392 ms


### Verify that similarity search is working

In [30]:
### test if vector DB was loaded correctly
results = vectordb.similarity_search('check digit')
results

[Document(page_content='(VIN). Check digits are used to identify errors in data entry  caused by mis-typing  or mis-scanning a barcode. They can usually detect the following types of error:  an incorrect digit entered, for example 5327 entered instead of 5307 transposition errors where two numbers have changed order, for example 5037  instead of 5307 omitted or extra digits, for example 537 instead of 5307 or 53107 instead  of 5307  phonetic errors, for example 13 (thirteen), instead of 30 (thirty). There are a number of different methods used to generate a check digit. Two common methods will be considered here:  ISBN 13 Modulo-11', metadata={'source': 'Cambridge IGCSE and O Level Computer Science.pdf', 'page': 71}),
 Document(page_content='A format check checks that the characters entered conform to a pre-defined  pattern, for example, in Chapter 9 the cub number must be in the form CUB9999. The pseudocode for this example will be given in the string handling section of Chapter 9. A 

### Create a prompt template requiring the LLM to generate an answer only based on the provided context.

In [31]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""

### Configure that we will use top 3 results from similarity search

In [32]:
from langchain.memory.vectorstore import VectorStoreRetriever

retriever : VectorStoreRetriever = vectordb.as_retriever(search_kwargs = {"k": 5, "search_type" : "similarity"})

### Provide a question for which answer is required.
- The final prompt including context will be copied to the clipboard.
- You can paste the prompt on an LLM interface (e.g. chat.openai.com) and get your answer!

In [33]:
query = 'What is the weakness of parity check?'

docs = retriever.get_relevant_documents(query)
merged_context = ''
for doc in docs:
    merged_context = merged_context + ' ' + doc.page_content

final_prompt = prompt_template.format(context=merged_context, question=query)
print(final_prompt)

import pyperclip
pyperclip.copy(final_prompt)


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

 Parity checking is one method used to check whether data has been changed or  corrupted following data transmission. This method is based on the number of 1-bits in a byte of data. The parity can be either called EVEN  (that is, an even number of 1-bits in the  byte) or ODD  (that is, an odd number of 1-bits in the byte). One of the bits in  the byte (usually the most significant bit or left-most bit) is reserved for a parity  bit. The parity bit is set according to whether the parity being used is even or  odd. For example, consider the byte: In this example, if the byte is using even parity, then the parity bit needs to be set to 0, since there is already an even number of 1-bits in the byte (four 1-bits). We thus get: In this example, if the byte is using odd parity, then the

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
