# AskMyPDF 

AskMyPDF is a simple RAG project that uses a *local* LLM to embed and generate responses to a user question using as context the PDF document.

The application is built around 
- `langchain` runnables
- `llama_cpp_python` which allows the use of GGUF models over a C++ compiled executor
- `ChromaDB` as a vector store for the embeddings
- `streamlit` to develop the basic UI and run it in a local server.

This notebook goes through the basic steps, simplifying where necessary to keep it short and notebook- and walkthrough-friendly.

## 0. Getting started

For this notebook to be useful, make sure you have the following dependencies:
- `langchain` both `core` and `community`
- `llama-cpp-python`
- `chromadb`
- `pypdf`
- `gradio`: the last cell on the notebook uses a `Gradio` interface for a quick in-notebook experimentation instead of a `streamlit` implementation.


**N.B. #1:** In this notebook we are using the `Llama-2-7b-chat` model quantized and provided by [`TheBloke`](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF) to both embed the document and generate the response. In terms of the `llama-cpp-python` dependecy, it looks like the latest version does not support generation of embeddings using models with no pool layers, throwing a segfault. A quick solution is to downgrade to `llama-cpp-python==0.2.47`

**N.B. #2:** To use your local GPU for inference speed-up you need to export some env variables before pip installing `llama_cpp_python`. Follow the documentation based on your platform and your needs. 

For example, on an Apple Silicon platform, to use the Metal acceleration

```bash
export CMAKE_ARGS="-DLLAMA_METAL=on"
export FORCE_CMAKE=1
pip install llama-cpp-python==0.2.47 --no-cache-dir
```

In [1]:
from dotenv import dotenv_values

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import LlamaCppEmbeddings
from langchain_community.vectorstores import Chroma

from langchain_community.llms import LlamaCpp

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate

from IPython.display import display, Markdown

## 1. Setting the Model and PDF Paths

Here I am storing the `.gguf` of the model and the `.pdf` file in a local directory. You can change the path to point to your own files.

In [2]:
conf = dotenv_values(".env")

In [3]:
MODEL_PATH = conf["MODEL_PATH"]
PDF_PATH = conf["PDF_PATH"]

In [4]:
print(f"""- MODEL: {MODEL_PATH.split('/')[-1]}\n- PDF: {PDF_PATH.split("/")[-1]}""")

- MODEL: llama-2-13b-chat.Q4_K_M.gguf
- PDF: CERN-ACC-NOTE-2019-0046.pdf


## 2. Load the PDF and Split

Here simply load the PDF with the PyPDFLoader. We do a *quick and dirty* `load_and_split` that splits the document at 1 chunk per page. Of course, different splitters can be defined.

In [5]:
loader = PyPDFLoader(PDF_PATH)
print(loader)

<langchain_community.document_loaders.pdf.PyPDFLoader object at 0x10571c4c0>


In [6]:
pages = loader.load_and_split()
print(
    f"""`pages`\n-------\n 1. has length {len(pages)}\n 2. is of type {type(pages)} and \n 3. each index is of type {type(pages[0])}"""
)

`pages`
-------
 1. has length 17
 2. is of type <class 'list'> and 
 3. each index is of type <class 'langchain_core.documents.base.Document'>


## 3. Load the Embedding Model

In [7]:
embeddings = LlamaCppEmbeddings(model_path=MODEL_PATH, n_gpu_layers=-1, verbose=False)
print(f"`embeddings` is of type {type(embeddings)}")

`embeddings` is of type <class 'langchain_community.embeddings.llamacpp.LlamaCppEmbeddings'>


## 4. Vector Store

Create a local (in memory) vector store. Each document is a single PDF page. After creating the vector store set it up as a retriever.

In [8]:
# This embeds the document.. Will take few seconds depending on local hardware.
docs = Chroma.from_documents(pages, embeddings)

When setting the document as a retriever you can specify what is the search type you want and what are the search parameters. For example here,let's use the default parameters (similarity).

In [9]:
retriever = docs.as_retriever()

Quickly test the retriever. My default document is a technical document on a Python module performing space charge potential calculations.

In [None]:
retriever.get_relevant_documents("space charge")

## 5. Fire Up the LLM


In [11]:
llm = LlamaCpp(
    model_path=MODEL_PATH,
    verbose=False,
    n_gpu_layers=-1,  # use all available gpu
    n_ctx=4096,  # maxing out the context window
    temperature=0.05,
    seed=42,
)

## 6. Create the prompt

In [12]:
template = """[INST]You are helpful and respectful assistant tasked to answer user question based on a given context. \
Using the following pieces of retrieved context, delimited by <cntx> and </cntx>, to answer the question which is delimted by <qstn> and </qstn>. \
If you don't know the answer, just say that you don't know. \
Use maximum 3 sentences. \
Provide the answer directly without any introduction about the context.

<qstn>
Question: {question}
</qstn>

<cntx>
Context: {context}
</cntx>

Answer:[/INST]"""

In [13]:
prompt = ChatPromptTemplate.from_template(template)
print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="[INST]You are helpful and respectful assistant tasked to answer user question based on a given context. Using the following pieces of retrieved context, delimited by <cntx> and </cntx>, to answer the question which is delimted by <qstn> and </qstn>. If you don't know the answer, just say that you don't know. Use maximum 3 sentences. Provide the answer directly without any introduction about the context.\n\n<qstn>\nQuestion: {question}\n</qstn>\n\n<cntx>\nContext: {context}\n</cntx>\n\nAnswer:[/INST]"))]


## 7. Create the chain

The idea is of the chain is:
1. User question is sent to retriever.
2. Retriever returns documents.
3. Format the documents joining them as the context of the query.
4. Context + question is added to the prompt template and pushed to the LLM
5. LLM response is StrOutputParse-d.


In [14]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [15]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Test the chain

In [16]:
test_user_question = "What is space charge?"
response = rag_chain.invoke(test_user_question)

In [17]:
display(Markdown(response))

  Based on the provided context, the answer to the question "What is space charge?" is:

"The space charge potential can be expressed in action angle variables with an extra dependence on the orbital angle. In this form, noted as ¯Vsc, it can be included in the Hamiltonian of Eq. (4) to obtain the perturbed Hamiltonian as: ¯H=QxJx+QyJy+¯Vsc (8) Since the space charge potential is a summation of infinite terms, the term corresponding to the resonance under study can be considered individually."

## 8. Quick GUI

In [18]:
import gradio as gr

In [19]:
# Wrap the output generation:
def invoke_chain(user_q):
    return rag_chain.invoke(user_q)

In [20]:
## Interface
demo = gr.Interface(
    fn=invoke_chain,
    inputs="textbox",
    outputs="textbox",
    title="AskMyPDF",
    theme="soft",
    allow_flagging="never",
)

In [21]:
demo.launch()

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


