# Retrieval Augment Generation

### Usecase: Building docsrag: A query-engine to help developers quickly find information in open-source documentation
- More specifically we will be building **raybot** with our docsrag library: A retrieval-augmented question answering system using ray's documentation

### Techstack:
- `llama_index`
   - `llama_hub` for document loading
   - `openai` and `huggingface` for LLM models
   - `langchain` for language chaining
   - `nltk` for text processing
- `ray`

### Building a retrieval-augmented question answering system using ray documentation
Retrieval augmented generation (RAG) is a paradigm for augmenting LLM with custom data. It generally consists of two stages:
1. indexing stage: preparing a knowledge base
2. querying stage: retrieving relevant context from the knowledge to assist the LLM in responding to a question

[<img src="rag.jpeg" height="500"/>](rag.jpeg)

# Indexing Stage
Given a dataset of documents, we first need to index them. This is done by:
- Load the documents
- Parse the documents into passages which are called nodes
- Use an embedding model to encode the nodes into embedding vectors
- Index the embeddings using a vector similarity search database
<!-- ![index](index_build.jpeg) -->
[<img src="index_build.jpeg" height="500"/>](index_build.jpeg)


### Document Loader

We will go over how a sample markdown document is loaded into a document object

Also the llama-index markdown-reader does not support introducing document relationships.

### DocumentLoader implementation in docsrag

We showcase the docsrag `GithubDocumentLoader` which simply an adapter for `llama_hub.github_repo.GithubRepositoryReader`

For the sake of simplicity, the `GithubDocumentLoader`:
- consider only markdown (`.md`) and restructured-text (`.rst`) files inside the ray repo doc/source folder.
- read the documents as raw text given the default `llama_index` readers have their flaws

In [None]:
from docsrag.docs_loader import GithubDocumentLoader

document_loader = GithubDocumentLoader(
    owner="ray-project",
    repo="ray",
    version_tag="releases/2.6.3",
    paths_to_include=["doc/source/"],
    file_extensions_to_include=[".md", ".rst"],
    paths_to_exclude=[
        "doc/source/_ext/",
        "doc/source/_includes/",
        "doc/source/_static/",
        "doc/source/_templates/",
    ],
    filenames_to_exclude=[],
)

In [None]:
# uncomment and run this command to fetch the documents
# docsrag fetch-documents --config-path ./data/config.yaml --data-path ./data --overwrite

In [None]:
import pickle

with open(f"./data/docs/{hash(document_loader)}.pkl", "rb") as f:
    docs = pickle.load(f)

In [None]:
print(f"Number of documents: {len(docs)}")

In [None]:
sample_mkdown_doc = next(
    doc for doc in docs if doc.metadata["file_path"].endswith(".md")
)

print(sample_mkdown_doc.text[:500])

## Node Parser
A node parser chunks a document into nodes

The parser will:
- run a text chunker
- inject additional node metadata
- construct node relationships

A node is:
- the chunk text plus metadata (e.g. node text hash, node relationships to other nodes)

We showcase the docsrag `NodeParser` which simply an adapter for `llama_hub.github_repo.GithubRepositoryReader`

In [None]:
from docsrag.node_parser import NodeParser

In [None]:
node_parser = NodeParser.parse_obj(
    {
        "inherit_metadata_from_doc": True,
        "construct_prev_next_relations": True,
        "text_chunker": {
            "chunk_size": 1024,
            "chunk_overlap": 20,
            "paragraph_separator": "\n\n\n",
            "sentence_tokenizer": {"type": "tokenizers/punkt"},
            "secondary_chunking_regex": "[^,.;。]+[,.;。]?",
            "tokenizer": {"encoding": "gpt2"},
            "word_seperator": " ",
        },
        "metadata_pipeline": {
            "extractors": [
                "file_path_extractor",
                "text_hash_extractor",
            ]
        },
    }
)

In [None]:
# uncomment and run this command to parse the nodes
# docsrag parse-nodes --config-path ./data/config.yaml --data-path ./data --overwrite

In [None]:
%psource node_parser.run

In [None]:
import pickle

with open(f"./data/nodes/{hash(node_parser)}.pkl", "rb") as f:
    nodes = pickle.load(f)

In [None]:
print(f"Number of nodes: {len(nodes)}")

In [None]:
import yaml
import gradio as gr
from docsrag.node_parser import NodeParser
import pickle

with open("tutorial_docs.pkl", "rb") as f:
    docs = pickle.load(f)

config = {
    "inherit_metadata_from_doc": True,
    "construct_prev_next_relations": True,
    "text_chunker": {
        "chunk_size": 1024,
        "chunk_overlap": 20,
        "paragraph_separator": "\n\n\n",
        "sentence_tokenizer": {"type": "tokenizers/punkt"},
        "secondary_chunking_regex": "[^,.;。]+[,.;。]?",
        "tokenizer": {"encoding": "gpt2"},
        "word_seperator": " ",
    },
    "metadata_pipeline": {
        "extractors": [
            "file_path_extractor",
            "text_hash_extractor",
        ]
    },
}


def parse_nodes(
    text,
    chunk_size=1024,
    chunk_overlap=20,
    paragraph_separator="\n\n\n",
    sentence_tokenizer="tokenizers/punkt",
    secondary_chunking_regex="[^,.;。]+[,.;。]?",
    tokenizer="gpt2",
    word_seperator=" ",
    extractors=["file_path_extractor", "text_hash_extractor"],
):
    config_dict = config
    config_dict["text_chunker"]["chunk_size"] = chunk_size
    config_dict["text_chunker"]["chunk_overlap"] = chunk_overlap
    config_dict["text_chunker"]["paragraph_separator"] = paragraph_separator
    config_dict["text_chunker"]["sentence_tokenizer"]["type"] = sentence_tokenizer
    config_dict["text_chunker"]["secondary_chunking_regex"] = secondary_chunking_regex
    config_dict["text_chunker"]["tokenizer"]["encoding"] = tokenizer
    config_dict["text_chunker"]["word_seperator"] = word_seperator
    config_dict["metadata_pipeline"]["extractors"] = extractors

    node_parser = NodeParser.parse_obj(config_dict)
    doc = docs[0]
    doc.text = text
    nodes = node_parser.run([doc], use_ray=False)
    return (
        nodes[0].text,
        yaml.dump(nodes[0].metadata),
        yaml.dump([str(rel) for rel in nodes[0].relationships]),
        nodes[1].text,
        yaml.dump(nodes[1].metadata),
        yaml.dump([str(rel) for rel in nodes[1].relationships]),
    )


with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            title = gr.Markdown(
                """
                # Node Parser Demo
                Shows how configuration options affect the output of the node parser.
                """
            )
    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            text1 = gr.Textbox(label="Document", value=docs[0].text)
        with gr.Column(scale=1, min_width=100):
            text2 = gr.Textbox(label="NodeParser chunksize", value=1024)
            text3 = gr.Textbox(label="NodeParser chunk_overlap", value=20)
            text4 = gr.Textbox(label="NodeParser paragraph_separator", value='"\n\n\n"')
            text5 = gr.Textbox(
                label="NodeParser sentence_tokenizer", value="tokenizers/punkt"
            )
            text6 = gr.Textbox(
                label="NodeParser secondary_chunking_regex", value='"[^,.;。]+[,.;。]?"'
            )

    with gr.Row():
        inbtw = gr.Button("Submit", variant="primary")

    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            out1 = gr.Textbox(label="First Node text")
        with gr.Column(scale=1, min_width=100):
            out2 = gr.Textbox(label="First Node metadata")
        with gr.Column(scale=1, min_width=100):
            out3 = gr.Textbox(label="First Node relationships")

    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            out4 = gr.Textbox(label="Second Node text")
        with gr.Column(scale=1, min_width=100):
            out5 = gr.Textbox(label="Second Node metadata")
        with gr.Column(scale=1, min_width=100):
            out6 = gr.Textbox(label="Second Node relationships")
    inbtw.click(
        parse_nodes,
        inputs=[text1, text2, text3, text4, text5, text6],
        outputs=[out1, out2, out3, out4, out5, out6],
    )

demo.launch(quiet=True)

## Embedding model and vector store

We showcase the docsrag VectorStoreIndexRay (a very simple in-memory vector store) and how to use it to find similar nodes.

In [None]:
from docsrag.embedding.index import VectorStoreSpec, VectorStoreIndexRay

We start by building our VectorStoreIndexRay from the nodes we parsed earlier. This will compute the embeddings for each node and store them in a vector store.

In [None]:
%psource VectorStoreIndexRay.build_from_spec

In [None]:
%psource VectorStoreIndexRay._get_node_embeddings

In [None]:
node_limit = None

embedding_vector_store_spec = VectorStoreSpec.parse_obj(
    {"embedding_model_name": "BAAI/bge-small-en"}
)

vector_store_index = VectorStoreIndexRay.build_from_spec(
    nodes=nodes[:node_limit] if node_limit else nodes,
    spec=embedding_vector_store_spec,
    num_gpus=0,
    batch_size=100,
)

In [None]:
from pathlib import Path

store.save(Path(f"./data/vector_index/{hash(vector_store_index)}"))

In [None]:
# to build the full vector store uncomment and run below command
# docsrag build-embedding-vector-store-index --config-path ./data/config.yaml --data-path ./data --overwrite

We load the vector store that was built in the previous step.

In [None]:
hash_vector_store = 525061202 # hash(vector_store_index)
loaded_index = VectorStoreIndexRay.load(Path(f"data/vector_index/{hash_vector_store}/"))

In [None]:
nodes_with_scores = loaded_index.retrieve_most_similiar_nodes(
    query="How can I migrate from a single-application config to a multi-application config in Ray Serve?",
    similarity_top_k=3,
)

In [None]:
print(f"Number of nodes fetched: {len(nodes_with_scores)}")

In [None]:
most_similar_node = nodes_with_scores[0]
print(most_similar_node.node.text[-1170:], end="\n\n")
print(f"{most_similar_node.node.metadata=}")
print(f"{most_similar_node.score=}")

## Evaluating our Embedding Index using standard ranking and classification metrics

- Step1: Build a question and answer evaluation dataset from the ray documentation corpus
- Step2: Assess the quality of our embedding index based on the built dataset

### Building an Evaluation Dataset
[<img src="eval_build.jpeg" height="500"/>](eval_build.jpeg)


In [None]:
from textwrap import dedent
from docsrag.evaluation_dataset_generator import EvaluationDatasetBuilder

eval_dataset_builder = EvaluationDatasetBuilder.parse_obj(
    {
        "qa_generator_open_ai": {
            "model": "gpt-3.5-turbo",
            "system_prompt": dedent(
                """
            You are a helpful assistant that generates questions and answers from a provided context.
            The context will be selected documents from the ray's project documentation.
            The questions you generate should be obvious on their own and should mimic what a developer might ask trying to work with ray, especially if they can't directly find the answer in the documentation.
            The answers should be factually correct, can be of a variable length and can contain code.
            If the provided context does not contain enough information to create a question and answer, you should respond with 'I can't generate a question and answer from this context'. 
            The following is an example of how the output should look:
            Q1: How can I view ray dashboard from outside the Kubernetes cluster?
            A1: You can use port-forwarding. Run the command 'kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265'

            Q2: {question}
            A2: {answer}
            """
            ).lstrip(),
            "user_prompt_template": dedent(
                """
        Provide questions and answers from the following context:

        {context}
        """
            ).lstrip(),
            "max_tokens": 1024,
            "temperature": 1.0,
            "top_p": 0.85,
            "frequency_penalty": 0,
            "presence_penalty": 0,
        },
        "noise_injector_from_parquet": {"dataset_name": "trivia_questions.parquet"},
    }
)

In [None]:
# Note this is the prompt used by llama-index in its finetuning module
# """\
# Context information is below.

# ---------------------
# {context_str}
# ---------------------

# Given the context information and not prior knowledge.
# generate only questions based on the below query.

# You are a Teacher/ Professor. Your task is to setup \
# {num_questions_per_chunk} questions for an upcoming \
# quiz/examination. The questions should be diverse in nature \
# across the document. Restrict the questions to the \
# context information provided."
# """

In [None]:
qa_generator_openai = eval_dataset_builder.qa_generator_open_ai

In [None]:
questions = qa_generator_openai.run(context=most_similar_node.node.text)

In [None]:
print(questions)

### Evaluate our Embedding Vector Index Store

[<img src="run_eval.jpeg" height="600"/>](run_eval.jpeg)


In [None]:
from docsrag.embedding.evaluation import VectorStoreEvaluator, load_evaluation_dataset

In [None]:
from docsrag.embedding.index import VectorStoreIndexRay
from pathlib import Path
hash_vector_store = 525061202 # hash(vector_store_index)
loaded_index = VectorStoreIndexRay.load(Path(f"data/vector_index/{hash_vector_store}/"))

In [None]:
evaluator = VectorStoreEvaluator(
    vector_store_index=loaded_index,
    top_ks=[1, 3, 5, 7, 10]
)

In [None]:
eval_df = load_evaluation_dataset(
    evaluation_dataset_dir=Path("data/eval_data/"),
    evaluation_dataset_name="1618109849114044135",
    limit=None,
)

In [None]:
scores = evaluator.run(eval_df)

In [None]:
scores

### Now we are going to use the embedding vector store to augment our LLM model

In [None]:
from docsrag.llm.model import LLM, LLMPlusRag

In [None]:
predictor_without_rag = LLM(
    model="gpt-3.5-turbo",
    temperature=0.1,
    max_tokens=1000,
    max_retries=10,
)

In [None]:
query = "How can I set a metric and mode in ray Tune?"

In [None]:
answer_without_rag = predictor_without_rag.query(query)

In [None]:
print(answer_without_rag)

In [None]:
predictor_with_rag = LLMPlusRag(
    model="gpt-3.5-turbo",
    temperature=0.1,
    max_tokens=1000,
    max_retries=10,
    vector_store_path=f"./data/vector_index/{hash_vector_store}"
)

In [None]:
answer_with_rag = predictor_with_rag.query(query=query, similarity_top_k=2)

In [None]:
print(answer_with_rag)

## Fine-tuning embedding configuration using ray

In [1]:
import pickle
from pathlib import Path

with open("data/nodes/130956594988870197.pkl", "rb") as f:
    nodes = pickle.load(f)

from docsrag.embedding.evaluation import load_evaluation_dataset

eval_df = load_evaluation_dataset(
    evaluation_dataset_dir=Path("data/eval_data/"),
    evaluation_dataset_name="1618109849114044135",
    limit=10,
)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
from docsrag.embedding.index import VectorStoreIndexRay, VectorStoreSpec
from docsrag.embedding.evaluation import VectorStoreEvaluator
from ray import tune
from pathlib import Path


def my_objective(config):  # ①

    node_limit = None

    embedding_vector_store_spec = VectorStoreSpec.parse_obj(
        {"embedding_model_name": config["embedding_model_name"]}
    )

    vector_store_index = VectorStoreIndexRay.build_from_spec(
        nodes=nodes[:node_limit] if node_limit else nodes,
        spec=embedding_vector_store_spec,
        num_gpus=0,
        batch_size=100,
    )

    evaluator = VectorStoreEvaluator(
        vector_store_index=vector_store_index,
        top_ks=[3],
    )

    scores = evaluator.run(eval_df.iloc[:4])

    return {"score": scores["recall@k"].mean()}


search_space = {  # ②
    "embedding_model_name": tune.choice(
        [
            "BAAI/bge-small-en",
            # "BAAI/bge-base-en",
        ],
    ),
}

tuner = tune.Tuner(my_objective, param_space=search_space)  # ③

results = tuner.fit()
print(results.get_best_result(metric="score", mode="max").config)

0,1
Current time:,2023-08-31 10:55:23
Running for:,00:00:16.33
Memory:,22.8/64.0 GiB

Trial name,status,loc,embedding_model_name
my_objective_60150_00000,RUNNING,127.0.0.1:16420,BAAI/bge-small-en




2023-08-31 10:55:23,714	ERROR tune.py:941 -- Trials did not complete: [my_objective_60150_00000]
2023-08-31 10:55:23,715	INFO tune.py:945 -- Total run time: 16.39 seconds (16.33 seconds for the tuning loop).
Continue running this experiment with: Tuner.restore(path="/Users/marwansarieddine/ray_results/my_objective_2023-08-31_10-55-07", trainable=...)


{'embedding_model_name': 'BAAI/bge-small-en'}


[2m[36m(my_objective pid=16420)[0m 2023-08-31 10:55:23,733	ERROR worker.py:844 -- Worker exits with an exit code 1.
[2m[36m(my_objective pid=16420)[0m Traceback (most recent call last):
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 1197, in ray._raylet.task_execution_handler
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 1100, in ray._raylet.execute_task_with_cancellation_handler
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 823, in ray._raylet.execute_task
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 870, in ray._raylet.execute_task
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 877, in ray._raylet.execute_task
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 881, in ray._raylet.execute_task
[2m[36m(my_objective pid=16420)[0m   File "python/ray/_raylet.pyx", line 821, in ray._raylet.execute_task.functio

In [4]:
def my_objective(config):  # ①

    node_limit = 100

    embedding_vector_store_spec = VectorStoreSpec.parse_obj(
        {"embedding_model_name": config["embedding_model_name"]}
    )

    vector_store_index = VectorStoreIndexRay.build_from_spec(
        nodes=nodes[:node_limit] if node_limit else nodes,
        spec=embedding_vector_store_spec,
        num_gpus=0,
        batch_size=100,
    )

    evaluator = VectorStoreEvaluator(
        vector_store_index=vector_store_index,
        top_ks=[3],
    )

    scores = evaluator.run(eval_df.iloc[:4])

    return {"score": scores["recall@k"].mean()}



In [5]:
my_objective(
    {
        "embedding_model_name": "BAAI/bge-small-en",
    }
)

NameError: name 'VectorStoreSpec' is not defined