# Retrieval Augment Generation

### Usecase: Building docsrag: A query-engine to help developers quickly find information in open-source documentation
- More specifically we will be building **raybot** with our docsrag library: A retrieval-augmented question answering system using ray's documentation

### Techstack:
- `llama_index`
   - `llama_hub` for document loading
   - `openai` and `huggingface` for LLM models
   - `langchain` for language chaining
   - `nltk` for text processing
- `ray`

### Building a retrieval-augmented question answering system using ray documentation
Retrieval augmented generation (RAG) is a paradigm for augmenting LLM with custom data. It generally consists of two stages:
1. indexing stage: preparing a knowledge base
2. querying stage: retrieving relevant context from the knowledge to assist the LLM in responding to a question

[<img src="rag.jpeg" height="500"/>](rag.jpeg)

# Indexing Stage
Given a dataset of documents, we first need to index them. This is done by:
- Load the documents
- Parse the documents into passages which are called nodes
- Use an embedding model to encode the nodes into embedding vectors
- Index the embeddings using a vector similarity search database
<!-- ![index](index_build.jpeg) -->
[<img src="index_build.jpeg" height="500"/>](index_build.jpeg)


### Document Loader

We will go over how a sample markdown document is loaded into a document object

Also the llama-index markdown-reader does not support introducing document relationships.

### DocumentLoader implementation in docsrag

We showcase the docsrag `GithubDocumentLoader` which simply an adapter for `llama_hub.github_repo.GithubRepositoryReader`

For the sake of simplicity, the `GithubDocumentLoader`:
- consider only markdown (`.md`) and restructured-text (`.rst`) files inside the ray repo doc/source folder.
- read the documents as raw text given the default `llama_index` readers have their flaws

In [None]:
from docsrag.docs_loader import GithubDocumentLoader

document_loader = GithubDocumentLoader(
    owner="ray-project",
    repo="ray",
    version_tag="releases/2.6.3",
    paths_to_include=["doc/source/"],
    file_extensions_to_include=["rst", ".md"],
    paths_to_exclude=[
        "doc/source/_ext/",
        "doc/source/_includes/",
        "doc/source/_static/",
        "doc/source/_templates/",
    ],
    filenames_to_exclude=[],
)

In [None]:
# uncomment and run this command to fetch the documents
# docsrag fetch-documents --config-path ./data/config.yaml --data-path ./data --overwrite

In [None]:
import pickle

with open("./data/docs/276534478724186880.pkl", "rb") as f:
    docs = pickle.load(f)

In [None]:
print(f"Number of documents: {len(docs)}")

In [None]:
sample_mkdown_doc = next(
    doc for doc in docs if doc.metadata["file_path"].endswith(".md")
)

print(sample_mkdown_doc.text[:500])

## Node Parser
A node parser chunks a document into nodes

The parser will:
- run a text chunker
- inject additional node metadata
- construct node relationships

A node is:
- the chunk text plus metadata (e.g. node text hash, node relationships to other nodes)

We showcase the docsrag `NodeParser` which simply an adapter for `llama_hub.github_repo.GithubRepositoryReader`

In [None]:
from docsrag.node_parser import NodeParser

In [None]:
node_parser = NodeParser.parse_obj(
    {
        "inherit_metadata_from_doc": True,
        "construct_prev_next_relations": True,
        "text_chunker": {
            "chunk_size": 1024,
            "chunk_overlap": 20,
            "paragraph_separator": "\n\n\n",
            "sentence_tokenizer": {"type": "tokenizers/punkt"},
            "secondary_chunking_regex": "[^,.;。]+[,.;。]?",
            "tokenizer": {"encoding": "gpt2"},
            "word_seperator": " ",
        },
        "metadata_pipeline": {
            "extractors": [
                "file_path_extractor",
                "text_hash_extractor",
            ]
        },
    }
)

In [None]:
# uncomment and run this command to parse the nodes
# docsrag parse-nodes --config-path ./data/config.yaml --data-path ./data --overwrite

In [None]:
%psource node_parser.run

In [None]:
with open("./data/nodes/130956594988870197.pkl", "rb") as f:
    nodes = pickle.load(f)

In [None]:
print(f"Number of nodes: {len(nodes)}")

In [None]:
import yaml
import gradio as gr
from docsrag.node_parser import NodeParser
import pickle

with open("tutorial_docs.pkl", "rb") as f:
    docs = pickle.load(f)

config = {
    "inherit_metadata_from_doc": True,
    "construct_prev_next_relations": True,
    "text_chunker": {
        "chunk_size": 1024,
        "chunk_overlap": 20,
        "paragraph_separator": "\n\n\n",
        "sentence_tokenizer": {"type": "tokenizers/punkt"},
        "secondary_chunking_regex": "[^,.;。]+[,.;。]?",
        "tokenizer": {"encoding": "gpt2"},
        "word_seperator": " ",
    },
    "metadata_pipeline": {
        "extractors": [
            "file_path_extractor",
            "text_hash_extractor",
        ]
    },
}


def parse_nodes(
    text,
    chunk_size=1024,
    chunk_overlap=20,
    paragraph_separator="\n\n\n",
    sentence_tokenizer="tokenizers/punkt",
    secondary_chunking_regex="[^,.;。]+[,.;。]?",
    tokenizer="gpt2",
    word_seperator=" ",
    extractors=["file_path_extractor", "text_hash_extractor"],
):
    config_dict = config
    config_dict["text_chunker"]["chunk_size"] = chunk_size
    config_dict["text_chunker"]["chunk_overlap"] = chunk_overlap
    config_dict["text_chunker"]["paragraph_separator"] = paragraph_separator
    config_dict["text_chunker"]["sentence_tokenizer"]["type"] = sentence_tokenizer
    config_dict["text_chunker"]["secondary_chunking_regex"] = secondary_chunking_regex
    config_dict["text_chunker"]["tokenizer"]["encoding"] = tokenizer
    config_dict["text_chunker"]["word_seperator"] = word_seperator
    config_dict["metadata_pipeline"]["extractors"] = extractors

    node_parser = NodeParser.parse_obj(config_dict)
    doc = docs[0]
    doc.text = text
    nodes = node_parser.run([doc], use_ray=False)
    return (
        nodes[0].text,
        yaml.dump(nodes[0].metadata),
        yaml.dump([str(rel) for rel in nodes[0].relationships]),
        nodes[1].text,
        yaml.dump(nodes[1].metadata),
        yaml.dump([str(rel) for rel in nodes[1].relationships]),
    )


with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            title = gr.Markdown(
                """
                # Node Parser Demo
                Shows how configuration options affect the output of the node parser.
                """
            )
    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            text1 = gr.Textbox(label="Document", value=docs[0].text)
        with gr.Column(scale=1, min_width=100):
            text2 = gr.Textbox(label="NodeParser chunksize", value=1024)
            text3 = gr.Textbox(label="NodeParser chunk_overlap", value=20)
            text4 = gr.Textbox(label="NodeParser paragraph_separator", value='"\n\n\n"')
            text5 = gr.Textbox(
                label="NodeParser sentence_tokenizer", value="tokenizers/punkt"
            )
            text6 = gr.Textbox(
                label="NodeParser secondary_chunking_regex", value='"[^,.;。]+[,.;。]?"'
            )

    with gr.Row():
        inbtw = gr.Button("Submit", variant="primary")

    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            out1 = gr.Textbox(label="First Node text")
        with gr.Column(scale=1, min_width=100):
            out2 = gr.Textbox(label="First Node metadata")
        with gr.Column(scale=1, min_width=100):
            out3 = gr.Textbox(label="First Node relationships")

    with gr.Row():
        with gr.Column(scale=3, min_width=100):
            out4 = gr.Textbox(label="Second Node text")
        with gr.Column(scale=1, min_width=100):
            out5 = gr.Textbox(label="Second Node metadata")
        with gr.Column(scale=1, min_width=100):
            out6 = gr.Textbox(label="Second Node relationships")
    inbtw.click(
        parse_nodes,
        inputs=[text1, text2, text3, text4, text5, text6],
        outputs=[out1, out2, out3, out4, out5, out6],
    )

demo.launch(quiet=True)

## Embedding model and vector store

We showcase the docsrag VectorStoreIndexRay (a very simple in-memory vector store) and how to use it to find similar nodes.

In [None]:
from docsrag.embedding.index import VectorStoreSpec, VectorStoreIndexRay

We start by building our VectorStoreIndexRay from the nodes we parsed earlier. This will compute the embeddings for each node and store them in a vector store.

In [None]:
%psource VectorStoreIndexRay.build_from_spec

In [None]:
%psource VectorStoreIndexRay._get_node_embeddings

In [None]:
node_limit = 10

embedding_vector_store_spec = VectorStoreSpec.parse_obj(
    {"embedding_model_name": "BAAI/bge-small-en"}
)

store = VectorStoreIndexRay.build_from_spec(
    nodes=nodes[:node_limit] if node_limit else nodes,
    spec=embedding_vector_store_spec,
    num_gpus=0,
    batch_size=10,
)

In [None]:
from pathlib import Path

store.save(Path("./data/vector_index_new/"))

In [None]:
# to build the full vector store uncomment and run below command
# docsrag build-embedding-vector-store-index --config-path ./data/config.yaml --data-path ./data --overwrite

We load the vector store that was built in the previous step.

In [None]:
loaded_index = VectorStoreIndexRay.load(Path("data/vector_store/609458502334478189/"))

In [None]:
nodes_with_scores = loaded_index.retrieve_most_similiar_nodes(
    query="How can I migrate from a single-application config to a multi-application config in Ray Serve?",
    similarity_top_k=3,
)

In [None]:
print(f"Number of nodes fetched: {len(nodes_with_scores)}")

In [None]:
most_similar_node = nodes_with_scores[0]
print(most_similar_node.node.text[-1170:], end="\n\n")
print(f"{most_similar_node.node.metadata=}")
print(f"{most_similar_node.score=}")

## Evaluating our Embedding Index using standard ranking and classification metrics

- Step1: Build a question and answer evaluation dataset from the ray documentation corpus
- Step2: Assess the quality of our embedding index based on the built dataset

### Building an Evaluation Dataset
[<img src="eval_build.jpeg" height="500"/>](eval_build.jpeg)


In [None]:
from docsrag.evaluation_dataset_generator import EvaluationDatasetBuilder

In [None]:
from textwrap import dedent

eval_dataset_builder = EvaluationDatasetBuilder.parse_obj(
    {
        "qa_generator_open_ai": {
            "model": "gpt-3.5-turbo",
            "system_prompt": dedent(
                """
            You are a helpful assistant that generates questions and answers from a provided context.
            The context will be selected documents from the ray's project documentation.
            The questions you generate should be obvious on their own and should mimic what a developer might ask trying to work with ray, especially if they can't directly find the answer in the documentation.
            The answers should be factually correct, can be of a variable length and can contain code.
            If the provided context does not contain enough information to create a question and answer, you should respond with 'I can't generate a question and answer from this context'. 
            The following is an example of how the output should look:
            Q1: How can I view ray dashboard from outside the Kubernetes cluster?
            A1: You can use port-forwarding. Run the command 'kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265'

            Q2: {question}
            A2: {answer}
            """
            ).lstrip(),
            "user_prompt_template": dedent(
                """
        Provide questions and answers from the following context:

        {context}
        """
            ).lstrip(),
            "max_tokens": 1024,
            "temperature": 1.0,
            "top_p": 0.85,
            "frequency_penalty": 0,
            "presence_penalty": 0,
        },
        "noise_injector_from_parquet": {"dataset_name": "trivia_questions.parquet"},
    }
)

In [None]:
# Note this is the prompt used by llama-index in its finetuning module
# """\
# Context information is below.

# ---------------------
# {context_str}
# ---------------------

# Given the context information and not prior knowledge.
# generate only questions based on the below query.

# You are a Teacher/ Professor. Your task is to setup \
# {num_questions_per_chunk} questions for an upcoming \
# quiz/examination. The questions should be diverse in nature \
# across the document. Restrict the questions to the \
# context information provided."
# """

In [None]:
qa_generator_openai = eval_dataset_builder.qa_generator_open_ai

In [None]:
questions = qa_generator_openai.run(context=most_similar_node.node.text)

In [None]:
print(questions)

### Evaluate our Embedding Vector Index Store

[<img src="run_eval.jpeg" height="600"/>](run_eval.jpeg)


In [None]:
from docsrag.embedding.evaluation import VectorStoreEvaluator

In [None]:
evaluator = VectorStoreEvaluator(
    vector_store_index=loaded_index,
    evaluation_dataset_name=hash(eval_dataset_builder),
    top_ks=[1, 3, 5, 7, 10]
)

In [None]:
evaluator.run()

In [None]:
scores = _

In [None]:
pd.DataFrame(scores["cos_sim"]).reset_index().rename(columns={"index": "top_k"})

### Now we are going to use the embedding vector store to augment our LLM model

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.llm_predictor import LLMPredictor

In [None]:
predictor = LLMPredictor(
    llm=OpenAI(
        model="gpt-3.5-turbo",
        temperature=0.1,
        max_tokens=1000,
        max_retries=10,
    )
)

In [None]:
%psource predictor.predict

In [None]:
from llama_index.prompts.base import PromptTemplate
from llama_index.prompts.prompt_type import PromptType

In [None]:
df = pd.read_parquet("./data/eval_data")

In [None]:
df["question"].iloc[2020]

In [None]:
df["answer"].iloc[2020]

In [None]:
answer = predictor.predict(
    prompt=PromptTemplate(
        template=(
            df["question"].iloc[2020]
        )
    )
)

In [None]:
print(answer)

In [None]:
qa_prompt_tmpl_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
)

qa_prompt_tmpl = PromptTemplate(qa_prompt_tmpl_str, prompt_type=PromptType.QUESTION_ANSWER)

In [245]:
query = df["question"].iloc[2020]

nodes = loaded_index.retrieve_most_similiar_nodes(
    query=query,
    similarity_top_k=1,
)

Batches: 100%|██████████| 1/1 [00:00<00:00, 59.15it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 52.57it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 15.59it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 38.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 61.44it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 57.87it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 64.74it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 65.51it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 65.47it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 61.24it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 75.18it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 68.17it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 75.98it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 79.33it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 82.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 78.81it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 72.23it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 78.13it/s]
Batches: 1

In [246]:
print(query)

How can I set the metric and mode for a Trial Scheduler in Tune?


In [247]:
nodes[0].score

0.9094396803081154

In [248]:
# print(nodes[0].node.text)

In [249]:
from llama_index.indices.prompt_helper import PromptHelper

prompt_helper = PromptHelper.from_llm_metadata(
    llm_metadata=predictor.llm.metadata,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None,
    tokenizer=None,
    separator=" ",
)

prompt_helper.num_output = 1024

In [250]:
len(
    [
        token
        for node in nodes
        for token in prompt_helper._tokenizer(node.node.text)
    ]
)

1010

In [251]:
text_chunks = [node.node.text for node in nodes]

In [252]:
context_str = '\n'.join(text_chunks)

In [254]:
question = qa_prompt_tmpl.template.format(context_str=context_str, query_str=query)
print(question)

Context information is below.
---------------------
.. _tune-schedulers:

Tune Trial Schedulers (tune.schedulers)

In Tune, some hyperparameter optimization algorithms are written as "scheduling algorithms".
These Trial Schedulers can early terminate bad trials, pause trials, clone trials,
and alter hyperparameters of a running trial.

All Trial Schedulers take in a ``metric``, which is a value returned in the result dict of your
Trainable and is maximized or minimized according to ``mode``.

.. code-block:: python

    from ray import tune
    from ray.air import session
    from tune.schedulers import ASHAScheduler

    def train_fn(config):
        # This objective function is just for demonstration purposes
        session.report({"loss": config["param"]})

    tuner = tune.Tuner(
        train_fn,
        tune_config=tune.TuneConfig(
            scheduler=ASHAScheduler(),
            metric="loss",
            mode="min",
            num_samples=10,
        ),
        param_space=

In [255]:
from llama_index.llms.base import ChatMessage

updated_answer = predictor.llm.chat(
    [ChatMessage(content=question)]
)

In [256]:
print(updated_answer)

assistant: To set the metric and mode for a Trial Scheduler in Tune, you need to specify them in the `tune.TuneConfig` object when creating the `tune.Tuner`. The `metric` parameter represents the value returned in the result dictionary of your Trainable, and the `mode` parameter specifies whether the metric should be maximized or minimized.

Here is an example of how to set the metric and mode for a Trial Scheduler in Tune:

```python
from ray import tune
from tune.schedulers import ASHAScheduler

tuner = tune.Tuner(
    train_fn,
    tune_config=tune.TuneConfig(
        scheduler=ASHAScheduler(),
        metric="loss",  # Set the metric to "loss"
        mode="min",  # Set the mode to "min" (to minimize the metric)
        num_samples=10,
    ),
    param_space={"param": tune.uniform(0, 1)},
)
results = tuner.fit()
```

In this example, the metric is set to "loss" and the mode is set to "min", indicating that the scheduler should minimize the "loss" metric.
