## Demo notebook for VA benefits QA agent

### Preprocess the data

Data is retrieved from https://www.va.gov/. URLs of specific pages of interest are stored in data/source.py, where the user can add or remove the links as needed.

To load the data, we use the parse_html method in src/data_processor.py, which parses the raw html and add section tags "<title>", "<descriptor>", "<content>", and "<topic>" to it.

In [1]:
from src.data_processor import parse_html
from data.source import URL
import re

chunks = []
for url in URL:
    text = parse_html(url)
    text = re.sub("\{[^)]*\}", "", text)
    chunks.append(text)

https://www.va.gov/disability/after-you-file-claim/
https://www.va.gov/change-address/
https://www.va.gov/change-direct-deposit/
https://www.va.gov/claim-or-appeal-status/
https://www.va.gov/decision-reviews/
https://www.va.gov/disability/eligibility/hazardous-materials-exposure/
https://www.va.gov/disability/eligibility/illnesses-within-one-year-of-discharge/
https://www.va.gov/disability/eligibility/ptsd/
https://www.va.gov/disability/eligibility/special-claims/
https://www.va.gov/disability/how-to-file-claim/additional-forms/
TABLE!!
https://www.va.gov/disability/how-to-file-claim/evidence-needed/fully-developed-claims/
https://www.va.gov/disability/upload-supporting-evidence/
https://www.va.gov/va-payment-history/
https://www.va.gov/disability/view-disability-rating/
https://www.va.gov/disability/how-to-file-claim/additional-forms/
TABLE!!
https://www.va.gov/disability/dependency-indemnity-compensation/
https://www.va.gov/disability/how-to-file-claim/when-to-file/
https://www.va.go

## Build knowledge index

Now we can use the text that we've collected from the source urls to construct our knowledge base. This knowledge base is a retrieval index that stores the source for generating our answers.

There are several caveats in the process. For the build_index() method, we can configure how we chunk our input texts. I use my custom tags that I created in the first step to annotate some meta data associated with the document body, using the MarkdownHeaderTextSplitter method from langchain. This way even though longer text belonging to a same document is separated into chunks, they are still annotated with the same tag. I also set my chunk size to 500 characters with 100 overlap, as this set up seems to give me the best result for the model I use. However, this is subject to change depending on the nature of your input text, model choice, and prompting strategy.

I use 'sentence-transformers/all-MiniLM-L6-v2' as encoder under the hood to obtain the embeddings.

In [2]:
from src.database import Knowledge

knowledge = Knowledge("VA question answering system")
knowledge.build_index(data=chunks,
                     split_headers=[
                ("<topic>", "topic"), ("<descriptor>", "context")
            ],
                     chunk_size=500,
                     chunk_overlap=100)

  from .autonotebook import tqdm as notebook_tqdm


## Load LLM

Then we load the LLM that we want to use in the backend as our question answering agent. I wrapped up with a function load_model() where you can configure which model to use. Since I only tested the app locally, the model has to be pre-downloaded. I store the model in model/

Here, we can configure some hyper parameters, such as the temperature for generation, and the maximum number of new tokens.

In [3]:
from src.llm import load_model

llm = load_model(path='model/llama-2-7b-chat.ggmlv3.q4_0.bin',
                 model_type='llama',
                 max_new_tokens=256,
                 temperature=0.1)

## Set up the agent

Finally we set up the agent. I use the prompt template that I stored in data/prompts.py. Depending on the model, additional prompts can be added/maintained.

For the agent to work, we also pass in the knowledge index that has been created earlier. top_k controls how many passages to retrieve for the agent to base their answers off, where I set to 3 as it keeps a balance of sufficient context but not being too long at the same time.

In [4]:
from src.llm import agent, set_qa_prompt
from data.prompts import qa_template

vectorstore = knowledge.get_database()
chat_agent = agent(llm=llm, 
                   vectorstore=vectorstore,
                   top_k=3,
                   qa_template_=qa_template)


## Let's chat!

And now we can chat with this agent! Since I'm running locally, it may take a while to get the response.

In [6]:
question = "what is the review process?"
chat_agent({'query': question})

{'query': 'what is the review process?',
 'result': 'The Higher-Level Review process typically takes an average of 125 days (4 to 5 months) for VA to complete, with the possibility of taking longer if an informal conference is requested as part of the review. If you identify errors in your case during the review process, you can submit a written statement with your application to help VA make a decision faster.',
 'source_documents': [Document(page_content='information you want to talk about with the reviewer ready. Prepare to explain any errors in your case.', metadata={'topic': 'How do I ask for an informal conference?'}),
  Document(page_content='You can also request a Higher-Level Review by filling\xa0out a Decision Review Request: Higher-Level Review (VA Form 20-0996). Get VA Form 20-0996 to download Learn more about how to request a Higher-Level Review Note: You can’t submit any evidence. You and/or your representative can speak with the reviewer on the phone. You can tell them w


To make it slightly more user friendly, I created a chainlit app following the same process. You can run **chainlit run ask.py** from the root directory to open up a browser and try it out! 

However, sometimes the model may generate gibberish. Hopefully it can be improved when more tunings (either through prompting, or hyper parameter setting, or chunking rules) are performed.

### Final thoughts

### The choice of chains
The chat agent was built with RetrievalQA chain from langchain.chain. I have also experimented with ConversationalRetrievalChain because I wanted to explicitly control how the agent utilizes chat history. However, the generated results seem to be worse with ConversationalRetrievalChain for the initial question as well as follow-ups when context is appended with chat history. Using RetrievalQA chain interactively did not lead to a drastic decrease in answer quality with follow-ups (well, it might have, but the follow-ups that I came up with were somewhat independent from the previous question, as the questions mostly have separate answers from the knowledge index anyways). So the choice of chains was relatively arbituary and the behavior should be evaluated with more thought out questions.

### Prompt design
It was quite surprising to me since a simple prompt that describes the role of the agent, and the task with a template can already achieve a decent response quality. However, before adding the last sentence in the prompt template ("Make sure to use complete sentence to answer the question.") the model would struggle with slightly more complicated questions. It would be interesting to see how the answer might change in a more systematic evaluation on prompt design.

### Effect of chunking rule
The most dramatic effect on model performance was from how chunking was performed. Slightly shorter chunks with relatively more overlap seems to have improved the generation quality the most. This might be due to the limit of context length for smaller models that I can run locally, and how the input documents are generally structured.

Overall, there is still a lot can be done to understand the behavior of this simple llm-based agent. The room for improvement is likely to be pretty big.
