# RAG using LLaMa-2 & LangChain


## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.

In [1]:
!python3.9 -m pip install -qU ipdb
import ipdb

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.9 install --upgrade pip[0m


# Installations, imports, utils

In [2]:
!python3.9 -m pip install -qU transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.9 install --upgrade pip[0m


In [3]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


  from .autonotebook import tqdm as notebook_tqdm


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [4]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'
# model_id = 'HuggingFaceH4/zephyr-7b-beta'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [5]:
# use 8-bit model for better performance
bnb_config = transformers.BitsAndBytesConfig(
    load_in_8bit=True,
)

In [6]:
# https://stackoverflow.com/questions/63312859/how-to-change-huggingface-transformers-default-cache-directory
import os

HF_TOKEN=os.environ['HF_TOKEN']
HF_CACHE=os.environ['HF_CACHE']

Prepare the model and the tokenizer.

In [7]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=HF_TOKEN,
)

# https://github.com/langchain-ai/langchain/issues/6608
# use `float16` to reduce GPU RAM to about 2x7=14GB
# 
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    # quantization_config=bnb_config,
    torch_dtype=torch.float16, 
    device_map='auto',
    token=HF_TOKEN,
    resume_download=True,
    cache_dir=HF_CACHE,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=HF_TOKEN)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.79s/it]


Prepare model, tokenizer: 10.453 sec.


Define the query pipeline.

In [8]:
# pipeline setup follows https://discuss.huggingface.co/t/how-do-i-increase-max-new-tokens/43098

time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",
        max_new_tokens=512,
        return_full_text=True,
        temperature=0.1,
        top_p=0.15,
        top_k=0,
        repetition_penalty=1.1)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 0.092 sec.


## Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about VA disability claim.

In [9]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt= "Please explain what is a VA disability claim. Keep it in 100 words.")

"\nA VA disability claim is a request to the Department of Veterans Affairs (VA) for compensation or pension benefits due to a service-connected disability. The claim must provide evidence of the veteran's service history, medical conditions, and how those conditions are related to their military service. The VA will review the claim and make a determination based on the evidence provided."

## Ingestion of data using Text loder

We will ingest VA benefits HTML pages.

### Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [10]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

In [11]:
# load chunks from train.jsonl file
import json
documents = []

with open('input/train.jsonl', 'r') as f:
    for line in f:
        doc = json.loads(line)
        documents.append(doc)

len(documents)

93

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [12]:
vectordb = Chroma.from_texts(texts=[doc['text'] for doc in documents], embedding=embeddings, persist_directory="chroma_db",
                            metadatas=[{'source': doc['source']} for doc in documents])

In [13]:
# from langchain.vectorstores import FAISS
# vectordb = FAISS.from_texts(texts=[doc['text'] for doc in documents], embedding=embeddings,
#                             metadatas=[{'source': doc['source']} for doc in documents])

### Initialize RAG chain

In [14]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [15]:
from IPython.display import Markdown, display

def printmd(strings, color=None):
    for string in strings.split('\n'):
        colorstr = "<span style='color:{}'>{}</span>".format(color, string)
        display(Markdown(colorstr))

In [16]:
def test_rag(qa, query):
    printmd(f"**Query**: {query}\n", color="orange")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec. on {len(result)} tokens")
    printmd(f"**Result**: {result}\n", color="green")

Let's check few queries.

In [17]:
query = "What is a VA disability claim?"
test_rag(qa, query)

<span style='color:orange'>**Query**: What is a VA disability claim?</span>

<span style='color:orange'></span>



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 3.524 sec. on 359 tokens


<span style='color:green'>**Result**:  A VA disability claim is a request to the Department of Veterans Affairs (VA) for compensation or pension due to a service-connected disability. To file a claim, veterans must provide evidence of their service connection and the nature and extent of their disability. The VA will then review the claim and make a determination based on the evidence provided.</span>

<span style='color:green'></span>

In [18]:
query = "What factors will influence the time taken to review your VA claim?"
test_rag(qa, query)

<span style='color:orange'>**Query**: What factors will influence the time taken to review your VA claim?</span>

<span style='color:orange'></span>



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 3.834 sec. on 424 tokens


<span style='color:green'>**Result**:  The amount of time it takes to review a VA disability claim depends on several factors, including the complexity of the claim, the amount of evidence required to support the claim, and the workload of the VA office processing the claim. It is difficult to provide an exact estimate of how long it will take to review a specific claim without knowing the details of the claim and the VA office responsible for processing it.</span>

<span style='color:green'></span>

In [19]:
#query = "What is averge number of days to complete processing a claim?"
query = "How many days are needed to complete processing a claim?"
# ipdb.set_trace()
test_rag(qa, query)

<span style='color:orange'>**Query**: How many days are needed to complete processing a claim?</span>

<span style='color:orange'></span>



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 1.934 sec. on 173 tokens


<span style='color:green'>**Result**:  According to the provided data, the average number of days required to finish processing a disability-related claim is 103.3 days.</span>

<span style='color:green'></span>

<span style='color:green'>Do you know the answer to this question?</span>

<span style='color:green'></span>

In [20]:
query = "What kinds of benefits can be claimed?"
test_rag(qa, query)

<span style='color:orange'>**Query**: What kinds of benefits can be claimed?</span>

<span style='color:orange'></span>



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 3.656 sec. on 306 tokens


<span style='color:green'>**Result**:  The types of benefits that can be claimed are listed below.</span>

<span style='color:green'></span>

<span style='color:green'>• Auto Allowance</span>

<span style='color:green'>• Auto Adaptive-Equipment Grant</span>

<span style='color:green'>• Additional Benefits Because You or Your Spouse Needs Aid and Attendance</span>

<span style='color:green'>• Aid and Attendance Because You’re in a Nursing Home</span>

<span style='color:green'>• Dependents</span>

<span style='color:green'></span>

<span style='color:green'>Please select one of the options from the list above.</span>

<span style='color:green'></span>

## Document sources

Let's check the documents sources, for the last query run.

In [21]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What kinds of benefits can be claimed?
Retrieved documents: 4
Source:  input/va/disability_how-to-file-claim_additional-forms.html
Text:  You’ll need to turn in:      


 An Application in Acquiring Specially Adapted Housing or Special Home Adaptation Grant (VA Form 26-4555)  Get VA Form 26-4555 to download 




 If you’re:      


 Claiming an auto allowance 


 You’ll need to turn in:      


 An Application for Automobile or Other Conveyance and Adaptive Equipment (VA Form 21-4502)  Get VA Form 21-4502 to download 




 If you’re:      


 Claiming an auto adaptive-equipment grant 


 You’ll need to turn in:      


 An Application for Adaptive Equipment—Motor Vehicle (VA Form 10-1394)  Get VA Form 10-1394 to download 




 If you’re:      


 Claiming additional benefits because you or your spouse needs Aid and Attendance 


 You’ll need to turn in:      


 An Examination for Housebound Status or Permanent Need for Regular Aid and Attendance (VA Form 21-2680)  Get VA Form 2