
## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed. 

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  
 
The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.


# Installations, imports, utils

In [1]:
from torch import cuda, bfloat16
import torch
import transformers
from time import time
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline



# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [2]:
model_name = "/home/jomondal/experiments/mywork/pretrained_models/Llama-2-7b-chat-hf"
device = f'cuda:{torch.cuda.current_device()}' if torch.cuda.is_available() else 'cpu'

In [3]:
compute_dtype = getattr(torch, "float16")

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=compute_dtype
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    quantization_config=bnb_config,
    
)
model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name,padding_side="left",
    add_eos_token=True,
    add_bos_token=True,)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare the model and the tokenizer.

Define the query pipeline.

In [4]:
query_pipeline = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 1024,
                do_sample=True,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id)

We define a function for testing the pipeline.

In [5]:
prompt_query = "Tell me about apple vision pro? Keep it in 100 words."

In [6]:
query_pipeline(prompt_query)

[{'generated_text': "Tell me about apple vision pro? Keep it in 100 words. everybody knows about apple vision pro.\n\nApple Vision Pro is a machine learning-based image processing tool designed for professional photographers and videographers. It offers advanced features such as facial detection, tracking, and recognition, as well as object detection and removal. The tool also provides advanced color grading and tone mapping capabilities, allowing users to enhance and enhance their images. Apple Vision Pro is available as a standalone application for Mac and PC, and is also integrated into Apple's Final Cut Pro and Motion video editing software."}]

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [7]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt=prompt_query)

  warn_deprecated(


' Unterscheidung between the two can be difficult, but there are some key differences: Apple Vision Pro is a new AI-powered tool from Apple that helps photographers and videographers improve their workflow by automatically organizing and tagging their photos and videos. Apple Vision is a new AI-powered tool from Apple that helps photographers and videographers improve their workflow by automatically organizing and tagging their photos and videos. Apple Vision Pro is a new AI-powered tool from Apple that helps photographers and videographers improve their workflow by automatically organizing and tagging their photos and videos. Apple Vision Pro is a new AI-powered tool from Apple that helps photographers and videographers improve their workflow by automatically organizing and tagging their photos and videos. Apple Vision Pro is a new AI-powered tool from Apple that helps photographers and videographers improve their workflow by automatically organizing and tagging their photos and video

## Ingestion of data using Text loder

We will ingest the newest presidential address, from Jan 2023.

In [8]:
loader = TextLoader("dataset/apple_vision_pro.txt",encoding="utf8")

documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [10]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [11]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [12]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 


We define a test function, that will run the query and time it.

In [19]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.invoke(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [20]:
test_rag(qa, prompt_query)

Query: Tell me about apple vision pro? Keep it in 100 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 4.536 sec.

Result:  {'query': 'Tell me about apple vision pro? Keep it in 100 words.', 'result': ' Apple Vision Pro is a revolutionary device that combines cutting-edge technology with a user-friendly interface, delivering an immersive and intuitive experience for work, entertainment, and communication. With its ultra-high-resolution displays, advanced 3D user interface, and innovative features like Spatial Audio and Spatial Photos, Vision Pro offers a new level of engagement and interaction with digital content.'}


## Document sources

Let's check the documents sources, for the last query run.

In [18]:
docs = vectordb.similarity_search(prompt_query)
print(f"Query: {prompt_query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: Tell me about apple vision pro? Keep it in 100 words.
Retrieved documents: 4
Source:  dataset/apple_vision_pro.txt
Text:  Apple Vision Pro is powered by visionOS, which is built on the foundation of decades of engineering innovation in macOS, iOS, and iPadOS. visionOS delivers powerful spatial experiences, unlocking new opportunities at work and at home. Featuring a brand-new three-dimensional user interface and input system controlled entirely by a user’s eyes, hands, and voice, navigation feels magical. Intuitive gestures allow users to interact with apps by simply looking at them, tapping their fingers to select, flicking their wrist to scroll, or using a virtual keyboard or dictation to type. With Siri, users can quickly open or close apps, play media, and more.
Extraordinary Experiences: 

Source:  dataset/apple_vision_pro.txt
Text:  The ultimate entertainment experience: Apple Vision Pro features ultra-high-resolution displays that deliver more pixels than a 4K TV for each