# RAG Demo


This notebook is highly inspired by:
1. https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide
2. https://www.youtube.com/watch?v=3yPBVii7Ct0&t=148s
3. https://www.youtube.com/watch?v=ypzmPwLH_Q4

In [None]:
!pip install -qU \
  transformers \
  sentence-transformers \
  datasets \
  accelerate \
  einops \
  langchain \
  xformers \
  bitsandbytes\
  chromadb

# Embedding Model

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'BAAI/bge-small-en-v1.5'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

In [3]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


#Create the DB

## Example data

In [4]:
import chromadb
from chromadb.config import Settings
import chromadb

import chromadb
client = chromadb.EphemeralClient()

client = chromadb.PersistentClient(path="db/")

In [5]:
# student_info = """
# Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
# is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
# in her free time in hopes of working at a tech company after graduating from the University of Washington.
# """

# club_info = """
# The university chess club provides an outlet for students to come together and enjoy playing
# the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
# the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
# participate in tournaments, analyze famous chess matches, and improve members' skills.
# """

# university_info = """
# The University of Washington, founded in 1861 in Seattle, is a public research university
# with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
# As the flagship institution of the six public universities in Washington state,
# UW encompasses over 500 buildings and 20 million square feet of space,
# including one of the largest library systems in the world.
# """
# collection = client.get_or_create_collection(name="Student")

# collection.add(
#     documents = [student_info, club_info, university_info],
#     metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
#     ids = ["id1", "id2", "id3"]
# )
# results = collection.query(
#     query_texts=["What is the student name?"],
#     n_results=2
# )

# results

## Arxiv

In [6]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)
data[0]

Downloading readme:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

{'doi': '1102.0183',
 'chunk-id': '0',
 'chunk': 'High-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nTechnical Report No. IDSIA-01-11\nJanuary 2011\nIDSIA / USI-SUPSI\nDalle Molle Institute for Arti\x0ccial Intelligence\nGalleria 2, 6928 Manno, Switzerland\nIDSIA is a joint institute of both University of Lugano (USI) and University of Applied Sciences of Southern Switzerland (SUPSI),\nand was founded in 1988 by the Dalle Molle Foundation which promoted quality of life.\nThis work was partially supported by the Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:\nIntelligent Fill in Form.arXiv:1102.0183v1  [cs.AI]  1 Feb 2011\nTechnical Report No. IDSIA-01-11 1\nHigh-Performance Neural Networks\nfor Visual Object Classi\x0ccation\nDan C. Cire\x18 san, Ueli Meier, Jonathan Masci,\nLuca M. Gambardella and J\x7f urgen Schmidhuber\nJanuary 2011\nAbs

In [7]:
data['chunk'][1]

'January 2011\nAbstract\nWe present a fast, fully parameterizable GPU implementation of Convolutional Neural\nNetwork variants. Our feature extractors are neither carefully designed nor pre-wired, but\nrather learned in a supervised way. Our deep hierarchical architectures achieve the best\npublished results on benchmarks for object classi\x0ccation (NORB, CIFAR10) and handwritten\ndigit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep\nnets trained by simple back-propagation perform better than more shallow ones. Learning\nis surprisingly rapid. NORB is completely trained within \x0cve epochs. Test error rates on\nMNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively.\n1 Introduction\nThe human visual system e\x0eciently recognizes and localizes objects within cluttered scenes. For\narti\x0ccial systems, however, this is still di\x0ecult, due to viewpoint-dependent object variability,\nand the high in-class variability of many obj

In [8]:
# data_sliced = data.select(range(10))

In [9]:
from chromadb.utils import embedding_functions
embed_model = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embed_model_id)
collection_paper = client.get_or_create_collection(name="ArxivPapers", embedding_function=embed_model)

This process took over 30m

In [10]:
# data_pd = data_sliced.to_pandas()
data_pd = data.to_pandas()

batch_size = 32

# Batch processing
for i in range(0, len(data_pd), batch_size):
    i_end = min(len(data_pd), i + batch_size)
    batch = data_pd.iloc[i:i_end]

    # Prepare IDs, texts and metadata
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    texts = [x['chunk'] for i, x in batch.iterrows()]
    metadata = [{'source': x['source'], 'title': x['title']} for i, x in batch.iterrows()]

    # Add to ChromaDB collection
    collection_paper.add(
        documents=texts,
        metadatas=metadata,
        ids=ids
    )

In [None]:
results = collection_paper.query(
    query_texts=["what is so special about llama 2?"],
    n_results=2
)

results

# Model (llama2 chat 7b)

In [None]:
from torch import cuda, bfloat16
import transformers

# model_id = 'mistralai/Mistral-7B-v0.1'
model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)


In [14]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.2,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
)

In [15]:
res = generate_text("what is so special about llama 2?")
print(res[0]["generated_text"])

what is so special about llama 2?
 everybody loves llamas, but what makes llama 2 so unique?

Llama 2 is a special llama because it has a unique pattern of spots on its fur. While most llamas have spots that are more or less circular in shape, Llama 2's spots are shaped like little hearts. This makes it stand out from the rest of the llama herd and gives it a distinctive appearance.
Llama 2 is also very friendly and outgoing. It loves to play with other animals and is always eager to make new friends. It has a special bond with a little goat named Billy, and the two of them can often be seen playing together in the meadow.
In addition to its unique appearance and friendly personality, Llama 2 has a special talent. It can mimic the sounds of other animals, including birds, dogs, and even other llamas. This has earned it the nickname "Llama 2: The Mimic."
Overall, Llama 2 is a truly special animal that stands out from the rest of the herd. Its unique appearance, friendly personality, and

# Model + VectorDB = RAG (Langchain)

In [16]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [17]:
llm(prompt="what is so special about llama 2?")

'\n everybody loves llamas, but what makes llama 2 so unique?\n\nLlama 2 is a special breed of llama that has been bred for its unique characteristics. Unlike other llamas, Llama 2 has a distinctive spot on its forehead that resembles a small nose. This spot, known as a "nose spot," is a rare genetic trait that is found in only a small percentage of llamas.\nLlama 2\'s nose spot gives it a playful and friendly appearance, making it a favorite among llama enthusiasts. The llama\'s soft, fluffy coat and gentle demeanor also contribute to its popularity.\nBut what really sets Llama 2 apart is its incredible intelligence. This llama is known to be highly curious and can solve complex problems with ease. It is also highly trainable and can learn a wide range of tasks, from pulling a cart to performing tricks.\nLlama 2\'s intelligence and friendly personality make it a joy to be around. Whether you\'re a seasoned llama owner or just a llama fan, this special breed is sure to capture your hea

In [18]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [30]:
embedding_function = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectordb2 = Chroma(client=client,
                  collection_name = "ArxivPapers",
                  embedding_function=embedding_function,
                   )

retriever = vectordb2.as_retriever(search_kwargs={"k": 3})

# create the chain to answer questions
rag_pipeline = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [None]:
results = rag_pipeline('what is so special about llama 2?')

In [44]:
import textwrap

wrapped_text = textwrap.fill(results['result'], width=100)
print(f"Result: ", wrapped_text)

Result:   Llama 2 is a new language model that has been trained on a large corpus of text data and is
designed to be used for a variety of natural language generation tasks, including chatbots, language
translation, and text summarization. It is intended for use in commercial and research settings, and
its developers have taken steps to ensure that it is safe and reliable. However, as with any
language model, there are potential risks and limitations associated with its use, and developers
should perform safety testing and tuning tailored to their specific applications of the model.
