<a href="https://colab.research.google.com/github/jnorfolk/DataSpeak-QA/blob/main/Copy_of_llama_2_13b_retrievalqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb)

# RAG with LLaMa 13B

In this notebook we'll explore how we can use the open source **Llama-13b-chat** model in both Hugging Face transformers and LangChain.
At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).

---

🚨 _Note that running this on CPU is sloooow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [None]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 100}
)

We can use the embedding model to create document embeddings like so:

In [None]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [None]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
# (use your own! - Following the above hyperlink)
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or '35f6e3ee-3bf3-444a-a8e6-f9824044188a',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

Now we initialize the index.

In [None]:
import time

# Use the name of the index that you created in Pinecone
index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [None]:
index = pinecone.Index(index_name)
index.describe_index_stats()

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

Here I will deviate and use my own data.

In [None]:
# You can attach your own google drive in this cell block
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

path_answers = '/content/drive/MyDrive/DataSpeak/Answers.csv'
answers = pd.read_csv(path_answers, on_bad_lines='warn', encoding='ISO-8859-1')

path_questions = '/content/drive/MyDrive/DataSpeak/Questions.csv'
questions = pd.read_csv(path_questions, on_bad_lines='warn', encoding='ISO-8859-1')

path_tags = '/content/drive/MyDrive/DataSpeak/Tags.csv'
tags = pd.read_csv(path_tags, on_bad_lines='warn', encoding='ISO-8859-1')

In [None]:
def preprocess_data(questions, answers):
  df = pd.merge(questions, answers, left_on='Id', right_on='ParentId', how='left', validate='one_to_many', suffixes=['_q', '_a']).dropna()
  df = df.drop(['OwnerUserId_q', 'CreationDate_q', 'Id_a', 'OwnerUserId_a', 'CreationDate_a', 'ParentId'], axis=1)
  df = df.rename(columns = {'Id_q': 'id', 'Title': 'question_title', 'Body_q': 'question_body', 'Score_a': 'score', 'Body_a': 'answer'})
  df = df.drop(df[df.Score_q <= 0].index) # Keep decent questions
  df = df.drop(df[df.score <= 0].index) # Keep good answers
  idx = df.groupby('id')['score'].head(3).index # Keep top answer
  df = df.loc[idx].reset_index(drop=True) # Keep top answer
  df = df.sort_values(by=['id', 'score'], ascending=[True, False])
  df = df.dropna().reset_index(drop=True)
  return df

In [None]:
def clean_text(df, columns):

  from bs4 import BeautifulSoup

  def remove_html_tags(text):
      soup = BeautifulSoup(text, "html.parser")
      return soup.get_text()

  import re

  def remove_chars(text):
      # Remove newline characters
      text = text.replace('\n', ' ')

      # Remove funny characters
      funny_characters = ['\u2018', '\u2019', '\u201C', '\u201D']
      for char in funny_characters:
          text = text.replace(char, "'")

      # Remove any remaining non-ASCII characters
      text = re.sub(r'[^\x00-\x7F]+', '', text)

      # Optionally, remove extra spaces
      text = ' '.join(text.split())

      return text

  for column in columns:
    df[column] = df[column].apply(remove_html_tags)
    df[column] = df[column].apply(remove_chars)

  return df

In [None]:
def get_tag_subset(df, tags, tags_list):
    df_result = pd.DataFrame()
    df_tags = pd.merge(df, tags, how='left', left_on='id', right_on="Id").dropna().drop(['Id', 'Score_q'], axis=1).dropna()
    for tag in tags_list:
        df_subset = df_tags.loc[df_tags['Tag'].str.contains(tag)].drop('Tag', axis=1)
        df_result = pd.concat([df_result, df_subset])
        df_result.drop_duplicates(inplace=True)
    return df_result

In [None]:
# Add any topic that you wish the data to contain to tags_list as a list
tags_list = ['numpy', 'pandas', 'django', 'matplotlib']

df = preprocess_data(questions, answers)
df_cut = get_tag_subset(df, tags, tags_list)
data = clean_text(df_cut, columns=['question_title', 'question_body', 'answer'])
data = data.drop(data[data.score <= 0].index).drop('Unnamed: 0', axis=1)
data['context'] = 'Title: ' + data.question_title + '; Score: ' + str(data.score) + '; Answer: ' + data.answer

We will embed and index the documents like so:

In [None]:
# We are embedding and uploading the data to Pinecone. Use your own ids, texts, and metadata if using different data

from tqdm.auto import tqdm

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    ids = [f"{x['id']}" for i, x in batch.iterrows()]
    texts = [x['question_title'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['question_title'],
         'context': x['context'],
         'source': x['id'],
         'answer': x['answer'],
         'score': x['score']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [None]:
index.describe_index_stats()

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [None]:
from torch import cuda, bfloat16
import transformers

# Either llama-2 model can be used depending on available processing power
model_id = 'meta-llama/Llama-2-13b-chat-hf'
# model_id = 'meta-llama/Llama-2-7b-chat-hf'


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [None]:
res = generate_text("How do I do a rolling mean with pandas?")
print(res[0]["generated_text"])

In [None]:
# res = generate_text("How do I use a ENUM in Django?")
# print(res[0]["generated_text"])

Now to implement this in LangChain

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
# llm(prompt="How do I go about specifying and using a ENUM in a Django model?")

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 13B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [None]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [None]:
query = "How do I do a rolling mean with pandas?"

vectorstore.similarity_search_with_score(
    query,  # the search query
    k=5  # returns top 5 most relevant chunks of text
)

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [None]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever(),
)

Let's begin asking questions! First let's try *without* RAG:

In [None]:
llm("how is a series different than a dataframe?")

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [None]:
rag_pipeline("how is a series different than a dataframe?")

This looks *much* better! Let's try some more.

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.

In [None]:
llm("what is pandas?")

In [None]:
rag_pipeline('What is pandas?')

A reasonable answer from the RAG pipeline, but it doesn't contain much information — maybe we can ask more about this, like what is this _"red team"_ procedure that delayed the launch of the 34B model?

In [None]:
rag_pipeline('How do I join tables with pandas?')

Very interesting!

In [None]:
rag_pipeline('I have a function. I want to apply it to a column in my pandas dataframe. How do I do this?')
