<a href="https://colab.research.google.com/github/kostas-panagiotakis/NLP/blob/main/Feynman_RAG_3_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BarbenHeimer with LLaMA 2


![image](https://images.lumacdn.com/cdn-cgi/image/format=auto,fit=cover,dpr=1,quality=75,width=960,height=480/event-covers/87/1f6d4850-0231-4bd9-92e8-af6b45c18d7a)


In the following notebook we'll be discussing Retrieval Augmented Generation - and how to leverage Meta's neweset LLM, LLaMA 2 as the engine!

### Pre-task Work

All we really need to do to get started is to get our prerequisites!

We'll be leveraging `langchain` and `llama 2` today.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [LLaMA 2](https://huggingface.co/blog/llama2)

In [1]:
!pip install -U -q "langchain" "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1"
!pip install jq

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 2.7.0 requires transformers<5.0.0,>=4.34.0, but you have transformers 4.31.0 which is incompatible.[0m[31m


In [2]:
pip install --upgrade sentence-transformers


Collecting transformers<5.0.0,>=4.34.0 (from sentence-transformers)
  Using cached transformers-4.41.0-py3-none-any.whl (9.1 MB)
Collecting tokenizers<0.20,>=0.19 (from transformers<5.0.0,>=4.34.0->sentence-transformers)
  Using cached tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: transformers
    Found existing installation: transformers 4.31.0
    Uninstalling transformers-4.31.0:
      Successfully uninstalled transformers-4.31.0
Successfully installed tokenizers-0.19.1 transformers-4.41.0


In [3]:
!pip install langchain_community



### Task 1: Data Preparation

In this task we'll be collecting, and then parsing, our data.

In [4]:
import pandas as pd
from datasets import load_dataset
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint


In [5]:
 # Load the dataset
dataset = load_dataset("enesxgrahovac/the-feynman-lectures-on-physics")
dataset
train = dataset['train']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


  0%|          | 0/1 [00:00<?, ?it/s]

#### Data Parsing

Now that we have our data - let's go ahead and start parsing it into a more usable format for LangChain!

We'll be using the `CSVLoader` for this application.

Check out the docs here:
- [CSVLoader](https://python.langchain.com/docs/integrations/document_loaders/csv)

In [27]:
# Convert the dataset into df format and add row index
df = pd.DataFrame(train)
df['index'] = df.reset_index().index

last_column_values = df['index']
df = df.drop(columns=['index'])
df.insert(0, 'index', last_column_values)
df = df.sort_values(by=['index', 'book_volume','chapter_number','section_number'])
df

Unnamed: 0,index,book_volume,book_title,chapter_number,chapter_title,section_number,section_title,section_text
0,0,1,,1,Atoms in Motion,1,Introduction,This two-year course in physics is presented f...
1,1,1,,2,Basic Physics,1,Introduction,"In this chapter, we shall examine the most fun..."
2,2,1,,2,Basic Physics,2,Physics before 1920,It is a little difficult to begin at once with...
3,3,1,,3,The Relation of Physics to Other Sciences,1,Introduction,Physics is the most fundamental and all-inclus...
4,4,1,,3,The Relation of Physics to Other Sciences,2,Chemistry,The science which is perhaps the most deeply a...
...,...,...,...,...,...,...,...,...
636,636,3,,21,The Schrödinger Equation in a Classical Contex...,5,Superconductivity,"As you know, very many metals become supercond..."
637,637,3,,21,The Schrödinger Equation in a Classical Contex...,6,The Meissner effect,Now we can describe some of the phenomena of s...
638,638,3,,21,The Schrödinger Equation in a Classical Contex...,7,Flux quantization,The London equation (21.21) was proposed to ac...
639,639,3,,21,The Schrödinger Equation in a Classical Contex...,8,The dynamics of superconductivity,The Meissner effect\nand the flux quantization...


In [28]:
from langchain.document_loaders.csv_loader import CSVLoader

# Convert df to csv
feynman_data = df.to_csv('/content/output.csv', index=False)
file_path = '/content/output.csv'

feynman_data = CSVLoader(file_path=file_path, source_column="index")

feynman_data_loaded = feynman_data.load()

In [29]:
len(feynman_data_loaded)

641

In [30]:
type(feynman_data_loaded[0])

Now that we have collected our review information into a loader - we can go ahead and chunk the reviews into more manageable pieces.

We'll be leveraging the `RecursiveCharacterTextSplitter` for this task today.

While splitting our text seems like a simple enough task - getting this correct/incorrect can have massive downstream impacts on your application's performance.

You can read the docs here:
- [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)

> ### HINT:
>It's always worth it to check out the LangChain source code if you're ever in a bind - for instance, if you want to know how to transform a set of documents, check it out [here](https://github.com/langchain-ai/langchain/blob/5e9687a196410e9f41ebcd11eb3f2ca13925545b/libs/langchain/langchain/text_splitter.py#L268C18-L268C18)

# Parameter optimization


In [55]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 10000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len, # the length function - in this case, character length (aka the python len() fn.)
)

In [56]:
feynman_documents = text_splitter.transform_documents(feynman_data_loaded)

In [57]:
len(feynman_documents)

889

In [61]:
feynman_documents[50]

Document(page_content='index: 38\nbook_volume: 1\nbook_title: \nchapter_number: 9\nchapter_title: Newton’s Laws of Dynamics\nsection_number: 7\nsection_title: Planetary motions\nsection_text: The above analysis is very nice for the motion of an oscillating spring, but can we analyze the motion of a planet around the sun? Let us see whether we can arrive at an approximation to an ellipse for the orbit. We shall suppose that the sun is infinitely heavy, in the sense that we shall not include its motion. Suppose a planet starts at a certain place and is moving with a certain velocity; it goes around the sun in some curve, and we shall try to analyze, by Newton’s laws of motion and his law of gravitation, what the curve is. How? At a given moment it is at some position in space. If the radial distance from the sun to this position is called $r$, then we know that there is a force directed inward which, according to the law of gravity, is equal to a constant times the product of the sun’s m

In [49]:
feynman_data_loaded[0]

Document(page_content='index: 0\nbook_volume: 1\nbook_title: \nchapter_number: 1\nchapter_title: Atoms in Motion\nsection_number: 1\nsection_title: Introduction\nsection_text: This two-year course in physics is presented from the point of view that you, the reader, are going to be a physicist. This is not necessarily the case of course, but that is what every professor in every subject assumes! If you are going to be a physicist, you will have a lot to study: two hundred years of the most rapidly developing field of knowledge that there is. So much knowledge, in fact, that you might think that you cannot learn all of it in four years, and truly you cannot; you will have to go to graduate school too! Surprisingly enough, in spite of the tremendous amount of work that has been done for all this time it is possible to condense the enormous mass of results to a large extent—that is, to find laws which summarize all our knowledge. Even so, the laws are so hard to grasp that it is unfair to 

With our documents transformed into more manageable sizes, and with the correct metadata set-up, we're now ready to move on to creating our VectorStore!

### Task 2: Creating an "Index"

The term "index" is used largely to mean: Structured documents parsed into a useful format for querying, retrieving, and use in the LLM application stack.

#### Selecting Our VectorStore

There are a number of different VectorStores, and a number of different strengths and weaknesses to each.

In this notebook, we will be keeping it very simple by leveraging [Facebook AI Similarity Search](https://ai.meta.com/tools/faiss/#:~:text=FAISS%20(Facebook%20AI%20Similarity%20Search,more%20scalable%20similarity%20search%20functions.), or `FAISS`.

# Analyze different vector stores and how they impact performances

In [None]:
#!pip install -q -U faiss-cpu tiktoken sentence-transformers

We're going to be setting up our VectorStore with the OpenAI embeddings model. While this embeddings model does not need to be consistent with the LLM selection, it does need to be consistent between embedding our index and embedding our queries over that index.

While we don't have to worry too much about that in this example - it's something to keep in mind for more complex applications.

We're going to leverage a [`CacheBackedEmbeddings`](https://python.langchain.com/docs/modules/data_connection/caching_embeddings )flow to prevent us from re-embedding similar queries over and over again.

Not only will this save time, it will also save us precious embedding tokens, which will reduce the overall cost for our application.

>#### Note:
>The overall cost savings needs to be compared against the additional cost of storing the cached embeddings for a true cost/benefit analysis. If your users are submitting the same queries often, though, this pattern can be a massive reduction in cost.

In [62]:
#!pip install --upgrade faiss-cpu
!pip install -q -U faiss-cpu tiktoken sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [63]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(feynman_documents, embedder)



In [64]:
core_embeddings_model

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

Now that we've created the VectorStore, we can check that it's working by embedding a query and retrieving passages from our reviews that are close to it.

In [65]:
query = "Give me the definition of Energy in terms of work"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

index: 6
book_volume: 1
book_title: 
chapter_number: 4
chapter_title: Conservation of Energy
section_number: 1
section_title: What is energy?
section_text: In this chapter, we begin our more detailed study of the different aspects of physics, having finished our description of things in general. To illustrate the ideas and the kind of reasoning that might be used in theoretical physics, we shall now examine one of the most basic laws of physics, the conservation of energy. There is a fact, or if you wish, a law, governing all natural phenomena that are known to date. There is no known exception to this law—it is exact so far as we know. The law is called the conservation of energy. It states that there is a certain quantity, which we call energy, that does not change in the manifold changes which nature undergoes. That is a most abstract idea, because it is a mathematical principle; it says that there is a numerical quantity which does not change when something happens. It is not a des

In [66]:
docs[0]

Document(page_content='index: 6\nbook_volume: 1\nbook_title: \nchapter_number: 4\nchapter_title: Conservation of Energy\nsection_number: 1\nsection_title: What is energy?\nsection_text: In this chapter, we begin our more detailed study of the different aspects of physics, having finished our description of things in general. To illustrate the ideas and the kind of reasoning that might be used in theoretical physics, we shall now examine one of the most basic laws of physics, the conservation of energy. There is a fact, or if you wish, a law, governing all natural phenomena that are known to date. There is no known exception to this law—it is exact so far as we know. The law is called the conservation of energy. It states that there is a certain quantity, which we call energy, that does not change in the manifold changes which nature undergoes. That is a most abstract idea, because it is a mathematical principle; it says that there is a numerical quantity which does not change when some

Let's see how much time the `CacheBackedEmbeddings` pattern saves us:

In [67]:
%%timeit -n 1 -r 1
query = "I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow."
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

15.9 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [68]:
%%timeit
query = "I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow."
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

8.69 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


As we can see, even over a significant number of runs - the cached query is significantly faster than the first instance of the query!

With that, we're ready to move onto Task 3!

### Task 3: Building a Retrieval Chain

In this task, we'll be making a Retrieval Chain which will allow us to ask semantic questions over our data.

This part is rather abstracted away from us in LangChain and so it seems very powerful.

Be sure to check the documentation, the source code, and other provided resources to build a deeper understanding of what's happening "under the hood"!

#### A Basic RetrievalQA Chain

We're going to leverage `return_source_documents=True` to ensure we have proper sources for our reviews - should the end user want to verify the reviews themselves.

Hallucinations [are](https://arxiv.org/abs/2202.03629) [a](https://arxiv.org/abs/2305.15852) [massive](https://arxiv.org/abs/2303.16104) [problem](https://arxiv.org/abs/2305.18248) in LLM applications.

Though it has been tenuously shown that using Retrieval Augmentation [reduces hallucination in conversations](https://arxiv.org/pdf/2104.07567.pdf), one sure fire way to ensure your model is not hallucinating in a non-transparent way is to provide sources with your responses. This way the end-user can verify the output.

#### Our LLM

In this notebook, we're going to leverage Meta's LLaMA 2!

Specifically, we'll be using: `meta-llama/Llama-2-13b-chat-hf`

That's right, a 13B parameter model that we're going to run on *less than* 15GB of GPU RAM.

More information on this model can be found [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)

# Compare two models the 13 vs the 7 bill params

In [69]:
!pip install huggingface-hub -q

In [72]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We will be leveraging Tim Dettmer's `bitsandbytes` as well as `accelerate` and `transformers` from Hugging Face to make our model as small as possible. The overall quality of the model is fairly well retained!

In [73]:
import torch
import transformers

model_id = "meta-llama/Llama-2-13b-chat-hf"

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

model.eval()



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

In [74]:
#tokenizer = GPT2Tokenizer.from_pretrained(model_id)
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id
)

Now we need to pack it into a `pipeline` for compatability with `langchain`!

In [75]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=0.1,
    max_new_tokens=256
)

In [76]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

  warn_deprecated(


Now we can set up our chain.

In [78]:
retriever = vector_store.as_retriever()

In [79]:
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

Now that it's set-up, let's test it out!

In [80]:
qa_with_sources_chain({"query" : "Give me the definition of energy"})

  warn_deprecated(


OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 121.06 MiB is free. Process 211892 has 14.63 GiB memory in use. Of the allocated memory 13.06 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
qa_with_sources_chain({"query" : "what is the kinetic energy"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the kinetic energy',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nBecause the concepts of kinetic energy, and energy in general, are so important, various names have been given to the important terms in equations such as these. $\\tfrac{1}{2}mv^2$ is, as we know, called kinetic energy. $\\FLPF\\cdot\\FLPv$ is called power: the force acting on an object times the velocity of the object (vector “dot” product) is the power being delivered to the object by that force. We thus have a marvelous theorem: the rate of change of kinetic energy of an object is equal to the power expended by the forces acting on it. However, to study the conservation of energy, we want to analyze this still more closely. Let us evaluate the change in kinetic energy in a very short time $dt$. If we multiply both sides of Eq. (13.7) by $dt$, we find that the differen

In [None]:
qa_with_sources_chain({"query" : "who is kostas?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'who is kostas?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nCopenhagen. He made voluminous tables, which were then studied by the mathematician Kepler, after Tycho’s death. Kepler discovered from the data some very beautiful and remarkable, but simple, laws regarding planetary motion.\n\n\\text{e from $s$}\\\\[1ex] \\displaystyle \\text{ph from $L$} \\end{subarray} \\biggr\\rangle \\biggr\\rvert^2\\notag\\\\[2ex] \\label{Eq:III:3:10} =\\abs{a\\phi_1\\!+b\\phi_2}^2+\\;\\abs{a\\phi_2\\!+b\\phi_1}^2. \\end{gather}\n\nby a flat one—a “constant”—at the same height. In other words, we simply take $I(\\omega)$ outside the integral sign and call it $I(\\omega_0)$. We may also take the rest of the constants out in front of the integral, and what we have left is \\begin{equation} \\label{Eq:I:41:11} \\tfrac{2}{3}\\pi r_0^2\\omega_0^2I(\\omega_0) \\int_

And with that, we have our Barbie & Oppenheimer Review RAG tool built!

This Notebook is a companion to the event put on by [AIMS](https://www.linkedin.com/company/ai-maker-space/), and [Deci](https://deci.ai/), and is authored by [Chris Alexiuk](https://www.linkedin.com/in/csalexiuk/)