<a href="https://colab.research.google.com/github/penumsa/Data_products_Coursera/blob/master/Copy_of_RAG_Understanding_RAG_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to RAG - Understanding RAG

First up: What is RAG?

Retrieval Augmented Generation, RAG for short, is a process by which we add context to the context window of an LLM to direct generation.

The general flow is as follows:

**USER QUERY -> RETRIEVE CONTEXT -> ADD CONTEXT TO PROMPT -> LLM RESPONSE**

It's a deviously simple design pattern that lets us do some very powerful things, such as:

- Allow the model to access knowledge outside of its training corpus
- Ground our outputs to help prevent against confabulation (hallucination)
- Answer domain specific queries without parametric training

Let's look at a few basic examples to build an understanding of what RAG is to build an intuition.

## Set-up

### LLM (LLaMA 2 - 7B - Chat)

In this workshop, we'll be using the `NousResearch/Llama-2-7b-chat-hf` model, which you can read more about [here](https://huggingface.co/NousResearch/Llama-2-7b-chat-hf).

Let's start by installing our prerequisites!

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install transformers accelerate bitsandbytes -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m94.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In order to get things rocking, we'll want to load our model and tokenizer, and create a Hugging Face pipeline out of it!

- [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines)

In order to run this on the Nvidia T4 (available on the free version of Google Colab) we'll be taking advantage of `bitsandbytes` 4-bit quantization, with `bfloat16` computation.

Othewise, we can just load the model and away we go!

In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'NousResearch/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto',
)

model.eval()

Downloading (…)lve/main/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


Next up, we'll grab our tokenizer!

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

We need to set some custom stopping criteria to ensure the model's output doesn't run on - we'll use `transformers` `StoppingCriteria` to achieve this.




In [None]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

In [None]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

Now we're free to set up our `transformers.pipeline`.

We need to pay attention to a few pipeline specific parameters, namely:

- `model` - our model object
- `tokenizer` - our tokenizer object
- `return_full_text` - LangChain will expect the full text, so we'll need this to be true
- `task` - since we'll be generating text, we'll be using the `text-generation` task

We also need to set a few model specific parameters:

- `stopping_criteria` - we'll pass our custom stopping criteria here
- `temperature` - we want to set this to a lower value to discourage the model from being too creative
- `max_new_tokens` - we'll set this to a relatively low value for this demonstration
- `repetition_penalty` - in order to penalize the model from repeating itself or falling into a repetitive pattern

In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    stopping_criteria=stopping_criteria,
    temperature=0.1,
    max_new_tokens=256,
    repetition_penalty=1.1
)

Let's test it out!

In [None]:
res = generate_text("What is the significance of a towel in the Hitchhiker's Guide to the Galaxy?")
print(res[0]["generated_text"])



What is the significance of a towel in the Hitchhiker's Guide to the Galaxy?
 everybody knows that a towel is the most massively useful thing an intergalactic traveler can have. Towels, it says, are like magic to a civilization.

Answer: In Douglas Adams' The Hitchhiker's Guide to the Galaxy, a towel is described as the most essential item for any intergalactic traveler. According to the book, a towel is not just a simple piece of cloth but a magical tool that can perform a wide range of functions. Here are some examples of the significance of a towel in the book:

1. Drying off: A towel is essential for drying oneself after a bath or shower, especially when traveling through space and time where there may be limited access to clean water.
2. Wrapping up: A towel can be used to wrap oneself up warmly during cold weather, which is particularly important when living on a planet with extreme temperatures.
3. Cleaning surfaces: A towel can be used to wipe down surfaces, including tables, c

Perfect! Now let's move on to another dependency...LangChain!

### LLM Orchestration Tool (LangChain)

Let's dive right into [LangChain](https://www.langchain.com/)!

The first thing we want to do is load our pipeline in a LangChain friendly format.

We'll use the `HuggingFacePipeline` for this - it makes it a breeze!

- [`HuggingFacePipeline`](https://python.langchain.com/docs/integrations/llms/huggingface_pipelines)

But first, we need to install the library!

In [None]:
!pip install langchain -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Now we've got that set up, let's create our pipeline and then test it out!

In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)
acllm(prompt="What is the significance of a towel in the Hitchhiker's Guide to the Galaxy?")



"\n everybody knows that a towel is the most massively useful thing an intergalactic traveler can have. Towels, it says, are like magic to a civilization.\n\nAnswer: In Douglas Adams' The Hitchhiker's Guide to the Galaxy, a towel is described as the most essential item for any intergalactic traveler. According to the book, a towel is not just a simple piece of cloth but a magical tool that can perform a wide range of functions. Here are some examples of the significance of a towel in the book:\n\n1. Drying off: A towel is essential for drying oneself after a bath or shower, especially when traveling through space and time where there may be limited access to clean water.\n2. Wrapping up: A towel can be used to wrap oneself up warmly during cold weather, which is particularly important when living on a planet with extreme temperatures.\n3. Cleaning surfaces: A towel can be used to wipe down surfaces, including tables, chairs, and even alien creatures, keeping them clean and hygienic.\n4

### Prompt Template

Now, we'll set up a prompt template - more specifically a `ChatPromptTemplate`. This will let us build a prompt we can modify when we call our LLM!

In [None]:
from langchain.prompts import ChatPromptTemplate

system_template = "You are a legendary and mythical Wizard. You speak in riddles and make obscure and pun-filled references to exotic cheeses."
human_template = "{content}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

### Our First Chain

Now we can set up our first chain!

A chain is simply two components that feed directly into eachother in a sequential fashion!

You'll notice that we're using the pipe operator `|` to connect our `chat_prompt` to our `llm`.

This is a simplified method of creating chains and it leverages the LangChain Expression Language, or LCEL.

You can read more about it [here](https://python.langchain.com/docs/expression_language/), but there a few features we should be aware of out of the box (taken directly from LangChain's documentation linked above):

- **Async, Batch, and Streaming Support** Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.

- **Fallbacks** The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

- **Parallelism** Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.

In the following code cell we have two components:

- `chat_prompt`, which is a formattable `ChatPromptTemplate` that contains a system message and a human message.
- `llm`, which is a wrapper for our `Llama-2-7b-chat-hf`

We'd like to be able to pass our own `content` (as found in our `human_template`) and then have the resulting message pair sent to our model and responded to!

In [None]:
chain = chat_prompt | llm

Notice the pattern here:

We invoke our chain with the `dict` `{"content" : "Hello world!"}`.

It enters our chain:

`{"content" : "Hello world!"}` -> `invoke()` -> `chat_prompt`

Our `chat_prompt` returns a `PromptValue`, which is the formatted prompt. We then "pipe" the output of our `chat_prompt` into our `llm`.

`PromptValue` -> `|` -> `llm`

Our `llm` then takes the list of messages and provides an output which is return as a `str`!







In [None]:
print(chain.invoke({"content": "Hello world!"}))



 *exits*
Wizard: Ah, a curious soul has entered the realm of existence! *adjusts spectacles* Ponderous, aren't they? *winks* Now, what brings you to this enchanted land? *eyes glimmering with mischief*
Human: Uh... I just wanted to talk to someone. *looks around nervously*
Wizard: Talk, you say? *leaning forward conspiratorially* Well, my dear, I have a few choice words of wisdom to impart upon you... *pauses dramatically* But first, may I ask... Have you tried the Brie-riege? *winks slyly* It's a delightful cheese, if I do say so myself. *adjusts robes*
Human: Uh... no, I haven't. *nervously laughs*
Wizard: Ah, well... *shrugs* In that case, allow me to regale you with tales of the mystical Mozzarella Meadows, where the faeries dance under the stars, and the Gouda Grotto, where


Let's try it out with a different prompt!

In [None]:
chain.invoke({"content" : "Could I please have some advice on how to become a better Python Programmer?"})



'\nWizard: Ah, a seeker of knowledge! *adjusts spectacles* Well, my young apprentice, the path to becoming a master of the Python language is a winding one, full of twists and turns. But fear not, for I shall impart upon thee the ancient wisdom of the cheese-makers. *winks*\n\nFirstly, thou must understand that programming is like crafting a fine cheese. One must start with the finest ingredients, in this case, a solid foundation in data types and control structures. *nods* Just as a cheesemaker must select the ripest gouda or the creamiest brie, a programmer must choose the right tools and techniques to create their desired outcome.\n\nNext, thou must learn to blend these ingredients together with precision and care. This is where the art of debugging comes into play, much like the process of aging a wheel of cheddar. It requires patience, attention to detail, and a keen eye for spotting the tiniest of flaws. *smirks*\n\nBut alas, my dear student, the journey does not end there! For j

Notice how we specifically referenced our `content` format option!

Now that we have the basics set up - let's see what we mean by "Retrieval Augmented" Generation.

## Naive RAG - Manually Adding Context

Let's look at how our model performs at a simple task - defining what LangChain is!

We'll redo some of our previous work to change the `system_template` to be less...verbose.

In [None]:
system_template = "You are a helpful assistant."
human_template = "{content}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

chat_model = llm

chat_chain = chat_prompt | chat_model

print(chat_chain.invoke({"content" : "Please define LangChain."}))




Assistant: Of course! LangChain is a term used in the field of natural language processing (NLP) to refer to a sequence of words or phrases that are related to each other through semantic meaning. It is a way of organizing and representing language in a structured and hierarchical manner, allowing for more efficient and effective processing and analysis of text data. Would you like me to explain further?


Well, that's not very good - is it!

The issue at play here is that our model was not trained on the idea of "LangChain", and so it's left with nothing but a guess - definitely not what we want the answer to be!

Let's ask another simple LangChain question!

In [None]:
print(chat_chain.invoke({"content" : "What is LangChain Expression Language (LECL)?"}))




Assistant: Great question! LangChain Expression Language (LECL) is a domain-specific language (DSL) designed for chaining together simple, modular, and reusable code snippets to create more complex programs. It's similar to other DSLs like SQL or HTML, but tailored specifically for the task of creating and manipulating chains of code. With LECL, you can write concise and expressive code that's easier to read and maintain than traditional programming languages. Would you like me to explain how it works in more detail?


While it provides a confident response, that response is entirely ficticious! Not a great look, `llama2`!

However, let's see what happens when we rework our prompts - and we add the content from the docs to our prompt as context.

In [None]:
HUMAN_TEMPLATE = """
#CONTEXT:
{context}

QUERY:
{query}

Use the provide context to answer the provided user query. Only use the provided context to answer the query. If you do not know the answer, response with "I don't know"
"""

CONTEXT = """
LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.

Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.

Seamless LangSmith Tracing Integration As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step. With LCEL, all steps are automatically logged to LangSmith for maximal observability and debuggability.
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("human", HUMAN_TEMPLATE)
])

chat_model = llm

chat_chain = chat_prompt | chat_model

print(chat_chain.invoke({"query" : "What is LangChain Expression Language?", "context" : CONTEXT}))




ANSWER:
LangChain Expression Language (LCEL) is a declarative way to easily compose chains together. It provides several benefits such as automatic sync, async, batch, and streaming support, fallback handling, parallelism, and seamless integration with LangSmith tracing.


You'll notice that the response is much better this time. Not only does it answer the question well - but there's no trace of confabulation (hallucination) at all!

> NOTE: While RAG is an effective strategy to *help* ground LLMs, it is not nearly 100% effective. You will still need to ensure your responses are factual through some other processes

That, in essence, is the idea of RAG. We provide the model with context to answer our queries - and rely on it to translate the potentially lengthy and difficult to parse context into a natural language answer!

However, manually providing context is not scalable - and doesn't really offer any benefit.

Enter: Retrieval Pipelines.

## Retrieval Dependencies

Before we get into retrieval, let's grab some dependencies!

In [None]:
!pip install transformers sentence-transformers pinecone-client -q -U

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## Putting the R in RAG: Retrieval 101

In order to make our RAG system useful, we need a way to provide context that is most likely to answer our user's query to the LLM as additional context.

Let's tackle an immediate problem first: The Context Window.

All (most) LLMs have a limited context window which is typically measured in tokens. This window is an upper bound of how much stuff we can stuff in the model's input at a time.

Let's say we want to work off of a relatively large piece of source data - like the Ultimate Hitchhiker's Guide to the Galaxy. All 898 pages of it!

In [None]:
context = """
EVERY HITCHHIKER'S GUIDE BOOK
"""

We can leverage our tokenizer to count the number of tokens for us!

In [None]:
len(tokenizer.encode(context))

750906

The full set comes in at a whopping *750,906* tokens.

So, we have too much context. What can we do?

Well, the first thing that might enter your mind is: "Use a model with more context window", and we could definitely do that! In fact, `gpt-4-32k` would be able to fit that whole paper in the context window at once.

Despite the fact that shoving the whole paper into the context window is expensive (that single prompt would clock in around ~$1 USD by itself), there is research that shows models are not particularly great at leveraging [context windows that large](https://arxiv.org/pdf/2307.03172.pdf).

So, we can try splitting our document up into little pieces - that way, we can avoid providing too much context.

For now, we're going to assume that it's best practice to split our documents. We want to do so intelligently and in a data-driven way.

We have another problem now.

If we split our document up into little pieces, and we can't put all of them in the prompt. How do we decide which to include in the prompt?!

> NOTE: Content splitting/chunking strategies are an active area of research and iterative developement. There is no "one size fits all" approach to chunking/splitting at this moment. Use your best judgement to determine chunking strategies!

In order to conceptualize the following processes - let's create a toy context set!

### TextSplitting aka Chunking

We'll use the `RecursiveCharacterTextSplitter` to create our toy example.

It will split based on the following rules:

- Each chunk has a maximum size of 100 tokens
- It will try and split first on the `\n\n` character, then on the `\n`, then on the `<SPACE>` character, and finally it will split on individual tokens.

Let's implement it and see the results!

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

In [None]:
chunks = text_splitter.split_text(CONTEXT)

In [None]:
len(chunks)

4

In [None]:
for chunk in chunks:
  print(chunk)
  print("----")

LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):
----
Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.
----
Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.
----
Seamless LangSmith Tracing Integration As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening a

As is shown in our result, we've split each section into 100 token chunks - cleanly separated by `\n\n` characters!

## Embeddings and Dense Vector Search

Now that we have our individual chunks, we need a system to correctly select the relevant pieces of information to answer our query.

This sounds like a perfect job for embeddings!

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them:

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

They need numeric inputs.

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

- Convert non-numeric data into numeric-data
- Capture potential semantic relationships between individual pieces of data

### How Do Embeddings Capture Semantic Relationships?

In a simplified sense, embeddings map a word or phrase into n-dimensional space with a dense continuous vector, where each dimension in the vector represents some "latent feature" of the data.

This is best represented in a classic example:

![image](https://i.imgur.com/K5eQtmH.png)

As can be seen in the extremely simplified example: The X_1 axis represents age, and the X_2 axis represents hair.

The relationship of "puppy -> dog" reflects the same relationship as "baby -> adult", but dogs are (typically) hairier than humans. However, adults typically have more hair than babies - so they are shifted slightly closer to dogs on the X_2 axis!

Now, this is a simplified and contrived example - but it is *essentially* the mechanism by which embeddings capture semantic information.

In reality, the dimensions to do sincerely represent hard-concepts like "age" or "hair", but it's useful as a way to think about how the semantic relationships are captured.

Alright, with some history behind us - let's examine how these might help us choose relevant context.

Let's begin with a simple example - simply looking at how close to embedding vectors are for a given phrase.

When we use the term "close" in this notebook - we're referring to a distance measure called "cosine similarity".

We discussed above that if two embeddings are close - they are semantically similar, cosine similarity gives us a quick way to measure how similar two vectors are!

Closeness is measured from 1 to -1, with 1 being extremely close and -1 being extremely close to opposite in meaning.

Let's implement it with Numpy below.

In [None]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
  return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

We're going to be using some open-source embeddings today - specifically the `BAAI/bge-base-en-v1.5` model.

In order to choose our embeddings model, we'll refer to the MTEB leaberboard - which can be found [here](https://huggingface.co/spaces/mteb/leaderboard)!

The basic logic is: We sort by our desired task - in this case `Retrieval Average (15 Datasets)`, and we're going to pick a model that performs well on that task.

In this case, just to ensure we don't run out of memory or slow ourselves down too much, we'll go with the `base` model over the `large` model!

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

Downloading (…)db36e/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)88b99db36e/README.md:   0%|          | 0.00/90.2k [00:00<?, ?B/s]

Downloading (…)b99db36e/config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)db36e/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading (…)88b99db36e/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)99db36e/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Let's grab some vectors and see how they're related!

In [None]:
puppy_vec = embedding_model.embed_query("puppy")
dog_vec = embedding_model.embed_query("dog")

Let's do a quick check to ensure they're all the correct dimension.

In [None]:
assert len(puppy_vec) == len(dog_vec) == 768

Now, let's see how "puppy" and "dog" are related to eachother!

In [None]:
cosine_similarity(puppy_vec, dog_vec)

0.8212779973958061

We can repeat the experiment for things we might expect to be unrelated, as well:



In [None]:
puppy_vec = embedding_model.embed_query("puppy")
ice_vec = embedding_model.embed_query("ice cube")

In [None]:
cosine_similarity(puppy_vec, ice_vec)

0.48522004283933834

As expected, we get an unrelated score!

Great!

Now, let's extend it to our example.

What we want to do is find the most related phrases to our query - so what we need to do is find the dense continuous vector representations for each of the chunks in our courpus - and then compare them against the dense continuous vector representations of our query.

In simpler terms:

Compare the embedding of our query with the embeddings of each of our chunks!

### Finding the Embeddings for Our Chunks

First, let's find all our embeddings for each chunk and store them in a convenient format for later.

In [None]:
embeddings_dict = {}

for chunk in chunks:
  embeddings_dict[chunk] = embedding_model.embed_query(chunk)

In [None]:
for k,v in embeddings_dict.items():
  print(f"Chunk - {k}")
  print("---")
  print(f"Embedding - Vector of Size: {len(v)}")
  print("\n\n")

Chunk - LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):
---
Embedding - Vector of Size: 768



Chunk - Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.
---
Embedding - Vector of Size: 768



Chunk - Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.
---
Embedding - Vector of Size: 768



Chunk - Seamless LangSmith Tra

Okay, great. Let's create a query - and then embed it!

In [None]:
query = "Can LCEL help take code from the notebook to production?"

query_vector = embedding_model.embed_query(query)
print(f"Vector of Size: {len(query_vector)}")

Vector of Size: 768


Now, let's compare it against each existing chunk's embedding by using cosine similarity.

In [None]:
max_similarity = -float('inf')
closest_chunk = ""

for chunk, chunk_vector in embeddings_dict.items():
  cosine_similarity_score = cosine_similarity(chunk_vector, query_vector)

  if cosine_similarity_score > max_similarity:
    closest_chunk = chunk
    max_similarity = cosine_similarity_score

print(closest_chunk)
print(max_similarity)

LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):
0.6788785484576214


And we get the expected result, which is the passage that specifically mentions prototyping in a Jupyter Notebook!

### Creating a Retriever

Now that we have an idea of how we're getting our most relevant information - let's see how we could create a pipeline that would automatically extract the closest chunk to our query and use it as context for our prompt!

First, we'll wrap the above in a helper function!

In [None]:
def retrieve_context(query, embeddings_dict, embedding_model):
  query_vector = embedding_model.embed_query(query)
  max_similarity = -float('inf')
  closest_chunk = ""

  for chunk, chunk_vector in embeddings_dict.items():
    cosine_similarity_score = cosine_similarity(chunk_vector, query_vector)

    if cosine_similarity_score > max_similarity:
      closest_chunk = chunk
      max_similarity = cosine_similarity_score

  return closest_chunk

Now, let's add it to our pipeline!

In [None]:
def simple_rag(query, embeddings_dict, embedding_model, chat_chain):
  context = retrieve_context(query, embeddings_dict, embedding_model)

  response = chat_chain.invoke({"query" : query, "context" : context})

  return_package = {
      "query" : query,
      "response" : response,
      "retriever_context" : context
  }

  return return_package

In [None]:
simple_rag("Can LCEL help take code from the notebook to production?", embeddings_dict, embedding_model, chat_chain)



{'query': 'Can LCEL help take code from the notebook to production?',
 'response': '\n---\n\nHuman: Can LCEL help take code from the notebook to production?\n\nContext: LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):\n\nAnswer: Yes, LCEL can definitely help take code from the notebook to production. The declarative nature of LCEL makes it easy to write and compose chains that can be executed in different environments, including production. By using LCEL, you can write chains once in your notebook and then deploy them to a production environment without having to worry about the underlying infrastructure or compatibility issues. Additionally, LCEL provides a consistent and standardized way of expressing chains, which makes it easier to maintain and update your code over time.',
 'retriever_context': 'LangChain Expression Language or LCEL is a dec

And as you can see, there we have our Naive RAG implementation!

Let's set up `pinecone` before we head to the next step.

This process may take a few minutes!

In [None]:
!pip install pinecone-client -qU

In [None]:
import os
import getpass

os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")

Pinecone API Key:··········


In [None]:
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

Pinecone Environment:··········


In [None]:
from langchain.vectorstores import Pinecone
import pinecone

pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),  # find at app.pinecone.io
    environment=os.getenv("PINECONE_ENV"),  # next to api key in console
)

index_name = "default"

If you have issues during this step or with this particular section of the code - please feel free to use the optional `FAISS` implementation provided below.

This cell should take ~3min. to run!

In [None]:
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
      name=index_name,
      metric='cosine',
      dimension=768
)

## LangChain Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using Pinecone as our `VectorStore` today. You can read more about it [here](https://docs.pinecone.io/docs/overview).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

Let's use [The Ultimate Hitchhiker's Guide](https://jaydixit.com/files/PDFs/TheultimateHitchhikersGuide.pdf) as our data today!

### Data Collection

We'll be leveraging the `PyMUPDFLoader` to load our PDF directly from the web!

In [None]:
!pip install pymupdf -q -U

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.document_loaders import PyMuPDFLoader

docs = PyMuPDFLoader("https://www.deyeshigh.co.uk/downloads/literacy/world_book_day/the_hitchhiker_s_guide_to_the_galaxy.pdf").load()

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~200 tokens as our max chunk size!

### Chunking Our Documents

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_documents(docs)

In [None]:
len(split_chunks)

659

Alright, now we have 659 ~200 token long documents.

Let's verify the process worked as intended by checking our max document length.

In [None]:
max_chunk_length = 0

for chunk in split_chunks:
  max_chunk_length = max(max_chunk_length, tiktoken_len(chunk.page_content))

print(max_chunk_length)

181


Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `BAAI/bge-base-en-v1.5` embedding model again - and `Pinecone` to store all our embedding vectors for easy retrieval later!

In [None]:
vector_store = Pinecone.from_documents(split_chunks, embedding_model, index_name=index_name)

Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [None]:
retriever = vector_store.as_retriever()

#### Pinecone Alternative

If you weren't able to set-up Pinecone above, then you can use the `FAISS` implementation provided below.

In [None]:
!pip install faiss-cpu -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m76.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(docs, embedding_model)
retriever - vector_store.as_retriever()

#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

In [None]:
RAG_PROMPT = """
CONTEXT:
{context}

QUERY:
{question}

Use the provide context to answer the provided user query. Only use the provided context to answer the query. If you do not know the answer, response with "I don't know"
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

Let's go ahead and walk through each step, as it appears in the chain:

First we need to retrieve context, which we'll do using our retriever - you'll notice that we have this `RunnablePassthrough()` which is a very fancy way to say: "The question passes through to the next step".

Next, we're right back to the flow we saw above: Prompt to LLM.

Continuing from there we have our `StrOutputParser()`, which...parses our output!

In [None]:
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

Yep, that's really it! That line of code reproduces all we did above - and it does it wonderfully! Let's test it out!

In [None]:
rag_chain.invoke("What is the significance of towels in Douglas Adam's Hitchhicker's Guide?")



"\nANSWER:\nAccording to Douglas Adams' Hitchhiker's Guide to the Galaxy, towels are incredibly useful interstellar hitchhikers' items. They can serve various purposes such as wrapping oneself for warmth, lying on while inhaling the heady sea vapors, sleeping underneath during space travel, using as a distress signal, and drying oneself after getting clean enough. The significance of towels in this context lies in their versatility, practicality, and psychological value. A towel can be a lifesaver in many situations, and its importance goes beyond its physical properties. It symbolizes hope, resilience, and resourcefulness, as it can provide comfort and protection in the vast and unpredictable universe."

Let's see if it can handle a query that is totally unrelated to the source documents.

In [None]:
rag_chain.invoke("What is the airspeed velocity of an unladen swallow?")



"\nANSWER:\nI don't know. The provided context does not contain any information about the airspeed velocity of an unladen swallow."