<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/RAG_on_Colab_with_Huggingface_and_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 29th June, 2024
# Ref: https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain

# Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain

_Authored by: [Maria Khalusova](https://github.com/MKhalusova)_

This notebook demonstrates how one can quickly build a RAG (Retrieval Augmented Generation) for a project's GitHub issues using [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) model, and LangChain.

## What is RAG

RAG is a popular approach to address the issue of a powerful LLM not being aware of specific content due to said content not being in its training data, or hallucinating even when it has seen it before. Such specific content may be proprietary, sensitive, or, as in this example, recent and updated often.

If your data is static and doesn't change regularly, you may consider fine-tuning a large model. In many cases, however, fine-tuning can be costly, and, when done repeatedly (e.g. to address data drift), leads to "model shift". This is when the model's behavior changes in ways that are not desirable.

**RAG (Retrieval Augmented Generation)** does not require model fine-tuning. Instead, RAG works by providing an LLM with additional context that is retrieved from relevant data so that it can generate a better-informed response.

Here's a quick illustration:

![RAG diagram](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png)

* The external data is converted into embedding vectors with a separate embeddings model, and the vectors are kept in a database. Embeddings models are typically small, so updating the embedding vectors on a regular basis is faster, cheaper, and easier than fine-tuning a model.

* At the same time, the fact that fine-tuning is not required gives you the freedom to swap your LLM for a more powerful one when it becomes available, or switch to a smaller distilled version, should you need faster inference.

Let's illustrate building a RAG using an open-source LLM, embeddings model, and LangChain.

First, install the required dependencies:

## Install software

In [1]:
# 0.0
!pip install -q accelerate bitsandbytes transformers sentence-transformers faiss-gpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# 0.1 If running in Google Colab, you may need to run this
#       cell to make sure you're using UTF-8 locale
#       to install LangChain

import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
# 0.2 Install langchain

!pip install -q langchain langchain-community

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.8/332.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.4/127.4 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m145.0/145.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Download and Load GitHub data


In this example, we'll load all of the issues (both open and closed) from [PEFT library's repo](https://github.com/huggingface/peft).

First, you need to acquire a [GitHub personal access token](https://github.com/settings/tokens?type=beta) to access the GitHub API.

In [4]:
# 0.3 Access to github repo:

from getpass import getpass
ACCESS_TOKEN = getpass("YOUR_GITHUB_PERSONAL_TOKEN")

YOUR_GITHUB_PERSONAL_TOKEN··········


Next, we'll load all of the issues in the [huggingface/peft](https://github.com/huggingface/peft) repo:
- By default, pull requests are considered issues as well, here we chose to exclude them from data with by setting `include_prs=False`
- Setting `state = "all"` means we will load both open and closed issues.

In [5]:
# 1.0 Load github issues:

from langchain.document_loaders import GitHubIssuesLoader

# 1.0.1 Instantiate github issues loader:

loader = GitHubIssuesLoader(
                            repo="huggingface/peft",
                            access_token=ACCESS_TOKEN,
                            include_prs=False,
                            state="all"
                           )


In [6]:
%%time

# 1.0.2 Load the documents:

docs = loader.load()

CPU times: user 4 s, sys: 52.2 ms, total: 4.05 s
Wall time: 20.4 s


## Splitting and Chunking
Data once loaded is in text form. Split and chunk it

The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.

The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here.

In [7]:
# 2.0 Split documents into chunks:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 2.0.1 Instantiate the splitter class:
splitter = RecursiveCharacterTextSplitter(chunk_size=512,
                                          chunk_overlap=30
                                          )



In [8]:
%%time

# 2.0.2 Chunk docs now:

chunked_docs = splitter.split_documents(docs)

CPU times: user 388 ms, sys: 6.41 ms, total: 395 ms
Wall time: 405 ms


In [None]:
# print(chunked_docs[0])

## Create the embeddings

Now that the docs are all of the appropriate size, we can create a database with their embeddings.

To create document chunk embeddings we'll use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. There are many other embeddings models available on the Hub, and you can keep an eye on the best performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).


To create the vector database, we'll use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets.

We'll access both the embeddings model and FAISS via LangChain API.

In [11]:
# 3.0 Call libraries:
# FAISS in-memory vectorstore
from langchain.vectorstores import FAISS

# 3.0.1 This is a wrapper for Huggingface embedder model
from langchain.embeddings import HuggingFaceEmbeddings

In [12]:
%%time

# 3.0.2 Create database of vectors:
db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(
                                                 model_name='BAAI/bge-base-en-v1.5'  # 450mb download
                                                )
                          )

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

CPU times: user 1min 5s, sys: 2.81 s, total: 1min 8s
Wall time: 1min 19s


The vector database is now set up, next we need to set up the next piece of the chain - the model.

## Load quantized model

For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model. It's model card is [here](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)

With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

To make inference faster, we will load the quantized version of the model:

In [13]:
# 4.0.1 Call libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

For `bitsandbytes` configuration, see [here](https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig)

In [15]:
# 4.0.2 Which model to reduce to 4bit quantized:
model_name = 'HuggingFaceH4/zephyr-7b-beta'

# 4.0.3 Configure for reduction:
bnb_config = BitsAndBytesConfig(
                                load_in_4bit=True,              # enable 4-bit quantization
                                bnb_4bit_use_double_quant=True, # Nested quantization. Results from the
                                                                #  Ist quantization are quantized again.
                                bnb_4bit_quant_type="nf4",      # 4-bit NormalFloat for better results
                                bnb_4bit_compute_dtype=torch.bfloat16 # computation set to float16 for speedups
                                )



`AutoModelForCausalLM` can not be instantiated directly, as for example `AutoModelForCausalLM(parameters)`. To instantiate it use *from_pretrained()* method.

In [16]:
%%time

# 5.0 First download model and also its configuration
#     and then instantiate AutoModelForCausalLM class

model = AutoModelForCausalLM.from_pretrained(model_name,    # 15gb download
                                             quantization_config=bnb_config # Reduces
                                             )

# 5.0.1
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 5.0.2 What is the returned object:
type(model)  # transformers.models.mistral.modeling_mistral.MistralForCausalLM

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

CPU times: user 37.8 s, sys: 39.1 s, total: 1min 16s
Wall time: 2min 43s


## Setup pipeline

Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting.

### `StrOutputParser`
Refer [here](https://www.restack.io/docs/langchain-knowledge-langchain-stroutputparser-guide)      
Output parsers are responsible for taking the output of an LLM and transforming it to a more suitable format. This is very useful when you are using LLMs to generate any form of structured data.

The `StrOutputParser` is a straightforward yet powerful tool within the LangChain ecosystem. Its primary function is to convert the output of a language model, whether from an LLM or a ChatModel, into a string format. This conversion is crucial for applications that require a uniform output format for further processing or display to end-users.

>>For **LLM Outputs**: If the model's output is already a string, the StrOutputParser simply passes this string through without modification.    

>> For **ChatModel Outputs**: In cases where the output is a ChatModel message, the `StrOutputParser` extracts the .`content` attribute of the message, ensuring that the final output is in string format.

This parser is particularly useful in scenarios where the raw output from a model needs to be streamlined or formatted for specific use cases, such as generating reports, feeding into other components of an application, or displaying information to users in a readable format.

### Code

### `HuggingFacePipeline`  
Refer [here](https://python.langchain.com/v0.2/docs/integrations/llms/huggingface_pipelines/) for Hugging Face Local Pipelines

In `langchain` environment, HuggingFace models can be run locally through the `HuggingFacePipeline` class.    
This class is a pipeline wrapper. And the resulting model can be called from langchain.

One can use the wrapper in [two ways](https://python.langchain.com/v0.2/docs/integrations/llms/huggingface_pipelines/) as below:

In [None]:
"""
# A. Using from_model_id method
hf = HuggingFacePipeline.from_model_id(
                                        model_id="gpt2",
                                        task="text-generation",
                                        pipeline_kwargs={"max_new_tokens": 10},
                                      )

# B. OR,using pipeline:

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=10)
hf = HuggingFacePipeline(pipeline=pipe)

"""

In [19]:
%%time

# 6.0
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 6.0.1
from transformers import pipeline

CPU times: user 158 ms, sys: 19.9 ms, total: 178 ms
Wall time: 289 ms


In [21]:
# 6.0.2 Create Huggingface pipeline:
text_generation_pipeline = pipeline(
                                    model=model,
                                    tokenizer=tokenizer,
                                    task="text-generation",
                                    temperature=0.2,
                                    do_sample=True,
                                    repetition_penalty=1.1,
                                    return_full_text=True,
                                    max_new_tokens=400,
                                  )

In [22]:
# 6.0.3 Wrap the pipeline with HuggingFacePipeline
#       to interact with langchain code:

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

  warn_deprecated(


## Design prompt

Below is the template:     

`<|system|>` stands for system message. The symbol,`{context}`, is placeholder for any context message. When not null, it becomes a part of `system` message. And, `<|user|>` stands for user's messages.
Similarly, `{question}` is a placeholder for user's question.  

Note that `<s>` and `</s>` are special tokens for beginning of string (`BOS`) and end of string (`EOS`).     

When `PromptTemplate()` is `invoked` then values of `{context}` and `{question}` get filled and complete prompt is output.

In [18]:
# 6.0.4
# Prompt template for zephyr:
# A template opens with three inverted commas
# and closes with three inverted commas.

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

Prompt for zephyr has this form:<br>
`<|system|>` <br>
You are a friendly chatbot who always responds in the style of a pirate.`</s>`     
`<|user|>`    
How many helicopters can a human eat in one sitting?`</s>`   
`<|assistant|>`    
Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!<br>

In [19]:
# 6.0.5 This template will receive two inputs:
#        'context' and 'question'.
#       Inputs are received when llm chain is invoked.

# So PromptTemplate is a function to automate receipt
# of 'context' and 'question' and substitute them in the
# supplied: template=prompt_template


prompt = PromptTemplate(
                        input_variables=["context", "question"],
                        template=prompt_template,  # Fill above variables here
                        )

## LLM chain

In [20]:
# 6.0.6 When llm_chain is invoked, prompt gets into llm

llm_chain = prompt | llm | StrOutputParser()

Note: _You can also use `tokenizer.apply_chat_template` to convert a list of messages (as dicts: `{'role': 'user', 'content': '(...)'}`) into a string with the appropriate chat format._


## RAG chain

We need a way to return(retrieve) the documents given an unstructured query. For that, we'll use the `as_retriever` method using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents
- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.      
See stackOverflow [reference](https://stackoverflow.com/a/78278938/3282777)

In [None]:

# 7.0 Instantiate retriever:

retriever = db.as_retriever(
                            search_type="similarity",
                            search_kwargs={'k': 4}
                           )

Finally, we need to combine the `llm_chain` with the `retriever` to create a RAG chain. We pass the original `question` through to the final generation step, as well as the retrieved `context` docs:

About retriever see [here](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/vectorstore/#specifying-top-k)  and [here](https://stackoverflow.com/a/78278938/3282777)

In [None]:
# Just to demonstrate the similarity
# vectors that are output when retriever
# is invoked.
# These vectors are {context}

retriever.invoke(question)

In [22]:
%%time

from langchain_core.runnables import RunnablePassthrough

#retriever = db.as_retriever()

rag_chain = (
             {"context": retriever, "question": RunnablePassthrough()}
             | llm_chain
             )


CPU times: user 586 µs, sys: 0 ns, total: 586 µs
Wall time: 592 µs


## Compare the results

Let's see the difference RAG makes in generating answers to the library-specific questions.

In [23]:
question = "How do you combine multiple adapters?"

First, let's see what kind of answer we can get with just the model itself, no context added:

In [25]:
%%time

out = llm_chain.invoke(
                        {"context":"",
                          "question": question
                        }
                      )

CPU times: user 37.3 s, sys: 266 ms, total: 37.5 s
Wall time: 42.8 s


In [27]:
print(out)


<|system|>
Answer the question based on your knowledge. Use the following context to help:



</s>
<|user|>
How do you combine multiple adapters?
</s>
<|assistant|>

  To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's a general guide:

1. Determine which adapter(s) you need to convert from one connection type to another. For example, if you want to connect an HDMI device to a VGA monitor, you may need an HDMI-to-VGA adapter and a VGA-to-DVI adapter (if your monitor only supports DVI input).

2. Connect the first adapter to the source device. For instance, plug the HDMI-to-VGA adapter into your laptop's HDMI port.

3. Connect the second adapter to the output device. In this case, connect the VGA-to-DVI adapter to the monitor's VGA input.

4. If necessary, connect any additional adapters in between. For example, if your monitor only supports DVI input but has a DisplayPort output, you may need a DisplayP

As you can see, the model interpreted the question as one about physical computer adapters, while in the context of PEFT, "adapters" refer to LoRA adapters.
Let's see if adding context from GitHub issues helps the model give a more relevant answer:

See reddit [this reference](https://www.reddit.com/r/LangChain/comments/1c7qwsw/comment/l0i40gg/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) for understanding.
rag_chain.invoke(question) passes question to its dictionary of {"context": retriever, "question": RunnablePassthrough()}. The question first goes into retriever and context gets that value. question also gets into RunnablePassthrough() and the output is question itself. Thus, the complete dict is now populated and this is now fed to llm_chain.

This is what happens:
>Step1: Execute: rag_chain.invoke(question)

>Step2: The question is passed to every function/method in the dictionary, something like: = `(`<br>
                                  `{"context": retriever.invoke(question)`, <br>
                            `"question": RunnablePassthrough(question)`<br>
                            }<br>
             | `llm_chain`<br>
             `)`<br>

 The output of `retriever.invoke(question)` is as demonstrated earlier and the output of `RunnablePassthrough(question)` is the `question` itself. `RunnablePassthrough()` is an identity function.     

 And then the result is piped into `llm_chain` where first a complete `prompt_template` is created and then the results are fed into `llm`.             

In [29]:
%%time

out_rag = rag_chain.invoke(question)  # Process the question as above

CPU times: user 41.7 s, sys: 2.45 s, total: 44.2 s
Wall time: 56.9 s


Note that `|system|` contains additional context taken out from vector store.

In [30]:
print(out_rag)


<|system|>
Answer the question based on your knowledge. Use the following context to help:

[Document(page_content='The documentation does not mention the need to perform a merge when switching adapters. Additionally, the methods add_adapter, set_adapter, and enable_adapters do not appear to work\r\n\r\nPlease provide clarification on how to correctly switch between adapters', metadata={'url': 'https://github.com/huggingface/peft/issues/1802', 'title': 'Issues when switching between multiple adapters LoRAs ', 'creator': 'JhonDan1999', 'created_at': '2024-05-26T19:18:13Z', 'comments': 7, 'state': 'open', 'labels': [], 'assignee': None, 'milestone': None, 'locked': False, 'number': 1802, 'is_pull_request': False}), Document(page_content="If you can provide any advice, I would greatly appreciate it. I suspect that this is either unsupported and/or not fully-implemented; or, it has something to do with the way I'm attaching adapters. I've tried a bunch of alternate configurations, but I'm

As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.

Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.

In [None]:
######## DONE ##############33