# Simple RAG for GitHub Issues using HuggingFace Zephyr and LangChain

In this example, we will build a RAG for a project's GitHub issues using [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) model and LangChain.

## RAG introduction

RAG is a popular approach to address the issue of a powerful LLM not being aware of specific content due to said content not being in its training data, or hallucinating even when it has seen it before. Such specific content may be proprietary, sensitive, or, as in this example, recent and updated often.

If our data is static and does not change regularly, we may consider fine-tuning a large model. In many cases, however, fine-tuning can be costly, and, when done repeatedly (e.g., to address data drift), leads to "model shift".

RAG does not require model fine-tuning. Instead, RAG works by providing an LLM with additional context that is retrieved from relevant data so that it can generate a better-informed response.

Note that
* The external data is converted into embedding vectors with a separate embedding model, and the vector are kept in a database. Embedding models are typically small, so updating the embedding vecotrs on a regular basis is faster, cheaper, and easier than fine-tuning a model.
* The fact that fine-tuning is not required gives us the freedom to swap our LLM for a more powerful one when it becomes available, or switch to a smaller distilled version for faster inference.

## Setups

In [None]:
!pip install -qU torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu

In [None]:
# If running in Google Colab, may need to run this cell to make sure to using UTF-8 locale to install LangChain
import locale

locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install -qU langchain langchain-community

## Prepare the data

In this example, we will load all of the issues (both open and closed) from [HuggingFace PEFT Github repository](https://github.com/huggingface/peft).

In [None]:
from google.colab import userdata

ACCESS_TOKEN = userdata.get('GITHUB_ACCESS_TOKEN')

By default, the pull requests are considered issues as well. In this example, we will exclude them from data by setting `include_prs = False`. We will also set `state = "all"` to load both open and closed issues.

In [None]:
from langchain.document_loaders import GitHubIssuesLoader

loader = GitHubIssuesLoader(
    repo='huggingface/peft',
    access_token=ACCESS_TOKEN,
    include_prs=False,
    state='all'
)

docs = loader.load()

The content of individual GitHub issues may be longer than what an embedding model can take as input. We need to chunk the documents into appropriately sized pieces.

The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=30
)

chunked_docs = splitter.split_documents(docs)

## Create the embeddings and retriever

To create document chunk embeddings, we will use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embedding model.

To create the vector database, we will use `FAISS`, a library developed by Facebook AI, which offers efficient similarity search and clustering of dense vectors.

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

db = FAISS.from_documents(
    chunked_docs,
    HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
)

To retrieve the documents give nan unstructured query, we will use the `as_retriever` method using the `db` database as a backbone.

In [None]:
retriever = db.as_retriever(
    search_type='similarity',
    search_kwargs={'k':4}
)

* `search_type='similarity'` - perform similarity search between the query and documents
* `search_kwargs={'k':4}` - instruct the retriever to return top 4 results

Now the vector database and retriever are set up. Next, we need to set up the model.

## Load quantized model

To make inference faster, we will load the quantized model:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_name = 'HuggingFaceH4/zephyr-7b-beta'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config
)

## Set up the LLM chain

Finally, we have everything in hand and need to set up the LLM chain:
1. creating a `text_generation` pipeline using the loaded model and its tokenizer
2. creating a `prompt_template`, which should follow the format of the model, so if we substitute the model checkpoint, we will make sure to use the appropriate formatting

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts improt PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [None]:
prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the follwoing context to help:

{context}

</s>
<|user|>
{question>
</s>
<|assistant|>

"""

prompt = Prompt_template(
    input_variables=['context', 'question'],
    template=prompt_template
)

In [None]:
# Set up the entire chain
llm_chain = prompt | llm | StrOutputParser()

Finally, we need to combine the `llm_chain` with the retriever to create a RAG chain. We will pass the original question through to the final generation step, as well as the retrieved context docs:

In [None]:
from langchain_core.runnables import RunnablePassthrough

rag_result = {
    'context': retriever,
    'question': RunnablePassthrough()
}

rag_chain = rag_result | llm_chain

## Compare the results

In [None]:
question = 'How do you combine multiple adapters?'

First, let's see what kind of answer we can get with the model itself without additional context provided:

In [None]:
llm_chain.invoke({'context': "", 'question': question})

Now let's see if adding context from GitHub issues helps the model give a more relevant answer:

In [None]:
rag_chain.invoke(question)