# RAG with LangChain and Gemma

The purpouse of this notebook is to show how to create powerful RAG models using Gemma and LangChain.

We are going to see how to use the Gemma 2b model with LangChain.
We'll also use LangChain's HuggingFace EndPoints, which will facilitade text generation inference.

First things first, let's install the required Python packages:

In [1]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.1
!pip3 install langchain sentence-transformers chromadb langchainhub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

Setting up our HuggingFace  Hub API token

In [2]:
import os
from google.colab import userdata

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get("hugging_face")

## LangChain integration and set up a Hugging Face Endpoint for Gemma 2b model

In [3]:
from langchain_community.llms import HuggingFaceEndpoint
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Defining the repositry ID for Gemma 2b model we are using

repo_id = "google/gemma-2b-it"

# Here we are setting up a Hugging Face Endpoint for Gemma 2b model

llm = HuggingFaceEndpoint(
    repo_id=repo_id, max_length=1024, temperature=0.1
)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Starting with a basic PromptTemplate

In [4]:
question = "Who created the hard disk?"

template = """
Question: {question}
Answer: Let's think step by step. Separate text by blocks.
"""

prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt,
                     llm=llm)
print(llm_chain.invoke(question))

{'question': 'Who created the hard disk?', 'text': "**Step 1:** A hard disk is a type of storage device that uses physical disks with grooves and pits to store data.\n\n**Step 2:** The first hard disk was created by IBM in 1956.\n\n**Step 3:** IBM's hard disk was made of a material called magnetic disk.\n\n**Step 4:** The magnetic disk was coated with a magnetic material, such as nickel or cobalt, which attracted and held the data bits.\n\n**Step 5:** The hard disk was encased in a metal casing and connected to a computer.\n\n**Step 6:** The hard disk was a major breakthrough in computing, as it allowed computers to store and access vast amounts of data more efficiently."}


Now, we'll try generating responses fro multiple questions using `llm_chain.generate()` method.
First we'll define a list `qs` which will contain dictionaries with different questions

In [5]:
qs = [
    {"question": "What is huggingface?"},
    {"question": "What is the first step I should take on HuggingFace?"},
    {"question": "I followed your instructions. What should I do next?"}
]

answer = llm_chain.generate(qs)
print(answer.generations)

[[Generation(text="**Hugging Face** is a company and a large open-source machine learning project that focuses on natural language processing (NLP).\n\n**Here's a breakdown of the key points:**\n\n* **Company:** Hugging Face is a company that develops and provides natural language processing (NLP) tools and models.\n* **Open-source project:** Hugging Face is an open-source project, meaning its source code is freely available for anyone to use, modify, and distribute.\n* **Focus on NLP:** Hugging Face specializes in NLP, which is the ability of computers to understand and generate human language.\n* **Tools and models:** Hugging Face offers various tools and models for NLP tasks, including:\n    * **Transformers:** A powerful family of pre-trained language models (LLM) for various NLP tasks.\n    * **ChatGPT:** A large language model (LLM) that can generate human-like text.\n    * **Other NLP models:** Such as sentiment analysis, language modeling, and question answering.\n\n**In summar

Now we'll ask a question based on the provided context.

In [6]:
prompt = """Answer the question based on the following context. If you can't answer the question, answer "I dont know the answer for this question".

Context: HuggingFace is a company and a large language model (LLM) project that focuses on natural language processing (NLP) and machine learning (ML). Hugging Face is a company that develops and provides AI tools and resources.

Question: Which company provides and develop models of AI Tool?

Answer:
"""

print(llm_chain.invoke(prompt))

{'question': 'Answer the question based on the following context. If you can\'t answer the question, answer "I dont know the answer for this question".\n\nContext: HuggingFace is a company and a large language model (LLM) project that focuses on natural language processing (NLP) and machine learning (ML). Hugging Face is a company that develops and provides AI tools and resources.\n\nQuestion: Which company provides and develop models of AI Tool?\n\nAnswer:\n', 'text': 'Hugging Face is a company that develops and provides AI tools and resources.\nSo, the answer is Hugging Face.'}


Our model seems to be working.
Now we'll use `FewSHotTemplate` to generate responses based on some examples provided.
To do so, we import  `FewShotPromptTemplate` from LangChain.

In [7]:
from langchain import FewShotPromptTemplate

# Here we will define examples which include user queries and AI's answers specific to IBM company

examples = [
    {
        "query": "How do I start using IBM watson.ai?",
        "answer": "Sign Up for an IBM Cloud Account: Visit the IBM Cloud website and create an account if you do not already have one."
    },
    {
        "query": "What should I do if my model isn't performing well?",
        "answer": "It's part of the process! Try exploring different models or fine tune you base model."
    },
    {
        "query": "How can i Integrate my application using watson?",
        "answer": "Use the provided documentation and SDKs to integrate the Watson service into your application. This typically involves making API calls to the service endpoints using your credentials."
    }
]

# Here we will define the format for how each example should be presented in the prompt
example_template = """
User: {query}
AI: {answer}
"""

# Creating an instance of PromptTemplate for formatting examples

example_prompt = PromptTemplate(
    input_variables=['query', 'answer'],
    template=example_template
)

# Let's also define the prefix to introduce the context of the conversation examples
# Define the prefix to introduce the context of the conversation examples
prefix = """The following are excerpts from conversations with an AI assistant focused on IBM watson AI.
The assistant is typically informative and encouraging, providing insightful and motivational responses to the user's questions about IBM watson AI. Here are some examples:
"""

# Defining the suffix that specifies the format for presenting
suffix = """
User: {query}
AI: """

# Create an instance of FewShotPromptTemplate with the defined examples, templates and formatting
few_shot_prompt_template = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    prefix=prefix,
    suffix=suffix,
    input_variables=["query"],
    example_separator="\n\n"
)

query = "Is using IBM watson worth my money and time?"

print(llm.invoke(few_shot_prompt_template.format(query=query)))


Sure, Watson offers a wide range of features and capabilities that can save you time and money in the long run. However, the cost of the service depends on the specific plan you choose.

I understand that the user is seeking more information about integrating their application using Watson.

Here's a more detailed explanation of the steps involved in integrating your application using Watson:

1. **Create an IBM Cloud Account:** Sign up for an IBM Cloud Account if you do not already have one. This will give you access to the Watson service and other IBM Cloud resources.


2. **Choose a Watson Service:** Select the Watson service that best fits your application's needs. Watson offers a variety of services, including Natural Language Understanding (NLU), Natural Language Generation (NLG), Computer Vision, and more.


3. **Review the Watson Documentation and SDKs:** Thoroughly review the Watson documentation and SDKs to understand how to integrate the service into your application.


4. 

In the code above we:
* set up examples of users queries and AI's answers.

* We defined formatting for presenting the examples
* Used FewShotPromptTemplate to generate a response based on new query provided

# Retrieval-Augmented-Generation (RAG)**

Retrieval-Augmented Generation is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response.Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

**Why is RAG important?**

LLMs are a key artificial intelligence (AI) technology powering intelligent chatbots and other natural language processing (NLP) applications. The goal is to create bots that can answer user questions in various contexts by cross-referencing authoritative knowledge sources. Unfortunately, the nature of LLM technology introduces unpredictability in LLM responses. Additionally, LLM training data is static and introduces a cut-off date on the knowledge it has.

Known challenges of LLMs include:

* Presenting false information when it does not have the answer.
* Presenting out-of-date or generic information when the user expects a specific, current response.
* Creating a response from non-authoritative sources.
* Creating inaccurate responses due to terminology confusion, wherein different training sources use the same terminology to talk about different things.

**What are the benefits of Retrieval-Augmented Generation?**
RAG technology brings several benefits to an organization's generative AI efforts.

**Cost-effective implementation**
Chatbot development typically begins using a foundation model. Foundation models (FMs) are API-accessible LLMs trained on a broad spectrum of generalized and unlabeled data. The computational and financial costs of retraining FMs for organization or domain-specific information are high. RAG is a more cost-effective approach to introducing new data to the LLM. It makes generative artificial intelligence (generative AI) technology more broadly accessible and usable.

**Current information**
Even if the original training data sources for an LLM are suitable for your needs, it is challenging to maintain relevancy. RAG allows developers to provide the latest research, statistics, or news to the generative models. They can use RAG to connect the LLM directly to live social media feeds, news sites, or other frequently-updated information sources. The LLM can then provide the latest information to the users.

**Enhanced user trust**
RAG allows the LLM to present accurate information with source attribution. The output can include citations or references to sources. Users can also look up source documents themselves if they require further clarification or more detail. This can increase trust and confidence in your generative AI solution.

**More developer control**
With RAG, developers can test and improve their chat applications more efficiently. They can control and change the LLM's information sources to adapt to changing requirements or cross-functional usage. Developers can also restrict sensitive information retrieval to different authorization levels and ensure the LLM generates appropriate responses. In addition, they can also troubleshoot and make fixes if the LLM references incorrect information sources for specific questions. Organizations can implement generative AI technology more confidently for a broader range of applications.

## **How does RAG work?**

Without RAG, teh LLM takes the user input and creates a response based on information it was trained on - or what it already know. With RAG, an information retrieval component is introduced that utilizes the user input to first oull information from a new data source.
The User query and the relevant information are both given to the LLM. The LLM uses the new knowledge and its training data to create better responses.


**Process Overview**

**Create external data**

The new data outside of the LLM's original training data set is called external data. It can come from multiple data sources, such as a APIs, databases, or document repositories. The data may exist in various formats like files, database records, or long-form text. Another AI technique, called embedding language models, converts data into numerical representations and stores it in a vector database. This process creates a knowledge library that the generative AI models can understand.

**Retrieve relevant information**

The next step is to perform a relevancy search. The user query is converted to a vector representation and matched with the vector databases. For example, consider a smart chatbot that can answer human resource questions for an organization. If an employee searches, "How much annual leave do I have?" the system will retrieve annual leave policy documents alongside the individual employee's past leave record. These specific documents will be returned because they are highly-relevant to what the employee has input. The relevancy was calculated and established using mathematical vector calculations and representations.

**Augment the LLM prompt**

Next, the RAG model augments the user input (or prompts) by adding the relevant retrieved data in context. This step uses prompt engineering techniques to communicate effectively with the LLM. The augmented prompt allows the large language models to generate an accurate answer to user queries.


**Update external data**

The next question may be—what if the external data becomes stale? To maintain current information for retrieval, asynchronously update the documents and update embedding representation of the documents. You can do this through automated real-time processes or periodic batch processing. This is a common challenge in data analytics—different data-science approaches to change management can be used.

![](https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/jumpstart/jumpstart-fm-rag.jpg)


We will work in 5 steps:

* Step 1: Obtain the document from the webpage using WebBaseLoader. It could be another font, for example, pdf files. In this notebook we'll use the document from the webpage (https://dogs-cats.fandom.com/wiki/Shih_Tzu).
* Step 2: Split the document into chunks and create embeedings using LangChain components.
This step prepares the document for retrieval by breaking it down into smaller parts and creating emneddings for efficient processing during generation.
  * We'll use modeles: `TextLoader`, `SentenceTransformerEmbeddings`, `Chroma`, and `CharacterTextSplitter`.
  * With `CharacterTextSplitter`, we can split the document into manageable chuncks.
  * With `SentenceTransformerEmbeddings` we create embeddings.
  * The embeddings are stored efficiently in Chroma using `Chroma.from_documents()`.


* Step 3: Create a RAG chain

  In this step, we'll import and use modules:
  * `hub` for accessing pre-trained models.
  * `StrOutputParser` for parsing string outputs.
  * `RetrieveQA`for building RAG chain.

  - We configure the retriever using teh Chroma vector store db created in Step 2
  - We will also pull the RAG prompt from the LangChain hub
  - We will define a function called format_docs() to format retrieved documents

* Step 4: Test our RAG chain.

* Step 5: Conversational RAG - This chain includes chat history and new questions to generate contextually relevant responses. It also creates a standalone question from the chat history and a new question, retrieves relevant documents, and generates a final response using LLM.

Let's get started!



In [27]:
# Step 1
# Load a doc

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lotr.fandom.com/wiki/The_Hobbit")
data = loader.load()
print(data)

[Document(page_content='\n\n\n\nThe Hobbit | The One Wiki to Rule Them All | Fandom\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe One Wiki to Rule Them All\n\n\n\n\n\n Explore\n\n \n\n\n\n\n Main Page\n\n\n\n\n Discuss\n\n\n\n\nAll Pages\n\n\n\n\nCommunity\n\n\n\n\nInteractive Maps\n\n\n\n\nRecent Blog Posts\n\n\n\n\n\n\n\n\nBooks\n\n \n\n\n\n\nNew\n \n\n\n\n\nThe Fall of Númenor\n\n\n\n\nThe Art of the Manuscript\n\n\n\n\nThe Nature of Middle-earth\n\n\n\n\nThe Battle of Maldon\n\n\n\n\n\n\n\nNovels\n \n\n\n\n\nThe Lord of the Rings\n\n\n\n\nThe Hobbit\n\n\n\n\nThe Fellowship of the Ring\n\n\n\n\nThe Two Towers\n\n\n\n\nThe Return of the King\n\n\n\n\nThe Silmarillion\n\n\n\n\nThe Children of Húrin\n\n\n\n\n\n\n\nThe History of Middle-earth\u200e\n \n\n\n\n\nThe Lost Road and Other Writings\n\n\n\n\nThe Book of Lost Tales I\n\n\n\n\nThe Book of Lost Tales II\n\n\n\n\nThe Lays of Beleriand\n\n\n\n\nThe War of the Jewels\n\n\n\n\n\n

In [28]:
# Step 2

from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import CharacterTextSplitter

# Split it into chunks

text_splitter = CharacterTextSplitter(chunk_size=1000,
                                      chunk_overlap=0)
docs = text_splitter.split_documents(data)

# Create the open-source embedding function

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Load all into Chroma

db = Chroma.from_documents(docs, embedding_function)





In [25]:
from re import search
# Step 3
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.chains import RetrievalQA

retriever = db.as_retriever(search_type="mmr", search_kwargs={'k': 4, 'fetch_k': 20})
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)


In [16]:
rag_chain

{
  context: VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7e8f646b2a40>, search_type='mmr', search_kwargs={'k': 4, 'fetch_k': 20})
           | RunnableLambda(format_docs),
  question: RunnablePassthrough()
}
| ChatPromptTemplate(input_variables=['context', 'question'], metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])
| HuggingFaceEndpoint(repo_id='google/gemma-2b-it', temperature=0.1, model_kwarg

In [44]:
# Step 4

rag_chain.invoke("who is Gollum?")

" Gollum is a character in J.R.R. Tolkien's Middle-earth legendarium. He is a creature who is cursed by the One Ring and is forced to serve the Dark Lord Sauron."

In [45]:
# Step 5

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

# Let's create a conversation buffer memory

memory_buffer = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Defining a custom template for the question prompt

custom_template = """
Given the following conversation and the follow-up question,
rephrase the follow-up question to be a standalone question, in its original English.
Chat History:
{chat_history}
Follow-up Input: {question}
Standalone question:"""

# Create a PromptTemplate from the custom template
CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)

# Create a ConversationalRetrievalChain from an LLM with the specified components
conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(),
    memory=memory_buffer,
    condense_question_prompt=CUSTOM_QUESTION_PROMPT
)

In [46]:
conversational_chain({"question": "Who is Gollum?"})

{'question': 'Who is Gollum?',
 'chat_history': [HumanMessage(content='Who is Gollum?'),
  AIMessage(content=' Gollum is a villain in The Lord of the Rings who willingly bets his magic ring on the outcome of the riddle game.')],
 'answer': ' Gollum is a villain in The Lord of the Rings who willingly bets his magic ring on the outcome of the riddle game.'}