# Demonstrating the RAG Framework with LangChain and Local Language Models

This notebook showcases the application of the Retrieval-Augmented Generation (RAG) framework using LangChain, integrating a local LLM (Llama3) using the Ollama tool, and employing a Simple SentenceEmbedding Model from Hugging Face. The goal is to enhance language understanding and retrieval capabilities through effective combination of these technologies.

## Key Components and Classes

### 1. Loader
The `Loader` class is responsible for initializing and loading the necessary models and tools. It typically handles the setup of local language models (like Llama3) and ensures that all components are ready for operation.

### 2. Retriever
The `Retriever` class plays a crucial role in the RAG framework. It uses the Simple SentenceEmbedding Model to convert text into vector representations, which are then used to retrieve relevant information from a dataset or knowledge base. This class is essential for the "retrieve" step of the RAG process.

### 3. VectorStore
The `VectorStore` class is used to manage the storage of vectorized data. In this setup, it likely interacts with the embeddings produced by the Retriever, providing a structured way to access and retrieve these vectors for comparison and retrieval tasks.

### 4. Generator
The `Generator` class is typically where the augmented generation takes place. It integrates outputs from the Retriever to generate coherent and contextually enriched responses. This class utilizes the local LLM to process the retrieved information and generate final outputs.

## LangChain Packages and Imports

The notebook utilizes several packages from LangChain, which facilitate the integration of RAG components and local language models. Key imports might include:

- `langchain.llama`: For utilizing local language models, particularly Llama3, within the LangChain framework.
- `langchain.retrieval`: To handle the retrieval processes, including managing vector stores and embedding models.
- `langchain.generators`: This package likely contains the Generator class, which is crucial for synthesizing and generating text based on retrieved data.

These packages collectively enable the implementation of a sophisticated RAG system that leverages local computation and advanced NLP models to enhance text processing and generation tasks.


# Installing and Configuring the Ollama Tool on Colab

This section outlines the steps required to install the Ollama tool on a machine running Google Colab notebooks. This setup is essential for integrating local language models, specifically Llama3, into our environment for enhanced NLP tasks.

In [None]:
# Install Ollama on the Colab machine, Run the Ollama server on the colab machine and Pull the llama3 model
!apt-get install lshw
!curl https://ollama.ai/install.sh | sh
!nohup ollama serve &
!ollama pull llama3

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  pci.ids usb.ids
The following NEW packages will be installed:
  lshw pci.ids usb.ids
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 790 kB of archives.
After this operation, 2,988 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 lshw amd64 02.19.git.2021.06.19.996aaad9c7-2build1 [321 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 pci.ids all 0.0~2022.01.22-1 [251 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 usb.ids all 2022.04.02-1 [219 kB]
Fetched 790 kB in 1s (1,138 kB/s)
Selecting previously unselected package lshw.
(Reading database ... 121918 files and directories currently installed.)
Preparing to unpack .../lshw_02.19.git.2021.06.19.996aaad9c7-2build1_amd64.deb ...
Unpacking lshw (02.19.git.2021.06.19.996aaad9c7-2build1) ...


## Python 3rd Party Libraries Requirements

In [None]:
!pip install -qU langchain langchain_community faiss-cpu sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.9/307.9 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.2/121.2 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

## Python Imports

In [None]:
from langchain_community.llms.ollama import Ollama
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import hub

from pprint import pprint
import bs4

# RAG

## Initialize the Ollama tool with the Llama3 model.
* `temperature` to 0.7 to control randomness in the response generation,
ensuring responses are not too deterministic but still sensible.
* `verbose=True` enables detailed logging, useful for debugging and understanding model behavior.
* `top_p=1` configures the model to only consider the top 100% probable next words at each step, effectively disabling the top-p sampling.

In [None]:
llm = Ollama(model="llama3", temperature=0.7, verbose=True, top_p=1)

# Invoke the model to answer a simple question.
response = llm.invoke("What comes after Monday?")

# print the response to see what the model has generated.
print(response)

Tuesday!


## Loading Web Documents

In [None]:
# Define a list of URLs that point to the web pages we want to process.
# In this case, the URLs link to stories:
# a) "The Little Match Girl" by Hans Christian Andersen and
# b) "Little Red Riding Hood" from American Literature's website.
urls = [
    "https://americanliterature.com/author/hans-christian-andersen/short-story/the-little-match-girl",
    "https://americanliterature.com/childrens-stories/little-red-riding-hood"
]

# Create an instance of the WebBaseLoader class.
# This class is used to load textual content from the web pages specified in the `urls` list.
loader = WebBaseLoader(
    web_paths=(urls)
)

# Load the web pages and store the content in the `web_pages` variable.
# This method fetches the HTML content of each URL and typically extracts the text.
web_pages = loader.load()

# Initialize a RecursiveCharacterTextSplitter instance.
# This component is configured to split the text into chunks of 1000 characters with an overlap of 200 characters.
# Chunk size defines how many characters each text chunk will contain.
# Chunk overlap allows for a portion of the text to be repeated in subsequent chunks, which can be useful
# for maintaining context in certain NLP tasks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Apply the text splitter to the loaded web pages.
# This will divide the text of each page into smaller sections based on the specified chunk size and overlap.
# The result is stored in `splits`, which will be a list of text chunks ready for further processing.
splits = text_splitter.split_documents(web_pages)


## Indexing in a VectorStore

In [None]:
# Initialize the HuggingFaceEmbeddings class with a specific transformer model.
# The chosen model, 'sentence-transformers/paraphrase-MiniLM-L6-v2', is optimized for paraphrasing tasks,
# making it suitable for generating embeddings that capture the semantics of similar phrases effectively.

# `model_kwargs` is used to specify additional settings for the model loading process.
# Here, {'device': 'cpu'} forces the model to run on the CPU instead of a GPU.

# `encode_kwargs` contains parameters that affect the encoding process.
# Setting {'normalize_embeddings': True} ensures that the embeddings are normalized.
# Normalization, in this context, means scaling the embedding vectors to have a unit norm.
# This standardization often helps in improving the performance in similarity calculations by making
# the distance measures (like cosine similarity) focus strictly on the angle between vectors rather than their magnitude.

device = "cpu"
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/paraphrase-MiniLM-L6-v2',
    model_kwargs={'device':device},
    encode_kwargs={'normalize_embeddings': True}
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Initialize a Chroma vector store using the previously generated document splits and embeddings.
# `Chroma.from_documents` is a method used to create a vector store by processing a collection of documents,
# each represented by a segment of text from the `splits` list. The text segments are converted into
# vector embeddings using the `embeddings` object.

# `documents`: This parameter takes the list of document segments (`splits`) that need to be embedded.
# These segments were previously created by splitting larger documents into smaller, manageable chunks.

# `embedding`: This parameter receives the `embeddings` object, which specifies the model and settings
# used to generate embeddings. The embeddings are created using the 'sentence-transformers/paraphrase-MiniLM-L6-v2'
# model, which provides high-quality semantic embeddings for the text segments.

# The result, `vector_store`, is an object that holds a collection of vector embeddings corresponding to the
# input documents. This vector store can be used in various applications like semantic search, document similarity,
# and other NLP tasks that require comparing the content of different texts based on their vectorized forms.

# try:
#   if vector_store :
#     pprint("Cleaning the existing db")
#     vector_store.delete_collection()
# except NameError:
#     print("WARN: vector_store not defined")

vector_store = FAISS.from_documents(documents=splits, embedding=embeddings)
pprint(f"Vector store has {len(vector_store.index_to_docstore_id)} chunks, across {len(web_pages)} web pages.")

'Vector store has 23 chunks, across 2 web pages.'


## Question Answering with Context Retrieval and Language Model Integration

This section of the notebook demonstrates a sophisticated question answering system that leverages vector embeddings, retrieval mechanisms, and natural language processing to answer specific questions based on provided textual content. The system is designed to function by integrating various components, including document retrieval, context assembly, prompt formulation, and response generation, orchestrated into a streamlined processing chain.

### Key Components and Workflow

1. **Question Definition**:
   - A predefined list of questions is set up, focusing on specific information extraction related to the content from previously loaded web pages.

2. **Retrieval System Setup**:
   - A vector store, loaded with document embeddings, is transformed into a retriever capable of fetching the top relevant text segments based on the semantic similarity to the input queries.

3. **Prompt Configuration**:
   - A templating system is used to format the input for the language model, ensuring that each question is paired with its relevant context in a structured manner conducive to comprehension by the model.

4. **Context Assembly Function**:
   - A function is crafted to assemble the textual content retrieved from the vector store into a coherent block of context. This context is critical for providing background information necessary for the language model to generate accurate answers.

5. **Processing Chain**:
   - The components are linked in a chain that orchestrates the flow from context retrieval through to the final response output. This chain includes dynamic retrieval of context, question passing, prompt formation, language model invocation, and parsing the output into a human-readable format.

6. **Execution and Output**:
   - The system iterates over each question, dynamically retrieves and assembles the context, formulates the query, invokes the language model, and outputs the model’s response. The results are then displayed, showing both the questions and their respective answers.

This setup exemplifies the integration of modern NLP techniques to build a context-aware question answering system that efficiently processes and responds to inquiries by leveraging locally stored knowledge and computational resources.


In [None]:
# Define a list of questions related to the content of the documents loaded and split earlier.
# These questions are intended to be answered by referencing the textual content from those documents.
questions = [
    "What did the little match girl see?",
    "What happened to the little red riding hood?"
]

# Convert the vector store into a retriever capable of fetching relevant document segments based on a query.
# `search_kwargs={"k": 10}` configures the retriever to return the top 10 most relevant document segments for a given query.
retriever = vector_store.as_retriever(search_kwargs={"k": 10})

# Define a chat prompt template that formats the context and question for the language model.
# This template organizes the input so that the context is clearly separated from the question,
# ensuring that the model can distinguish between the provided background information and what it needs to answer.
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:
[CONTEXT]:
{context}

[QUESTUIN]
{question}""")

# Function to assemble the context from retrieved document chunks.
# It takes retrieved chunks, extracts the page content from each, and joins them with double newlines.
# This formatting helps maintain a clear separation between different pieces of context.
def assemble_context(chunks):
    context = "\n\n".join([chunk.page_content for chunk in chunks])
    return context

# Define a processing chain that orchestrates the flow from context retrieval to response generation.
# The chain is composed of several steps:
# 1. `retriever | assemble_context`: Retrieves document chunks and assembles them into a single context string.
# 2. `RunnablePassthrough()`: Passes the question as is.
# 3. `prompt | llm | StrOutputParser()`: Formats the prompt, invokes the language model, and parses the string output.
chain = (
    {"context": retriever | assemble_context, "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)

# Iterate over each question, invoke the chain with the question to get an answer, and print the results.
# The process retrieves context relevant to each question, formats it with the question, and generates an answer.
for question in questions:
    ans = chain.invoke(question)
    pprint({"Q)": question, "A)": ans}, sort_dicts=False)


{'Q)': 'What did the little match girl see?',
 'A)': 'Based on the provided context, the little match girl saw beautiful '
       'things, including:\n'
       '\n'
       '1. A warm and bright flame that seemed like a candle as she held her '
       'hands over it.\n'
       '2. The lights of the Christmas tree rising higher and higher, which '
       'she then saw as stars in heaven.\n'
       '3. One of the stars falling down and forming a long trail of fire.\n'
       '\n'
       'She also saw a long trail of fire after one of the stars fell down.'}
{'Q)': 'What happened to the little red riding hood?',
 'A)': 'According to the provided context, Little Red Riding Hood took great '
       'care and gave her hand on it to her mother before going to visit her '
       "sick grandmother in the wood. When she arrived at the grandmother's "
       'house, a wolf met her and tried to entice her from the path. However, '
       'Little Red Riding Hood was on her guard and went straight for