# Create a RAG system on AIPC using Ollama

## Introduction  

This notebook demonstrates how to run LLM inference for a Retrieval-Augmented Generation (RAG) application using Ollama locally on an AI PC. It is optimized for Intel® Core™ Ultra processors, utilizing the combined capabilities of the CPU, GPU, and NPU for efficient AI workloads. 

### What is an AI PC?  

An AI PC is a next-generation computing platform equipped with a CPU, GPU, and NPU, each designed with specific AI acceleration capabilities.  

- **Fast Response (CPU)**  
  The central processing unit (CPU) is optimized for smaller, low-latency workloads, making it ideal for quick responses and general-purpose tasks.  

- **High Throughput (GPU)**  
  The graphics processing unit (GPU) excels at handling large-scale workloads that require high parallelism and throughput, making it suitable for tasks like deep learning and data processing.  

- **Power Efficiency (NPU)**  
  The neural processing unit (NPU) is designed for sustained, heavily-used AI workloads, delivering high efficiency and low power consumption for tasks like inference and machine learning.  

The AI PC represents a transformative shift in computing, enabling advanced AI applications like LLM-based RAG workflows to run seamlessly on local hardware. This innovation enhances everyday PC usage by delivering faster, more efficient AI experiences without relying on cloud resources.  

In this notebook, we’ll explore how to use the AI PC’s capabilities to perform LLM inference and integrate it into a RAG pipeline, showcasing the power of local AI acceleration for modern applications. 

**Retrieval-augmented generation (RAG)** is a technique for augmenting LLM knowledge with additional, often private or real-time, data. LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

## Run QA over Document

Now, when model created, we can setup Chatbot interface using Streamlit

A typical RAG application has two main components:

- **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happen offline.

- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

**Indexing**

1. `Load`: First we need to load our data. We’ll use DocumentLoaders for this.
2. `Split`: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window.
3. `Store`: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

![Indexing pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/dfed2ba3-0c3a-4e0e-a2a7-01638730486a)

**Retrieval and generation**

1. `Retrieve`: Given a user input, relevant splits are retrieved from storage using a Retriever.
2. `Generate`: A LLM produces an answer using a prompt that includes the question and the retrieved data.

![Retrieval and generation pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/f0545ddc-c0cd-4569-8c86-9879fdab105a)


We can build a RAG pipeline of LangChain through [`create_retrieval_chain`](https://python.langchain.com/docs/modules/chains/), which will help to create a chain to connect RAG components including:

- [`Vector stores`](https://python.langchain.com/docs/modules/data_connection/vectorstores/)，
- [`Retrievers`](https://python.langchain.com/docs/modules/data_connection/retrievers/)
- [`LLM`](https://python.langchain.com/docs/integrations/llms/)
- [`Embedding`](https://python.langchain.com/docs/integrations/text_embedding/)


In [None]:
import os
import time
import warnings

warnings.filterwarnings("ignore")

from langchain_community import document_loaders, embeddings, vectorstores, llms
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain import chains, text_splitter, PromptTemplate

OLLAMA_BASE_URL = "http://localhost:11434"
VECTOR_DB_DIR = "vector_dbs"

### Document Loaders in RAG

* Document loaders in RAG are used to load and preprocess the documents that will be used for retrieval during the question answering process.
* Document loaders are responsible for preprocessing the documents. This includes tokenizing the text, converting it to the format expected by the retriever, and creating batches of documents.
* Document loaders work in conjunction with the retriever in RAG. The retriever uses the documents loaded by the document loader to find the most relevant documents for a given query.
* The WebBaseLoader in Retrieval Augmented Generation (RAG) is a type of document loader that is designed to load documents from the web.
* The WebBaseLoader is used when the documents for retrieval are not stored locally or in a Hugging Face dataset, but are instead located on the web. This can be useful when you want to use the most up-to-date information available on the internet for your question answering system




In [None]:
def load_document(url):
    print("Loading document from URL...")
    loader = document_loaders.WebBaseLoader(url)
    return loader.load()

### Text splitter

* RecursiveCharacterTextSplitter is used to split text into smaller pieces recursively at the character level. 
* split_documents fuctions splits larger documents into smaller chunks, for easier processing

In [None]:
def split_document(text, chunk_size=3000, overlap=200):
    print("Splitting document into chunks...")
    text_splitter_instance = text_splitter.RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return text_splitter_instance.split_documents(text)

### Huggingface emdeggings
In Retrieval Augmented Generation (RAG) embeddings play a crucial role in the retrieval of relevant documents for a given query.

* In RAG, each document in the knowledge base is represented as a dense vector, also known as an embedding. These embeddings are typically generated by a transformer model.
* When a query is received, it is also converted into an embedding using the same transformer model. This ensures that the query and the documents are in the same vector space, making it possible to compare them.
* Retrieval: The retrieval step in RAG involves finding the documents whose embeddings are most similar to the query embedding. This is typically done using a nearest neighbor search.

#### Sentence transformers

* You can use a Sentence Transformer to generate embeddings for each document in your knowledge base. Since Sentence Transformers are designed to capture the semantic meaning of sentences, these embeddings should do a good job of representing the content of the documents.
* You can also use a Sentence Transformer to generate an embedding for the query. This ensures that the query and the documents are in the same vector space, making it possible to compare them.
* By using Sentence Transformers, you can potentially improve the quality of the retrieval step in RAG. Since Sentence Transformers are designed to capture the semantic meaning of sentences, they should be able to find documents that are semantically relevant to the query, even if the query and the documents do not share any exact words.




In [None]:
def initialize_embedding_fn(embedding_type="huggingface", model_name="sentence-transformers/all-MiniLM-l6-v2"):
    print(f"Initializing {embedding_type} model with {model_name}...")
    if embedding_type == "ollama":
        model_name = chat_model
        return embeddings.OllamaEmbeddings(model=model_name, base_url=OLLAMA_BASE_URL)
    elif embedding_type == "huggingface":
        model_name = "sentence-transformers/paraphrase-MiniLM-L3-v2"
        return embeddings.HuggingFaceEmbeddings(model_name=model_name)
    elif embedding_type == "nomic":
        return embeddings.NomicEmbeddings(model_name=model_name)
    elif embedding_type == "fastembed":
        return  FastEmbedEmbeddings(threads=16)
    else:
        raise ValueError(f"Unsupported embedding type: {embedding_type}")

### Create and get embeddings using ChromaDB

Here's a flow chart that describes how embeddings work in a RAG model with ChromaDB:

* Query Input: The user inputs a query.
* Query Embedding: The query is passed through a transformer-based encoder (like BERT or RoBERTa) to generate a query embedding.
* Document Embedding: Each document in the ChromaDB is also passed through a transformer-based encoder to generate a document embedding. This is typically done offline and the embeddings are stored in the database for efficient retrieval.
* Embedding Comparison: The query embedding is compared with each document embedding in the ChromaDB. This is done by calculating the cosine similarity or dot product between the query embedding and each document embedding.
* Document Retrieval: The documents with the highest similarity scores are retrieved. The number of documents retrieved is a hyperparameter that can be tuned.
* Answer Generation: The retrieved documents and the query are passed to a sequence-to-sequence model (like BART or T5) to generate an answer.

In [None]:
def get_or_create_embeddings(document_url, embedding_fn, persist_dir=VECTOR_DB_DIR):
    vector_store_path = os.path.join(os.getcwd(), persist_dir)    
    if os.path.exists(vector_store_path):
        print("Loading existing vector store...")
        return vectorstores.Chroma(persist_directory=persist_dir, embedding_function=embedding_fn)
    else:
        start_time = time.time()
        print("No existing vector store found. Creating new one...")
        document = load_document(document_url)
        documents = split_document(document)
        vector_store = vectorstores.Chroma.from_documents(
            documents=documents,
            embedding=embedding_fn,
            persist_directory=persist_dir
        )
        vector_store.persist()
        print(f"Embedding time: {time.time() - start_time:.2f} seconds")
        return vector_store

### Retrievers

* Retrievers are responsible for fetching relevant documents from a document store or knowledge base given a query. The retrieved documents are then used by the generator to produce a response.
* RetrievalQA is a type of question answering system that uses a retriever to fetch relevant documents given a question, and then uses a reader to extract the answer from the retrieved documents.
* RetrievalQA can be seen as a two-step process:
    * Retrieval: The retriever fetches relevant documents from the document store given a query.    
    * Generation: The generator uses the retrieved documents to generate a response.
* This two-step process allows RAG to leverage the strengths of both retrieval-based and generation-based approaches to question answering. The retriever allows RAG to efficiently search a large document store, while the generator allows RAG to generate detailed and coherent responses.


In [None]:
def handle_user_interaction(vector_store, chat_model):
    prompt_template = """
    Use the following pieces of context to answer the question at the end. 
    If you do not know the answer, answer 'I don't know', limit your response to the answer and nothing more. 

    {context}

    Question: {question}
    """
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    chain_type_kwargs = {"prompt": prompt}
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    qachain = chains.RetrievalQA.from_chain_type(llm=chat_model, retriever=retriever, chain_type="stuff", chain_type_kwargs=chain_type_kwargs)
    qachain.invoke({"query": "what is this about?"})
    print(f"Model warmup complete...")
    while True:
        question = input("Enter your question (or 'quit' to exit): ")
        if question.lower() == 'quit':
            break
        start_time = time.time()
        answer = qachain.invoke({"query": question})
        print(f"Answer: {answer['result']}")
        print(f"Response time: {time.time() - start_time:.2f} seconds")

### Run the application

In [None]:
def main(document_url, embedding_type, chat_model):
    embedding_fn = initialize_embedding_fn(embedding_type)
    vector_store = get_or_create_embeddings(document_url, embedding_fn)
    chat_model_instance = llms.Ollama(base_url=OLLAMA_BASE_URL, model=chat_model)
    handle_user_interaction(vector_store, chat_model_instance)

if __name__ == "__main__":
    document_url = "https://www.gutenberg.org/files/1727/1727-h/1727-h.htm"    
    embedding_type = "huggingface"
    chat_model = "llama3:latest"
    main(document_url, embedding_type, chat_model)

### Streamlit Demo

In [None]:
%%writefile src/st_rag_chromadb.py
import streamlit as st
import time
import os
import warnings
import ollama

warnings.filterwarnings("ignore")

from langchain_community import document_loaders, embeddings, vectorstores, llms
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain import chains, text_splitter, PromptTemplate

OLLAMA_BASE_URL = "http://localhost:11434"
VECTOR_DB_DIR = "vector_dbs"

st.header("LLM Rag 🐻‍❄️")


models = [model["name"] for model in ollama.list()["models"]]
model = st.selectbox("Choose a model from the list", models)

# Input text to load the document
url_path = st.text_input("Enter the URL to load for RAG:",value="https://www.gutenberg.org/files/1727/1727-h/1727-h.htm", key="url_path")

# Select embedding type
embedding_type = st.selectbox("Please select an embedding type", ("ollama", "huggingface", "nomic", "fastembed"),index=1)

# Input for RAG
question = st.text_input("Enter the question for RAG:", value="What is this about", key="question")

## Load the document using document_loaders
def load_document(url):
    print("Loading document from URL...")
    st.markdown(''' :green[Loading document from URL...] ''')
    loader = document_loaders.WebBaseLoader(url)
    return loader.load()


## Split the document into multiple chunks
def split_document(text, chunk_size=3000, overlap=200):
    print("Splitting document into chunks...")
    st.markdown(''' :green[Splitting document into chunks...] ''')
    text_splitter_instance = text_splitter.RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=overlap)
    return text_splitter_instance.split_documents(text)




## Initialize embeddings for these chunks of data. we can use one of the below four embedding types

def initialize_embedding_fn(embedding_type="huggingface", model_name="sentence-transformers/all-MiniLM-l6-v2"):
    print(f"Initializing {embedding_type} model with {model_name}...")
    st.write(f"Initializing {embedding_type} model with {model_name}...")
    if embedding_type == "ollama":
        model_name = chat_model
        return embeddings.OllamaEmbeddings(model=model_name, base_url=OLLAMA_BASE_URL)
    elif embedding_type == "huggingface":
        model_name = "sentence-transformers/paraphrase-MiniLM-L3-v2"
        return embeddings.HuggingFaceEmbeddings(model_name=model_name)
    elif embedding_type == "nomic":
        return embeddings.NomicEmbeddings(model_name=model_name)
    elif embedding_type == "fastembed":
        return  FastEmbedEmbeddings(threads=16)
    else:
        raise ValueError(f"Unsupported embedding type: {embedding_type}")
    
## Create embeddings for these chunks of data and store it in chromaDB

def get_or_create_embeddings(document_url, embedding_fn, persist_dir=VECTOR_DB_DIR):
    vector_store_path = os.path.join(os.getcwd(), persist_dir)    
    start_time = time.time()
    print("No existing vector store found. Creating new one...")
    st.markdown(''' :green[No existing vector store found. Creating new one......] ''')
    document = load_document(document_url)
    documents = split_document(document)
    vector_store = vectorstores.Chroma.from_documents(
        documents=documents,
        embedding=embedding_fn,
        persist_directory=persist_dir
    )
    vector_store.persist()
    print(f"Embedding time: {time.time() - start_time:.2f} seconds")
    st.write(f"Embedding time: {time.time() - start_time:.2f} seconds")
    return vector_store
# Create the user prompt and generate the response
def handle_user_interaction(vector_store, chat_model):
    prompt_template = """
    Use the following pieces of context to answer the question at the end. 
    If you do not know the answer, answer 'I don't know', limit your response to the answer and nothing more. 

    {context}

    Question: {question}
    """
    prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
    chain_type_kwargs = {"prompt": prompt}
    # Use retrievers to retrieve the data from the database
    st.markdown(''' :green[Using retrievers to retrieve the data from the database...] ''')
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})
    st.markdown(''' :green[Answering the query...] ''')
    qachain = chains.RetrievalQA.from_chain_type(llm=chat_model, retriever=retriever, chain_type="stuff", chain_type_kwargs=chain_type_kwargs)
    qachain.invoke({"query": "what is this about?"})
    print(f"Model warmup complete...")
    st.markdown(''' :green[Model warmup complete...] ''')
       
    
          
    start_time = time.time()
    answer = qachain.invoke({"query": question})
    print(f"Answer: {answer['result']}")    
    print(f"Response time: {time.time() - start_time:.2f} seconds")
    st.write(f"Response time: {time.time() - start_time:.2f} seconds")
    
    
    return answer['result']
  
       

# Main Function to load the document, initialize the embeddings , create the vector database and invoke the model
def getfinalresponse(document_url, embedding_type, chat_model):    
    
    document_url = url_path    
    chat_model = model
                
    embedding_fn = initialize_embedding_fn(embedding_type)
    vector_store = get_or_create_embeddings(document_url, embedding_fn)     
    chat_model_instance = llms.Ollama(base_url=OLLAMA_BASE_URL, model=chat_model)
    return handle_user_interaction(vector_store, chat_model_instance)

    
submit=st.button("Generate")


# generate response
if submit:    
    document_url = url_path    
    chat_model = model
    
    with st.spinner("Loading document....🐎"):        
        st.write(getfinalresponse(document_url, embedding_type, chat_model))


### Streamlit Demo

In [None]:
! streamlit run src/st_rag_chromadb.py

### Streamlit sample output

Below is the output of a sample run from the streamlit application and offloaded to iGPU

<img src="Assets/rag2.png"> <img src="Assets/rag1.png">

### References
https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-agent-langchain