# ChatBot without and with Retrieval, using OpenVINO and LangChain

In this module, we will build up a ChatBot using LangChain. Then ask it questions about "Lunar Lake", which is the code name for Intel® Core™ Ultra Processors Series 2 laptop processors, which had not been released at the time the LLM was trained. Then we will build a vector database using a few documents that describe the "Lunar Lake" processors, and add retrieval-augmented generation (RAG) to the chatbot to ground its responses to this information.

If you are running this on your own, not as part of a workshop, install packages using `requirements.txt` and run the `setup.py` script before using this notebook.

This module's goal is to demonstrate the use case of building a chatbot using documents stored locally on an AI PC. It will be obvious that grounding the chatbot's responses to the documents is valuable, but the value of running locally is privacy and security - everything is local to the user's PC. The LLM is hard-coded to run on the GPU, the embeddings model on the NPU, and the reranker model uses the AUTO setting. Feel free to edit the code to try different settings. The LLM is Intel neural-chat-v3-3, but this has also been tested using TinyLlama/TinyLlama-1.1B-Chat-v1.0.

**Retrieval-augmented generation (RAG)** is a technique for augmenting LLM knowledge with additional, often private or real-time, data. LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

[LangChain](https://python.langchain.com/docs/get_started/introduction) is a framework for developing applications powered by language models. It has a number of components specifically designed to help build RAG applications. In this tutorial, we’ll build a simple question-answering application over a text data source.

In this example, the customized RAG pipeline consists of following components in order, where embedding, rerank and LLM will be deployed with OpenVINO to optimize their inference performance.

![RAG](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/0076f6c7-75e4-4c2e-9015-87b355e5ca28)

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/llm-rag-langchain/llm-rag-langchain.ipynb" />

RAG description and image are courtesy of [Create a RAG system using OpenVINO and LangChain](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-rag-langchain).


## Imports

Import required dependencies


In [None]:
from pathlib import Path
import torch
from transformers import (
    TextIteratorStreamer,
    StoppingCriteria,
    StoppingCriteriaList,
)

import chatbot_utils as utils

### Load the LLM

OpenVINO models can be run locally through the `HuggingFacePipeline` class in [LangChain](https://python.langchain.com/docs/integrations/llms/openvino/). To deploy a model with OpenVINO, you can specify the `backend="openvino"` parameter to trigger OpenVINO as backend inference framework.

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

model_dir = "./models/neural-chat-7b-v3-3/INT4"
llm_device = "GPU"

ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}

llm = HuggingFacePipeline.from_model_id(
    model_id=str(model_dir),
    task="text-generation",
    backend="openvino",
    model_kwargs={
        "device": llm_device,
        "ov_config": ov_config,
        "trust_remote_code": True,
    },
    pipeline_kwargs={"max_new_tokens": 400},
)
# We'll define this later
rag_chain = None

if llm.pipeline.tokenizer.eos_token_id:
    llm.pipeline.tokenizer.pad_token_id = llm.pipeline.tokenizer.eos_token_id

### Define Inference Function
This shows how simple it is to build an inference pipeline using LangChain. Depending on whether we use RAG or not, we build different pipelines.

In [None]:
from threading import Thread
def bot_inference(query, use_rag):
    global db, llm, rag_chain

    streamer = TextIteratorStreamer(
        llm.pipeline.tokenizer,
        timeout=60.0,
        skip_prompt=True,
        skip_special_tokens=True,
    )
    pipeline_kwargs = dict(
        max_new_tokens=512,
        temperature=0.6,
        do_sample=True,
        top_p=0.95,
        top_k=10,
        repetition_penalty=1.2,
        streamer=streamer,
    )

    llm.pipeline_kwargs = pipeline_kwargs
    if use_rag:
        t1 = Thread(target=rag_chain.invoke, args=({"input": query},))
    else:
        input_text = utils.RAG_PROMPT_TEMPLATE.format(input=query, context="")
        t1 = Thread(target=llm.invoke, args=(input_text,))
    t1.start()

    # Initialize an empty string to store the generated text
    partial_text = ""
    for new_text in streamer:
        partial_text = utils.partial_text_processor(partial_text, new_text)
        yield partial_text

### Ask the Chatbot Questions

Ask a question, choose a device to run the LLM on, then press Submit.

Later on we will supply it with information on the recently-announced Intel Core Ultra 9 processor, code-named "Lunar Lake", which did not exist when the bot was trained. So ask it questions about Lunar Lake!

In [None]:
gr_demo = utils.build_gr_blocks(bot_inference, use_rag=False)
gr_demo.launch()

## Run QA over Document

A typical RAG application has two main components:

- **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happen offline.

- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The most common full sequence from raw data to answer looks like:

**Indexing**

1. `Load`: First we need to load our data. We’ll use DocumentLoaders for this.
2. `Split`: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window.
3. `Store`: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

![Indexing pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/dfed2ba3-0c3a-4e0e-a2a7-01638730486a)

**Retrieval and generation**

1. `Retrieve`: Given a user input, relevant splits are retrieved from storage using a Retriever.
2. `Generate`: A LLM produces an answer using a prompt that includes the question and the retrieved data.

![Retrieval and generation pipeline](https://github.com/openvinotoolkit/openvino_notebooks/assets/91237924/f0545ddc-c0cd-4569-8c86-9879fdab105a)

We can build a RAG pipeline of LangChain through [`create_retrieval_chain`](https://python.langchain.com/docs/modules/chains/), which will help to create a chain to connect RAG components including:

- [`Vector stores`](https://python.langchain.com/docs/modules/data_connection/vectorstores/)，
- [`Retrievers`](https://python.langchain.com/docs/modules/data_connection/retrievers/)
- [`LLM`](https://python.langchain.com/docs/integrations/llms/)
- [`Embedding`](https://python.langchain.com/docs/integrations/text_embedding/)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.retrievers import ContextualCompressionRetriever
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import FAISS
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.embeddings import OpenVINOEmbeddings

# Load and compile the embedding and rerank model
embedding = utils.load_embeddings_model("NPU")
# Reranking improves the quality of retrieval by ranking the retrieved chunks according to a similarity score. 
# It adds some time to the inference step, because it is ranking the similarity to that specific query.
reranker = utils.load_rerank_model("AUTO")

def build_vectordb(docs_path):
    global llm, rag_chain
    
    print(f"Building {docs_path}")
    # Create a loader to load all the .pdf files in a directory
    loader = PyPDFDirectoryLoader("./local_docs")
    docs = loader.load()
    # Split the docs into chunks, with some overlap
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunked_docs = text_splitter.split_documents(docs)
        
    # Use the local copy of the model that's used if you're online and use embeddings_model = HuggingFaceEmbeddings()
    embeddings_model = "./models/bge-small-en-v1.5"
    db = FAISS.from_documents(chunked_docs, embedding)

    # Number of results from the retrieval search to to use, and set the similarity score threshold
    search_kwargs = {"k": 10, "score_threshold": 0.5}
    retriever = db.as_retriever(search_kwargs=search_kwargs, search_type="similarity_score_threshold")
    # Set up rerank, use the top 2 results
    reranker.top_n = 2
    # Define the retriever and prompt
    retriever = ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)
    prompt = PromptTemplate.from_template(utils.RAG_PROMPT_TEMPLATE)
    # Add the retriever to the existing LLM chain
    combine_docs_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

build_vectordb("./local_docs")

In [None]:
gr_demo = utils.build_gr_blocks(bot_inference, use_rag=True)
gr_demo.launch()

## Disclaimer for Using Large Language Models

Please be aware that while Large Language Models like TinyLlama are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.

To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

 
Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.