# Chat with your documents on Intel Meteor Lake iGPU
In this notebook we will show how to build a system for chatting with your document running locally on your Intel Meteor Lake laptop!

Retrieval Augmented Generation (RAG) is a method for enhancing model generation results with retrieved relevant information from an extenernal database.
Using RAG we can chat with our documents or ask questions about current events which didn't appear in the model's training data.
Running RAG locally can give the user a great experience of having an immediate response while interacting with a model while ensuring privacy as all the data is kept on the laptop and not uploaded anywhere.
RAG is a key component in the AIPC Era and here we will show you how to use it!

To build the RAG pipeline we will use LangChain:
> LangChain is a framework designed to simplify the creation of applications using large language models.

LangChain has a built-in integration with OpenVINO and 🤗 Optimum which will make our implementation very easy and simple.
You can read about LangChain [here](https://python.langchain.com/docs/get_started/introduction) as we won't dive into how LangChain works in this notebook.

To show RAG's prowess we will use [RealTimeData/bbc_latest](https://huggingface.co/datasets/RealTimeData/bbc_latest) dataset, BBC Latest is a dataset that is updated weekly with the latest BBC news articles of the same week.
Since we are using a foundation model that was trained some time ago without any fine-tuning, the model will have to rely on RAG to be able to answer our question on current events.

For a language model we will use [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) which is a very powerful model especially for local inference.
Check our notebook for running [Phi-2 on Intel Meteor Lake iGPU](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/quantized_generation_demo.ipynb), we will use the same method but we won't cover it in details so be sure to check it.

Our RAG system will use the following pipeline to answer questions:
```
        User query
        /        \
   Retriver       |     
       |          |
Relevant docs     |
        \         |
      Prompt creation
             |
           Phi-3
             |
           Answer             
```
In a chatbot scenario we will also have the chat history as an input to the prompt creation.

<b>Let's begin! 🚀</b>

First we will create our database.
We will use [ChromaDB](https://python.langchain.com/docs/integrations/vectorstores/chroma) as our database. ChromaDB receives the dataset and an embedder and it will encode all the documents with the embedder model to a vector representation.
We will use [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) which is pretty good and also very small which is a great fit for running locally.
Later, when given a query, it will use the same embedder to encode the query and will retrieve relevant documents with similar represatation to our query.
Since articles in BBC can be very long we will split every articles to passages of 3 sentences.
We will also save the processed data into disk to avoid computing the entire dataset everytime we reload.
Note, this process may take a few minutes.

In [None]:
# Uncomment to install dependencies
# ! pip install langchain datasets pandas nltk chromadb sentence-transformers
# import nltk

# nltk.download('punkt')

In [None]:
from datasets import load_dataset
from nltk.tokenize import sent_tokenize
from functools import reduce
import pandas as pd
import os

from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents.base import Document


def articles_to_passages(articles, sent_count_per_passage=3):
    """Split a list of articles to a list of passages"""

    def map(text):
        sents = sent_tokenize(text)
        sentence_df = pd.DataFrame(sents, columns=["sentence"]).reset_index()
        sentence_df["batch"] = sentence_df["index"] // sent_count_per_passage
        passages = list(sentence_df.groupby("batch")["sentence"].apply(lambda x: " ".join(x)))
        return passages

    return reduce(lambda l1, l2: l1 + l2, [map(p) for p in articles], [])


embedding_function = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")

chroma_db_path = "./chroma_db"
if os.path.exists(chroma_db_path):
    database = Chroma(persist_directory=chroma_db_path, embedding_function=embedding_function)
    print("Loaded dataset from disk")
else:
    dataset = load_dataset("RealTimeData/bbc_latest", split="train", revision="2024-03-18")
    # Filter only sports arctiles
    sports_articles = dataset.filter(lambda e: "sport" in e["link"])["content"]
    sports_articles = pd.DataFrame(sports_articles).drop_duplicates()[0].to_list()
    # Split documents to passages
    sport_passages = articles_to_passages(sports_articles)
    database = Chroma.from_documents([Document(page_content=doc) for doc in sport_passages], embedding_function, persist_directory=chroma_db_path)
    print(f"Number of sports arcticles found: {len(sports_articles)}\nNumber of embedded passages: {len(sport_passages)}")

Next we will initilize a retriever from the dataset.
We override the `_get_relevant_documents` method to enable a control over the number of documents the retriever will return for every query.

In [None]:
def _get_relevant_documents(self, query, *, run_manager):
    search_kwargs = {k: v for k, v in self.search_kwargs.items()}

    if "top_k" in run_manager.metadata:
        search_kwargs["k"] = run_manager.metadata["top_k"]
    if self.search_type == "similarity":
        docs = self.vectorstore.similarity_search(query, **search_kwargs)
    elif self.search_type == "similarity_score_threshold":
        docs_and_similarities = self.vectorstore.similarity_search_with_relevance_scores(query, **search_kwargs)
        docs = [doc for doc, _ in docs_and_similarities]
    elif self.search_type == "mmr":
        docs = self.vectorstore.max_marginal_relevance_search(query, **search_kwargs)
    else:
        raise ValueError(f"search_type of {self.search_type} not allowed.")
    return [d.page_content for d in docs]


retriever = database.as_retriever()
type(retriever)._get_relevant_documents = _get_relevant_documents

Let's test our retriever with a query and see if it returns a relevant document:

In [None]:
question = "How many teams will The 2024-25 Champions League feature?"

In [None]:
from langchain.schema.runnable import RunnableConfig

print(retriever.invoke(question, config=RunnableConfig(metadata={"top_k": 1})))

Check that the retrieved document is relevant to your question.
In our example, we can see that in the document it says that there will be 36 teams in the Champions League in 2024-25.
Later we will see that Phi-3 can't answer that question correctly without RAG since it was trained on data that predates October 2023.

Next, we will want to build a prompt to include the question and relevant documents.
The template will be quite simple:
```
<s><|system|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<|end|>
<|user|>
{question}
{retrieved documents list}<|end|>
<|assistant|>
```
You will find that this template is with accordance to the chat template that Phi-3 was trained with while adding the system message and the context.

In [None]:
from langchain import PromptTemplate


# Phi-3 wasn't trained with system prompt: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/discussions/51
template = """<s><|user|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<|end|>
<|user|>
{question}
{context}<|end|>
<|assistant|>
"""
prompt = PromptTemplate.from_template(template)

Now we can build a chain that will receive our question and return a prompt to query a LM

In [None]:
from operator import itemgetter


chain = {"context": (itemgetter("question") | retriever), "question": itemgetter("question")} | prompt

print(chain.invoke({"question": question}, config=RunnableConfig(metadata={"top_k": 1})).to_string())

That's it, we are set to run the prompt through our model.
Next we will initialize our OpenVINO optimized Phi-3 model in a pipeline and form the complete chain to produce an answer to our question

In [None]:
# Uncomment to install dependencies
# ! pip install optimum[openvino,nncf]
# Phi-3 is not supported yet in the official release of `intel-optimum` so we will need to install from source
# ! pip install git+https://github.com/huggingface/optimum-intel

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
from functools import wraps

import openvino.properties as props
import openvino.properties.hint as hints


model_name = "microsoft/Phi-3-mini-4k-instruct"
save_name = model_name.split("/")[-1] + "_openvino"
precision = "f16"
quantization_config = OVWeightQuantizationConfig(
    bits=4,
    sym=False,
    group_size=128,
    ratio=0.8,
)
device = "gpu"
saved = os.path.exists(save_name)
load_kwargs = {
    "device": device,
    "ov_config": {
        hints.performance_mode(): hints.PerformanceMode.LATENCY,
        hints.inference_precision: precision,
        props.cache_dir(): os.path.join(save_name, "model_cache"),  # OpenVINO will use this directory as cache
    },
    "quantization_config": quantization_config,
    "trust_remote_code": True,
    "export": not saved,
}

ov_llm = HuggingFacePipeline.from_model_id(
    model_id=model_name if not saved else save_name,
    task="text-generation",
    backend="openvino",
    model_kwargs=load_kwargs,
)

if not saved:
    # For some reason LC passes the model_kwargs to the tokenizer aswell and this can cause issues when saving
    for k in load_kwargs:
        ov_llm.pipeline.tokenizer.__dict__["init_kwargs"].pop(k, None)
    ov_llm.pipeline.save_pretrained(save_name)


original_generate = HuggingFacePipeline._generate


@wraps(original_generate)
def _generate_with_kwargs(*args, **kwargs):
    pipeline_kwargs = kwargs.get("run_manager").metadata.get("pipeline_kwargs", {})
    return original_generate(*args, **kwargs, pipeline_kwargs=pipeline_kwargs)


HuggingFacePipeline._generate = _generate_with_kwargs

chain |= ov_llm

In [None]:
from transformers import TextStreamer


streamer = TextStreamer(ov_llm.pipeline.tokenizer, skip_special_tokens=True, skip_prompt=True)
out = chain.invoke(
    {"question": question},
    config=RunnableConfig(
        metadata={
            "top_k": 1,
            "pipeline_kwargs": {
                "max_new_tokens": 256,
                "return_full_text": False,
                "streamer": streamer,
                "eos_token_id": ov_llm.pipeline.tokenizer.convert_tokens_to_ids(["<|endoftext|>", "<|end|>", "<|system|>", "<|user|>", "<|assistant|>"]),
            },
        }
    ),
)

And there you have it, our chain is complete and we got the correct answer!

## Chatbot with RAG demo
We are now ready to build our chatbot demo with RAG capabilites.
We will use [Gradio](https://www.gradio.app/) to build our demo.

First, we will define our chat memory and modify our template and chain to be able to handle chat memory

In [None]:
# Uncomment to install dependencies
# ! pip install gradio

In [None]:
from langchain_core.messages.base import BaseMessage
from langchain.memory import ConversationBufferMemory


def parse_chat_history(chat_history):
    role_map = {"human": "<|user|>\n", "ai": "<|assistant|>\n", "context": ""}
    buffer = ""
    for dialogue_turn in chat_history:
        assert isinstance(dialogue_turn, BaseMessage)
        role_prefix = role_map[dialogue_turn.type]
        buffer += f"{role_prefix}{dialogue_turn.content}"
        buffer += "<|end|>\n" if dialogue_turn.type != "human" else "\n"
    return buffer


def add_to_memory(memory, question, context, answer):
    memory.chat_memory.add_messages(
        [BaseMessage(content=question, type="human"), BaseMessage(content=context, type="context"), BaseMessage(content=answer, type="ai")]
    )


def delete_last_message_from_memory(memory):
    del memory.chat_memory.messages[-3:]


memory = ConversationBufferMemory(memory_key="chat_history", ai_prefix="Assistant", human_prefix="User")

template = """<s><|system|>
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.<|end|>
{chat_history}<|user|>
{question}
{context}<|end|>
<|assistant|>
"""
prompt = PromptTemplate.from_template(template)

We will want to make RAG optional in our demo so we will have 2 chains, with and without RAG

In [None]:
from langchain_core.runnables import RunnableLambda


base_chain = RunnableLambda(func=lambda x: x) | {
    "context": itemgetter("context"),
    "answer": prompt | ov_llm,
}

rag_chain = {
    "context": (itemgetter("question") | retriever),
    "question": itemgetter("question"),
    "chat_history": itemgetter("chat_history") | RunnableLambda(parse_chat_history),
} | base_chain

no_rag_chain = {
    "context": RunnableLambda(lambda q: ""),
    "question": itemgetter("question"),
    "chat_history": itemgetter("chat_history") | RunnableLambda(parse_chat_history),
} | base_chain

Next we will write our core functions generation function for our demo

In [None]:
import time
import itertools
from threading import Thread
from transformers import (
    TextIteratorStreamer,
    StoppingCriteria,
    StoppingCriteriaList,
    GenerationConfig,
)


from threading import Thread


class ThreadWithResult(Thread):
    """
    Modified Thread class to save the return value of the target function
    Based on https://stackoverflow.com/a/65447493
    """

    def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
        def function():
            self._result = target(*args, **kwargs)

        super().__init__(group=group, target=function, name=name, daemon=daemon)

    @property
    def result(self):
        self.join()
        return self._result


# Copied and modified from https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/generation.py#L13
class SuffixCriteria(StoppingCriteria):
    def __init__(self, start_length, eof_strings, tokenizer, check_fn=None):
        self.start_length = start_length
        self.eof_strings = eof_strings
        self.tokenizer = tokenizer
        if check_fn is None:
            check_fn = lambda decoded_generation: any([decoded_generation.endswith(stop_string) for stop_string in self.eof_strings])
        self.check_fn = check_fn

    def __call__(self, input_ids, scores, **kwargs):
        """Returns True if generated sequence ends with any of the stop strings"""
        decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :])
        return all([self.check_fn(decoded_generation) for decoded_generation in decoded_generations])


def is_partial_stop(output, stop_str):
    """
    Check whether the output contains a partial stop str.

    Params:
      output: current output from the model
      stop_str: a string we will want to generation on
    Returns:
      True if the suffix of the output is a prefix of the stop_str
    """
    for i in range(0, min(len(output), len(stop_str))):
        if stop_str.startswith(output[-i:]):
            return True
    return False


def format_context(context):
    """
    Utility function to format retrieved documents inside the chatbot window

    Params:
      context: retrived documents
    Returns:
      Formated string with the retrieved documents
    """
    if len(context) == 0:
        return ""
    blockquote_style = """font-size: 12px;
background: #e4e4e4; 
border-left: 10px solid #ccc; 
margin: 0.5em 30px;
padding: 0.5em 10px;
color: black;"""
    summary_style = """font-weight: bold;
font-size: 14px;
list-style-position: outside;
margin: 0.5em 15px;
padding: 0px 0px 10px 15px;"""
    s = f'<details style="margin:0px;padding:0px;"><summary style="{summary_style}">Retrieved documents:</summary>'
    for doc in context:
        d = doc.replace("\n", " ")
        s += f'<blockquote style="{blockquote_style}"><p>{d}</p></blockquote>'
    s += "</details>"
    return s


def prepare_for_regenerate(history):
    """
    Delete last assistant response from memory in order to regenerate it

    Params:
      history: conversation history
    Returns:
      Updated history
    """
    history[-1][1] = None
    delete_last_message_from_memory(memory)
    return history, *([gr.update(interactive=False)] * 6)


def add_user_text(message, history):
    """
    Add user's message to chatbot history

    Params:
      message: current user message
      history: conversation history
    Returns:
      Updated history, clears user message and status
    """
    # Append current user message to history with a blank assistant message which will be generated by the model
    history.append([message.strip(), None])
    return "", history, *([gr.update(interactive=False)] * 5)


def reset_chatbot():
    """Clears demo contents and resets chat history"""
    memory.clear()
    return None, None, "Status: Idle"


def generate(
    history,
    temperature,
    max_new_tokens,
    top_p,
    repetition_penalty,
    num_retrieved_docs,
    enable_rag,
):
    """
    Generates the assistant's reponse given the chatbot history and generation parameters

    Params:
      history: conversation history formated in pairs of user and assistant messages `[user_message, assistant_message]`
      temperature:  parameter for control the level of creativity in AI-generated text.
                    By adjusting the `temperature`, you can influence the AI model's probability distribution, making the text more focused or diverse.
      max_new_tokens: The maximum number of tokens we allow the model to generate as a response.
      top_p: parameter for control the range of tokens considered by the AI model based on their cumulative probability.
      repetition_penalty: parameter for penalizing tokens based on how frequently they occur in the text.
      num_retrieved_docs: number of documents to retrieve in case of RAG
      enable_rag: a boolean to enable/disable RAG
    Yields:
      Updated history and generation status.
    """
    if len(history) == 0 or history[-1][1] is not None:
        yield history, "Status: Idle", *([gr.update(interactive=True)] * 6)
        return
    prompt_char = "▌"
    history[-1][1] = prompt_char
    yield history, "Status: Generating...", *([gr.update(interactive=False)] * 6)

    start = time.perf_counter()
    user_query = history[-1][0]
    current_chain = rag_chain if enable_rag else no_rag_chain
    tokenizer = ov_llm.pipeline.tokenizer
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    # Create a stopping criteria to prevent the model from playing the role of the user aswell.
    stop_str = []
    stopping_criteria = StoppingCriteriaList([SuffixCriteria(0, stop_str, tokenizer)])

    # Prepare input for generate
    generation_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        do_sample=temperature > 0.0,
        temperature=temperature if temperature > 0.0 else 1.0,
        repetition_penalty=repetition_penalty,
        top_p=top_p,
        eos_token_id=tokenizer.convert_tokens_to_ids([tokenizer.eos_token, "<|end|>", "<|system|>", "<|user|>", "<|assistant|>"]),
        pad_token_id=tokenizer.eos_token_id,
    )
    generate_kwargs = dict(
        streamer=streamer,
        generation_config=generation_config,
        stopping_criteria=stopping_criteria,
    )
    chain_kwargs = {"config": RunnableConfig(metadata={"top_k": num_retrieved_docs, "pipeline_kwargs": generate_kwargs})}

    # Call chain
    t1 = ThreadWithResult(
        target=current_chain.invoke,
        args=[{"question": user_query, "chat_history": memory.chat_memory.messages}],
        kwargs=chain_kwargs,
    )
    t1.start()

    # Initialize an empty string to store the generated text.
    partial_text = ""
    generated_tokens = 0
    for new_text in streamer:
        partial_text += new_text
        generated_tokens += 1
        history[-1][1] = partial_text + prompt_char
        pos = -1
        for s in stop_str:
            if (pos := partial_text.rfind(s)) != -1:
                break
        if pos != -1:
            partial_text = partial_text[:pos]
            break
        elif any([is_partial_stop(partial_text, s) for s in stop_str]):
            continue
        yield history, "Status: Generating...", *([gr.update(interactive=False)] * 6)
    chain_out = t1.result
    history[-1][1] = partial_text + format_context(chain_out["context"])
    add_to_memory(memory, user_query, chain_out["context"], partial_text)
    generation_time = time.perf_counter() - start
    yield history, f"Generation time: {generation_time:.2f} sec", *([gr.update(interactive=True)] * 6)

Let's add an option to chat with our own documents by loading them to our database.

In [None]:
# ! pip install pypdf

In [None]:
from pypdf import PdfReader


added_documents_ids = []


def pdf_to_docs(file_path):
    reader = PdfReader(file_path)
    texts = [page.extract_text() for page in reader.pages]
    return [Document(page_content=p) for p in articles_to_passages(texts)]


def load_files(files):
    yield (
        f"Loading...",
        *([gr.update(interactive=False)] * 6),
    )
    start = time.perf_counter()
    for fp in files:
        documents = pdf_to_docs(fp)
        added_documents_ids.append(database.add_documents(documents))
    upload_time = time.perf_counter() - start
    yield (
        f"Load time: {upload_time * 1000:.2f}ms",
        *([gr.update(interactive=True)] * 5),
        gr.update(value=f"Delete documents 〈{len(added_documents_ids)}〉", interactive=True),
    )


def delete_documents():
    yield (
        f"Deleting...",
        *([gr.update(interactive=False)] * 6),
    )
    global added_documents_ids
    for l in added_documents_ids:
        database.delete(l)
    added_documents_ids = []
    yield (
        f"Status: Idle",
        *([gr.update(interactive=True)] * 5),
        gr.update(value=f"Delete documents 〈{len(added_documents_ids)}〉", interactive=True),
    )

Now we can build the actual demo using Gradio.
The layout will be simple, a chatbow window followed by a text prompt with controls that will let you to enable/disable RAG function, submit, clear and regenerate, this is pretty standard for a chatbot demo.
We have also added the option to add PDF documents to the database and delete them if required.
You can extend the add documents option to support other formats than PDF.

In [None]:
import gradio as gr

try:
    demo.close()
except:
    pass

EXAMPLES_EDUCATION = [
    "Lily drops a rubber ball from the top of a wall. The wall is 2 meters tall. How long will it take for the ball to reach the ground?",
    "Mark has 15 notebooks in his backpack. Each day, he uses 3 notebooks for his classes. After 4 days, how many notebooks will Mark have left in his backpack?",
]
EXAMPLES_BBC = [
    "How many teams will The 2024-25 Champions League feature?",
    "Who said that english football is finished?",
]

with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown('<h1 style="text-align: center;">Intel Labs Demo: Chat with 150 BBC Sport News Articles on Intel Meteor Lake iGPU</h1>')
    chatbot = gr.Chatbot(height=800)
    with gr.Row():
        rag = gr.Checkbox(
            value=True,
            label="Retrieve",
            interactive=True,
            info="Enables RAG pipeline",
        )
        msg = gr.Textbox(placeholder="Enter message here...", show_label=False, autofocus=True, scale=75)
        status = gr.Textbox("Status: Idle", show_label=False, max_lines=1, scale=20)
    with gr.Row():
        submit = gr.Button("Submit", variant="primary")
        regenerate = gr.Button("Regenerate")
        clear = gr.Button("Clear")
        load = gr.UploadButton("Load Document", file_types=["pdf"], file_count="multiple")
        delete_docs = gr.Button(lambda: f"Delete documents {f'〈{len(added_documents_ids)}〉'}", interactive=True)
    with gr.Accordion("Advanced Options:", open=False):
        with gr.Row():
            with gr.Column():
                temperature = gr.Slider(
                    label="Temperature",
                    value=0.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                max_new_tokens = gr.Slider(
                    label="Max new tokens",
                    value=128,
                    minimum=0,
                    maximum=512,
                    step=32,
                    interactive=True,
                )
            with gr.Column():
                top_p = gr.Slider(
                    label="Top-p (nucleus sampling)",
                    value=1.0,
                    minimum=0.0,
                    maximum=1.0,
                    step=0.05,
                    interactive=True,
                )
                repetition_penalty = gr.Slider(
                    label="Repetition penalty",
                    value=1.0,
                    minimum=1.0,
                    maximum=2.0,
                    step=0.1,
                    interactive=True,
                )
            num_documents = gr.Slider(label="Retrieved documents numbers", value=1, minimum=1, maximum=10, step=1, interactive=True)
    gr.Examples(EXAMPLES_EDUCATION, inputs=msg, label="Non-RAG examples")
    gr.Examples(EXAMPLES_BBC, inputs=msg, label="RAG with BBC Sports examples")
    buttons = [submit, regenerate, clear, load, delete_docs]
    # Sets generate function to be triggered when the user submit a new message
    gr.on(
        triggers=[submit.click, msg.submit],
        fn=add_user_text,
        inputs=[msg, chatbot],
        outputs=[msg, chatbot, *buttons],
        concurrency_limit=1,
        queue=True,
    ).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty, num_documents, rag],
        outputs=[chatbot, status, msg, *buttons],
        concurrency_limit=1,
        queue=True,
    )
    regenerate.click(
        fn=prepare_for_regenerate,
        inputs=chatbot,
        outputs=[chatbot, msg, *buttons],
        concurrency_limit=1,
        queue=True,
    ).then(
        fn=generate,
        inputs=[chatbot, temperature, max_new_tokens, top_p, repetition_penalty, num_documents, rag],
        outputs=[chatbot, status, msg, *buttons],
        concurrency_limit=1,
        queue=True,
    )
    clear.click(fn=reset_chatbot, inputs=None, outputs=[chatbot, msg, status], queue=True)
    load.upload(
        fn=load_files,
        inputs=[load],
        outputs=[status, msg, *buttons],
        concurrency_limit=1,
        queue=True,
    )
    delete_docs.click(fn=delete_documents, outputs=[status, msg, *buttons], concurrency_limit=1, queue=True)

In [None]:
demo.launch(server_name="0.0.0.0", server_port=7860, inline=False, inbrowser=False)

In [None]:
# demo.close()