### Dell Technologies Proof of Concept - RAG chatbot with multi format data CSV, PDF, PPT with sources tabs and RAG toggle 4bit
- Model:  Mistral 7B
- Vector database:  Chroma db
- Chain:  Langchain retrievalQAchain, huggingface pipeline
- GUI:  Gradio interface (not with blocks)
- Workload:  CSV, PPT and PDF files
- Quantized to 4 bit

Features in Additional Inputs:
- Change persona ad hoc with adjustable system prompt
- Change model parameters with sliders (temp., top-p, top-k, max_tokens)
- Memory is intact and conversational using chat_history key
- Create all types of content such as email, product description, product comparison tables etc.
- Directly query / summarize a document given the title

Note: The software and sample files are provided “as is” and are to be used only in conjunction with this POC application. They should not be used in production and are provided without warranty or guarantees. Please use them at your own discretion.

<img src="images/RAG-diagram-dell-technologies.png" alt="Alternative text" />

### Huggingface tools

You will need to at least log in once to get the hub for tools and the embedding model.  After that you can comment this section out.

In [1]:
# get your account token from https://huggingface.co/settings/tokens
# this is a read-only test token

token = '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'

from huggingface_hub import login
login(token=token, add_to_git_credential=True)

Token is valid (permission: fineGrained).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.[0m
Token has not been saved to git credential helper.
Your token has been saved to /home/daol/.cache/huggingface/token
Login successful


### Install python libraries and applications

Using % to ensure installation into this conda environment and not OS python

In [2]:
### Check installed GPU

In [3]:
# !nvidia-smi

### Assign GPU environment vars and ID order

NOTE:  to change which GPU you want visible, simply change the CUDA VISIBLE DEVICES ID to the GPU you prefer. 
This method guarantees no confusion or misplaced workloads on any GPUs.

In [4]:
## THESE VARIABLES MUST APPEAR BEFORE TORCH OR CUDA IS IMPORTED
## set visible GPU devices and order of IDs to the PCI bus order
## target the L40s that is on ID 1

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

## this integer corresponds to the ID of the GPU, for multiple GPU use "0,1,2,3"...
## to disable all GPUs, simply put empty quotes ""

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

### Investigate our GPU and CUDA environment

NOTE:  If you are only using 1 single GPU in the visibility settings above, then the active CUDA device will always be 0 since it is the only GPU seen.

In [5]:
import torch
import sys
import os
from subprocess import call
print('_____Python, Pytorch, Cuda info____')
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA RUNTIME API VERSION')
#os.system('nvcc --version')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('_____nvidia-smi GPU details____')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('_____Device assignments____')
print('Number CUDA Devices:', torch.cuda.device_count())
print ('Current cuda device: ', torch.cuda.current_device(), ' **May not correspond to nvidia-smi ID above, check visibility parameter')
print("Device name: ", torch.cuda.get_device_name(torch.cuda.current_device()))

_____Python, Pytorch, Cuda info____
__Python VERSION: 3.12.8 (main, Dec  6 2024, 19:59:28) [Clang 18.1.8 ]
__pyTorch VERSION: 2.5.1+cu124
__CUDA RUNTIME API VERSION
__CUDNN VERSION: 90100
_____nvidia-smi GPU details____
index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA L40S, 550.120, 46068 MiB, 17145 MiB, 28445 MiB
1, NVIDIA L40S, 550.120, 46068 MiB, 1 MiB, 45589 MiB
_____Device assignments____
Number CUDA Devices: 1
Current cuda device:  0  **May not correspond to nvidia-smi ID above, check visibility parameter
Device name:  NVIDIA L40S


### Assign single GPU to device variable

This command assigns GPU ID 0 to the DEVICE variable called "cuda:0" if pytorch can actually reach and speak with the GPU using cuda language.  Else it will use the cpu.

In [6]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(DEVICE)

cuda


In [7]:
from langchain import HuggingFacePipeline, PromptTemplate

### import loaders
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import CSVLoader
from langchain_community.document_loaders import UnstructuredPowerPointLoader

### for embedding
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import CharacterTextSplitter

#from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

### for langchain chain
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
#from langchain.chains import ConversationalRetrievalChain
from transformers import AutoTokenizer, pipeline, TextIteratorStreamer, AutoModelForCausalLM
from langchain.chains import LLMChain


### status bars and UI and other accessories
import gradio as gr
import json

### Clear GPU memory from any previous runs
- assume Nvidia drivers installed
- When running notebooks over and over again, often much of the memory is still in the GPU memory allocated cache.  Depending on the size of the GPU, this might cause out of memory issues during the next run.  It is advised to clear out the cache, or restart the kernel.
- here we see multiple GPUs, the memory usage, any running processes and our CUDA version

In [8]:
# import gc
# gc.collect()
# torch.cuda.empty_cache()

### Clear the previous run vector database

This is optional, the vector db will be rebuilt.  For a completely fresh run you can delete the local folder.

In [9]:
# ## remove chroma vector db local db folder from previous run

# !rm -rf "db2"

### Add PDF directory of files

- ESG Summary (Environmental, Social and Governance) report
- Dell leadership org chart

In [10]:
pdf_dir_loader = PyPDFDirectoryLoader("pdf-files-infohub/")

### Add CSV files

CSV files are vectorized line by line.  100 lines of the file will equal 100 docs of vectorized data.

This will load data into Langchain document object.

#### load events

NOTE:  if you switch files you might run into a byte recognition error:  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3519: invalid start byte.  This is fixed with windows-1252 encoding.

Your CSV files must be totally clean from any funny characters.  0xa0 is a funky space character.

In [11]:
events_loader = CSVLoader("csv-files/events-schedule-dtw24.csv", encoding='windows-1252')

#### load general info about conference

NOTE:  if you switch files you might run into a byte recognition error:  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3519: invalid start byte.  This is fixed with windows-1252 encoding.

Your CSV files must be totally clean from any funny characters.  0xa0 is a funky space character.

In [12]:
general_info_loader = CSVLoader("csv-files/concierge-question-answer-list.csv", encoding='windows-1252')


### Add Powerpoint files

#### load powerpoint file here

In [13]:
ppt_loader = UnstructuredPowerPointLoader("ppt-files/pan-dell-gen-ai-ppt-3pages.pptx")

### Merge all dataset contents into one set of docs

NOTE:  if there is an error with the merge, especially when loading CSV files, check and see if there are any funny characters in the file like bad double quotes, or bad apostrophes, or bad ASCII characters.  Open VIM and check your file and remove any unusual characters.

In [14]:
from langchain_community.document_loaders.merge import MergedDataLoader

loader_all = MergedDataLoader(loaders=[
    events_loader, 
    general_info_loader, 
    pdf_dir_loader, 
    ppt_loader,
])

docs = loader_all.load()
len(docs)

1586

### Declare embedding model to use

Use Instruct model to split text intelligently

In [15]:
embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
)

load INSTRUCTOR_Transformer
max_seq_length  512


  model.load_state_dict(torch.load(os.path.join(input_path, 'pytorch_model.bin'), map_location=torch.device('cpu')))


### Chunk text

<b>chunk size large</b>:  If you want to provide large text overviews and summaries in your responses - appropriate for content creation tasks - then a large chunk size is helpful.  800 or higher.

<b>chunk size small</b>:  If you are looking for specific answers based on extracted content from your knowledge base, a smaller chunk size is better.  Smaller than 800.

<b>chunk overlap</b>:  If the paragraphs of content in your PDFs often refer to previous content in the document, like a large whitepaper, you might want to have a good size overlap.  128 or higher, this is totally up to the content.

https://dev.to/peterabel/what-chunk-size-and-chunk-overlap-should-you-use-4338

In [16]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)

text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
len(texts)

1588

### Create the vector database
- take converted embeddings and place them into vector db
- stored locally on prem
- NOTE IF YOU GET A FILE HANDLER ERROR RELATED TO HNSWLIB do the following:
- pip uninstall hnswlib
- 
pip uninstall chroma-hnswli
- 
pip install chroma-hnswlib

In [18]:
vectordb = Chroma.from_documents(texts, embeddings, persist_directory="db2")
print('\n' + 'Time to complete:')


Time to complete:


In [19]:
# # ### Load vector db if you've already created it --- comment this out and uncomment the above loader, splitter cells to create new vector db

vectordb = Chroma(persist_directory="./db2", embedding_function=embeddings)

  vectordb = Chroma(persist_directory="./db2", embedding_function=embeddings)


#### Get unique files embedded into vectordb

In [20]:
db = vectordb
print("\nEmbedding keys:", db.get().keys())
print("\nNumber of embedded docs:", len(db.get()["ids"]))


Embedding keys: dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])

Number of embedded docs: 1588


In [21]:
## get list of all file URLs in vector db

def get_unique_files():
    
    db = vectordb
    print("\nEmbedding keys:", db.get().keys())
    print("\nNumber of embedded docs:", len(db.get()["ids"]))
    
    unique_list = list({doc["source"] for doc in db.get()["metadatas"]})

    print("\nList of unique files in db:\n")
    for unique_file in unique_list:
        print(unique_file)

    pretty_files = json.dumps(unique_list, indent=4, default=str)

    return pretty_files


In [22]:
# get_unique_files()

#### Prepare Instruct model

The 'instruct' version of a has been fine-tuned to be able to follow prompted instructions. These models 'expect' to be asked to do something.  They are good at performing tasks, rather than chatting. 

In [23]:
#model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model_id = "meta-llama/Llama-3.2-3B-Instruct"
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")

#### Quantization Configuration
Great video on this:  https://www.youtube.com/watch?v=eovBbABk3hw&ab_channel=Rohan-Paul-AI

Bitsandbytes stores weights in 4 bits, the computations still happen in 16 or 32 bit depending on bfloat choice.


In [24]:
from transformers import AutoTokenizer, LlamaForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = LlamaForCausalLM.from_pretrained(
    model_id,
#    load_in_4bit=True,
#    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    device_map="auto",

)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Initialize tokenizer

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_id) 
tokenizer.use_default_system_prompt = False

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Check GPU memory usage after model load

In [26]:
!nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

pid, process_name, used_gpu_memory [MiB]
1471035, /home/daol/maymust-dell-example/RAG-chatbot-multiformat/.venv/bin/python, 6800 MiB
1482498, /bin/python3, 6410 MiB
1484926, /home/daol/maymust-dell-example/RAG-chatbot-multiformat/.venv/bin/python, 3920 MiB
1496039, /home/daol/maymust-dell-example/RAG-chatbot-multiformat/.venv/bin/python, 3956 MiB


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### Print interesting metrics

In [27]:
def get_model_info ():

    model_details = (
    
    f"\nGeneral Model Info:\n"
    f"\n-------------------\n"
    
    f"\n Model_id: {model_id} \n"
    f"\n Model config: {model} \n"

    f"\nGeneral Embeddings Info:\n"
    f"\n-------------------\n"

    f"\n Embeddings model config: {embeddings} \n" 

    )
        
    return model_details
    


In [28]:
# get_model_info()

### Constants

Used to initialize the advanced settings sliders in the GUI

In [29]:
MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
#MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

### Chat Memory
To have a positive, realistic chat experience the LLM needs to access a form of memory.  Memory for the LLM chat is basically a copy of the chat history that is given to the LLM as reference.  

In [30]:
####### MEMORY PARAMETERS ###########

memory = ConversationBufferWindowMemory(
    k=5, ## number of interactions to keep in memory
    memory_key="chat_history",
    return_messages=True,  ## formats the chat_history into HumanMessage and AImessage entity list
    input_key="query",   ### for straight retrievalQA chain
    output_key="result"   ### for straight retrievalQA chain
)

  memory = ConversationBufferWindowMemory(


### Main Process Input Function

This is the function that orchestrates all the major components such as:
- user variable input from the GUI
- prompt template
- pipeline setup
- chain setup
- response output

In [31]:
### this chunk works, however it gives constant clarifying questions... annoying but the responses are pretty decent sometimes.
def process_input(
    question,
    chat_history,
    rag_toggle,
    system_prompt,
    source_docs_qty,
    max_new_tokens,
    temperature,
    top_p,
    top_k,
    repetition_penalty
                 ):

#     print("1", question)
#     print("2", chat_history)
#     print("3", rag_toggle)
#     print("4", system_prompt)
#     print("5", source_docs_qty)
    
    
    
    global response

    
    ### system prompt variable is typed in by the user in Gradio advanced settings text box and sent into process_input function
    ### This is Llama2 prompt format 
    ### https://huggingface.co/blog/llama2#how-to-prompt-llama-2
    
    prompt_template_rag = "\n\n [INST] <<SYS>>" + system_prompt + "<</SYS>>\n\n Context: {context} \n\n  Question: {question} \n\n[/INST]".strip()


    PROMPT_rag = PromptTemplate(template=prompt_template_rag, input_variables=["context", "question"])


    prompt_template_llm = "\n\n [INST] <<SYS>>" + system_prompt + "<</SYS>>\n\n Question: {question} \n\n[/INST]".strip()


    PROMPT_llm = PromptTemplate(template=prompt_template_llm, input_variables=["question"])
    
        
    
    ####### STREAMER FOR TEXT OUTPUT ############
    
    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    ####### PIPELINE ARGUMENTS FOR THE LLM ############
    ### more info at https://towardsdatascience.com/decoding-strategies-in-large-language-models-9733a8f70539
    
    text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=True,
    streamer=streamer,
    max_new_tokens=max_new_tokens,
    top_p=top_p,
    top_k=top_k,
    temperature=temperature,
    repetition_penalty=repetition_penalty,
    )

    ####### ATTACH PIPELINE TO LLM ############

    llm = HuggingFacePipeline(pipeline=text_pipeline)
        

    llmchain = LLMChain(llm=llm, prompt=PROMPT_llm)


    ###### RETRIEVAL QA FROM CHAIN TYPE PARAMS ###########
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        chain_type_kwargs={"prompt": PROMPT_rag},
#        retriever=vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 3}),
        retriever=vectordb.as_retriever(search_type="similarity", search_kwargs={"k": source_docs_qty}),
        memory=memory,
        verbose=True,
        return_source_documents = True,
        )

    
    
    #########################
    if rag_toggle:
    
        response = qa_chain({"query": question})

    else:
    
        response = llmchain({"question": question})
        

    #########################    
    

    ####### MANAGE OUTPUT ARRAY FROM STREAMER ###########
    ## whatever is in streamer, the positional argument 'text', take it and join it all together
    ## yield allows streaming in Gradio
    
    outputs = []
    for text in streamer:
        outputs.append(text)
        yield "".join(outputs)


    return response

### Show sources function

Sources are critical to demonstrate the LLMs response is true to the source and not hallucinating.  This function using the global "response" variable created in the process_input function. This content is parsed with jsondumps and shown in the GUI textbox at the very bottom. 

In [32]:
def get_sources():

    res_dict = {
        "answer_from_llm": response["result"],   ### looks up result key from raw output
    }
    
    res_dict["source_documents"] = []    ### create an empty array for source documents key front result dict

    for each_source in response["source_documents"]:
        res_dict["source_documents"].append({
            "page_content": each_source.page_content,
            "metadata":  each_source.metadata
        })

    # print(res_dict["answer_from_llm"])  ### PRINT JUST THE RAW ANSWER FROM LLM
    
    pretty_sources = json.dumps(res_dict["source_documents"], indent=4, default=str)

    print(pretty_sources)
    
    return pretty_sources


### Build the Gradio GUI
- Gradio is a quick, highly customizable UI package for your python applications:  https://www.gradio.app/
- Combined with langchain, gradio can trigger multiple chains for a wide variety of user interactions.

<b>NOTE</b>:  Gradio will output variables in the order they appear here in the interface object. There is no declaration of these variables explicitly in the creation of each one when it is sent to the processing function.  i.e. slider for temperature is the 3rd variable in the list.  It is passed as a positional argument, not as "temperature" variable explicitly.  You have to take those positional arguments that gradio passes out (from the user input at the browser) as positional input into your chat processing function.  

#### Access the UI
- The provided code forces Gradio to create a small web server on the local host the notebook is being served from
- Gradio will provide a URL that can be used in a web browser, that must be accessed from within the same network, so you may need to access it using a jumphost.  In this case we used a Windows jump host and Chrome browser on the same network to access the page.

In [33]:
chat_interface = gr.ChatInterface(
    
    ### call the main process function above
    
    fn=process_input, 

    ### format the dialogue box, add company avatar image
    
    chatbot = gr.Chatbot(
        bubble_full_width=False,
        avatar_images=(None, (os.path.join(os.path.dirname("__file__"), "images/dell-logo-sm.jpg"))),
    ),

    
    
    additional_inputs=[
        
        
        gr.Checkbox(label="Use RAG", 
                    value=True, 
                    info="Query LLM directly or query the RAG chain"
                    ),
        
        
        gr.Textbox(label="Persona and role for system prompt:", 
                lines=3, 
                value="""Your name is Andie, a helpful concierge at the Dell Tech World conference held in Las Vegas.\
                Please respond as if you were talking to someone using spoken English language.\
                The first word of your response should never be Answer:.\
                You are given a list of helpful information about the conference.\
                Your goal is to use the given information to answer attendee questions.\
                Please do not provide any additional information other than what is needed to directly answer the question.\
                You do not need to show or refer to your sources in your responses.\
                Please do not make up information that is not available from the given data.\
                If you can't find the specific information from the given context, please say that you don't know.\
                Please respond in a helpful, concise manner.\
                """

                ),

        gr.Slider(
            label="Number of source docs",
            minimum=1,
            maximum=10,
            step=1,
            value=3,
        ),
        
        gr.Slider(
            label="Max new words (tokens)",
            minimum=1,
            maximum=MAX_MAX_NEW_TOKENS,
            step=1,
            value=DEFAULT_MAX_NEW_TOKENS,
        ),
        gr.Slider(
            label="Creativity (Temperature), higher is more creative, lower is less creative:",
            minimum=0.1,
            maximum=1.99,
            step=0.1,
            value=0.6,
        ),
        gr.Slider(
            label="Top probable tokens (Nucleus sampling top-p), affects creativity:",
            minimum=0.05,
            maximum=1.0,
            step=0.05,
            value=0.9,
        ),
        gr.Slider(
            label="Number of top tokens to choose from (Top-k):",
            minimum=1,
            maximum=100,
            step=1,
            value=50,
        ),
        gr.Slider(
            label="Repetition penalty:",
            minimum=1.0,
            maximum=1.99,
            step=0.05,
            value=1.2,
        ),
    ],

    
    stop_btn=None,
    
    examples=[

        ## events csv content
        ["Which booths are found in the showcase floor at Dell Technologies World 2024?"],
        ["What are some common use cases for GenAI?"],
        ["Where is the Charting the Generative AI landscape in healthcare session going to be held?"],
        ["Who is hosting the Understanding GenAI as a workload in a multicloud world session?"],
        ["What enterprise Retrieval Augmented Generation solutions does Dell offer?"],

        ## Powerpoint content
        ["What are some of the results of the Dell Generative AI Pulse Survey?"],
        

        ## pdf content, content creation, workplace productivity
        ["What is Dell's ESG policy in one sentence?"],
        ["Would you please write a professional email response to John explaining the benefits of Dell Powerflex."],
        ["Create a new advertisement for Dell Technologies PowerEdge servers. Please include an interesting headline and product description."],
        ["Create 3 engaging tweets highlighting the key advantages of using Dell Technologies solutions for Generative AI."],
        ["What are the key steps in designing a secure and scalable on-premises solution for GenAI workloads with Dell?"],
        ["Summarize the significant developments from Dell's latest SEC filings."],

    ],

)


In [34]:
###  SET GRADIO INTERFACE THEME (https://www.gradio.app/guides/theming-guide)

#theme = gr.themes.Soft()
#theme = gr.themes.Glass()
#theme = gr.themes.Base()

theme = gr.themes.Default()

#### Tabbed interfaces one for chat one for sources

In [35]:
### set width and margins in local css file
### set Title in a markdown object at the top, then render the chat interface

with gr.Blocks(theme=theme, css="style.css", title="RAG Chat CSV PDF PPT") as demo:
    gr.Markdown(
    """
    # Retrieval Digital Assistant
    """)

    with gr.Tab("Chat Session"):

        chat_interface.render()

    with gr.Tab("Source Citations"):
            
        source_text_box = gr.Textbox(label="Reference Sources")
        get_source_button = gr.Button("Get Source Content")
        get_source_button.click(fn=get_sources, inputs=None, outputs=source_text_box)


    with gr.Tab("Database Files"):


        files_text_box = gr.Textbox(label="Uploaded Files")
        get_files_button = gr.Button("List Uploaded Files")
        get_files_button.click(fn=get_unique_files, inputs=None, outputs=files_text_box)


    with gr.Tab("Model Info"):


        model_info_text_box = gr.Textbox(label="Model Info")
        model_info_button = gr.Button("Get Model Info")
        model_info_button.click(fn=get_model_info, inputs=None, outputs=model_info_text_box)


In [None]:
if __name__ == "__main__":
    demo.queue(max_size=5)  ## sets up websockets for bidirectional comms and no timeouts, set a max number users in queue
    demo.launch(share=True, debug=True, server_name="localhost", server_port=7810, allowed_paths=["images/dell-logo-sm.jpg"])

Running on local URL:  http://localhost:7810
Running on public URL: https://a4cdb1573d52fd654e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=text_pipeline)
  llmchain = LLMChain(llm=llm, prompt=PROMPT_llm)
  response = qa_chain({"query": question})




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


### Inspiration code:

https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat <br>

### Author:
David O'Dell - Solutions and AI Tech Marketing Engineer

In [None]:
# !conda env export --name rag | grep -v "^prefix: "