### Dell Technologies Proof of Concept - RAG Llama2-Chat-7b-hf 4 bit PDF Retrieval Knowledgebase Assistant
- Model:  llama2-7b-chat-hf  (4 bit)
- Vector database:  Chroma db
- Chain:  Langchain retrievalQAchainwithSources, huggingface pipeline
- GUI:  Gradio interface (not with blocks)
- Workload:  RAG PDF knowledgebase
- limited PDF file dataset from https://infohub.delltechnologies.com/

Features in Additional Inputs:
- Change persona ad hoc with adjustable system prompt
- Change model parameters with sliders (temp., top-p, top-k, max_tokens)
- Memory is intact and conversational using chat_history key
- Create all types of content such as email, product description, product comparison tables etc.
- Directly query / summarize a document given the title

Note: The software and sample files are provided “as is” and are to be used only in conjunction with this POC application. They should not be used in production and are provided without warranty or guarantees. Please use them at your own discretion.


<img src="images/RAG-diagram-dell-technologies.png" alt="Alternative text" />

### Huggingface tools

You will need to at least log in once to get the hub for tools and the embedding model.  After that you can comment this section out.

In [None]:
from huggingface_hub import login

token = 'YOUR_TOKEN'
login(token=token, add_to_git_credential=True)

### Assign GPU environment vars and ID order

NOTE:  to change which GPU you want visible, simply change the CUDA VISIBLE DEVICES ID to the GPU you prefer. 
This method guarantees no confusion or misplaced workloads on any GPUs.

In [None]:
## THESE VARIABLES MUST APPEAR BEFORE TORCH OR CUDA IS IMPORTED
## set visible GPU devices and order of IDs to the PCI bus order
## target the L40s that is on ID 1

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

## this integer corresponds to the ID of the GPU, for multiple GPU use "0,1,2,3"...
## to disable all GPUs, simply put empty quotes ""

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

### Investigate our GPU and CUDA environment

NOTE:  If you are only using 1 single GPU in the visibility settings above, then the active CUDA device will always be 0 since it is the only GPU seen.

In [None]:
import torch
import sys
import os
from subprocess import call
print('_____Python, Pytorch, Cuda info____')
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA RUNTIME API VERSION')
#os.system('nvcc --version')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('_____nvidia-smi GPU details____')
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('_____Device assignments____')
print('Number CUDA Devices:', torch.cuda.device_count())
print ('Current cuda device: ', torch.cuda.current_device(), ' **May not correspond to nvidia-smi ID above, check visibility parameter')
print("Device name: ", torch.cuda.get_device_name(torch.cuda.current_device()))

### Assign single GPU to device variable

This command assigns GPU ID 0 to the DEVICE variable called "cuda:0" if pytorch can actually reach and speak with the GPU using cuda language.  Else it will use the cpu.

In [1]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(DEVICE)

NameError: name 'torch' is not defined

In [None]:
from langchain import HuggingFacePipeline, PromptTemplate
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferWindowMemory
#from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from pdf2image import convert_from_path
from transformers import AutoTokenizer, pipeline, TextIteratorStreamer, AutoModelForCausalLM
from tqdm import tqdm
import time
import gradio as gr

### Clear GPU memory from any previous runs
- assume Nvidia drivers installed
- When running notebooks over and over again, often much of the memory is still in the GPU memory allocated cache.  Depending on the size of the GPU, this might cause out of memory issues during the next run.  It is advised to clear out the cache, or restart the kernel.
- here we see multiple GPUs, the memory usage, any running processes and our CUDA version

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

### Clear the previous run vector database

This is optional, the vector db will be rebuilt.  For a completely fresh run you can delete the local folder.

In [None]:
## remove chroma vector db local db folder from previous run
!rm -rf "vector-db"

### Prepare data from knowledge base

- load the pdf files
- use an instruct model to intelligently split the content into chunks

In [None]:
loader = PyPDFDirectoryLoader("pdfs-dell-infohub")
docs = loader.load()
len(docs)

### Use Instruct model to split text intelligently

In [None]:
embeddings = HuggingFaceInstructEmbeddings(
    model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
)

### Chunk text

<b>chunk size large</b>:  If you want to provide large text overviews and summaries in your responses - appropriate for content creation tasks - then a large chunk size is helpful.  800 or higher.

<b>chunk size small</b>:  If you are looking for specific answers based on extracted content from your knowledge base, a smaller chunk size is better.  Smaller than 800.

<b>chunk overlap</b>:  If the paragraphs of content in your PDFs often refer to previous content in the document, like a large whitepaper, you might want to have a good size overlap.  128 or higher, this is totally up to the content.

https://dev.to/peterabel/what-chunk-size-and-chunk-overlap-should-you-use-4338

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
texts = text_splitter.split_documents(docs)
len(texts)

### Create the vector database
- take converted embeddings and place them into vector db
- stored locally on prem

In [None]:
%%time
vectordb = Chroma.from_documents(texts, embeddings, persist_directory="vector-db")
print('\n' + 'Time to complete:')

### Prepare Chat model

Llama2 7b chat chosen for this use case for its optimized human dialogue.  https://huggingface.co/meta-llama/Llama-2-7b-chat-hf


In [None]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"

### Choose precision version

16-bit full precision will require at least 13Gb to run this notebook and memory usage will grow as the chat continues.

4-bit precision is available with the help of huggingface and bitsandbytes load in parameter, simply swith the commented model line

In [None]:
## full precision, larger footprint on GPU
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="cuda")

### 4 bit precision, smaller GPU memory, must allow auto device or cuda:0
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, torch_dtype=torch.float32, device_map="auto")

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.use_default_system_prompt = False

### Constants

Used to initialize the advanced settings sliders in the GUI

In [None]:
MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
#MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

### Chat Memory
To have a positive, realistic chat experience the LLM needs to access a form of memory.  Memory for the LLM chat is basically a copy of the chat history that is given to the LLM as reference.  

In [None]:
####### MEMORY PARAMETERS ###########

memory = ConversationBufferWindowMemory(
    k=5, ## number of interactions to keep in memory
    memory_key="chat_history",
    return_messages=True,  ## formats the chat_history into HumanMessage and AImessage entity list
    input_key="question",
    output_key="answer"
)

In [None]:
from langchain.globals import set_verbose, set_debug

set_debug(True)
set_verbose(True)

### Main Process Input Function

This is the function that orchestrates all the major components such as:
- user variable input from the GUI
- prompt template
- pipeline setup
- chain setup
- response output

In [None]:
### this chunk works, however it gives constant clarifying questions... annoying but the responses are pretty decent sometimes.
def process_input(question,
    chat_history,
    system_prompt,
    max_new_tokens,
    temperature,
    top_p,
    top_k,
    repetition_penalty
                 ):

    ### let's check and see that our gradio interface is passing the input variables as we expect
    ### Change the values of sliders in gradio at run time to make changes to the inputs here
    # print("SYS:", system_prompt) 
    # print("ch:", chat_history)
    # print("MAX_NEW_TOKENS:", max_new_tokens, "T:", temperature, "P:", top_p, "K:", top_k, "REP_PEN:", repetition_penalty)

    
    ### system prompt variable is typed in by the user in Gradio advanced settings text box and sent into process_input function
    ### This is Llama2 prompt format 
    ### https://huggingface.co/blog/llama2#how-to-prompt-llama-2

#    llama2_prompt_template = "\n\n [INST] <<SYS>>" + system_prompt + "<</SYS>>\n\n Context: {context} \n\n  Chat History: {chat_history} \n\n  Question: {question} \n\n[/INST]".strip()

    llama2_prompt_template = "\n\n [INST] <<SYS>>" + system_prompt + "<</SYS>>\n\n Summaries: {summaries} \n\n  Chat History: {chat_history} \n\n  Question: {question}\n\n[/INST]".strip()


    PROMPT = PromptTemplate(
#        input_variables=["context", "chat_history", "question"], 
        input_variables=["summaries", "chat_history", "question"], 
        template=llama2_prompt_template
    )

    ####  check to see what the prompt actually looks like
    
#    print(PROMPT)

    ####### STREAMER FOR TEXT OUTPUT ############
    
    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    ####### PIPELINE ARGUMENTS FOR THE LLM ############
    ### more info at https://towardsdatascience.com/decoding-strategies-in-large-language-models-9733a8f70539
    
    text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    do_sample=True,
#    num_beams=2, beam search over 1 cannot be used with streamer
    streamer=streamer,
    max_new_tokens=max_new_tokens,
    top_p=top_p,
    top_k=top_k,
    temperature=temperature,
    repetition_penalty=repetition_penalty,
    )

    ####### ATTACH PIPELINE TO LLM ############

    llm = HuggingFacePipeline(pipeline=text_pipeline)

    
########  RETRIEVAL QA WITH SOURCES WORKS FAIRLY WELL IN OUR USE CASE
    
    ### this does NOT rephrase the question

    ### info on db retriever settings:  https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore
    ### Maximum marginal relevance retrieval (mmr) will provide a more broad selection from more files
    ## search kwargs integer is the max number of docs to return in the response
    
    ###### RETRIEVAL QA FROM CHAIN TYPE PARAMS ###########
    qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type="stuff",
        chain_type_kwargs={"prompt": PROMPT},
        retriever=vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 4}),
#        retriever=vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 4}),
        return_source_documents = True,
        memory=memory,
        verbose=True,
        )


    ### this response format is best for retrieval QA chain with sources ###
    ### Gradio will respond with only 2 arguments from chatbot.interface, first will always be the question, second will be history
    
    response = qa_chain(question, chat_history)

    ##### TEST THE RESPONSE ######
    
#    print(response)
#    print(response["chat_history"])
#    print(response["answer"])


    ##### TEST SOURCE DOCS lIST ######
    
    print("============================================")
    print("===============Source Documents============")
    print("============================================")

    for x in range(len(response["source_documents"][0].metadata)):
        print(response["source_documents"][x].metadata)

    print("============================================")
    print("============================================")

    #### chat history will be empty key if there is no actual history yet, run the bot a few times
    
#    print(response.keys())
    # print(response["answer"])
#    print(response["sources"])
    
    
    ####### MANAGE OUTPUT ARRAY FROM STREAMER ###########
    ## whatever is in streamer, the positional argument 'text', take it and join it all together
    ## yield allows streaming in Gradio
    
    outputs = []
    for text in streamer:
        outputs.append(text)
        yield "".join(outputs)


### Build the Gradio GUI
- Gradio is a quick, highly customizable UI package for your python applications:  https://www.gradio.app/
- Combined with langchain, gradio can trigger multiple chains for a wide variety of user interactions.

<b>NOTE</b>:  Gradio will output variables in the order they appear here in the interface object. There is no declaration of these variables explicitly in the creation of each one when it is sent to the processing function.  i.e. slider for temperature is the 3rd variable in the list.  It is passed as a positional argument, not as "temperature" variable explicitly.  You have to take those positional arguments that gradio passes out (from the user input at the browser) as positional input into your chat processing function.  

#### Access the UI
- The provided code forces Gradio to create a small web server on the local host the notebook is being served from
- Gradio will provide a URL that can be used in a web browser, that must be accessed from within the same network, so you may need to access it using a jumphost.  In this case we used a Windows jump host and Chrome browser on the same network to access the page.

In [None]:
chat_interface = gr.ChatInterface(
    
    ### call the main process function above
    
    fn=process_input, 

    ### format the dialogue box, add company avatar image
    
    chatbot = gr.Chatbot(
        bubble_full_width=False,
        avatar_images=(None, (os.path.join(os.path.dirname("__file__"), "images/dell-logo-sm.jpg"))),
    ),

    
    additional_inputs=[
        
        gr.Textbox(label="Persona and role for system prompt:", 
                   lines=3, 
                   value="""You are a technical research assistant, you answer only in English language. Your audience appreciates technical details in your answer."""
                  ),
        
        gr.Slider(
            label="Max new words (tokens)",
            minimum=1,
            maximum=MAX_MAX_NEW_TOKENS,
            step=1,
            value=DEFAULT_MAX_NEW_TOKENS,
        ),
        gr.Slider(
            label="Creativity (Temperature), higher is more creative, lower is less creative:",
            minimum=0.1,
            maximum=2.0,
            step=0.1,
            value=0.6,
        ),
        gr.Slider(
            label="Top probable tokens (Nucleus sampling top-p), affects creativity:",
            minimum=0.05,
            maximum=1.0,
            step=0.05,
            value=0.9,
        ),
        gr.Slider(
            label="Number of top tokens to choose from (Top-k):",
            minimum=1,
            maximum=100,
            step=1,
            value=50,
        ),
        gr.Slider(
            label="Repetition penalty:",
            minimum=1.0,
            maximum=1.99,
            step=0.05,
            value=1.2,
        ),
    ],
    
    stop_btn=None,
    
    examples=[
        ["Can you give me a detailed summary of the document 'h19642-Introduction-to-Apex-File-Storage-for-AWS.pdf'?"],
        ["What are some solutions Dell provides for the Telecom Industry?"],
        ["How does Dell APEX block storage support multiple availability zones?"],
        ["Please document the process  of a 'cluster aware update' for Dell VXrail."],
        ["Would you please create a CTO advisory proposal comparing Dell Technologies storage PowerFlex solutions against HP storage solutions."],
        ["Would you please write a professional email response to John explaining the benefits of Dell Powerflex. Please be concise and in paragraph form, no lists or bullet points."],
        ["Create a new advertisement for Dell Technologies PowerEdge servers.  Please include an interesting headline and product description.  You want to persuade the target audience of IT decision makers to purchase PowerEdge servers. Include a section at the end titled Call to Action, listing next steps the readers should take."],

    ],

)

###  SET GRADIO INTERFACE THEME (https://www.gradio.app/guides/theming-guide)

#theme = gr.themes.Soft()
#theme = gr.themes.Glass()
theme = gr.themes.Default()


### set width and margins in local css file
### set Title in a markdown object at the top, then render the chat interface

with gr.Blocks(theme=theme, css="style.css") as demo:
    gr.Markdown(
    """
    # Retrieval Digital Assistant
    """)
    
    chat_interface.render()


if __name__ == "__main__":
    demo.queue(max_size=1)  ## sets up websockets for bidirectional comms and no timeouts, set a max number users in queue
    demo.launch(share=False, debug=True, server_name="YOUR_SERVER_IP", server_port=7861, allowed_paths=["images/dell-logo-sm.jpg"])

### Inspiration code:

https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat <br>

### Author:
David O'Dell - Solutions and AI Tech Marketing Engineer