# Open-Source RAG with LLaMa 13B (4 bits for less GPU memory), Faiss, HuggingFace and Langchain or with OpenAI

In this Poc we'll create a RAG Open-Source solution with **Llama-13b-chat** with HuggingFace embedings, Faiss (Vector DB), all orchestrated by LangChain. Or we could parametrize with OpenAI.

In terms of struture of the solution, we have the main UI in file
 `RAG_QAw_Parametrization.ipynb` that import all the parametrization (which model, temperature, chain...) from `parametrization.ipynb`  and the core RAG functions from `RAGQA.ipynb`. `RAGQA.ipynb` import also `Parametrization.ipynb`.   


**Retrieval Augmented Generation (RAG)** is an advanced Natural Language Processing (NLP) technique that combines both retrieval and generation elements to enhance AI language models' capabilities.

You must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours).

---

🚨 I suggest  runing in Google Colab  by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab.

---

The pip install was made at `7_RAG_QAw_Parametrization_v1.ipynb`. Do i need

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0 \
  pypdf \
  faiss-cpu \
  Docx2txt \
  gradio \
  openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m7.6 MB/s[0m 

# Import modules

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain import OpenAI

import gradio as gr

# Global Variables

In [None]:
#Global Variables with access functions
#GLOBAL CONSTANTS
METALLAMA2 = "meta llama2"
OPENAIGPT35 = "openai gpt35"
OPENAIGPT30 = "openai gpt30"
LOADQA = "load qa"
RETRIEVALQA = "retrieval qa"

#Global Variables
model2work = METALLAMA2
temperature2work = 0
qachain2work = LOADQA
bulletprompt2work = False
verbose2work = False
embeddings = None
llm = None
autkeys = None

#Access functions to global variables for solution parametrization

Read and write functions to the global variables.

## LLama2_Initialization
We need to initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`.

* The respective tokenizer for the model.

### We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

##Note: In autkeys_write you need to put your HUGGINGFACEHUB keys
        autkeys = "" #change for your HUGGINGFACEHUB keys


In [None]:
#Access functions to global variables
def model2work_read ():
    return model2work

def model2work_write (new_value):
    global model2work

    model2work = new_value
    return model2work

def temperature2work_read ():
    return temperature2work

def temperature2work_write (new_value):
    global temperature2work

    temperature2work = new_value
    return temperature2work

def qachain2work_read ():
    return qachain2work

def qachain2work_write (new_value):
    global qachain2work

    qachain2work = new_value
    return qachain2work

def bulletprompt2work_read ():
    return bulletprompt2work

def bulletprompt2work_write (new_value):
    global bulletprompt2work

    bulletprompt2work = new_value
    return bulletprompt2work

def verbose2work_read ():
    return verbose2work

def verbose2work_write (new_value):
    global verbose2work

    verbose2work = new_value
    return verbose2work

def embeddings_read ():
    return embeddings

#Define which kind of embeddings to use (returns)
def embeddings_write (modeltype):
    global embeddings

    if  modeltype == METALLAMA2:
        embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
    elif modeltype == OPENAIGPT35:
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    elif modeltype == OPENAIGPT30:
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    return embeddings

# We need to initialize a `text-generation` pipeline with Hugging Face transformers.
# The Pipeline requires three things that we must initialize first, those are:
# 1) A LLM, in this case it will be `meta-llama/Llama-2-13b-chat-hf`
# 2) The respective tokenizer for the model
# 3) We initialize the model and move it to our CUDA-enabled GPU.
    # Using Colab this can take 5-10 minutes to download and initialize the model.
#return the llm
def LLama2_Initialization (temperature, autkey):
    from torch import cuda, bfloat16
    import transformers


    model_id = 'meta-llama/Llama-2-13b-chat-hf'

    device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

    # set quantization configuration to load large model with less GPU memory
    # this requires the `bitsandbytes` library
    bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=bfloat16
    )

    # begin initializing HF items, need auth token for these

    #hf_auth2 = os.environ.get('HUGGINGFACEHUB_API_TOKEN')  # not work coolab...
    # hf_auth2 = 'HUGGINGFACEHUB_API_TOKEN' #'HF_AUTH_TOKEN'
    # Prompt for the HUGGINGFACEHUB_API_TOKEN
    #hf_auth = getpass("Enter your Hugging Face API token: ")

    hf_auth = autkey
    print (f"hf_auth {hf_auth}")

    model_config = transformers.AutoConfig.from_pretrained(
        model_id,
        use_auth_token=hf_auth
    )

    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map='auto',
        use_auth_token=hf_auth
    )
    model.eval()
    print(f"Model loaded on {device}")

    #The pipeline requires a tokenizer which handles the translation of
    # human readable plaintext to LLM readable token IDs. The Llama 2 13B models
    # were trained using the Llama 2 13B tokenizer, which we initialize like so:

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_id,
        use_auth_token=hf_auth
    )

    # Now we're ready to initialize the HF pipeline. There are a few additional
    # parameters that we must define here. Comments explaining these have been
    # included in the code.

    generate_text = transformers.pipeline(
        model=model, tokenizer=tokenizer,
        return_full_text=True,  # langchain expects the full text
        task='text-generation',
        # we pass model parameters here too
        temperature= temperature, #0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
        max_new_tokens=4096,  # mex number of tokens to generate in the output was 512, limit 4096
        repetition_penalty=1.1  # without this output begins repeating
    )

    # In LangChain calling HuggingFace LLama2
    from langchain.llms import HuggingFacePipeline

    llm = HuggingFacePipeline(pipeline=generate_text)
    return llm


def llm_read ():
    return llm

#Parametrize the model (LLM) to use (returns)
def llm_write (modeltype, temperaturevalue, autkey):
    global llm

    if  modeltype == METALLAMA2:
        llm = LLama2_Initialization(temperaturevalue, autkey) #change
    elif modeltype == OPENAIGPT35:
        llm = OpenAI(model_name ="gpt-3.5-turbo",temperature=temperaturevalue)
    elif modeltype == OPENAIGPT30:
        llm = OpenAI(temperature=temperaturevalue)
    return llm

def autkeys_read ():
    return autkeys

# Parametrize the autkeys to use (returns)
def autkeys_write (modeltype):
    import os
    import getpass

    global autkeys


    if modeltype == METALLAMA2:
        #autkeys = os.environ.get('HUGGINGFACEHUB_API_TOKEN')  # not work collab...
        #autkeys = getpass("Enter your Hugging Face API token: ") #for collab
        autkeys = "" #change for your HUGGINGFACEHUB keys
        # Prompt for the HUGGINGFACEHUB_API_TOKEN

    elif modeltype == OPENAIGPT35 or modeltype == OPENAIGPT30:
        autkeys = os.environ.get('OPENAI_API_KEY')  # not work collab...
        # Before executing the following code, make sure to have
        # your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
        #autkeys = getpass("Enter your Open AI API token: ")  # for collab
    return autkeys

## Solution Parametrization
We need to parametrize the solution, UI with model,temperature, chain type, prompt with bullets (just for Q/A with QaLoadchain) and verbose.

In [None]:
def parametrization_change (modeltype, temperature, qachaintype, bulletprompt, verbose):

    model2work = model2work_write (modeltype)
    temperature2work = temperature2work_write (temperature)
    qachain2work = qachain2work_write (qachaintype)
    bulletprompt2work = bulletprompt2work_write (bulletprompt)
    verbose2work = verbose2work_write (verbose) #global is false verbose

    autkey = autkeys_write (model2work) # obtain key for llm
    embeddings = embeddings_write (modeltype)
    llm = llm_write (modeltype, temperature, autkey)

    return (f"parametrization_change | modeltype {model2work_read ()}, temperature {temperature2work_read ()}, qachaintype {qachain2work_read ()}, bulletprompt {bulletprompt2work_read()}, verbose {verbose2work_read ()}, embeddings {embeddings_read ()}. llm {llm_read ()}")


inputs =[
                # gr.Radio([META_LLAMA2, OPENAI_GPT_3_5, OPENAI_GPT_3_0], label="Model", ),
                gr.Radio([METALLAMA2, OPENAIGPT35, OPENAIGPT30], label="Model", ),
                gr.Slider(0, 1, 0, label="temperature"),
                # gr.Dropdown(['1850', '1900', '1950', '2000', '2050'], label="Year"),
                gr.Radio([LOADQA, RETRIEVALQA], label="Chain Type", ),
                gr.Checkbox(label="Console debug chain messages?"),
                gr.Checkbox(label="Bullet Answers?"),
        ]
outputs = "text"

app_parametrization = gr.Interface(fn=parametrization_change, inputs=inputs, outputs=outputs, allow_flagging="never")

Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0
