# **EASY IMPLEMENTATION OF RETRIEVAL AUGMENTED GENERATION (RAG) WITH LLM MISTRAL 7B**

---

📎 README

This notebook is an easy implementation of retrieval augmented generation (RAG) using Langchain with Hugging Face's model Mistral-7B-Instruct-v0.1 and a FAISS vector store for storing text embeddings, utilizing Sentence Transformers from Hugging Face.

About:

*   Use FAISS vectore store and Load/Download vector database.
*   Choice between downloading the model or entering your Hugging Face user access token.
*   Use a GPU to run the Hugging Face model and the FAISS vector store.
*   Use Langchain for Q&A on your vectorized PDF database.

References:

1.  "[Q/A with LLMs + Harry Potter & Mistral HF API](https://www.kaggle.com/code/acorn8/q-a-with-llms-harry-potter-mistral-hf-api)".
2.   "[Advanced RAG Implementation on Custom Data Using Hybrid Search, Embed Caching And Mistral-AI](https://medium.aiplanet.com/advanced-rag-implementation-on-custom-data-using-hybrid-search-embed-caching-and-mistral-ai-ce78fdae4ef6)".


In [1]:
# @title 1. ✨ Installing dependences.

!pip install langchain pypdf faiss-gpu &> /dev/null
!pip install InstructorEmbedding sentence_transformers &> /dev/null
!pip install -q git+https://github.com/huggingface/transformers &> /dev/null
!pip install huggingface_hub -q &> /dev/null
!pip install xformers &> /dev/null
!pip install -qU auto-gptq optimum &> /dev/null

import warnings
warnings.filterwarnings("ignore")

import os
import glob
import textwrap
import langchain
import torch


In [2]:
# @title 2. ✍ Making vector database

#@markdown Control parameters for vector database

class CFG1:

    # splitting
    split_chunk_size = 800 #@param {type:"number"}
    split_overlap = 0 #@param {type:"number"}

    # embeddings
    from langchain.storage import LocalFileStore
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2'

    #@markdown Select archives of database
    source = "Make_database_from_pdf" #@param ["Make_database_from_pdf", "Load_database"]

    #@markdown Select the folder and upload PDF for conversion (Default pdf, data )
    # paths
    PDFs_path = "/content/data" #@param {type:"string"}
    Embeddings_path =  "/content/data/hf.pdf" #@param {type:"string"}
    Persist_directory = "/content/data" #@param {type:"string"}

# ### download embeddings model
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.embeddings import CacheBackedEmbeddings,HuggingFaceEmbeddings
from langchain.storage import LocalFileStore
from langchain.cache import InMemoryCache


core_embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG1.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)

store = LocalFileStore("./cache/")
embeddings = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings,
    store,
    namespace = CFG1.embeddings_model_repo
)

if CFG1.source == "Make_database_from_pdf":
    # loaders
    from langchain.document_loaders import PyPDFLoader
    from langchain.document_loaders import DirectoryLoader

    #Load text.
    loader = DirectoryLoader(
        CFG1.PDFs_path,
        glob="./*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True,
        use_multithreading=True
    )

    documents = loader.load()

    # Splitter text
    from langchain.text_splitter import RecursiveCharacterTextSplitter

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = CFG1.split_chunk_size,
        chunk_overlap = CFG1.split_overlap
    )

    texts = text_splitter.split_documents(documents)

    # ### create embeddings and DB
    # vector stores
    from langchain.vectorstores import FAISS

    vectordb = FAISS.from_documents(
        documents = texts,
        embedding = embeddings
    )

    # ### persist vector database
    vectordb.save_local("faiss_index_hp")

elif CFG1.source == "Load_database":
    from langchain.vectorstores import FAISS
    ### load vector DB embeddings
    vectordb = FAISS.load_local(
        CFG1.Embeddings_path,
        embeddings
    )


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


100%|██████████| 1/1 [00:00<00:00,  2.53it/s]


In [3]:
# @title 3. 🛠 Setup the model Mistral-7B-Instruct-v0.1

# @markdown Select how to get access to the Hugging Face model.
type_access = "Dowload_model" #@param ["Load_model_from_API", "Dowload_model"]

class CFG2:

    model_name = 'mistralai/Mistral-7B-Instruct-v0.1'

    # LLMs parameters
    temperature = 0.2 # @param {type:"slider", min:0, max:1, step:0.1}
    top_p = 0.95 # @param {type:"slider", min:0, max:1, step:0.1}
    top_k = 5 # @param {type:"slider", min:0, max:10, step:1}
    repetition_penalty = 1.15 #@param {type:"number"}
    do_sample = True #@param {type:"boolean"}
    max_new_tokens = 2048 #@param {type:"number"}
    num_return_sequences=1 #@param {type:"number"}

    # similar passages
    k = 5 # @param {type:"slider", min:0, max:100, step:1}

if type_access == "Load_model_from_API":

    # Api huggingface
    from langchain.llms import HuggingFaceHub

    huggingfacehub_api_token = "hf_token"  #@param {type:"string"}

    llm = HuggingFaceHub(
        repo_id = CFG2.model_name,
        model_kwargs={
            "max_new_tokens": CFG2.max_new_tokens,
            "temperature": CFG2.temperature,
            "top_p": CFG2.top_p,
            "repetition_penalty": CFG2.repetition_penalty,
            "do_sample": CFG2.do_sample,
            "num_return_sequences": CFG2.num_return_sequences
        },
        huggingfacehub_api_token = huggingfacehub_api_token
    )

elif type_access == "Dowload_model":

    #Download the quantized GPTQ Model

    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    #import torch

    #DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

    model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"
    # To use a different branch, change revision
    # For example: revision="gptq-4bit-32g-actorder_True"
    model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                                device_map="auto",
                                                trust_remote_code=False,
                                                revision="gptq-4bit-32g-actorder_True")
    #
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

    from langchain.llms import HuggingFacePipeline

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=CFG2.max_new_tokens,
        do_sample=CFG2.do_sample,
        temperature=CFG2.temperature,
        top_p=CFG2.top_p,
        top_k=CFG2.top_k,
        repetition_penalty=CFG2.repetition_penalty
    )

    llm = HuggingFacePipeline(pipeline=pipe)
    langchain.llm_cache = InMemoryCache()

# prompts
from langchain import PromptTemplate, LLMChain

prompt_template = """<s>[INST] You are given the context after <<CONTEXT>> and a question after <<QUESTION>>.

Answer the question by only using the information in context. Only base your answer on the information in the context. Even if you know something more,
keep silent about it. It is important that you only tell what can be infered from the context alone.

<<QUESTION>>{question}\n<<CONTEXT>>{context} [/INST]"""

PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables = ["question", "context"]
)

llm_chain = LLMChain(prompt=PROMPT, llm=llm)

# we get the context part by embedding retrieval
# retrievers
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceHub

retriever = vectordb.as_retriever(search_kwargs = {"k": CFG2.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever,
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])

    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4] + ' - page: ' + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )

    ans = ans + '\n\nSources: \n' + sources_used
    return ans


`AnnotionFormat` is deprecated and will be removed in v4.38. Please use `transformers.image_utils.AnnotationFormat` instead.


config.json:   0%|          | 0.00/962 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.57G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [4]:
# @title 4. 💬 ¡Run the cell and after write your question!

# For the classical form for to make question, execute the next funtion:
# query = "Ask question?"
#def llm_ans(query):
#    llm_response = qa_chain(query)
#    ans = process_llm_response(llm_response)
#    return ans.strip()

def llm_ans():

    i = 1
    Continue = True
    while Continue:
      print('\n Question:', i, '...(Write EXIT for end)...')
      print('-----------------')
      query = str(input())

      Continue = query!= 'EXIT'

      if Continue:
        llm_response = qa_chain(query)
        ans = process_llm_response(llm_response)
        print('\n Answer:', i)
        print('-----------------')
        print(ans.strip())
        i = i + 1

llm_ans()


 Question: 1 ...(Write EXIT for end)...
-----------------
What is this book about?

 Answer: 1
-----------------
This book is about natural language processing (NLP) and how to fine-tune existing models like BERT, GPT2, and T5 for various NLP applications. The book also covers how transformers are being used in areas like vision. All source code used in the book can be found at github.com/apress/intro-transformers-nlp.

Sources: 
hf - page: 1 
hf - page: 7 
hf - page: 1 
hf - page: 5 
hf - page: 1

 Question: 2 ...(Write EXIT for end)...
-----------------
EXIT
