<a href="https://colab.research.google.com/github/joshuaalpuerto/ML-guide/blob/main/RAG_hybrid_Mistral_7B_Instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)
!pip install -q optimum --progress-bar off
!pip install -q auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  --progress-bar off # Use cu117 if on CUDA 11.7
# We need specific transformer to make mistral work
!pip install -q git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79 --progress-bar off

!pip install -qU langchain Faiss-gpu sentence-transformers
!pip install -q jq # for json loader to work
!pip install ctransformers[gptq] #To use CTransformer from langchain and load gptq model

# !pip install -qU trl Py7zr
!pip install -q rank_bm25
!pip install -q PyPdf

In [2]:
#connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import langchain
from langchain.storage import InMemoryStore
from langchain.cache import InMemoryCache
from langchain.llms import CTransformers

config = {'max_new_tokens': 2048, 'temperature': 0.1, 'top_p': 0.95, 'repetition_penalty': 1.1}

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"

llm = CTransformers(model=model_name_or_path,
                    # 4-bit, with Act Order and group size 32g. Gives highest possible inference quality, with maximum VRAM usage.
                    revision="gptq-4bit-32g-actorder_True",
                    config=config)

# Have cache
langchain.llm_cache = InMemoryCache()

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Downloading (…)8014d/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)4f88014d/config.json:   0%|          | 0.00/963 [00:00<?, ?B/s]

Downloading (…)8014d/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/186 [00:00<?, ?B/s]

Downloading (…)3b4f88014d/README.md:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

In [4]:
from langchain.storage import InMemoryStore
from langchain.embeddings import CacheBackedEmbeddings,HuggingFaceEmbeddings

# In our implementation we have used uses the local file system for storing embeddings and FAISS vector store for retrieval.
# store = LocalFileStore("./cache/")

# We can also set up inmemory cache
# NOTE: we used this as we are more familiar with it
store = InMemoryStore()

embed_model_id="thenlper/gte-large"

# Under the hood HuggingFaceEmbeddings is using sentence-transformer
core_embeddings_model = HuggingFaceEmbeddings(model_name=embed_model_id)

# Here we will leverage a CacheBackedEmbeddings to prevent us from re-embedding similar queries over and over again.
embedder = CacheBackedEmbeddings.from_bytes_store(core_embeddings_model,
                                                  store,
                                                  namespace=embed_model_id)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Downloading (…)b04c2/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

Downloading (…)b43b04c2/config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)4c2/onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)b04c2/onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)b04c2/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)43b04c2/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


In [5]:
from langchain.document_loaders import JSONLoader

# Define the metadata extraction function.
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["country"] = record.get("country")
    metadata["answer"] = record.get("answer")

    return metadata


loader = JSONLoader(
    file_path='/content/drive/MyDrive/datasets/qna-clean.json',
    jq_schema='.[]',
    content_key="question",
    metadata_func=metadata_func
)

faq_docs = loader.load()

In [6]:
from langchain.vectorstores import FAISS

# Create VectorStore
vectorstore = FAISS.from_documents(faq_docs, embedder)

In [None]:
# NOT USED

from langchain.retrievers import BM25Retriever

# Sparse
bm25_retriever = BM25Retriever.from_documents(faq_docs)
bm25_retriever.k=5


In [None]:
# NOT USED

from langchain.retrievers import EnsembleRetriever

# Hybrid search
faiss_retriever = vectorstore.as_retriever(search_kwargs={"k":5})
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever,faiss_retriever],
                                       weights=[0.5,0.5])

In [None]:
# Single pass template

PROMPT_TEMPLATE = """<s>[INST] You act as a helpful immigration and relocation agent, responding to questions based on provided context on behalf of the user's assigned agent.

Follow these guidelines for the answer:
- Only answer information related to immigration, relocation, settle-in, cultural integration or similar fields related queries.
- Summarize the answer and make it clear, concise, grammatically correct and direct like a human agent would
- Don't disclose yourself as AI, automated tool or bot, answer like the user's agent would
- Never suggest contacting outside Jobbatical platform. Rewrite the answer to contact the user's agent instead
- write the answer using english language
- Answer the question as truthfully as poosible if you are unsure base on the context below reply with "OUT_OF_CONTEXT"

### Question:
{question}

### Context:
{context}

### Answer:
[/INST]"""

input_variables = ['context', 'question']
custom_prompt = PromptTemplate(template=PROMPT_TEMPLATE,
                            input_variables=input_variables)

# Multi route template

Because if we put all rules in a single prompt LLM get's confused. instead we separated the prompt
- FAQ
- Request for updates

In [36]:
from langchain import PromptTemplate

FAQ_PROMPT_TEMPLATE = """<s>[INST] You act as a helpful immigration and relocation agent, responding to questions based on provided context on behalf of the user's assigned agent.

Follow these guidelines for the answer:
- Only answer information related to immigration, relocation, settle-in, cultural integration or similar fields related queries.
- Summarize the answer and make it clear, concise, grammatically correct and direct like a human agent would
- Don't disclose yourself as AI, automated tool or bot, answer like the user's agent would
- Never suggest contacting outside Jobbatical platform. Rewrite the answer to contact the user's agent instead
- write the answer using english language
- Answer the question as truthfully as poosible if you are unsure base on the context below reply with "OUT_OF_CONTEXT"

### Question:
{question}

### Context:
{context}

### Answer:
[/INST]"""

REQUESTING_UPDATES_PROMPT_TEMPLATE = """<s>[INST] You act as a helpful immigration and relocation agent, responding to questions on behalf of the user's assigned agent.

Follow these guidelines for the answer:
- For questions that are specific about their own or family member's immigration case progress, suggest to contact their agents.
- Don't use any general information.
- write the answer using english language


### Question:
{question}

### Answer:
[/INST]"""


In [70]:
prompt_infos = [
    {
        "name": "faq",
        "description": "Good for answering generic immigration or relocation questions",
        "prompt_template": FAQ_PROMPT_TEMPLATE,
    },
    {
        "name": "request for updates",
        "description": "Good for answering questions related to user's relocation process or updates",
        "prompt_template": REQUESTING_UPDATES_PROMPT_TEMPLATE,
    },
]

prompts_dict = {}
for p_info in prompt_infos:
    name = p_info["name"]
    prompt_template = p_info["prompt_template"]
    prompts_dict[name] = PromptTemplate(template=prompt_template, input_variables = ['context', 'question'])

In [72]:
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain, RouterOutputParser

# We modify the prompt because we are using Mistral
# original prompt https://github.com/langchain-ai/langchain/blob/8c150ad7f6e9073f3508187ce79ea4715b74e105/libs/langchain/langchain/chains/router/multi_prompt_prompt.py#L4
MULTI_PROMPT_ROUTER_TEMPLATE = """<s>[INST]
Given a raw text input to a language model select the model prompt best suited for \
the input. You will be given the names of the available prompts and a description of \
what the prompt is best suited for. You may also revise the original input if you \
think that revising it will ultimately lead to a better response from the language \
model.

<< FORMATTING >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
{{{{
    "destination": string \\ name of the prompt to use or "DEFAULT"
    "next_inputs": string \\ a potentially modified version of the original input
}}}}
```

REMEMBER: "destination" MUST be one of the candidate prompt names specified below OR \
it can be "DEFAULT" if the input is not well suited for any of the candidate prompts.
REMEMBER: "next_inputs" can just be the original input if you don't think any \
modifications are needed.

<< CANDIDATE PROMPTS >>
{destinations}


<< INPUT >>
{{input}}

<< OUTPUT (must include ```json at the start of the response) >>
<< OUTPUT (must end with ```) >>[/INST]
"""


destinations = [f"{p['name']}: {p['description']}" for p in prompt_infos]
destinations_str = "\n".join(destinations)
router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(destinations=destinations_str)
router_prompt = PromptTemplate(
    template=router_template,
    input_variables=["question"],
    output_parser=RouterOutputParser(),
)
router_chain = LLMRouterChain.from_llm(llm, router_prompt)

router_chain('Any update for my visa?')



{'input': 'Any update for my visa?',
 'destination': 'request for updates',
 'next_inputs': {'input': 'Any update for my visa?'}}

In [74]:
import langchain
from langchain.chains import LLMChain
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.callbacks import StdOutCallbackHandler

# A bit slower but this improves retrieval by using bruteforce similarity(cos_sim) against user query.
def filter_similar_docs_from_query(query, embeddings, retriever, similarity_threshold=0.90):
  embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=similarity_threshold)
  compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter, base_retriever=retriever)

  docs = compression_retriever.get_relevant_documents(query)
  inputs = [f"- {doc.metadata['answer']}" for doc in docs]
  context = "\n".join(inputs)

  return context

# Naive top_k retriever (there a chance it gets incorrect context)
def get_documents_as_context(query, retriever):
    docs = retriever.get_relevant_documents(query)
    inputs = [f"- {doc.metadata['answer']}" for doc in docs]
    context = "\n".join(inputs)
    return context

def inference(question, context, prompt):
  # Log genereated info
  handler = StdOutCallbackHandler()
  # We use LLMChain + manual retrieval because the answers for our question is inside metadata.
  # There no easyway to do it in Langchain.
  llm_chain = LLMChain(
      llm=llm,
      prompt=prompt,
      verbose=True,
      callbacks=[handler]
  )

  return llm_chain.predict(context=context, question=query)

def get_prompt_by_query(query, default_prompt = prompts_dict['faq']):
  prompt_router = router_chain(query)
  return prompts_dict[prompt_router['destination']] or default_prompt

# To print the final prompt
  langchain.debug = True

In [76]:
%%time

query = "when will I received my TRP?"
# query = "Me and my wife doesn't marriage certificate yet. Is that a problem?"
country = "Estonia"

retriever = vectorstore.as_retriever(search_kwargs={"k":5, 'filter': {'country': country} })
context = filter_similar_docs_from_query(query=query, embeddings=core_embeddings_model, retriever=retriever)
prompt = get_prompt_by_query(query)

inference(question=query, context=context, prompt=prompt)





[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<s>[INST] You act as a helpful immigration and relocation agent, responding to questions based on provided context on behalf of the user's assigned agent.

Follow these guidelines for the answer:
- Only answer information related to immigration, relocation, settle-in, cultural integration or similar fields related queries.
- Summarize the answer and make it clear, concise, grammatically correct and direct like a human agent would
- Don't disclose yourself as AI, automated tool or bot, answer like the user's agent would
- Never suggest contacting outside Jobbatical platform. Rewrite the answer to contact the user's agent instead
- write the answer using english language
- Answer the question as truthfully as poosible if you are unsure base on the context below reply with "OUT_OF_CONTEXT" 

### Question: 
when will I received my TRP?

### Context:
- By law, the processing of your application can take up to 2

' Based on the information provided, it is possible that you may receive your TRP (Temporary Residence Permit) within 1 month from the date of decision. However, please note that the processing time for your application can take up to 2 months from submission, and the Police Board has the right to extend the deadline if necessary. If you have any further questions or concerns about your TRP, please feel free to contact your assigned agent for assistance.'

In [30]:
%%time

query = "When will I received my TRP?"
#query = "Good internet provider?"
country = "Estonia"

retriever = vectorstore.as_retriever(search_kwargs={"k":5, 'filter': {'country': country} })
context = filter_similar_docs_from_query(query=query, embeddings=core_embeddings_model, retriever=retriever)

inference(question=query, context=context, prompt=custom_prompt)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<s>[INST] You act as a helpful immigration and relocation agent, responding to questions based on provided context on behalf of the user's assigned agent.

Follow these guidelines for the answer:
- Only answer information related to immigration, relocation, settle-in, cultural integration or similar fields related queries.
- Summarize the answer and make it clear, concise, grammatically correct and direct like a human agent would
- Don't disclose yourself as AI, automated tool or bot, answer like the user's agent would
- Never suggest contacting outside Jobbatical platform. Rewrite the answer to contact the user's agent instead
- write the answer using english language
- Answer the question as truthfully as poosible if you are unsure base on the context below reply with "OUT_OF_CONTEXT" 

### Question: 
When will I received my TRP?

### Context:
- By law, the processing of your application can take up to 2

' Based on the information provided, it is possible that you may receive your TRP (Temporary Residence Permit) within 1 month from the date of decision. However, please note that the processing time for your application can take up to 2 months from submission, and the Police Board has the right to extend the deadline if necessary. If you have any further questions or concerns about your TRP, please feel free to contact your assigned agent for assistance.'

## Debugging

> NOTE: You don't need to run this