<a href="https://colab.research.google.com/github/joshuaalpuerto/ML-guide/blob/main/RAG_hybrid_Mistral_7B_Instruct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)
!pip install -q optimum --progress-bar off
!pip install -q auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  --progress-bar off # Use cu117 if on CUDA 11.7
# We need specific transformer to make mistral work
!pip install -q git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79 --progress-bar off

!pip install -qU langchain Faiss-gpu sentence-transformers
!pip install -q jq # for json loader to work
!pip install ctransformers[gptq] #To use CTransformer from langchain and load gptq model

# !pip install -qU trl Py7zr
!pip install -q rank_bm25
!pip install -q PyPdf

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for optimum (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m3.8 MB/s[0m e

In [5]:
import langchain
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.retrievers import BM25Retriever,EnsembleRetriever

from langchain.schema import prompt
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain import PromptTemplate


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [3]:
#connect to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from langchain.llms import CTransformers

config = {'max_new_tokens': 1024, 'temperature': 0.1, 'repetition_penalty': 1.1}

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.1-GPTQ"

llm = CTransformers(model=model_name_or_path,
                    config=config)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading (…)3b4f88014d/README.md:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/186 [00:00<?, ?B/s]

Downloading (…)8014d/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)8014d/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)4f88014d/config.json:   0%|          | 0.00/963 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

In [7]:
from langchain.storage import InMemoryStore
from langchain.embeddings import CacheBackedEmbeddings,HuggingFaceEmbeddings

# In our implementation we have used uses the local file system for storing embeddings and FAISS vector store for retrieval.
# store = LocalFileStore("./cache/")

# We can also set up inmemory cache
# NOTE: we used this as we are more familiar with it
store = InMemoryStore()

embed_model_id="thenlper/gte-large"

# Under the hood HuggingFaceEmbeddings is using sentence-transformer
core_embeddings_model = HuggingFaceEmbeddings(model_name=embed_model_id)

# Here we will leverage a CacheBackedEmbeddings to prevent us from re-embedding similar queries over and over again.
embedder = CacheBackedEmbeddings.from_bytes_store(core_embeddings_model,
                                                  store,
                                                  namespace=embed_model_id)

Downloading (…)b04c2/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

Downloading (…)b43b04c2/config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)4c2/onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)b04c2/onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)b04c2/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)43b04c2/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


In [None]:
from langchain.document_loaders import JSONLoader

# Define the metadata extraction function.
def metadata_func(record: dict, metadata: dict) -> dict:

    metadata["country"] = record.get("country")
    metadata["answer"] = record.get("answer")

    return metadata


loader = JSONLoader(
    file_path='/content/drive/MyDrive/datasets/qna-clean.json',
    jq_schema='.[]',
    content_key="question",
    metadata_func=metadata_func
)

faq_docs = loader.load()
faq_docs

In [19]:
# Create VectorStore
vectorstore = FAISS.from_documents(faq_docs, embedder)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Import Model
llm = LlamaCpp(
    streaming = True,
    model_path="/content/drive/MyDrive/Model/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    temperature=0.75,
    top_p=1,
    verbose=True,
    n_ctx=4096
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


In [None]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_store.as_retriever(search_kwargs={"k": 2}))

In [None]:
query = "What is linear regression model"

In [None]:
qa.run(query)

' The linear regression model is a statistical method for modeling the relationship between a dependent variable and one or more independent variables, commonly referred to as predictors or explanatory variables. In this paper, the linear regression model is used as a starting point to derive estimators for the variance-covariance matrix of the errors in certain structural change models.'

In [None]:
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print('Exiting')
    sys.exit()
  if user_input == '':
    continue
  result = qa({'query': user_input})
  print(f"Answer: {result['result']}")

Llama.generate: prefix-match hit


Answer:  It is a statistical method used to analyze the linear relationship between two or more variables. In this case, it is used to estimate the coefficients of a linear regression model.


Llama.generate: prefix-match hit


Answer:  sandwich provides various functions for estimating the covariance matrix. The most commonly used ones are vcovHAC and vcovHC. These functions can take different weighting schemes, including kernel-based HAC estimation with automatic bandwidth selection based on weightsAndrews andbwAndrews. Other functions like weightsLumley provide different weighting schemes, such as truncated and smoothed weights, which are useful in certain applications. In econometric analyses, these functions are used to compute partial t-tests for assessing the significance of a parameter. The choice of estimator depends on the presence of heteroscedasticity and autocorrelation in the errors.
