## GGML CTransformers Quantized models
- References:
    - https://python.langchain.com/en/latest/integrations/ctransformers.html
    - https://python.langchain.com/en/latest/modules/models/llms/integrations/ctransformers.html
    - https://github.com/marella/ctransformers#langchain
    - https://github.com/marella/ctransformers#supported-models
- Download GGML model bin from TheBloke: https://huggingface.co/TheBloke
    - MPT-7B-Instruct downloads: https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML/tree/main

In [1]:
from langchain.llms import CTransformers, OpenAI
from langchain.vectorstores import FAISS
from langchain import PromptTemplate, LLMChain
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceEmbeddings
from dotenv import load_dotenv
import os
import textwrap

In [2]:
CONFIG = {'max_new_tokens': 256, 
          'temperature': 0.1,
#           'repetition_penalty': 1.1
         }

In [3]:
# MODEL_BIN_PATH = '../models/bin/mpt-7b-instruct.ggmlv3.q5_1.bin'
MODEL_BIN_PATH = '../models/bin/mpt-7b-instruct.ggmlv3.q8_0.bin'
# MODEL_BIN_PATH = '../models/bin/mpt-7b-instruct.ggmlv3.fp16.bin'
# MODEL_BIN_PATH = '../models/bin/ggml-mpt-7b-chat.bin'
MODEL_TYPE = 'mpt'

In [4]:
llm = CTransformers(model=MODEL_BIN_PATH, 
                    model_type=MODEL_TYPE,
                    config=CONFIG
                    )

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# template = """Question: {question}

# Answer:"""

# prompt = PromptTemplate(template=template, input_variables=['question'])

In [6]:
# llm_chain = LLMChain(prompt=prompt, 
#                      llm=llm)

In [7]:
# %%time
# query = 'What is Boston Consulting Group?'
# response = llm_chain.run(query)
# print(response)

___
### Simulate Q&A Retrieval

In [8]:
# qa_system_template_prefix = """
# You are an assistant to a human, powered by a large language model trained by OpenAI.

# You are designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, you are able to generate human-like text based on the input you receive, allowing you to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.

# You have access to some personalized information provided by the human in the Context section below. 
# """


# qa_system_template_main = """Use the following pieces of information to answer the human's question.
# If you don't know the answer, just say that you don't know, don't try to make up an answer.

# Context:
# - Our Equality Policy ensures a representative and inclusive workplace, striving to 
# eliminate discrimination based on race, gender, disability, religion, age, sexual orientation, 
# or any other characteristic protected by the Equality Act 2010. 
# - Upholding the Employment Rights Act 1996, our Work-Life Balance Policy provides flexible working 
# arrangements for employees to balance their professional and personal lives. This policy, in compliance 
# with the UK Working Time Regulations 1998, also regulates work hours to prevent overwork.

# Helpful answer:"""

In [9]:
# def set_qa_prompt():
#     """
#     Prompt template for QA retrieval for each vectorstore
#     """
#     messages = [
# #         SystemMessagePromptTemplate.from_template(qa_system_template_prefix),
#         SystemMessagePromptTemplate.from_template(qa_system_template_main),
#         HumanMessagePromptTemplate.from_template('{question}')
#         ]

#     qa_prompt = ChatPromptTemplate.from_messages(messages)
    
#     return qa_prompt

In [10]:
# qa_prompt = set_qa_prompt()

In [11]:
new_template = """You are an expert HR assistant. Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}

Question: {question}

Helpful detailed answer:"""

In [12]:
new_prompt = PromptTemplate(template=new_template, 
                            input_variables=['context', 'question'])

In [13]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                   model_kwargs={'device': 'cpu'})

In [14]:
# Load vectordb
vectordb = FAISS.load_local('../vectorstores/db_faiss', embeddings)

In [15]:
def build_retrieval_qa(llm, prompt, vectordb):
    dbqa = RetrievalQA.from_chain_type(llm=llm,
                                       chain_type="stuff",
                                       retriever=vectordb.as_retriever(search_kwargs={"k": 2}),
                                       return_source_documents=True,
                                       chain_type_kwargs={"prompt": prompt}
                                       )

    return dbqa

In [16]:
dbqa = build_retrieval_qa(llm, new_prompt, vectordb)

In [17]:
query = "What is the length of maternity leave?"

In [18]:
%%time
response = dbqa({'query': query})

CPU times: total: 1min
Wall time: 20.7 s


In [19]:
response

{'query': 'What is the length of maternity leave?',
 'result': ' Female employees are eligible for 26 weeks paid maternity leave.',
 'source_documents': [Document(page_content='Copyright © 2019 by Boston Consulting Group. All rights reserved. 17Maternity benefit for birth mothers \n•Female employee shall be entitled to 26 weeks (182 consecutive days \nincluding weekends and Public Holidays) of paid maternity leave and shall be \ntaken in one continuous block with no breaks in between.\n•By local statutory, can start max 4 weeks before giving birth.\n•Notice of pregnancy from a qualified medical practitioner must be submitted \nto SEA Human Resources Operations team.', metadata={'source': 'data\\Employee Handbook 2023 SIN.pdf', 'page': 16}),
  Document(page_content='to SEA Human Resources Operations team.\n•All other provisions in the Employment Law relating to Maternity Leave are adopted herein and shall form an integral part of this Policy.\n•Maternity leave does not break continuity 

___
## Performance Review
#### mpt7b-instruct.ggmlv3.q5_1.bin (4.8Gb) - AMD Ryzen 5 5600X 6-Core Processor
- query (with retrieval, k=3) = "What is the length of maternity leave?": CPU Wall time 42s
- query (with retrieval, k=2) = "What is the length of maternity leave?": CPU Wall time 28s
 
#### mpt-7b-instruct.ggmlv3.q8_0.bin (6.9Gb) - AMD Ryzen 5 5600X 6-Core Processor
- query = 'What is Boston Consulting Group?': CPU Wall time 7.3s
- query (with retrieval, k=3) = "What is the length of maternity leave?": CPU Wall time 30s
- query (with retrieval, k=2) = "What is the length of maternity leave?": CPU Wall time 21s

#### mpt-7b-instruct.ggmlv3.fp16.bin
