# Pre-Requisites

First off we install the libraries, login to HuggingFaceHub and then import those libraries.

In [1]:
# @title Installing required libraries

!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate xformers
!pip -q install langchain
!pip -q install peft chromadb
!pip -q install unstructured
!pip install -q sentence_transformers
!pip -q install pypdf

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [2]:
# @title Displaying GPU information

!nvidia-smi

Sat Sep 16 21:57:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# @title Logging in to HuggingFace

from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# @title Importing the libraries

import nltk
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline

# Quantization of the Model

## What is Quantization?
It's like compressing your model to make it smaller, faster and more-power efficient.

## Why Quantization?
1. **Smaller Size**: Easier to deploy
2. **Faster**: Quicker predictions
3. **Less Power**: Saves battery


In [5]:
# @title Defining the config for the quantization

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                            # --> Load weights in 4-bit to save space.
    bnb_4bit_quant_type="nf4",                    # --> Type of 4-bit quantization.
    bnb_4bit_compute_dtype=torch.bfloat16,        # --> Using torch.bfloat16 for faster calculations.
    bnb_4bit_use_double_quant=False               # --> False means no double compression.
)

# LLAMA2-13B

We now set up the Large Language Model (LLM), i.e, Meta's LLama2-13b chat version. First we define a Tokenizer that tokenizes plaintext into tokens. Then we set up the model by passing our quantized config (which we defined earlier), and mapping it to use GPU as the device.

In [6]:
# @title Tokenizer and Model

id = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(id)
model = AutoModelForCausalLM.from_pretrained(id, quantization_config=bnb_config, device_map={"":0})

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

# Prompt Engineering

## Constant Declaration
It is required to define some constant denotations in order to structure the user input prompt correctly :

1. **B_INST**, **E_INST** => Marks the beginning and the end of the Instructions
2. **B_SYS**, **E_SYS** => Marks the beginning and the end of the System Prompts
3. **DEFAULT_SYSTEM_PROMPT** => Default behavior for the LLM System

## Prompt Structuring
We define a function to build and return a well-structured prompt that uses the aforementioned constants. This is done so that the LLM understands the context fully before attempting to respond to the user's raw queries.

In [7]:
# @title Prompt Template Configuration

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

DEFAULT_SYSTEM_PROMPT = """
You are a helpful, respectful and honest assistant. Your primary job is to always answer as helpfully as possible, \
while being safe. Your responses must not contain anything harmful, unethical, racist, sexist, toxic, dangerous, \
or illegal NSFW content. You must always ensure that your responses are socially, politically, morally and overall \
unambiguous, unbiased and positive in nature. If a question does not make sense, or is not factually coherent, explain \
why instead of answering something incorrect. Avoid spreading misinformation as a result of hallucination at any cost. \
If you do not know the answer to any question, be honest by clearing that up and letting the user know so.
"""

def get_prompt(instruction, new_sys_prompt=DEFAULT_SYSTEM_PROMPT):
  SYSTEM_PROMPT = B_SYS + DEFAULT_SYSTEM_PROMPT + new_sys_prompt + E_SYS
  prompt_template = B_INST + SYSTEM_PROMPT + instruction + E_INST
  return prompt_template

# Vector Stores and Document Processing

In this step, we focus on embedding the user document (PDF) into vectors, and storing them into a Vector Database for faster retrieval. Such an approach is also known as **Retrieval-Augmented Generation (RAG)**.

In [8]:
# @title Importing Langchain and its Methods

from langchain.document_loaders import PyPDFLoader  # to load the document as plaintext
from langchain.text_splitter import RecursiveCharacterTextSplitter  # to split the plaintext into chunks
from langchain.embeddings import HuggingFaceEmbeddings  # to embed chunks into vectors
from langchain.vectorstores import Chroma  # to store those vectors in a database

In [9]:
# @title Loading the PDF file into a PDF Loader

# I'm using a chess tutorial pdf, but you are free to use any PDF that you desire to.
# This PDF will be available in this repo itself, if you need.
loader = PyPDFLoader("/content/chess.pdf")

In [10]:
# @title Splitting the text into chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 20,
    length_function = len,
)

In [11]:
# @title Segregation based on Pages as a parameter

pages = loader.load_and_split(text_splitter)
pages[:1]

[Document(page_content="CHESS  \n \nBy Mind Games  \n \n \nHow to Play Chess: Rules and Basics  \nIt's never too late to learn how to play chess - the most popular game in the world! \nLearning the rules of chess is easy:  \nStep 1. How to Setup the Chessboard  \nAt the beginning of the game the chessboard is laid out so that each player has the \nwhite (or light) color square in the bottom right -hand side. The chess pieces are then \narranged the same way each time. The second row (or rank) is filled with pawns. The", metadata={'source': '/content/chess.pdf', 'page': 0})]

In [12]:
# @title Storing embeddings into a Vector Database

db = Chroma.from_documents(pages, HuggingFaceEmbeddings(), persist_directory='/content/db')

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Pipeline and Bot Creation

This stage defines a function `create_pipeline()` to set up a text-generation pipeline using HuggingFace libraries, and a `ChatBot` class to create a chatbot. The class constructor initializes the chatbot's memory, prompt, and data retriever. The `create_chat_bot()` method within the class combines these elements to generate a chatbot that can both retrieve relevant information and generate text based on that. It returns this combined chatbot for use.

In [13]:
# @title Importing the necessary libraries

from langchain import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

In [14]:
# @title Chatbot memory configuration

memory = ConversationBufferWindowMemory(
    memory_key="chat_history",   # specifies that the memory will be used to store chat history.
    key=5,   # memory will keep track of the last 5 interactions in the conversation.
    return_messages=True   # indicates that the stored messages can be returned for future use.
)

In [15]:
# @title Defining a retriever object for the database

retriever = db.as_retriever()

In [16]:
# @title Pipeline Generation

def create_pipeline(max_new_tokens=512):
    pipe = pipeline("text-generation",
                model=model,
                tokenizer = tokenizer,
                max_new_tokens = max_new_tokens,
                temperature = 0.3)   # You can set the temperature to any value you desire, but make sure it's a strictly positive float.
    return pipe

In [17]:
# @title Encapsulating the process into a class

class ChatBot:
  def __init__(self, memory, prompt, task:str = "text-generation", retriever = retriever):
    self.memory = memory
    self.prompt = prompt
    self.retriever = retriever

  def create_chat_bot(self, max_new_tokens = 512):
    hf_pipe = create_pipeline(max_new_tokens)
    llm = HuggingFacePipeline(pipeline =hf_pipe)
    qa = ConversationalRetrievalChain.from_llm(
      llm=llm,
      retriever=self.retriever,
      memory=self.memory,
      combine_docs_chain_kwargs={"prompt": self.prompt}
  )
    return qa

# Setting up the Inference

Now we create a Chess bot and a prompt template for the same.

In [18]:
# @title Prompt Template

from langchain import PromptTemplate

instruction = "Given the context that has been provided : \n{context}\nAnswer the following question : \n{question}"

system_prompt = """
You are an expert at chess, you play at the Grandmaster Level. Your name is Mikhail Tal. \
Your job is to respond to the queries that have been asked to you, in a polite and \
instructive manner. Your responses must be extremely short and accurate, do not write multiple \
huge paragraphs. In case you do not know the answer to a particular question, simply \
clarify the fact that you don't know, and avoid spreading misinformation at any cost. \
Maximum length of your responses should be 2-3 sentences.
"""

template = get_prompt(instruction, system_prompt)
print(template)

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

[INST]<<SYS>>

You are a helpful, respectful and honest assistant. Your primary job is to always answer as helpfully as possible, while being safe. Your responses must not contain anything harmful, unethical, racist, sexist, toxic, dangerous, or illegal NSFW content. You must always ensure that your responses are socially, politically, morally and overall unambiguous, unbiased and positive in nature. If a question does not make sense, or is not factually coherent, explain why instead of answering something incorrect. Avoid spreading misinformation as a result of hallucination at any cost. If you do not know the answer to any question, be honest by clearing that up and letting the user know so.

You are an expert at chess, you play at the Grandmaster Level. Your name is Mikhail Tal. Your job is to respond to the queries that have been asked to you, in a polite and instructive manner. Your responses must be extremely short and accurate, do not write multiple huge paragraphs. In case you 

In [19]:
# @title Bot Creation

chess_bot_instance = ChatBot(memory=memory, prompt=prompt)
chess_bot = chess_bot_instance.create_chat_bot()

# Inference

We finally now test the chatbot that we created.

In [20]:
# @title Creating a reply function
from IPython.display import display, HTML

def reply(message):
  bot_response = chess_bot({"question": message})['answer']
  display(HTML(bot_response))

In [21]:
# @title Testing the bot

reply("How to set up the chessboard? Explain in 2 sentences.")

In [23]:
reply("How does the Knight move in Chess?")

In [24]:
reply("Explain en passant.")

In [28]:
reply("What are some of the best opening strategies?")

In [30]:
reply("What do you mean by a Gambit?")

In [32]:
reply("Can you tell me the numerical values associated to each piece in chess?")