# **Overview:**

### The following code and documentation is on Llama2-13B LLM where its memory and RetrievalQ&A chain is tested

***Feel free to test memory functionality and user query retrievals with different questions. After testing if you find any limitations please note them at the end***

Furthermore, if you have any difficulty in understanding certain arguments, please use **BARD** instead of ChatGPT because ChatGPT has no information regarding LangChain due to its knowledge cut-off.

Requirements on Collab:
- 20+ GB RAM minimum
- Single A100 GPU or mutiple A40.
- <100 GB Disk Space

# Phases in appproach:
1. Installations
2. Load documents for querying.(PDF **and** text, both necessary)
3. Define the recursive text splitter from langchain
  - Define your tiktoken tokenizer first(you can experiment with the type of tokenizer)
  - Define your Recursive spillter using the tokenizer of your choice(tiktoken tokenizer)
4. Create your embeddings (Roberta v1 large is used)
5. Inititate Pinecone
6. Define and load Llama2-13B In Collab.
7. Training the model:
  - Print the trainable paramters from PEFT
  - Define model tokenizer for tokenizing (try spacy tokenizer if the file size is big)
  - Define training arguemnts
  - Save the trained model **OR** push the model to Hugging Face repo
  - Call the saved trained model
8. Define elements of model pipeline
  - Llama2-13B LLM tokenizer
  - Stopping criteria object:
    - Identify stopping tokens
    - Convert them to "LongTensors"
    - Define your custom stopping criteria function using the stop tokens
  - Finally, define your pipeline based on tokenizer and stopping function
9. Load the model in LangChain:
  - Modify the system_message/default instruction prompt if necessary
  - Instantiate your modified prompt template if you modified the prompt
  - Define your type of memory (Conversation Buffer window memory)
  - Instantiate your LLM using "HuggingFacePipeline"
  - Define your retrieval Q&A chain
10. Test your model with queries


**NOTE:**<br>
A trimming function is defined before phase 10 because after initial testing the model has some default letters and symbols in its output which needed to be removed.

# Brief overview of types of memory in LangChain:
### **Type #1: ConversationBufferMemory**

It simply takes your past interaction with the AI and passes them as 'raw text' into the {history} parameter without any processing.

### Pros and Cons of ConversationBufferMemory
- Pros:
  - Storing max info ie: no loss of previous info
  - Simple intuitive approach

- Cons:
  - Storing all the tokens therefore slower response times as queries get complex and longer
  - Since storing all tokens if conversation goes long enough we will exhuast our max token limit.

### **Type #2: ConversationSummaryMemory**

The conversation summary memory keeps the previous pieces of conversation in a summarized form, where the summarization is performed by an LLM.

### Pros and Cons of ConversationSummaryMemory:
- Pros:
  - Less tokens for long conversation
  - Therefore, enables longer conversations
  - Not too complex.
- Cons:
  - Inefficient for short conversations
  - Heavily Dependant on good summaries, in case of small model like ours short summaries are not good.

### **Type #3: ConversationBufferWindowMemory**

ConversationBufferWindowMemory will be keep the specified number of the last interactions in our memory but will intentionally drop the oldest ones - short-term memory **if you'd like.**

### **Conclusion:**

Turns out buffer window memory heavily relies on the value of k parameter which is nothing but the number of previous conversations the model should remember in the {history} prompt.

This memory type would be our option to use since we would not want to waste our token storing Raw text as it is nor would want summaries of our conversations as there will be loss of data.

*The biggest advantage with this memory is that the LLM can become a conversational agent without the ConversationChain of LangChain where it remembers previous interactions and answers if a question is based on these previous interactions.*

**The demonstration of Q&A retrieval integrated with this type of memory will be our focus**

# **Phase 1: Installations:**

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

!pip install -qqq loralib==0.1.1

In [None]:
!pip install unstructured==0.6.1 -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q

In [None]:
!pip install -qU pypdf tiktoken

# **Phase 2: Loading Files as PDFs and Text Files**

In [None]:
!pip install -qqq torch==2.0.1
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f

In [1]:
from langchain.document_loaders import DirectoryLoader    # Loading the directory
from langchain.document_loaders import TextLoader         # Loading text files
from langchain.document_loaders import PyPDFLoader

# loading our directory
loader_1 = DirectoryLoader('/home/balaji/Desktop/VM2/', glob="Combined_half.pdf", loader_cls=PyPDFLoader)

# loading our documents
document = loader_1.load()    # 624

# Loading .txt doc
loader = TextLoader("/home/balaji/Desktop/VM2/Combined_half.txt")

# **Phase 3: Define Recursive Text Splitter**
## Phase 3A: Defining our tiktoken tokenizer for `len` argument in Recursive Text Splitter

In [3]:
import tiktoken

tokenizer_tiktoken = tiktoken.get_encoding('p50k_base')           # max token len is 2048

# creating the length function to count the number of tokens
def tiktoken_token_len(text):
  """
  This function simply counts the number of tokens in the content.

  Note: The number of tokens is not equal to the length of the content
  """
  tokens = tokenizer_tiktoken.encode(                       # This is very specific to tiktoken
      str(text),
      disallowed_special=()
    )
  return len(tokens)

tiktoken_token_len(document)

27714

# Phase 3B: Chunking our documents using RevursiveCharacterSplitter

In [1]:
# Since document is large chunking it to reduce the size

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=20,chunk_overlap=5):
  """
  The function uses a text splitter called RecursiveCharacterTextSplitter to
  divide the documents into smaller chunks.
  The function applies the text splitter to each document in the input list and
  returns the resulting chunks.
  """
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                 chunk_overlap=chunk_overlap,
                                                 length_function = tiktoken_token_len,
                                                 separators = ["\n\n", "\n", " ", ""])
  docs = text_splitter.split_documents(document)
  return docs

# Calling our fucntion
docs = split_docs(document)

print("Length of docs after chunking:", len(docs), "\n")
print("One of the page's chunked content:\n\n", docs)

ModuleNotFoundError: No module named 'langchain'

In [5]:
type(docs)

list

# **Phase 4: Creating our Embeddings**

In [6]:
from torch import cuda
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Equivalent to HuggingFaceEmbeddings(model_name="all-roberta-large-v1")
embeddings_roberta = SentenceTransformerEmbeddings(model_name = "all-roberta-large-v1",
                                                   model_kwargs={'device': device})

In [7]:
# Testing our embedding:
docs_1 = [
    "this is one document",
    "and another document"
]

embeddings = embeddings_roberta.embed_documents(docs_1)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 1024.


# **Phase 5: Initiating Pinecone**

**NOTE:**<br>
Please enter your Pinecone API key, index name and enviroment name from Pinecone before running the cell.<br>

There are *two* code chunks, one for when you create a new index for running this model and second chunk, prevents overwrting of an existing index

In [8]:
import pinecone
from langchain.vectorstores import Pinecone

# Initialize Pinecone with your API key and environment
pinecone.init(api_key="4576e811-a5ca-4992-8fc4-3fa379847b02", environment="us-west4-gcp-free")

# Define the index name
index_name = 'llama2624'

# Check if the index exists, and create it if it doesn't
if index_name not in pinecone.list_indexes():
    pinecone.create_index(name=index_name, metric="cosine", dimension=1024)
    docsearch = Pinecone.from_documents(docs, embeddings_roberta, index_name=index_name)
else:
    docsearch = Pinecone.from_existing_index(index_name, embeddings_roberta)


### Calling the model:
Following is the model configuration in bitsandbytes defined in Phase 6.
```
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)
```

In [10]:
from torch import cuda, bfloat16
import transformers

from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,                         # load the model in 4-bit precision.
    bnb_4bit_quant_type='nf4',                 # type of quantization to use for 4-bit weights.
    bnb_4bit_use_double_quant=True,            # use double quantization for 4-bit weights.
    bnb_4bit_compute_dtype=bfloat16            # compute dtype to use for 4-bit weights.
)

TRAINED_MODEL = 'kings-crown/EM624_QA_Full'
config = PeftConfig.from_pretrained(TRAINED_MODEL)
model = transformers.AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,   # String that specifies the path to the base model. The base model is the model that the PEFT model is based on.
                                      # It is used to initialize the parameters of the PEFT model.
    return_dict=True,
    quantization_config=bnb_config,   # used to reduce the size of the model and improve its performance.
    device_map="auto",                # the model will be assigned to the most appropriate device
    trust_remote_code=True
)

tokenizer= transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path)   # Create a AutoTokenizer object from the base model name or path.
tokenizer.pad_token = tokenizer.eos_token                                               # The pad_token argument is set to the eos_token. This is the end-of-sequence token.

model = PeftModel.from_pretrained(model, TRAINED_MODEL)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/52.5M [00:00<?, ?B/s]

# **Phase 6: Defining our model Pipeline**

### **We create a list of stopping criteria**


In [11]:
"""
The reason we have three back ticks is because we ask the LLM to reply to our queries in JSON markdowns.

What this means is once we get to the end of the markdown which is ''' we are telling the model to stop generating text.
"""
# Creating a Stop list:
stop_list = ['\nHuman:', '\n```\n']

# Stopping Token IDS:
stop_token_ids_1 = [tokenizer(x)['input_ids'] for x in stop_list]
print('Stopping token ids are:', stop_token_ids_1)

Stopping token ids are: [[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]


### **Converting to long tensors**

In [12]:
# Converting the ids to LONG tensors: (This is mandatory)
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids_1]
print('\nStopping token ids in TENSOR form are:',stop_token_ids)


Stopping token ids in TENSOR form are: [tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'), tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]


### **Defining custom stopping criteria object**

In [13]:
# Finally, defining our custom stopping criteria:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    """
    This class defines a custom stopping criteria object that stops training when the model predicts a specific sequence of tokens.
    """
    # This function is called by the training loop to check if the training should be stopped.
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        # Iterate over the list of stop token IDs.
        for stop_ids in stop_token_ids:
            # Check if the model predicts the stop token sequence.
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False  # If the model does not predict the stop token sequence, return False.

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

### **Finally, Define the transformer/model pipeline**

In [14]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,               # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria, # without this model goes off topic during chat after a point
    temperature=0.1,                     # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,                          # select from top tokens whose probability add up to 15%, you can experiment with this
    top_k=0,                             # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=512,                  # max number of tokens to generate in the output, this should be not high enough to generate randomness and low enough to be precise
    repetition_penalty=1.1               # without this output begins repeating
)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerN

# **Phase 7: Loading our model in LangChain Framework:**

In [15]:
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Modified initial prompt
prompt_template = """
The following is a friendy conversation between a human and an AI based on the content provided.
The AI is conversational and retrives answers for the questions asked and is concise in it's responses.
It does not mention anything about the "Context" or "Category" of answers in it's responses.
If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{chat_history}
Human: {input}
AI:"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["chat_history","input"]
)

from langchain.chains.conversation.memory import ConversationBufferWindowMemory

# Defining our memoory
memory = ConversationBufferWindowMemory(
    k=5,                              # Number of previous conversations to store
    return_only_outputs=True,
    memory_key="chat_history"         # this has align with agent prompt (below)

)

# Instantiate the LLM
llm = HuggingFacePipeline(pipeline=generate_text)


# Define your QnA retrieval chain
chat = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    memory = memory,
    verbose = True)

# Trimming function

In [16]:
def chat_trim(chat_chain, query):
    """
    There are some unwanted characters in the response by the LLM at the end. This function cleans them up.
    """
    # create response
    chat_chain.run(query)

    # check for double newlines (also happens often)
    chat_chain.memory.chat_memory.messages[-1].content = chat_chain.memory.chat_memory.messages[-1].content.split('\n\n')[0]

    # strip any whitespace
    chat_chain.memory.chat_memory.messages[-1].content = chat_chain.memory.chat_memory.messages[-1].content.strip()

    return chat_chain.memory.chat_memory.messages[-1].content


# **Phase 10: Testing our model responses**

`The testing is conducted as follows:`<br>
*The questions are asked to test if the model not only retrieves relevant information based on user query but also stores the responses during the conversation. If we want our model to closely mimic a conversational bot it needs to have access to its memory.*  
________________________________________________________________________________

### Type of questions for testing:
- Since the document is based on Obama, the first **two** questions are based on Obama and Kunal gandhi.
- Third question is based on economy and has nothing to do with Obama since the keyword 'Obama' has not been used.
- This differnece in questions is to test if the model is able to access its memory, segregate the information on Obama and economy or Kunal Gandhi respectively and then retrieve information based on the last question which is to summarise information on Obama only.

**NOTE: you can have your own document for testing but try keeping different context for your questons to test LLMs memory later on.**

In [17]:
chat_trim(chat, "Who is Barack Obama?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Barack Obama is the 44th President of the United States. He was born on August 4, 1961, in Honolulu, Hawaii. His mother, Ann Dunham, was American, while his father, Barack Obama Sr., was Kenyan. After graduating from Columbia University, Obama worked as a community organizer and later attended Harvard Law School. He served in the Illinois State Senate before being elected the first AfricanAmerican President of the United States in 2008 and reelected in 2012." Context: Explaining Barack Obama\'s background and achievements. Category: Biographies'

In [18]:
chat_trim(chat, "Who is Kunal Gandhi?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Kunal Gandhi is a popular figure in the Python and open source communities due to his contributions to the pandas library. He is highly regarded for his work and has also studied at a top 100 international university. "'

In [19]:
chat_trim(chat, "What is Room Theory")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Room Theory is a knowledgedriven approach that utilizes context and knowledge to analyze and understand data. It involves collecting data, transforming it into a numerical representation, and using this numerical representation as the context for analyzing and understanding the data." }]]]'

In [20]:
chat_trim(chat, "How is a dictionary instiantiated in python?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'A dictionary in Python is instantiated using curly brackets {} and can have keyvalue pairs separated by colon (:). It is important to note that the order of keys and values in a dictionary is not fixed, and it can be reordered or changed.Category: Python, Data Structures  "'

In [24]:
def print_x_statements(x):
  for i in range(x):
    print(chat.memory.chat_memory.messages[i])

print_x_statements(8)

content='Who is Barack Obama?' additional_kwargs={} example=False
content='Barack Obama is the 44th President of the United States. He was born on August 4, 1961, in Honolulu, Hawaii. His mother, Ann Dunham, was American, while his father, Barack Obama Sr., was Kenyan. After graduating from Columbia University, Obama worked as a community organizer and later attended Harvard Law School. He served in the Illinois State Senate before being elected the first AfricanAmerican President of the United States in 2008 and reelected in 2012." Context: Explaining Barack Obama\'s background and achievements. Category: Biographies' additional_kwargs={} example=False
content='Who is Kunal Gandhi?' additional_kwargs={} example=False
content='Kunal Gandhi is a popular figure in the Python and open source communities due to his contributions to the pandas library. He is highly regarded for his work and has also studied at a top 100 international university. "' additional_kwargs={} example=False
content