# **Overview:**

### The following code and documentation is on Llama2-13B LLM where its memory and RetrievalQ&A chain is tested

***Feel free to test memory functionality and user query retrievals with different questions. After testing if you find any limitations please note them at the end***

Furthermore, if you have any difficulty in understanding certain arguments, please use **BARD** instead of ChatGPT because ChatGPT has no information regarding LangChain due to its knowledge cut-off.

Requirements on Collab:
- 20+ GB RAM minimum
- Single A100 GPU or mutiple A40.
- <100 GB Disk Space

# Phases in appproach:
1. Installations
2. Load documents for querying.(PDF **and** text, both necessary)
3. Define the recursive text splitter from langchain
  - Define your tiktoken tokenizer first(you can experiment with the type of tokenizer)
  - Define your Recursive spillter using the tokenizer of your choice(tiktoken tokenizer)
4. Create your embeddings (Roberta v1 large is used)
5. Inititate Pinecone
6. Define and load Llama2-13B In Collab.
7. Training the model:
  - Print the trainable paramters from PEFT
  - Define model tokenizer for tokenizing (try spacy tokenizer if the file size is big)
  - Define training arguemnts
  - Save the trained model **OR** push the model to Hugging Face repo
  - Call the saved trained model
8. Define elements of model pipeline
  - Llama2-13B LLM tokenizer
  - Stopping criteria object:
    - Identify stopping tokens
    - Convert them to "LongTensors"
    - Define your custom stopping criteria function using the stop tokens
  - Finally, define your pipeline based on tokenizer and stopping function
9. Load the model in LangChain:
  - Modify the system_message/default instruction prompt if necessary
  - Instantiate your modified prompt template if you modified the prompt
  - Define your type of memory (Conversation Buffer window memory)
  - Instantiate your LLM using "HuggingFacePipeline"
  - Define your retrieval Q&A chain
10. Test your model with queries


**NOTE:**<br>
A trimming function is defined before phase 10 because after initial testing the model has some default letters and symbols in its output which needed to be removed.

# Brief overview of types of memory in LangChain:
### **Type #1: ConversationBufferMemory**

It simply takes your past interaction with the AI and passes them as 'raw text' into the {history} parameter without any processing.

### Pros and Cons of ConversationBufferMemory
- Pros:
  - Storing max info ie: no loss of previous info
  - Simple intuitive approach

- Cons:
  - Storing all the tokens therefore slower response times as queries get complex and longer
  - Since storing all tokens if conversation goes long enough we will exhuast our max token limit.

### **Type #2: ConversationSummaryMemory**

The conversation summary memory keeps the previous pieces of conversation in a summarized form, where the summarization is performed by an LLM.

### Pros and Cons of ConversationSummaryMemory:
- Pros:
  - Less tokens for long conversation
  - Therefore, enables longer conversations
  - Not too complex.
- Cons:
  - Inefficient for short conversations
  - Heavily Dependant on good summaries, in case of small model like ours short summaries are not good.

### **Type #3: ConversationBufferWindowMemory**

ConversationBufferWindowMemory will be keep the specified number of the last interactions in our memory but will intentionally drop the oldest ones - short-term memory **if you'd like.**

### **Conclusion:**

Turns out buffer window memory heavily relies on the value of k parameter which is nothing but the number of previous conversations the model should remember in the {history} prompt.

This memory type would be our option to use since we would not want to waste our token storing Raw text as it is nor would want summaries of our conversations as there will be loss of data.

*The biggest advantage with this memory is that the LLM can become a conversational agent without the ConversationChain of LangChain where it remembers previous interactions and answers if a question is based on these previous interactions.*

**The demonstration of Q&A retrieval integrated with this type of memory will be our focus**

# **Phase 1: Installations:**

In [None]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

!pip install -qqq loralib==0.1.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m16.6 MB/s[0m

In [None]:
!pip install unstructured==0.6.1 -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.2/112.2 kB[0m [31m16.7 MB/s[0m eta 

In [None]:
!pip install -qU pypdf tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m271.0/271.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[?25h

# **Phase 2: Loading Files as PDFs and Text Files**

In [None]:
from langchain.document_loaders import DirectoryLoader    # Loading the directory
from langchain.document_loaders import TextLoader         # Loading text files
from langchain.document_loaders import PyPDFLoader

# loading our directory
loader_1 = DirectoryLoader('/content/', glob="Barack_Obama.pdf", loader_cls=PyPDFLoader)

# loading our documents
document = loader_1.load()    # BARACK

# Loading .txt doc
loader = TextLoader("/content/Barack_Obama.txt")

In [None]:
document

[Document(page_content='[Barack Hussein Obama II ( (listen) b\x00-RAHK hoo-SAYN oh-BAH-m\x00; born August 4, 1961) is an American \npolitician, lawyer, and author who served as the 44th president of the United States from 2009 to 2017. A \nmember of the Democratic Party, Obama was the first African-American  president of the United States. H\ne previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from \n1997 to 2004. , Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, \nhe worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he w\nas the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney \nand an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. \nTurning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, wh\nen he ran

# **Phase 3: Define Recursive Text Splitter**
## Phase 3A: Defining our tiktoken tokenizer for `len` argument in Recursive Text Splitter

In [None]:
import tiktoken

tokenizer_tiktoken = tiktoken.get_encoding('p50k_base')           # max token len is 2048

# creating the length function to count the number of tokens
def tiktoken_token_len(text):
  """
  This function simply counts the number of tokens in the content.

  Note: The number of tokens is not equal to the length of the content
  """
  tokens = tokenizer_tiktoken.encode(                       # This is very specific to tiktoken
      str(text),
      disallowed_special=()
    )
  return len(tokens)

tiktoken_token_len(document)

18642

# Phase 3B: Chunking our documents using RevursiveCharacterSplitter

In [None]:
# Since document is large chunking it to reduce the size

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=20,chunk_overlap=5):
  """
  The function uses a text splitter called RecursiveCharacterTextSplitter to
  divide the documents into smaller chunks.
  The function applies the text splitter to each document in the input list and
  returns the resulting chunks.
  """
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                                 chunk_overlap=chunk_overlap,
                                                 length_function = tiktoken_token_len,
                                                 separators = ["\n\n", "\n", " ", ""])
  docs = text_splitter.split_documents(document)
  return docs

# Calling our fucntion
docs = split_docs(document)

print("Length of docs after chunking:", len(docs), "\n")
print("One of the page's chunked content:\n\n", docs)

Length of docs after chunking: 1387 

One of the page's chunked content:

 [Document(page_content='[Barack Hussein Obama II ( (listen) b\x00-RAHK', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='b\x00-RAHK hoo-SAYN oh-BAH-m\x00; born August', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='born August 4, 1961) is an American', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='politician, lawyer, and author who served as the 44th president of the United States from', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='of the United States from 2009 to 2017. A', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='member of the Democratic Party, Obama was the first African-American  president of the United', metadata={'source': '/content/Barack_Obama.pdf', 'page': 0}), Document(page_content='president of the United 

In [None]:
type(docs)

list

# **Phase 4: Creating our Embeddings**

In [None]:
from torch import cuda
from langchain.embeddings import HuggingFaceEmbeddings, SentenceTransformerEmbeddings

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Equivalent to HuggingFaceEmbeddings(model_name="all-roberta-large-v1")
embeddings_roberta = SentenceTransformerEmbeddings(model_name = "all-roberta-large-v1",
                                                   model_kwargs={'device': device})

Downloading (…)eaf99/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)a0f59eaf99/README.md:   0%|          | 0.00/9.84k [00:00<?, ?B/s]

Downloading (…)f59eaf99/config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)f99/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading (…)0f59eaf99/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)eaf99/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (…)af99/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)0f59eaf99/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)59eaf99/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# Testing our embedding:
docs_1 = [
    "this is one document",
    "and another document"
]

embeddings = embeddings_roberta.embed_documents(docs_1)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 1024.


# **Phase 5: Initiating Pinecone**

**NOTE:**<br>
Please enter your Pinecone API key, index name and enviroment name from Pinecone before running the cell.<br>

There are *two* code chunks, one for when you create a new index for running this model and second chunk, prevents overwrting of an existing index

In [None]:
"""
The code below is if you have created a new index on Pinecone; uncomment below
"""

import pinecone
from langchain.vectorstores import Pinecone

# Initializing Pinecone
pinecone.init(
    api_key="ff38ec75-df01-40a1-b7a7-5772107a84fc",
    environment="us-west4-gcp-free"
)

active_indexes = pinecone.list_indexes()
index_description = pinecone.describe_index(active_indexes[0])
print("Index Description:", index_description)
index_name = 'llama2-13b'

# Creating our pinecone Index and upserting our indexes
docsearch = Pinecone.from_documents(docs, embeddings_roberta, index_name = index_name)

"""
This chunk of code prevents from overwriting an existing index
"""
# import pinecone
# from langchain.vectorstores import Pinecone

# # Initialize Pinecone
# pinecone.init(
#     api_key="ff38ec75-df01-40a1-b7a7-5772107a84fc",         # API Key
#     environment="us-west4-gcp-free"                         # Environment specification
# )

# active_indexes = pinecone.list_indexes()
# index_description = pinecone.describe_index(active_indexes[0])
# print("Index Description:", index_description)
# index_name = 'mpt-30b-trained'

# # Create and configure index if doesn't already exist
# if index_name not in pinecone.list_indexes():
#     pinecone.create_index(
#         name=index_name,
#         metric="cosine",
#         dimension=1024)
#     docsearch = Pinecone.from_documents(docs, embeddings_roberta, index_name=index_name)

# else:
#     docsearch = Pinecone.from_existing_index(index_name, embeddings_roberta)

Index Description: IndexDescription(name='llama2-13b', metric='cosine', replicas=1, dimension=1024.0, shards=1, pods=1, pod_type='p1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')


# **Phase 6A: Defining and Loading our Llama-13B model in Collab**

In [None]:
!pip install -qqq torch==2.0.1
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f

[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone


In [None]:
from torch import cuda, bfloat16
import transformers
"""
The peft package provides a number of modules for parameter-efficient fine-tuning (PEFT) of large language models (LLMs).
More on this later.
"""
from peft import (
    LoraConfig,                                # class defines the configuration for the LoRa method of parameter-efficient fine-tuning.
    PeftConfig,                                # class defines the configuration for the PEFT framework.
    PeftModel,                                 # class is a subclass of the AutoModel class from the Transformers library. It is used to represent a PEFT-finetuned model.
    get_peft_model,                            # function is used to create a new PeftModel object.
    prepare_model_for_kbit_training            # function is used to prepare a model for K-bit training.
)

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,                         # load the model in 4-bit precision.
    bnb_4bit_quant_type='nf4',                 # type of quantization to use for 4-bit weights.
    bnb_4bit_use_double_quant=True,            # use double quantization for 4-bit weights.
    bnb_4bit_compute_dtype=bfloat16            # compute dtype to use for 4-bit weights.
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)


Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
print(f"Model loaded on {device}")

Model loaded on cuda:0


# **Phase 6B: Defining Model tokenizer**



In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]



Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

# **Phase 7: Training the Llama-13B model**

We will train the model with **PEFT**:
- PEFT stands for Parameter-Efficient Fine-Tuning. It is a technique for fine-tuning large language models (LLMs) that can significantly reduce the computational and memory requirements of traditional fine-tuning methods.

- PEFT works by fine-tuning only a small number of (extra) model parameters, while freezing most of the pre-trained network's parameters. This approach helps prevent catastrophic forgetting and results in reduced computational and storage costs.

The peft package provides a number of modules for parameter-efficient fine-tuning (PEFT) of large language models (LLMs).
- *LoraConfig*: class defines the configuration for the LoRa method of parameter-efficient fine-tuning.
- *PeftConfig*: class defines the configuration for the PEFT framework.
- *PeftModel*: class is a subclass of the AutoModel class from the Transformers library. It is used to represent a PEFT-finetuned model.
- *get_peft_model*: function is used to create a new PeftModel object.       
- *prepare_model_for_kbit_training*: function is used to prepare a model for K-bit training.


In [None]:
from peft import (
    LoraConfig,                                # class defines the configuration for the LoRa method of parameter-efficient fine-tuning.
    PeftConfig,                                # class defines the configuration for the PEFT framework.
    PeftModel,                                 # class is a subclass of the AutoModel class from the Transformers library. It is used to represent a PEFT-finetuned model.
    get_peft_model,                            # function is used to create a new PeftModel object.
    prepare_model_for_kbit_training            # function is used to prepare a model for K-bit training.
)

In [None]:
def print_trainable_parameters(model):
  """
  This function prints the number of trainable parameters in the model.

  Args:
    model: The model to print the trainable parameters for.

  Returns:
    None.
  """
  # Count the number of trainable parameters.
  trainable_params = 0
  all_param = 0
  # Iterate over all the parameters in the model.
  for _, param in model.named_parameters():
    # Increment the number of all parameters.
    all_param += param.numel()
    # Increment the number of trainable parameters if the parameter requires gradients.
    if param.requires_grad:
      trainable_params += param.numel()
  # Print the number of trainable parameters.
  print(
      f"trainable params: {trainable_params} || all params: {all_param} || trainables%: {100 * trainable_params / all_param}"
  )

The first line of code below, **model.gradient_checkpointing_enable()**, enables gradient checkpointing for the model. Gradient checkpointing is a technique that can help to reduce the memory requirements of training large models. It works by dividing the model into smaller chunks, and only computing the gradients for the current chunk.

The second line of code below, **model = prepare_model_for_kbit_training(model)**, prepares the model for K-bit training. K-bit training is a technique that can help to reduce the computational requirements of training large models. It works by quantizing the model's parameters to a lower precision, such as 8-bit or 4-bit.

In [None]:
# Enables gradient checkpointing for the model.
model.gradient_checkpointing_enable()
# Prepares the model for K-bit training.
model = prepare_model_for_kbit_training(model)

**LoraConfig()** below, is a class that defines the configuration for the LoRa method of parameter-efficient fine-tuning.

In [None]:
# This class defines the configuration for the LoRa method of parameter-efficient fine-tuning.
config = LoraConfig(
    r=16,                                 # specifies the number of heads in the Lora attention layer.
    lora_alpha=32,                        # specifies the alpha parameter for the Lora attention layer.
    #target_modules=["query_key_value"],  # This is an optional argument that specifies the modules in the model that should be adapted by the Lora method. By default, all modules are adapted.
    lora_dropout=0.05,                    # specifies the dropout rate for the Lora attention layer.
    bias="none",                          # specifies the type of bias to use in the Lora attention layer. By default, no bias is used.
    task_type="CAUSAL_LM"                 # specifies the task type for the Lora method. By default, the task type is set to "CAUSAL_LM".
)

In [None]:
model = get_peft_model(model, config)     # returns a new model that has been adapted using the LoRa method.
print_trainable_parameters(model)         # function prints the number of trainable parameters in the model.

trainable params: 13107200 || all params: 6685086720 || trainables%: 0.19606626733482374


### *Training the Llama 13B model*

Lets tokenize our data with our Llama 13b tokenizer first.

**NOTE:** Also try tokenizing with Spacy tokenizer if file size is big

In [None]:
# Tokenizing with Llama-13B tokenizer
from transformers import TextDataset

def load_dataset(file_path, tokenizer):
  """
  Loads a text dataset from a file path.

  Args:
    file_path: The path to the text file.
    tokenizer: The tokenizer to use to tokenize the text.
    block_size: Each sequence in the dataset will be at most 128 tokens long.

  Returns:
    A TextDataset object.
  """
  train_dataset = TextDataset(
      tokenizer=tokenizer,
      file_path=file_path,
      block_size=128,
  )
  return train_dataset

# load dataset
dataset = load_dataset("/content/Barack_Obama.txt", tokenizer)




### Training Arguemnts:
More detailed explanation of each argument:

* `per_device_train_batch_size`: The number of training examples to process on each GPU.
* `gradient_accumulation_steps`: The number of gradient updates to perform before applying them to the model parameters. This can be used to increase the effective batch size without increasing the amount of memory required.
* `num_train_epochs`: The number of times to iterate over the training data.
* `learning_rate`: The initial learning rate. (Try 2e-4 and 1e-4)
* `fp16`: Whether to use 16-bit floating point precision during training. This can help to reduce the memory requirements and speed up training.
* `save_total_limit`: The maximum number of checkpoints to save.
* `logging_steps`: The number of steps after which to log training progress.
* `output_dir`: The directory where the checkpoints and other training output will be saved.
* `optim`: The optimizer to use for training.
* `lr_scheduler_type`: The type of learning rate scheduler to use.
* `warmup_ratio`: The ratio of the training steps to use for warmup, where the learning rate increases linearly.

In [None]:
# Defining training arguemnts
training_args = transformers.TrainingArguments(
      per_device_train_batch_size=1,
      gradient_accumulation_steps=4,
      num_train_epochs=10,
      learning_rate=2e-4,
      fp16=True,
      save_total_limit=3,
      logging_steps=1,
      output_dir="experiments",
      optim="paged_adamw_8bit",
      lr_scheduler_type="cosine",
      warmup_ratio=0.05,
)
# Defining our trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

Step,Training Loss
1,1.528
2,1.8202
3,1.6756
4,1.51
5,1.3616
6,1.7339
7,1.7783
8,1.5614
9,1.6847
10,1.6106


TrainOutput(global_step=400, training_loss=0.4359310211054981, metrics={'train_runtime': 964.391, 'train_samples_per_second': 1.69, 'train_steps_per_second': 0.415, 'total_flos': 7950704001024000.0, 'train_loss': 0.4359310211054981, 'epoch': 9.82})

### Save the pretrained model

**NOTE:** You can save the model in your env itself. Also, to access the trained version of this model, i have uploaded the model on my hugging face repo.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Upload the model on Hugging face:
TRAINED_MODEL = "kunalg080198/llama-13b-trained-corpuschange2"

model.push_to_hub(
  TRAINED_MODEL, use_auth_token = True
)

adapter_model.bin:   0%|          | 0.00/52.5M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kunalg080198/llama-13b-trained-corpuschange2/commit/26c9aee846abb9bf46b74a759c3d1dc695738c46', commit_message='Upload model', commit_description='', oid='26c9aee846abb9bf46b74a759c3d1dc695738c46', pr_url=None, pr_revision=None, pr_num=None)

### Calling the model:
Following is the model configuration in bitsandbytes defined in Phase 6.
```
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)
```

In [None]:
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)

config = PeftConfig.from_pretrained(TRAINED_MODEL)
model = transformers.AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,   # String that specifies the path to the base model. The base model is the model that the PEFT model is based on.
                                      # It is used to initialize the parameters of the PEFT model.
    return_dict=True,
    quantization_config=bnb_config,   # used to reduce the size of the model and improve its performance.
    device_map="auto",                # the model will be assigned to the most appropriate device
    trust_remote_code=True
)

tokenizer= transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path)   # Create a AutoTokenizer object from the base model name or path.
tokenizer.pad_token = tokenizer.eos_token                                               # The pad_token argument is set to the eos_token. This is the end-of-sequence token.

model = PeftModel.from_pretrained(model, TRAINED_MODEL)

Downloading (…)/adapter_config.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading adapter_model.bin:   0%|          | 0.00/52.5M [00:00<?, ?B/s]

# **Phase 8: Defining our model Pipeline**

### **We create a list of stopping criteria**


In [None]:
"""
The reason we have three back ticks is because we ask the LLM to reply to our queries in JSON markdowns.

What this means is once we get to the end of the markdown which is ''' we are telling the model to stop generating text.
"""
# Creating a Stop list:
stop_list = ['\nHuman:', '\n```\n']

# Stopping Token IDS:
stop_token_ids_1 = [tokenizer(x)['input_ids'] for x in stop_list]
print('Stopping token ids are:', stop_token_ids_1)

Stopping token ids are: [[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]


### **Converting to long tensors**

In [None]:
# Converting the ids to LONG tensors: (This is mandatory)
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids_1]
print('\nStopping token ids in TENSOR form are:',stop_token_ids)


Stopping token ids in TENSOR form are: [tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'), tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]


### **Defining custom stopping criteria object**

In [None]:
# Finally, defining our custom stopping criteria:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
  """
  This class defines a custom stopping criteria object that stops training when the model predicts a specific sequence of tokens.
  """
    # This function is called by the training loop to check if the training should be stopped.
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        # Iterate over the list of stop token IDs.
        for stop_ids in stop_token_ids:
            # Check if the model predicts the stop token sequence.
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False  # If the model does not predict the stop token sequence, return False.

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

### **Finally, Define the transformer/model pipeline**

In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,               # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria, # without this model goes off topic during chat after a point
    temperature=0.1,                     # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,                          # select from top tokens whose probability add up to 15%, you can experiment with this
    top_k=0,                             # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=512,                  # max number of tokens to generate in the output, this should be not high enough to generate randomness and low enough to be precise
    repetition_penalty=1.1               # without this output begins repeating
)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal

# **Phase 9: Loading our model in LangChain Framework:**

In [None]:
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Modified initial prompt
prompt_template = """
The following is a friendy conversation between a human and an AI based on the content provided.
The AI is conversational and retrives answers for the questions asked and is concise in it's responses.
It does not mention anything about the "Context" or "Category" of answers in it's responses.
If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
{chat_history}
Human: {input}
AI:"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["chat_history","input"]
)

from langchain.chains.conversation.memory import ConversationBufferWindowMemory

# Defining our memoory
memory = ConversationBufferWindowMemory(
    k=5,                              # Number of previous conversations to store
    return_only_outputs=True,
    memory_key="chat_history"         # this has align with agent prompt (below)

)

# Instantiate the LLM
llm = HuggingFacePipeline(pipeline=generate_text)


# Define your QnA retrieval chain
chat = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
    memory = memory,
    verbose = True)

ModuleNotFoundError: ignored

# Trimming function

In [None]:
def chat_trim(chat_chain, query):
  """
  There are some unwanted characters in the response by the LLM at the end. This functions cleans them up
  """
    # create response
    chat_chain.run(query)

    # check for double newlines (also happens often)
    chat.memory.chat_memory.messages[-1].content = chat.memory.chat_memory.messages[-1].content.split('\n\n')[0]

    # strip any whitespace
    chat.memory.chat_memory.messages[-1].content = chat.memory.chat_memory.messages[-1].content.strip()

    return chat.memory.chat_memory.messages[-1].content

# **Phase 10: Testing our model responses**

`The testing is conducted as follows:`<br>
*The questions are asked to test if the model not only retrieves relevant information based on user query but also stores the responses during the conversation. If we want our model to closely mimic a conversational bot it needs to have access to its memory.*  
________________________________________________________________________________

### Type of questions for testing:
- Since the document is based on Obama, the first **two** questions are based on Obama and Kunal gandhi.
- Third question is based on economy and has nothing to do with Obama since the keyword 'Obama' has not been used.
- This differnece in questions is to test if the model is able to access its memory, segregate the information on Obama and economy or Kunal Gandhi respectively and then retrieve information based on the last question which is to summarise information on Obama only.

**NOTE: you can have your own document for testing but try keeping different context for your questons to test LLMs memory later on.**

In [None]:
chat_trim(chat, "Who is Barack Obama?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Barack Obama is the 44th President of the United States. He is a member of the Democratic Party and was the first African-American to be elected president.'

In [None]:
chat_trim(chat, "Who is Kunal Gandhi?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Kunal Gandhi is a highly motivated and results-oriented individual with a passion for ment, team leadership, and problem-solving. He is a strategic thinker with a keen eye on new challenges. In his spare time, Kunal enjoys playing cricket, hiking, and reading.'

In [None]:
chat_trim(chat, "How was the economy of United States in 2009?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The economy of the United States in 2009 was in a state of expansion, having pulled back from the threat of depression and starting to recover from the Great Recession. According to the U.S. Bureau of Labor Statistics, the unemployment rate rose from 5.0 percent in January 2009 to 10.0 percent in October 2009, but then fell to 9.6 percent in the last month of the year. Additionally, the Dow Jones Industrial Average rose from around 8,000 in the first quarter to over 10,000 by the end of the year., Helpful Answer 2: In 2009, the economy of the United States started to recover from the Great Recession, with GDP growth of 2.2 percent in the third quarter. According to a survey of members of the National Association for Business Economics, job creation increased in 2009., Helpful Answer 3: The economy of the United States in 2009 was in a state of expansion, with growth of 2.2 percent in the third quarter, according to the Bureau of Economic Analysis. This followed a decline of 6.3 percen

In [None]:
chat_trim(chat, "What masters did Kunal Gandhi pursue?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Based on the context provided, Kunal Gandhi pursued a Master of Science in Computer Science at Stevens Institute of Technology.'

In [None]:
chat_trim(chat, "What is the education experience of Barack Obama?")




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Barack Obama graduated with a Bachelor of Arts degree in 1983 with a 3.7 GPA. He then worked as a financial researcher and writer at Business International Corporation for about a year before entering Harvard Law School, where he earned his Juris Doctor magna cum laude (with highest honor) in 1991.'

In [None]:
chat_trim(chat, "What are the parts of speech?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The parts of speech are nouns, verbs, adjectives, adverbs, pronouns, prepositions, and conjunctions.'

In [None]:
chat_trim(chat, "How was the economy of United States in 2009?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"The economy of United States contracted in 2009, as the Great Recession peaked. According to the National Bureau of Economic Research, the recession lasted from December 2007 to June 2009. During this time, the economy shrank at an annual rate of 2.5 percent in the first quarter and then grew at an annual rate of 3.7 percent in the second quarter. However, the economy then slowed down again, growing at only an annual rate of 1.6 percent in the third quarter, though it picked up in the fourth quarter to an annual rate of 5.0 percent, beating expectations. Despite these improvements, the economy continued to struggle, with high unemployment rates and a large budget deficit. According to the Bureau of Labor Statistics, between January 2009 and January 2010, the number of unemployed persons increased by 8.4 million., Unhelpful Answer: I don't know., Neutral Answer: I don't have enough information to give an answer., Other Answers: The economy did not do well in 2009., The economy was terr

In [None]:
chat_trim(chat, "Can you summarize the previous answers on Obama only?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Sure, here is a summary of my previous answers about Obama:\nObama was born in Honolulu, Hawaii. His mother was American, while his father was from Kenya. He grew up with little knowledge about his father. He had several stepfathers during his childhood. In his early adult years, Obama moved to Los Angeles, where he worked as a community organizer. He enrolled at Harvard Law School and met his future wife, Michelle Robinson. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Elected as a Democrat to the Senate in 2004, he gained national attention in 2008, delivering the keynote address at the Democratic National Convention. By a wide margin, he won the presidency over Republican nominee John McCain, becoming the first African-American to be e

In [None]:
chat_trim(chat, "What did Obama say On May 9, 2012, shortly after the official launch of his campaign?")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Obama said his views had evolved, and he publicly affirmed his personal support for the legalization of same-sex marriage.'

In [None]:
def print_x_statements(x):
  for i in range(x):
    print(chat.memory.chat_memory.messages[i])

print_x_statements(8)

In [None]:
chat.memory