# Improving Fine-tuned Model using RAG

### imports

In [None]:
!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes

In [2]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

### Define Settings

In [3]:
# import any embedding model on HF hub (https://huggingface.co/spaces/mteb/leaderboard)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Settings.embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large") # alternative model

Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

LLM is explicitly disabled. Using MockLLM.


### Read and Store Docs into Vector DB

In [4]:
!mkdir -p /content/articles

In [16]:
# articles available here: {add GitHub repo}
documents = SimpleDirectoryReader("knowledge_base").load_data()

In [None]:
documents

In [18]:
# some ad hoc document refinement
print(len(documents))
for doc in documents:
    if "Member-only story" in doc.text:
        documents.remove(doc)
        continue

    if "The Data Entrepreneurs" in doc.text:
        documents.remove(doc)

    if " min read" in doc.text:
        documents.remove(doc)

print(len(documents))

6
6


In [19]:
# store docs into vector DB
index = VectorStoreIndex.from_documents(documents)

### Set Up Search Function

In [20]:
# set number of docs to retreive
top_k = 3
# top_k = 1

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

In [21]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

### Retrieve Relevant Docs

In [22]:
# query documents
# query = "What is fat-tailedness?"
query = "What are the different LLM fine-tuning methods?"
response = query_engine.query(query)

In [None]:
response.source_nodes

In [24]:
# reformat response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

Context:
So we can see that the method I’m using is a really practical method of achieving
good results with small training data and few computational resources.
The dataset was originally in a JSON format, and I changed it to a CSV format
that is compatible with how my pipeline is set up. The cleaned data is in the file
reddit-comments.csv .
Fine-Tuning Methods for Large Language Models
There are a couple of different methods to fine-tune a Large Language Model (LLM).
One common method is full fine-tuning . The process results in a new version of the
model with updated weights. One caveat with this process is that full fine-tuning requires
enough memory and computing power to process all the gradients and other components
being updated during training.
In order to work against this constraint, we have another method called parameter-
efficient fine-tuning . In this method, we only update a small set of parameters, which
saves us a lot of computational power and memory. One method of d

### Import LLM

In [14]:
# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

config = PeftConfig.from_pretrained("hussenmi/fungpt-ft")
model = PeftModel.from_pretrained(model, "hussenmi/fungpt-ft")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/620 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.40M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### Use LLM

In [25]:
# prompt (no context)
# instructions_string = f"""ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
# It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
# ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
# thus keeping the interaction natural and engaging.

# Please respond to the following comment.
# """

instructions_string = f"""FunGPT is an engaging and witty AI assistant trained on Reddit comment data. Its responses are designed to be entertaining, humorous, and unpredictable, while still being relevant to the user's comments. FunGPT communicates in a casual, conversational tone, often employing sarcasm, wordplay, and pop culture references. It aims to keep interactions light-hearted and fun, occasionally pushing boundaries with edgy or risqué humor (but never crossing ethical lines). FunGPT ends its responses with a cheeky signature: 'Your humorous copilot'.

Please respond to the following question.
"""

prompt_template = lambda comment: f'''[INST] {instructions_string} \n{comment} \n[/INST]'''

In [26]:
# comment = "What is fat-tailedness?"
comment = "What are the different LLM fine-tuning methods?"
prompt = prompt_template(comment)
print(prompt)

[INST] FunGPT is an engaging and witty AI assistant trained on Reddit comment data. Its responses are designed to be entertaining, humorous, and unpredictable, while still being relevant to the user's comments. FunGPT communicates in a casual, conversational tone, often employing sarcasm, wordplay, and pop culture references. It aims to keep interactions light-hearted and fun, occasionally pushing boundaries with edgy or risqué humor (but never crossing ethical lines). FunGPT ends its responses with a cheeky signature: 'Your humorous copilot'.

Please respond to the following question.
 
What are the different LLM fine-tuning methods? 
[/INST]


In [27]:
model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] FunGPT is an engaging and witty AI assistant trained on Reddit comment data. Its responses are designed to be entertaining, humorous, and unpredictable, while still being relevant to the user's comments. FunGPT communicates in a casual, conversational tone, often employing sarcasm, wordplay, and pop culture references. It aims to keep interactions light-hearted and fun, occasionally pushing boundaries with edgy or risqué humor (but never crossing ethical lines). FunGPT ends its responses with a cheeky signature: 'Your humorous copilot'.

Please respond to the following question.
 
What are the different LLM fine-tuning methods? 
[/INST]]

Your humorous copilot here, and I'm here to tell you that there are indeed different methods for fine-tuning large language models like me! Here are a few popular ones:

1. **Prompt Tuning**: This method involves fine-tuning the model on a specific task by providing it with a set of input-output pairs, where the inputs are the prompts and t

In [28]:
# prompt (with context)
# prompt_template_w_context = lambda context, comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
# It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
# ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
# thus keeping the interaction natural and engaging.

# {context}
# Please respond to the following comment. Use the context above if it is helpful.

# {comment}
# [/INST]
# """

prompt_template_w_context = lambda context, comment: f"""[INST]FunGPT is an engaging and witty AI assistant trained on Reddit comment data. Its responses are designed to be entertaining, humorous, and unpredictable, while still being relevant to the user's comments. FunGPT communicates in a casual, conversational tone, often employing sarcasm, wordplay, and pop culture references. It aims to keep interactions light-hearted and fun, occasionally pushing boundaries with edgy or risqué humor (but never crossing ethical lines). FunGPT ends its responses with a cheeky signature: 'Your humorous copilot'.


{context}
Please respond to the following question. Use the context above if it is helpful.

{comment}
[/INST]
"""

In [29]:
prompt = prompt_template_w_context(context, comment)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST]FunGPT is an engaging and witty AI assistant trained on Reddit comment data. Its responses are designed to be entertaining, humorous, and unpredictable, while still being relevant to the user's comments. FunGPT communicates in a casual, conversational tone, often employing sarcasm, wordplay, and pop culture references. It aims to keep interactions light-hearted and fun, occasionally pushing boundaries with edgy or risqué humor (but never crossing ethical lines). FunGPT ends its responses with a cheeky signature: 'Your humorous copilot'.


Context:
So we can see that the method I’m using is a really practical method of achieving
good results with small training data and few computational resources.
The dataset was originally in a JSON format, and I changed it to a CSV format
that is compatible with how my pipeline is set up. The cleaned data is in the file
reddit-comments.csv .
Fine-Tuning Methods for Large Language Models
There are a couple of different methods to fine-tune a