<a href="https://colab.research.google.com/github/nileshchopda/Fine-Tuning_Gemma2_with_LoRA_for_Hindi/blob/main/fine-tuning-gemma2-2b-for-hindi-hinglish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning Gemma2 2b for Hindi-Hinglish QnA

In this notebook we will walk through our step by step journey of fine tuning **Gemma2 2b model** into a conversational chatbot for Hindi.

This notebook will help you
  
  1. Understand the various parameters of crucial LoRA Layers. Where you will learn what the parameters do and how to tailor them for knowledge injection vs task tuning.

  2. How to effectively save your models after training and avoid unintentional inference issues.

  3. How to save resources while training as much as possible.

We used following datasets for our fine tuning.
- **GPT4's Alpaca**
- **Cognitive Lab's Hindi Instruct**
- **Wikipedia Datasets**
- **Alpaca for Gemma**
- **Databricks Dolly**
- **Hindi Maths Quest**

We experimented with different LoRA Configs in the various trainings with varying results depending on the configs

We mostly used **L4** and occasionaly **A100** GPUs as per our resource availability.

It took us about **40 Hours** to train on all the datasets, either full or a subset of them.

We learnt many things along the way specially related to the prompt for fine tuning and suitable LoRA Config for different tasks.

---


**All of our models and adapters that has been used/created in this project are available on kaggle**


You will need to place your kaggle token json file inside /root/.config/kaggle folder, you can download it from kaggle.

```python
import kagglehub

# Download latest version
path = kagglehub.model_download("lnshrivas/gemma-2/transformers/gemma-2-2b-hindi")

print("Path to model files:", path)
```

**Please pay attention to the `Notes` sections they are important to understand some crucial steps and concepts**

**Please check the `Understanding LoRA` section inside the *Conclusion* section before proceeding for enhancing your understanding about training from the getgo**

---

>DISCLAIMER - This Notebook was created as part of a Kaggle Competition

## 1. Inference Testing on base model

Let us test our base model's capacity for handling English, Hindi and Hinglish queries.

In [1]:
!pip install -U bitsandbytes
!pip install datasets
!pip install trl
!pip install kaggle
!pip install peft

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.3
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB

In [2]:
import torch
import warnings

from trl import SFTTrainer
from peft import PeftModel
from peft import LoraConfig
from tqdm.notebook import tqdm
from datasets import load_dataset

from transformers import BitsAndBytesConfig, TrainingArguments, AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq



We will download our base model first, from Kaggle.

You will also need a consent from google on kaggle to get this model. It's quite easy

https://www.kaggle.com/models/google/gemma-2/transformers/gemma-2-2b

In [3]:
# !pip install kaggle



In [4]:
# from kaggle_secrets import UserSecretsClient
# user_secrets = UserSecretsClient()
# secret_value_0 = user_secrets.get_secret("key")
# secret_value_1 = user_secrets.get_secret("username")


In [None]:
# !ls ..
# !mkdir ../root/.config/kaggle
## Get kaggle.json form kaggle with api
# !cp kaggle.json /root/.config/kaggle/kaggle.json

In [14]:
!kaggle models instances versions download google/gemma-2/transformers/gemma-2-2b/2

Downloading gemma-2.tar.gz to /content
100% 9.07G/9.07G [01:12<00:00, 144MB/s]
100% 9.07G/9.07G [01:12<00:00, 134MB/s]
/content/gemma-2.tar.gz

We'll extract the model files

In [None]:
!tar -xvzf 'gemma-2.tar.gz' 'gemma-2-2b'

---
>NOTE: We will load the model in 4-bit quantization for resource constraint training

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b")

In [None]:
base_model = AutoModelForCausalLM.from_pretrained("gemma-2-2b",quantization_config=bnb_config,
                                                                         device_map='auto')

In [None]:
question = "<start_of_turn>user Tell me about elephants, but tell me in English please. <end of turn>\n<start_of_turn>model "

inputs = tokenizer(question, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

In [None]:
question = "<start_of_turn>user recycling ke vishay me ek nara sujhav kare<end of turn>\n<start_of_turn>model "

inputs = tokenizer(question, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

In [None]:
question = "<start_of_turn>user रीसाइक्लिंग के विषय में एक नारा सुझाए<end of turn>\n<start_of_turn>model "

inputs = tokenizer(question, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

As you can see it cannot handle hinglish nor hindi queries, Generating garbled output.

---

## 2. Fine tuning on Alpaca Dataset and making it understand Hindi

We trained for a total of **15 Hrs** on this dataset

### Dataset Preparation

> NOTE: These settings often help in reducing reserved GPU memory to increase available memory and can help in save resources for training.

In [None]:
!export TORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

In [None]:
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In [None]:
alpaca_dataset_train = load_dataset("FreedomIntelligence/alpaca-gpt4-hindi",
                              split = "train")
alpaca_dataset_train, alpaca_dataset_train[3]

In [None]:
alpaca_dataset_train.info

In [None]:
alpaca_prompt="""<start_of_turn>user\n.\n\"{}\"<end_of_turn>\n<start_of_turn>model\n{}<end_of_turn>"""
print(alpaca_prompt)

---

> NOTE: We have set the padding to right. This will ensure the padding will be added to the right of the tokens. Without this tokenizer may pad to the left.

In [None]:
eos_token = tokenizer.eos_token
tokenizer.padding_side = "right"
eos_token

In [None]:
def formatting_func(conversations):
    texts = []
    conversations = conversations["conversations"]
    for convo in conversations:
        # EOS_TOKEN is important
        text = alpaca_prompt.format(convo[0]["value"], convo[1]["value"]) + eos_token
        texts.append(text)
    return { "text" : texts, }

In [None]:
alpaca_dataset = alpaca_dataset_train.map(formatting_func, batched = True,)

In [None]:
alpaca_dataset

---

> NOTE: This prompt will help our model understand the user query and model generation part. The eos (end of sentence) token is very important without which the model will use to know when to stop regardless of max seq length to generate otherwise it will generate endless tokens which we will see happening in action, ahead

In [None]:
print(alpaca_dataset["text"][0])

---

> NOTE: This function converts the generated prompt text into tokens along with padding, and also generates attention masks for our tokens, telling the model which tokens to pay attention to and ignore the padding tokens

In [None]:
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

print("Tokenizing dataset...")
dataset = alpaca_dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print("Dataset tokenized:", dataset[0])

---
### Training

> NOTE: This LoRA Config helped us generalize faster with our dataset, however if you wish to inject knowledge to ur model
- keep **"r"** high
- remove the feed forward layers from target modules
- use_rslora=False

> However in our implementation we didn't follow through it.

> To save resources while training
- use smaller batch size
- keep gradient accumulation to 1
- save limit to 1 or 2
- set a torch empty cache parameter other wise after long runs the build up of cache will crash the GPU

> This will help keep the GPU Memory requirements under 20 GBs. Suitable for L4 GPU

In [None]:
lora_config = LoraConfig(
    r=128,
    lora_alpha=256,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["embed_tokens", "lm_head"],
    task_type="CAUSAL_LM",
    use_rslora=True
)

train_args = TrainingArguments(
    per_device_train_batch_size=2,  # Each GPU processes 2 examples per step.
    gradient_accumulation_steps=1,  # Gradients are accumulated over 1 steps before updating weights.
    # warmup_steps=30,  # Learning rate warms up (gradually increases) for the first 30 steps.
    #max_steps=10,  # Total number of optimization steps for training.
    warmup_ratio=0.1, # Learning rate warms up (gradually increases) for the first 10 percent of epoch.
    num_train_epochs=1,  # Total number of epochs for training.
    gradient_checkpointing=True,  # Saves memory by recomputing activations during backpropagation.
    learning_rate=5e-5,  # Base learning rate for the optimizer.
    fp16=not torch.cuda.is_bf16_supported(),  # FP16 precision if BF16 is not available.
    bf16=torch.cuda.is_bf16_supported(),  # Enables bfloat16 precision if available.
    save_steps=100,  # Saves checkpoint every 100 steps.
    torch_empty_cache_steps = 100,  # Empties the cache at every 100 steps.
    optim="adamw_8bit",  # Uses AdamW optimizer with 8-bit precision for optimizer states to save memory.
    weight_decay=0.01,  # Regularization to prevent overfitting by penalizing large weights.
    lr_scheduler_type="linear",  # Linearly decays learning rate after the warmup period.
    output_dir="gemma-2-2b-{hi)-alpaca-chk",  # Directory where model checkpoints and logs will be saved.
    report_to="none",  # Disables logging to external tools like TensorBoard or WandB.
    save_total_limit=2, # Will save only 2 checkpoints at max, reducing the disk usage.
    run_name='pretrain_gemma2' # Defining a name for our runtime.
)

---
> NOTE: In both tokenizer and collator some arguments are commmon.
- **padding** 'longest' to find the longest batch and pad based on that
- **padding** 'max_length' will then require u to set a
  - **max_length** parameter for padding,
  - **truncation** parameter to cut any sequences longer than max_length

In [None]:
# If you did not tokenized the dataset, you must use Data Collator.
# It uses tokenizer, tokenize your training data and returns them as tensors.
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=base_model,
    padding="longest",
    return_tensors="pt"
)

trainer = SFTTrainer(
    model=base_model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=dataset,
    data_collator=data_collator,
)

In [None]:
# To begin training use
trainer.train()

In [None]:
# To resume training from last checkpoint use
trainer.train(resume_from_checkpoint=True)

In [None]:
# Once training is done save the model and the tokenizer
trainer.save_model('gemma-2-2b-(hi)-24985steps-1epoch-alphacha')
trainer.tokenzier.save_pretrained('gemma-2-2b-(hi)-24985steps-1epoch-alphacha')

> NOTE: The training loss as you can see gets low quickly and can go even lower as we training it for more epochs

---
### Inference

For inference you will need to merge the base model and adapter model like this

> NOTE: Always merge a non quantized base model with the adapter to avoid rounding errors during inference and causing unexpected behaviour

In [None]:
model = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained('gemma-2-2b', device_map="cpu"), 'gemma-2-2b-24985steps-1epoch-alphacha')

In [None]:
question = "कुछ एक रीसाइक्लिंग अभियान के लिए एक नारा सुझाव दें।"



inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=128,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

You can see how the model is now able to answer to a hindi query that according to the dataset it has learned from

---

### Saving the model as transformer file

> NOTE: These steps are very curcial for ensuring the model weights are properly transfered as without this we faced an inference loss where our saved pretrained model wasn't able to infer properly without the state dictionary weight transfer so we made it a default saving frame work for us

In [None]:
merged_model.save_pretrained("gemma-2-2b-tmp")

In [None]:
torch.save(merged_model.state_dict(), "merged_model_state_dict.pth")

In [None]:
model = AutoModelForCausalLM.from_pretrained("gemma-2-2b-tmp",device_map='cpu')

In [None]:
model.load_state_dict(torch.load("merged_model_state_dict.pth", weights_only=True))

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b-{hi)-24985steps-1epoch-alphacha")

In [None]:
model.save_pretrained("gemma-2-2b-base+alpaca")

In [None]:
tokenizer.save_pretrained("gemma-2-2b-base+alpaca")

Saved model inference -

In [None]:
question = "कुछ एक रीसाइक्लिंग अभियान के लिए एक नारा सुझाव दें।"

inputs = tokenizer(question, return_tensors="pt").to('cpu')

generated_ids = model.generate(**inputs,
                              max_new_tokens=128,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

---
## 3. Fine tuning on Congitive Lab Hindi Instruct Dataset

We decided to further fine tune our model on another dataset to test what kind of changes will that bring to our model and possibly increase its knowledge base further (it didn't go very well)

Here's where we learnt the importance of definining the perfect LoRA Configs to train or fine tune any LLM

We trained for a total of **20 Hrs** on this dataset

> NOTE: We will always load a quantized model for training.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b-base+alpaca")

In [None]:
base_model = AutoModelForCausalLM.from_pretrained("gemma-2-2b-base+alpaca",quantization_config=bnb_config,
                                                                         device_map='auto')

---
> NOTE: In this data set as we can see they tried to help the model understand the previous user-model interactions which can be useful if you want to instruction tune it to understand how to use previous conversations when you are using History in an LLM

> However there is a major issue in the training prompt. Can you see it? We accidentally trained our model on this dataset without rectifying the issue which led to a big problem during inference which we will see ahead

> We will also delve into ways of resolving that issue in a lot cheaper way

### Dataset Preparation

In [None]:
cognitive_hi_inst_train = load_dataset("CognitiveLab/Hindi-Instruct", split='train')
cognitive_hi_inst_test = load_dataset("CognitiveLab/Hindi-Instruct", split='test')

In [None]:
cognitive_hi_inst_train, print(cognitive_hi_inst_train[0]['text']), print(cognitive_hi_inst_train[0]['input_ids'])

This tokenizer is more or less the same as we used for the other dataset

In [None]:
def tokenize_function(examples):
    tokenizer.padding_side = "right"
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

print("Tokenizing dataset...")
train_dataset = cognitive_hi_inst_train.map(tokenize_function, batched=True, remove_columns=["text"])
test_dataset = cognitive_hi_inst_test.map(tokenize_function, batched=True, remove_columns=["text"])
print("Dataset tokenized:", train_dataset[0])

### Training

---
> NOTE: This time we decided to
- only focus on the attention and feed forward layers and
- exclude the embedding and lm layers while training.

> We also reduced
- **"r"** and
- lora_alpha
- torch empty cache parameter to save more vram for larger batch size

> We will see its consequences ahead

In [None]:
lora_config = LoraConfig(
    r=32,
    lora_alpha=128,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj","gate_proj", "up_proj", "down_proj"
                    ],
    #modules_to_save=["embed_tokens", "lm_head"],
    task_type="CAUSAL_LM",
    use_rslora=True
)

train_args = TrainingArguments(
    per_device_train_batch_size=3,  # Each GPU processes 4 examples per step.
    gradient_accumulation_steps=1,  # Gradients are accumulated over 4 steps before updating weights.
    # warmup_steps=30,  # Learning rate warms up (gradually increases) for the first 30 steps.
    #max_steps=10,  # Total number of optimization steps for training.
    warmup_ratio=0.1, # Learning rate warms up (gradually increases) for the first 10 percent of epoch.
    num_train_epochs=1,  # Total number of epochs for training.
    gradient_checkpointing=True,  # Saves memory by recomputing activations during backpropagation.
    learning_rate=5e-5,  # Base learning rate for the optimizer.
    fp16=not torch.cuda.is_bf16_supported(),  # FP16 precision if BF16 is not available.
    bf16=torch.cuda.is_bf16_supported(),  # Enables bfloat16 precision if available.
    save_steps=100,  # Saves checkpoint every 100 steps.
    torch_empty_cache_steps=10,  # Empties the cache at every 10 steps.
    optim="adamw_8bit",  # Uses AdamW optimizer with 8-bit precision for optimizer states to save memory.
    weight_decay=0.01,  # Regularization to prevent overfitting by penalizing large weights.
    lr_scheduler_type="linear",  # Linearly decays learning rate after the warmup period.
    output_dir="gemma-2-2b-cog-lab-chk",  # Directory where model checkpoints and logs will be saved.
    report_to="none",  # Disables logging to external tools like TensorBoard or WandB.
    save_total_limit=2, # Will save only 2 checkpoints at max, reducing the disk usage.
    run_name='pretrain_gemma2' # Defining a name for our runtime.
)

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=base_model,
    padding="longest",
    return_tensors="pt"
)

trainer = SFTTrainer(
    model=base_model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

---
> NOTE: Notice now how the training loss didn't change too drastically unlike the last time. This can be the result of the LoRA Config, some hyper parameters or the dataset itself

In [None]:
trainer.train()

Follow the exact same saving steps

In [None]:
trainer.save_model("gemma-2-2b-30443steps-1epoch-cog-lab")

In [None]:
trainer.tokenizer.save_pretrained('gemma-2-2b-30443steps-1epoch-cog-lab')

### Inference

Load the merge the base and adapter

In [None]:
merged_model = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained("gemma-2-2b-base+alpaca",device_map='cpu'), 'gemma-2-2b-30443steps-1epoch-cog-lab').merge_and_unload()

In [None]:
question = "<start_of_turn>user क्या आप मुझे रीसाइक्लिंग के लिए एक नारा समझा सकते हैं? <end of turn>\n<start_of_turn>model "


inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

---
> NOTE: Notice how the model is unable to stop its generation unlike the last time. Can you guess why this is happening?

> That's right, the end of sentence token. If you notice in our training prompt there was no `<eos>` token. This means the model didn't learn when to stop it's generation and keeps on continuing with more tokens.

> To fix this we simply need to add the `<bos>` and `<eos>` tokens to our prompt to tell the model what is the begenning and the end of a conversation

### Saving the model

Ofc first we will save our model like before

In [None]:
merged_model.save_pretrained("gemma-2-2b-tmp")

In [None]:
torch.save(merged_model.state_dict(), "merged_model_state_dict.pth")

In [None]:
model = AutoModelForCausalLM.from_pretrained("gemma-2-2b-tmp",device_map='cpu')

In [None]:
model.load_state_dict(torch.load("merged_model_state_dict.pth", weights_only=True))

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b-30443steps-1epoch-cog-lab")

In [None]:
model.save_pretrained("gemma-2-2b-base+alpaca+cog-lab")

In [None]:
tokenizer.save_pretrained("gemma-2-2b-base+alpaca+cog-lab")

### Further Fine Tuning to teach beginning and end tokens

In [None]:
def format_prompt_with_tokens(batch):
    formatted_prompts = []
    for conversations in batch['messages']:
        formatted_prompt = []
        for i in range(0, len(conversations), 2):  # Process user-model pairs
            user_message = conversations[i]
            model_message = conversations[i + 1] if i + 1 < len(conversations) else None

            if user_message['role'] == "user" and model_message and model_message['role'] == "assistant":
                formatted_prompt.append(
                    f"<bos><start_of_turn>{user_message['role']} {user_message['content']} <end_of_turn>\n"
                    f"<start_of_turn>model {model_message['content']} <end_of_turn><eos>"
                )

        # Join the formatted prompt for this conversation
        formatted_prompts.append("\n".join(formatted_prompt))

    # Return the formatted text as a new field in the dataset
    return {"text": formatted_prompts}

---
> NOTE: This time we take 3000 samples for fine tuning the model to teach it when to use `<eos>` token

In [None]:
# Shuffle the dataset and take 1000 examples
random_subset = cognitive_hi_inst_train.take(3000)

# Apply the formatting function to this subset
cognitive_hi_inst_dataset = random_subset.map(format_prompt_with_tokens, batched=True)

In [None]:
print(cognitive_hi_inst_dataset[1]['text'])

In [None]:
def tokenize_function(examples):
    tokenizer.padding_side = "right"
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

print("Tokenizing dataset...")
train_dataset = cognitive_hi_inst_dataset.map(tokenize_function, batched=True)
print("Dataset tokenized:", train_dataset[0])

---

> NOTE: The rest of the things are the same.
- We added a small dropout to try to prevent content bias


In [None]:
lora_config = LoraConfig(
    r=128,
    lora_alpha=256,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj","gate_proj", "up_proj", "down_proj",],
    lora_dropout=0.05,
    #modules_to_save=["embed_tokens", "lm_head"],
    task_type="CAUSAL_LM",
    use_rslora=True
)

train_args = TrainingArguments(
    per_device_train_batch_size=3,  # Each GPU processes 4 examples per step.
    gradient_accumulation_steps=1,  # Gradients are accumulated over 4 steps before updating weights.
    warmup_steps=30,  # Learning rate warms up (gradually increases) for the first 30 steps.
    #max_steps=1000,  # Total number of optimization steps for training.
    warmup_ratio=0.1, # Learning rate warms up (gradually increases) for the first 10 percent of epoch.
    num_train_epochs=1,  # Total number of epochs for training.
    gradient_checkpointing=True,  # Saves memory by recomputing activations during backpropagation.
    learning_rate=5e-6,  # Base learning rate for the optimizer.
    fp16=not torch.cuda.is_bf16_supported(),  # FP16 precision if BF16 is not available.
    bf16=torch.cuda.is_bf16_supported(),  # Enables bfloat16 precision if available.
    save_steps=100,  # Saves checkpoint every 100 steps.
    torch_empty_cache_steps=10,  # Empties the cache at every 10 steps.
    logging_steps=100,  # Logs metrics every 100 steps.
    optim="adamw_8bit",  # Uses AdamW optimizer with 8-bit precision for optimizer states to save memory.
    weight_decay=0.01,  # Regularization to prevent overfitting by penalizing large weights.
    lr_scheduler_type="linear",  # Linearly decays learning rate after the warmup period.
    output_dir="gemma-2-2b-(hi)-cog-lab-chk-fnt",  # Directory where model checkpoints and logs will be saved.
    report_to="none",  # Disables logging to external tools like TensorBoard or WandB.
    save_total_limit=2, # Will save only 2 checkpoints at max, reducing the disk usage.
    run_name='pretrain_gemma2' # Defining a name for our runtime.
)

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=base_model,
    padding="longest",
    return_tensors="pt"
)

trainer = SFTTrainer(
    model=base_model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

In [None]:
trainer.train()

Training it for more epochs could've shown a better result

In [None]:
trainer.save_model("gemma-2-2b-1000steps-0.03epoch-cog-lab-fnt")

In [None]:
trainer.tokenizer.save_pretrained("gemma-2-2b-1000steps-0.03epoch-cog-lab-fnt")



As always we will merge the models



In [None]:
merged_model = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained("gemma-2-2b-(hi)-base-alpc-cb",device_map='auto'), 'gemma-2-2b-it(hi)-1000steps-0.03epoch-cog-lab-fnt').merge_and_unload()

---

It seems the model can now answer in Hinglish well enough, but has become pattern biased where in it thinks everything needs a thorough explanation along with its historical context

In [None]:
question = "<start_of_turn>user Tell me about elephants, but tell me in English please. <end of turn>\n<start_of_turn>model "


inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

---

It can also answer in English with same pattern biasness

In [None]:
question = "<start_of_turn>user What's your name? <end_of_turn>\n<start_of_turn>"


inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

---

Since it was trained on multi-turn dataset we decided to check whether that type of prompt format can help it

In [None]:
question = "<start_of_turn>user सुबेह सुबेह उठने वाली चिड़िया कौनसी है? <end of turn>\n<start_of_turn>model सुबह-सुबह बहुत सी चिड़िया उठती है, आपको किसके बारे में जानना है? <end_of_turn>\n<start_of_turn>user आपको कौनसी चिड़ियाँ के बारे में पता है? <end of turn>\n<start_of_turn>model "


inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

---

We tried **Few Shot** examples to tell it how to have a conversation and it worked well enough but although the content was sub optimal

In [None]:
question = "The following is a conversation between a user and model. The assistant responds in Hindi and provides accurate, concise answers.\nExample 1:\n<start_of_turn>user भारत की राजधानी क्या है? <end_of_turn>\n<start_of_turn>model भारत की राजधानी नई दिल्ली है। <end_of_turn>\nExample 2:\n<start_of_turn>user पिरामिड कहां पाए जाते हैं? <end_of_turn>\n<start_of_turn>model पिरामिड मुख्य रूप से मिस्र में पाए जाते हैं, लेकिन सूडान, मेसोअमेरिका और इटली जैसे अन्य स्थानों पर भी हैं। <end_of_turn>\nExample 3:<start_of_turn>user मुझे चाय और कॉफी के फायदे बताओ। <end_of_turn>\n<start_of_turn>model चाय एंटीऑक्सिडेंट्स से भरपूर होती है और तनाव कम करती है। वहीं, कॉफी सतर्कता और ऊर्जा को बढ़ाती है। <end_of_turn>\nNow continue the conversation:\n<start_of_turn>user भारत के पड़ोसी देशों के नाम क्या हैं? <end_of_turn>\n<start_of_turn>model "


inputs = tokenizer(question, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=246,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1,
                              use_cache=False)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

---

Conclusion: But as you can see, the model has now learnt to produce the end of turn and end of sentence tokens along with the knowledge of when to stop the generation.

This goes to show us it is actually fairly easy to teach a model the turn based approach and eos token generation with a small amount of training data requiring much lesser resources.

The model's performance in specific tasks can be improved with careful tweaking of the LoRA parameters which greatly impacts the training

---

## 4. Training on Combined Mixed Text Corpus

In this part of our research we did something different. So far we decided to train our model on different available datasets one by one. Which did give us some result specially when it came to teaching the model how to use genearal hindi tokens. With some achievements in the code switching and hinglish capabilities.

This time we decided to improve the model's knowledge base along with teaching it how to communicate in Hindi, Hinglish and possibly code switching by giving specific purpose datasets for the said tasks.

We gave a mixture of following
- **Wikipedia** datasets in hindi and hinglish
- **Alpaca** Hindi conversation dataset
- **Databricks Dolly** code mix , hinglish instruct dataset
- **Hindi Math Quest** dataset for hindi mathematics

Although our main goal was to increase the model's knowledge base in hindi, hinglish and code switching, we added maths for the possibility of increasing it's reasoning capability

We trained for a total of only **4 Hrs** on this dataset as unfortunately our resources had run out.

We'll see some interesting results ahead

We'll begin with downloading and carefully formating our datasets, putting them in right prompt. We only took a small subset of all the datasets carefully in order to induce certain behaviour from each as per resource and time constraints


### Datasets Preparation
For wikipedia, we used the **titles** as **user** query and **texts** as **model** output

In [None]:
wiki_1 = load_dataset("Cohere/wikipedia-22-12-hi-embeddings", split = "train",)
wiki_1[0]['title'], wiki_1[0]['text']

In [None]:
wiki_2 = load_dataset("wikimedia/wikipedia", "20231101.hi", split = "train",)
wiki_2[0]['title'], wiki_2[0]['text']

In [None]:
wiki_3 = load_dataset("sgzsh269/wikipedia-hindi-hinglish", split = "train",)
wiki_3[0]['hindi_title'], wiki_3[0]['hindi_text'], wiki_3[0]['hinglish_title'], wiki_3[0]['hinglish_text']

In [None]:
wiki_3

In [None]:
def format_func_d1(example):
    prompts = []
    titles = example["title"]
    texts = example["text"]

    # Loop over each example in the batch
    for title, text in zip(titles, texts):
        prompt = f"<start_of_turn>user: {title} <end_of_turn>\n<start_of_turn>model: {text}<end_of_turn><eos>"
        prompts.append(prompt)

    # Return as a batch
    return {"prompt": prompts}

wiki_1_train = wiki_1.select_columns(["title","text"]).shuffle().take(5000).map(format_func_d1, batched=True).select_columns(["prompt"])
wiki_1_train[0], wiki_1_train[0]['prompt']

In [None]:
def format_func_d2(example):
    prompts = []
    titles = example["title"]
    texts = example["text"]

    # Loop over each example in the batch
    for title, text in zip(titles, texts):
        prompt = f"<start_of_turn>user: {title} <end_of_turn>\n<start_of_turn>model: {text}<end_of_turn><eos>"
        prompts.append(prompt)

    # Return as a batch
    return {"prompt": prompts}

wiki_2_train = wiki_2.select_columns(["title","text"]).shuffle().take(5000).map(format_func_d2, batched=True).select_columns(["prompt"])
wiki_2_train[0]['prompt']

In [None]:
def format_func_d3(example):
    prompts = []
    try: titles = example["hindi_title"]
    except: titles = example["hinglish_title"]
    try: texts = example["hindi_text"]
    except: texts = example["hinglish_text"]

    # Loop over each example in the batch
    for title, text in zip(titles, texts):
        prompt = f"<start_of_turn>user: {title} <end_of_turn>\n<start_of_turn>model: {text}<end_of_turn><eos>"
        prompts.append(prompt)

    # Return as a batch
    return {"prompt": prompts}

wiki_3_train_hi = wiki_3.select_columns(["hindi_title","hindi_text"]).map(format_func_d3, batched=True).select_columns(["prompt"])
wiki_3_train_he = wiki_3.select_columns(["hinglish_title","hinglish_text"]).map(format_func_d3, batched=True).select_columns(["prompt"])
wiki_3_train = concatenate_datasets([wiki_3_train_hi, wiki_3_train_he])
wiki_3_train_hi[0]['prompt'], wiki_3_train_he[0]['prompt']

In [None]:
wiki_dataset_train = concatenate_datasets([wiki_1_train, wiki_2_train, wiki_3_train])
wiki_dataset_train

---

> NOTE: This alpaca dataset is different from the previous one.

We have given both the instruction and input fields as user query.

In [None]:
alpaca_dataset_train = load_dataset("guneetsk99/Hindi_Alpaca_For_Gemma_67K",
                              split = "train")
alpaca_dataset_train, alpaca_dataset_train[3]

In [None]:
alpaca_dataset_train

In [None]:
gen_prompt="""<start_of_turn>user: {} {}<end_of_turn>\n<start_of_turn>model: {}<end_of_turn>"""
print(alpaca_prompt)

In [None]:
def formatting_func(examples):
    prompts = []
    instruction = examples["instruction"]
    input = examples["input"]
    output = examples['output']
    # Loop over each example in the batch
    for instruction, input, output in zip(instruction, input, output):
        input = input if input else ''
        prompt = gen_prompt.format(instruction, input, output) + '<eos>'
        prompts.append(prompt)
    return { "prompt" : prompts, }

In [None]:
alpaca_train = alpaca_dataset_train.shuffle().take(2000).map(formatting_func, batched = True,).select_columns(["prompt"])

In [None]:
alpaca_train,alpaca_train[0]

In [None]:
print(alpaca_train["prompt"][0])

In [None]:
databrick_dolly = load_dataset("aaditya/databricks-dolly-15k-Hinglish-Codemix", split = "train")

In [None]:
databrick_dolly[0]

In [None]:
def formatting_func(examples):
    prompts = []
    instruction = examples["codemix_instruction"]
    input = examples["codemix_input"]
    output = examples['codemix_output']
    # Loop over each example in the batch
    for instruction, input, output in zip(instruction, input, output):
        instruction = instruction if instruction else ''
        if instruction:
          input = f'{input}' if input else ''
        else:
          input = input if input else ''
        prompt = gen_prompt.format(instruction, input, output) + '<eos>'
        prompts.append(prompt)
    return { "prompt" : prompts, }

In [None]:
databrick_train = databrick_dolly.shuffle().take(2000).map(formatting_func, batched = True,).select_columns(["prompt"])

In [None]:
databrick_train, databrick_train[0]

---
> NOTE: For certain gated datasets you need to login to HuggingFace or activate your hf token which you can get from your HuggingFace account

In [None]:
from huggingface_hub import login

login()

In [None]:
import os

os.environ["HF_HOME"] = "your_hf_token"

In [None]:
math_quest = load_dataset("dnyanesh/HindiMathQuest", split = "train")
math_quest[0]

In [None]:
def formatting_func(examples):
    prompts = []
    instruction = examples["instruction"]
    input = examples["input"]
    output = examples['output']
    # Loop over each example in the batch
    for instruction, input, output in zip(instruction, input, output):
        input = input if input else ''
        prompt = gen_prompt.format(instruction, input, output) + '<eos>'
        prompts.append(prompt)
    return { "prompt" : prompts, }

In [None]:
mathquest_train = math_quest.shuffle().take(2000).map(formatting_func, batched = True,).select_columns(["prompt"])

In [None]:
mathquest_train, mathquest_train[0]

---

Finally we will concatenate the datasets and tokenize them

In [None]:
train_dataset = concatenate_datasets([wiki_dataset_train, alpaca_train, databrick_train, mathquest_train]).shuffle()
train_dataset, train_dataset[0]

In [None]:
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["prompt"],
        padding="longest",
        truncation=True,
        max_length=1024,
        return_tensors="pt"
    )
    tokenized["labels"] = tokenized["input_ids"].clone()
    return tokenized

print("Tokenizing dataset...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
print("Dataset tokenized:", train_dataset[0])

In [None]:
train_dataset

### Training

---
NOTE: As always the lora parameters are very crucial.

- Analyze our current parameters as compared to our previous configs.
-You can increase the **r** and alpha to prompt even more layers with stronger learning.
- We have set the gradient accumulation to 2 this time.

In [None]:
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"
                    ],
    modules_to_save=["embed_tokens", "lm_head"],
    task_type="CAUSAL_LM",
    use_rslora=True
)

train_args = TrainingArguments(
    per_device_train_batch_size=2,  # Each GPU processes 2 examples per step.
    gradient_accumulation_steps=2,  # Gradients are accumulated over 2 steps before updating weights.
    # warmup_steps=30,  # Learning rate warms up (gradually increases) for the first 30 steps.
    #max_steps=10,  # Total number of optimization steps for training.
    warmup_ratio=0.1, # Learning rate warms up (gradually increases) for the first 10 percent of epoch.
    num_train_epochs=1,  # Total number of training steps for training.
    gradient_checkpointing=True,  # Saves memory by recomputing activations during backpropagation.
    learning_rate=5e-5,  # Base learning rate for the optimizer.
    fp16=not torch.cuda.is_bf16_supported(),  # FP16 precision if BF16 is not available.
    bf16=torch.cuda.is_bf16_supported(),  # Enables bfloat16 precision if available.
    save_steps=100,  # Saves checkpoint every 100 steps.
    torch_empty_cache_steps = 10,  # Empties the cache at every 10 steps.
    logging_steps=100,  # Logs metrics every 10 steps.
    optim="adamw_8bit",  # Uses AdamW optimizer with 8-bit precision for optimizer states to save memory.
    weight_decay=0.01,  # Regularization to prevent overfitting by penalizing large weights.
    lr_scheduler_type="linear",  # Linearly decays learning rate after the warmup period.
    output_dir="gemma-2-2b-(hi)-wiki+alpaca+databrick+mathquest_chk",  # Directory where model checkpoints and logs will be saved.
    report_to="none",  # Disables logging to external tools like TensorBoard or WandB.
    save_total_limit=2, # Will save only 2 checkpoints at max, reducing the disk usage.
    run_name='pretrain_gemma2' # Defining a name for our runtime.
)

In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=base_model,
    padding="longest",
    return_tensors="pt"
)

trainer = SFTTrainer(
    model=base_model,
    tokenizer=tokenizer,
    args=train_args,
    peft_config=lora_config,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

In [None]:
trainer.train()

### Saving the model

The saving and merging steps are the same as always

In [None]:
trainer.save_model('gemma-2-2b-{hi)-16994batch-1epoch-wiki+alpaca+databrick+mathquest')
trainer.tokenizer.save_pretrained('gemma-2-2b-{hi)-16994batch-1epoch-wiki+alpaca+databrick+mathquest')

In [None]:
tokenizer = AutoTokenizer.from_pretrained('gemma-2-2b-{hi)-16994batch-1epoch-wiki+alpaca+databrick+mathquest')
merged_model = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained('gemma-2-2b', device_map='auto'), 'gemma-2-2b-{hi)-16994batch-1epoch-wiki+alpaca+databrick+mathquest').merge_and_unload()

In [None]:
merged_model.save_pretrained("gemma-2-2b-tmp")

In [None]:
torch.save(merged_model.state_dict(), "merged_model_state_dict.pth")

In [None]:
model = AutoModelForCausalLM.from_pretrained("gemma-2-2b-tmp",device_map='cpu')

In [None]:
model.load_state_dict(torch.load("merged_model_state_dict.pth", weights_only=True))

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b-(hi)-16994batch-1epoch-wiki+alpaca+databrick+mathquest")

In [None]:
model.save_pretrained("gemma-2-2b-(hi)-base+wiki+alpaca+databrick+mathquest")

In [None]:
tokenizer.save_pretrained("gemma-2-2b-(hi)-base+wiki+alpaca+databrick+mathquest")

### Inference

In [None]:
system_prompt = "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. You respond to users in a clear, and concise manner in the language of the user query. \nआप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं। \naap jemmaa2 hain, ek madadagaar, sanvaadee eaee sahaayak. aap hindee, bolachaal kee hinglish aur angrejee mein visheshagy hain. aap upayogakartaon ko upayogakarta kee kveree kee bhaasha mein spasht aur sankshipt tareeke se javaab dete hain."

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

In [None]:
system_prompt = "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: दिवाली का त्यौहार क्यों मनाया जाता है, संचेप में बतायें?<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + '\n' + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

It seems the model has learnt how to give answer in points

---
> NOTE: The model is able to infer in hindi and english as per the user query however it needs to be explicitly mentioned with the system prompt.

Use the system prompt if the model isn't giving accurate results specially during translation and hinglish responses.

---

In [None]:
system_prompt = "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: यह रहा एक गणित का प्रश्न हिंदी में: \n**प्रश्न:** \nएक रेलगाड़ी की लंबाई 120 मीटर है। वह 72 किमी/घंटा की गति से चल रही है। रेलगाड़ी को एक 240 मीटर लंबे पुल को पार करने में कितना समय लगेगा? \n(उत्तर सेकंड में दें।)?<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + '\n' + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=1000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

---

> NOTE: For maths its better to use less temperature as mathematical reasoning doesn't rely on creativity rather a streamlined aproach

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: प्रश्न: एक रेलगाड़ी की लंबाई 120 मीटर है। वह 72 किमी/घंटा की गति से चल रही है। रेलगाड़ी को एक 240 मीटर लंबे पुल को पार करने में कितना समय लगेगा? (उत्तर सेकंड में दें।) निर्देश: इस प्रश्न को पहले ध्यान से पढ़ें और पूरी तरह से समझें। इसके बाद, इसे चरणबद्ध तरीके से हल करें। प्रत्येक चरण में अपने निष्कर्ष स्पष्ट रूप से प्रस्तुत करें और अंत में उत्तर दें।<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + '\n' + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              repetition_penalty=1)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Mathematical reasoning isn't all that great yet :')

Although that wasn't our goal anyway

Now let's check for areas where creativity is needed.

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: Kya aapko pata hay ki ek saal me kitne din hote hain?<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input =  user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Well atleast it knows how many days are there in a year

---

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: एक यादृच्छिक कविता उत्पन्न करें<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input =  user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

It seems to know the general structure of a poem however the response isn't too great in terms of coherence and meaning

---

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: महात्मा गांधी के बारे में 100 शब्दो में निबंद लिखें।<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input =  user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

For essay writing

---

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: Python प्रोग्रामिंग लैंग्वेज में एक 'हैलो वर्ल्ड' का कोड लिखा है।<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input =  user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Here our query itself was incorrect however it seemed to know we are asking about something related to programming and it gave a good response in code switching

---

In [None]:
system_prompt = "You are Gemma2, a helpful, AI assistant. You are an expert in Hindi, colloquial Hinglish and English communication. You respond to users in a clear, and concise manner in the language of the user query"

# Prepare the input
user_input = "<start_of_turn>user: Translate 'And when i decided to play outside, it started raining' to hindi<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + '\n' + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(merged_model.device)

generated_ids = merged_model.generate(**inputs,
                              max_new_tokens=2000,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

For translation task we needed to include the system prompt for a good response.

> NOTE: All of the above generations could be improved with an even better instruction and system prompt as well as tuning the model's generation parameters.

And there's an added benifit of training on larger mixed dataset

---

## 5. But why base?

You might be wondering why did we choose the base model eventually making it instruct, instead of choosing a model which is more lightweight (smaller size) and possible would scale better for instruct tuning, for our fine tuning task.

Good question. And so let me present to you the answer to that question.

Let's setup our model for inference

In [None]:
!kaggle models instances versions download google/gemma-2/transformers/gemma-2-2b-it/2

In [None]:
!tar -xvzf 'gemma-2.tar.gz' -C 'gemma-2-2b-it'

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("gemma-2-2b-it")

In [None]:
base_model = AutoModelForCausalLM.from_pretrained("gemma-2-2b-it",quantization_config=bnb_config,
                                                                         device_map='auto')

In [None]:
system_prompt = "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. You respond to users in a clear, and concise manner in the language of the user query. \nआप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं।"

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to('cuda')

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=2048,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Extract model output after <start_of_turn>model: and before <end_of_turn>
print(generated_text)

In [None]:
response = ''
parts = generated_text.split("<start_of_turn>model:")

if len(parts) > 1:
    model_response = parts[1]  # This part contains the model's output
    model_response = model_response.split("<end_of_turn>")[0].strip()  # Remove after <end_of_turn>
    response += model_response

print(response)

The instruct model is already highly capable in generating high quality answers out of the box! And any further fine tuning on it only diminished it's quality rather than improving it. And so we stuck with enhancing the base model for our language task instead.

---

We Deviced a series of simple questions to test the instruct model

In [None]:
test_prompts = [
    {
        "category": "General",
        "user": "दुनिया का सबसे ऊँचा पर्वत कौन सा है?"
    },
    {
        "category": "General",
        "user": "पानी का रासायनिक सूत्र क्या है?"
    },
    {
        "category": "General",
        "user": "“सूर्य” शब्द का पर्यायवाची क्या है?"
    },
    {
        "category": "General",
        "user": "पृथ्वी पर सबसे बड़ा महासागर कौन सा है?"
    },
    {
        "category": "Chat",
        "user": "तुम कैसे हो?"
    },
    {
        "category": "Chat",
        "user": "क्या तुम मेरे दोस्त बनोगे?"
    },
    {
        "category": "Chat",
        "user": "आज का मौसम कैसा रहेगा?"
    },
    {
        "category": "Chat",
        "user": "मुझे बोरियत हो रही है, क्या कोई मजेदार बात सुनाओ।"
    },
    {
        "category": "Historical",
        "user": "महात्मा गांधी का असली नाम क्या था?"
    },
    {
        "category": "Historical",
        "user": "अशोक महान किस राजवंश से संबंधित थे?"
    },
    {
        "category": "Historical",
        "user": "भारत का स्वतंत्रता संग्राम कब शुरू हुआ?"
    },
    {
        "category": "Historical",
        "user": "ताजमहल किसने बनवाया और क्यों?"
    },
    {
        "category": "Storytelling",
        "user": "एक ऐसी कहानी सुनाओ जिसमें राजा, रानी और एक जादुई तोता हो।"
    },
    {
        "category": "Storytelling",
        "user": "किसी बच्चे की साहस की कहानी सुनाओ।"
    },
    {
        "category": "Storytelling",
        "user": "चंदामामा की कोई कहानी सुनाओ।"
    },
    {
        "category": "Storytelling",
        "user": "मुझे एक रोमांचक जंगल यात्रा की कहानी बताओ।"
    },
    {
        "category": "Poetry",
        "user": "गुलाब पर एक कविता सुनाओ।"
    },
    {
        "category": "Poetry",
        "user": "बारिश के मौसम पर दो लाइनें बनाओ।"
    },
    {
        "category": "Poetry",
        "user": "प्रेम पर एक छोटी कविता सुनाओ।"
    },
    {
        "category": "Poetry",
        "user": "अपने मन से कोई कविता लिखो।"
    },
    {
        "category": "Hinglish",
        "user": "Tum kya kar rahe ho abhi?"
    },
    {
        "category": "Hinglish",
        "user": "Mujhe ek achhi movie recommend karo."
    },
    {
        "category": "Hinglish",
        "user": "Life ke baare mein tumhara kya opinion hai?"
    },
    {
        "category": "Hinglish",
        "user": "Ek short story sunao jo funny ho."
    },
    {
        "category": "Knowledge",
        "user": "भारत का राष्ट्रीय पक्षी कौन है?"
    },
    {
        "category": "Knowledge",
        "user": "E=mc² का मतलब क्या है?"
    },
    {
        "category": "Knowledge",
        "user": "चंद्रग्रहण क्यों और कैसे होता है?"
    },
    {
        "category": "Knowledge",
        "user": "विज्ञान के कौन से अविष्कार ने मानव जीवन को सबसे ज्यादा बदला?"
    },
    {
        "category": "Fun",
        "user": "अगर तुम एक जादुई प्राणी होते, तो कौन से होते?"
    },
    {
        "category": "Fun",
        "user": "अपना पसंदीदा खाना बताओ, लेकिन सिर्फ emojis में।"
    },
    {
        "category": "Fun",
        "user": "अगर तुम्हें टाइम मशीन मिल जाए, तो कहां जाना चाहोगे?"
    },
    {
        "category": "Fun",
        "user": "मुझे एक दिन के लिए राजा बना दो, क्या करोगे?"
    }
]

In [None]:
# Define the function
def generate_responses(dataset, base_model, tokenizer, system_prompt):
    """
    Generate responses for each input in a dataset using a conversational model.

    Args:
        dataset (list): A list of dictionaries with 'category' and 'user' keys.
        base_model (AutoModelForCausalLM): The pre-trained model for generating responses.
        tokenizer (AutoTokenizer): The tokenizer for the model.
        system_prompt (str): The system prompt to provide context for the model.

    Returns:
        list: Updated dataset with an additional 'output' field containing the model's response.
    """
    updated_dataset = []

    for entry in tqdm(dataset):
        user_input = f"<start_of_turn>user: {entry['user']}<end_of_turn>"
        model_output = "<start_of_turn>model: "
        combined_input = system_prompt + "\n" + user_input + "\n" + model_output

        # Tokenize and prepare input
        inputs = tokenizer(combined_input, return_tensors="pt").to('cuda')

        # Generate response
        generated_ids = base_model.generate(
            **inputs,
            max_new_tokens=2048,
            do_sample=True,
            temperature=1,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.0
        )

        # Decode the generated output
        response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]

        # Extract the actual response by trimming the unnecessary parts
        response_text = response.split("<start_of_turn>model:")[1].split("<end_of_turn>")[0].strip()

        # Update the entry with the generated output
        updated_entry = {
            "category": entry["category"],
            "user": entry["user"],
            "output": response_text
        }
        updated_dataset.append(updated_entry)

    return updated_dataset

In [None]:
system_prompt = (
    "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. "
    "You respond to users in a clear, and concise manner in the language of the user query. \n"
    "आप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। "
    "आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं।"
)
# Generate responses
updated_dataset = generate_responses(test_prompts, base_model, tokenizer, system_prompt)

In [None]:
updated_dataset

You can see the responses are already very good. Much better than the base model as we will see in a second.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
#If you add the model from Kaggle, use this line.
modelName = "/content/gemma-2-2b"

tokenizer = AutoTokenizer.from_pretrained(modelName)
base_model = AutoModelForCausalLM.from_pretrained(modelName,
                                             quantization_config=bnb_config,
                                             trust_remote_code=True,
                                             device_map="auto")

In [None]:
system_prompt = "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. You respond to users in a clear, and concise manner in the language of the user query. \nआप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं।"

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to('cuda')

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=2048,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0])

As we can already see the responses are very bad

In [None]:
system_prompt = (
    "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. "
    "You respond to users in a clear, and concise manner in the language of the user query. \n"
    "आप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। "
    "आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं।"
)
# Generate responses
updated_dataset = generate_responses(test_prompts, base_model, tokenizer, system_prompt)

In [None]:
updated_dataset

You can compare these base responses with the instruct model. They are worlds apart.

Due to this difference we decided to fine tune the base model as that would be more suitable as per the goal of our project

## 6. Final testing

We are gonna test our model for QnA and RAG along with some Few-Shotting for improving the results

### QnA Testing

Now let us see how does our model performs on the same series of test questions

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/Bhavesh/Models/google/gemma-2-2b-(hi)-base+wiki+alpaca+databrick+mathquest")

In [None]:
base_model = AutoModelForCausalLM.from_pretrained("/content/drive/MyDrive/Bhavesh/Models/google/gemma-2-2b-(hi)-base+wiki+alpaca+databrick+mathquest",quantization_config=bnb_config,
                                                                         device_map='auto')

In [None]:
test_prompts = [
    {
        "category": "General",
        "user": "दुनिया का सबसे ऊँचा पर्वत कौन सा है?"
    },
    {
        "category": "General",
        "user": "पानी का रासायनिक सूत्र क्या है?"
    },
    {
        "category": "General",
        "user": "“सूर्य” शब्द का पर्यायवाची क्या है?"
    },
    {
        "category": "General",
        "user": "पृथ्वी पर सबसे बड़ा महासागर कौन सा है?"
    },
    {
        "category": "Chat",
        "user": "तुम कैसे हो?"
    },
    {
        "category": "Chat",
        "user": "क्या तुम मेरे दोस्त बनोगे?"
    },
    {
        "category": "Chat",
        "user": "आज का मौसम कैसा रहेगा?"
    },
    {
        "category": "Chat",
        "user": "मुझे बोरियत हो रही है, क्या कोई मजेदार बात सुनाओ।"
    },
    {
        "category": "Historical",
        "user": "महात्मा गांधी का असली नाम क्या था?"
    },
    {
        "category": "Historical",
        "user": "अशोक महान किस राजवंश से संबंधित थे?"
    },
    {
        "category": "Historical",
        "user": "भारत का स्वतंत्रता संग्राम कब शुरू हुआ?"
    },
    {
        "category": "Historical",
        "user": "ताजमहल किसने बनवाया और क्यों?"
    },
    {
        "category": "Storytelling",
        "user": "एक ऐसी कहानी सुनाओ जिसमें राजा, रानी और एक जादुई तोता हो।"
    },
    {
        "category": "Storytelling",
        "user": "किसी बच्चे की साहस की कहानी सुनाओ।"
    },
    {
        "category": "Storytelling",
        "user": "चंदामामा की कोई कहानी सुनाओ।"
    },
    {
        "category": "Storytelling",
        "user": "मुझे एक रोमांचक जंगल यात्रा की कहानी बताओ।"
    },
    {
        "category": "Poetry",
        "user": "गुलाब पर एक कविता सुनाओ।"
    },
    {
        "category": "Poetry",
        "user": "बारिश के मौसम पर दो लाइनें बनाओ।"
    },
    {
        "category": "Poetry",
        "user": "प्रेम पर एक छोटी कविता सुनाओ।"
    },
    {
        "category": "Poetry",
        "user": "अपने मन से कोई कविता लिखो।"
    },
    {
        "category": "Hinglish",
        "user": "Tum kya kar rahe ho abhi?"
    },
    {
        "category": "Hinglish",
        "user": "Mujhe ek achhi movie recommend karo."
    },
    {
        "category": "Hinglish",
        "user": "Life ke baare mein tumhara kya opinion hai?"
    },
    {
        "category": "Hinglish",
        "user": "Ek short story sunao jo funny ho."
    },
    {
        "category": "Knowledge",
        "user": "भारत का राष्ट्रीय पक्षी कौन है?"
    },
    {
        "category": "Knowledge",
        "user": "E=mc² का मतलब क्या है?"
    },
    {
        "category": "Knowledge",
        "user": "चंद्रग्रहण क्यों और कैसे होता है?"
    },
    {
        "category": "Knowledge",
        "user": "विज्ञान के कौन से अविष्कार ने मानव जीवन को सबसे ज्यादा बदला?"
    },
    {
        "category": "Fun",
        "user": "अगर तुम एक जादुई प्राणी होते, तो कौन से होते?"
    },
    {
        "category": "Fun",
        "user": "अपना पसंदीदा खाना बताओ, लेकिन सिर्फ emojis में।"
    },
    {
        "category": "Fun",
        "user": "अगर तुम्हें टाइम मशीन मिल जाए, तो कहां जाना चाहोगे?"
    },
    {
        "category": "Fun",
        "user": "मुझे एक दिन के लिए राजा बना दो, क्या करोगे?"
    }
]

In [None]:
# Define the function
def generate_responses(dataset, base_model, tokenizer, system_prompt=''):
    """
    Generate responses for each input in a dataset using a conversational model.

    Args:
        dataset (list): A list of dictionaries with 'category' and 'user' keys.
        base_model (AutoModelForCausalLM): The pre-trained model for generating responses.
        tokenizer (AutoTokenizer): The tokenizer for the model.
        system_prompt (str): The system prompt to provide context for the model.

    Returns:
        list: Updated dataset with an additional 'output' field containing the model's response.
    """
    updated_dataset = []

    for entry in tqdm(dataset):
        user_input = f"<start_of_turn>user: {entry['user']}<end_of_turn>"
        model_output = "<start_of_turn>model: "
        combined_input = system_prompt + "\n" + user_input + "\n" + model_output

        # Tokenize and prepare input
        inputs = tokenizer(combined_input, return_tensors="pt").to('cuda')

        # Generate response
        generated_ids = base_model.generate(
            **inputs,
            max_new_tokens=2048,
            do_sample=True,
            temperature=1,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.0
        )

        # Decode the generated output
        response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]

        # Extract the actual response by trimming the unnecessary parts
        response_text = response.split("<start_of_turn>model:")[1].split("<end_of_turn>")[0].strip()

        # Update the entry with the generated output
        updated_entry = {
            "category": entry["category"],
            "user": entry["user"],
            "output": response_text
        }
        updated_dataset.append(updated_entry)

    return updated_dataset

In [None]:
system_prompt = (
    "You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish and English. "
    "You respond to users in a clear, and concise manner in the language of the user query. \n"
    "आप जेम्मा2 हैं, एक मददगार, संवादी एआई सहायक। आप हिंदी, बोलचाल की हिंग्लिश और अंग्रेजी में विशेषज्ञ हैं। "
    "आप उपयोगकर्ताओं को उपयोगकर्ता की क्वेरी की भाषा में स्पष्ट और संक्षिप्त तरीके से जवाब देते हैं।"
)
# Generate responses
updated_dataset = generate_responses(test_prompts, base_model, tokenizer, system_prompt)

In [None]:
updated_dataset

### RAG Testing Along with Few-Shot Prompting

We tested it in Historical aspect

In [None]:
system_prompt = """You are Gemma2, a helpful, conversational AI assistant integrated with a Retrieval-Augmented Generation (RAG) system.
You are an expert in Hindi, colloquial Hinglish, and English. When responding to user queries, you:
- Retrieve relevant information from the integrated knowledge base or external sources when needed.
- Provide clear, concise, and accurate responses in the language of the user query."""

retrieved_info = """Retrieved information:
- Diwali is celebrated to commemorate the return of Lord Rama to Ayodhya after a 14-year exile, during which he defeated Ravana.
- It symbolizes the victory of light over darkness and good over evil.
- Source: Indian Mythology Knowledge Base"""

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" +user_input + "\n" + retrieved_info + "\n" + model_output



inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

In [None]:
system_prompt = """You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish, and English. When responding to user queries, you'll provide clear, concise, and accurate responses based on "Retrieved Information" in the language of the user query."""

retrieved_info = """Retrieved information:
- Diwali is celebrated to commemorate the return of Lord Rama to Ayodhya after a 14-year exile, during which he defeated Ravana.
- It symbolizes the victory of light over darkness and good over evil."""

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" +user_input + "\n" + retrieved_info + "\n" + model_output



inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

In [None]:
system_prompt = """You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish, and English. When responding to user queries, you'll provide clear, concise, and accurate responses based on "Retrieved Information" in the language of the user query. \n आप Gemma2 हैं, एक सहायक, बातचीत करने वाली AI सहायक। आप हिंदी, आम बोलचाल की हिंग्लिश और अंग्रेज़ी में विशेषज्ञ हैं। उपयोगकर्ता की क्वेरी का उत्तर 'Retrieved Information' के आधार पर स्पष्ट, संक्षिप्त और उपयोगकर्ता की क्वेरी की भाषा में दें।"""

retrieved_info = """Retrieved information:
- Diwali is celebrated to commemorate the return of Lord Rama to Ayodhya after a 14-year exile, during which he defeated Ravana.
- It symbolizes the victory of light over darkness and good over evil."""

# Prepare the input
user_input = "<start_of_turn>user: Why is diwali celebrated<end_of_turn>"
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" +user_input + "\n" + retrieved_info + "\n" + model_output



inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

> NOTE: As you can see the RAG Response in English is pretty accurate and true to the context along with some extra information

---

In [None]:
system_prompt = """You are Gemma2, a helpful, conversational AI assistant. You are an expert in Hindi, colloquial Hinglish, and English. Answer the user in clear concise and manner in the language of the user query. You will answer the user question based on the information only"""

# Prepare the input
user_input = """<start_of_turn>user: दीवाली क्यों मनाई जाती है? Answer - "दीवाली मनाई जाती है भगवान राम की अयोध्या वापसी की स्मृति में, जो 14 वर्षों के वनवास के बाद हुई, इस दौरान उन्होंने रावण का वध किया। यह अंधकार पर प्रकाश और बुराई पर अच्छाई की विजय का प्रतीक है। स्रोत: भारतीय पौराणिक ज्ञान आधार" <end_of_turn>"""
model_output = "<start_of_turn>model: "
combined_input = system_prompt + "\n" +user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=2048,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.0)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

In [None]:
system_prompt = """ You will answer "user" query based on the information only. \nInformation - "दीवाली मनाई जाती है भगवान राम की अयोध्या वापसी की स्मृति में, जो 14 वर्षों के वनवास के बाद हुई, इस दौरान उन्होंने रावण का वध किया। यह अंधकार पर प्रकाश और बुराई पर अच्छाई की विजय का प्रतीक है। स्रोत: भारतीय पौराणिक ज्ञान आधार" """

# Input Preparation
user_input = """<start_of_turn>user: "Diwali kyu manai jaati hay?"<end_of_turn>"""
model_output = "<start_of_turn>model: "

# Combine Input for RAG
combined_input = system_prompt + "\n" + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=2048,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1.5)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

For Hindi and code-mix or Hinglish Responses were slightly true to the context however weren't too accurate and additional, sometimes good sometimes garbbled but close responses were being generated however not aligning with the Retrieved Information

---

We tried many keywords and prompts and finally settles with a Few-Shot attempt and to provide the retrieved information within the `<end_of_turn>` token

It is also advisable to keep the temperature low while RAG application

In [None]:
system_prompt = """
You are Gemma2, a helpful, conversational AI assistant with Retrieval-Augmented Generation capabilities.
You are an expert in Hindi, colloquial Hinglish, and English. Respond to the user in a clear, concise manner in the language of the query.
Always base your answers solely on the 'Retrieved Information.' Avoid producing unnecessary output or adding extra context.

Analyze these Examples:

Example 1: Hindi
User: "चंद्रग्रहण क्या है?"
Retrieved Information: 'चंद्रग्रहण तब होता है जब चंद्रमा पृथ्वी की छाया में प्रवेश करता है। यह पूर्ण और आंशिक हो सकता है। स्रोत: खगोल विज्ञान ज्ञान आधार'
Model: "चंद्रग्रहण तब होता है जब चंद्रमा पृथ्वी की छाया में आता है।"

Example 2: Hinglish
User: "What is the meaning of aurora borealis?"
Retrieved Information: 'Aurora Borealis, also known as the Northern Lights, is a natural light display in Earth's sky, predominantly seen in high-latitude regions. Source: Encyclopedia of Natural Phenomena'
Model: "Aurora Borealis is the Northern Lights seen in high-latitude regions."

Example 3: English
User: "What is the capital of France?"
Retrieved Information: 'The capital of France is Paris. Source: World Geography Database'
Model: "The capital of France is Paris."

Example 4: Hinglish
User: "Volcano kya hota hai?"
Retrieved Information: 'A volcano is an opening in Earth's surface where molten rock, ash, and gases erupt. It forms mountains over time. Source: Geological Facts'
Model: "Volcano ek opening hai jahan se molten rock aur gases erupt karte hain."

Example 5: Hindi
User: "भारत का राष्ट्रीय पक्षी कौन सा है?"
Retrieved Information: 'भारत का राष्ट्रीय पक्षी मोर है। स्रोत: भारतीय ज्ञान कोश'
Model: "भारत का राष्ट्रीय पक्षी मोर है।"

Now answer the user question based on the 'Retrieved Information' only.
"""

# Retrieval-Augmented Input
rag = """Retrieved Information - 'दीवाली मनाई जाती है भगवान राम की अयोध्या वापसी की स्मृति में, जो 14 वर्षों के वनवास के बाद हुई, इस दौरान उन्होंने रावण का वध किया। यह अंधकार पर प्रकाश और बुराई पर अच्छाई की विजय का प्रतीक है। स्रोत: भारतीय पौराणिक ज्ञान आधार'"""

# User Input
user_input = f"""<start_of_turn>user: Answer in short - "दीवाली क्यों मनाई जाती है?" \n{rag} \n<end_of_turn>"""

# Model Output Placeholder
model_output = "<start_of_turn>model: "

# Combine Input for RAG
combined_input = system_prompt + "\n" + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=1,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

---

You can see the difference when we reduce the temperature

In [None]:
system_prompt = """
You are Gemma2, a helpful, conversational AI assistant with Retrieval-Augmented Generation capabilities.
You are an expert in Hindi, colloquial Hinglish, and English. Respond to the user in a clear, concise manner in the language of the query.
Always base your answers solely on the 'Retrieved Information.' Avoid producing unnecessary output or adding extra context.

Analyze these Examples:

Example 1: Hindi
User: "चंद्रग्रहण क्या है?"
Retrieved Information: 'चंद्रग्रहण तब होता है जब चंद्रमा पृथ्वी की छाया में प्रवेश करता है। यह पूर्ण और आंशिक हो सकता है। स्रोत: खगोल विज्ञान ज्ञान आधार'
Model: "चंद्रग्रहण तब होता है जब चंद्रमा पृथ्वी की छाया में आता है।"

Example 2: Hinglish
User: "What is the meaning of aurora borealis?"
Retrieved Information: 'Aurora Borealis, also known as the Northern Lights, is a natural light display in Earth's sky, predominantly seen in high-latitude regions. Source: Encyclopedia of Natural Phenomena'
Model: "Aurora Borealis is the Northern Lights seen in high-latitude regions."

Example 3: English
User: "What is the capital of France?"
Retrieved Information: 'The capital of France is Paris. Source: World Geography Database'
Model: "The capital of France is Paris."

Now answer the user question based on the 'Retrieved Information' only.
"""

# Retrieval-Augmented Input
rag = """Retrieved Information - 'दीवाली मनाई जाती है भगवान राम की अयोध्या वापसी की स्मृति में, जो 14 वर्षों के वनवास के बाद हुई, इस दौरान उन्होंने रावण का वध किया। यह अंधकार पर प्रकाश और बुराई पर अच्छाई की विजय का प्रतीक है। स्रोत: भारतीय पौराणिक ज्ञान आधार'"""

# User Input
user_input = f"""<start_of_turn>user: Answer in short - "दीवाली क्यों मनाई जाती है?" \n{rag} \n<end_of_turn>"""

# Model Output Placeholder
model_output = "<start_of_turn>model: "

# Combine Input for RAG
combined_input = system_prompt + "\n" + user_input + "\n" + model_output


inputs = tokenizer(combined_input, return_tensors="pt").to(base_model.device)

generated_ids = base_model.generate(**inputs,
                              max_new_tokens=500,
                              do_sample=True,
                              temperature=0.5,
                              top_p=0.95,
                              top_k=50,
                              repetition_penalty=1)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

> NOTE: As per these examples, using Few-Shot prompting with lower temperature for RAG instantly improved the model's output making it closer to the Retrieved Information.

- This wasn't needed during English RAG although for Hindi it seems to help the model understand how to use the Information provided to it, much efficiently

- Training the model even longer on broader knowledge base and QnA datasets can vastly improve the results specially considering how quickly it learnt from a very small amount of examples in our last Training
---

## 7. Conclusion and Findings

**Our Findings**

- As a native hindi speaker I can say that although it has learnt to answer in hindi now, the answers are not accurate at all. Except may be a few.

- However this is still a drastic difference and change from the original base model.

- This approach of combining several Wikipedia articles with titles as user query and text as model output has proven effective.

- Although the given answers are wrong the hallucination upon giving a Hindi prompt has gone down drastically.

- The model did learn to follow the system prompt and output in the language of the given query if given the right system prompt.

- The model's performance and answer accuracy can be further improved by giving it much rich documents about several topics such as Chemistery, History, Geography etc

- Including chat conversations along with knowledge injection has also shown improvements in a chat like response generation.

- One thing to note is that even though we trained our model for **40 Hrs**, this is a sum total time on all three trainings. However our most preferred one was the last attempt of training on mixed text corpus which had given us best result so far. And in that attempt we only trained for **4 Hrs** on a small subset of dataset due to resource and time constraint. This
goes to show us that if we had trained on that mixed corpus for longer the result would have been exponentially better

- Further more subsequent runs on same prompt can result a better response, overall making it a viable option of getting adopted as a multi-run option or majority voting.

**In conclusion replicating our Final Approach can significantly increase the model's performance in learning and answering prompts/chats/queries a new language**

## 8. Learnings



**Understanding LoRA**

LoRA (Low-Rank Adaptation) is a technique to fine-tune large language models efficiently by adapting a small subset of parameters. Here's an explanation of the key parameters and their implications:

---

**1. Parameters of LoRA**

**`r` (Rank)**
- **Definition**: The rank of the low-rank decomposition matrix used for parameter adaptation. Lower `r` values mean fewer trainable parameters, making the adaptation more memory-efficient but less expressive.
- **High `r`**: Use when the task requires injecting substantial new knowledge or adapting to a domain that is significantly different from the pre-trained model's domain.
- **Low `r`**: Use when the task involves subtle adaptations or pattern fine-tuning within a domain close to the pre-trained model's scope.

---

**`lora_alpha`**
- **Definition**: A scaling factor that controls the impact of the LoRA layers on the model.
- **High `lora_alpha`**: Amplifies the contribution of the LoRA layers. Useful when large-scale domain shifts or high-impact adaptations are required.
- **Low `lora_alpha`**: Reduces the LoRA layers' influence, ensuring minimal disturbance to the pre-trained parameters. Suitable for fine-tuning in similar domains or tasks requiring subtle behavior changes.

---

**`use_rslora` (Residual LoRA)**
- **Definition**: A variant of LoRA that retains residual connections, helping to stabilize training and improve performance in some scenarios.
- **When to use**: For tasks with limited data or where maintaining robustness is critical. It reduces the risk of catastrophic forgetting and overfitting.

---

**2. Target Modules**

**What are target modules?**
- These are the parts of the model where LoRA applies low-rank updates. Common target modules include:
  - `q_proj`: Query projections (attention heads).
  - `k_proj`: Key projections.
  - `v_proj`: Value projections.
  - `o_proj`: Output projections.
  - `gate_proj`, `up_proj`, `down_proj`: Parts of feed-forward networks.

**Effect of including specific target modules**:
- **Attention modules (`q_proj`, `k_proj`, `v_proj`, `o_proj`)**:
  - **Focus**: LoRA changes how the model attends to information.
  - **When to use**: If the task requires significant changes in how the model interprets input relationships or context.

- **Feed-forward network modules (`gate_proj`, `up_proj`, `down_proj`)**:
  - **Focus**: LoRA adjusts the transformation and interpretation of features.
  - **When to use**: If the task relies on complex transformations or domain-specific feature engineering.

- **Broader module inclusion**: Increases the model's ability to adapt but requires more memory and computational resources. It may also risk overfitting if the dataset is small.

**Excluding target modules**:
- Limits the scope of adaptation, preserving more of the pre-trained knowledge. This can be ideal for fine-tuning on tasks requiring minimal domain shifts.

---

**3. Modules to Save**
- **Definition**: These modules are saved along with LoRA parameters, ensuring the model's modified state is preserved for deployment.
- **Impact**:
  - Saving modules like `embed_tokens` and `lm_head` ensures that task-specific embeddings or outputs are retained.
  - Including broader modules increases the ability to deploy the model for specific tasks but requires careful consideration to avoid saving unnecessary changes.

---

**4. Choosing Parameter Values**

**For Knowledge Injection**:
- **Purpose**: Add domain-specific knowledge or train the model for a substantially new task.
- **Recommended Settings**:
  - **`r`**: Higher (e.g., 32–64) to increase flexibility.
  - **`lora_alpha`**: Higher (e.g., 128–256) for stronger influence.
  - **Target Modules**: Include a wide range, such as all attention and feed-forward modules.
  - **Modules to Save**: Save embeddings, heads, and any adapted layers.

**For Pattern Fine-Tuning**:
- **Purpose**: Adjust the model for small-scale adaptations or subtle domain shifts.
- **Recommended Settings**:
  - **`r`**: Lower (e.g., 4–16) for efficiency.
  - **`lora_alpha`**: Lower (e.g., 16–64) to ensure subtle updates.
  - **Target Modules**: Focus on essential modules like `q_proj`, `v_proj`, and `o_proj`.
  - **Modules to Save**: Minimal, often just embeddings or heads.

---

**5. Practical Considerations**
- **Dataset Size**:
  - Small datasets benefit from fewer target modules and lower `r` to avoid overfitting.
  - Large datasets can leverage higher `r` and broader target modules for richer adaptation.
  
- **Task Complexity**:
  - Complex tasks or significant domain shifts require higher `r` and broader module inclusion.
  - Simple tasks or minor shifts work well with limited adaptations.

By carefully tuning these parameters based on the task and dataset, you can achieve efficient and effective model fine-tuning using LoRA.

## 9. Summary

This project has been a great learning milestone for us. We overcame tons of problems, errors, misbehaviours, dataset curation, model parameters, fine tuning, LLM Training and a lot more.

It was a result of 1 week of meticulous experimentaion, research and learning many things from scratch specially for an efficient training.

We sincerely Thank Google for this opportunity!

If we managed to increase your knowledge base, please give us an upvote. It has taken us, a lot of efforts, trials and errors to get this far.