# **Fine-Tuning Gemma 2 with LoRA for Hindi**

# Introduction

Language models like Gemma 2 are transformative tools in the world of natural language processing (NLP), enabling applications ranging from text generation to language translation and sentiment analysis. However, these models are often trained on predominantly English datasets, which can lead to underrepresentation of other languages and cultural nuances.

In this notebook, I aim to address this gap by fine-tuning Gemma 2 for Hindi, a language spoken by over 600 million people worldwide and rich in cultural and linguistic diversity. By fine-tuning the model for Hindi, we empower communities to access NLP technologies tailored to their language, opening doors to enhanced communication, learning, and innovation.

This notebook is designed with clarity and replicability in mind, ensuring that anyone, regardless of their expertise, can follow along and adapt the process for their own language or context.

# Setup and Initialization

In this section, we prepare the environment by installing the necessary libraries and importing essential modules to fine-tune Gemma 2 for Hindi. This step ensures that all dependencies are correctly installed, and the environment is configured for seamless execution of the notebook.

**1. Install Dependencies**

* We install packages such as **transformers**, **datasets**, **accelerate**, **peft**, and others that are essential for working with large language models and fine-tuning tasks.
* The **bitsandbytes** library is used to enable memory-efficient computations, crucial for handling large models.
* We also set up **wandb** (Weights & Biases) for experiment tracking and logging.

In [1]:
%%capture
%pip install -U transformers datasets accelerate peft trl bitsandbytes

**2. Import Libraries**
* Key libraries for loading and fine-tuning the model (**transformers**, **peft**, **trl**) are imported.
* Supporting libraries like **torch**, **wandb**, and **datasets** are also loaded to streamline model training and data handling.

In [2]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)

import os
import torch
import bitsandbytes as bnb

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig, setup_chat_format

**3. Define Base Variables**
* We define the base model (**google/gemma-2-2b-it**), the dataset to be used for fine-tuning (**maharnab/hindi_instruct**), and the name of the new fine-tuned model (**Gemma-2-2b-it-hindi**).

In [3]:
base_model = "/kaggle/input/gemma-2/transformers/gemma-2-2b-it/2/"
dataset_name = "maharnab/hindi_instruct"
new_model = "Gemma-2-2b-it-hindi"

# Hardware Setup and Model Configuration

In this section, we ensure that the hardware and model configurations are optimized for fine-tuning Gemma 2 using techniques like Quantized LoRA (QLoRA) and Flash Attention.

**1. Check CUDA Device Capability**

* The CUDA device capability is checked to determine whether the hardware supports advanced features like Flash Attention v2
* If it meets the capability, Install and configure Flash Attention v2 for performance-boosting memory and speed enhancements and Set the appropriate data type for computations (bfloat16 for newer GPUs, float16 for older ones).

In [4]:
# Check CUDA device capability and set appropriate configurations
# Flash Attention v2 requires CUDA device capability >= 8.0
if torch.cuda.get_device_capability()[0] >= 8:
    # Install Flash Attention if capability allows
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16               # Use bfloat16 precision for better performance on supported hardware
    attn_implementation = "flash_attention_2"  # Use Flash Attention v2
else:
    torch_dtype = torch.float16    # Use float16 for older hardware
    attn_implementation = "eager"  # Default attention implementation

**2. Configure Quantized LoRA (QLoRA)**

* We use **BitsAndBytesConfig** to enable 4-bit quantization, which allows efficient loading of large models without compromising much on performance.
* Additional settings like **nf4** quantization type and double quantization ensure improved model precision and computational efficiency.

In [5]:
# Configuration for Quantized LoRA (QLoRA)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                   # Enable 4-bit quantization for efficient model loading
    bnb_4bit_quant_type="nf4",           # Use NormalFloat4 (NF4) quantization
    bnb_4bit_compute_dtype=torch_dtype,  # Set computation precision based on hardware support
    bnb_4bit_use_double_quant=True       # Use double quantization for improved accuracy
)

**3. Load Pretrained Model and Tokenizer**

* The causal language model (**Gemma 2**) is loaded with the specified quantization and attention configurations.
* The tokenizer corresponding to the base model is loaded, ensuring compatibility with the fine-tuning process.

In [6]:
# Load the pretrained causal language model with quantization configuration
model = AutoModelForCausalLM.from_pretrained(
    base_model,                              # The base model identifier or path
    quantization_config=quantization_config,          # Apply QLoRA configuration
    device_map="auto",                       # Automatically map model to available devices
    attn_implementation=attn_implementation  # Set attention implementation
)

# Load the tokenizer corresponding to the pretrained model
tokenizer = AutoTokenizer.from_pretrained(
    base_model,             # The base model identifier or path
    trust_remote_code=True  # Trust custom tokenizer code if provided by the model
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Configuring LoRA for Fine-Tuning

**1. Identify Linear Layers**

* Define a helper function to locate all 4-bit linear layers in the model, which are suitable for LoRA fine-tuning.
* Exclude the **lm_head** layer to focus on trainable components.

In [7]:
def find_all_linear_names(model):
    """
    This function searches for all linear layers of the 4-bit format 
    in a given model and returns their names, excluding the 'lm_head' 
    module if present.

    Args:
    - model: The model to search for linear layers in.

    Returns:
    - List of module names associated with linear layers.
    """
    # The target class for linear layers (4-bit format)
    cls = bnb.nn.Linear4bit
    lora_module_names = set()  # Set to hold the unique names of the target linear modules

    # Iterate over all named modules in the model
    for name, module in model.named_modules():
        # Check if the module is of the target class
        if isinstance(module, cls):
            names = name.split('.')  # Split the module name by dots to isolate components
            # Add the first or last part of the name (depending on the structure) to the set
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    # Remove 'lm_head' if present in the set (needed for 16-bit models)
    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')

    # Return the list of linear module names
    return list(lora_module_names)

# Get the list of linear module names in the model
modules = find_all_linear_names(model)

**3. Set Up LoRA Configuration**

* Configure LoRA with parameters like rank (r), scaling factor (lora_alpha), and dropout rate (lora_dropout).
* Specify the identified linear layers as target modules for fine-tuning.

In [8]:
# LoRA configuration setup
peft_config = LoraConfig(
    r=16,                    # Rank for LoRA
    lora_alpha=32,           # Scaling factor for LoRA
    lora_dropout=0.05,       # Dropout rate for LoRA
    bias="none",             # No bias in LoRA layers
    task_type="CAUSAL_LM",   # Task type for causal language modeling
    target_modules=modules   # The list of target modules (linear layers)
)

**4. Prepare Tokenizer and Chat Format, and Apply LoRA to the Model**

* Set the tokenizer's padding side and reset the chat template to avoid conflicts. Adapt the tokenizer and model for a conversational format using the setup_chat_format utility.
* Integrate the LoRA configuration into the model, enabling efficient fine-tuning of the selected components.

In [9]:
# Set the padding side for the tokenizer (important for certain models)
tokenizer.padding_side = 'right'

# Reset chat template to ensure no leftover settings
tokenizer.chat_template = None

# Setup chat format with the model and tokenizer
model, tokenizer = setup_chat_format(model, tokenizer)

# Apply LoRA configurations to the model
model = get_peft_model(model, peft_config)

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


# Preparing and Loading the Dataset

In this section, we load and preprocess multiple datasets related to various topics in Hindi, such as art, culture, history, cooking, health, and more. The datasets are then combined into a single unified dataset, formatted appropriately for training, and split into training and test sets.

**1. Loading the Datasets**

* **OdiaGenAI/instruction_set_hindi_1035**: Contains instructions and responses related to art, culture, history, cooking, environment, music, and sports.
* **SherryT997/HelpSteer-hindi**: Focuses on general question-answering conversations.
* **kaifahmad/indian-history-hindi-QA-3.4k**: Includes questions and answers specifically about Indian history.
* **OdiaGenAI/health_hindi_200**: Features 200 rows of health-related data.

In [11]:
import json
from datasets import load_dataset


# Load datasets
instruct_hindi = load_dataset("OdiaGenAI/instruction_set_hindi_1035", split='all')
conv_hindi = load_dataset("SherryT997/HelpSteer-hindi", split='all')
history_hindi = load_dataset("kaifahmad/indian-history-hindi-QA-3.4k", split='all')
health_hindi = load_dataset("OdiaGenAI/health_hindi_200", split='all')

README.md:   0%|          | 0.00/452 [00:00<?, ?B/s]

combined_instruction_hindi_1035.json:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1035 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/350 [00:00<?, ?B/s]

output.jsonl:   0%|          | 0.00/9.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2937 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/1.09k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/690k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3468 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/133 [00:00<?, ?B/s]

health_hindi.json:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204 [00:00<?, ? examples/s]

**2 .Preprocessing the Datasets**

* **Instructional and QA datasets** are reformatted into a consistent structure with "instruction", "input", and "output" keys.
* **Conversation datasets** (like **HelpSteer-hindi**) are reformatted by extracting conversational turns between "human" and "gpt" from the data.

In [12]:
# Function to reformat datasets
def reformat_data(dataset, instruction_key, output_key):
    reformatted = []
    for entry in dataset:
        reformatted.append({
            "instruction": "You are a helpful assistant.",
            "input": entry[instruction_key],
            "output": entry[output_key]
        })
    return reformatted

# Function to reformat conversation datasets
def reformat_conversation_data(dataset):
    reformatted = []
    for conversation in dataset['conversations']:
        for i in range(len(conversation) - 1):  # Iterate over all conversation turns
            if conversation[i]['from'] == 'human' and conversation[i + 1]['from'] == 'gpt':
                reformatted.append({
                    "instruction": "You are a helpful assistant.",
                    "input": conversation[i]['value'],      # Human input
                    "output": conversation[i + 1]['value']  # GPT's response
                })
    return reformatted

# Reformat each dataset
instruct_data = reformat_data(instruct_hindi, "Instruction", "Output")
health_data = reformat_data(health_hindi, "Instruction", "Output")
history_data = reformat_data(history_hindi, "Question", "Answer")
conversation_data = reformat_conversation_data(conv_hindi)

**3. Combining the Datasets**

Once reformatted, the datasets are combined into a single dataset that encompasses a variety of domains and question-answering tasks. This unified dataset is then saved as a JSONL file.

In [13]:
# Combine all datasets
combined_data = instruct_data + health_data + history_data + conversation_data

# Write to JSONL file
output_file = "hindi_dataset.jsonl"
with open(output_file, "w", encoding="utf-8") as f:
    for entry in combined_data:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

print(f"Combined dataset saved to {output_file}")

Combined dataset saved to hindi_dataset.jsonl


**4. Shuffling the Dataset**

After loading and combining the datasets, the dataset is shuffled to introduce randomness for better generalization during training. The dataset can also be optionally truncated to a smaller subset for quicker experimentation.

In [14]:
# Importing the dataset and shuffling it for randomness
dataset = load_dataset(dataset_name, split="all")  # Load the entire dataset
dataset = dataset.shuffle(seed=3117)               # Shuffle the dataset with a fixed seed for reproducibility

README.md:   0%|          | 0.00/972 [00:00<?, ?B/s]

hindi_dataset.jsonl:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7644 [00:00<?, ? examples/s]

**5. Formatting the Dataset for Chat-based Tasks**

Each row of the dataset is formatted into a chat-like structure using a custom template. The format includes a "system" message (instruction), a "user" message (input question), and an "assistant" message (output answer). This format is then tokenized and prepared for training.

In [15]:
def format_chat_template(row):
    """
    This function formats each row of the dataset into a chat-like structure 
    and applies the chat template for tokenization.

    Args:
    - row: The current dataset row, containing 'instruction', 'input', and 'output'.

    Returns:
    - The updated row with a 'text' field containing the formatted chat template.
    """
    # Construct a JSON-like structure for the chat conversation (system, user, assistant)
    row_json = [
        {"role": "system", "content": row["instruction"]},  # System message: the instruction
        {"role": "user", "content": row["input"]},          # User message: the input question
        {"role": "assistant", "content": row["output"]}     # Assistant message: the model's response
    ]
    # Apply the tokenizer to format the row using the chat template without tokenizing
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

# Apply the chat formatting to the entire dataset using multiple processes (num_proc=4 for parallelism)
dataset = dataset.map(format_chat_template, num_proc=4)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/7644 [00:00<?, ? examples/s]

**6. Splitting the Dataset**

Finally, the dataset is split into training (90%) and test (10%) sets, which will be used for fine-tuning and evaluating the model, respectively.

In [16]:
# Split the dataset into training and test sets (90% train, 10% test)
dataset = dataset.train_test_split(test_size=0.1)

# Setting Hyperparameters and Training the Model

In this section, we configure and initialize the training process using the **SFTTrainer** class, setting essential hyperparameters and training configurations for fine-tuning the model. This setup ensures an efficient and controlled training process.

**1. Trainer Initialization**

* The **SFTTrainer** class is initialized with the model, tokenizer, training and evaluation datasets, and LoRA configuration. This class handles the entire training and evaluation workflow.

**2. Hyperparameters Configuration**

The hyperparameters define how the training process will be carried out:

* **Batch Size**: Both training and evaluation batches are set to 1 for memory efficiency.
* **Gradient Accumulation**: Training steps are accumulated across two steps to simulate a larger batch size.
* **Optimizer**: paged_adamw_32bit optimizer is used to ensure stability and efficiency.
* **Epochs**: The model is trained for 1 epoch.
* **Learning Rate**: A learning rate of 0.0002 is chosen to allow fine adjustments during training.
* **Logging & Evaluation**: Training logs are saved every 10 steps, and the model is evaluated based on a set frequency.
* **Saving Models**: The model is saved every step based on the configuration, with a maximum of two saved models.

In [17]:
# Setting Hyperparameters for Training
trainer = SFTTrainer(
    model=model,                     # The model to be trained
    processing_class=tokenizer,      # The tokenizer used for data processing
    train_dataset=dataset["train"],  # Training dataset
    eval_dataset=dataset["test"],    # Evaluation dataset
    peft_config=peft_config,         # LoRA configuration for model adaptation
    args=SFTConfig(
        output_dir=new_model,                                   # Directory where the trained model will be saved
        per_device_train_batch_size=1,                          # Batch size for training
        per_device_eval_batch_size=1,                           # Batch size for evaluation
        gradient_accumulation_steps=2,                          # Number of steps for gradient accumulation
        optim="paged_adamw_32bit",                              # Optimizer type for training
        num_train_epochs=1,                                     # Number of training epochs
        eval_strategy="steps",                                  # Evaluation strategy during training
        eval_steps=int(len(dataset["train"]) // (1 * 2) // 5),  # Frequency of evaluation in steps
        logging_steps=10,                                       # Frequency of logging during training
        warmup_steps=30,                                        # Number of steps for learning rate warmup
        logging_strategy="steps",                               # Logging strategy to use (log every 'steps' steps)
        learning_rate=0.0002,                                   # Learning rate for training
        save_steps=0,                                           # Frequency of saving the model in steps
        save_total_limit=0,                                     # Maximum number of saved models to keep
        save_strategy="no",                                     # Disable checkpoint saving
        max_seq_length=512,                                     # Maximum sequence length for input data
        fp16=True,                                              # Enable mixed precision (16-bit floating point) for training
        bf16=False,                                             # Disable bfloat16 (use fp16 instead)
        group_by_length=True,                                   # Group data by length for more efficient batching
        report_to="none",                                       # No external reporting (like to wandb)
        dataset_text_field="text",                              # Field name for dataset text input
        packing=False,                                          # Disable packing of sequences for batching
        load_best_model_at_end=False,                           # Do not load the best model after training
        seed=3117,                                              # Set the random seed for reproducibility
    ),
)

Converting train dataset to ChatML:   0%|          | 0/6879 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/6879 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/6879 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/6879 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/765 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/765 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/765 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/765 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


**3. Cache Management**

Caching is disabled during training to avoid excessive memory usage, ensuring smooth operation on limited hardware.

In [18]:
# Disable caching during training to avoid memory issues
model.config.use_cache = False

**4. Model Training**

Finally, the training process is initiated with the **train()** method, which uses the configured settings to fine-tune the model.

In [19]:
# Start training the model
trainer.train()

Step,Training Loss,Validation Loss
687,1.6142,1.708609
1374,1.6519,1.647516
2061,1.5956,1.611051
2748,1.3651,1.571476
3435,1.4964,1.552956


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=3439, training_loss=1.6087598509482084, metrics={'train_runtime': 5147.7268, 'train_samples_per_second': 1.336, 'train_steps_per_second': 0.668, 'total_flos': 1.811041434301133e+16, 'train_loss': 1.6087598509482084})

In [20]:
# Re-enable cache after training
model.config.use_cache = True

# Saving the Fine-Tuned Adapter Model

Save the fine-tuned adapter model locally for future use and deployment.

In [21]:
# Save the trained model to the specified directory
trainer.model.save_pretrained(new_model)



# Inference and Generating Responses

This section focuses on loading the fine-tuned model, configuring it for inference, and generating responses to user queries. The workflow involves preparing the model and tokenizer, formatting the input, and decoding the model's output to generate meaningful answers.

**1. Clear CUDA Cache**

Before inference, the CUDA memory cache is cleared to optimize GPU memory usage and prevent memory-related issues.

In [22]:
# Clear the CUDA memory cache.
torch.cuda.empty_cache()

**2. Set Up the Model for Inference**

Load the fine-tuned model with 4-bit quantization and integrate it with the base model:
* **Quantization Configuration**: Applies 4-bit quantization to optimize memory and computational efficiency.
* **Model Loading**: Loads the base model and fine-tuned weights, setting it to evaluation mode for inference.

In [23]:
# Define the path to the fine-tuned model
new_model_path = "/kaggle/working/Gemma-2-2b-it-hindi"

# Configuration for 4-bit quantization to optimize model performance and memory usage
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # Enable 4-bit quantization for efficient loading
    bnb_4bit_quant_type="nf4",             # Use NormalFloat4 (NF4) quantization type for better accuracy
    bnb_4bit_compute_dtype=torch.float16,  # Use 16-bit floating-point precision for computations
    bnb_4bit_use_double_quant=True         # Enable double quantization for improved numerical stability
)

# Load the base model with QLoRA (Quantized LoRA) configuration
model = AutoModelForCausalLM.from_pretrained(
    base_model,                              # Path or identifier of the base model
    quantization_config=quantization_config, # Apply the quantization configuration
    attn_implementation="eager",             # Set attention mechanism implementation to "eager"
    torch_dtype=torch.float16,               # Use 16-bit floating-point precision for weights and activations
    return_dict=True,                        # Return outputs as a dictionary for better readability
    device_map="auto"                        # Automatically map model components to available devices
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

**3. Prepare the Tokenizer**

The tokenizer is initialized to process inputs and generate outputs in a chat format. Any previous configurations are reset to avoid interference with new tasks.

In [24]:
# Load the tokenizer for the base model
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Reset the chat template to ensure no stale settings interfere with new tasks
tokenizer.chat_template = None

# Configure the model and tokenizer for chat-based interactions
model, tokenizer = setup_chat_format(model, tokenizer)

# Load the fine-tuned model with PeftModel, applying it to the base model
model = PeftModel.from_pretrained(model, new_model_path)

# Set the model to evaluation mode to prepare for inference
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256002, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

In [26]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256002, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

**4. Define the Conversation and Create a Prompt**

Format the conversation history into a structured prompt using the tokenizer’s chat template. This ensures the model receives well-structured input.

In [30]:
# Define the conversation history as a list of messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "जीवन की क्या परिभाषा हो सकती है?"},
]

# Apply the tokenizer's chat template to format the messages for the model
# Set tokenize=False to avoid tokenization at this point, and add_generation_prompt=True to prepare for generation
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Tokenize the prompt and prepare the inputs for the model
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

**5. Generate a Response**

Use the model to generate a response, applying sampling techniques for diverse and high-quality outputs:

* **Max Length**: Limits the response length to 256 tokens.
* **Top-K Sampling**: Considers the top 50 tokens at each step.
* **Nucleus Sampling (Top-P)**: Ensures 85% cumulative probability, balancing diversity and relevance.
* **Temperature**: A low value (0.3) makes the output more deterministic.
* **No Repetition**: Prevents repetitive phrases by disallowing 3-gram repetitions.

In [34]:
# Optimized text generation with custom sampling strategies for better results
outputs = model.generate(
    **inputs,                # Feed the tokenized inputs to the model
    max_length=256,          # Limit the maximum length of the generated text (512 tokens)
    num_return_sequences=1,  # Only return one sequence of text
    top_k=50,                # Limit the sampling pool to the top 50 tokens
    top_p=0.85,              # Use nucleus sampling with a cumulative probability of 85% (more deterministic output)
    temperature=0.3,         # Lower temperature for more deterministic (less random) responses
    no_repeat_ngram_size=3,  # Prevent repeating n-grams of size 3 (e.g., "the the the")
    do_sample=True,          # Enable sampling for more diverse outputs (as opposed to greedy decoding)
    num_beams=20             # This parameter controls the number of beams used during beam search.
)

**6. Decode and Extract the Response**

Decode the generated output into human-readable text, cleaning unnecessary parts to extract the final response.

In [35]:
# Decode the output sequence back to text, skipping special tokens like padding and EOS markers
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract the assistant's response from the generated text (split at "assistant" to clean up)
response = text.split("assistant")[2].strip()  # Remove unwanted parts and get the final response

print(response)

किसी जीव के जीवन का वर्णन करने के लिए कई अलग-अलग शब्दों का उपयोग किया जा सकता है। कुछ सामान्य शब्दों में शामिल हैं:

- जीवन
- जीव
- प्राणी
- जानवर
- पशु
- वन्यजीव
- मनुष्य
- मानव
- व्यक्ति
- शिशु
- बच्चा
- युवक
- महिला
- पुरुष
- बुजुर्ग
- वृद्ध

यह ध्यान रखना महत्वपूर्ण है कि इनमें से प्रत्येक शब्द का एक अलग अर्थ हो सकता है, और इसका उपयोग किसी विशिष्ट संदर्भ या परिस्थिति के आधार पर किया जाता है। उदाहरण के लिए, "जीव" शब्द का उपयोग अक्सर प्राकृतिक दुनिया में पाए जाने वाले सभी जीवों के बारे में कहा जाता है, जबकि "मनुष्य" का उपयोग केवल मानव जाति के सदस्यों को संकेत करता है। इसके अतिरिक्त, "शिशु", "बच्चा" और "युवक" शब्दों
