**Environment Setup for Fine-Tuning and Optimizing Large Language Models**
The goal of this setup is to prepare a robust and efficient environment for fine-tuning and deploying large language models (LLMs). This involves installing tools to streamline hardware management, memory optimization, and parameter-efficient fine-tuning. By using libraries like **PEFT**, **Accelerate**, and **BitsAndBytes**, we ensure that fine-tuning large-scale models is resource-efficient and manageable, even on limited hardware.

Additionally, installing the **Transformers** library provides access to state-of-the-art pre-trained models, while the Datasets library simplifies data loading and preprocessing. Installing the latest versions directly from GitHub ensures that the environment is equipped with the most recent features, updates, and bug fixes.



In [1]:
!pip install peft ## Install the PEFT library for parameter-efficient fine-tuning of large models
!pip install accelerate ## Install the Accelerate library to manage distributed training
!pip install bitsandbytes  # for 8-bit optimization if needed
!pip install datasets # Install the Hugging Face Datasets library for seamless access and processing of datasets
!pip install accelerate bitsandbytes # Install Accelerate and BitsAndBytes together for an optimized training setup
!pip install -U git+https://github.com/huggingface/transformers.git # Install the latest development version of the Transformers library directly from Hugging Face's GitHub
!pip install -U git+https://github.com/huggingface/peft.git # Install the latest development version of the PEFT library directly from Hugging Face's GitHub


Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m32.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
D

**Parameter-Efficient Fine-Tuning with LoRA**

Importing Libraries:

* Transformers: Provides tools for working with pre-trained language models like GPT or BERT.
* PEFT (LoRA): Reduces the number of trainable parameters to make fine-tuning more efficient.
* Datasets: Enables structured and efficient data handling for model training.
* Torch: The deep learning framework for training and fine-tuning models.
* NumPy: Used for numerical operations and to set random seeds for reproducibility.
* Typing: For cleaner, well-documented code through type hints.


Random seeds are fixed for both NumPy and PyTorch so that the code produces consistent results across multiple runs.



In [2]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, get_scheduler
from datasets import Dataset
import pandas as pd
import numpy as np
from peft import LoraConfig, get_peft_model, TaskType
from typing import List, Dict

# Ensure reproducibility
np.random.seed(42)
torch.manual_seed(42)




<torch._C.Generator at 0x7a2fd824e430>

The *calculate_question_difficulty* function estimates the difficulty of a question using simple heuristics based on:

* Question length: Longer questions are assumed to be more difficult.
* Presence of complex keywords: Certain words (e.g., "analyze," "evaluate") add difficulty.
* Presence of technical terms: Words related to technical or theoretical topics contribute to difficulty.

The function returns a **difficulty score** as a float.



In [3]:
def calculate_question_difficulty(text: str) -> float:
    """
    Calculate question difficulty based on various heuristics.
    """
    # Simple heuristics for difficulty scoring
    difficulty_score = 0

    # Length-based complexity
    difficulty_score += len(text.split()) * 0.01

    # Keyword-based complexity
    complex_keywords = ['analyze', 'evaluate', 'explain', 'compare', 'contrast', 'predict']
    difficulty_score += sum(word in text.lower() for word in complex_keywords) * 0.5

    # Number of technical terms (can be expanded)
    technical_terms = ['algorithm', 'theory', 'principle', 'methodology']
    difficulty_score += sum(term in text.lower() for term in technical_terms) * 0.3

    return difficulty_score

The ***prepare_data*** function is designed to preprocess text datasets for use in natural language processing (NLP) tasks such as language modeling or fine-tuning. It formats input questions and answers into a structured text prompt, tokenizes them using a Hugging Face tokenizer, and prepares a Hugging Face Dataset object.
* Data Loading: The function reads a CSV file containing the dataset, expecting columns like: prompt (the question),  A, B, C, D, E (multiple-choice options)
answer (correct option label).
* Formatting: It combines the question, choices, and answer into a single structured text format.

* Tokenization: The dataset is tokenized using the provided Hugging Face tokenizer with: max_length=512 for truncation.
Padding to ensure uniform input length.
* Label Assignment: The input_ids (tokenized input) are duplicated as labels for model training (e.g., language modeling tasks).

**Output:** A tokenized Hugging Face Dataset ready for downstream tasks such as model training or evaluation.




In [21]:
def prepare_data(data_path: str, tokenizer) -> Dataset:
    """
    Load and preprocess the dataset.
    """
    df = pd.read_csv(data_path, encoding="ISO-8859-1")

    # Combine question, options (A-E), and the answer into a formatted input string

    df["input_text"] = df.apply(
        lambda x: f"Question: {x['prompt']}\nA) {x['A']}\nB) {x['B']}\nC) {x['C']}\nD) {x['D']}\nE) {x['E']}\nAnswer: {x['answer']}</s>",
        axis=1
    )

    # Convert to Hugging Face Dataset
    dataset = Dataset.from_pandas(df[["input_text"]])

    # Tokenize the dataset
    def tokenize_function(examples):
        outputs = tokenizer(
            examples["input_text"],
            truncation=True,
            padding='max_length',
            max_length=512,
            return_tensors=None
        )
        outputs["labels"] = outputs["input_ids"].copy()
        return outputs

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names
    )

    return tokenized_dataset

The ***create_curriculum_dataloaders*** function implements curriculum learning by splitting a tokenized dataset into stages based on sequence lengths. Instead of using external difficulty scores, it uses the sequence length (inferred from the attention mask) as a proxy for complexity.

How It Works:
* Sequence Length Calculation: Sequence lengths are calculated from the attention_mask field of the tokenized dataset. Tokens with padding (0s) are excluded by summing up the mask values.
* Sorting by Difficulty: Examples are sorted in ascending order based on sequence length. Shorter sequences are considered "easier" and longer ones "harder."
* Stage Creation: The dataset is divided into num_stages parts, ensuring that earlier stages contain shorter sequences and later stages contain longer ones.

**Output:** Returns a list of datasets, where each element corresponds to one curriculum stage.


In [5]:
def create_curriculum_dataloaders(tokenized_dataset: Dataset, num_stages: int = 3):
    """
    Create curriculum learning stages based on sequence length instead of difficulty score.
    """
    # Use sequence length as a proxy for difficulty
    sequence_lengths = [sum(attention_mask) for attention_mask in tokenized_dataset['attention_mask']]

    # Create a new dataset with sequence lengths
    indexed_dataset = Dataset.from_dict({
        'index': range(len(tokenized_dataset)),
        'length': sequence_lengths
    })

    # Sort by sequence length
    sorted_indices = sorted(range(len(sequence_lengths)), key=lambda k: sequence_lengths[k])

    # Split into stages
    stage_size = len(sorted_indices) // num_stages
    stages = []

    for i in range(num_stages):
        start_idx = i * stage_size
        end_idx = (i + 1) * stage_size if i < num_stages - 1 else len(sorted_indices)
        stage_indices = sorted_indices[start_idx:end_idx]
        stages.append(tokenized_dataset.select(stage_indices))

    return stages

The ***setup_model*** function sets up a large language model for efficient training using 4-bit quantization and LoRA (Low-Rank Adaptation). This function optimizes model memory usage and training speed while enabling fine-tuning on resource-constrained hardware.

**Key Components:**
* BitsAndBytesConfig: Enables 4-bit quantization with nf4 (normalized float-4) to reduce memory consumption.
Uses FP16 computation for faster and memory-efficient training.
* LoRA:Applies Low-Rank Adaptation to key projection layers (q_proj, v_proj, k_proj, o_proj). Reduces the number of trainable parameters while maintaining model performance.
* Gradient Checkpointing: Activates checkpointing to save memory by recomputing gradients during backpropagation.
* Automatic Device Mapping: Leverages device_map="auto" to automatically distribute the model across available GPUs or CPUs.


In [6]:
def setup_model(model_name="facebook/opt-1.3b"):
    # Configure training optimizations
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )
    if not tokenizer.pad_token:
        tokenizer.pad_token = tokenizer.eos_token

    # Load model with optimizations
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        use_cache=False
    )

    # Configure LoRA
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
    )

    # Prepare model for training
    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()
    model = get_peft_model(model, lora_config)

    # Print trainable parameters
    model.print_trainable_parameters()

    return model, tokenizer

The ***fine_tune_model*** function fine-tunes a large language model (LLM) in curriculum learning stages using LoRA (Low-Rank Adaptation) and efficient 4-bit quantization. The fine-tuning is conducted progressively through smaller stages based on sequence lengths, making the training process smoother and more efficient.

Workflow:
* Setup: Uses the setup_model function to load a quantized LLM and tokenizer with LoRA optimizations.Configures a DataCollatorForLanguageModeling for causal LM tasks.
* Curriculum Learning: The dataset is divided into stages based on sequence lengths. Each stage trains the model progressively, starting with shorter and simpler inputs and moving to longer ones.
* Training: The model is trained for each curriculum stage, with periodic evaluation.

**Key training parameters include:**
* Gradient Accumulation: Effective batch size scaling.
* FP16 Training: Reduces memory usage and speeds up computations.
* Gradient Checkpointing: Further optimizes memory consumption.

**Final Output:**The fine-tuned model and tokenizer are saved to the specified directory for further use.


In [7]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling  # Changed from default_data_collator
)

def fine_tune_model(
    dataset: Dataset,
    output_dir: str = "fine_tuned_model"
) -> tuple:
    """
    Fine-tune the LLM using LoRA and curriculum learning.
    """
    os.environ["WANDB_DISABLED"] = "true"

    # Initialize model and tokenizer
    model, tokenizer = setup_model()

    # Initialize data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        eval_strategy="steps",
        eval_steps=100,
        learning_rate=2e-4,
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=16,
        num_train_epochs=1,
        weight_decay=0.05,
        save_steps=500,
        save_total_limit=2,
        logging_dir="./logs",
        logging_steps=10,
        report_to="none",
        fp16=True,
        warmup_steps=100,
        dataloader_num_workers=0,
        remove_unused_columns=False,  # Changed to False
        gradient_checkpointing=True,
        max_grad_norm=0.3,
        ddp_find_unused_parameters=False
    )

    # Create curriculum stages
    stages = create_curriculum_dataloaders(dataset, num_stages=3)

    # Train through curriculum stages
    for stage_idx, stage_dataset in enumerate(stages):
        print(f"\nTraining on curriculum stage {stage_idx + 1}/{len(stages)}")

        # Split into train and eval
        train_size = int(0.8 * len(stage_dataset))
        train_dataset = stage_dataset.select(range(train_size))
        eval_dataset = stage_dataset.select(range(train_size, len(stage_dataset)))

        # Ensure datasets have the right format
        print("Training dataset features:", train_dataset.features)
        print("Sample training input:", train_dataset[0])

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            data_collator=data_collator,
        )

        trainer.train()
        eval_results = trainer.evaluate()
        print(f"Stage {stage_idx + 1} evaluation results:", eval_results)

    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    return model, tokenizer


The ***generate_answer*** function takes a natural language question as input and generates an answer using a fine-tuned causal language model. The model generates responses in a format that aligns with the dataset used during fine-tuning, ensuring consistency.

**Workflow**:
* Prompt Formatting: The input question is wrapped in a specific format:
makefile
Question: <question>
Answer:

   This ensures the model recognizes the structure and generates answers appropriately.
* Tokenization: The input prompt is tokenized and moved to the model's device (GPU or CPU).
* Response Generation: The generate function is used with specific parameters:
max_length: Controls the total length of the response.
* temperature: Balances randomness; a higher value allows diverse outputs.
* top_p: Enables nucleus sampling for more natural responses.
* do_sample: Activates non-deterministic sampling.
* Post-Processing:The output tokens are decoded into a string.

The function extracts the portion of the response following "Answer:" to isolate the generated answer.




In [8]:
def generate_answer(question: str, model, tokenizer) -> str:
    """
    Generate an answer using the fine-tuned model with format matching your data.
    """
    # Format prompt to match your data format
    prompt = f"""Question: {question}
Answer: """

    # Prepare input
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    # Generate with specific parameters
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=512,
            num_return_sequences=1,
            temperature=0.9,  # Increased for more randomness
            top_p=0.95,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract the part after "Answer:"
    answer_part = response.split("Answer:")[-1].strip()
    return answer_part

The ***extract_selected_option*** function identifies and extracts the selected answer option (A, B, C, D, or E) from the model's generated response. The function ensures robustness by handling multiple answer formats, including single letters, formatted options (e.g., "A)"), and fallback detection of valid letters.

Workflow:
* Normalization: The model's generated answer is converted to uppercase and stripped of leading/trailing spaces.
* Direct Letter Detection: If the response contains a single valid letter (A-E), it is returned immediately.
* Formatted Option Matching:The function checks for matches against a list of formatted options (e.g., ['A)', 'B)', 'C)', 'D)', 'E)']). If a match is found, the corresponding letter is extracted.
* Fallback Detection: If no formatted match is found, the function scans the response for the first occurrence of a valid letter (A-E).
* Error Handling: If no valid option can be detected, the function returns "N/A".


In [9]:
def extract_selected_option(generated_answer: str, options: List[str]) -> str:
    """
    Extract the selected option from the generated answer based on your data format.
    """
    # Clean and uppercase the answer
    answer_upper = generated_answer.upper().strip()

    # First check if the answer is just a letter
    if len(answer_upper) == 1 and answer_upper in ['A', 'B', 'C', 'D', 'E']:
        return answer_upper

    # Look for exact matches in your data format (e.g., "A)")
    for option in options:
        if option.upper() in answer_upper:
            return option[0]  # Return just the letter

    # Fallback: look for first occurrence of A, B, C, D, or E
    for char in answer_upper:
        if char in ['A', 'B', 'C', 'D', 'E']:
            return char

    return "N/A"



In [10]:
!pip install accelerate bitsandbytes
!pip install transformers>=4.34.0




In [11]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
The token `MIXTRAL` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authentica

This script sets up the **Mistral-7B** model for efficient fine-tuning and inference using 4-bit quantization and gradient checkpointing. It leverages BitsAndBytes (bnb) for quantization to significantly reduce GPU memory requirements while maintaining model performance.

**Key Features:**
* 4-bit Quantization: Configured using the BitsAndBytesConfig class with nf4 (normalized float-4) for optimal performance.
Enables 4-bit computation with FP16 as the compute data type.
* Tokenizer: Loads the fast tokenizer for high-speed tokenization.
Aligns padding to the right, which is ideal for causal language modeling tasks.
* Memory Optimization: Gradient Checkpointing reduces GPU memory usage by recomputing activations during backpropagation.

* torch.cuda.empty_cache() clears any unused memory to prevent GPU fragmentation.
* Device Mapping: Uses device_map="auto" to distribute the model across available GPU devices automatically.


In [12]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling  # Changed from default_data_collator
)
import torch

# Define model name
model_name = "mistralai/Mistral-7B-v0.1"

# Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    padding_side="right",
    use_fast=True,
)
tokenizer.pad_token = tokenizer.eos_token

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=bnb_config
)

# Enable memory optimizations
torch.cuda.empty_cache()
model.gradient_checkpointing_enable()




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [13]:
import torch
torch.cuda.empty_cache()
model.gradient_checkpointing_enable()


The ***main*** function orchestrates the entire pipeline:

* Data Preparation: Prepares the training dataset for fine-tuning.
* Fine-Tuning: Fine-tunes a causal language model using the preprocessed dataset.
* Answer Generation: Processes test questions and generates answers.
* Result Saving: Saves the results into a CSV file (answers.csv).
* Analysis: Displays the distribution of answers for a quick overview.

**Workflow:**
* Dataset Paths: Hackathon_KB_updated.csv: Dataset for fine-tuning.
Hackathon_Question_set_sample.csv: Test set with questions.
* Fine-Tuning:Calls setup_model and prepare_data to set up and preprocess the dataset. Uses fine_tune_model to train the model.
* Answer Generation: Reads the test set CSV.Generates answers using generate_answer. Extracts selected options with extract_selected_option.
* Error Handling: Questions with missing or whitespace content are skipped.
Errors during processing are caught and logged.
* Results:
Results are saved to answers.csv with:
Number: Question number.
Answer: Selected option (e.g., A, B, C).
Generated_Text: Full generated output.
* Analysis:The script prints the distribution of answers (A, B, C, D, etc.).


In [22]:
def main():
    # Paths and configurations
    dataset_path = "/content/Hackathon_KB_updated.csv"
    fine_tuned_dir = "fine_tuned_model"

    # Prepare dataset
    print("Preparing dataset...")
    model, tokenizer = setup_model()  # Get tokenizer for data preparation
    dataset = prepare_data(dataset_path, tokenizer)

    print("Fine-tuning model...")
    model, tokenizer = fine_tune_model(dataset, output_dir=fine_tuned_dir)

    # Process test questions
    print("Processing test questions...")
    df = pd.read_csv("/content/Hackathon_Question_set_sample.csv")
    df['Question'] = df['Question'].fillna('').astype(str)

    results = []
    for idx, row in df.iterrows():
        question = row['Question']
        if not question or question.isspace():
            results.append({
                "Number": row['Number'],
                "Answer": "N/A",
                "Generated_Text": ""
            })
            continue

        try:
            options = [opt.strip() for opt in question.split() if opt.endswith(")")]
            generated_answer = generate_answer(question, model, tokenizer)
            selected_option = extract_selected_option(generated_answer, options)

            # Print for debugging
            print(f"\nQuestion {row['Number']}:")
            print(f"Generated text: {generated_answer}")
            print(f"Selected option: {selected_option}")

            results.append({
                "Number": row['Number'],
                "Answer": selected_option,
                "Generated_Text": generated_answer
            })

        except Exception as e:
            print(f"Error processing question {row['Number']}: {str(e)}")
            results.append({
                "Number": row['Number'],
                "Answer": "Error",
                "Generated_Text": str(e)
            })

    # Save results with generated text for analysis
    results_df = pd.DataFrame(results)
    results_df.to_csv("answers.csv", index=False)
    print("\nResults saved to answers.csv")

    # Print distribution of answers
    answer_dist = results_df['Answer'].value_counts()
    print("\nDistribution of answers:")
    print(answer_dist)

if __name__ == "__main__":
    main()


Preparing dataset...
trainable params: 2,359,296 || all params: 1,318,117,376 || trainable%: 0.1790


Map:   0%|          | 0/11975 [00:00<?, ? examples/s]

Fine-tuning model...
trainable params: 2,359,296 || all params: 1,318,117,376 || trainable%: 0.1790

Training on curriculum stage 1/3
Training dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
Sample training input: {'input_ids': [2, 45641, 35, 520, 21, 83, 12, 11127, 4790, 116, 50118, 250, 43, 18069, 50118, 387, 43, 17616, 50118, 347, 43, 19515, 50118, 495, 43, 13466, 50118, 717, 43, 6200, 50118, 33683, 35, 163, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Step,Training Loss,Validation Loss
100,19.2649,1.486166


Stage 1 evaluation results: {'eval_loss': 1.4727946519851685, 'eval_runtime': 68.2335, 'eval_samples_per_second': 11.71, 'eval_steps_per_second': 11.71, 'epoch': 0.9974937343358395}

Training on curriculum stage 2/3
Training dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
Sample training input: {'input_ids': [2, 45641, 35, 590, 61, 10320, 222, 226, 13572, 14800, 261, 1120, 263, 20827, 13572, 14800, 7861, 2364, 4972, 25, 10, 6707, 2650, 869, 116, 50118, 250, 43, 2482, 3275, 1725, 3109, 50118, 387, 43, 21232, 179, 50118, 347, 43, 4150, 4467, 50118, 495, 43, 9171, 50118, 717, 43, 140, 594, 50118, 33683, 35, 211, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

Step,Training Loss,Validation Loss
100,29.6808,1.648024


Stage 2 evaluation results: {'eval_loss': 1.6107540130615234, 'eval_runtime': 68.5694, 'eval_samples_per_second': 11.652, 'eval_steps_per_second': 11.652, 'epoch': 0.9974937343358395}

Training on curriculum stage 3/3
Training dataset features: {'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
Sample training input: {'input_ids': [2, 45641, 35, 3394, 58, 5, 1461, 453, 9, 20, 1211, 229, 5022, 3019, 6402, 927, 3152, 1971, 116, 50118, 250, 43, 16951, 4171, 6, 17557, 8811, 4393, 1794, 6, 8, 12865, 5082, 5961, 4, 50118, 387, 43, 16951, 4171, 6, 17557, 8811, 4393, 1794, 6, 8, 16562, 7999, 17063, 4, 50118, 347, 43, 16951, 4171, 6, 17557, 8811, 4393, 1794, 6, 8, 2206, 16503, 4, 50118, 495, 43, 16951, 4171, 6, 17557, 8811, 4393, 1794, 6, 8, 8205, 12, 22041, 1405, 261, 4, 50118, 717, 43, 16951, 4171, 

Step,Training Loss,Validation Loss
100,22.9428,1.19633


Stage 3 evaluation results: {'eval_loss': 1.1794451475143433, 'eval_runtime': 69.0454, 'eval_samples_per_second': 11.572, 'eval_steps_per_second': 11.572, 'epoch': 0.9968691296180339}


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processing test questions...

Question 1.0:
Generated text: Hong Kong B
Selected option: B

Question 2.0:
Generated text: Â

References
Selected option: E

Question 3.0:
Generated text: A
Selected option: A

Question 4.0:
Generated text: Kangchenjunga
Selected option: A

Question 5.0:
Generated text: A 

References
Selected option: A

Question 6.0:
Generated text: Mahabharata Lake – 2,539 kilometers (1,575 mi) – B) Ganges River – 1,788 kilometers (1,120 mi) – D) Nanda Devi – 1,200 kilometers (738 mi) – E) Kailash Range –
Selected option: B

Question 7.0:
Generated text: A) Mercury B) Venus C) Mars
Selected option: A

Results saved to answers.csv

Distribution of answers:
Answer
N/A    19
A       4
B       2
E       1
Name: count, dtype: int64


In [None]:
    # Process test questions
    print("Processing test questions...")
    df = pd.read_csv("/content/Hackathon_Question_set_HT.csv", encoding="ISO-8859-1")
    df['Question'] = df['Question'].fillna('').astype(str)

    results = []
    for idx, row in df.iterrows():
        question = row['Question']
        if not question or question.isspace():
            results.append({
                "Number": row['Number'],
                "Answer": "N/A",
                "Generated_Text": ""
            })
            continue

        try:
            options = [opt.strip() for opt in question.split() if opt.endswith(")")]
            generated_answer = generate_answer(question, model, tokenizer)
            selected_option = extract_selected_option(generated_answer, options)

            # Print for debugging
            print(f"\nQuestion {row['Number']}:")
            print(f"Generated text: {generated_answer}")
            print(f"Selected option: {selected_option}")

            results.append({
                "Number": row['Number'],
                "Answer": selected_option,
                "Generated_Text": generated_answer
            })

        except Exception as e:
            print(f"Error processing question {row['Number']}: {str(e)}")
            results.append({
                "Number": row['Number'],
                "Answer": "Error",
                "Generated_Text": str(e)
            })

    # Save results with generated text for analysis
    results_df = pd.DataFrame(results)
    results_df.to_csv("answers.csv", index=False)
    print("\nResults saved to answers.csv")

    # Print distribution of answers
    answer_dist = results_df['Answer'].value_counts()
    print("\nDistribution of answers:")
    print(answer_dist)


Processing test questions...

Question 1:
Generated text: 2 - Nativity

The Nativity of Jesus is an annual Feast that falls on 25 December and celebrates the birth of Jesus Christ. The Greek Orthodox church celebrates the feast of Nativity on the same day (25 December). The Nativity of Christ is celebrated by the Latin Catholic Church and is on the same date as the Orthodox Church (25 December).

For the Western Church, the Feast of St Nicholas is celebrated on 6 December.

'Tinitavyy' is the word 'Tiny Nativity' written backward.
Selected option: A

Question 2:
Generated text: C)
Selected option: C

Question 3:
Generated text:  Marks & Spencer

If you guessed M&S you would be correct! And in that year M&S launched their iconic campaign Make Christmas, Marks & Spencer with music from Coldplay and The Cure. In 2017, M&S is hoping to recreate this success with the latest instalment in their Christmas is for sharing campaign. This Christmas, M&S has once again partnered with The Cure