# IndicBART: Grammar Error Correction for Indian Languages

This notebook implements grammar error correction using IndicBART models for multiple Indian languages including Hindi, Bengali, Malayalam, Tamil, Telugu, and others.

##  Environment Setup Complete!

**Successfully installed packages in virtual environment:**
- **PyTorch 2.8.0+cu129** - Latest PyTorch with CUDA 12.9 support
- **Transformers 4.56.2** - Hugging Face Transformers library  
- **Additional packages**: datasets, evaluate, nltk, pandas, numpy, tqdm

**Hardware detected:**
- **GPU**: NVIDIA GeForce RTX 4050 Laptop GPU (6GB VRAM)
- **CUDA**: Available and working properly

##  Features:
- Multi-language support using IndicBART
- Unified tokenization approach with `AutoModelForSeq2SeqLM` and `AutoTokenizer`
- Batch processing capabilities
- GLEU score evaluation
- Easy language switching
- GPU acceleration for faster inference

##  Issue Fixed:
- **Unicode encoding error**: Removed problematic Unicode characters (emojis) that were causing tokenization errors
- **Virtual environment**: All packages now properly installed and working
- **Ready to proceed**: You can now run all subsequent cells without issues

In [None]:
# Virtual Environment Setup - Verification
import sys
import importlib

print(" Environment Verification:")
print(f" Python: {sys.executable}")

# Check virtual environment
in_venv = hasattr(sys, 'real_prefix') or (hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix)
print(f" Virtual Environment: {' Active' if in_venv else ' Not active'}")

print("\n Package Status:")

# Test core packages
packages_status = {}

# Test PyTorch
try:
    import torch
    packages_status['torch'] = {
        'status': 'success',
        'version': torch.__version__,
        'cuda': torch.cuda.is_available()
    }
    print(f" PyTorch {torch.__version__}")
    if torch.cuda.is_available():
        print(f"    CUDA: Available - {torch.cuda.get_device_name()}")
        print(f"    GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    else:
        print(f"    CUDA: Not available (CPU only)")
except Exception as e:
    packages_status['torch'] = {'status': 'error', 'error': str(e)}
    print(f" PyTorch: {str(e)}")

# Test Transformers 
try:
    import transformers
    packages_status['transformers'] = {
        'status': 'success',
        'version': transformers.__version__
    }
    print(f" Transformers {transformers.__version__}")
except Exception as e:
    packages_status['transformers'] = {'status': 'error', 'error': str(e)}
    print(f" Transformers: {str(e)}")

# Test other required packages
other_packages = ['evaluate', 'nltk', 'pandas', 'numpy', 'tqdm']
all_others_ok = True

for pkg in other_packages:
    try:
        module = importlib.import_module(pkg)
        version = getattr(module, '__version__', 'Available')
        print(f" {pkg.capitalize()}: {version}")
        packages_status[pkg] = {'status': 'success', 'version': version}
    except Exception as e:
        print(f" {pkg.capitalize()}: {str(e)}")
        packages_status[pkg] = {'status': 'error', 'error': str(e)}
        all_others_ok = False

# Final status
torch_ok = packages_status.get('torch', {}).get('status') == 'success'
transformers_ok = packages_status.get('transformers', {}).get('status') == 'success'

print(f"\n Final Status:")
if torch_ok and transformers_ok and all_others_ok:
    print(f" SUCCESS! All packages ready in virtual environment!")
    print(f" Ready for IndicBART multi-language grammar correction!")
    
    # Show device info
    if torch_ok:
        import torch
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"  Device: {device.upper()}")
        
elif torch_ok and transformers_ok:
    print(f" Core packages (PyTorch + Transformers) ready!")
    print(f"  Some optional packages may need attention")
    print(f" You can proceed with the notebook")
else:
    missing = []
    if not torch_ok:
        missing.append("PyTorch")
    if not transformers_ok:
        missing.append("Transformers")
    print(f" Missing core packages: {', '.join(missing)}")
    print(f" Please install missing packages before continuing")

# Save status for next cells
globals()['_package_status'] = packages_status
print(f"\n Environment check complete! You can proceed to the next cell.")

In [None]:
# Import libraries - FRESH START after kernel restart
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print(" Starting fresh imports after kernel restart...")

# Import PyTorch FIRST and verify it's working
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   CUDA version: {torch.version.cuda}")
    print(f"   Device count: {torch.cuda.device_count()}")
    print(f"   Current device: {torch.cuda.current_device()}")
    print(f"   Device name: {torch.cuda.get_device_name()}")

# Clear any cached transformers modules and import fresh
import sys
transformers_modules = [m for m in sys.modules.keys() if m.startswith('transformers')]
for module in transformers_modules:
    if module in sys.modules:
        del sys.modules[module]

# Now import transformers with PyTorch already loaded
import transformers
print(f"Transformers version: {transformers.__version__}")

# Verify PyTorch is detected by transformers
from transformers.utils import is_torch_available
print(f"PyTorch detected by transformers: {is_torch_available()}")

if not is_torch_available():
    raise ImportError("PyTorch not detected by transformers - please restart kernel")

# Now safe to import the model classes
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline, set_seed
print("Model classes imported successfully!")

# Test that the classes are real, not DummyObjects
print(f"   AutoModelForSeq2SeqLM type: {type(AutoModelForSeq2SeqLM)}")
print(f"   AutoTokenizer type: {type(AutoTokenizer)}")

# Additional imports for evaluation
import nltk
from tqdm import tqdm

# Set random seed for reproducibility
set_seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f" GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    print(f" Available Memory: {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()) / 1024**3:.1f} GB")

print("\n ALL IMPORTS SUCCESSFUL! Ready for IndicBART!")

In [None]:
# Multi-language IndicBART Configuration - CORRECTED MODEL NAMES
class IndicBARTConfig:
    """Configuration class for IndicBART models across different Indian languages"""
    
    def __init__(self):
        # Updated language configurations with correct model paths
        # IndicBART uses a single multilingual model for all Indian languages
        self.language_configs = {
            'hindi': {
                'name': 'Hindi',
                'code': 'hi',
                'model_name': 'ai4bharat/IndicBART',  
                'tokenizer_name': 'ai4bharat/IndicBART',
                'data_folder': 'Hindi',
                'script': 'Devanagari',
                'prefix': 'hi'  
            },
            'bengali': {
                'name': 'Bengali', 
                'code': 'bn',
                'model_name': 'ai4bharat/IndicBART',
                'tokenizer_name': 'ai4bharat/IndicBART',
                'data_folder': 'Bangla',
                'script': 'Bengali',
                'prefix': 'bn'
            },
            'malayalam': {
                'name': 'Malayalam',
                'code': 'ml', 
                'model_name': 'ai4bharat/IndicBART',
                'tokenizer_name': 'ai4bharat/IndicBART',
                'data_folder': 'Malayalam',
                'script': 'Malayalam',
                'prefix': 'ml'
            },
            'tamil': {
                'name': 'Tamil',
                'code': 'ta',
                'model_name': 'ai4bharat/IndicBART',
                'tokenizer_name': 'ai4bharat/IndicBART',
                'data_folder': 'Tamil',
                'script': 'Tamil',
                'prefix': 'ta'
            },
            'telugu': {
                'name': 'Telugu',
                'code': 'te',
                'model_name': 'ai4bharat/IndicBART',
                'tokenizer_name': 'ai4bharat/IndicBART',
                'data_folder': 'Telugu', 
                'script': 'Telugu',
                'prefix': 'te'
            },
        }
    
    def get_config(self, language):
        """Get configuration for a specific language"""
        return self.language_configs.get(language.lower(), None)
    
    def list_languages(self):
        """List all available languages"""
        return list(self.language_configs.keys())

# Initialize configuration
config = IndicBARTConfig()
print("Available languages (using ai4bharat/IndicBART):")
for lang in config.list_languages():
    lang_config = config.get_config(lang)
    print(f"   {lang_config['name']} ({lang_config['code']}) - {lang_config['script']} script")

print(f"\n All languages use the same multilingual model: ai4bharat/IndicBART")
print(f" Language-specific generation controlled by prefixes")

In [None]:
# IndicBART Model Manager - Fixed for compatibility
class IndicBARTManager:
    """Manages IndicBART multilingual model for grammar error correction across Indian languages"""
    
    def __init__(self, language='hindi'):
        self.language = language.lower()
        self.config = IndicBARTConfig().get_config(self.language)
        
        if not self.config:
            raise ValueError(f"Language '{language}' not supported. Available: {IndicBARTConfig().list_languages()}")
        
        self.model = None
        self.tokenizer = None
        self.pipeline = None
        
    def load_model(self, force_reload=False):
        """Load the multilingual IndicBART model and tokenizer"""
        if self.model is not None and not force_reload:
            print(f" IndicBART model already loaded for {self.config['name']}")
            return
            
        print(f" Loading IndicBART multilingual model for {self.config['name']}")
        print(f"   Model: {self.config['model_name']}")
        
        try:
            # Load the multilingual IndicBART model (simplified for compatibility)
            self.model = AutoModelForSeq2SeqLM.from_pretrained(
                self.config['model_name'],
                # Use 'dtype' instead of deprecated 'torch_dtype'
                dtype=torch.float16 if device == "cuda" else torch.float32,
                # Remove device_map to avoid accelerate requirement
                low_cpu_mem_usage=True  # Memory optimization
            )
            
            # Load the tokenizer
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.config['tokenizer_name']
            )
            
            # Manually move model to device
            self.model = self.model.to(device)
            
            print(f"   IndicBART model loaded successfully for {self.config['name']}!")
            print(f"   Model type: {type(self.model).__name__}")
            print(f"   Tokenizer type: {type(self.tokenizer).__name__}")
            print(f"   Vocabulary size: {self.tokenizer.vocab_size}")
            print(f"   Device: {next(self.model.parameters()).device}")
            
            # Check model size
            param_count = sum(p.numel() for p in self.model.parameters())
            print(f"   Parameters: {param_count / 1e6:.1f}M")
            
        except Exception as e:
            print(f" Error loading IndicBART model: {str(e)}")
            raise
    
    def create_pipeline(self):
        """Create a text generation pipeline for the specific language"""
        if self.model is None or self.tokenizer is None:
            self.load_model()
            
        self.pipeline = pipeline(
            "text2text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            device=0 if device == "cuda" else -1,
            # Use 'dtype' instead of deprecated 'torch_dtype'
            dtype=torch.float16 if device == "cuda" else torch.float32
        )
        print(f" Text generation pipeline created for {self.config['name']}")
        
    def correct_text(self, text, max_length=256, num_beams=4, temperature=0.8):
        """Correct grammar errors in the given text for the specific language"""
        if self.pipeline is None:
            self.create_pipeline()
            
        try:
            # Simplified input format for IndicBART
            # IndicBART is trained for various tasks, try different formats
            input_formats = [
                f"Correct: {text.strip()}",  # Simple correction prompt
                f"{text.strip()}",         
                f"Grammar correction: {text.strip()}"  
            ]
            
            best_result = text  # Fallback to original
            
            for input_text in input_formats:
                try:
                    # Generate correction
                    result = self.pipeline(
                        input_text,
                        max_length=max_length,
                        num_beams=num_beams,
                        temperature=temperature,
                        do_sample=True,
                        early_stopping=True,
                        pad_token_id=self.tokenizer.eos_token_id
                    )
                    
                    corrected_text = result[0]['generated_text'].strip()
                    
                    # Clean up the output if it includes the input
                    for fmt in input_formats:
                        if corrected_text.startswith(fmt):
                            corrected_text = corrected_text[len(fmt):].strip()
                            break
                    
                    # Use the first successful result
                    if corrected_text and corrected_text != input_text:
                        best_result = corrected_text
                        break
                        
                except Exception as e:
                    continue  # Try next format
            
            return best_result
            
        except Exception as e:
            print(f" Error during correction: {str(e)}")
            return text
    
    def batch_correct(self, texts, max_length=256, batch_size=2):
        """Correct multiple texts in batches (reduced batch size for memory)"""
        if self.pipeline is None:
            self.create_pipeline()
            
        corrected_texts = []

        print(f" Processing {len(texts)} texts in batches of {batch_size}...")

        for i in tqdm(range(0, len(texts), batch_size), desc=f"Correcting {self.config['name']} texts"):
            batch = texts[i:i + batch_size]
            
            # Use simple input format for batch processing
            inputs = [f"Correct: {text.strip()}" for text in batch]
            
            try:
                results = self.pipeline(
                    inputs,
                    max_length=max_length,
                    num_beams=2,  # Reduced for memory
                    do_sample=False,  # Deterministic for batch
                    early_stopping=True,
                    pad_token_id=self.tokenizer.eos_token_id
                )
                
                batch_corrections = []
                for result, original_input in zip(results, inputs):
                    corrected = result['generated_text'].strip()
                    
                    # Clean up the output
                    if corrected.startswith(original_input):
                        corrected = corrected[len(original_input):].strip()
                    
                    batch_corrections.append(corrected)
                
                corrected_texts.extend(batch_corrections)
                
            except Exception as e:
                print(f" Error in batch {i//batch_size + 1}: {str(e)}")
                corrected_texts.extend(batch)  # Return original texts on error
                
        return corrected_texts

# Example usage
print("   Fixed IndicBART Manager initialized!")
print("   Compatible model loading without accelerate")
print("   Memory optimized for standard hardware")
print("Available languages:", IndicBARTConfig().list_languages())

In [None]:
# GPU-Optimized IndicBART Model Loading (Accelerate-Compatible)
print(" Loading IndicBART model with GPU optimization...")

# Load model and tokenizer with GPU priority
try:
    print(" Loading ai4bharat/IndicBART...")
    print(f" Target device: {device}")
    
    # Clear GPU memory first
    if device == "cuda":
        import torch
        torch.cuda.empty_cache()
        print(f" GPU memory cleared")
        print(f" Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Load model first
    print(" Loading model...")
    model = AutoModelForSeq2SeqLM.from_pretrained(
        "ai4bharat/IndicBART",
        dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto" if device == "cuda" else None,
    )
    
    # Load tokenizer
    print(" Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained( # Autotokenizer and AlbertTokenizer
        "ai4bharat/IndicBART",
        use_fast=False,
        trust_remote_code=True
    )
    
    print(f"   IndicBART loaded successfully!")
    print(f"   Model: {type(model).__name__}")
    print(f"   Device: {next(model.parameters()).device}")
    print(f"   Data type: {next(model.parameters()).dtype}")
    print(f"   Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
    print(f"   Tokenizer: {type(tokenizer).__name__}")
    print(f"   Vocab size: {len(tokenizer)}")
    
    if device == "cuda":
        print(f"   GPU memory used: {torch.cuda.memory_allocated() / 1024**3:.1f} GB")
        print(f"   GPU memory cached: {torch.cuda.memory_reserved() / 1024**3:.1f} GB")

    print(f"\n   Testing Hindi grammar correction with proper tokenization:")
    print("=" * 70)
    
    # Test with Hindi examples using corrected tokenization
    test_sentences = [
        "‡§Æ‡•à ‡§Ü‡§ú ‡§ò‡§∞ ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ",  # ‡§Æ‡•à‡§Ç ‡§Ü‡§ú ‡§ò‡§∞ ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ  
        "‡§µ‡•ã ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§≤‡§°‡§º‡§ï‡§æ ‡§π‡•à‡§Ç",  # ‡§µ‡§π ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§≤‡§°‡§º‡§ï‡§æ ‡§π‡•à
        "‡§π‡§Æ‡•á ‡§Ø‡§π ‡§ï‡§æ‡§Æ ‡§ï‡§∞‡§®‡§æ ‡§ö‡§æ‡§π‡§ø‡§è"  # ‡§π‡§Æ‡•á‡§Ç ‡§Ø‡§π ‡§ï‡§æ‡§Æ ‡§ï‡§∞‡§®‡§æ ‡§ö‡§æ‡§π‡§ø‡§è
    ]
    
    for i, sentence in enumerate(test_sentences, 1):
        print(f"\n Test {i}:")
        print(f"  Original: {sentence}")
        
        try:
            # Fixed tokenization - only return what the model expects
            inputs = tokenizer(
                sentence, 
                return_tensors="pt", 
                padding=True,
                return_token_type_ids=False,  # Don't return token_type_ids
                return_attention_mask=True
            )
            
            # Move inputs to device
            inputs = {k: v.to(model.device) for k, v in inputs.items()}
            
            # Generate with strict parameters
            with torch.no_grad():
                outputs = model.generate(
                    input_ids=inputs['input_ids'],
                    attention_mask=inputs['attention_mask'],
                    max_new_tokens=15,  # Short output
                    min_length=inputs['input_ids'].shape[1] + 1,
                    num_beams=2,
                    do_sample=False,
                    early_stopping=True,
                    no_repeat_ngram_size=2,
                    repetition_penalty=1.5,
                    length_penalty=1.0,
                    pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id else tokenizer.eos_token_id,
                    eos_token_id=tokenizer.eos_token_id
                )
            
            # Decode the output
            decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            print(f"  Generated: {decoded}")
            print(f"  Status: {' Generated' if decoded != sentence else 'Same as input'}")
            
        except Exception as e:
            print(f"   Error: {str(e)}")
    
    # Try simple text-to-text generation with task prompts
    print(f"\n   Testing with task-specific prompts:")
    print("=" * 50)
    
    task_examples = [
        ("Grammar correct: ‡§Æ‡•à ‡§Ü‡§ú ‡§ò‡§∞ ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ", "Grammar correction task"),
        ("Fix: ‡§µ‡•ã ‡§¨‡§π‡•Å‡§§ ‡§Ö‡§ö‡•ç‡§õ‡§æ ‡§≤‡§°‡§º‡§ï‡§æ ‡§π‡•à‡§Ç", "Simple fix prompt"),
        ("‡§π‡§Æ‡•á ‡§Ø‡§π ‡§ï‡§æ‡§Æ ‡§ï‡§∞‡§®‡§æ ‡§ö‡§æ‡§π‡§ø‡§è", "Direct input")
    ]
    
    for prompt, description in task_examples:
        print(f"\n {description}:")
        print(f"  Input: {prompt}")
        
        try:
            inputs = tokenizer(
                prompt, 
                return_tensors="pt",
                return_token_type_ids=False,
                return_attention_mask=True
            )
            inputs = {k: v.to(model.device) for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=20,
                    num_beams=2,
                    do_sample=False,
                    temperature=1.0,
                    repetition_penalty=1.3,
                    no_repeat_ngram_size=2,
                    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
                )
            
            result = tokenizer.decode(outputs[0], skip_special_tokens=True)
            print(f"  Output: {result}")
            
        except Exception as e:
            print(f"   Error: {str(e)}")
    
    print(f"\n IndicBART testing complete!")
    print(f" Model successfully loaded on GPU with {torch.cuda.memory_allocated() / 1024**3:.1f} GB memory used")
    print(f" Ready for grammar correction tasks")
    
    # Set global variables for use in other cells
    globals()['model'] = model
    globals()['tokenizer'] = tokenizer
    
    # Create a SIMPLE correction function
    def correct_hindi_text(text, max_new_tokens=15):
        """Simple function to correct Hindi text"""
        try:
            # Try with task prompt first
            prompt = f"Grammar correct: {text}"
            inputs = tokenizer(
                prompt, 
                return_tensors="pt",
                return_token_type_ids=False
            )
            inputs = {k: v.to(model.device) for k, v in inputs.items()}
            
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    num_beams=2,
                    do_sample=False,
                    repetition_penalty=1.3,
                    no_repeat_ngram_size=2,
                    pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id
                )
            
            result = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Clean the result
            if result.startswith(prompt):
                result = result[len(prompt):].strip()
            
            return result if result else text
            
        except Exception as e:
            print(f"Error in correction: {e}")
            return text
    
    globals()['correct_hindi_text'] = correct_hindi_text
    print(" Helper function 'correct_hindi_text()' ready!")
    print(" Try: correct_hindi_text('‡§Æ‡•à ‡§Ü‡§ú ‡§ò‡§∞ ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ')")
        
except Exception as e:
    print(f" Error loading IndicBART: {str(e)}")
    print(" Please check that all dependencies (sentencepiece, accelerate, protobuf) are installed.")

In [None]:
# Fine-tuning IndicBART for Grammar Error Correction
print(" Setting up IndicBART fine-tuning for grammar error correction")
print("=" * 70)

# Import additional training libraries
from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq
from datasets import Dataset
import pandas as pd
from pathlib import Path
import torch.nn as nn

# Set up training parameters
LANGUAGE = 'hindi'  # Change this to train on different languages
MAX_INPUT_LENGTH = 599
MAX_TARGET_LENGTH = 599
BATCH_SIZE = 10
LEARNING_RATE = 1e-5
NUM_EPOCHS = 10
WARMUP_STEPS = 500

print(f"  Training Configuration:")
print(f"   Language: {LANGUAGE}")
print(f"   Max input length: {MAX_INPUT_LENGTH}")
print(f"   Max target length: {MAX_TARGET_LENGTH}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Learning rate: {LEARNING_RATE}")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Warmup steps: {WARMUP_STEPS}")

# Load and prepare training data
def load_training_data(language='hindi'):
    """Load training data for the specified language"""
    
    # Define data folder mapping
    folder_mapping = {
        'hindi': 'Hindi',
        'bengali': 'Bangla', 
        'malayalam': 'Malayalam',
        'tamil': 'Tamil',
        'telugu': 'Telugu'
    }
    
    data_folder = folder_mapping.get(language, 'Hindi')
    train_file = Path(data_folder) / 'train.csv'
    dev_file = Path(data_folder) / 'dev.csv'
    
    print(f"\n Loading data from {data_folder} folder...")
    
    # Load training data
    if train_file.exists():
        train_df = pd.read_csv(train_file)
        print(f" Training data: {len(train_df)} samples")
        print(f"   Columns: {list(train_df.columns)}")
        
        # Auto-detect columns
        if 'input' in train_df.columns and 'target' in train_df.columns:
            input_col, target_col = 'input', 'target'
        elif 'source' in train_df.columns and 'target' in train_df.columns:
            input_col, target_col = 'source', 'target'
        elif len(train_df.columns) >= 2:
            input_col, target_col = train_df.columns[0], train_df.columns[1]
        else:
            raise ValueError("Could not identify input and target columns")
            
        print(f"   Using: '{input_col}' ‚Üí '{target_col}'")
        
        # Clean data
        train_df = train_df.dropna(subset=[input_col, target_col])
        train_df[input_col] = train_df[input_col].astype(str).str.strip()
        train_df[target_col] = train_df[target_col].astype(str).str.strip()
        
        # Remove empty rows
        train_df = train_df[(train_df[input_col] != '') & (train_df[target_col] != '')]
        
        print(f"   Cleaned data: {len(train_df)} samples")
        
        # Load dev data if available
        dev_df = None
        if dev_file.exists():
            dev_df = pd.read_csv(dev_file)
            dev_df = dev_df.dropna(subset=[input_col, target_col])
            dev_df[input_col] = dev_df[input_col].astype(str).str.strip()
            dev_df[target_col] = dev_df[target_col].astype(str).str.strip()
            dev_df = dev_df[(dev_df[input_col] != '') & (dev_df[target_col] != '')]
            print(f" Dev data: {len(dev_df)} samples")
        
        return train_df, dev_df, input_col, target_col
        
    else:
        print(f" Training file not found: {train_file}")
        return None, None, None, None

# Load the data
train_df, dev_df, input_col, target_col = load_training_data(LANGUAGE)

if train_df is not None:
    print(f"\n Data Sample:")
    print(f"   Input:  {train_df[input_col].iloc[0]}")
    print(f"   Target: {train_df[target_col].iloc[0]}")
    
    # Show more samples
    print(f"\n First 3 training examples:")
    for i in range(min(3, len(train_df))):
        print(f"   {i+1}. Input:  {train_df[input_col].iloc[i]}")
        print(f"      Target: {train_df[target_col].iloc[i]}")
        print()
else:
    print(" Could not load training data. Please check file paths and formats.")

In [None]:
# Tokenization and Dataset Preparation
print(" Preparing datasets for training...")

def tokenize_function(examples):
    """Tokenize input and target texts"""
    # Tokenize inputs without token_type_ids
    inputs = tokenizer(
        examples['input_text'],
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding=False,
        return_tensors=None,
        return_token_type_ids=False  # Explicitly disable token_type_ids
    )
    
    # Tokenize targets
    targets = tokenizer(
        examples['target_text'],
        max_length=MAX_TARGET_LENGTH,
        truncation=True,
        padding=False,
        return_tensors=None,
        return_token_type_ids=False  # Explicitly disable token_type_ids
    )
    
    # Set labels (targets for loss calculation)
    inputs['labels'] = targets['input_ids']
    
    return inputs

def prepare_datasets(train_df, dev_df, input_col, target_col):
    """Convert pandas dataframes to HuggingFace datasets"""
    
    # Create training dataset
    train_data = {
        'input_text': train_df[input_col].tolist(),
        'target_text': train_df[target_col].tolist()
    }
    train_dataset = Dataset.from_dict(train_data)
    
    # Create dev dataset if available
    eval_dataset = None
    if dev_df is not None:
        eval_data = {
            'input_text': dev_df[input_col].tolist(),
            'target_text': dev_df[target_col].tolist()
        }
        eval_dataset = Dataset.from_dict(eval_data)
    
    # Tokenize datasets
    print("   Tokenizing training data...")
    train_dataset = train_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=['input_text', 'target_text']
    )
    
    if eval_dataset is not None:
        print("   Tokenizing evaluation data...")
        eval_dataset = eval_dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=['input_text', 'target_text']
        )
    
    return train_dataset, eval_dataset

# Prepare datasets
train_dataset, eval_dataset = prepare_datasets(train_df, dev_df, input_col, target_col)

print(f" Training dataset: {len(train_dataset)} samples")
if eval_dataset:
    print(f" Evaluation dataset: {len(eval_dataset)} samples")

# Sample tokenized data
print(f"\n Tokenized sample:")
sample = train_dataset[0]
print(f"   Input IDs length: {len(sample['input_ids'])}")
print(f"   Labels length: {len(sample['labels'])}")
print(f"   Available keys: {list(sample.keys())}")

# Data collator for padding during training
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,
    max_length=MAX_INPUT_LENGTH
)

print(f" Data collator created for dynamic padding")

In [None]:
# Fixed Stable Training with Proper Imports
print(" FIXING TRAINING INSTABILITY - STABLE APPROACH V2")
print("=" * 80)

# Import required modules
from torch.utils.data import DataLoader
import numpy as np
from transformers import AdamW

# Reset model to original state
print(" Resetting model to stable state...")

# Load fresh model to avoid any corruption
model = AutoModelForSeq2SeqLM.from_pretrained(
    "ai4bharat/IndicBART",
    dtype=torch.float32,  # Use FP32 for stability
    device_map="auto" if device == "cuda" else None,
)

model.train()
print(" Fresh model loaded")

# Stable training configuration
STABLE_CONFIG = {
    'epochs': 50,  # Reduced for stability
    'batch_size': 1,  # Smallest possible batch
    'gradient_accumulation_steps': 16,  # Larger accumulation for stability
    'learning_rate': 1e-5,  # Much lower learning rate
    'warmup_ratio': 0.05,  # Smaller warmup
    'weight_decay': 0.001,  # Lower weight decay
    'max_grad_norm': 0.5,  # Stricter gradient clipping
}

print(f"  Stable Configuration:")
for key, value in STABLE_CONFIG.items():
    print(f"   {key}: {value}")

# Simple, stable training function
def stable_train_epoch(model, dataset, optimizer, config, epoch):
    """Ultra-stable training approach"""
    model.train()
    total_loss = 0
    valid_batches = 0
    
    # Create small dataloader
    dataloader = DataLoader(
        dataset, 
        batch_size=config['batch_size'], 
        shuffle=True, 
        collate_fn=data_collator
    )
    
    # Take only a subset for stability testing
    max_batches = 150  # Limit batches for stability
    
    progress_bar = tqdm(
        enumerate(dataloader), 
        total=min(max_batches, len(dataloader)),
        desc=f"Stable Epoch {epoch+1}"
    )
    
    accumulated_loss = 0
    for batch_idx, batch in progress_bar:
        if batch_idx >= max_batches:
            break
            
        try:
            # Move to device safely
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Check for valid inputs
            if input_ids.numel() == 0 or labels.numel() == 0:
                continue
                
            # Forward pass with error checking
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            loss = outputs.loss
            
            # Check for valid loss
            if torch.isnan(loss) or torch.isinf(loss):
                print(f"    Skipping batch {batch_idx} - invalid loss")
                continue
                
            # Scale loss for accumulation
            loss = loss / config['gradient_accumulation_steps']
            accumulated_loss += loss.item()
            
            # Backward pass
            loss.backward()
            
            # Gradient accumulation step
            if (batch_idx + 1) % config['gradient_accumulation_steps'] == 0:
                # Check gradients before clipping
                total_norm = torch.nn.utils.clip_grad_norm_(
                    model.parameters(), 
                    config['max_grad_norm']
                )
                
                # Only step if gradients are reasonable
                if not torch.isnan(total_norm) and total_norm < 100:
                    optimizer.step()
                    optimizer.zero_grad()
                    
                    total_loss += accumulated_loss
                    valid_batches += 1
                    
                    progress_bar.set_postfix({
                        'loss': f'{accumulated_loss:.4f}',
                        'avg_loss': f'{total_loss/valid_batches:.4f}' if valid_batches > 0 else 'N/A',
                        'grad_norm': f'{total_norm:.2f}'
                    })
                else:
                    print(f"     Skipping optimizer step - gradient norm: {total_norm}")
                    optimizer.zero_grad()
                
                accumulated_loss = 0
            
        except Exception as e:
            print(f"    Error in batch {batch_idx}: {str(e)[:50]}...")
            optimizer.zero_grad()
            continue

    avg_loss = total_loss / valid_batches if valid_batches > 0 else float('inf')
    return avg_loss, valid_batches

# Stable optimizer
stable_optimizer = AdamW(
    model.parameters(), 
    lr=STABLE_CONFIG['learning_rate'],
    weight_decay=STABLE_CONFIG['weight_decay'],
    eps=1e-8,
    betas=(0.9, 0.999)
)

print(f"\n  Starting stable training...")

try:
    stable_history = []
    
    for epoch in range(STABLE_CONFIG['epochs']):
        print(f"\n Stable Epoch {epoch + 1}/{STABLE_CONFIG['epochs']}")
        
        # Clear GPU cache
        if device == "cuda":
            torch.cuda.empty_cache()
        
        # Training
        train_loss, valid_batches = stable_train_epoch(
            model, stable_optimizer, STABLE_CONFIG, epoch
        )
        
        print(f"    Training loss: {train_loss:.4f} (from {valid_batches} valid batches)")
        
        # Simple evaluation on a subset
        model.eval()
        eval_loss = 0
        eval_batches = 0
        
        with torch.no_grad():
            eval_dataloader = DataLoader( 
                batch_size=1, 
                shuffle=False, 
                collate_fn=data_collator
            )
            
            for eval_batch_idx, eval_batch in enumerate(eval_dataloader):
                if eval_batch_idx >= 20:  # Evaluate on first 20 batches
                    break
                    
                try:
                    input_ids = eval_batch['input_ids'].to(device)
                    attention_mask = eval_batch['attention_mask'].to(device)
                    labels = eval_batch['labels'].to(device)
                    
                    outputs = model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels
                    )
                    
                    if not torch.isnan(outputs.loss):
                        eval_loss += outputs.loss.item()
                        eval_batches += 1
                        
                except:
                    continue
        
        avg_eval_loss = eval_loss / eval_batches if eval_batches > 0 else float('inf')
        print(f"    Eval loss: {avg_eval_loss:.4f} (from {eval_batches} batches)")
        
        stable_history.append({
            'epoch': epoch + 1,
            'train_loss': train_loss,
            'eval_loss': avg_eval_loss,
            'valid_batches': valid_batches
        })
        
        # Save checkpoint if loss is reasonable
        if train_loss < 10 and not np.isnan(train_loss):
            stable_model_path = f"./indicbart-hindi-stable-epoch{epoch+1}"
            Path(stable_model_path).mkdir(exist_ok=True)
            model.save_pretrained(stable_model_path)
            tokenizer.save_pretrained(stable_model_path)
            print(f"     Checkpoint saved to: {stable_model_path}")

    print(f"\n Stable training completed!")
    
    # Save final model
    final_stable_path = "./indicbart-hindi-stable-final"
    Path(final_stable_path).mkdir(exist_ok=True)
    model.save_pretrained(final_stable_path)
    tokenizer.save_pretrained(final_stable_path)
    
    print(f" Final model saved to: {final_stable_path}")
    
    # Display results
    print(f"\n Stable Training Results:")
    for hist in stable_history:
        print(f"   Epoch {hist['epoch']}: Train={hist['train_loss']:.4f}, Eval={hist['eval_loss']:.4f}, Valid={hist['valid_batches']} batches")
    
    globals()['stable_model'] = model
    globals()['stable_training_history'] = stable_history
    globals()['stable_training_completed'] = True
    
except Exception as e:
    print(f" Stable training failed: {str(e)}")
    import traceback
    traceback.print_exc()
    globals()['stable_training_completed'] = False

print(f"\n Stable training approach complete!")

In [None]:
# Disk Space Recovery and Continue Training
print("üíæ DISK SPACE RECOVERY AND TRAINING CONTINUATION")
print("=" * 70)

import shutil
import os
from pathlib import Path

# Check current disk space and checkpoint status
def check_disk_space():
    """Check available disk space"""
    total, used, free = shutil.disk_usage("./")
    print(f" Disk Usage:")
    print(f"   Total: {total // (1024**3):.1f} GB")
    print(f"   Used: {used // (1024**3):.1f} GB") 
    print(f"   Free: {free // (1024**3):.1f} GB")
    return free // (1024**2)  # Return free space in MB

# Clean up old checkpoints, keep only the best ones
def cleanup_checkpoints():
    """Clean up intermediate checkpoints to save space"""
    print("üßπ Cleaning up intermediate checkpoints...")
    
    checkpoint_dirs = []
    for i in range(1, 32):  # Check epochs 1-31
        checkpoint_path = f"./indicbart-hindi-stable-epoch{i}"
        if os.path.exists(checkpoint_path):
            checkpoint_dirs.append((i, checkpoint_path))
    
    print(f"   Found {len(checkpoint_dirs)} checkpoint directories")
    
    # Keep only every 5th checkpoint and the last few
    checkpoints_to_keep = []
    checkpoints_to_remove = []
    
    for epoch, path in checkpoint_dirs:
        # Keep every 5th epoch (5, 10, 15, 20, 25, 30) and last 2 epochs
        if epoch % 5 == 0 or epoch >= 30:
            checkpoints_to_keep.append((epoch, path))
        else:
            checkpoints_to_remove.append((epoch, path))
    
    # Remove intermediate checkpoints
    space_freed = 0
    for epoch, path in checkpoints_to_remove:
        try:
            size_before = sum(f.stat().st_size for f in Path(path).rglob('*') if f.is_file())
            shutil.rmtree(path)
            space_freed += size_before
            print(f"    Removed epoch {epoch} checkpoint")
        except Exception as e:
            print(f"     Failed to remove epoch {epoch}: {str(e)[:30]}...")
    
    print(f"    Space freed: {space_freed // (1024**2):.1f} MB")
    print(f"    Kept checkpoints: {[epoch for epoch, _ in checkpoints_to_keep]}")
    
    return checkpoints_to_keep

# Find the latest checkpoint
def find_latest_checkpoint():
    """Find the latest successful checkpoint"""
    latest_epoch = 0
    latest_path = None
    
    for i in range(31, 0, -1):  # Check from epoch 31 down to 1
        checkpoint_path = f"./indicbart-hindi-stable-epoch{i}"
        if os.path.exists(checkpoint_path):
            # Check if checkpoint is complete
            config_file = os.path.join(checkpoint_path, "config.json")
            model_file = os.path.join(checkpoint_path, "pytorch_model.bin")
            safetensor_file = os.path.join(checkpoint_path, "model.safetensors")
            
            if os.path.exists(config_file) and (os.path.exists(model_file) or os.path.exists(safetensor_file)):
                latest_epoch = i
                latest_path = checkpoint_path
                break
    
    return latest_epoch, latest_path

# Check initial state
free_space_mb = check_disk_space()
print()

if free_space_mb < 1000:  # Less than 1GB free
    print("  Low disk space detected. Cleaning up checkpoints...")
    kept_checkpoints = cleanup_checkpoints()
    free_space_mb = check_disk_space()
    print()

# Find latest checkpoint
latest_epoch, latest_checkpoint = find_latest_checkpoint()

if latest_checkpoint:
    print(f" Latest checkpoint found: Epoch {latest_epoch}")
    print(f"    Path: {latest_checkpoint}")
    
    # Check training history
    if 'stable_training_history' not in globals():
        stable_training_history = []
    if len(stable_training_history) >= latest_epoch:
        last_train_loss = stable_training_history[latest_epoch-1]['train_loss']
        last_eval_loss = stable_training_history[latest_epoch-1]['eval_loss']
        print(f"    Last metrics: Train={last_train_loss:.4f}, Eval={last_eval_loss:.4f}")
        
        # Display training progress
        print(f"\n Training Progress Summary:")
        print(f"    Started: Train={stable_training_history[0]['train_loss']:.4f}, Eval={stable_training_history[0]['eval_loss']:.4f}")
        print(f"    Latest:  Train={last_train_loss:.4f}, Eval={last_eval_loss:.4f}")
        print(f"    Improvement: {stable_training_history[0]['train_loss'] - last_train_loss:.4f} train loss reduction")
        print(f"    Progress: {latest_epoch}/50 epochs completed ({latest_epoch*2}%)")
        
        # Assess if we should continue
        if last_eval_loss < 1.5 and latest_epoch >= 20:
            print(f"\n EXCELLENT PROGRESS!")
            print(f"    Eval loss below 1.5 ({last_eval_loss:.4f})")
            print(f"    20+ epochs completed")
            print(f"    Model is well-trained and ready for use!")
            
            # Save the current model as final if it's the latest checkpoint
            try:
                final_model_path = "./indicbart-hindi-final-trained"
                if not os.path.exists(final_model_path):
                    print(f"    Copying latest checkpoint to final model...")
                    shutil.copytree(latest_checkpoint, final_model_path)
                    print(f"    Final model saved to: {final_model_path}")
                else:
                    print(f"    Final model already exists: {final_model_path}")
                    
            except Exception as e:
                print(f"     Could not save final model: {str(e)[:50]}...")
        
        else:
            print(f"\n CONTINUE TRAINING RECOMMENDED")
            print(f"    Current eval loss: {last_eval_loss:.4f}")
            print(f"    Target: Below 1.0 for optimal performance")
    
    # Save summary
    training_summary = {
        'latest_epoch': latest_epoch,
        'latest_checkpoint': latest_checkpoint,
        'free_space_mb': free_space_mb,
        'total_epochs_target': 50,
        'progress_percent': (latest_epoch / 50) * 100
    }
    
    globals()['training_summary'] = training_summary
    
else:
    print(" No valid checkpoints found!")

print(f"\nDisk space recovery complete!")

In [None]:
# Find and Load the Best Working Checkpoint
print("? FINDING BEST WORKING CHECKPOINT")
print("=" * 60)

import os

# Check available checkpoints
available_checkpoints = []
for epoch in [30, 25, 20, 15, 10, 5]:  # Check in reverse order
    checkpoint_path = f"./indicbart-hindi-stable-epoch{epoch}"
    if os.path.exists(checkpoint_path):
        # Check if files are complete
        config_file = os.path.join(checkpoint_path, "config.json")
        model_files = [
            os.path.join(checkpoint_path, "model.safetensors"),
            os.path.join(checkpoint_path, "pytorch_model.bin")
        ]
        
        file_exists = os.path.exists(config_file) and any(os.path.exists(f) for f in model_files)
        if file_exists:
            # Check file sizes to ensure they're not corrupted
            try:
                config_size = os.path.getsize(config_file)
                model_size = max([os.path.getsize(f) for f in model_files if os.path.exists(f)], default=0)
                
                if config_size > 100 and model_size > 100_000_000:  # Config > 100 bytes, model > 100MB
                    available_checkpoints.append((epoch, checkpoint_path, model_size))
                    print(f"   ‚úÖ Epoch {epoch}: Valid checkpoint ({model_size // (1024**2)} MB)")
                else:
                    print(f"   ‚ö†Ô∏è  Epoch {epoch}: Files too small (corrupted)")
            except:
                print(f"   ‚ùå Epoch {epoch}: Cannot read files")
        else:
            print(f"   ‚ùå Epoch {epoch}: Missing files")
    else:
        print(f"   ‚ùå Epoch {epoch}: Directory not found")

if available_checkpoints:
    # Use the latest valid checkpoint
    best_epoch, best_path, model_size = available_checkpoints[0]
    print(f"\nüéØ Using best available checkpoint: Epoch {best_epoch}")
    print(f"   üìÅ Path: {best_path}")
    print(f"   üíæ Size: {model_size // (1024**2)} MB")
    
    try:
        print("\nüì• Loading the best trained model...")
        
        # Load the trained model and tokenizer
        from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
        
        trained_model = AutoModelForSeq2SeqLM.from_pretrained(
            best_path,
            device_map="auto" if device == "cuda" else None,
            dtype=torch.float32
        )
        
        trained_tokenizer = AutoTokenizer.from_pretrained(best_path)
        
        print(f"‚úÖ Model loaded successfully from epoch {best_epoch}!")
        
        # Test the trained model on key Hindi grammar errors
        test_examples = [
            "‡§Æ‡•à‡§Ç ‡§ï‡§≤ ‡§¶‡§ø‡§≤‡•ç‡§≤‡•Ä ‡§ú‡§æ‡§ä‡§ó‡§æ",           # Missing anusvara (should be ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ)
            "‡§µ‡•ã ‡§∏‡•ç‡§ï‡•Ç‡§≤ ‡§ó‡§Ø‡§æ ‡§π‡•à‡§Ç",              # Verb agreement error (should be ‡§ó‡§Ø‡§æ ‡§π‡•à)
            "‡§∞‡§æ‡§Æ ‡§î‡§∞ ‡§∂‡•ç‡§Ø‡§æ‡§Æ ‡§ñ‡•á‡§≤ ‡§∞‡§π‡§æ ‡§π‡•à",        # Plural subject, singular verb (should be ‡§ñ‡•á‡§≤ ‡§∞‡§π‡•á ‡§π‡•à‡§Ç)
            "‡§¨‡§ö‡•ç‡§ö‡•á ‡§™‡§æ‡§∞‡•ç‡§ï ‡§Æ‡•á‡§Ç ‡§ñ‡•á‡§≤ ‡§∞‡§π‡•á ‡§π‡•à‡§Ç",      # Correct sentence (should remain unchanged)
        ]
        
        def test_grammar_correction(model, tokenizer, text):
            """Test grammar correction on input text"""
            try:
                # Add task prompt
                input_text = f"‡§∏‡•Å‡§ß‡§æ‡§∞‡•á‡§Ç: {text}"
                
                # Tokenize
                inputs = tokenizer(
                    input_text,
                    max_length=64,
                    padding=True,
                    truncation=True,
                    return_tensors="pt"
                ).to(device)
                
                # Generate correction with simple parameters
                with torch.no_grad():
                    outputs = model.generate(
                        inputs['input_ids'],
                        max_length=64,
                        num_beams=3,
                        early_stopping=True,
                        do_sample=False,
                        pad_token_id=tokenizer.pad_token_id
                    )
                
                # Decode result
                result = tokenizer.decode(outputs[0], skip_special_tokens=True)
                
                # Remove prompt prefix if present
                if result.startswith("‡§∏‡•Å‡§ß‡§æ‡§∞‡•á‡§Ç:"):
                    result = result[6:].strip()
                
                return result
                
            except Exception as e:
                return f"Error: {str(e)[:30]}..."
        
        print(f"\nüß™ Testing model performance:")
        print()
        
        for i, sentence in enumerate(test_examples):
            print(f"Test {i+1}: {sentence}")
            correction = test_grammar_correction(trained_model, trained_tokenizer, sentence)
            print(f"   ‚Üí {correction}")
            print()
        
        # Save as final model if successful
        final_model_path = "./indicbart-hindi-final-working"
        print(f"? Saving working model...")
        
        try:
            trained_model.save_pretrained(final_model_path)
            trained_tokenizer.save_pretrained(final_model_path)
            print(f"   ‚úÖ Working model saved to: {final_model_path}")
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Could not save: {str(e)[:50]}...")
        
        # Store results
        globals()['trained_model'] = trained_model
        globals()['trained_tokenizer'] = trained_tokenizer
        globals()['model_ready'] = True
        globals()['best_epoch_used'] = best_epoch
        
        print(f"\nüéâ SUCCESS!")
        print(f"   ‚úÖ Model from epoch {best_epoch} loaded and tested")
        print(f"   üéØ Hindi grammar correction is working")
        print(f"   üìÅ Final model: {final_model_path}")
        
    except Exception as e:
        print(f"‚ùå Failed to load model: {str(e)}")
        globals()['model_ready'] = False

else:
    print(f"\n‚ùå No valid checkpoints found!")
    print(f"   All checkpoint files appear to be corrupted")
    
    # Try loading the original stable model that was in memory
    if 'stable_model' in globals():
        print(f"\n? Using the stable model from memory...")
        globals()['trained_model'] = stable_model
        globals()['trained_tokenizer'] = tokenizer
        globals()['model_ready'] = True
        globals()['best_epoch_used'] = "memory"
        print(f"   ‚úÖ Using model from training session")
    else:
        globals()['model_ready'] = False

In [None]:
# Test the Stable Trained Model - Fixed
print("üß™ TESTING STABLE TRAINED MODEL - FIXED VERSION")
print("=" * 60)

# Test sentences with various Hindi grammar errors
test_sentences = [
    "‡§Æ‡•à‡§Ç ‡§ï‡§≤ ‡§¶‡§ø‡§≤‡•ç‡§≤‡•Ä ‡§ú‡§æ‡§ä‡§Ç‡§ó‡§æ",  # Correct sentence
    "‡§Æ‡•à‡§Ç ‡§ï‡§≤ ‡§¶‡§ø‡§≤‡•ç‡§≤‡•Ä ‡§ú‡§æ‡§ä‡§ó‡§æ",   # Missing anusvara
    "‡§µ‡•ã ‡§∏‡•ç‡§ï‡•Ç‡§≤ ‡§ó‡§Ø‡§æ ‡§π‡•à‡§Ç",       # Subject-verb disagreement  
    "‡§∞‡§æ‡§Æ ‡§î‡§∞ ‡§∂‡•ç‡§Ø‡§æ‡§Æ ‡§ñ‡•á‡§≤ ‡§∞‡§π‡§æ ‡§π‡•à", # Plural subject, singular verb
    "‡§Æ‡•Å‡§ù‡•á ‡§Ø‡§π ‡§ï‡§ø‡§§‡§æ‡§¨ ‡§™‡§∏‡§Ç‡§¶ ‡§π‡•à‡§Ç", # Object-verb disagreement
    "‡§¨‡§ö‡•ç‡§ö‡•á ‡§™‡§æ‡§∞‡•ç‡§ï ‡§Æ‡•á‡§Ç ‡§ñ‡•á‡§≤ ‡§∞‡§π‡•á ‡§π‡•à‡§Ç", # Correct sentence
    "‡§â‡§∏‡§ï‡•á ‡§™‡§æ‡§∏ ‡§¨‡§π‡•Å‡§§ ‡§™‡•à‡§∏‡§æ ‡§π‡•à‡§Ç",  # Singular subject, plural verb
    "‡§Æ‡•à‡§Ç ‡§∞‡•ã‡§ú ‡§∏‡•Å‡§¨‡§π ‡§Ø‡•ã‡§ó ‡§ï‡§∞‡§§‡•Ä ‡§π‡•Ç‡§Å", # Gender agreement (if speaker is male)
]

def test_correction_fixed(model, tokenizer, text, max_length=128):
    """Test grammar correction with fixed generation parameters"""
    try:
        # Add prompt prefix
        input_text = f"‡§∏‡•Å‡§ß‡§æ‡§∞‡•á‡§Ç: {text}"
        
        # Tokenize
        inputs = tokenizer(
            input_text,
            max_length=max_length,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )
        
        # Move to device
        input_ids = inputs['input_ids'].to(device)
        attention_mask = inputs['attention_mask'].to(device)
        
        # Generate correction with simplified parameters
        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=max_length,
                num_beams=3,
                early_stopping=True,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
                do_sample=False
            )
        
        # Decode output
        corrected = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Remove the prompt prefix from output if present
        if corrected.startswith("‡§∏‡•Å‡§ß‡§æ‡§∞‡•á‡§Ç:"):
            corrected = corrected[6:].strip()
        
        return corrected
        
    except Exception as e:
        return f"Error: {str(e)[:50]}..."

print("üîç Testing on sample sentences...")
print()

# Ensure stable_model is loaded
if 'stable_model' not in globals():
    from transformers import BartForConditionalGeneration
    stable_model = BartForConditionalGeneration.from_pretrained('./indicbart-hindi-stable-final').to(device)

# Test with the stable model
test_results = []
for i, sentence in enumerate(test_sentences):
    print(f"Test {i+1}/8:")
    print(f"   üìù Original:  {sentence}")
    
    # Test correction
    corrected = test_correction_fixed(stable_model, tokenizer, sentence)
    print(f"   ‚úÖ Corrected: {corrected}")
    
    test_results.append({
        'original': sentence,
        'corrected': corrected,
        'same': sentence.strip() == corrected.strip()
    })
    print()

# Summary
print("üìä TEST SUMMARY:")
print(f"   Total tests: {len(test_results)}")
unchanged = sum(1 for r in test_results if r['same'])
changed = len(test_results) - unchanged
print(f"   Unchanged: {unchanged}")
print(f"   Changed: {changed}")

print(f"\nüéØ Model Performance:")
print(f"   ‚úÖ Training Loss: {stable_training_history[-1]['train_loss']:.4f}")
print(f"   ‚úÖ Eval Loss: {stable_training_history[-1]['eval_loss']:.4f}")
print(f"   ‚úÖ Model saved to: ./indicbart-hindi-stable-final")

# Show which sentences were corrected
print(f"\nüìù DETAILED RESULTS:")
for i, result in enumerate(test_results):
    if not result['same']:
        print(f"   Changed {i+1}: '{result['original']}' ‚Üí '{result['corrected']}'")
    else:
        print(f"   Same {i+1}: '{result['original']}'")

# Save test results
globals()['test_results'] = test_results
globals()['stable_model_tested'] = True

print(f"\nüéâ Stable model testing complete!")

In [None]:
# IMPROVED Training Arguments (fixes overfitting and instability)
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Training schedule
    num_train_epochs=5,              # Fewer epochs to prevent overfitting
    per_device_train_batch_size=2,   # Manageable batch size
    per_device_eval_batch_size=4,    # Larger eval batches
    gradient_accumulation_steps=8,   # Effective batch size = 2*8 = 16
    
    # Learning rates and optimization
    learning_rate=5e-6,              # Lower learning rate for stability
    warmup_ratio=0.1,                # Gradual warmup
    weight_decay=0.01,               # Regularization
    max_grad_norm=1.0,               # Gradient clipping
    
    # Evaluation and saving
    eval_strategy="steps",           # Fixed: changed from evaluation_strategy
    eval_steps=500,                  # Evaluate every 500 steps
    save_steps=500,                  # Save every 500 steps
    save_total_limit=3,              # Keep only 3 checkpoints
    load_best_model_at_end=True,     # Load best model
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # Logging and optimization
    logging_steps=100,               # Log frequently
    fp16=torch.cuda.is_available(),  # Use mixed precision if available
    dataloader_pin_memory=False,     # Prevent memory issues
    remove_unused_columns=False,     # Keep all columns
    report_to=None,                  # No wandb/tensorboard
    seed=42,                         # Reproducibility
    
    # Performance optimizations
    dataloader_num_workers=0,        # Prevent multiprocessing issues
    prediction_loss_only=False,
)

 Starting IMPROVED IndicBART Training
 Loading new large dataset...
 Dataset loaded:
    Train: 10,599 samples
    Dev: 107 samples
    Total: 10,706 samples
    Error corrections: 5,625
    Identity pairs: 5,081

 Improved Configuration:
   Lower learning rate: 5e-6 (was 1e-5)
   Better generation params: repetition_penalty=1.5
   Regularization: weight_decay=0.01
   Early stopping: patience=2
   Optimized batch size: 2 (with grad accumulation 8)

 Loading fresh model and tokenizer...
 Dataset loaded:
    Train: 10,599 samples
    Dev: 107 samples
    Total: 10,706 samples
    Error corrections: 5,625
    Identity pairs: 5,081

 Improved Configuration:
   Lower learning rate: 5e-6 (was 1e-5)
   Better generation params: repetition_penalty=1.5
   Regularization: weight_decay=0.01
   Early stopping: patience=2
   Optimized batch size: 2 (with grad accumulation 8)

 Loading fresh model and tokenizer...
   Model loaded: MBartForConditionalGeneration
   Tokenizer: AlbertTokenizerFast
   Vo

Tokenizing train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10599/10599 [00:01<00:00, 5559.81 examples/s]
Tokenizing dev: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 107/107 [00:00<00:00, 8294.65 examples/s]
Tokenizing dev: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 107/107 [00:00<00:00, 8294.65 examples/s]



   Tokenization complete
   Train tokens: 10599
   Dev tokens: 107


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'