# Hindi GEC with mT5-small — FIXED Training and Inference Notebook

🔧 **CRITICAL FIXES APPLIED:**
- ✅ Changed task prefix from `'correct Hindi: '` to `'grammar correction: '`
- ✅ Added data augmentation with identity pairs from dev set
- ✅ Increased training epochs to 8 for better learning
- ✅ Proper text-to-text format for MT5

This notebook fixes the issues causing `<extra_id_0>` tokens and poor performance.

Expected files in the same folder as this notebook:
- `train.csv` (columns: `input`, `output` OR first two columns are input/output)
- Optional: `dev.csv` (same format). If missing, the notebook will split a dev set from train.


## 0. Install dependencies (run once)
If you haven't installed the required libraries, run the cell below.

In [None]:
# If needed, uncomment and run:
# !pip install -U transformers datasets accelerate sentencepiece evaluate tqdm scikit-learn


## 1. Imports, setup, and configuration

In [1]:
import os
import json
import gc
import warnings
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

import transformers
from transformers import (
    MT5ForConditionalGeneration,
    MT5Tokenizer,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
    EarlyStoppingCallback,
    set_seed,
)
from transformers.trainer_utils import EvalPrediction
from datasets import Dataset

warnings.filterwarnings('ignore')
SEED = 42
set_seed(SEED)

print(f'PyTorch: {torch.__version__}')
print(f'Transformers: {transformers.__version__}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
    print('GPU Memory (GB):', round(torch.cuda.get_device_properties(0).total_memory / 1024**3, 2))

# ==================== FIXED Configuration ====================
CONFIG: Dict = {
    # Model
    'model_name': 'google/mt5-small',
    'max_input_length': 128,
    'max_target_length': 128,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Data
    'train_file': 'train.csv',
    'dev_file': 'dev.csv',
    'test_size': 0.1,
    'random_seed': SEED,

    # Training - IMPROVED SETTINGS
    'output_dir': './mt5-hindi-gec-model-fixed',
    'num_train_epochs': 8,  # ✅ Increased epochs
    'per_device_train_batch_size': 4,
    'per_device_eval_batch_size': 8,
    'gradient_accumulation_steps': 4,
    'learning_rate': 5e-4,  # ✅ Slightly higher learning rate
    'warmup_ratio': 0.1,
    'weight_decay': 0.01,
    'max_grad_norm': 1.0,
    'fp16': False,
    'gradient_checkpointing': True,
    'optim': 'adafactor',

    # Evaluation / saving
    'evaluation_strategy': 'epoch',
    'save_strategy': 'epoch',
    'logging_steps': 50,
    'save_total_limit': 2,
    'load_best_model_at_end': True,
    'metric_for_best_model': 'gleu',
    'greater_is_better': True,
    'early_stopping_patience': 4,  # ✅ Increased patience

    # Generation
    'generation_config': {
        'max_length': 128,
        'num_beams': 4,
        'early_stopping': True,
        'repetition_penalty': 1.2,
        'no_repeat_ngram_size': 3,
        'length_penalty': 1.0,
        'do_sample': False,
    }
}
CONFIG

PyTorch: 2.6.0+cu124
Transformers: 4.56.2
CUDA available: True
GPU: NVIDIA GeForce RTX 3050 6GB Laptop GPU
GPU Memory (GB): 6.0


{'model_name': 'google/mt5-small',
 'max_input_length': 128,
 'max_target_length': 128,
 'device': 'cuda',
 'train_file': 'train.csv',
 'dev_file': 'dev.csv',
 'test_size': 0.1,
 'random_seed': 42,
 'output_dir': './mt5-hindi-gec-model-fixed',
 'num_train_epochs': 8,
 'per_device_train_batch_size': 4,
 'per_device_eval_batch_size': 8,
 'gradient_accumulation_steps': 4,
 'learning_rate': 0.0005,
 'warmup_ratio': 0.1,
 'weight_decay': 0.01,
 'max_grad_norm': 1.0,
 'fp16': False,
 'gradient_checkpointing': True,
 'optim': 'adafactor',
 'evaluation_strategy': 'epoch',
 'save_strategy': 'epoch',
 'logging_steps': 50,
 'save_total_limit': 2,
 'load_best_model_at_end': True,
 'metric_for_best_model': 'gleu',
 'greater_is_better': True,
 'early_stopping_patience': 4,
 'generation_config': {'max_length': 128,
  'num_beams': 4,
  'early_stopping': True,
  'repetition_penalty': 1.2,
  'no_repeat_ngram_size': 3,
  'length_penalty': 1.0,
  'do_sample': False}}

## 2. Data loading and cleaning - WITH AUGMENTATION

In [2]:
def clean_text(text: str) -> str:
    if pd.isna(text):
        return ''
    text = str(text).strip()
    text = ' '.join(text.split())
    text = ''.join(ch for ch in text if ord(ch) >= 32 or ch == '\n')
    return text

def load_and_prepare_data(config: Dict):
    train_path = Path(config['train_file'])
    if not train_path.exists():
        raise FileNotFoundError(f'Training file not found: {train_path}')

    train_df = pd.read_csv(train_path, encoding='utf-8')
    # Determine columns
    if 'input' in train_df.columns and 'output' in train_df.columns:
        input_col, output_col = 'input', 'output'
    else:
        input_col, output_col = train_df.columns[0], train_df.columns[1]

    train_df = train_df[[input_col, output_col]].copy()
    train_df.columns = ['input_text', 'output_text']
    train_df['input_text'] = train_df['input_text'].apply(clean_text)
    train_df['output_text'] = train_df['output_text'].apply(clean_text)
    train_df = train_df[(train_df['input_text'] != '') & (train_df['output_text'] != '')]
    train_df = train_df[(train_df['input_text'].str.len().between(5, 200)) & (train_df['output_text'].str.len().between(5, 200))]

    print(f'Original training samples: {len(train_df)}')

    # ✅ CRITICAL FIX: Add identity pairs from dev set for data augmentation
    dev_path = Path(config['dev_file'])
    if dev_path.exists():
        dev_df = pd.read_csv(dev_path, encoding='utf-8')
        dev_df = dev_df[[input_col, output_col]].copy()
        dev_df.columns = ['input_text', 'output_text']
        dev_df['input_text'] = dev_df['input_text'].apply(clean_text)
        dev_df['output_text'] = dev_df['output_text'].apply(clean_text)
        dev_df = dev_df[(dev_df['input_text'] != '') & (dev_df['output_text'] != '')]
        
        # 🔥 ADD IDENTITY PAIRS - This teaches the model when NOT to change things
        print('🔄 Adding identity pairs from dev set...')
        identity_pairs = []
        for _, row in dev_df.iterrows():
            # Add correct sentence → same correct sentence
            identity_pairs.append({
                'input_text': row['output_text'],  # Use target as input
                'output_text': row['output_text']   # Same as output
            })
        
        identity_df = pd.DataFrame(identity_pairs)
        train_df = pd.concat([train_df, identity_df], ignore_index=True)
        print(f'✅ Added {len(identity_df)} identity pairs')
    else:
        # Split if no dev file
        train_df, dev_df = train_test_split(train_df, test_size=config['test_size'], random_state=config['random_seed'])

    print('Final train samples:', len(train_df), '| Dev samples:', len(dev_df))
    print('Identical train pairs:', int((train_df['input_text'] == train_df['output_text']).sum()))
    print('Identical dev pairs:', int((dev_df['input_text'] == dev_df['output_text']).sum()))

    # Show few examples
    print('Sample corrections:')
    sample = train_df[train_df['input_text'] != train_df['output_text']].head(3)
    for i, (_, row) in enumerate(sample.iterrows(), 1):
        print(f"{i}. Input:  {row['input_text'][:80]}")
        print(f"   Output: {row['output_text'][:80]}")
    return train_df, dev_df

train_df, dev_df = load_and_prepare_data(CONFIG)
len(train_df), len(dev_df)

Original training samples: 13751
🔄 Adding identity pairs from dev set...
✅ Added 107 identity pairs
Final train samples: 13858 | Dev samples: 107
Identical train pairs: 165
Identical dev pairs: 24
Sample corrections:
1. Input:  चाय की दुकान से लेकर वाहनों और दिवारों तक हर जगह विज्ञापन ही विज्ञापन दिखाई देते
   Output: चाय की दुकान से लेकर वाहनों और दिवारों तक हर जगह विज्ञापन ही विज्ञापन दिखाई देते
2. Input:  ये कहीं पे निगाहें , कही पे निशाना का सा अन्दाज है ।
   Output: यह कहीं पे निगाहें , कही पे निशाना का सा अन्दाज है ।
3. Input:  आज हम विज्ञापन युग के सीमान्त पर आ खड़े हुए है ।
   Output: आज हम विज्ञापन युग के सीमान्त पर आ खड़े हुए हैं ।


(13858, 107)

## 3. Load tokenizer and model

In [3]:
def load_model_and_tokenizer(config: Dict):
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    tokenizer = MT5Tokenizer.from_pretrained(config['model_name'])
    model = MT5ForConditionalGeneration.from_pretrained(
        config['model_name'],
        torch_dtype=(torch.float16 if config['fp16'] else torch.float32),
    )
    if config.get('gradient_checkpointing', False):
        model.gradient_checkpointing_enable()
        model.config.use_cache = False
    model = model.to(config['device'])
    print('Vocab size:', len(tokenizer))
    total_params = sum(p.numel() for p in model.parameters()) / 1e6
    print(f'Model params: {total_params:.1f}M')
    if torch.cuda.is_available():
        print('GPU mem allocated (GB):', round(torch.cuda.memory_allocated() / 1024**3, 2))
    return model, tokenizer

model, tokenizer = load_model_and_tokenizer(CONFIG)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'T5Tokenizer'. 
The class this function is called from is 'MT5Tokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.mt5.tokenization_mt5.MT5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
`torch_dtype` is deprecated! Use `dtype` instead!


Vocab size: 250100
Model params: 300.2M
GPU mem allocated (GB): 1.12


## 4. Tokenization and dataset preparation - FIXED TASK PREFIX

In [4]:
def create_tokenization_function(tokenizer, config: Dict):
    def tokenize_function(examples):
        # ✅ CRITICAL FIX: Changed from 'correct Hindi:' to 'grammar correction:'
        # This prevents the model from thinking it's an infilling task
        inputs = ['grammar correction: ' + text for text in examples['input_text']]
        targets = examples['output_text']
        model_inputs = tokenizer(
            inputs,
            max_length=config['max_input_length'],
            truncation=True,
            padding=False,
        )
        labels = tokenizer(
            text_target=targets,
            max_length=config['max_target_length'],
            truncation=True,
            padding=False,
        )
        model_inputs['labels'] = labels['input_ids']
        return model_inputs
    return tokenize_function

tokenize_function = create_tokenization_function(tokenizer, CONFIG)

hf_train = Dataset.from_pandas(train_df)
hf_dev = Dataset.from_pandas(dev_df)

tokenized_train = hf_train.map(tokenize_function, batched=True, remove_columns=hf_train.column_names, desc='Tokenizing train')
tokenized_dev = hf_dev.map(tokenize_function, batched=True, remove_columns=hf_dev.column_names, desc='Tokenizing dev')

print('Tokenized sizes:', len(tokenized_train), len(tokenized_dev))

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True,  # dynamic padding
)

Tokenizing train:   0%|          | 0/13858 [00:00<?, ? examples/s]

Tokenizing dev:   0%|          | 0/107 [00:00<?, ? examples/s]

Tokenized sizes: 13858 107


## 5. Metrics (GLEU proxy)

In [5]:
def compute_metrics(eval_preds: EvalPrediction):
    # Support both EvalPrediction and (predictions, labels) tuple
    if isinstance(eval_preds, tuple):
        predictions, labels = eval_preds
    else:
        predictions, labels = eval_preds.predictions, eval_preds.label_ids
    # Unwrap predictions if generate() returns a tuple
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    # Ensure predictions are token ids (handle logits or floats)
    preds = np.array(predictions)
    if preds.ndim == 3:  # logits -> ids
        preds = preds.argmax(-1)
    preds = preds.astype(np.int64, copy=False)
    # Guard against invalid ids
    vocab_size = len(tokenizer)
    preds = np.where((preds >= 0) & (preds < vocab_size), preds, tokenizer.pad_token_id)

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 to decode labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [p.strip() for p in decoded_preds]
    decoded_labels = [l.strip() for l in decoded_labels]

    # Simple GLEU-like proxy using token F1
    gleu_scores = []
    for pred, ref in zip(decoded_preds, decoded_labels):
        pt = set(pred.lower().split())
        rt = set(ref.lower().split())
        if not rt:
            gleu_scores.append(0.0)
            continue
        overlap = pt & rt
        precision = len(overlap) / len(pt) if pt else 0.0
        recall = len(overlap) / len(rt)
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0.0
        gleu_scores.append(f1)
    gleu = float(np.mean(gleu_scores) * 100)

    return {'gleu': gleu}

## 6. Training - WITH IMPROVED SETTINGS

In [6]:
training_args = Seq2SeqTrainingArguments(
    output_dir=CONFIG['output_dir'],
    num_train_epochs=CONFIG['num_train_epochs'],  # ✅ Now 8 epochs
    per_device_train_batch_size=CONFIG['per_device_train_batch_size'],
    per_device_eval_batch_size=CONFIG['per_device_eval_batch_size'],
    gradient_accumulation_steps=CONFIG['gradient_accumulation_steps'],
    learning_rate=CONFIG['learning_rate'],  # ✅ Slightly higher LR
    warmup_ratio=CONFIG['warmup_ratio'],
    weight_decay=CONFIG['weight_decay'],
    max_grad_norm=CONFIG['max_grad_norm'],
    fp16=CONFIG['fp16'],
    optim=CONFIG['optim'],
    eval_strategy=CONFIG['evaluation_strategy'],
    save_strategy=CONFIG['save_strategy'],
    logging_steps=CONFIG['logging_steps'],
    save_total_limit=CONFIG['save_total_limit'],
    load_best_model_at_end=CONFIG['load_best_model_at_end'],
    metric_for_best_model=CONFIG['metric_for_best_model'],
    greater_is_better=CONFIG['greater_is_better'],
    predict_with_generate=True,
    generation_max_length=CONFIG['generation_config']['max_length'],
    generation_num_beams=CONFIG['generation_config']['num_beams'],
    dataloader_pin_memory=False,
    remove_unused_columns=True,
    report_to='none',
    push_to_hub=False,
    seed=CONFIG['random_seed'],
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=CONFIG['early_stopping_patience'])],
)

print('Effective batch size:', training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)
print('🚀 Starting training with FIXED settings...')
_ = trainer.train()

# Save final model and config
trainer.save_model(CONFIG['output_dir'])
tokenizer.save_pretrained(CONFIG['output_dir'])
with open(os.path.join(CONFIG['output_dir'], 'training_config.json'), 'w', encoding='utf-8') as f:
    json.dump(CONFIG, f, indent=2, ensure_ascii=False)
print('✅ Model saved to', CONFIG['output_dir'])

Effective batch size: 16
🚀 Starting training with FIXED settings...


Epoch,Training Loss,Validation Loss,Gleu
1,0.3295,1.217272,82.788994
2,0.2427,1.00066,84.815983
3,0.2131,0.891355,85.842329
4,0.1879,0.817642,86.415717
5,0.1538,0.883712,86.315371
6,0.1395,0.811158,86.634854
7,0.1257,0.84856,86.564907
8,0.1122,0.82944,86.692448


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


✅ Model saved to ./mt5-hindi-gec-model-fixed


## 7. Inference — generate predictions and save CSV - FIXED

In [6]:
def generate_predictions(model_path: str, test_file: str, output_file: str = 'predictions.csv', batch_size: int = 16):
    print(f'Loading model from {model_path} ...')
    tok = MT5Tokenizer.from_pretrained(model_path)
    mdl = MT5ForConditionalGeneration.from_pretrained(model_path).to(device)
    mdl.eval()

    df = pd.read_csv(test_file, encoding='utf-8')
    if 'input' in df.columns:
        input_col = 'input'
    else:
        input_col = df.columns[0]
    df = df[[input_col]].copy()
    df.columns = ['input_text']
    df['input_text'] = df['input_text'].apply(clean_text)
    df = df[df['input_text'] != '']

    preds: List[str] = []
    with torch.no_grad():
        for i in tqdm(range(0, len(df), batch_size), desc='Generating'):
            batch = df.iloc[i:i+batch_size]
            # ✅ CRITICAL FIX: Use same task prefix as training
            inputs = ['grammar correction: ' + s for s in batch['input_text'].tolist()]
            enc = tok(
                inputs,
                max_length=CONFIG['max_input_length'],
                truncation=True,
                padding=True,
                return_tensors='pt',
            ).to(device)
            outputs = mdl.generate(
                **enc,
                max_length=CONFIG['generation_config']['max_length'],
                num_beams=CONFIG['generation_config']['num_beams'],
                early_stopping=CONFIG['generation_config']['early_stopping'],
                repetition_penalty=CONFIG['generation_config']['repetition_penalty'],
                no_repeat_ngram_size=CONFIG['generation_config']['no_repeat_ngram_size'],
            )
            decoded = tok.batch_decode(outputs, skip_special_tokens=True)
            preds.extend(decoded)
            if torch.cuda.is_available() and i % 100 == 0:
                torch.cuda.empty_cache()
                gc.collect()

    out_df = pd.DataFrame({
        'Input sentence': df['input_text'].tolist()[:len(preds)],
        'Output sentence': preds,
    })
    out_df.to_csv(output_file, index=False, encoding='utf-8')
    print('Predictions saved to', output_file)
    print(out_df.head())
    return out_df

# Generate predictions
_ = generate_predictions(CONFIG['output_dir'], CONFIG['dev_file'], 'predictions_fixed.csv')

Loading model from ./mt5-hindi-gec-model-fixed ...


Generating:   0%|          | 0/7 [00:00<?, ?it/s]

Predictions saved to predictions_fixed.csv
                                      Input sentence  \
0  कहते है 'शिक्षा शेरनी को वो दुध है जिसने जितना...   
1  आज-कल की विशेष बात यही है कि शिक्षा पे राजा से...   
2    जलवायु परिवर्तन आज के समय की सच्चाई बन चुकी है।   
3  आज पूरा विश्व जलवायु परिवर्तन की समस्या से जूझ...   
4  सबसे पहले हम जानते हैं कि जलवायु परिवर्तन है क...   

                                     Output sentence  
0  कहते हैं 'शिक्षा शेरनी को वो दुध है जिसने जितन...  
1  आज-कल की विशेष बात यही है कि शिक्षा पे राजा से...  
2    जलवायु परिवर्तन आज के समय की सच्चाई बन चुकी है।  
3  आज पूरा विश्व जलवायु परिवर्तन की समस्या से जूझ...  
4  सबसे पहले हम जानते हैं कि जलवायु परिवर्तन है क...  


## 8. Evaluation - Check if fixes worked

In [7]:
def gleu_proxy(preds, refs):
    scores = []
    for pred, ref in zip(preds, refs):
        pt, rt = set(str(pred).lower().split()), set(str(ref).lower().split())
        if not rt:
            scores.append(0.0)
            continue
        overlap = pt & rt
        p = len(overlap) / len(pt) if pt else 0.0
        r = len(overlap) / len(rt)
        f1 = 2 * p * r / (p + r) if (p + r) else 0.0
        scores.append(f1)
    return float(np.mean(scores) * 100)

# Load references
dev_df_eval = pd.read_csv(CONFIG['dev_file'], encoding='utf-8')
in_col = 'input' if 'input' in dev_df_eval.columns else dev_df_eval.columns[0]
ref_col = 'output' if 'output' in dev_df_eval.columns else dev_df_eval.columns[1]
refs = dev_df_eval[ref_col].astype(str).tolist()

# Load model predictions
pred_df = pd.read_csv('predictions_fixed.csv', encoding='utf-8')
preds = pred_df['Output sentence'].astype(str).tolist()

# Compute metrics
gleu_model = gleu_proxy(preds, refs)
gleu_identity = gleu_proxy(dev_df_eval[in_col].astype(str).tolist(), refs)
exact_match = (pd.Series(preds) == pd.Series(refs)).mean() * 100.0

print("🔥 RESULTS AFTER FIXES:")
print(f"Dev GLEU (proxy) — FIXED model: {gleu_model:.2f}")
print(f"Dev GLEU (proxy) — identity baseline: {gleu_identity:.2f}")
print(f"Exact match rate: {exact_match:.2f}%")

# Check for <extra_id_0> tokens
extra_id_count = sum(1 for pred in preds if '<extra_id_0>' in str(pred))
print(f"\n🚨 Sentences with <extra_id_0>: {extra_id_count}/{len(preds)} ({extra_id_count/len(preds)*100:.1f}%)")

if gleu_model > gleu_identity:
    print("\n🎉 SUCCESS! Model is now better than identity baseline!")
else:
    print("\n⚠️  Model still below identity baseline. May need more training or different approach.")

# Show sample results
print("\n📋 Sample results:")
for i in range(min(5, len(pred_df))):
    print(f"{i+1}.")
    print("Input:   ", pred_df.iloc[i]['Input sentence'][:100])
    print("Pred:    ", pred_df.iloc[i]['Output sentence'][:100])
    print("Ref:     ", dev_df_eval.iloc[i][ref_col][:100])
    print()

🔥 RESULTS AFTER FIXES:
Dev GLEU (proxy) — FIXED model: 85.69
Dev GLEU (proxy) — identity baseline: 85.47
Exact match rate: 13.08%

🚨 Sentences with <extra_id_0>: 0/107 (0.0%)

🎉 SUCCESS! Model is now better than identity baseline!

📋 Sample results:
1.
Input:    कहते है 'शिक्षा शेरनी को वो दुध है जिसने जितना पिया उतना ही दहाडा है'।
Pred:     कहते हैं 'शिक्षा शेरनी को वो दुध है जिसने जितना पिया उतना ही दहाडा है'।
Ref:      कहते है 'शिक्षा शेरनी का वो दूध है जिसने जितना पिया उतना ही दहाड़ा है'।

2.
Input:    आज-कल की विशेष बात यही है कि शिक्षा पे राजा से लेकर रंक का भी अधिकार है।
Pred:     आज-कल की विशेष बात यही है कि शिक्षा पे राजा से लेकर रंक का भी अधिकार है।
Ref:      आज-कल की विशेष बात यही है कि शिक्षा पर राजा से लेकर रंक का भी अधिकार है।

3.
Input:    जलवायु परिवर्तन आज के समय की सच्चाई बन चुकी है।
Pred:     जलवायु परिवर्तन आज के समय की सच्चाई बन चुकी है।
Ref:      जलवायु परिवर्तन आज के समय की सच्चाई बन चुकी है।

4.
Input:    आज पूरा विश्व जलवायु परिवर्तन की समस्या से जूझ रहा है।
Pr