# Banglish to Bangla Transliteration using Transformers
This notebook demonstrates the complete workflow for training a T5 Transformer model to transliterate text from Banglish (Romanized Bengali) to Bangla (Bengali script).

## Key Steps
1. Load the dataset containing transliteration pairs.
2. Split the dataset into training and validation sets.
3. Preprocess the data using a tokenizer.
4. Initialize a T5 Transformer model for conditional generation.
5. Define training arguments and train the model using `Seq2SeqTrainer`.

### Prerequisites
- Ensure that the required libraries (transformers, datasets, pandas, torch, sklearn) are installed.
- Download the dataset from Hugging Face's datasets repository.


In [None]:
import pandas as pd
from datasets import Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from sklearn.model_selection import train_test_split
import torch

### Load Dataset
We start by loading the dataset containing Banglish-to-Bangla transliteration pairs. 
The dataset is stored in a Parquet file hosted on Hugging Face datasets.

In [None]:
df = pd.read_parquet("hf://datasets/SKNahin/bengali-transliteration-data/data/train-00000-of-00001.parquet")
print("Dataset columns:", df.columns.tolist())

### Split Dataset
We split the dataset into training and validation subsets using an 80-20 ratio. 
This ensures the model has enough data for learning and validation.

In [None]:
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

### Initialize Tokenizer
We use the `T5Tokenizer` for preprocessing the text. 
The tokenizer converts Banglish and Bangla text into token IDs suitable for the T5 model.

In [None]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")

### Preprocessing Function
The `preprocess_function` tokenizes the input Banglish text (`bn`) and target Bangla text (`rm`). 
It ensures the data is padded and truncated to the model's expected input length.

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(examples["bn"], padding="max_length", truncation=True)
    targets = tokenizer(examples["rm"], padding="max_length", truncation=True)
    inputs["labels"] = targets["input_ids"]
    inputs["decoder_input_ids"] = targets["input_ids"]
    return inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

### Initialize Model
The `T5ForConditionalGeneration` model is used for training. 
We ensure the model is moved to a GPU if available.

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-small").to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

### Define Training Arguments
Using the `Seq2SeqTrainingArguments`, we configure the training process with parameters like:
- Learning rate
- Batch size
- Number of epochs
- Evaluation strategy

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./banglish-to-bangla",
    run_name="banglish-to-bangla-v1",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=3,
    label_smoothing_factor=0.1,
)

### Initialize Trainer
The `Seq2SeqTrainer` simplifies the training process, handling training, validation, and evaluation automatically.

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
)

### Train the Model
The `train` method trains the model and saves the results to the specified output directory.

In [None]:
print("Dataset info:", train_dataset)
trainer.train()