# Transliteration Project — Report

This notebook documents the transliteration seq2seq project (Romanized → Devanagari). It contains:
- Implementation notes and architecture
- Full math derivation answering the assignment questions (parameters and computations)
- Minimal code snippets to run the pipeline and visualize attention
- Remarks, results, and next steps


## 1. Architecture Overview

We implement a character-level Seq2Seq model with an RNN-based encoder and decoder. The decoder can optionally use Bahdanau additive attention. The key components in `src/models/` are:
- `encoder.py` — flexible RNN/GRU/LSTM encoder with embedding and packed sequences
- `decoder.py` — flexible decoder; supports optional Bahdanau attention
- `attention.py` — Bahdanau attention module
- `seq2seq.py` — wrapper that runs encoder and decoder and handles teacher forcing


## 2. Notation and assumptions (explicit)

- `e` = embedding dimension for characters
- `h` = hidden dimension for encoder and decoder (per direction)
- `T` = input sequence length (we assume input and output lengths equal for derivation)
- `V` = vocabulary size (same for source and target in this derivation)
- Single encoder layer and single decoder layer unless otherwise noted
- Vanilla RNN cell by default; I provide LSTM/GRU adjustments later


## 3. Parameter count — symbolic derivation (vanilla RNN)

Components:
- Source embedding: `V * e`
- Target embedding: `V * e`
- Encoder RNN (1 layer): `e*h + h^2 + h`  (input-to-hidden, hidden-to-hidden, bias)
- Decoder RNN (1 layer): `e*h + h^2 + h`
- Output linear: `h*V + V`

Total parameters (vanilla RNN):

\[ P_{total} = 2 V e + 2(eh + h^2 + h) + hV + V \]

Notes:
- If you tie source and target embeddings, subtract one `Ve` term.
- If you use weight-tying for output projection, you can reuse embedding transpose and remove `hV`.


## 4. Computation count (dominant matrix multiplies) — symbolic

Per encoder time-step (matmuls): `e*h + h^2`.
Per decoder time-step (matmuls): `e*h + h^2 + h*V` (the `h*V` is output projection).
Total for T encoder and T decoder steps:

\[ \text{Total matmuls} = T \cdot (2 e h + 2 h^2 + h V) \]

If you want multiply+add FLOPs, multiply matmul count by 2. For backprop (training), expect ~2–3× forward cost depending on operations; a rough training FLOP estimate is `≈3 × forward_FLOPs`.


## 5. LSTM / GRU adjustments (brief)
- GRU roughly multiplies the RNN parameter term by 3 (reset, update, candidate): replace `(eh + h^2 + h)` with `3*(eh + h^2) + 3*h`.
- LSTM multiplies it by 4 (input, forget, output, cell): replace with `4*(eh + h^2) + 4*h`.


## 6. Worked numeric example (careful arithmetic)

Choose: `e=128, h=256, V=5000, T=20`.

Parameter counts:
- Source emb: `5000 * 128 = 640,000`
- Target emb: `640,000`
- Encoder RNN: `128*256 + 256*256 + 256 = 32,768 + 65,536 + 256 = 98,560`
- Decoder RNN: `98,560`
- Output proj: `256*5000 + 5000 = 1,280,000 + 5,000 = 1,285,000`

Total parameters:

```
P_total = 640,000 + 640,000 + 98,560 + 98,560 + 1,285,000
        = 2,762,120 parameters (~2.76M)
```

Memory (float32): `~2.76M * 4 bytes ≈ 10.5 MB`.

Forward-pass matmuls:
- `2 e h = 2 * 128 * 256 = 65,536`
- `2 h^2 = 2 * 256 * 256 = 131,072`
- `h V = 256 * 5000 = 1,280,000`
Sum inside parens: `1,476,608`.
Multiply by `T=20` → `29,532,160` matmuls.
Multiply-add FLOPs ≈ `59,064,320` FLOPs (forward).


## 7. How to run training (quick test)

Below is a minimal code snippet you can run in Colab or locally to perform a quick smoke test on the tiny sample dataset included in `data/raw/sample.tsv`.
This *won't* execute automatically in the notebook until you run the cell.


In [None]:
%%bash
python -V
echo "To run quick test:"
echo "python scripts/train.py --config config/model_config.yaml --quick_test"


## 8. Attention visualization (how-to)

When you enable `use_attention=True` in the model config and run evaluation, `eval_epoch` collects sample predictions. To visualize attention, perform these steps in a cell:

1. Run model in evaluation mode on a single example and capture attention weights returned by the decoder (we stored attention per step in the decoder).  
2. Build a matrix of shape `(T_out, T_enc)` containing attention weights at every decoder step.  
3. Plot with matplotlib's `imshow` and annotate with source/target characters.

A short code sketch is included below (not executed here):


In [None]:
# Attention visualization sketch (run in notebook after training)
import json
import torch
import matplotlib.pyplot as plt
from src.data.tokenizer import CharacterTokenizer
from src.models.encoder import Encoder
from src.models.decoder import Decoder
from src.models.seq2seq import Seq2Seq

# load tokenizers saved by train.py
with open('checkpoints/src_itos.json','r',encoding='utf-8') as f:
    src_itos = json.load(f)
with open('checkpoints/tgt_itos.json','r',encoding='utf-8') as f:
    tgt_itos = json.load(f)

src_tok = CharacterTokenizer({c:i for i,c in enumerate(src_itos)}, src_itos)
tgt_tok = CharacterTokenizer({c:i for i,c in enumerate(tgt_itos)}, tgt_itos)

# create model (match config)
enc = Encoder(input_dim=len(src_tok.itos), embed_dim=128, hidden_dim=256, rnn_type='LSTM')
dec = Decoder(output_dim=len(tgt_tok.itos), embed_dim=128, hidden_dim=256, rnn_type='LSTM', use_attention=True)
model = Seq2Seq(enc, dec, sos_idx=tgt_tok.sos, eos_idx=tgt_tok.eos, device='cpu')

# load best checkpoint
ckpt = torch.load('checkpoints/best.pt', map_location='cpu')
model.load_state_dict(ckpt['model_state'])
model.eval()

# example: single source
src_text = 'ghar'
src_ids = [src_tok.sos] + src_tok.encode(src_text) + [src_tok.eos]
import torch
src_tensor = torch.tensor([src_ids], dtype=torch.long)
src_lens = torch.tensor([len(src_ids)], dtype=torch.long)

# run greedy inference step-by-step, collecting attention
enc_outs, enc_hidden = model.encoder(src_tensor, src_lens)
enc_mask = (torch.arange(enc_outs.size(1)).unsqueeze(0) < src_lens.unsqueeze(1))

dec_hidden = enc_hidden
input_tok = torch.tensor([tgt_tok.sos], dtype=torch.long)
attn_matrix = []
preds = []
for t in range(20):
    logits, dec_hidden, attn = model.decoder(input_tok, dec_hidden, encoder_outputs=enc_outs, encoder_mask=enc_mask)
    pred = logits.argmax(dim=1).item()
    preds.append(pred)
    if attn is not None:
        attn_matrix.append(attn.squeeze(0).numpy())
    input_tok = torch.tensor([pred], dtype=torch.long)
    if pred == tgt_tok.eos:
        break

# plot attention heatmap
if len(attn_matrix) > 0:
    import numpy as np
    A = np.stack(attn_matrix)  # (T_out, T_enc)
    plt.imshow(A, aspect='auto')
    plt.xlabel('Source positions')
    plt.ylabel('Decoder steps')
    plt.title('Attention heatmap')
    plt.colorbar()
    plt.show()


## 9. Conclusion & next steps

- The codebase is modular and ready for experiments (add beam search, transformer baseline, or multilingual training).  
- The notebook contains the complete math derivation and a numeric worked example suitable for the assignment submission.  
- Next I can: run a quick test and paste logs here, add beam search, or produce attention heatmaps for a few examples.
