# [CSL 7340] Project

## Instructions:
1. You can choose any one of the following tasks.
2. Evaluation through demo/viva.
3. Submission zip file should contain codes (in .py), and report (preferably latex)
4. Report must contain links to accessible colab notebooks as well. [with proper output blocks]
5. Cite resources appropriately wherever needed.  
6. Final evaluation will be on the basis of review, documentation, quality of results generated, discussions.



## Flags for flow controll

The `FLAGS` DS is used to store global boolean values that determine the overall behaviour of the project.

In [None]:
FLAGS = dict()

# If True, this mounts and uses GDrive as persistent storage wherever the notebook supports it.
FLAGS['MOUNT_GDRIVE'] = True


# Print flag configuration warnings
error = lambda msg, **printflags: print(f'\x1b[1;33;41m ERROR \x1b[0m: \x1b[1;35m{msg}\x1b[0m', **printflags)
warn = lambda msg, **printflags: print(f'\x1b[1;33;41m Warning \x1b[0m: \x1b[1;35m{msg}\x1b[0m', **printflags)
info = lambda msg, **printflags: print(f'Info: \x1b[1;35m{msg}\x1b[0m', **printflags)
log = lambda tag, msg, **printflags: print(f'{tag}: \x1b[1;35m{msg}\x1b[0m', **printflags)

if not FLAGS['MOUNT_GDRIVE']:
  warn('GDrive mounting disabled. The notebook does not have access to persistent storage and hence will not be able to save any data across different sessions.')

## Paths to save/load data

The `PATH` DS is used to store file/folder paths for all data/files in the project. This allows easy migration and saving of important files when needed.

In [None]:
import pathlib
PATH = dict()
 
PATH['workdir'] = pathlib.Path('./workdir')
 
# Mount GDrive if needed
if FLAGS['MOUNT_GDRIVE']:
  PATH['gdrive_mount'] = pathlib.Path('./gdrive')           # Used to mount GDrive (do not use for save/load)
  PATH['gdrive'] = PATH['gdrive_mount'].joinpath('MyDrive') # Used to save/load files in mounted GDrive
  # Mount GDrive
  from google.colab import drive
  drive.mount(str(PATH['gdrive_mount'].resolve()))
 
# Persistent storage
if FLAGS['MOUNT_GDRIVE']:
  PATH['workdir_save'] = PATH['gdrive'].joinpath('Classroom', 'CSL7340 - Natural Language Processing (Reg.)', 'NLP Project', 'workdir_save')
else:
  # TODO: use alternative persistent storage if no GDrive
  PATH['workdir_save'] = PATH['workdir']
log(f'Persistent storage at', f'{PATH["workdir_save"]}')
 
# Make sure all path folders are created
for key, path in PATH.items():
  path.mkdir(parents=True, exist_ok=True)

Mounted at /content/gdrive
Persistent storage at: [1;35mgdrive/MyDrive/Classroom/CSL7340 - Natural Language Processing (Reg.)/NLP Project/workdir_save[0m


## Install libraries and dependencies

This cell installs the libraries/dependencies required for this notebook.

It can also initialize libraries or required code.

In [None]:
# Upgrade gensim
# !pip install -U gensim --quiet

# Update pytables
# Bugfix: https://stackoverflow.com/questions/54210073/pd-read-hdf-throws-cannot-set-writable-flag-to-true-of-this-array
!pip install -U tables --quiet

# Pandas
# !pip install pandas===1.2.3 --quiet

# Test versions
# import pandas, tables
# if pandas.__version__ != '1.2.3':
#   # Restart runtime
#   warn("Package update requires runtime reboot. Automatically crashing runtime. Please run this cell again.", flush=True)
#   import os, time
#   time.sleep(1)
#   os.kill(os.getpid(), 9)

# NLTK and contractions
# !pip install nltk --quiet
# # Init. NLTK
# import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('omw')
# nltk.download('stopwords')

# Contractions
# !pip install contractions --quiet

# FlashText, used to replace terms in text with minimal overhead
# !pip install flashtext --quiet

# Sparse: used to store padded arrays in compressed formats and avoid memory overflow errors
# !pip install -U sparse --quiet

# Pytorch lightning wrapper library to manage pytorch code cleanly
!pip install pytorch-lightning==1.2.4 --quiet
!pip install torchmetrics

# Better dataclass than pyton's standard version
# https://github.com/biqqles/dataclassy
!pip install dataclassy --quiet
from dataclassy import *

# Training visualisation
!pip install mlflow --quiet

# Kaggle dataset
# !pip install -q kaggle

# modin - out of core pandas
# !pip install modin[ray] --quiet
# import ray
# ray.init(ignore_reinit_error=True)
# import os
# os.environ['MODIN_ENGINE'] = 'ray'
# os.environ['MODIN_OUT_OF_CORE'] = 'true'
# !export MODIN_OUT_OF_CORE=true
# import modin.pandas as pd

# iNLTK
!pip install inltk --quiet
# Setup all languages
import inltk.inltk as inltk
for l in ['hi', 'pa', 'gu', 'kn', 'ml', 'or', 'mr', 'bn', 'ta', 'ur', 'ne', 'sa', 'en', 'te']:
  try:
    inltk.setup(l)
  except:
    pass

from tqdm.notebook import tqdm

Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.


## Helper functions

### BackedupResource

This class is used to backup and restore file/folder resources.

In [None]:
import pathlib
from abc import ABC, abstractmethod
class BackedUpResource(ABC):
  def __init__(
      self,
      resource_path: pathlib.Path,
      backup_path: pathlib.Path,
      download_path: pathlib.Path = None,
      logger = {
          'error': error,
          'info': info,
          'warn': warn,
      }
    ):
    # Fix types
    resource_path = pathlib.Path(resource_path)
    backup_path = pathlib.Path(backup_path)
    # If download path not given, use resource path's parent
    if download_path == None:
      download_path = resource_path.parent
    download_path = pathlib.Path(download_path)
    # Check ext.
    if backup_path.suffixes[-2:] == ['.tar', '.gz']:
      self._compress_arg = 'z'
    elif backup_path.suffixes[-1:] == ['.tar']:
      self._compress_arg = ''
    else:
      raise Exception(f'Invalid backup path: \'{backup_path}\'. Only \'.tar.gz\' or \'.tar\' backup file formats supported.')
    # Save params
    self.resource = resource_path
    self.backup = backup_path
    self.download_path = download_path
    self.log = logger
  def exists(self):
    return self.resource.exists() or self.backup.exists()
  @abstractmethod
  def generate(self, resource: pathlib.Path):
    '''
    Implement this to generate the resource.
    '''
    pass
  def _download(self, target, destination: pathlib.Path):
    # create destination dirs
    destination.parent.mkdir(parents=True, exist_ok=True)
    !rsync -ah --progress '{target.resolve()}' '{destination.resolve()}'
  def _upload(self, target, destination: pathlib.Path):
    # create destination dirs
    destination.parent.mkdir(parents=True, exist_ok=True)
    !rsync -ah --progress '{target.resolve()}' '{destination.resolve()}'
  def load(self, regenerate=False):
    '''
    Loads the resource from backup or generates it if it doesn't exist.
    Returns the path to the loaded resource.
    Implement the self.generate method.

    If regenerate is True, this will regenrate the resource and overwrite all backups with the new resource.
    '''
    # Declare tarfile path
    tarfile = self.resource.with_name(self.backup.name)
    def _backup():
      # compress resource
      self.log['info'](f'Adding resource \'{self.resource.name}\' to archive.')
      comp_arg = self._compress_arg
      arg_2 = tarfile.resolve()
      file_arg = '.' if self.resource.is_dir() else self.resource.name
      arg_3 = self.resource.resolve() if self.resource.is_dir() else self.resource.parent.resolve()
      !tar cvf{comp_arg} '{arg_2}' -C '{arg_3}' '{file_arg}'
      # upload
      if regenerate:
        self.log['info']('Overwriting archive to backup.')
        !rm -rf '{self.backup.resolve()}'
      else:
        self.log['info']('Uploading archive to backup.')
      self._upload(tarfile, self.backup)
      self.log['info'](f'Resource \'{self.resource.name}\' backed up at: \'{self.backup}\'')
    if self.resource.exists() and not regenerate:
      # Resource already exists.
      if not self.backup.exists():
        # Backup resource
        self.log['info'](f'Backing up resource \'{self.resource.name}\' to: \'{self.backup}\'')
        _backup()
    elif self.backup.exists() and not regenerate:
      # Load from backup
      self.log['info'](f'Loading resource \'{self.resource.name}\' from backup at: \'{self.backup}\'')
      self.resource.parent.mkdir(parents=True, exist_ok=True)
      self.log['info'](f'Downloading backup file to: \'{tarfile}\'')
      self._download(self.backup, tarfile)
      self.log['info'](f'Extracting backup file to: \'{self.resource}\'')
      # Extract backup
      arg_1 = tarfile.resolve()
      # count files in tar
      file_count = !tar tzf {arg_1} | wc -l
      file_count = int(file_count[0])
      if file_count > 1:
        # make dir
        self.resource.mkdir(parents=True, exist_ok=True)
      comp_arg = self._compress_arg
      dir_arg = self.resource.resolve() if self.resource.is_dir() else self.resource.parent.resolve()
      !tar xvf{comp_arg} {arg_1} -C {dir_arg}
    else:
      if regenerate:
        self.log['info'](f'Re-Generating resource \'{self.resource.name}\'.')
        !rm -rf '{self.resource.resolve()}'
        !rm -rf '{tarfile.resolve()}'
      else:
        # Nothing exists, generate resource
        self.log['info'](f'Generating resource \'{self.resource.name}\'.')
      self.generate(self.resource)
      if not self.resource.exists():
        raise Exception(f'Cannot backup, resource \'{self.resource.name}\' was not generated.')
      _backup()
    return self.resource


class NoGenerateResource(BackedUpResource):
  def generate(self, _):
    raise Exception('NoGenerateResource cannot be generated. Please make sure the file resource or its backup exists.')

## Q1. 
Machine Translation: (that translates sentences from one language to another)

1. Data:- Choose any corpus from https://indicnlp.ai4bharat.org/samanantar/ 
2. Detailed review of at least three papers presented in NIPS / ACL / KDD / COLING / NAACL / conference of similar tier over the last 3 years - that addresses the task using a DL architecture.
3. Implement a DL model for solving the task. Your implementation may be one of the architectures reviewed in step b, or a mixture, or a completely novel one.
4. Discuss the evaluation metrics used to judge the performance of the model, show the model performance using these metrics. Comment on the model's performance. Compare your results with the papers reviewed.
5. Make a clear documentation of the same along with model related information like architecture, training, validation and test splits, hyperparameters choice (and appropriate reasoning), and any other design considerations made, shortcomings of the model, limitations etc.
6. Show some examples where the model has given correct translations as well as some wrong ones.



### Download and setup data

In [None]:
import pandas as pd
class DatasetLoader(BackedUpResource):
  def __init__(
      self,
      dataset_file: pathlib.Path,
      backup_path: pathlib.Path,
      download_path: pathlib.Path = None,
      dataset_key = 'en-hi',
    ):
    super().__init__(dataset_file, backup_path, download_path)
    self.dataset_key = dataset_key
  def generate(self, dataset_file):
    # download zip file
    self.download_path.mkdir(exist_ok=True, parents=True)
    downfolder = self.download_path.resolve()
    dataset_url = "https://akpublicdata.blob.core.windows.net/indicnlp/samanatar/v0.2/samanatar-en-indic-v0.2.zip"
    self.log['info']('Downloading dataset zip file...')
    !cd '{downfolder}' && curl -L -z dataset.zip --create-dirs '{dataset_url}' -o dataset.zip.temp && mv -f dataset.zip.temp dataset.zip 2>/dev/null || rm -f dataset.zip.temp
    # create dataset folder
    downfile = downfolder.joinpath('dataset.zip')
    if not downfile.exists():
      raise Exception(f'Could not download dataset file from: {dataset_url}')
    dataset_file.mkdir(exist_ok=True, parents=True)
    self.log['info']('Extracting data...')
    temp = f'final_data/{self.dataset_key}/*'
    !cd '{downfolder}' && unzip '{downfile.resolve()}' '{temp}'
    temp = downfolder / 'final_data' / self.dataset_key
    # Move files to resource folder
    !mv '{temp}'/* '{dataset_file}'/
    self.log['info'](f'Done.')
  def load_memory(
      self,
      max_entries=25_000,
      regenerate=False
      ):
    folder = self.load(regenerate=regenerate)
    a,b = self.dataset_key.split('-')
    df = dict()
    for l in (a,b,):
      with open(folder / f'train.{l}') as f:
        df[l] = [next(f).strip() for x in range(max_entries)]
    return pd.DataFrame(df)

Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take time, depending on your internet connection. Please be patient.
We'll only do this for the first time.
Downloading Model. This might take

In [None]:
# Setup paths
PATH['Q1'] = PATH['workdir'] / 'Q1'
PATH['Q1_save'] = PATH['workdir_save'] / 'Q1'

PATH['Q1'].mkdir(parents=True, exist_ok=True)

# Select language pair
LANG_SRC = 'en'
LANG_TARGET = 'hi'
DATASET_KEY = f'{LANG_SRC}-{LANG_TARGET}'

dataset_loader = DatasetLoader(
    PATH['Q1'] / f'dataset_{DATASET_KEY}',
    PATH['Q1_save'] / f'dataset_{DATASET_KEY}.tar.gz',
    dataset_key = DATASET_KEY,
)

# Load data
if not('DATA' in locals()):
  DATA = dataset_loader.load_memory()
DATA

Info: [1;35mLoading resource 'dataset_en-hi' from backup at: 'gdrive/MyDrive/Classroom/CSL7340 - Natural Language Processing (Reg.)/NLP Project/workdir_save/Q1/dataset_en-hi.tar.gz'[0m
Info: [1;35mDownloading backup file to: 'workdir/Q1/dataset_en-hi.tar.gz'[0m
sending incremental file list
dataset_en-hi.tar.gz
        849.71M 100%   58.97MB/s    0:00:13 (xfr#1, to-chk=0/1)
Info: [1;35mExtracting backup file to: 'workdir/Q1/dataset_en-hi'[0m
./
./train.en
./train.hi


Unnamed: 0,en,hi
0,"In reply, Pakistan got off to a solid start.",जिसके जवाब में पाक ने अच्छी शुरुआत की थी.
1,The European Union has seven principal decisio...,यूरोपीय संघ के महत्वपूर्ण संस्थानों में यूरोपि...
2,The Congress leader represents Sivaganga Lok S...,कांग्रेस नेता तमिलनाडु से शिवगंगा लोकसभा क्षेत...
3,Prompt the user about connection attempts,संबंधन प्रयास के बारे में उपयोक्ता को प्रांप्ट...
4,"Further, the Minister announced that Deposit I...",वित्त मंत्री ने घोषणा कि जमा बीमा और ऋण गारंटी...
...,...,...
24995,"All of you, stay back!",ओरेकल!
24996,"Yes, even though we are imperfect and make mis...","माना कि हम असिद्ध हैं और गलतियाँ करते हैं, फिर..."
24997,It's not the President.,यह राष्ट्रपति नहीं है।
24998,When so...,इस बारे में जब .


### Tokenize data

In [None]:
class SeriesApplyDataLoader(BackedUpResource):
  def __init__(
      self,
      df_col,
      apply_func,
      dataset_file: pathlib.Path,
      backup_path: pathlib.Path,
      download_path: pathlib.Path = None,
    ):
    assert dataset_file.name[-5:]=='.hdf5'
    super().__init__(dataset_file, backup_path, download_path)
    self.apply_func = apply_func
    self.df = df_col
  def generate(self, dataset_file):
    self.log['info']('Applying function to df...')
    tqdm.pandas()
    df = self.df.progress_apply(self.apply_func).astype('object')
    # Save df
    self.log['info']('Saving result to hdf5...')
    df.to_hdf(dataset_file, key='df')
  def load_memory(self, regenerate=False):
    f = self.load(regenerate=regenerate)
    return pd.read_hdf(f, key='df')

In [None]:
# get Tokenize function
from torchtext.data.utils import get_tokenizer
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')

# Make tokenized data
src_tok_loader = SeriesApplyDataLoader(
    df_col=DATA[LANG_SRC],
    apply_func=en_tokenizer if LANG_SRC=='en' else lambda e: inltk.tokenize(e, LANG_SRC),
    dataset_file=PATH['Q1'] / f'{LANG_SRC}_tok.hdf5',
    backup_path=PATH['Q1_save'] / f'{LANG_SRC}_tok.tar.gz'
)
target_tok_loader = SeriesApplyDataLoader(
    df_col=DATA[LANG_TARGET],
    apply_func=en_tokenizer if LANG_TARGET=='en' else lambda e: inltk.tokenize(e, LANG_TARGET),
    dataset_file=PATH['Q1'] / f'{LANG_TARGET}_tok.hdf5',
    backup_path=PATH['Q1_save'] / f'{LANG_TARGET}_tok.tar.gz'
)

# Load tokenized data
DATA['en_tok'] = src_tok_loader.load_memory()
DATA['hi_tok'] = target_tok_loader.load_memory()

# Show data
DATA

Info: [1;35mLoading resource 'en_tok.hdf5' from backup at: 'gdrive/MyDrive/Classroom/CSL7340 - Natural Language Processing (Reg.)/NLP Project/workdir_save/Q1/en_tok.tar.gz'[0m
Info: [1;35mDownloading backup file to: 'workdir/Q1/en_tok.tar.gz'[0m
sending incremental file list
en_tok.tar.gz
          1.31M 100%  135.45MB/s    0:00:00 (xfr#1, to-chk=0/1)
Info: [1;35mExtracting backup file to: 'workdir/Q1/en_tok.hdf5'[0m
en_tok.hdf5
Info: [1;35mLoading resource 'hi_tok.hdf5' from backup at: 'gdrive/MyDrive/Classroom/CSL7340 - Natural Language Processing (Reg.)/NLP Project/workdir_save/Q1/hi_tok.tar.gz'[0m
Info: [1;35mDownloading backup file to: 'workdir/Q1/hi_tok.tar.gz'[0m
sending incremental file list
hi_tok.tar.gz
          1.91M 100%    4.79MB/s    0:00:00 (xfr#1, to-chk=0/1)
Info: [1;35mExtracting backup file to: 'workdir/Q1/hi_tok.hdf5'[0m
hi_tok.hdf5


Unnamed: 0,en,hi,en_tok,hi_tok
0,"In reply, Pakistan got off to a solid start.",जिसके जवाब में पाक ने अच्छी शुरुआत की थी.,"[In, reply, ,, Pakistan, got, off, to, a, soli...","[▁जिसके, ▁जवाब, ▁में, ▁पाक, ▁ने, ▁अच्छी, ▁शुरु..."
1,The European Union has seven principal decisio...,यूरोपीय संघ के महत्वपूर्ण संस्थानों में यूरोपि...,"[The, European, Union, has, seven, principal, ...","[▁यूरोपीय, ▁संघ, ▁के, ▁महत्वपूर्ण, ▁संस्थानों,..."
2,The Congress leader represents Sivaganga Lok S...,कांग्रेस नेता तमिलनाडु से शिवगंगा लोकसभा क्षेत...,"[The, Congress, leader, represents, Sivaganga,...","[▁कांग्रेस, ▁नेता, ▁तमिलनाडु, ▁से, ▁शिव, गंगा,..."
3,Prompt the user about connection attempts,संबंधन प्रयास के बारे में उपयोक्ता को प्रांप्ट...,"[Prompt, the, user, about, connection, attempts]","[▁संबंध, न, ▁प्रयास, ▁के, ▁बारे, ▁में, ▁उपयोक्..."
4,"Further, the Minister announced that Deposit I...",वित्त मंत्री ने घोषणा कि जमा बीमा और ऋण गारंटी...,"[Further, ,, the, Minister, announced, that, D...","[▁वित्त, ▁मंत्री, ▁ने, ▁घोषणा, ▁कि, ▁जमा, ▁बीम..."
...,...,...,...,...
24995,"All of you, stay back!",ओरेकल!,"[All, of, you, ,, stay, back, !]","[▁ओरेकल, !]"
24996,"Yes, even though we are imperfect and make mis...","माना कि हम असिद्ध हैं और गलतियाँ करते हैं, फिर...","[Yes, ,, even, though, we, are, imperfect, and...","[▁माना, ▁कि, ▁हम, ▁अ, सिद्ध, ▁हैं, ▁और, ▁गलत, ..."
24997,It's not the President.,यह राष्ट्रपति नहीं है।,"[It, 's, not, the, President, .]","[▁यह, ▁राष्ट्रपति, ▁नहीं, ▁है, ।]"
24998,When so...,इस बारे में जब .,"[When, so, ...]","[▁इस, ▁बारे, ▁में, ▁जब, ▁.]"


### Build vocabulary

In [None]:
# Reference: https://pytorch.org/tutorials/beginner/translation_transformer.html
from collections import Counter
from torchtext.vocab import Vocab

en_Counter = Counter()
en_max_seq_length = 0
for v in tqdm(DATA['en_tok']):
  en_Counter.update(v)
  en_max_seq_length = max(en_max_seq_length, len(v))

hi_Counter = Counter()
hi_max_seq_length = 0
for v in tqdm(DATA['hi_tok']):
  hi_Counter.update(v)
  hi_max_seq_length = max(hi_max_seq_length, len(v))

# Make Vocabs
EN_VOCAB = Vocab(en_Counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
HI_VOCAB = Vocab(hi_Counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

PAD_IDX = EN_VOCAB['<pad>']
BOS_IDX = EN_VOCAB['<bos>']
EOS_IDX = EN_VOCAB['<eos>']

# Show data
log('EN_VOCAB size', len(EN_VOCAB))
log('EN max seq. length', en_max_seq_length)
log('HI_VOCAB size', len(HI_VOCAB))
log('HI max seq. length', hi_max_seq_length)

# Show sample data
info('Sample stoi data:-')
list(HI_VOCAB.stoi.items())[:10]

HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))


EN_VOCAB size: [1;35m35970[0m
EN max seq. length: [1;35m227[0m
HI_VOCAB size: [1;35m20782[0m
HI max seq. length: [1;35m333[0m
Info: [1;35mSample stoi data:-[0m


[('<unk>', 0),
 ('<pad>', 1),
 ('<bos>', 2),
 ('<eos>', 3),
 ('▁के', 4),
 ('।', 5),
 ('▁में', 6),
 ('▁है', 7),
 (',', 8),
 ('▁की', 9)]

### Build dataloader

In [None]:
import pytorch_lightning as pl
from torch.utils.data import DataLoader
import torch
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from typing import Any
class BatchingDataModule(pl.LightningDataModule):
  @dataclass(frozen=True)
  class Parameters:
    @dataclass
    class LangParams:
      data: Any
      vocab: Vocab
      pad_idx: int
      bos_idx:int
      eos_idx: int
    src: 'BatchingDataModule.Paramaters.LangParams'
    target: 'BatchingDataModule.Paramaters.LangParams'
    test_size_ratio: float = 0.2
    val_size_ratio: float = 0.1
    batch_size: int
  def __init__(self, params: 'EmbeddingDataModule.Parameters'):
    super().__init__()
    self.params = params
  def prepare_data(self):
    X = self.params.src.data
    Y = self.params.target.data
    # Split train test data
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=self.params.test_size_ratio)
    # Split train val data
    X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=self.params.val_size_ratio)
    # save data
    self.train = (X_train, Y_train)
    self.test = (X_test, Y_test)
    self.val = (X_val, Y_val)
  def collate_fn(self, batch):
    def sentence_to_tensor(s, lang):
      return torch.tensor(
          [lang.bos_idx] + [lang.vocab[tk] for tk in s] + [lang.eos_idx],
          dtype=torch.long
        )
    x, y = [i[0] for i in batch], [i[1] for i in batch]
    x = pad_sequence([sentence_to_tensor(i, self.params.src) for i in x], padding_value=self.params.src.pad_idx)
    y = pad_sequence([sentence_to_tensor(i, self.params.target) for i in y], padding_value=self.params.target.pad_idx)
    return x, y
  def train_dataloader(self):
    return DataLoader(list(zip(*self.train)), batch_size=self.params.batch_size, collate_fn=self.collate_fn)
  def val_dataloader(self):
    return DataLoader(list(zip(*self.val)), batch_size=self.params.batch_size, collate_fn=self.collate_fn)
  def test_dataloader(self):
    return DataLoader(list(zip(*self.test)), batch_size=self.params.batch_size, collate_fn=self.collate_fn)

# Sample Datamodule
sample_datamodule = BatchingDataModule(
    BatchingDataModule.Parameters(
        src=BatchingDataModule.Parameters.LangParams(
            data=DATA['en_tok'],
            vocab=EN_VOCAB,
            pad_idx=PAD_IDX,
            bos_idx=BOS_IDX,
            eos_idx=EOS_IDX,
        ),
        target=BatchingDataModule.Parameters.LangParams(
            data=DATA['hi_tok'],
            vocab=HI_VOCAB,
            pad_idx=PAD_IDX,
            bos_idx=BOS_IDX,
            eos_idx=EOS_IDX,
        ),
        batch_size=1,
    )
)
sample_datamodule.prepare_data()
sample_batch = next(iter(sample_datamodule.train_dataloader()))
log('Sample batch size', 1)
log('Sample batch src shape:', sample_batch[0].shape)
log('Sample batch target shape:', sample_batch[1].shape)

Sample batch size: [1;35m1[0m
Sample batch src shape:: [1;35mtorch.Size([54, 1])[0m
Sample batch target shape:: [1;35mtorch.Size([114, 1])[0m


### Token embedding and positional encoding

In [None]:
import math
class Positional_Embedding(pl.LightningModule):
  @dataclass(frozen=True)
  class Parameters:
    embedding_dim: int
    max_seq_length: int
    dropout: float = 0.1
  def __init__(self, params: 'Positional_Embedding.Parameters'):
    super().__init__()
    self.params = params
    # Pre-compute positional embeddings
    den = torch.exp( -torch.arange(0, params.embedding_dim, 2, device=self.device) * math.log(10000) / params.embedding_dim)
    pos = torch.arange(0, params.max_seq_length, device=self.device).reshape(params.max_seq_length, 1)
    pos_embedding = torch.zeros((params.max_seq_length, params.embedding_dim), device=self.device)
    pos_embedding[:, 0::2] = torch.sin(pos * den)
    pos_embedding[:, 1::2] = torch.cos(pos * den)
    pos_embedding = pos_embedding.unsqueeze(-2)
    # Store pre-computed data and dropout layer
    self.dropout = torch.nn.Dropout(params.dropout)
    self.register_buffer('pos_embedding', pos_embedding)
  def forward(self, token_embedding: torch.Tensor):
    return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0),:])

class Token_Embedding(pl.LightningModule):
  @dataclass(frozen=True)
  class Parameters:
    embedding_dim: int
    vocab_size: int
  def __init__(self, params: 'Positional_Embedding.Parameters'):
    super().__init__()
    self.params = params
    self.emb = torch.nn.Embedding(params.vocab_size, params.embedding_dim)
  def forward(self, tokens: torch.Tensor):
    return self.emb(tokens.long()) * math.sqrt(self.params.embedding_dim)

### Transformer Model

In [None]:
import pytorch_lightning as pl
from torchtext.vocab import Vocab
from torchmetrics.functional import bleu_score
import torch

class Transformer_Model(pl.LightningModule):
  @dataclass(frozen=True)
  class Parameters:
    @dataclass
    class LangParams:
      vocab: Vocab
      max_seq_length: int
      pad_idx: int
      bos_idx:int
      eos_idx: int
    src: 'Transformer_Model.Paramaters.LangParams'
    target: 'Transformer_Model.Paramaters.LangParams'
    encoder_layers: int
    decoder_layers: int
    embedding_dim: int
    learning_rate: float = 0.0001
    attention_heads: int = 8
    linear_dim: int = 512
    dropout: float = 0.1
  def __init__(self, params: 'Transformer_Model.Parameters'):
    super().__init__()
    self.params = params
    # Encoder layer
    enc_layer = torch.nn.TransformerEncoderLayer(
        d_model=params.embedding_dim,
        nhead=params.attention_heads,
        dim_feedforward=params.linear_dim,
        dropout=params.dropout,
    )
    # Encoder
    self.enc = torch.nn.TransformerEncoder(
        enc_layer, num_layers=params.encoder_layers
    )
    # Decoder layer
    dec_layer = torch.nn.TransformerDecoderLayer(
        d_model=params.embedding_dim,
        nhead=params.attention_heads,
        dim_feedforward=params.linear_dim,
        dropout=params.dropout,
    )
    # Decoder
    self.dec = torch.nn.TransformerDecoder(
        dec_layer, num_layers=params.encoder_layers
    )
    # Generator
    self.gen = torch.nn.Linear(params.embedding_dim, len(params.target.vocab))
    # Token Embedding layers
    self.src_emb = Token_Embedding(Token_Embedding.Parameters(
        embedding_dim=params.embedding_dim,
        vocab_size=len(params.src.vocab)
    ))
    self.target_emb = Token_Embedding(Token_Embedding.Parameters(
        embedding_dim=params.embedding_dim,
        vocab_size=len(params.target.vocab)
    ))
    # Positional Embedding layer
    self.pos_emb = Positional_Embedding(Positional_Embedding.Parameters(
        embedding_dim=params.embedding_dim,
        max_seq_length=5000,
        dropout=params.dropout,
    ))
  
  def create_masks(self, src, target):
    # Padded sequence lengths
    src_seq_len = src.shape[0]
    target_seq_len = target.shape[0]
    # target mask
    target_mask = (torch.triu(torch.ones((target_seq_len, target_seq_len), device=self.device)) == 1).transpose(0, 1)
    target_mask = target_mask.float().masked_fill(target_mask == 0, float('-inf')).masked_fill(target_mask == 1, float(0.0))
    # source mask
    src_mask = torch.zeros((src_seq_len, src_seq_len), device=self.device).bool()
    # padding masks
    src_padding_mask = (src == self.params.src.pad_idx).transpose(0, 1)
    target_padding_mask = (target == self.params.target.pad_idx).transpose(0, 1)
    return src_mask, target_mask, src_padding_mask, target_padding_mask
  
  def forward(self, src, target, src_mask, target_mask, src_padding_mask, target_padding_mask, memory_key_padding_mask):
    # token embedding and positional encoding
    src_emb = self.pos_emb(self.src_emb(src))
    target_emb = self.pos_emb(self.target_emb(target))
    # encode
    memory = self.enc(src_emb, src_mask, src_padding_mask)
    # decode
    outs = self.dec(target_emb, memory, target_mask, None, target_padding_mask, memory_key_padding_mask)
    # Final linear layer
    return self.gen(outs)
  
  def translate(self, src_sentence, src_tokenizer, max_tokens=500):
    '''
    Accepts a sentence in source language and returns a target language sentence.
    Parameters:-
      src_sentence: Input sentence string
      src_tokenizer: Callable that is used to tokenize the source sentence.
    '''
    src_params = self.params.src
    tokens = [src_params.bos_idx] + [src_params.vocab.stoi[i] for i in src_tokenizer(src_sentence)] + [src_params.eos_idx]
    src = torch.LongTensor(tokens).to(self.device)
    ys = self.translate_tokenized(src, max_tokens)
    # Convert ys to string and return
    return "".join([self.params.target.vocab.itos[i] for i in ys]).replace("<bos>", "").replace("<eos>", "").replace("▁", " ")
    
  def translate_tokenized(self, src: torch.Tensor, max_tokens=500):
    '''
    Accepts a tokenized sentence tensor in source language and returns a tokenized sentence in target language.
    Note the tokens include <bos> and <eos>.
    '''
    src_params = self.params.src
    src_len = len(src)
    src = src.reshape(src_len, 1)
    src_mask = torch.zeros(src_len, src_len, device=self.device).bool()
    # Encode
    memory = self.enc(self.pos_emb(self.src_emb(src)), src_mask)
    # The tensor that stores the generated sentence
    ys = torch.ones(1, 1, device=self.device).fill_(src_params.bos_idx).long()
    # Deocder loop (1 token will be <bos>)
    for i in range(max_tokens):
      memory_mask = torch.zeros(ys.shape[0], memory.shape[0], device=self.device).bool()
      target_mask = (torch.triu(torch.ones((ys.size(0), ys.size(0)), device=self.device)) == 1).transpose(0, 1)
      target_mask = target_mask.float().masked_fill(target_mask == 0, float('-inf')).masked_fill(target_mask == 1, float(0.0))
      # Decode next token
      out = self.dec(self.pos_emb(self.target_emb(ys.float())), memory, target_mask).transpose(0, 1)
      prob = self.gen(out[:, -1])
      _, next_word = torch.max(prob, dim = 1)
      next_word = next_word.item()
      # Break if model generated eos
      if next_word == EOS_IDX:
        break
      # Prepare ys for next iteration
      ys = torch.cat([ys, torch.ones(1, 1, device=self.device).type_as(src.data).fill_(next_word)], dim=0)
    return ys.flatten()
  
  def training_step(self, batch, batch_idx):
    ret = self._step(batch, batch_idx)
    # log loss
    self.log('train_loss', ret['loss'], on_epoch=True, on_step=True, prog_bar=True, logger=True)
    return ret
  
  def validation_step(self, batch, batch_idx):
    ret = self._step(batch, batch_idx)
    # log loss
    self.log('val_loss', ret['loss'], on_epoch=True, on_step=False, prog_bar=True, logger=True)
    # measure average bleu score for every sentence
    bleu_sum = 0
    for x,y in zip(*batch):
      translated = self.translate_tokenized(x, max_tokens=len(y))
      target = [self.params.target.vocab.itos[i] for i in y]
      bleu_sum += bleu_score([translated], [[target]])
    self.log('average_bleu_score', bleu_sum/len(batch), on_epoch=True, on_step=False, prog_bar=True, logger=True)
    return ret
  
  def _step(self, batch, batch_idx):
    x, y = batch
    # the model generates token-by-token and 1st token <bos> is always known
    # hence, the loss must be calculated between the input and 1-right shift of the output
    y_shifted, target_pred = y[:-1,:], y[1:,:]
    # create masks
    x_mask, y_mask, x_padding_mask, y_padding_mask = self.create_masks(x, y_shifted)
    # predict
    pred = self(x, y_shifted, x_mask, y_mask, x_padding_mask, y_padding_mask, x_padding_mask)
    # calc loss
    a = pred.reshape(-1, pred.shape[-1])
    b = target_pred.reshape(-1)
    loss = torch.nn.functional.cross_entropy(a, b, ignore_index=self.params.target.pad_idx)
    return {'loss': loss}
  
  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=self.params.learning_rate)
    return optimizer

# Sample model training
sample_model = Transformer_Model(Transformer_Model.Parameters(
    src=Transformer_Model.Parameters.LangParams(
        vocab=EN_VOCAB,
        max_seq_length=en_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    target=Transformer_Model.Parameters.LangParams(
        vocab=HI_VOCAB,
        max_seq_length=hi_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    encoder_layers=1,
    decoder_layers=1,
    embedding_dim=128,
))

# This will print gibberish
from torchmetrics.functional import bleu_score
sample_translation = sample_model.translate("It's not the President.", en_tokenizer, max_tokens=10)
log('Sample src sentence', "It's not the President.")
log('Sample translation (not trained yet)', sample_translation)
log('Sample translation BLEU score(not trained yet)', bleu_score([inltk.tokenize(sample_translation, LANG_TARGET)], [[["▁यह", "▁राष्ट्रपति", "▁नहीं", "▁है", "।"]]]))

# Show sample model
sample_model

# sample_trainer = pl.Trainer(
#     max_epochs=1,
# )
# sample_trainer.fit(sample_model, sample_datamodule)

Sample src sentence: [1;35mIt's not the President.[0m
Sample translation (not trained yet): [1;35m लोभ ऋतु माइल मीठे St0% विस् वायुसेना प्राचार्य आन[0m
Sample translation BLEU score(not trained yet): [1;35m0.0[0m


Transformer_Model(
  (enc): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=512, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=512, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (dec): TransformerDecoder(
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): _LinearW

### Show mlflow tracking webpage

All logs generated by the below training cell are shown on this page. It also stores data from all historical runs. Due to some bugs, this doesn't allow me to assign names to runs yet but every run is recorded and the performance of previous runs can be seen.

**NOTE:** Right click inside the IFrame and click `Reload Frame` to see new runs.

**OR:** Click on the `Experiments` tab to refresh the list.

In [None]:
%%html
<div width="-webkit-fill-available">
  <iframe src="https://dagshub.com/jaideepheer/CSL7340-NLP.Project/experiments/" width="100%" height="1024"></iframe>
</div>

### Train the model

In [None]:
# Hyperparamaters
@dataclass(frozen=True)
class Hyperparamaters:
  max_epochs: int = 100
  learning_rate: float = 1e-3
  # https://blog.ml.cmu.edu/2020/03/20/are-sixteen-heads-really-better-than-one/
  attention_heads: int = 8
  encoder_layers: int = 8
  decoder_layers: int = 8
  linear_dim: int = 128
  dropout: float = 0.1
  embedding_dim: int = 128
  batch_size: int = 32
  test_size_ratio: float = 0.2
  val_size_ratio: float = 0.1
  logs_per_batch: int = 8 # Higher values may slow down overall training.
  check_val_every_n_epoch: int = 10 # Validation is very slow due to bleu score calculation
  en_max_seq_length: int = en_max_seq_length
  hi_max_seq_length: int = hi_max_seq_length

hyper_params = Hyperparamaters()
assert hyper_params.batch_size%hyper_params.logs_per_batch == 0

# Create datamodule
datamodule = BatchingDataModule(
    BatchingDataModule.Parameters(
        src=BatchingDataModule.Parameters.LangParams(
            data=DATA['en_tok'],
            vocab=EN_VOCAB,
            pad_idx=PAD_IDX,
            bos_idx=BOS_IDX,
            eos_idx=EOS_IDX,
        ),
        target=BatchingDataModule.Parameters.LangParams(
            data=DATA['hi_tok'],
            vocab=HI_VOCAB,
            pad_idx=PAD_IDX,
            bos_idx=BOS_IDX,
            eos_idx=EOS_IDX,
        ),
        batch_size=hyper_params.batch_size,
        test_size_ratio=hyper_params.test_size_ratio,
        val_size_ratio=hyper_params.val_size_ratio,
    )
)

# Define logger params
# https://mlflow.org/docs/latest/python_api/mlflow.pytorch.html#module-mlflow.pytorch
# https://stackoverflow.com/questions/61615818/setting-up-mlflow-on-google-colab
from pytorch_lightning.loggers import MLFlowLogger
import mlflow, mlflow.pytorch, os
from mlflow.tracking import MlflowClient
os.environ['MLFLOW_TRACKING_PASSWORD'] = '97d09c6f4b6c8a0753f27cc64b501eb83f82a2cb'
os.environ['MLFLOW_TRACKING_USERNAME'] = 'jaideepheer'
os.environ['MLFLOW_TRACKING_URI'] = 'https://dagshub.com/jaideepheer/CSL7340-NLP.Project.mlflow'
mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_URI'])

# Create experiment
experiment_name = "Transformer Model"
# mlflow.create_experiment(experiment_name)
mlflow.set_experiment(experiment_name)

# Setup checkpointing
from pytorch_lightning.callbacks import ModelCheckpoint
checkpoint_callbacks = [
  ModelCheckpoint(
    monitor='loss',
    dirpath=PATH['workdir_save'] / 'training_checkpoint_dir',
    filename='transformer_model-{epoch:02d}-{loss:.2f}',
    save_top_k=3,
    mode='min',
  ),
  # ModelCheckpoint(
  #   monitor='val_loss',
  #   dirpath=PATH['workdir_save'] / 'training_checkpoint_dir',
  #   filename='transformer_model-{epoch:02d}-{val_loss:.2f}',
  #   save_top_k=3,
  #   mode='min',
  # ),
]

# Create logger
logger = MLFlowLogger(
  experiment_name=experiment_name,
  tracking_uri=os.environ['MLFLOW_TRACKING_URI'],
  # tags=tags,
  # prefix=key,
)

# Log hyperparams
logger.log_hyperparams(asdict(hyper_params))

# Define trainer
trainer = pl.Trainer(
    gpus=-1, # Use all GPUs
    precision=16, # Use 16bit precission for larger batches and faster training
    max_epochs=hyper_params.max_epochs,
    logger=logger,
    log_every_n_steps=hyper_params.batch_size/hyper_params.logs_per_batch,
    check_val_every_n_epoch=hyper_params.check_val_every_n_epoch,
    # profiler="simple", 
    progress_bar_refresh_rate=20,
    # dir to save checkpoints
    callbacks=[*checkpoint_callbacks],
)

# Create model
model = Transformer_Model(Transformer_Model.Parameters(
    src=Transformer_Model.Parameters.LangParams(
        vocab=EN_VOCAB,
        max_seq_length=hyper_params.en_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    target=Transformer_Model.Parameters.LangParams(
        vocab=HI_VOCAB,
        max_seq_length=hyper_params.hi_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    encoder_layers=hyper_params.encoder_layers,
    decoder_layers=hyper_params.decoder_layers,
    embedding_dim=hyper_params.embedding_dim,
    learning_rate=hyper_params.learning_rate,
    attention_heads=hyper_params.attention_heads,
    linear_dim=hyper_params.linear_dim,
    dropout=hyper_params.dropout,
))

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Using native 16bit precision.


In [None]:
# Fit model
trainer.fit(model, datamodule)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Using native 16bit precision.

  | Name       | Type                 | Params
----------------------------------------------------
0 | enc        | TransformerEncoder   | 796 K 
1 | dec        | TransformerDecoder   | 1.3 M 
2 | gen        | Linear               | 2.7 M 
3 | src_emb    | Token_Embedding      | 4.6 M 
4 | target_emb | Token_Embedding      | 2.7 M 
5 | pos_emb    | Positional_Embedding | 0     
----------------------------------------------------
12.1 M    Trainable params
0         Non-trainable params
12.1 M    Total params
48.276    Total estimated model params size (MB)


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

### Load model from checkpoint

In [None]:
checkpoint = Transformer_Model.load_from_checkpoint(
    PATH['workdir_save'] / 'training_checkpoint_dir' / 'transformer_model-epoch=09-loss=9.07.ckpt',
    params=Transformer_Model.Parameters(
    src=Transformer_Model.Parameters.LangParams(
        vocab=EN_VOCAB,
        max_seq_length=hyper_params.en_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    target=Transformer_Model.Parameters.LangParams(
        vocab=HI_VOCAB,
        max_seq_length=hyper_params.hi_max_seq_length,
        pad_idx=PAD_IDX,
        bos_idx=BOS_IDX,
        eos_idx=EOS_IDX,
    ),
    encoder_layers=hyper_params.encoder_layers,
    decoder_layers=hyper_params.decoder_layers,
    embedding_dim=hyper_params.embedding_dim,
    learning_rate=hyper_params.learning_rate,
    attention_heads=hyper_params.attention_heads,
    linear_dim=hyper_params.linear_dim,
    dropout=hyper_params.dropout,
  )
)
checkpoint

Transformer_Model(
  (enc): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): L

### Show translation samples

In [None]:
num_samples = 30

SAMPLE_DATA = DATA[['en', 'hi', 'hi_tok']].sample(n=num_samples)

# Translate
tr = []
tr_tok = []
for idx, row in tqdm(SAMPLE_DATA.iterrows(), total=len(SAMPLE_DATA)):
  en_sen, hi_sen = row['en'], row['hi_tok']
  tr.append(checkpoint.translate(en_sen, en_tokenizer, max_tokens=len(hi_sen)))
  tr_tok.append(inltk.tokenize(tr[-1], LANG_TARGET))
SAMPLE_DATA['translated'] = tr

# Calculate bleu score on sample data
from torchmetrics.functional import bleu_score
log('Mean BLEU score (in sample)', bleu_score(tr_tok, [[i] for i in SAMPLE_DATA['hi_tok']]))

SAMPLE_DATA[['en', 'hi', 'translated']]

HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))


Mean BLEU score (in sample): [1;35m0.0[0m


Unnamed: 0,en,hi,translated
3041,"So, in this case the samples is a partition th...","इसलिए, इस मामले में नमूने एक विभाजन है जो हमार...",कहा कि कहा कि कहा कि कि कि कि के के कहा के कि...
3911,I am truly grateful to Salman.,मैं सलमान की बहुत इज्जत करती हूं।,मैं मैं मैं है
11022,NITI Aayog has taken steps to ensure that thes...,नीति आयोग ने यह सुनिश्चित करने के लिए कदम उठाए...,उन्होंने ने ने के के के के के के के के के के ...
9173,"The official orders will be issued soon, offic...",अधिकारियों ने कहा कि आधिकारिक आदेश जल्द आएगा।,इस ने ने के के। के। के
13044,Members & workers of Congress protest across t...,कर्नाटक में बीएस येदियुरप्पा को सरकार बनाने के...,उन्होंने ने के के के के के के के के के के के ...
23345,or cause the streams in your garden to disappe...,उसका पानी नीचे उतर (के खुश्क) हो जाए फिर तो उस...,", के के है के है है है है है"
10029,Following this there were skirmishes between t...,इसके बाद दोनों टीमों के बीच जमकर गोलियां चलीं.,इस के में के के के के।
24607,Sowing of Jute and Mesta is also in progress.,जूट और मेस्टा की बुवाई भी प्रगति पर है।,यह और और और और और और और और और।
5417,If the level...,अगर तह में .,यह के
13503,Raveena Tandon and husband Anil Thadani,रवीना टंडन और पति अनिल थडानी में होने जा रही ह...,इस और और और और और और और


In [None]:
# Selected samples
selected_samples = [12396, 1181, 21541, 24163, 11939, 4350, 11572, 10689, 18456]

# Show select samples
SAMPLE_DATA = DATA[['en', 'hi', 'hi_tok']].iloc[selected_samples]

# Translate
tr = []
tr_tok = []
for idx, row in tqdm(SAMPLE_DATA.iterrows(), total=len(SAMPLE_DATA)):
  en_sen, hi_sen = row['en'], row['hi_tok']
  tr.append(checkpoint.translate(en_sen, en_tokenizer, max_tokens=len(hi_sen)))
  tr_tok.append(inltk.tokenize(tr[-1], LANG_TARGET))
SAMPLE_DATA['translated'] = tr

# Calculate bleu score on sample data
from torchmetrics.functional import bleu_score
log('Mean BLEU score (in sample)', bleu_score(tr_tok, [[i] for i in SAMPLE_DATA['hi_tok']]))

SAMPLE_DATA[['en', 'hi', 'translated']]

HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))


Mean BLEU score (in sample): [1;35m0.0[0m


Unnamed: 0,en,hi,translated
12396,But it never rings.,लेकिन इससे कभी भी जूलरी नहीं बनती.,उन्होंने कहा नहीं नहीं
1181,"In Rajasthan Assembly Elections, Congress got ...","वहीं, जम्मू-कश्मीर की 6 सीटों में से 3 बीजेपी ...",मुख्यमंत्री ने ने ने के ने के के के के के के
21541,He said government has failed to,उन्होंने कहा कि सरकार दीनदयाल,सरकार सरकार सरकार सरकार सरकार
24163,He further said that the State has been given ...,उन्होंने यह भी कहा कि बीमारी की जल्दी चेतावनी ...,उन्होंने ने के के के के के के के के के के के ...
11939,Take a look at these.,नजर डालें इन्हीं पर।,इस में है में है
4350,The recovery rate among COVID-19 patients has ...,भारत में कोविड-19 रिकवरी दर भी लगातार बढ़ रही ...,इस के के के के
11572,She said her merger proposal was accepted by P...,सुषमा ने कहा कि उनके इस विलय प्रस्ताव को प्रधा...,प्रधानमंत्री मोदी ने मोदी के ने के मोदी मोदी
10689,Samajwadi party will fight the Lik Sabha elect...,कांग्रेस पार्टी उत्तरप्रदेश विधानसभा में चुनाव...,भाजपा ने के के के के के के के के के
18456,He also stole three bags having gold kept in i...,उन्होंने लॉकर से उसमें रखे सोने के तीन बैग भी ...,उन्होंने उन्होंने ने के के के के के के के के के
