# **Homework 5 - Sequence-to-sequence**

If you have any questions, feel free to email us at: ntu-ml-2021spring-ta@googlegroups.com

### (4/21 Updates)
1. Link to reference [training curves](https://wandb.ai/george0828zhang/hw5.seq2seq.new).

### (4/14 Updates)
1. Link to tutorial video [part 1](https://youtu.be/1pjS5_L5REI) [part 2](https://youtu.be/3XX9d0ymKgQ).
2. Now defaults to load `"avg_last_5_checkpoint.pt"` to generate prediction.
3. Expected run time on Colab with Tesla T4

|Baseline|Details|Total Time|
|-|:-:|:-:|
|Simple|2m 15s $\times$30 epochs|1hr 8m|
|Medium|4m $\times$30 epochs|2hr|
|Strong|8m $\times$30 epochs (backward)<br>+1hr (back-translation)<br>+15m $\times$30 epochs (forward)|12hr 30m|

# Sequence-to-Sequence Introduction
- Typical sequence-to-sequence (seq2seq) models are encoder-decoder models, which usually consists of two parts, the encoder and decoder, respectively. These two parts can be implemented with recurrent neural network (RNN) or transformer, primarily to deal with input/output sequences of dynamic length.
- **Encoder** encodes a sequence of inputs, such as text, video or audio, into a single vector, which can be viewed as the abstractive representation of the inputs, containing information of the whole sequence.
- **Decoder** decodes the vector output of encoder one step at a time, until the final output sequence is complete. Every decoding step is affected by previous step(s). Generally, one would add "< BOS >" at the begining of the sequence to indicate start of decoding, and "< EOS >" at the end to indicate end of decoding.

![seq2seq](https://i.imgur.com/0zeDyuI.png)

# Homework Description
- English to Chinese (Traditional) Translation
  - Input: an English sentence         (e.g.		tom is a student .)
  - Output: the Chinese translation  (e.g. 		湯姆 是 個 學生 。)

- TODO
    - Train a simple RNN seq2seq to acheive translation
    - Switch to transformer model to boost performance
    - Apply Back-translation to furthur boost performance

# Download and import required packages

In [None]:
!pip install 'torch>=1.6.0' editdistance matplotlib sacrebleu sacremoses sentencepiece tqdm wandb
!pip install --upgrade jupyter ipywidgets

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.6.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.6.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.6.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.6.0)

In [None]:
!git clone https://github.com/pytorch/fairseq.git
!cd fairseq && git checkout 9a1c497
!pip install --upgrade ./fairseq/

Cloning into 'fairseq'...
remote: Enumerating objects: 35391, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 35391 (delta 7), reused 5 (delta 5), pack-reused 35375 (from 3)[K
Receiving objects: 100% (35391/35391), 25.48 MiB | 15.84 MiB/s, done.
Resolving deltas: 100% (25540/25540), done.
Note: switching to '9a1c497'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 9a1c4970 Make Hydra logging work with DDP (#1568)
Processing ./fairseq


In [14]:
import sys
import pdb
import pprint
import logging
import os
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils import data
import numpy as np
import tqdm.auto as tqdm
from pathlib import Path
from argparse import Namespace
from fairseq import utils

import matplotlib.pyplot as plt

ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory

# Fix random seed

In [None]:
seed = 73
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

# Dataset Information

## En-Zh Bilingual Parallel Corpus
* [TED2020](#reimers-2020-multilingual-sentence-bert)
    - Raw: 398,066 (sentences)   
    - Processed: 393,980 (sentences)
    

## Testdata
- Size: 4,000 (sentences)
- **Chinese translation is undisclosed. The provided (.zh) file is psuedo translation, each line is a '。'**

# Dataset Download

## Install megatools (optional)

In [None]:
#!apt-get install megatools

## Download and extract

In [None]:
data_dir = './DATA/rawdata'
dataset_name = 'ted2020'
urls = (
    '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214989&authkey=AGgQ-DaR8eFSl1A"',
    '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214987&authkey=AA4qP_azsicwZZM"',
# # If the above links die, use the following instead.
#     "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted2020.tgz",
#     "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/test.tgz",
# # If the above links die, use the following instead.
#     "https://mega.nz/#!vEcTCISJ!3Rw0eHTZWPpdHBTbQEqBDikDEdFPr7fI8WxaXK9yZ9U",
#     "https://mega.nz/#!zNcnGIoJ!oPJX9AvVVs11jc0SaK6vxP_lFUNTkEcK2WbxJpvjU5Y",
)
file_names = (
    'ted2020.tgz', # train & dev
    'test.tgz', # test
)
prefix = Path(data_dir).absolute() / dataset_name

prefix.mkdir(parents=True, exist_ok=True)
for u, f in zip(urls, file_names):
    path = prefix/f
    if not path.exists():
        if 'mega' in u:
            !megadl {u} --path {path}
        else:
            !wget {u} -O {path}
    if path.suffix == ".tgz":
        !tar -xvf {path} -C {prefix}
    elif path.suffix == ".zip":
        !unzip -o {path} -d {prefix}
!mv {prefix/'raw.en'} {prefix/'train_dev.raw.en'}
!mv {prefix/'raw.zh'} {prefix/'train_dev.raw.zh'}
!mv {prefix/'test.en'} {prefix/'test.raw.en'}
!mv {prefix/'test.zh'} {prefix/'test.raw.zh'}

## Language

In [None]:
src_lang = 'en'
tgt_lang = 'zh'

data_prefix = f'{prefix}/train_dev.raw'
test_prefix = f'{prefix}/test.raw'

In [None]:
!head {data_prefix+'.'+src_lang} -n 5
!head {data_prefix+'.'+tgt_lang} -n 5

## Preprocess files

In [None]:
import re

def strQ2B(ustring):
    """Full width -> half width"""
    # reference:https://ithelp.ithome.com.tw/articles/10233122
    ss = []
    for s in ustring:
        rstring = ""
        for uchar in s:
            inside_code = ord(uchar)
            if inside_code == 12288:  # Full width space: direct conversion
                inside_code = 32
            elif (inside_code >= 65281 and inside_code <= 65374):  # Full width chars (except space) conversion
                inside_code -= 65248
            rstring += chr(inside_code)
        ss.append(rstring)
    return ''.join(ss)

def clean_s(s, lang):
    if lang == 'en':
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace('-', '') # remove '-'
        s = re.sub('([.,;!?()\"])', r' \1 ', s) # keep punctuation
    elif lang == 'zh':
        s = strQ2B(s) # Q2B
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace(' ', '')
        s = s.replace('—', '')
        s = s.replace('“', '"')
        s = s.replace('”', '"')
        s = s.replace('_', '')
        s = re.sub('([。,;!?()\"~「」])', r' \1 ', s) # keep punctuation
    s = ' '.join(s.strip().split())
    return s

def len_s(s, lang):
    if lang == 'zh':
        return len(s)
    return len(s.split())

def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1):
    '''將資料清洗完存入檔案'''
    if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():
        print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')
        return
    with open(f'{prefix}.{l1}', 'r') as l1_in_f:
        with open(f'{prefix}.{l2}', 'r') as l2_in_f:
            with open(f'{prefix}.clean.{l1}', 'w') as l1_out_f:
                with open(f'{prefix}.clean.{l2}', 'w') as l2_out_f:
                    for s1 in l1_in_f:
                        s1 = s1.strip()
                        s2 = l2_in_f.readline().strip()
                        s1 = clean_s(s1, l1)
                        s2 = clean_s(s2, l2)
                        s1_len = len_s(s1, l1)
                        s2_len = len_s(s2, l2)
                        if min_len > 0: # remove short sentence
                            if s1_len < min_len or s2_len < min_len:
                                continue
                        if max_len > 0: # remove long sentence
                            if s1_len > max_len or s2_len > max_len:
                                continue
                        if ratio > 0: # remove by ratio of length
                            if s1_len/s2_len > ratio or s2_len/s1_len > ratio:
                                continue
                        print(s1, file=l1_out_f)
                        print(s2, file=l2_out_f)

In [None]:
clean_corpus(data_prefix, src_lang, tgt_lang)
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1)

In [None]:
!head {data_prefix+'.clean.'+src_lang} -n 5
!head {data_prefix+'.clean.'+tgt_lang} -n 5

## Split into train/valid

In [None]:
valid_ratio = 0.01 # 3000~4000 would suffice
train_ratio = 1 - valid_ratio

In [None]:
# 檢查是否已經存在 train 和 valid 的分割資料檔案
if (prefix / f'train.clean.{src_lang}').exists() \
and (prefix / f'train.clean.{tgt_lang}').exists() \
and (prefix / f'valid.clean.{src_lang}').exists() \
and (prefix / f'valid.clean.{tgt_lang}').exists():
    print(f'train/valid splits exists. skipping split.')  # 如果檔案已經存在，則跳過分割
else:
    # 計算輸入語言檔案的總行數
    line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}'))

    # 建立一個包含所有行索引的列表
    labels = list(range(line_num))

    # 隨機打亂行索引
    random.shuffle(labels)

    # 對於每種語言（來源語言和目標語言），進行訓練集和驗證集的分割
    for lang in [src_lang, tgt_lang]:
        # 開啟訓練集和驗證集的輸出檔案
        train_f = open(os.path.join(data_dir, dataset_name, f'train.clean.{lang}'), 'w')
        valid_f = open(os.path.join(data_dir, dataset_name, f'valid.clean.{lang}'), 'w')

        count = 0  # 記錄當前行的索引
        # 逐行讀取來源檔案
        for line in open(f'{data_prefix}.clean.{lang}', 'r'):
            # 按照行索引的隨機分佈，將資料分配到訓練集或驗證集
            if labels[count] / line_num < train_ratio:
                train_f.write(line)  # 分配到訓練集
            else:
                valid_f.write(line)  # 分配到驗證集
            count += 1

        # 關閉檔案
        train_f.close()
        valid_f.close()

## Subword Units
Out of vocabulary (OOV) has been a major problem in machine translation. This can be alleviated by using subword units.
- We will use the [sentencepiece](#kudo-richardson-2018-sentencepiece) package
- select 'unigram' or 'byte-pair encoding (BPE)' algorithm

In [None]:
import sentencepiece as spm

# 設定詞彙表的大小
vocab_size = 8000

# 如果模型檔案已經存在，則跳過訓練步驟
if (prefix / f'spm{vocab_size}.model').exists():
    print(f'{prefix}/spm{vocab_size}.model exists. skipping spm_train.')  # 提示模型檔案已存在，跳過 SentencePiece 訓練
else:
    # 訓練 SentencePiece 模型
    spm.SentencePieceTrainer.train(
        input=','.join([  # 指定訓練數據，包含源語言和目標語言的訓練集與驗證集
            f'{prefix}/train.clean.{src_lang}',  # 源語言的訓練集
            f'{prefix}/valid.clean.{src_lang}',  # 源語言的驗證集
            f'{prefix}/train.clean.{tgt_lang}',  # 目標語言的訓練集
            f'{prefix}/valid.clean.{tgt_lang}'   # 目標語言的驗證集
        ]),
        model_prefix=prefix / f'spm{vocab_size}',  # 指定模型的輸出前綴名稱
        vocab_size=vocab_size,  # 詞彙表的大小
        character_coverage=1,  # 字符覆蓋率，1 表示完全覆蓋所有字符
        model_type='unigram',  # 指定模型類型，'unigram' 表示使用單一語法模型；可以選擇 'bpe' (Byte Pair Encoding) 等其他類型
        input_sentence_size=1e6,  # 訓練時從數據集中隨機選擇的句子數量（一百萬）
        shuffle_input_sentence=True,  # 是否對輸入句子進行隨機打亂
        normalization_rule_name='nmt_nfkc_cf',  # 正規化規則，用於處理文本的一致性，常用於 NMT（神經機器翻譯）
    )

In [None]:
# 加載 SentencePiece 模型，用於分詞（tokenize）
spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model'))

# 定義數據標籤對應的文件名稱
in_tag = {
    'train': 'train.clean',   # 訓練集的文件名稱
    'valid': 'valid.clean',   # 驗證集的文件名稱
    'test': 'test.raw.clean', # 測試集的文件名稱
}

# 遍歷數據分割（訓練、驗證、測試）
for split in ['train', 'valid', 'test']:
    # 遍歷語言（源語言和目標語言）
    for lang in [src_lang, tgt_lang]:
        # 定義分詞後的輸出路徑
        out_path = prefix/f'{split}.{lang}'

        # 如果分詞後的文件已經存在，則跳過分詞過程
        if out_path.exists():
            print(f"{out_path} exists. skipping spm_encode.")
        else:
            # 開啟輸出文件，準備將分詞結果寫入
            with open(prefix/f'{split}.{lang}', 'w') as out_f:
                # 開啟對應的原始輸入文件
                with open(prefix/f'{in_tag[split]}.{lang}', 'r') as in_f:
                    # 遍歷文件中的每一行
                    for line in in_f:
                        line = line.strip()  # 去除行首和行尾的空白字符
                        tok = spm_model.encode(line, out_type=str)  # 使用 SentencePiece 模型進行分詞
                        print(' '.join(tok), file=out_f)  # 將分詞結果用空格拼接並寫入輸出文件


In [None]:
!head {data_dir+'/'+dataset_name+'/train.'+src_lang} -n 5
!head {data_dir+'/'+dataset_name+'/train.'+tgt_lang} -n 5

## Binarize the data with fairseq

In [None]:
binpath = Path('./DATA/data-bin', dataset_name)
if binpath.exists():
    print(binpath, "exists, will not overwrite!")
else:
  # 如果目錄不存在，執行 fairseq 的資料預處理命令
    !python -m fairseq_cli.preprocess \
        --source-lang {src_lang}\
        --target-lang {tgt_lang}\
        --trainpref {prefix/'train'}\
        --validpref {prefix/'valid'}\
        --testpref {prefix/'test'}\
        --destdir {binpath}\
        --joined-dictionary\
        --workers 2

# Configuration for Experiments

In [None]:
config = Namespace(
    # 資料路徑
    datadir = "./DATA/data-bin/ted2020",  # 資料目錄，包含訓練和測試數據
    savedir = "./checkpoints/rnn",       # 模型檢查點儲存目錄
    source_lang = "en",                  # 源語言（英文）
    target_lang = "zh",                  # 目標語言（中文）

    # CPU執行緒數量，用於讀取和處理數據
    num_workers = 2,                     # 加載數據時的 CPU 執行緒數量
    # 批次大小以 token 數量為單位，梯度累積可增加有效批次大小
    max_tokens = 8192,                   # 每個批次的最大 token 數量
    accum_steps = 2,                     # 梯度累積步數，增加有效的批次大小

    # 使用 Noam 學習率調度器計算的學習率，可通過此係數調整最大學習率
    lr_factor = 2.0,                     # 學習率調節因子
    lr_warmup = 4000,                    # 預熱步數（學習率在這些步內逐漸增加）

    # 梯度裁剪，防止梯度爆炸
    clip_norm = 1.0,                     # 梯度裁剪的最大範圍

    # 訓練的最大 epoch 數
    max_epoch = 30,                      # 最大訓練 epoch
    start_epoch = 1,                     # 起始訓練 epoch

    # 條件搜尋（beam search）的 beam size
    beam = 5,                            # beam search 的 beam size
    # 根據源語言序列長度生成目標序列的最大長度公式：ax + b
    max_len_a = 1.2,                     # 序列長度比例係數 a
    max_len_b = 10,                      # 序列長度的偏置量 b
    # 解碼時，後處理句子，移除 sentencepiece 符號並使用 jieba 進行分詞
    post_process = "sentencepiece",      # 解碼後處理方式

    # 模型檢查點相關設定
    keep_last_epochs = 5,                # 儲存最近的檢查點數量
    resume = None,                       # 如果需要從檢查點恢復，設定檢查點名稱（位於 config.savedir 中）

    # 日誌記錄
    use_wandb = False,                   # 是否使用 wandb 進行日誌記錄
)

# Logging
- logging package logs ordinary messages
- wandb logs the loss, bleu, etc. in the training process

In [None]:
logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    level="INFO", # "DEBUG" "WARNING" "ERROR"
    stream=sys.stdout,
)
proj = "hw5.seq2seq"
logger = logging.getLogger(proj)
if config.use_wandb:
    import wandb
    wandb.init(project=proj, name=Path(config.savedir).stem, config=config)

# CUDA Environment

In [None]:
cuda_env = utils.CudaEnvironment()
utils.CudaEnvironment.pretty_print_cuda_env_list([cuda_env])
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Dataloading

## We borrow the TranslationTask from fairseq
* used to load the binarized data created above
* well-implemented data iterator (dataloader)
* built-in task.source_dictionary and task.target_dictionary are also handy
* well-implemented beach search decoder

In [None]:
from fairseq.tasks.translation import TranslationConfig, TranslationTask

## setup task
task_cfg = TranslationConfig(
    data=config.datadir,                # 指定資料目錄，應包含 binarized 的平行語料數據
    source_lang=config.source_lang,    # 指定源語言，如 "en"（英語）
    target_lang=config.target_lang,    # 指定目標語言，如 "fr"（法語）
    train_subset="train",              # 指定用於訓練的數據集分割名
    required_seq_len_multiple=8,       # 強制將序列長度填充為 8 的倍數，有助於加速 GPU 計算
    dataset_impl="mmap",               # 資料集實現方式，"mmap" 表示內存映射以節省 RAM
    upsample_primary=1,                # 用於增強主要語言對的數據，如果為 1，則不進行增強
)
task = TranslationTask.setup_task(task_cfg)

In [None]:
logger.info("loading data for epoch 1")
task.load_dataset(split="train", epoch=1, combine=True) # combine if you have back-translation data.
task.load_dataset(split="valid", epoch=1)

In [None]:
import pprint

# 獲取驗證集中的第一個樣本數據
sample = task.dataset("valid")[1]
# 輸出樣本數據的內容
pprint.pprint(sample)

# 將樣本的源語言（source）轉換為可讀字符串並輸出
# "Source" + 轉換後的源語言字符串
pprint.pprint(
    "Source: " + \
    task.source_dictionary.string(
        sample['source'],  # 樣本中的源語言序列
        config.post_process,  # 用於後處理的配置參數
    )
)

# 將樣本的目標語言（target）轉換為可讀字符串並輸出
# "Target" + 轉換後的目標語言字符串
pprint.pprint(
    "Target: " + \
    task.target_dictionary.string(
        sample['target'],  # 樣本中的目標語言序列
        config.post_process,  # 用於後處理的配置參數
    )
)

## Dataset Iterator

* Controls every batch to contain no more than N tokens, which optimizes GPU memory efficiency
* Shuffles the training set for every epoch
* Ignore sentences exceeding maximum length
* Pad all sentences in a batch to the same length, which enables parallel computing by GPU
* Add eos and shift one token
    - teacher forcing: to train the model to predict the next token based on prefix, we feed the right shifted target sequence as the decoder input.
    - generally, prepending bos to the target would do the job (as shown below)
![seq2seq](https://i.imgur.com/0zeDyuI.png)
    - in fairseq however, this is done by moving the eos token to the begining. Empirically, this has the same effect. For instance:
    ```
    # output target (target) and Decoder input (prev_output_tokens):
                   eos = 2
                target = 419,  711,  238,  888,  792,   60,  968,    8,    2
    prev_output_tokens = 2,  419,  711,  238,  888,  792,   60,  968,    8
    ```


In [None]:
def load_data_iterator(task, split, epoch=1, max_tokens=4000, num_workers=1, cached=True):
    """
    加載數據的迭代器。

    參數：
    - task: 數據任務物件，通常包含數據集及相關設定。
    - split: 數據集的分割（例如 'train', 'valid', 'test'）。
    - epoch: 當前的訓練輪次，默認為 1。
    - max_tokens: 每個批次的最大 token 數，默認為 4000。
    - num_workers: 用於數據加載的工作線程數，默認為 1。
    - cached: 是否啟用迭代器的快取，默認為 True。

    返回：
    - batch_iterator: 用於遍歷數據集的批次迭代器。
    """
    batch_iterator = task.get_batch_iterator(
        dataset=task.dataset(split),  # 根據數據分割名稱獲取對應的數據集（如訓練集或驗證集）
        max_tokens=max_tokens,        # 每個批次的最大 token 數
        max_sentences=None,           # 每個批次的最大句子數（這裡設置為 None，僅依據 max_tokens 限制）
        max_positions=utils.resolve_max_positions(
            task.max_positions(),    # 獲取模型支持的最大位置
            max_tokens,              # 根據 max_tokens 限制位置
        ),
        ignore_invalid_inputs=True,   # 忽略無效的輸入數據（例如過長的句子）
        seed=seed,                    # 用於隨機數生成的種子
        num_workers=num_workers,      # 數據加載使用的線程數
        epoch=epoch,                  # 當前訓練的 epoch
        disable_iterator_cache=not cached,  # 是否禁用迭代器快取（設定為 False 以加速處理）
    )
    return batch_iterator  # 返回批次迭代器

# 使用示例：加載驗證集的數據迭代器
demo_epoch_obj = load_data_iterator(
    task, "valid", epoch=1, max_tokens=20, num_workers=1, cached=False
)

# 獲取下一個訓練輪次的數據迭代器，並進行隨機打亂
demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)

# 從迭代器中獲取一個樣本數據
sample = next(demo_iter)demo_iter = demo_epoch_obj.next_epoch_itr(shuffle=True)
sample = next(demo_iter)

* each batch is a python dict, with string key and Tensor value. Contents are described below:
```python
batch = {
    "id": id, # id for each example
    "nsentences": len(samples), # batch size (sentences)
    "ntokens": ntokens, # batch size (tokens)
    "net_input": {
        "src_tokens": src_tokens, # sequence in source language
        "src_lengths": src_lengths, # sequence length of each example before padding
        "prev_output_tokens": prev_output_tokens, # right shifted target, as mentioned above.
    },
    "target": target, # target sequence
}
```

# Model Architecture
* We again inherit fairseq's encoder, decoder and model, so that in the testing phase we can directly leverage fairseq's beam search decoder.

In [None]:
from fairseq.models import (
    FairseqEncoder,
    FairseqIncrementalDecoder,
    FairseqEncoderDecoderModel
)

## Encoder

- The Encoder is a RNN or Transformer Encoder. The following description is for RNN. For every input token, Encoder will generate a output vector and a hidden states vector, and the hidden states vector is passed on to the next step. In other words, the Encoder sequentially reads in the input sequence, and outputs a single vector at each timestep, then finally outputs the final hidden states, or content vector, at the last timestep.
- Parameters:
  - *args*
      - encoder_embed_dim: the dimension of embeddings, this compresses the one-hot vector into fixed dimensions, which achieves dimension reduction
      - encoder_ffn_embed_dim is the dimension of hidden states and output vectors
      - encoder_layers is the number of layers for Encoder RNN
      - dropout determines the probability of a neuron's activation being set to 0, in order to prevent overfitting. Generally this is applied in training, and removed in testing.
  - *dictionary*: the dictionary provided by fairseq. it's used to obtain the padding index, and in turn the encoder padding mask.
  - *embed_tokens*: an instance of token embeddings (nn.Embedding)

- Inputs:
    - *src_tokens*: integer sequence representing english e.g. 1, 28, 29, 205, 2
- Outputs:
    - *outputs*: the output of RNN at each timestep, can be furthur processed by Attention
    - *final_hiddens*: the hidden states of each timestep, will be passed to decoder for decoding
    - *encoder_padding_mask*: this tells the decoder which position to ignore


In [None]:
class RNNEncoder(FairseqEncoder):
    def __init__(self, args, dictionary, embed_tokens):
        super().__init__(dictionary)
        self.embed_tokens = embed_tokens

        # 嵌入維度（輸入的特徵維度）
        self.embed_dim = args.encoder_embed_dim
        # 隱藏層維度（RNN 的隱藏單元大小）
        self.hidden_dim = args.encoder_ffn_embed_dim
        # 編碼器的層數（RNN 的層數）
        self.num_layers = args.encoder_layers

        # 輸入的 dropout，用於防止過擬合
        self.dropout_in_module = nn.Dropout(args.dropout)
        # 定義雙向 GRU
        self.rnn = nn.GRU(
            self.embed_dim,       # 輸入特徵維度
            self.hidden_dim,      # 隱藏層維度
            self.num_layers,      # RNN 層數
            dropout=args.dropout, # dropout 機制
            batch_first=False,    # 是否將 batch 作為第一維度 (此處為 False)
            bidirectional=True    # 雙向 RNN
        )
        # 輸出的 dropout，用於防止過擬合
        self.dropout_out_module = nn.Dropout(args.dropout)

        # padding 的索引，用於識別 padding 的位置
        self.padding_idx = dictionary.pad()

    def combine_bidir(self, outs, bsz: int):
        """
        將雙向 RNN 的輸出進行合併
        outs: RNN 的輸出
        bsz: batch 的大小
        """
        # 將輸出拆分為雙向數據，並調整維度
        out = outs.view(self.num_layers, 2, bsz, -1).transpose(1, 2).contiguous()
        # 將雙向的數據拼接起來
        return out.view(self.num_layers, bsz, -1)

    def forward(self, src_tokens, **unused):
        """
        前向傳播過程
        src_tokens: 來源序列（批次內的 token 編碼）
        """
        bsz, seqlen = src_tokens.size()  # 獲取批次大小和序列長度

        # 1. 將 token 編碼轉換為嵌入向量
        x = self.embed_tokens(src_tokens)  # 嵌入層處理
        x = self.dropout_in_module(x)  # 添加 dropout 防止過擬合

        # 2. 調整維度以適配 RNN（B x T x C -> T x B x C）
        x = x.transpose(0, 1)

        # 3. 傳遞到雙向 GRU
        h0 = x.new_zeros(2 * self.num_layers, bsz, self.hidden_dim)  # 初始化隱藏狀態
        x, final_hiddens = self.rnn(x, h0)  # 前向傳播
        outputs = self.dropout_out_module(x)  # 對輸出應用 dropout
        # outputs: [序列長度, 批次大小, 隱藏維度 * 雙向]
        # final_hiddens: [層數 * 雙向, 批次大小, 隱藏維度]

        # 4. 合併雙向 RNN 的隱藏狀態
        final_hiddens = self.combine_bidir(final_hiddens, bsz)
        # final_hiddens: [層數, 批次大小, 雙向隱藏維度]

        # 5. 創建編碼器的 padding 掩碼
        encoder_padding_mask = src_tokens.eq(self.padding_idx).t()  # 轉置以適應輸入格式

        # 返回編碼器的輸出
        return tuple(
            (
                outputs,             # [序列長度 x 批次大小 x 雙向隱藏維度]  所有時間步的隱藏狀態，供注意力機制使用
                final_hiddens,       # [層數 x 批次大小 x 雙向隱藏維度]  最後一個時間步的狀態，用於初始化解碼器
                encoder_padding_mask # [序列長度 x 批次大小]  標記 PAD 位置，防止影響注意力計算
            )
        )

    def reorder_encoder_out(self, encoder_out, new_order):
        """
        用於 Fairseq 的 beam search，重新排序編碼器的輸出
        encoder_out: 編碼器的輸出
        new_order: 新的排序索引
        """
        return tuple(
            (
                encoder_out[0].index_select(1, new_order),  # 重新排序輸出
                encoder_out[1].index_select(1, new_order),  # 重新排序隱藏狀態
                encoder_out[2].index_select(1, new_order),  # 重新排序 padding 掩碼
            )
        )

## Attention

- When the input sequence is long, "content vector" alone cannot accurately represent the whole sequence, attention mechanism can provide the Decoder more information.
- According to the **Decoder embeddings** of the current timestep, match the **Encoder outputs** with decoder embeddings to determine correlation, and then sum the Encoder outputs weighted by the correlation as the input to **Decoder** RNN.
- Common attention implementations use neural network / dot product as the correlation between **query** (decoder embeddings) and **key** (Encoder outputs), followed by **softmax**  to obtain a distribution, and finally **values** (Encoder outputs) is **weighted sum**-ed by said distribution.

- Parameters:
  - *input_embed_dim*: dimensionality of key, should be that of the vector in decoder to attend others
  - *source_embed_dim*: dimensionality of query, should be that of the vector to be attended to (encoder outputs)
  - *output_embed_dim*: dimensionality of value, should be that of the vector after attention, expected by the next layer

- Inputs:
    - *inputs*: is the key, the vector to attend to others
    - *encoder_outputs*:  is the query/value, the vector to be attended to
    - *encoder_padding_mask*: this tells the decoder which position to ignore
- Outputs:
    - *output*: the context vector after attention
    - *attention score*: the attention distribution


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class AttentionLayer(nn.Module):
    def __init__(self, input_embed_dim, source_embed_dim, output_embed_dim, bias=False):
        super().__init__()

        # 將輸入特徵投影到與 encoder_outputs 相同的維度
        self.input_proj = nn.Linear(input_embed_dim, source_embed_dim, bias=bias)

        # 將輸入和注意力加權後的輸出拼接後，再投影到 output_embed_dim 維度
        self.output_proj = nn.Linear(
            input_embed_dim + source_embed_dim, output_embed_dim, bias=bias
        )

    def forward(self, inputs, encoder_outputs, encoder_padding_mask):
        """
        Args:
            inputs: (T, B, dim) 輸入序列的嵌入表示
            encoder_outputs: (S, B, dim) 編碼器的輸出
            encoder_padding_mask: (S, B) 編碼器的填充掩碼，用於忽略 padding 部分
        Returns:
            x: (T, B, dim) 注意力處理後的輸出
            attn_scores: (B, T, S) 注意力分數
        """

        # 轉換張量格式為 batch-first
        inputs = inputs.transpose(1, 0)  # (B, T, dim)
        encoder_outputs = encoder_outputs.transpose(1, 0)  # (B, S, dim)
        encoder_padding_mask = encoder_padding_mask.transpose(1, 0)  # (B, S)

        # 【Query】將 decoder 的輸入投影到與 encoder 輸出相同的維度
        x = self.input_proj(inputs)  # (B, T, dim) 【Q】

        # 【Key】是來自 encoder 的輸出，不需要額外轉換
        # 【Value】也是來自 encoder 的輸出，不需要額外轉換

        # 計算注意力分數（內積計算 Q @ K^T）
        # 【Q】(B, T, dim)  x  【K^T】(B, dim, S)  →  【attn_scores】(B, T, S)
        attn_scores = torch.bmm(x, encoder_outputs.transpose(1, 2))

        # 對 padding 位置設置 -∞，確保 softmax 後變成 0
        if encoder_padding_mask is not None:
            encoder_padding_mask = encoder_padding_mask.unsqueeze(1)  # (B, 1, S)
            attn_scores = (
                attn_scores.float()
                .masked_fill_(encoder_padding_mask, float("-inf"))
                .type_as(attn_scores)
            )

        # softmax 計算注意力權重
        attn_scores = F.softmax(attn_scores, dim=-1)  # (B, T, S)

        # 【加權求和】注意力權重 x 【V】
        # 【attn_scores】(B, T, S) x 【V】(B, S, dim) → 【x】(B, T, dim)
        x = torch.bmm(attn_scores, encoder_outputs)

        # 拼接原始輸入與注意力輸出
        x = torch.cat((x, inputs), dim=-1)  # (B, T, dim + input_dim)

        # 經過線性層和 tanh 激活函數
        x = torch.tanh(self.output_proj(x))  # (B, T, output_embed_dim)

        # 恢復形狀 (B, T, dim) -> (T, B, dim)
        return x.transpose(1, 0), attn_scores

## Decoder

* The hidden states of **Decoder** will be initialized by the final hidden states of **Encoder** (the content vector)
* At the same time, **Decoder** will change its hidden states based on the input of the current timestep (the outputs of previous timesteps), and generates an output
* Attention improves the performance
* The seq2seq steps are implemented in decoder, so that later the Seq2Seq class can accept RNN and Transformer, without furthur modification.
- Parameters:
  - *args*
      - decoder_embed_dim: is the dimensionality of the decoder embeddings, similar to encoder_embed_dim，
      - decoder_ffn_embed_dim: is the dimensionality of the decoder RNN hidden states, similar to encoder_ffn_embed_dim
      - decoder_layers: number of layers of RNN decoder
      - share_decoder_input_output_embed: usually, the projection matrix of the decoder will share weights with the decoder input embeddings
  - *dictionary*: the dictionary provided by fairseq
  - *embed_tokens*: an instance of token embeddings (nn.Embedding)
- Inputs:
    - *prev_output_tokens*: integer sequence representing the right-shifted target e.g. 1, 28, 29, 205, 2
    - *encoder_out*: encoder's output.
    - *incremental_state*: in order to speed up decoding during test time, we will save the hidden state of each timestep. see forward() for details.
- Outputs:
    - *outputs*: the logits (before softmax) output of decoder for each timesteps
    - *extra*: unsused

In [None]:
class RNNDecoder(FairseqIncrementalDecoder):
    def __init__(self, args, dictionary, embed_tokens):
        super().__init__(dictionary)
        self.embed_tokens = embed_tokens

        # 確保編碼器和解碼器的 RNN 層數相同
        assert args.decoder_layers == args.encoder_layers, f"""seq2seq rnn requires that encoder
        and decoder have same layers of rnn. got: {args.encoder_layers, args.decoder_layers}"""

        # 確保解碼器隱藏層維度是編碼器的兩倍
        assert args.decoder_ffn_embed_dim == args.encoder_ffn_embed_dim * 2, f"""seq2seq-rnn requires
        that decoder hidden to be 2*encoder hidden dim. got: {args.decoder_ffn_embed_dim, args.encoder_ffn_embed_dim * 2}"""

        # 嵌入維度
        self.embed_dim = args.decoder_embed_dim
        # 隱藏層維度
        self.hidden_dim = args.decoder_ffn_embed_dim
        # 解碼器層數
        self.num_layers = args.decoder_layers

        # dropout 模塊，用於防止過擬合
        self.dropout_in_module = nn.Dropout(args.dropout)

        # 定義解碼器的 RNN 結構 (GRU)
        self.rnn = nn.GRU(
            self.embed_dim,        # 輸入維度
            self.hidden_dim,       # 隱藏層維度
            self.num_layers,       # 層數
            dropout=args.dropout,  # dropout 比例
            batch_first=False,     # 輸入的 batch 維度是否是第一維
            bidirectional=False    # 單向 GRU
        )

        # 注意力層
        self.attention = AttentionLayer(
            self.embed_dim,        # 查詢的維度
            self.hidden_dim,       # 鍵和值的維度
            self.embed_dim,        # 輸出維度
            bias=False             # 是否使用偏置
        )

        self.dropout_out_module = nn.Dropout(args.dropout)

        # 如果隱藏層維度與嵌入維度不同，需要進行投影
        if self.hidden_dim != self.embed_dim:
            self.project_out_dim = nn.Linear(self.hidden_dim, self.embed_dim)
        else:
            self.project_out_dim = None

        # 是否共享輸入嵌入和輸出嵌入的權重
        if args.share_decoder_input_output_embed:
            self.output_projection = nn.Linear(
                self.embed_tokens.weight.shape[1],  # 輸出維度
                self.embed_tokens.weight.shape[0],  # 詞典大小
                bias=False,
            )
            # 權重共享
            self.output_projection.weight = self.embed_tokens.weight
        else:
            self.output_projection = nn.Linear(
                self.embed_dim, len(dictionary), bias=False
            )
            # 初始化權重
            nn.init.normal_(
                self.output_projection.weight, mean=0, std=self.embed_dim ** -0.5
            )

    def forward(self, prev_output_tokens, encoder_out, incremental_state=None, **unused):
        # 從編碼器提取輸出
        encoder_outputs, encoder_hiddens, encoder_padding_mask = encoder_out
        # encoder_outputs:    seq_len x batch x num_directions*hidden
        # encoder_hiddens:    num_layers x batch x num_directions*encoder_hidden
        # encoder_padding_mask: seq_len x batch

        if incremental_state is not None and len(incremental_state) > 0:
            # 如果有增量狀態，則從上一時間步繼續
            prev_output_tokens = prev_output_tokens[:, -1:]
            cache_state = self.get_incremental_state(incremental_state, "cached_state")
            prev_hiddens = cache_state["prev_hiddens"]
        else:
            # 如果沒有增量狀態，則初始化隱藏狀態為編碼器的隱藏狀態
            prev_hiddens = encoder_hiddens

        bsz, seqlen = prev_output_tokens.size()

        # 將輸入 tokens 嵌入為向量
        x = self.embed_tokens(prev_output_tokens)
        x = self.dropout_in_module(x)

        # B x T x C -> T x B x C（為了適配 RNN 的輸入格式）
        x = x.transpose(0, 1)

        # 注意力機制
        if self.attention is not None:
            x, attn = self.attention(x, encoder_outputs, encoder_padding_mask)

        # 通過單向 RNN
        x, final_hiddens = self.rnn(x, prev_hiddens)
        # x: [seq_len, batch_size, hidden_dim]
        # final_hiddens: [num_layers, batch_size, hidden_dim]

        x = self.dropout_out_module(x)

        # 如果需要，將隱藏層輸出投影到嵌入維度
        if self.project_out_dim is not None:
            x = self.project_out_dim(x)

        # 投影到詞彙表大小（用於生成最終輸出）
        x = self.output_projection(x)

        # T x B x C -> B x T x C（轉換回輸出格式）
        x = x.transpose(1, 0)

        # 如果是增量模式，保存當前時間步的隱藏狀態
        cache_state = {
            "prev_hiddens": final_hiddens,
        }
        self.set_incremental_state(incremental_state, "cached_state", cache_state)

        return x, None

    def reorder_incremental_state(
        self,
        incremental_state,
        new_order,
    ):
        # 用於重新排列增量狀態（在 Beam Search 中使用）
        cache_state = self.get_incremental_state(incremental_state, "cached_state")
        prev_hiddens = cache_state["prev_hiddens"]
        prev_hiddens = [p.index_select(0, new_order) for p in prev_hiddens]
        cache_state = {
            "prev_hiddens": torch.stack(prev_hiddens),
        }
        self.set_incremental_state(incremental_state, "cached_state", cache_state)
        return


## Seq2Seq
- Composed of **Encoder** and **Decoder**
- Recieves inputs and pass to **Encoder**
- Pass the outputs from **Encoder** to **Decoder**
- **Decoder** will decode according to outputs of previous timesteps as well as **Encoder** outputs  
- Once done decoding, return the **Decoder** outputs

In [None]:
class Seq2Seq(FairseqEncoderDecoderModel):
    def __init__(self, args, encoder, decoder):
        super().__init__(encoder, decoder)
        self.args = args  # 儲存初始化時傳入的參數

    def forward(
        self,
        src_tokens,  # 編碼器的輸入序列 (例如英語句子的 token 序列)
        src_lengths,  # 輸入序列中每個句子的實際長度
        prev_output_tokens,  # 解碼器的輸入序列 (例如翻譯結果的上一個 token 序列)
        return_all_hiddens: bool = True,  # 是否返回所有層的隱藏狀態
    ):
        """
        執行編碼器-解碼器模型的前向傳播過程。
        """
        # 編碼器前向傳播
        encoder_out = self.encoder(
            src_tokens,  # 輸入 token 序列
            src_lengths=src_lengths,  # 輸入序列的長度，用於處理 padding
            return_all_hiddens=return_all_hiddens  # 是否返回所有層的隱藏狀態
        )

        # 解碼器前向傳播
        logits, extra = self.decoder(
            prev_output_tokens,  # 解碼器輸入 (例如解碼器上一時刻生成的 token)
            encoder_out=encoder_out,  # 編碼器的輸出，作為解碼器的上下文信息
            src_lengths=src_lengths,  # 輸入序列的長度，用於對齊注意力機制
            return_all_hiddens=return_all_hiddens  # 是否返回所有層的隱藏狀態
        )

        # 返回解碼器的 logits (用於生成最終輸出) 和額外的資訊 (例如注意力權重等)
        return logits, extra

# Model Initialization

In [None]:
# 提示：Transformer 架構
# 從 fairseq.models.transformer 模組導入 TransformerEncoder 和 TransformerDecoder
# from fairseq.models.transformer import (
#     TransformerEncoder,
#     TransformerDecoder,
# )

def build_model(args, task):
    """ 根據超參數構建模型實例 """
    # 獲取任務的源字典和目標字典
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary

    # Token 嵌入層
    # 使用源字典大小、嵌入維度、填充索引來初始化源語言的嵌入層
    encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
    # 使用目標字典大小、嵌入維度、填充索引來初始化目標語言的嵌入層
    decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())

    # 編碼器和解碼器
    # 提示：將 RNNEncoder 和 RNNDecoder 替換為 TransformerEncoder 和 TransformerDecoder
    encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)  # 初始化編碼器
    decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)  # 初始化解碼器

    # 序列到序列模型
    model = Seq2Seq(args, encoder, decoder)

    # 對序列到序列模型進行初始化非常重要，需額外處理
    def init_params(module):
        """ 初始化模型參數 """
        from fairseq.modules import MultiheadAttention
        if isinstance(module, nn.Linear):
            # 對線性層的權重使用均值為 0.0、標準差為 0.02 的正態分佈初始化
            module.weight.data.normal_(mean=0.0, std=0.02)
            # 若存在偏置，則初始化為 0
            if module.bias is not None:
                module.bias.data.zero_()
        if isinstance(module, nn.Embedding):
            # 對嵌入層的權重使用均值為 0.0、標準差為 0.02 的正態分佈初始化
            module.weight.data.normal_(mean=0.0, std=0.02)
            # 如果有填充索引，將對應的權重設為 0
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        if isinstance(module, MultiheadAttention):
            # 對多頭注意力模組的 q_proj、k_proj 和 v_proj 層進行初始化
            module.q_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.k_proj.weight.data.normal_(mean=0.0, std=0.02)
            module.v_proj.weight.data.normal_(mean=0.0, std=0.02)
        if isinstance(module, nn.RNNBase):
            # 對 RNN 模組的權重和偏置進行均勻分佈初始化
            for name, param in module.named_parameters():
                if "weight" in name or "bias" in name:
                    param.data.uniform_(-0.1, 0.1)

    # 權重初始化
    model.apply(init_params)
    return model

## Architecture Related Configuration
reference implementation

|model|embedding dim|encoder ffn|encoder layers|decoder ffn|decoder layers|
|-|-|-|-|-|-|
|RNN|256|512|1|1024|1|
|Transformer|256|1024|4|1024|4|

For strong baseline, please refer to the hyperparameters for *transformer-base* in Table 3 in [Attention is all you need](#vaswani2017)

In [None]:
from argparse import Namespace

# 定義一組參數，用於設定編碼器和解碼器的結構及其他模型配置
arch_args = Namespace(
    encoder_embed_dim=256,               # 編碼器嵌入維度：將輸入特徵壓縮到 256 維
    encoder_ffn_embed_dim=512,           # 編碼器前饋神經網路隱藏層維度
    encoder_layers=1,                    # 編碼器的層數
    decoder_embed_dim=256,               # 解碼器嵌入維度：將解碼器的特徵壓縮到 256 維
    decoder_ffn_embed_dim=1024,          # 解碼器前饋神經網路隱藏層維度
    decoder_layers=1,                    # 解碼器的層數
    share_decoder_input_output_embed=True,    # 是否共享解碼器輸入和輸出的嵌入層
    dropout=0.3,                         # 隨機失活的比例，防止過擬合
)

# HINT: 修改 Transformer 的默認參數
# 對 Transformer 的參數進行修正和補充
def add_transformer_args(args):
    # 編碼器的注意力頭數：設定多頭注意力機制的頭數為 4
    args.encoder_attention_heads = 4
    # 編碼器是否在運算之前執行正則化（LayerNorm）
    args.encoder_normalize_before = True

    # 解碼器的注意力頭數：設定多頭注意力機制的頭數為 4
    args.decoder_attention_heads = 4
    # 解碼器是否在運算之前執行正則化（LayerNorm）
    args.decoder_normalize_before = True

    # 啟用的激活函數，這裡設定為 ReLU
    args.activation_fn = "relu"
    # 設定編碼器可以處理的最大輸入序列長度
    args.max_source_positions = 1024
    # 設定解碼器可以處理的最大輸出序列長度
    args.max_target_positions = 1024

    # 對 Transformer 默認參數進行額外的修正
    # 使用 fairseq 中的 Transformer 預設架構進行初始化
    from fairseq.models.transformer import base_architecture
    base_architecture(arch_args)

# 調用函數，將額外參數補充到 arch_args 中
add_transformer_args(arch_args)

In [None]:
if config.use_wandb:
    wandb.config.update(vars(arch_args))

In [None]:
model = build_model(arch_args, task)
logger.info(model)

# Optimization

## Loss: Label Smoothing Regularization
* let the model learn to generate less concentrated distribution, and prevent over-confidence
* sometimes the ground truth may not be the only answer. thus, when calculating loss, we reserve some probability for incorrect labels
* avoids overfitting

code [source](https://fairseq.readthedocs.io/en/latest/_modules/fairseq/criterions/label_smoothed_cross_entropy.html)

In [None]:
import torch
import torch.nn as nn

class LabelSmoothedCrossEntropyCriterion(nn.Module):
    def __init__(self, smoothing, ignore_index=None, reduce=True):
        """
        帶標籤平滑的交叉熵損失函數
        args:
            smoothing: 平滑參數，用於分配概率給非正確標籤
            ignore_index: 忽略計算損失的標籤索引（通常用於 padding）
            reduce: 是否對損失值進行總和
        """
        super().__init__()
        self.smoothing = smoothing
        self.ignore_index = ignore_index
        self.reduce = reduce

    def forward(self, lprobs, target):
        """
        計算帶標籤平滑的交叉熵損失
        args:
            lprobs: log 概率輸入，形狀為 (batch_size, num_classes)
            target: 標籤，形狀為 (batch_size,)
        return:
            loss: 計算得到的損失值
        """
        # 如果 target 維度比 lprobs 少 1，增加一個維度
        if target.dim() == lprobs.dim() - 1:
            target = target.unsqueeze(-1)

        # nll_loss: 負對數似然損失（對應目標標籤的 log 概率），等價於 F.nll_loss
        # 選取 lprobs 中對應於 target 的類別的對數概率。這就是傳統的交叉熵損失（不帶平滑）。
        nll_loss = -lprobs.gather(dim=-1, index=target)

        # smooth_loss: 平滑損失，為所有 log 概率的總和
        # 這是由於平滑操作會將部分概率分配給非正確標籤
        smooth_loss = -lprobs.sum(dim=-1, keepdim=True)

        # 如果指定了 ignore_index，則忽略對應位置的損失計算
        if self.ignore_index is not None:
            pad_mask = target.eq(self.ignore_index)  # 獲取 padding mask
            nll_loss.masked_fill_(pad_mask, 0.0)    # 將 padding 部分的 nll_loss 設置為 0
            smooth_loss.masked_fill_(pad_mask, 0.0) # 將 padding 部分的 smooth_loss 設置為 0
        else:
            # 如果沒有指定 ignore_index，去掉最後一個維度
            nll_loss = nll_loss.squeeze(-1)
            smooth_loss = smooth_loss.squeeze(-1)

        # 如果需要 reduce（對損失進行總和）
        if self.reduce:
            nll_loss = nll_loss.sum()      # 將 nll_loss 對 batch 維度求和
            smooth_loss = smooth_loss.sum() # 將 smooth_loss 對 batch 維度求和

        # eps_i: 分配給非正確標籤的概率
        eps_i = self.smoothing / lprobs.size(-1)

        # 最終損失計算：
        # (1 - smoothing) * nll_loss: 正確標籤的損失
        # eps_i * smooth_loss: 非正確標籤的損失
        loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss

        return loss


# 一般情況下，平滑參數 0.1 是一個不錯的選擇
criterion = LabelSmoothedCrossEntropyCriterion(
    smoothing=0.1,
    ignore_index=task.target_dictionary.pad(),  # 忽略 padding 索引
)

## Optimizer: Adam + lr scheduling
Inverse square root scheduling is important to the stability when training Transformer. It's later used on RNN as well.
Update the learning rate according to the following equation. Linearly increase the first stage, then decay proportionally to the inverse square root of timestep.
$$lrate = d_{\text{model}}^{-0.5}\cdot\min({step\_num}^{-0.5},{step\_num}\cdot{warmup\_steps}^{-1.5})$$
code [source](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

In [None]:
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        """
        初始化 NoamOpt 優化器包裝器
        :param model_size: 模型的大小，通常是模型的嵌入維度
        :param factor: 學習率的倍率因子
        :param warmup: 預熱步數，用來平滑學習率的增長
        :param optimizer: 真正的優化器（如 Adam），將被包裝
        """
        self.optimizer = optimizer  # 優化器實例
        self._step = 0  # 計算步數
        self.warmup = warmup  # 預熱步數
        self.factor = factor  # 學習率倍率因子
        self.model_size = model_size  # 模型大小（嵌入維度）
        self._rate = 0  # 當前學習率

    @property
    def param_groups(self):
        """
        返回優化器的參數組
        :return: 優化器的參數組
        """
        return self.optimizer.param_groups

    def multiply_grads(self, c):
        """
        將所有梯度乘以常數 *c*
        :param c: 乘數常數
        """
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.data.mul_(c)  # 乘以常數

    def step(self):
        """
        更新參數和學習率
        """
        self._step += 1  # 增加步數
        rate = self.rate()  # 計算學習率
        for p in self.param_groups:
            p['lr'] = rate  # 更新學習率
        self._rate = rate  # 記錄當前學習率
        self.optimizer.step()  # 真正執行優化器的步驟

    def rate(self, step=None):
        """
        計算學習率
        :param step: 當前的步數，如果為 None，則使用內部步數
        :return: 計算出的學習率
        """
        if step is None:
            step = self._step  # 默認使用內部步數
        return 0 if not step else self.factor * \
            (self.model_size ** (-0.5) *  # 模型大小的縮放因子
             min(step ** (-0.5), step * self.warmup ** (-1.5)))  # 最小值調整學習率

## Scheduling Visualized

In [None]:
# 創建 NoamOpt 優化器，這是一種根據訓練步數動態調整學習率的優化器
optimizer = NoamOpt(
    model_size=arch_args.encoder_embed_dim,  # 模型嵌入維度，通常是模型的隱藏層大小
    factor=config.lr_factor,  # 學習率增長的倍率因子
    warmup=config.lr_warmup,  # 設定 warmup 步數，即學習率在開始階段逐漸增長
    optimizer=torch.optim.AdamW(  # 使用 AdamW 優化器（具有權重衰減的 Adam）
        model.parameters(),  # 訓練的模型參數
        lr=0,  # 初始學習率設為 0，實際學習率由 NoamOpt 決定
        betas=(0.9, 0.98),  # Adam 優化器的 beta 參數，控制一階和二階矩估計的平滑度
        eps=1e-9,  # 用於避免除零的微小數值
        weight_decay=0.0001  # 權重衰減參數，防止過擬合
    )
)

# 繪製學習率隨步數變化的圖像，展示學習率隨著訓練進行的變化
plt.plot(np.arange(1, 100000), [optimizer.rate(i) for i in range(1, 100000)])
plt.legend([f"{optimizer.model_size}:{optimizer.warmup}"])  # 顯示學習率曲線的圖例，標註模型大小和 warmup 步數

# Training Procedure

## Training

In [None]:
from fairseq.data import iterators
from torch.cuda.amp import GradScaler, autocast

def train_one_epoch(epoch_itr, model, task, criterion, optimizer, accum_steps=1):
    itr = epoch_itr.next_epoch_itr(shuffle=True)  # 獲取下一個訓練批次，並打亂數據
    itr = iterators.GroupedIterator(itr, accum_steps)  # 梯度累積：每 accum_steps 條樣本進行一次更新

    stats = {"loss": []}  # 儲存損失的統計數據
    scaler = GradScaler()  # 自動混合精度（AMP），用來提升訓練速度並減少顯存使用

    model.train()  # 設定模型為訓練模式
    progress = tqdm.tqdm(itr, desc=f"train epoch {epoch_itr.epoch}", leave=False)  # 訓練進度條
    for samples in progress:
        model.zero_grad()  # 每個步驟開始前，將梯度設置為零
        accum_loss = 0  # 累積損失初始化為 0
        sample_size = 0  # 累積樣本數初始化為 0
        # 梯度累積：每 accum_steps 條樣本更新一次梯度
        for i, sample in enumerate(samples):
            if i == 1:
                # 在第一個步驟之後清空 CUDA 緩存，可以減少 OOM（內存溢出）的機會
                torch.cuda.empty_cache()

            sample = utils.move_to_cuda(sample, device=device)  # 將樣本移動到 GPU
            target = sample["target"]  # 目標（真實標籤）
            sample_size_i = sample["ntokens"]  # 該批次樣本的 token 數量
            sample_size += sample_size_i  # 累積樣本數量

            # 混合精度訓練
            with autocast():  # 開啟自動混合精度
                net_output = model.forward(**sample["net_input"])  # 前向傳播，獲得模型輸出
                lprobs = F.log_softmax(net_output[0], -1)  # 計算對數 softmax
                loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1))  # 計算損失

                # 記錄損失
                accum_loss += loss.item()
                # 反向傳播
                scaler.scale(loss).backward()  # 使用混合精度進行反向傳播

        # 解除縮放
        scaler.unscale_(optimizer)
        optimizer.multiply_grads(1 / (sample_size or 1.0))  # 對梯度進行縮放，避免樣本數為零的情況
        gnorm = nn.utils.clip_grad_norm_(model.parameters(), config.clip_norm)  # 梯度裁剪，防止梯度爆炸

        # 更新模型參數
        scaler.step(optimizer)
        scaler.update()

        # 記錄和打印訓練進度
        loss_print = accum_loss / sample_size  # 計算當前批次的平均損失
        stats["loss"].append(loss_print)  # 儲存損失
        progress.set_postfix(loss=loss_print)  # 更新進度條顯示
        if config.use_wandb:  # 如果使用 wandb 進行訓練監控
            wandb.log({
                "train/loss": loss_print,  # 訓練損失
                "train/grad_norm": gnorm.item(),  # 梯度範數
                "train/lr": optimizer.rate(),  # 學習率
                "train/sample_size": sample_size,  # 當前批次的樣本數量
            })

    loss_print = np.mean(stats["loss"])  # 計算訓練期間的平均損失
    logger.info(f"training loss: {loss_print:.4f}")  # 記錄訓練損失
    return stats  # 返回損失統計

## Validation & Inference
To prevent overfitting, validation is required every epoch to validate the performance on unseen data.
- the procedure is essensially same as training, with the addition of inference step
- after validation we can save the model weights

Validation loss alone cannot describe the actual performance of the model
- Directly produce translation hypotheses based on current model, then calculate BLEU with the reference translation
- We can also manually examine the hypotheses' quality
- We use fairseq's sequence generator for beam search to generate translation hypotheses

In [None]:
# fairseq 的 beam search 生成器
# 給定模型和輸入序列，通過 beam search 生成翻譯假設（候選翻譯）
sequence_generator = task.build_generator([model], config)

def decode(toks, dictionary):
    # 將 Tensor 轉換為人類可讀的句子
    s = dictionary.string(
        toks.int().cpu(),  # 將 Token 轉為整數並移動到 CPU
        config.post_process,  # 用來進行後處理的配置（如去除多餘的空格等）
    )
    return s if s else "<unk>"  # 如果結果為空，返回 "<unk>" 表示未知字詞

def inference_step(sample, model):
    # 進行推理步驟
    # 使用 beam search 生成翻譯假設
    gen_out = sequence_generator.generate([model], sample)

    # 初始化三個列表來保存源語言句子、生成的翻譯假設和參考翻譯
    srcs = []
    hyps = []
    refs = []

    # 遍歷生成的結果，每個結果對應一個樣本
    for i in range(len(gen_out)):
        # 對於每個樣本，收集輸入源語言句子、生成的假設翻譯以及參考翻譯
        # 這些會在後續計算 BLEU 評分時使用
        srcs.append(decode(
            utils.strip_pad(sample["net_input"]["src_tokens"][i], task.source_dictionary.pad()),  # 去除 padding
            task.source_dictionary,  # 源語言詞典
        ))
        hyps.append(decode(
            gen_out[i][0]["tokens"],  # 0 表示選擇 beam search 中的最優假設（第一個結果）
            task.target_dictionary,  # 目標語言詞典
        ))
        refs.append(decode(
            utils.strip_pad(sample["target"][i], task.target_dictionary.pad()),  # 去除 padding
            task.target_dictionary,  # 目標語言詞典
        ))

    # 返回源語言句子、假設翻譯和參考翻譯
    return srcs, hyps, refs


In [None]:
import shutil
import sacrebleu

def validate(model, task, criterion, log_to_wandb=True):
    logger.info('開始驗證')  # 日誌：開始驗證
    # 載入驗證資料迭代器，並取得每個批次資料
    itr = load_data_iterator(task, "valid", 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)

    # 初始化統計資料
    stats = {"loss":[], "bleu": 0, "srcs":[], "hyps":[], "refs":[]}
    srcs = []  # 來源語言
    hyps = []  # 模型預測的假設
    refs = []  # 參考翻譯

    model.eval()  # 設定模型為評估模式，停用 dropout 等訓練技巧
    progress = tqdm.tqdm(itr, desc=f"驗證中", leave=False)  # 顯示進度條
    with torch.no_grad():  # 評估模式下不需要計算梯度
        for i, sample in enumerate(progress):
            # 計算驗證損失
            sample = utils.move_to_cuda(sample, device=device)  # 移動資料到 GPU
            net_output = model.forward(**sample["net_input"])  # 模型預測輸出

            # 計算對數機率 (log probabilities)
            lprobs = F.log_softmax(net_output[0], -1)
            target = sample["target"]  # 目標標籤
            sample_size = sample["ntokens"]  # 樣本大小 (單詞數)
            # 計算損失並進行平均
            loss = criterion(lprobs.view(-1, lprobs.size(-1)), target.view(-1)) / sample_size
            progress.set_postfix(valid_loss=loss.item())  # 更新進度條顯示
            stats["loss"].append(loss)  # 記錄損失

            # 進行推理（翻譯）
            s, h, r = inference_step(sample, model)
            srcs.extend(s)  # 儲存來源語言
            hyps.extend(h)  # 儲存模型的預測
            refs.extend(r)  # 儲存參考翻譯

    # 根據語言選擇合適的 BLEU 分數計算方法
    tok = 'zh' if task.cfg.target_lang == 'zh' else '13a'
    stats["loss"] = torch.stack(stats["loss"]).mean().item()  # 計算所有批次的平均損失
    stats["bleu"] = sacrebleu.corpus_bleu(hyps, [refs], tokenize=tok)  # 計算 BLEU 分數
    stats["srcs"] = srcs  # 儲存來源語言
    stats["hyps"] = hyps  # 儲存預測翻譯
    stats["refs"] = refs  # 儲存參考翻譯

    # 若啟用 wandb，則記錄損失和 BLEU 分數
    if config.use_wandb and log_to_wandb:
        wandb.log({
            "valid/loss": stats["loss"],
            "valid/bleu": stats["bleu"].score,
        }, commit=False)

    # 隨機顯示一個例子，查看模型翻譯效果
    showid = np.random.randint(len(hyps))
    logger.info("示例來源語言: " + srcs[showid])
    logger.info("示例假設翻譯: " + hyps[showid])
    logger.info("示例參考翻譯: " + refs[showid])

    # 顯示最終的驗證結果
    logger.info(f"驗證損失:\t{stats['loss']:.4f}")
    logger.info(stats["bleu"].format())  # 顯示 BLEU 分數
    return stats  # 返回統計結果

# Save and Load Model Weights

In [None]:
def validate_and_save(model, task, criterion, optimizer, epoch, save=True):
    # 驗證模型並取得結果
    stats = validate(model, task, criterion)
    bleu = stats['bleu']  # BLEU 分數
    loss = stats['loss']  # 損失值

    if save:
        # 儲存 epoch 的檢查點
        savedir = Path(config.savedir).absolute()  # 取得儲存目錄的絕對路徑
        savedir.mkdir(parents=True, exist_ok=True)  # 如果儲存目錄不存在，則創建它

        # 準備儲存的檢查點資料
        check = {
            "model": model.state_dict(),  # 儲存模型的參數
            "stats": {"bleu": bleu.score, "loss": loss},  # 儲存 BLEU 分數和損失
            "optim": {"step": optimizer._step}  # 儲存優化器的當前步數
        }

        # 儲存檢查點到磁碟
        torch.save(check, savedir/f"checkpoint{epoch}.pt")
        shutil.copy(savedir/f"checkpoint{epoch}.pt", savedir/f"checkpoint_last.pt")  # 複製到最後檢查點
        logger.info(f"saved epoch checkpoint: {savedir}/checkpoint{epoch}.pt")

        # 儲存該 epoch 的翻譯樣本
        with open(savedir/f"samples{epoch}.{config.source_lang}-{config.target_lang}.txt", "w") as f:
            for s, h in zip(stats["srcs"], stats["hyps"]):
                f.write(f"{s}\t{h}\n")  # 寫入源語言與翻譯的對應結果

        # 取得最好的 BLEU 分數並儲存最佳檢查點
        if getattr(validate_and_save, "best_bleu", 0) < bleu.score:
            validate_and_save.best_bleu = bleu.score
            torch.save(check, savedir/f"checkpoint_best.pt")  # 儲存最佳檢查點

        # 刪除不需要的舊檢查點（保留最後幾個檢查點）
        del_file = savedir / f"checkpoint{epoch - config.keep_last_epochs}.pt"
        if del_file.exists():
            del_file.unlink()  # 刪除舊檢查點

    return stats  # 返回驗證結果

def try_load_checkpoint(model, optimizer=None, name=None):
    # 嘗試載入檢查點，若無則輸出錯誤信息
    name = name if name else "checkpoint_last.pt"  # 預設載入最後的檢查點
    checkpath = Path(config.savedir)/name  # 取得檢查點檔案的路徑

    if checkpath.exists():  # 如果檢查點檔案存在
        check = torch.load(checkpath)  # 讀取檢查點
        model.load_state_dict(check["model"])  # 載入模型的參數
        stats = check["stats"]  # 載入訓練過程中的統計數據
        step = "unknown"  # 步數默認為 "unknown"

        if optimizer != None:  # 如果提供了優化器
            optimizer._step = step = check["optim"]["step"]  # 載入優化器的步數

        # 輸出載入檢查點的信息
        logger.info(f"loaded checkpoint {checkpath}: step={step} loss={stats['loss']} bleu={stats['bleu']}")
    else:
        # 如果檢查點不存在，輸出提示信息
        logger.info(f"no checkpoints found at {checkpath}!")

# Main
## Training loop

In [None]:
model = model.to(device=device)
criterion = criterion.to(device=device)

In [None]:
!nvidia-smi

In [None]:
logger.info("task: {}".format(task.__class__.__name__))
logger.info("encoder: {}".format(model.encoder.__class__.__name__))
logger.info("decoder: {}".format(model.decoder.__class__.__name__))
logger.info("criterion: {}".format(criterion.__class__.__name__))
logger.info("optimizer: {}".format(optimizer.__class__.__name__))
logger.info(
    "num. model params: {:,} (num. trained: {:,})".format(
        sum(p.numel() for p in model.parameters()),
        sum(p.numel() for p in model.parameters() if p.requires_grad),
    )
)
logger.info(f"max tokens per batch = {config.max_tokens}, accumulate steps = {config.accum_steps}")

In [None]:
# 加載訓練數據迭代器，並設置起始 epoch 和其他參數
epoch_itr = load_data_iterator(task, "train", config.start_epoch, config.max_tokens, config.num_workers)

# 嘗試載入檢查點（如果有的話），恢復訓練進度
try_load_checkpoint(model, optimizer, name=config.resume)

# 開始訓練直到達到最大 epoch
while epoch_itr.next_epoch_idx <= config.max_epoch:
    # 訓練一個 epoch
    train_one_epoch(epoch_itr, model, task, criterion, optimizer, config.accum_steps)

    # 驗證模型並保存結果
    stats = validate_and_save(model, task, criterion, optimizer, epoch=epoch_itr.epoch)

    # 輸出當前 epoch 結束的信息
    logger.info("end of epoch {}".format(epoch_itr.epoch))

    # 加載下一個 epoch 的訓練數據迭代器
    epoch_itr = load_data_iterator(task, "train", epoch_itr.next_epoch_idx, config.max_tokens, config.num_workers)

# Submission

In [None]:
# 設定模型檢查點存儲目錄
checkdir=config.savedir

# 使用平均檢查點的腳本，將最後 5 個檢查點進行平均
# 這樣可以達到類似於模型集成的效果，提升模型的穩定性和性能
!python ./fairseq/scripts/average_checkpoints.py \
--inputs {checkdir} \  # 指定檢查點的輸入目錄
--num-epoch-checkpoints 5 \  # 設定要平均的檢查點數量（這裡選擇最後 5 個）
--output {checkdir}/avg_last_5_checkpoint.pt  # 設定輸出檢查點的路徑和文件名

## Confirm model weights used to generate submission

In [None]:
# checkpoint_last.pt : 最新的 epoch（訓練過程中最新的檢查點）
# checkpoint_best.pt : 最高驗證 BLEU 分數對應的檢查點
# avg_last_5_checkpoint.pt: 最近 5 個 epoch 的平均檢查點

# 嘗試加載最近 5 個 epoch 的平均檢查點
try_load_checkpoint(model, name="avg_last_5_checkpoint.pt")

# 驗證模型，並且將結果不寫入 WandB（日誌工具）
validate(model, task, criterion, log_to_wandb=False)

## Generate Prediction

In [None]:
def generate_prediction(model, task, split="test", outfile="./prediction.txt"):
    # 加載數據集，這裡指定了測試集 (split="test")
    task.load_dataset(split=split, epoch=1)

    # 使用配置設置的參數來創建數據迭代器
    itr = load_data_iterator(task, split, 1, config.max_tokens, config.num_workers).next_epoch_itr(shuffle=False)

    # 用來保存結果的列表
    idxs = []  # 存儲樣本的 ID
    hyps = []  # 存儲預測結果

    # 設定模型為評估模式 (不進行反向傳播)
    model.eval()

    # 使用 tqdm 來顯示進度條
    progress = tqdm.tqdm(itr, desc=f"prediction")

    # 使用 no_grad() 節省記憶體，因為這是在推理階段，不需要計算梯度
    with torch.no_grad():
        # 遍歷數據集的每個 batch
        for i, sample in enumerate(progress):
            # 將數據移動到 GPU 或指定的設備上
            sample = utils.move_to_cuda(sample, device=device)

            # 執行推理，得到預測結果
            s, h, r = inference_step(sample, model)

            # 保存預測結果 (h 為預測結果)
            hyps.extend(h)
            # 保存樣本的 ID 以便排序
            idxs.extend(list(sample['id']))

    # 根據最初的順序對預測結果進行排序
    hyps = [x for _, x in sorted(zip(idxs, hyps))]

    # 將排序後的預測結果寫入文件
    with open(outfile, "w") as f:
        for h in hyps:
            f.write(h + "\n")

In [None]:
generate_prediction(model, task)

In [None]:
raise

# Back-translation

## Train a backward translation model

1. Switch the source_lang and target_lang in **config**
2. Change the savedir in **config** (eg. "./checkpoints/transformer-back")
3. Train model

## Generate synthetic data with backward model

### Download monolingual data

In [None]:
mono_dataset_name = 'mono'

In [None]:
from pathlib import Path

# 設定資料目錄並創建資料夾
mono_prefix = Path(data_dir).absolute() / mono_dataset_name
mono_prefix.mkdir(parents=True, exist_ok=True)  # 如果目錄不存在則創建

# 資料集的下載鏈接
urls = (
    '"https://onedrive.live.com/download?cid=3E549F3B24B238B4&resid=3E549F3B24B238B4%214986&authkey=AANUKbGfZx0kM80"',  # 預設下載鏈接
    # 如果上面的鏈接無效，則使用以下鏈接
    # "https://www.csie.ntu.edu.tw/~r09922057/ML2021-hw5/ted_zh_corpus.deduped.gz",
    # 如果上述鏈接無效，使用以下 Mega 下載鏈接
    # "https://mega.nz/#!vMNnDShR!4eHDxzlpzIpdpeQTD-htatU_C7QwcBTwGDaSeBqH534",
)

# 對應的檔案名稱
file_names = (
    'ted_zh_corpus.deduped.gz',  # 檔案名稱
)

# 下載並解壓檔案
for u, f in zip(urls, file_names):
    path = mono_prefix / f  # 設定檔案的儲存路徑
    if not path.exists():  # 檢查檔案是否已存在
        if 'mega' in u:  # 如果是 Mega 下載鏈接
            !megadl {u} --path {path}  # 使用 megadl 工具下載檔案
        else:  # 否則使用 wget 下載
            !wget {u} -O {path}  # 使用 wget 下載檔案並儲存為指定路徑
    else:
        print(f'{f} 已存在，跳過下載')  # 如果檔案已經存在，則跳過下載

    # 根據檔案後綴來選擇解壓方式
    if path.suffix == ".tgz":  # 如果是 .tgz 格式
        !tar -xvf {path} -C {prefix}  # 解壓 .tgz 檔案到指定資料夾
    elif path.suffix == ".zip":  # 如果是 .zip 格式
        !unzip -o {path} -d {prefix}  # 解壓 .zip 檔案到指定資料夾
    elif path.suffix == ".gz":  # 如果是 .gz 格式
        !gzip -fkd {path}  # 解壓 .gz 檔案

### TODO: clean corpus

1. remove sentences that are too long or too short
2. unify punctuation

hint: you can use clean_s() defined above to do this

### TODO: Subword Units

Use the spm model of the backward model to tokenize the data into subword units

hint: spm model is located at DATA/raw-data/\[dataset\]/spm\[vocab_num\].model

### Binarize

use fairseq to binarize data

In [None]:
binpath = Path('./DATA/data-bin', mono_dataset_name)
src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
tgt_dict_file = src_dict_file
monopref = str(mono_prefix/"mono.tok") # whatever filepath you get after applying subword tokenization
if binpath.exists():
    print(binpath, "exists, will not overwrite!")
else:
    !python -m fairseq_cli.preprocess\
        --source-lang 'zh'\
        --target-lang 'en'\
        --trainpref {monopref}\
        --destdir {binpath}\
        --srcdict {src_dict_file}\
        --tgtdict {tgt_dict_file}\
        --workers 2

### TODO: Generate synthetic data with backward model

Add binarized monolingual data to the original data directory, and name it with "split_name"

ex. ./DATA/data-bin/ted2020/\[split_name\].zh-en.\["en", "zh"\].\["bin", "idx"\]

then you can use 'generate_prediction(model, task, split="split_name")' to generate translation prediction

In [None]:
# Add binarized monolingual data to the original data directory, and name it with "split_name"
# ex. ./DATA/data-bin/ted2020/\[split_name\].zh-en.\["en", "zh"\].\["bin", "idx"\]
!cp ./DATA/data-bin/mono/train.zh-en.zh.bin ./DATA/data-bin/ted2020/mono.zh-en.zh.bin
!cp ./DATA/data-bin/mono/train.zh-en.zh.idx ./DATA/data-bin/ted2020/mono.zh-en.zh.idx
!cp ./DATA/data-bin/mono/train.zh-en.en.bin ./DATA/data-bin/ted2020/mono.zh-en.en.bin
!cp ./DATA/data-bin/mono/train.zh-en.en.idx ./DATA/data-bin/ted2020/mono.zh-en.en.idx

In [None]:
# hint: do prediction on split='mono' to create prediction_file
# generate_prediction( ... ,split=... ,outfile=... )

### TODO: Create new dataset

1. Combine the prediction data with monolingual data
2. Use the original spm model to tokenize data into Subword Units
3. Binarize data with fairseq

In [None]:
# Combine prediction_file (.en) and mono.zh (.zh) into a new dataset.
#
# hint: tokenize prediction_file with the spm model
# spm_model.encode(line, out_type=str)
# output: ./DATA/rawdata/mono/mono.tok.en & mono.tok.zh
#
# hint: use fairseq to binarize these two files again
# binpath = Path('./DATA/data-bin/synthetic')
# src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
# tgt_dict_file = src_dict_file
# monopref = ./DATA/rawdata/mono/mono.tok # or whatever path after applying subword tokenization, w/o the suffix (.zh/.en)
# if binpath.exists():
#     print(binpath, "exists, will not overwrite!")
# else:
#     !python -m fairseq_cli.preprocess\
#         --source-lang 'zh'\
#         --target-lang 'en'\
#         --trainpref {monopref}\
#         --destdir {binpath}\
#         --srcdict {src_dict_file}\
#         --tgtdict {tgt_dict_file}\
#         --workers 2

In [None]:
# create a new dataset from all the files prepared above
!cp -r ./DATA/data-bin/ted2020/ ./DATA/data-bin/ted2020_with_mono/

!cp ./DATA/data-bin/synthetic/train.zh-en.zh.bin ./DATA/data-bin/ted2020_with_mono/train1.en-zh.zh.bin
!cp ./DATA/data-bin/synthetic/train.zh-en.zh.idx ./DATA/data-bin/ted2020_with_mono/train1.en-zh.zh.idx
!cp ./DATA/data-bin/synthetic/train.zh-en.en.bin ./DATA/data-bin/ted2020_with_mono/train1.en-zh.en.bin
!cp ./DATA/data-bin/synthetic/train.zh-en.en.idx ./DATA/data-bin/ted2020_with_mono/train1.en-zh.en.idx

Created new dataset "ted2020_with_mono"

1. Change the datadir in **config** ("./DATA/data-bin/ted2020_with_mono")
2. Switch back the source_lang and target_lang in **config** ("en", "zh")
2. Change the savedir in **config** (eg. "./checkpoints/transformer-bt")
3. Train model

# References

1. <a name=ott2019fairseq></a>Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019, June). fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) (pp. 48-53).
2. <a name=vaswani2017></a>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017, December). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000-6010).
3. <a name=reimers-2020-multilingual-sentence-bert></a>Reimers, N., & Gurevych, I. (2020, November). Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512-4525).
4. <a name=tiedemann2012parallel></a>Tiedemann, J. (2012, May). Parallel Data, Tools and Interfaces in OPUS. In Lrec (Vol. 2012, pp. 2214-2218).
5. <a name=kudo-richardson-2018-sentencepiece></a>Kudo, T., & Richardson, J. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71).
6. <a name=sennrich-etal-2016-improving></a>Sennrich, R., Haddow, B., & Birch, A. (2016, August). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86-96).
7. <a name=edunov-etal-2018-understanding></a>Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 489-500).
8. https://github.com/ajinkyakulkarni14/TED-Multilingual-Parallel-Corpus
9. https://ithelp.ithome.com.tw/articles/10233122
10. https://nlp.seas.harvard.edu/2018/04/03/attention.html