# T5 英文文本生成
## 教學目標
使用T5模型根據英文關鍵字生成完整的句子

## 適用對象
 - 已對python的基本語法和有一定瞭解和掌握程度
 - 對深度學習的基本概念有初步的認識

## 執行方法
在 Jupyter notebook 中，選取想要執行的區塊後，使用以下其中一種方法執行
 - 上方工具列中，按下 Cell < Run Cells 執行
 - 使用快捷鍵 Shift + Enter 執行

## 大綱
 - [安裝套件](#安裝套件)
 - [載入T5模型](#載入T5模型)
 - [資料處理](#資料處理)
 - [超參數](#超參數)
 - [訓練](#訓練)
 - [驗證](#驗證)


## 安裝套件
 - transformers (4.37.0) huggingface讀取模型的套件
 - datasets (2.16.1) huggingface讀取資料集的套件
 - torcheval (0.0.7) 各種評價標準

In [None]:
! pip install transformers
! pip install datasets
! pip install torcheval
! pip install pytorch-ignite

In [None]:
import transformers as T
from datasets import load_dataset
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from tqdm import tqdm
# from ignite.metrics import Rouge
import re
device = "cuda" if torch.cuda.is_available() else "cpu"

## 載入T5模型
 - 使用huggingface裝載模型的架構、參數和tokenizer
 - 保存在路徑./cache/中
 - 用.to(device)把模型裝載入訓練設備(GPU)

In [None]:
t5_model = T.T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", cache_dir="./cache/").to(device)
t5_tokenizer = T.T5Tokenizer.from_pretrained("google/flan-t5-base", cache_dir="./cache/")

## 資料處理
 - 使用 torch.utils.data 中的 Dataset 和 Dataloader 成批次地讀取和預處理資料
 - 使用“/”將每個輸入的關鍵字和每個輸出鏈接起來

In [None]:
def get_tensor(sample):
    # 將模型的輸入和ground truth打包成Tensor
    model_inputs = t5_tokenizer.batch_encode_plus([each["concepts"] for each in sample], padding=True, truncation=True, return_tensors="pt")
    model_outputs = t5_tokenizer.batch_encode_plus([each["targets"] for each in sample], padding=True, truncation=True, return_tensors="pt")
    return model_inputs["input_ids"].to(device), model_outputs["input_ids"].to(device)

class CommonGenDataset(Dataset):
    def __init__(self, split="train") -> None:
        super().__init__()
        assert split in ["train", "validation", "test"]
        data_df = load_dataset("allenai/common_gen", split=split, cache_dir="./cache/").to_pandas().groupby("concept_set_idx")
        self.data = []
        for each in data_df:
            targets = "/ ".join([s+"." if not s.endswith(".") else s for s in each[1].target.to_list()])
            concepts = ", ".join(each[1].concepts.to_list()[0])
            self.data.append({"concepts": concepts, "targets": targets})

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

data_sample = CommonGenDataset(split="train").data[:3]
print(f"Dataset example: \n{data_sample[0]} \n{data_sample[1]} \n{data_sample[2]}")

## 超參數
 - 學習率 (learning rate): 1e-5
 - 訓練輪數 (epochs): 3
 - 優化器 (optimizer): AdamW
 - 批次大小 (batch size): 8
 - 評量指標 (evaluation matrics)Rouge-2

In [None]:
lr = 1e-5
epochs = 1
optimizer = AdamW(t5_model.parameters(), lr = 1e-5)
train_batch_size = 8
validation_batch_size = 8
common_gen_train = DataLoader(CommonGenDataset(split="train"), collate_fn=get_tensor, batch_size=train_batch_size, shuffle=True)
common_gen_validation = DataLoader(CommonGenDataset(split="validation"), collate_fn=get_tensor, batch_size=validation_batch_size, shuffle=False)
rouge = Rouge(variants=["L", 2], multiref="best")

## 驗證
驗證程式
 - 將驗證資料輸入模型，用Rouge-2評價輸出的效果
 - Rouge的使用方法參考 https://pytorch.org/ignite/generated/ignite.metrics.Rouge.html

In [None]:
def evaluate(model):
    pbar = tqdm(common_gen_validation)
    pbar.set_description(f"Evaluating")

    for inputs, targets in pbar:
        output = [re.split(r"[/]", each.replace("<pad>", "")) for each in t5_tokenizer.batch_decode(model.generate(inputs, max_length=50))]
        targets = [re.split(r"[/]", each.replace("<pad>", "")) for each in t5_tokenizer.batch_decode(targets)]
        for i in range(len(output)):
            sentences = [s.replace('.', ' .').split() for s in output[i]]
            ground_thruths = [t.replace('.', ' .').split() for t in targets[i]]
            for s in sentences:
                rouge.update(([s], [ground_thruths]))
    return rouge.compute()


## 訓練
 - 將資料成批次輸入T5模型，並獲取其損失函數數值，隨後計算梯度優化
 - tqdm用來顯示模型的訓練進度

In [None]:
for ep in range(epochs):
    pbar = tqdm(common_gen_train)
    pbar.set_description(f"Training epoch [{ep+1}/{epochs}]")
    for inputs, targets in pbar:
        optimizer.zero_grad()
        loss = t5_model(input_ids=inputs, labels=targets).loss
        loss.backward()
        optimizer.step()
        pbar.set_postfix(loss = loss.item())
    torch.save(t5_model, f'./saved_models/ep{ep}.mod')
    print(f"Rouge-2 score on epoch {ep}:", evaluate(t5_model))