<a href="https://colab.research.google.com/github/hululuzhu/chinese-ai-writing-share/blob/main/further_finetune_example/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T5 模仿 [ericqianli](https://github.com/ericqianli) 写诗
- Branch off [T5 finetune colab](https://github.com/hululuzhu/chinese-ai-writing-share/blob/main/training/t5_finetune/Mengzi_T5_Finetune_Chinese_Poem_Writing_V1.ipynb) which read 600k+ chinese poems
  - Finetuned T5 model at [google drive](https://drive.google.com/drive/folders/1-adlqJsU6tzjLuw_LvnzkdO9PEpeS7Vh?usp=sharing)
- Let the model further read [Li's 800+ poems](https://raw.githubusercontent.com/ericqianli/tianyahaige/master/src/data/poem.json)
  - Exclude group #3 as requested
- Expect the model to have Li's style

## Load Data

In [1]:
# Expect GPU
# !nvidia-smi

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# copy trained poem model
!mkdir -p my_t5/poem
!cp /content/drive/MyDrive/ML/Models/t5-poem/simplet5-epoch-3-train-loss-3.597/* my_t5/poem
!ls -l my_t5/poem

total 967996
-rw------- 1 root root       706 Jun 18 22:12 config.json
-rw------- 1 root root     37963 Jun 18 22:12 inference_mengzi_t5_poem_model.ipynb
-rw------- 1 root root 990438349 Jun 18 22:12 pytorch_model.bin
-rw------- 1 root root      1786 Jun 18 22:12 special_tokens_map.json
-rw------- 1 root root    725135 Jun 18 22:12 spiece.model
-rw------- 1 root root      1961 Jun 18 22:12 tokenizer_config.json


In [None]:
import json
import urllib.request
import pandas as pd
!pip install -q "tqdm>=4.36.1" > /tmp/na
from tqdm.notebook import tqdm
!pip install -q chinese-converter > /tmp/na
import chinese_converter  # 繁体到简体需要
import pickle
import os
import pandas as pd
import numpy as np

In [None]:
def convert_schinese(tchinese):
  return chinese_converter.to_simplified(tchinese)

POEM_URL = "https://raw.githubusercontent.com/ericqianli/tianyahaige/master/src/data/poem.json"

poems_pd = pd.read_json(POEM_URL)
# Per Li's request, exclude group #3
non_group3_poems = poems_pd.collections.apply(lambda x: 3 not in x)
qualified_pd = poems_pd[non_group3_poems]
qualified_pd = qualified_pd[['title', 'body']]
qualified_pd['author'] = '钱力'

qualified_pd['s_content'] = qualified_pd.apply(lambda row: convert_schinese(''.join(row.body)), axis=1)
qualified_pd['s_title'] = qualified_pd.apply(lambda row: convert_schinese(''.join(row.title)), axis=1)
qualified_pd['s_author'] = qualified_pd.apply(lambda row: convert_schinese(''.join(row.author)), axis=1)

qualified_pd = qualified_pd[['s_author', 's_title', 's_content']]

In [None]:
def clean_content(con):
  pieces = con.split("︒")
  con = '。'.join('，'.join(pieces[i:i+2]) for i in range(0, len(pieces), 2))
  con = con.replace("\r", "").replace("\n", "").replace(" ", "").strip()
  return con


clean_contents = []
for c in qualified_pd.s_content.values:
  clean_contents.append(clean_content(c))

qualified_pd['s_content'] = clean_contents
qualified_pd

Unnamed: 0,s_author,s_title,s_content
0,钱力,冬夜饮醉,览袂泛灵水，褰裳望紫薇。回车驾四野，谢屐倚崔嵬。崖残映月冷，天高尽雁飞。长风起华鬓，霜冷冻蛾...
1,钱力,夏望,蹙眉空坐冤，疑镜老容颜。阴蘩固灿色，阳景逐虚烟。熏风总浮雀，灼天遍落云。惟恨红尘陌，往来无聚年。
2,钱力,夜泛秋湖,秋旻多落云，平野将雾升。漫花零入夜，纤月散成纹。小舟乱一顷，弱酒醉三生。何当枕来路，释数霜发增。
3,钱力,伤云期,雁字回时凄南浦，小霰分杯欲饮无。云期随谢花零走，淡看飞烟冯凛湖。
4,钱力,北念友人,紫艷香残日愈秋，红笺泪尽字成愁。凭栏一望空江水，徒赴孤帆苦北流。
...,...,...,...
822,钱力,春暮二首,壮志仍惊夢，残春已落花。依依寻暮影，天际数归鸦。寂寞春去也，寒宵思不胜。狂风摧夜雨，心事烁枯...
823,钱力,立夏作,谷云尽处夏初荣，野草闲花风满城。莫道流年如驹隙，一期一会亦平生。
824,钱力,花事口占,春花静，夏花明。秋花萧瑟冬花清，人生得意不得意。且看四季花有情，
825,钱力,祖洲作二首,神思旷古，五面风经。断尘绝俗，长悔劳形。极目海碧，环首峰青。人间一夢，谅我迟醒。穿林度雨，凌...


In [None]:
MAX_AUTHOR_CHAR = 4
MAX_TITLE_CHAR = 12
MIN_CONTENT_CHAR = 10
MAX_CONTENT_CHAR = 64

def trim_author_fn(row):
  return row.s_author[:MAX_AUTHOR_CHAR]

def trim_title_fn(row):
  trimed_title = row.s_title[:MAX_TITLE_CHAR].replace(" ", "").replace("(", "").replace(")", "")
  return trimed_title

def trim_content_fn(row):
  trimed_content = row.s_content[:MAX_CONTENT_CHAR]
  # # End with a period to avoid partial ending to confuse model
  # last_period = trimed_content.rfind("。")
  # return trimed_content[:last_period+1]
  return trimed_content


my_df = qualified_pd
# Trim the size, a soft copy to avoid the view/copy conflict warning
my_df['s_author_trim'] = my_df.copy().apply(trim_author_fn, axis=1)
my_df['s_title_trim'] = my_df.copy().apply(trim_title_fn, axis=1)
my_df['s_content_trim'] = my_df.copy().apply(trim_content_fn, axis=1)

In [None]:
# Title cannot be empty
empty_title_mask = (my_df['s_title_trim'].str.len() == 0)
too_short_cotent_mask = (my_df['s_content_trim'].str.len() <= MIN_CONTENT_CHAR)
invalid_mask = (('无正文' == my_df['s_content_trim']) | ('无正文' == my_df['s_author_trim']))
too_short_mask =  empty_title_mask | too_short_cotent_mask | invalid_mask
# filtered_my_df = my_df.loc[too_short_mask]
# filtered_my_df

qualitied_df = my_df.loc[~too_short_mask][[
  's_author_trim', 's_title_trim', 's_content_trim']]

In [None]:
qualitied_df.sample(3)

Unnamed: 0,s_author_trim,s_title_trim,s_content_trim
748,钱力,野炊作,避秦索晋日凋零，惆怅白驹隙此生。酒过三巡思野客，炉开一昧祭寒星。业火焚心心返净，洞天观月月尤...
723,钱力,经天作,三十三年一夢惊，萍身云迹泊天风。幸仍窗伴初亏月，犹照灵台数点星。
445,钱力,夏至,夏至云峰密复疏，南山尽日卧玄卢。七弦月转商回夢，一剑风来夏满湖。墨到浓时情咫尺，花从艳后岁须...


In [None]:
AUTHOR_PROMPT = "模仿："
TITLE_PROMPT = "作诗："
EOS_TOKEN = '</s>'
def build_dataset_df(df, include_author=True):
  dfc = df.copy()
  if include_author:
    dfc['source_text'] = TITLE_PROMPT + df['s_title_trim'] + EOS_TOKEN + AUTHOR_PROMPT + df['s_author_trim']
  else:
    dfc['source_text'] = TITLE_PROMPT + df['s_title_trim']
  dfc['target_text'] = df['s_content_trim']
  dfc = dfc[['source_text', 'target_text']]
  return dfc

In [None]:
df_author_title_content = build_dataset_df(qualitied_df, True)
df_author_title_content.sample(3)

Unnamed: 0,source_text,target_text
600,作诗：比尔拉神庙二首</s>模仿：钱力,别来山海寂，古寺大烟生。熙攘传梵语，荼靡竞笛鸣。回头天是岸，合掌夜将明。究竟纷飞处，寂寂悟平...
639,作诗：青空作</s>模仿：钱力,又乘清霄万裏行，浮生寂处有神灵。轩辕格物由云纪，精卫无啼付海听。逐日何妨惊夸夫，飞星不落累啓...
632,作诗：秦简</s>模仿：钱力,前朝犹睡虎，云夢醒如初。千古兴邦律，百年代枕书。


In [None]:
df_title_content = build_dataset_df(qualitied_df, False)
df_title_content.sample(3)

Unnamed: 0,source_text,target_text
416,作诗：小天问,冥古洪荒，宙何有尽。星汉灿烂，宇岂能穷。孰圣敦庸，谁言可喻。匪思匪夢，万法归宗。
560,作诗：向晚,新月逐霞落，孤松向海听。风来虫忽静，摇落满天星。
212,作诗：再读逍遥游,眷慕仙风二十年，未知姑射接舆言。长歌狂趁子犹兴，霁雪清缠寂寞弦。


In [None]:
merged_df = pd.concat([df_author_title_content, df_title_content])

In [None]:
merged_df.sample(5)

Unnamed: 0,source_text,target_text
370,作诗：拟相思,昔时风正盛，春草扬碧丝。何事颦罗裙，妾无再春时。
353,作诗：祭鲤</s>模仿：钱力,缘何纹黑白，世事惯浮沉。来世诺爲鲤，与君跃龙门。
123,作诗：春夙</s>模仿：钱力,月素星稀一世霞，循风可赴旧云崖。分明寂寞都彻骨，却道闲愁正茶花。
219,作诗：无痕</s>模仿：钱力,无端人间如潮事，无奈事裏弄潮人。人事如潮思如水，逝去年华无了痕。
702,作诗：倚船观海三首,海中自有山岳，逡巡渐到悠悠。金鳞千从日起，银缦一向月浮。极目平云无羌，颔首万类绸缪。多少蓬莱...


## Modeling

In [None]:
# Quiet install simple T5 package
!pip install -q simplet5 &> /dev/null

In [None]:
import torch
from simplet5 import SimpleT5
from transformers import T5Tokenizer, T5ForConditionalGeneration

Global seed set to 42


In [None]:
class MengziSimpleT5(SimpleT5):
  def __init__(self) -> None:
    super().__init__()
    self.device = torch.device("cuda")

  def load_my_model(self, use_gpu: bool = True):
    self.tokenizer = T5Tokenizer.from_pretrained("Langboat/mengzi-t5-base")
    # Notice the pretrained peom T5 model is copied to local my_t5/poem
    self.model = T5ForConditionalGeneration.from_pretrained("my_t5/poem")

In [None]:
model = MengziSimpleT5()
model.load_my_model()
model.model = model.model.to('cuda')

Downloading:   0%|          | 0.00/708k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/659 [00:00<?, ?B/s]

In [None]:
model.tokenizer("桥形通汉上，峰势接云危。</s>烟霞交隐映，花鸟自参差。")

{'input_ids': [1012, 955, 406, 921, 23, 3, 1440, 2180, 799, 355, 4008, 4, 1, 1448, 4152, 690, 3934, 4990, 3, 17544, 178, 2572, 769, 4, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
model.tokenizer.decode([1012, 955, 406, 921, 23, 3, 1440, 2180, 799, 355, 4008, 4, 1, 1448, 4152, 690, 3934, 4990, 3, 17544, 178, 2572, 769, 4, 1])

'桥形通汉上,峰势接云危。</s> 烟霞交隐映,花鸟自参差。</s>'

In [None]:
from sklearn.model_selection import train_test_split
merged_df = merged_df.sample(frac=1) # Shuffle
train_df, eval_df = train_test_split(merged_df, test_size=100)

In [None]:
print("train", len(train_df), "eval", len(eval_df))

train 1502 eval 100


In [None]:
!mkdir -p /content/drive/MyDrive/ML/Models/t5-poem-li-2022branch

In [None]:
model.train(train_df=train_df,
            eval_df=eval_df, 
            source_max_token_len=(len(TITLE_PROMPT) + MAX_TITLE_CHAR +  1 + len(AUTHOR_PROMPT) + MAX_AUTHOR_CHAR),
            target_max_token_len=MAX_CONTENT_CHAR, 
            batch_size=16,
            max_epochs=6, # double the time to read normal poems to mimic Li
            use_gpu=True,
            outputdir="/content/drive/MyDrive/ML/Models/t5-poem-li-2022branch")

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Missing logger folder: /content/lightning_logs

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 247 M 
-----------------------------------------------------
247 M     Trainable params
0         Non-trainable params
247 M     Total params
990.311   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 42
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]