<a href="https://colab.research.google.com/github/hululuzhu/llama-lora-chinese-couplet/blob/main/LLaMA_LoRA_Chinese_Couplet_demo_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLmMA + LoRA = Finetune on Consumer level GPU
- Inspired by [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [LoRA](https://arxiv.org/abs/2106.09685), and [Alpaca LoRA](https://github.com/tloen/alpaca-lora)
- Example used Chinese Couplet to avoid potential conflict of interest with my employer
  - As a demo, ~30mins for Tesla T4 16G end to end to start to showcase the effect, and probably a couple of hours (depends GPU) to give out a good model.
  - A100 40G takes 9mins to show some difference below
- Last update: 04/24/2023
- Contact: hululu.zhu@gmail.com


Zero-shot Examples
- after 3 epochs of 5k pairs, cap max tokens, greedy
- post-processing to match # of chinese chars
- ideally a well trained model will know end of sentence (eos) itself
- prompt: `对联：{上联}\n下联：`

|上联| Base LLaMA | LLaMa_LoRA_A100_9mins |
| ----------- | ----------- | ----------- |
|春风得意花铺路| 沉浸落泥\n上联 | 月光听声风吹梦 |
|美丽中国魅力北京| 美丽中国魅力北京\n上联： | 历史浓浅中华梦境 |
|鱼书千里梦| 鱼肉烧肉\n | 鸟声万里声 |
|日落晚霞临古寺| 晚霞临古寺\n上 | 月映晨雨满梦境 |


## Prerequisites
- Nvidia GPU, check if 10G vram available
- pip install software

In [1]:
!nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-aa6bdd8b-5321-c26a-da02-909b26e35d16)


In [2]:
!pip install -q nvidia-ml-py3
import nvidia_smi
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
# card id 0 hardcoded here, there is also a call to get all available card ids, so we could iterate
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
nvidia_smi.nvmlShutdown()

print("Total memory:", info.total)
print("Free memory:", info.free)
print("Used memory:", info.used)

assert info.free > 1e10, (
    "Looks like your GPU is busy or not having enough 10G memory to continue")

Total memory: 42949672960
Free memory: 42481418240
Used memory: 468254720


In [3]:
!pip install -q bitsandbytes
!pip install -q datasets loralib sentencepiece
!pip install -q peft
!pip install -q transformers

## All the Imports

In [4]:
# disable warnings unless needed
import warnings
warnings.filterwarnings('ignore')

In [5]:
from datasets import Dataset, load_dataset
import numpy as np
import os
import pandas as pd
import pathlib
from peft import PeftModel, get_peft_config, get_peft_model, LoraConfig, TaskType, prepare_model_for_int8_training
import pickle
import sys
import torch
import transformers
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, AutoModelForSeq2SeqLM, DataCollatorForLanguageModeling


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


## Define top-level configs

In [6]:
# Check out more details at https://huggingface.co/decapoda-research/llama-7b-hf
# Not for commercial use.
model_name_or_path = "decapoda-research/llama-7b-hf"
tokenizer_name_or_path = "decapoda-research/llama-7b-hf"

# Max num of tokens (including prompt and output), chinese encoding takes more than # of chars as observed
CUTOFF_LEN = 96
# Do not predict training prompt to speedup, but might affect quality as Alpaca Lora enables it
TRAIN_ON_INPUT = True

## Load LLaMa 7B and tokenizer
- Takes about 5 mins, 13G+ model weights downloaded
- After loading, GPU usage is 7.6G+

In [32]:
original_8bit_llama_model = LlamaForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="auto",
    load_in_8bit=True)

# set padding id and side based on https://github.com/tloen/alpaca-lora/blob/main/finetune.py#L121
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_name_or_path)
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"  # Allow batched inference



Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


## Load Training data

In [8]:
# Reuse my T5 couplet data code https://github.com/hululuzhu/chinese-ai-writing-share/blob/main/training/t5_finetune/Mengzi_T5_Finetune_Chinese_Couplet_V1.ipynb
working_dir = "/tmp/working_dir"
!mkdir -p {working_dir}
!wget https://github.com/wb14123/couplet-dataset/releases/download/1.0/couplet.tar.gz -P {working_dir}
!ls -l {working_dir}
!mkdir -p {working_dir}/couplet_files
!tar -xf {working_dir}/couplet.tar.gz -C {working_dir}/couplet_files
!head -1 {working_dir}/couplet_files/couplet/train/in.txt {working_dir}/couplet_files/couplet/train/out.txt

COUPLET_PATH = f'{working_dir}/couplet_files/couplet'
MAX_SEQ_LEN = 32  # Max 32 chinese char including punctuation marks

train_df, test_df = None, None
for t in ['train', 'test']:
  ins, outs = [], []
  for i in ['in', 'out']:
    with open(f"{COUPLET_PATH}/{t}/{i}.txt", "r") as f:
      for line in f:
        clean_line = line.strip().replace(' ', '').replace('\n', '').replace('\r', '')[:MAX_SEQ_LEN]
        if i=='in':
          ins.append(clean_line)
        else:
          outs.append(clean_line)
  # The column names to match simpleT5
  data_dict = {
      'source_text': ins,
      'target_text': outs,
  }
  if t == 'train':
    train_df = pd.DataFrame(data_dict)
  else:
    test_df = pd.DataFrame(data_dict)

COUPLET_PROMPOT = '对联：'
COUPLET_SUFFIX = '\n下联：'
train_df['source_text'] = COUPLET_PROMPOT + train_df['source_text'] + COUPLET_SUFFIX
test_df['source_text'] = COUPLET_PROMPOT + test_df['source_text'] + COUPLET_SUFFIX

--2023-04-25 05:31:53--  https://github.com/wb14123/couplet-dataset/releases/download/1.0/couplet.tar.gz
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/122695108/9643dda6-194e-11e8-9642-44c7d57d40ac?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230425%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230425T053153Z&X-Amz-Expires=300&X-Amz-Signature=90f2ff7f4980e30e751c47e3f99b86175b52e9535c6fc3b4b18e286fafc43cee&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=122695108&response-content-disposition=attachment%3B%20filename%3Dcouplet.tar.gz&response-content-type=application%2Foctet-stream [following]
--2023-04-25 05:31:53--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/122695108/9643dda6-194e-11e8-9642-44c7d57d40ac?X-

In [9]:
# Sample 5k
train_df_sample = train_df[['source_text', 'target_text']].sample(5000)
train_df_sample

Unnamed: 0,source_text,target_text
172513,对联：陋室敲佳句\n下联：,穷人过富年
592730,对联：云山僧指路\n下联：,竹寺叶敲钟
277097,对联：资源有幸花争艳\n下联：,土地无私春又归
462600,对联：一方皆春色\n下联：,四面尽吉祥
130139,对联：石点头兮听道德\n下联：,竹为枕也梦玄机
...,...,...
608368,对联：鞠躬尽瘁，死而后已，薄葬有遗言，尚以未殁沙场为恨\n下联：,推亡固存，邦乃其昌，誓师昭大义，曾无自利天下之心
252758,对联：得失之间，梦如蕉鹿\n下联：,惊疑多者，情似杯蛇
610734,对联：为官学幽兰隐深山，不问春夏秋冬季\n下联：,做人效绿竹藏峭壁，哪管东西南北风
519875,对联：鱼书千里梦\n下联：,雁字一行秋


## Convert Data to Training-friendly DataSet

In [10]:
# Copied from Alpaca-LoRA, notice input_ids, attention_mask, and labels are
# default expected columns in huggingface dataset lib
def tokenize(tokenizer, prompt, cutoff_len, add_eos_token=True):
  # there's probably a way to do this with the tokenizer settings
  # but again, gotta move fast
  result = tokenizer(
      prompt,
      truncation=True,
      max_length=cutoff_len,
      padding=False,
      return_tensors=None,
  )
  if (
      result["input_ids"][-1] != tokenizer.eos_token_id
      and len(result["input_ids"]) < cutoff_len
      and add_eos_token
  ):
    result["input_ids"].append(tokenizer.eos_token_id)
    result["attention_mask"].append(1)

  # result["labels"] = copy.deepcopy(result["input_ids"])
  result["labels"] = result["input_ids"].copy()
  return result


# Branched from Alpaca-LoRA
def tokenize_fn(data_point):
  prompt_in, prompt_out = data_point['source_text'], data_point['target_text']
  full_prompt = prompt_in + prompt_out
  tokenized_full_prompt = tokenize(tokenizer, full_prompt, CUTOFF_LEN)
  if not TRAIN_ON_INPUT:
    user_prompt = prompt_in
    tokenized_user_prompt = tokenize(tokenizer, user_prompt, CUTOFF_LEN, add_eos_token=False)
    user_prompt_len = len(tokenized_user_prompt["input_ids"])
    tokenized_full_prompt["labels"] = [
        -100 # special id for skipping
    ] * user_prompt_len + tokenized_full_prompt["labels"][user_prompt_len:]
  return tokenized_full_prompt


train_ds = Dataset.from_pandas(train_df_sample)
train_ds = train_ds.flatten()
tokenized_train_ds = train_ds.map(
    tokenize_fn,
    remove_columns=['source_text', 'target_text', '__index_level_0__'],
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [11]:
len(tokenized_train_ds['labels'])

5000

In [12]:
# Optionally check a few examples by decoding the inputs
for i in range(100, 103):
  print("token length", len(tokenized_train_ds['input_ids'][i]))
  print(tokenizer.decode(tokenized_train_ds['input_ids'][i]))
  print("Label ids", tokenized_train_ds['labels'][i])
  print()

token length 82
<unk> 对联：百二楼台隐美林，纳月摩云，新城一览非凡境
下联：万千气象涵溪谷，屯坊列署，阆苑三迁有孟家<unk>
Label ids [0, 29871, 30783, 31986, 30383, 31047, 30685, 233, 168, 191, 31037, 236, 157, 147, 30630, 30853, 30214, 234, 189, 182, 30534, 233, 148, 172, 31784, 30214, 30374, 30626, 30287, 235, 170, 139, 31838, 232, 138, 164, 232, 165, 134, 13, 30557, 31986, 30383, 31535, 31159, 233, 179, 151, 31133, 233, 185, 184, 31850, 31112, 30214, 232, 180, 178, 232, 160, 141, 31025, 234, 192, 181, 30214, 236, 155, 137, 235, 142, 148, 30457, 235, 194, 132, 30417, 232, 176, 162, 30613, 0]

token length 38
<unk> 对联：不独文章工奏记
下联：敢持歌颂庆晨昏<unk>
Label ids [0, 29871, 30783, 31986, 30383, 30413, 234, 142, 175, 30333, 31374, 31041, 232, 168, 146, 31410, 13, 30557, 31986, 30383, 233, 152, 165, 31695, 31173, 236, 165, 133, 232, 189, 137, 233, 156, 171, 233, 155, 146, 0]

token length 42
<unk> 对联：飞云沐雨重焕彩
下联：后土承晖又东风<unk>
Label ids [0, 29871, 30783, 31986, 30383, 236, 166, 161, 31784, 233, 181, 147, 236, 158, 171, 30908, 234, 135, 152, 232, 19

## LoRA setup
- Check out LoRA at 

In [33]:
model = prepare_model_for_int8_training(original_8bit_llama_model)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

lora_model = get_peft_model(original_8bit_llama_model, config)

## Training
- Show before and after!

In [68]:
# Check out meaning of the Chinese char using ChatGPT
def eval_model(my_model, examples=["上联：春风得意花铺路\n下联：",
                                   "上联：美丽中国魅力北京\n下联：",
                                   "上联：鱼书千里梦\n下联：",
                                   "对联：日落晚霞临古寺\n下联：",]):
  for p_in in examples:
    batch = tokenizer(
        p_in,
        return_tensors='pt',
    )
    with torch.cuda.amp.autocast(): # required for mixed precisions
      output_tokens = my_model.generate(
          **batch, max_new_tokens=batch['input_ids'].shape[-1])
    # print(output_tokens[0])
    out = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
    # My own post-processing logic to "cheat" to align chars
    if len(out) > len(p_in) * 2 - 7:
      out = out[:len(p_in) * 2 - 7 - len(out)] # perfectly match chars
    # replace the last N for visibility
    if out.count('\n') > 1:
      out = out[::-1].replace("\n", "n\\", 1)[::-1]
    print(out)
    print()

In [67]:
print("Before training")
eval_model(lora_model)

Before training
上联：春风得意花铺路
下联：沉浸落泥\n上联

上联：美丽中国魅力北京
下联：美丽中国魅力北京

上联：鱼书千里梦
下联：鱼肉烧肉\n

对联：日落晚霞临古寺
下联：晚霞临古寺\n上



In [17]:
# As you can tell, I even omitted eval_dataset for this demo :(
trainer = transformers.Trainer(
    model=lora_model, 
    train_dataset=tokenized_train_ds,
    args=transformers.TrainingArguments(
        # increased batch size will significantly increase GPU requirement here
        # Decrease to 4 if you have less than 16G vram
        # Batch = 4, probably 8.3-8.8G vram
        # Batch = 16, 9.5G+
        # Batch = 32, 11G+
        # Batch = 64, 14G+
        per_device_train_batch_size=32,
        gradient_accumulation_steps=2,
        warmup_steps=8,
        num_train_epochs=2,
        learning_rate=2e-4, 
        fp16=True,
        logging_steps=25, 
        output_dir='outputs',
        remove_unused_columns=False,
    ),
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True,
    ),
)

lora_model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss
25,4.0323
50,3.1162
75,2.8801
100,2.7633


TrainOutput(global_step=117, training_loss=3.1314207223745494, metrics={'train_runtime': 532.5797, 'train_samples_per_second': 28.165, 'train_steps_per_second': 0.22, 'total_flos': 5.566868744621261e+16, 'train_loss': 3.1314207223745494, 'epoch': 2.96})

In [31]:
# Empirical quick tests showed "somehow ok" results if loss < 3.0
print("After training")
eval_model(lora_model)

After training
上联：春风得意花铺路
下联：月光听声风吹梦

上联：美丽中国魅力北京
下联：历史浓浅中华梦境

上联：鱼书千里梦
下联：鸟声万里声

对联：日落晚霞临古寺
下联：月映晨雨满梦境



## Suggested additional reading
- [Decoding algorithm by HF](https://huggingface.co/blog/how-to-generate)
- So far, I only demoed greedy search (output token with highest prob at each position without looking ahead)

## Optional: Upload to HuggingFace and share with the world!
- And you should!

In [19]:
# from huggingface_hub import notebook_login
# notebook_login()
# YOUR_HF_ID = "YOUR_ID_PLZ"
# lora_model.push_to_hub(f"{YOUR_HF_ID}/chinese-couplet-llama-lora-test-v0.1",
#                        use_auth_token=True,
#                        create_pr=True)
# # Go to huggingface and merge the PR to share with the world!

In [20]:
# base_model = LlamaForCausalLM.from_pretrained(
#     model_name_or_path,
#     load_in_8bit=True,
#     torch_dtype=torch.float16,
#     device_map="auto",
# )
# model = PeftModel.from_pretrained(
#     base_model,
#     "hululuzhu/chinese-couplet-llama-lora-test-v0.1",
#     torch_dtype=torch.float16,
# )