# LoRA Fine-Tuning Qwen-Chat Large Language Model (Single GPU)

Tongyi Qianwen is a large language model developed by Alibaba Cloud based on the Transformer architecture, trained on an extensive set of pre-training data. The pre-training data is diverse and covers a wide range, including a large amount of internet text, specialized books, code, etc. In addition, an AI assistant called Qwen-Chat has been created based on the pre-trained model using alignment mechanism.

This notebook uses Qwen-1.8B-Chat as an example to introduce how to LoRA fine-tune the Qianwen model using Deepspeed.

## Environment Requirements

Please refer to **requirements.txt** to install the required dependencies.

## Preparation

### Download Qwen-1.8B-Chat

First, download the model files. You can choose to download directly from ModelScope.

In [None]:
# %pwd
# %cd ../
# %pwd

/Users/zjy/llm_journey/Qwen/recipes/finetune


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


'/Users/zjy/llm_journey/Qwen/recipes/finetune'

In [5]:
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('Qwen/Qwen-1_8B-Chat', cache_dir='.', revision='master')

Downloading Model to directory: ./hub/Qwen/Qwen-1_8B-Chat


Downloading [cache_autogptq_cuda_kernel_256.cu]: 100%|██████████| 50.8k/50.8k [00:01<00:00, 30.7kB/s]
Downloading [config.json]: 100%|██████████| 910/910 [00:01<00:00, 484B/s]
Downloading [configuration.json]: 100%|██████████| 77.0/77.0 [00:02<00:00, 31.4B/s]
Downloading [configuration_qwen.py]: 100%|██████████| 2.29k/2.29k [00:01<00:00, 1.20kB/s]
Downloading [cpp_kernels.py]: 100%|██████████| 1.88k/1.88k [00:01<00:00, 1.18kB/s]
Downloading [generation_config.json]: 100%|██████████| 249/249 [00:01<00:00, 143B/s]
Downloading [LICENSE]: 100%|██████████| 7.11k/7.11k [00:01<00:00, 5.06kB/s]
Downloading [assets/logo.jpg]: 100%|██████████| 80.8k/80.8k [00:01<00:00, 46.0kB/s]
Downloading [model-00001-of-00002.safetensors]: 100%|██████████| 1.90G/1.90G [01:04<00:00, 31.9MB/s]
Downloading [model-00002-of-00002.safetensors]: 100%|██████████| 1.52G/1.52G [00:57<00:00, 28.4MB/s]
Downloading [model.safetensors.index.json]: 100%|██████████| 14.4k/14.4k [00:01<00:00, 9.75kB/s]
Downloading [modeling_q

### Download Example Training Data

Download the data required for training; here, we provide a tiny dataset as an example. It is sampled from [Belle](https://github.com/LianjiaTech/BELLE).

Disclaimer: the dataset can be only used for the research purpose.

In [6]:
!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json

--2024-11-15 10:32:13--  https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/Belle_sampled_qwen.json
Resolving atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)... 47.101.88.43
Connecting to atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com (atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com)|47.101.88.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228189 (223K) [application/json]
Saving to: ‘Belle_sampled_qwen.json’


2024-11-15 10:32:15 (530 KB/s) - ‘Belle_sampled_qwen.json’ saved [228189/228189]



You can also refer to this format to prepare the dataset. Below is a simple example list with 1 sample:

```json
[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好"
      },
      {
        "from": "assistant",
        "value": "我是一个语言模型，我叫通义千问。"
      }
    ]
  }
]
```

You can also use multi-turn conversations as the training set. Here is a simple example:

```json
[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "你好，能告诉我遛狗的最佳时间吗？"
      },
      {
        "from": "assistant",
        "value": "当地最佳遛狗时间因地域差异而异，请问您所在的城市是哪里？"
      },
      {
        "from": "user",
        "value": "我在纽约市。"
      },
      {
        "from": "assistant",
        "value": "纽约市的遛狗最佳时间通常在早晨6点至8点和晚上8点至10点之间，因为这些时间段气温较低，遛狗更加舒适。但具体时间还需根据气候、气温和季节变化而定。"
      }
    ]
  }
]
```

## Fine-Tune the Model

You can directly run the prepared training script to fine-tune the model.

In [7]:
!export CUDA_VISIBLE_DEVICES=0
!python ../../finetune.py \
    --model_name_or_path "Qwen/Qwen-1_8B-Chat/"\
    --data_path  "Belle_sampled_qwen.json"\
    --bf16 \
    --output_dir "output_qwen" \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --gradient_checkpointing \
    --lazy_preprocess \
    --use_lora

[2024-11-15 10:33:12,618] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to mps (auto detect)
W1115 10:33:13.226000 34929 qwen/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:06<00:00,  3.06s/it]
trainable params: 53,673,984 || all params: 1,890,502,656 || trainable%: 2.8391
Loading data...
Formatting inputs...Skip in lazy mode
  trainer = Trainer(
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
  return fn(*args, **kwargs)
{'loss': 1.035, 'grad_norm': 5.1622772216796875, 'learning_rate': 1e-05, 'epoch': 0.06}

## Merge Weights

The training of both LoRA and Q-LoRA only saves the adapter parameters. You can load the fine-tuned model and merge weights as shown below:

In [8]:
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch


model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat/", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, "output_qwen/")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("output_qwen_merged", max_shard_size="2048MB", safe_serialization=True)

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.01s/it]


[2024-11-15 12:12:15,849] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to mps (auto detect)


W1115 12:12:16.075000 33825 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


The tokenizer files are not saved in the new directory in this step. You can copy the tokenizer files or use the following code:

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-1_8B-Chat/",
    trust_remote_code=True
)

tokenizer.save_pretrained("output_qwen_merged")

('output_qwen_merged/tokenizer_config.json',
 'output_qwen_merged/special_tokens_map.json',
 'output_qwen_merged/qwen.tiktoken',
 'output_qwen_merged/added_tokens.json')

## Test the Model

### added the modeling_qwen.py, qwen_generation_utils.py, qwen_generation_utils.py from original Qwen/Qwen-1_8B-Chat

After merging the weights, we can test the model as follows:

In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("output_qwen_merged", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "output_qwen_merged",
    device_map="auto",
    trust_remote_code=True
).eval()

response, history = model.chat(tokenizer, "你好", history=None)
print(response)

Downloading [cache_autogptq_cuda_kernel_256.cu]:   0%|          | 0.00/50.8k [1:50:45<?, ?B/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00,  5.92s/it]
Some parameters are on the meta device because they were offloaded to the disk.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


你好！有什么我能帮助你的吗？


In [16]:
response, history = model.chat(tokenizer, "你好", history=None)
print(response)

你好！有什么我可以帮助你的吗？


In [17]:
response, history = model.chat(tokenizer, "请问中国的首都是哪个城市？", history=None)
print(response)

中国的首都是北京。


In [18]:
response, history = model.chat(tokenizer, "大学生一般多少岁？", history=None)
print(response)

现在的法定结婚年龄是男满20周岁，女满18周岁。


In [23]:
response, history = model.chat(tokenizer, "总结下面这段文本的摘要，随着科技飞速发展，我们的生活方式发生巨大改变。手机、人工智能、物联网的出现，让日常生活便利而舒适。比如，我们可以通过智能手机随时随地地获取信息，控制家庭设备，同时感受着人工智能为我们带来的智能化之便。但是，科技进步带来的便利也会对生活形成某种影响，比如，冲击传统行业和职业，改变人们的生产和消费模式。", history=None)
print(response)

随着科技的发展，人们的生活方式发生了翻天覆地的变化。手机、人工智能等技术使得生活更为便捷和舒适。然而，这也可能带来一些负面影响，如传统行业和职业的冲击以及生产消费模式的改变。


In [None]:
response, history = model.chat(tokenizer, "总结下面这段文本的摘要，随着科技的飞速发展，我们的生活方式也在悄然改变。智能手机、人工智能、物联网等科技产品的出现，为我们的日常生活带来了更多便利和舒适。比如，我们可以通过智能手机随时随地地获取信息，控制家庭设备，同时感受着人工智能为我们带来的智能化之便。但是，科技进步带来的便利也会对生活形成某种影响，比如，冲击传统行业和职业，改变人们的生产和消费模式。", history=None)
print(response)

In [19]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer_origin = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat/", trust_remote_code=True)
model_origin = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-1_8B-Chat/",
    device_map="auto",
    trust_remote_code=True
).eval()


Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.27s/it]
Some parameters are on the meta device because they were offloaded to the disk.


In [21]:

response, history = model_origin.chat(tokenizer_origin, "你好", history=None)
print(response)

你好！有什么我可以帮助你的吗？


In [22]:
response, history = model_origin.chat(tokenizer_origin, "大学生一般多少岁？", history=None)
print(response)

大学生是指不满18岁的学生，不包括已经毕业的在校生。如果你指的是大学毕业生，那么一般来说，他们在完成学业后通常在20多岁左右步入社会。当然，这也会因地区和专业而异，一些地方或专业的大学毕业年龄可能会提前到25岁或更早。


In [None]:
response, history = model_origin.chat(tokenizer_origin, "总结下面这段文本的摘要，随着科技飞速发展，我们的生活方式发生巨大改变。手机、人工智能、物联网的出现，让日常生活便利而舒适。比如，我们可以通过智能手机随时随地地获取信息，控制家庭设备，同时感受着人工智能为我们带来的智能化之便。但是，科技进步带来的便利也会对生活形成某种影响，比如，冲击传统行业和职业，改变人们的生产和消费模式。", history=None)
print(response)