# 预训练

镜像获取：

```shell
docker pull swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_3_ascend:pytorch_2.3.1-cann_8.0.rc3-py_3.10-hce_2.0.2409-aarch64-snt9b-20241213131522-aafe527

docker run -it --privileged --name=llm -u root --net=host --ipc=host \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home/icbc:/mnt \
-w /mnt \
419f2a9943a4 \
/bin/bash

```

### 1 数据预览

In [None]:
! head -2 wikipedia-zh-cn-8192.jsonl

### 2 工具安装

```shell
git pull https://gitee.com/janelu9/EasyLLM
cd EasyLLM
pip wheel -e . --no-deps && pip install jllm-*-py3-none-any.whl
```

### 3 数据转换

In [None]:
!python -m jllm.raw2ids \
    --tokenizer Qwen2.5-7B-Instruct \
    -i wikipedia-zh-cn-512.jsonl \
    --max_len 8193 \
    -t pretrain --stack -C 
# --stack 拼接token凑成max_len的长度，减少pad_id; 
# -C 清除缓存重新转换 

### 3.1 数据检查(可选步骤)

In [1]:
from transformers import AutoTokenizer
import pyarrow.parquet

In [7]:
tokenizer=AutoTokenizer.from_pretrained('Qwen2.5-7B-Instruct')

In [8]:
data = pyarrow.parquet.read_table('wikipedia-zh-cn-512_Qwen2.5-7B-Instruct/wikipedia-zh-cn-512-00000')

In [9]:
input_ids =data['input_ids'].to_numpy().tolist()

In [10]:
idx=0

In [None]:
print(tokenizer.decode(input_ids[idx]))

### 4 模型训练

#### 注: NPU下使用张量并行(model_parallel_size>1)需要Megatron和MindSpeed。

```shell
# 获取megatron
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_r0.8.0
cp -r megatron ../
# 安装mindspeed
git clone -b 2.0.0_core_r0.8.0 https://gitee.com/ascend/MindSpeed.git
pip install -e MindSpeed --user
```

In [None]:
!deepspeed --module jllm.train_pipe \
    --model Qwen2.5-7B-Instruct \
    --num_train_epochs 1 \
    --train_data wikipedia-zh-cn-512_Qwen2.5-7B-Instruct \
    --pipe_parallel_size 2 \
    --model_parallel_size 2 \
    --sequence_parallel_size 1 \
    --per_device_train_batch_size 1 \
    --global_batch_size 16 \
    --partition_method fast \
    --output_dir pretrained \
    --max_num_checkpoints 2 \
    --split_dlayer \
    --learning_rate 1e-5 |tee pretrain.log
#注释：
# --model 模型路径至少需要包含config.json
# --num_train_epochs 训练轮数
# --train_data 训练数据
# --pipe_parallel_size 流水线并行个数
# --model_parallel_size 张量并行个数
# --per_device_train_batch_size 一次输入训练多少样本
# --global_batch_size 训练完多少样本后（累加完多少个梯度后）进行一次参数更新
# --partition_method fast 流水线拆分策略
# --checkpoint checkpoint 模型检查点目录
# --output_dir pretrained 最终模型输出目录
# --max_num_checkpoints 2 最大保留多少个检查点
# --split_dlayer 是否拆分docoder layer
# --learning_rate 1e-5 学习率

In [None]:
!grep  'steps:.*loss:' pretrain.log|awk '{print $2,$4}'>pretrain.loss
import matplotlib.pyplot as plt
import numpy as np
xy=np.loadtxt('pretrain.loss')  
plt.plot(xy[:,0], xy[:,1])  
plt.show()

##### 正常情况下形如：

In [None]:
plt.plot(xy[:,0], 1/xy[:,0])
plt.show()

### 5 参数转换

#### 5.1 checkpoint转huggingface

In [None]:
!deepspeed --module jllm.train_pipe \
    --model Qwen2.5-7B-Instruct \
    --train_data wikipedia-zh-cn-512_Qwen2.5-7B-Instruct \
    --pipe_parallel_size 2 \
    --model_parallel_size 2 \
    --partition_method fast \
    --split_dlayer \
    --num_train_epochs 0 \
    --from_ckpt checkpoint \
    --output_dir pretrained
#--model 模型路径
#--train_data 训练数据
#--pipe_parallel_size 流水线长度
#--partition_method 流水线拆分方法
#--split_dlayer 将decoder layer拆开,使流水线分布更均匀
#--from_ckpt 加载模型参数的checkpoint路径
#--output_dir  输出huggingface格式模型的路径

#### 5.2 合并拆分张量(model_parallel_size>=2)

In [None]:
!python -m jllm.cat2hf \
    -C pretrained 
# -C 合并前的模型路径
# -H 合并后的huggingface格式的模型路径。不填的自行创建为pretrained_hf

### 6 模型测试

In [15]:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained('Qwen2.5-7B-Instruct')

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    'pretrained_hf',
    torch_dtype="auto",
    device_map="auto"
)

In [17]:
text='数学是研究数量、结构以及空间等概念及其变化的一门学科，属于'

In [18]:
from time import time

In [None]:
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
st=time()
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)
du=time()-st
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [20]:
128/du

7.7173362469042015