## 第十五章作业

#### 1. 调整 ZeRO-3 配置文件，使其支持 T5-3B 甚至 T5-11B 模型训练。

```bash
deepspeed --num_gpus=1 translation/run_translation.py \
--deepspeed config/ds_config_zero3.debug.json \
--model_name_or_path t5-3b --per_device_train_batch_size 32 \
--output_dir tmp/t5-3b --overwrite_output_dir --bf16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro \
--source_prefix "translate English to Romanian: " \
--learning_rate 1e-5 \
--max_grad_norm 1
```

![resource](ch15.nvidia-smi.htop.png)

### 环境搭建

In [2]:
%set_env CUDA_VISIBLE_DEVICE=0 
import torch
import os

print(os.getenv('CUDA_VISIBLE_DEVICE'))

env: CUDA_VISIBLE_DEVICE=0
0


In [4]:
torch.cuda.get_device_capability()

(8, 9)

In [5]:
torch.cuda.get_arch_list()

['sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90']

In [7]:
torch.cuda.get_device_properties(torch.device('cuda'))

_CudaDeviceProperties(name='NVIDIA GeForce RTX 4090', major=8, minor=9, total_memory=24210MB, multi_processor_count=128)

### 安装rust包管理器

```bash
sudo apt-get install -y cargo
````

### 必须是在conda虚拟环境下安装deepspeed，普通的venv环境会报错，显示无法import torch

```bash
conda create -n deepspeed --clone base
```


### 源代码安装 Transformers

```bash
pip install git+https://github.com/huggingface/transformers
```

*提示：网络原因，可以从离线的源码包安装*
```bash
pip install /root/downloads/transformers-main.zip
```

### 源代码安装 DeepSpeed

```bash
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.9" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log
```

### 直接使用 Python 命令启动 ZeRO-2 模式单 GPU 训练翻译模型（T5-Small）

```bash
cd deepspeed

python translation/run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --source_lang en \
    --target_lang ro \
    --source_prefix "translate English to Romanian: " \
    --dataset_name wmt16 \
    --dataset_config_name ro-en \
    --output_dir tmp/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --save_total_limit 2 \
    --resume_from_checkpoint tmp/tst-translation/checkpoint-11000 \
    --overwrite_output_dir True
```

In [3]:
! pip show transformers

Name: transformers
Version: 4.43.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /root/autodl-tmp/venvs/deepspeed/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 


In [4]:
%pip show deepspeed

Name: deepspeed
Version: 0.14.5+3d347276
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /root/autodl-tmp/venvs/deepspeed/lib/python3.10/site-packages
Requires: hjson, ninja, numpy, nvidia-ml-py, packaging, psutil, py-cpuinfo, pydantic, torch, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.
