[Trainer
] ã¯ã©ã¹ã¯ãã»ãšãã©ã®æšæºçãªãŠãŒã¹ã±ãŒã¹ã«å¯ŸããŠãPyTorch ã§æ©èœãå®å
šã«ãã¬ãŒãã³ã°ããããã® API ãæäŸããŸããããã¯ããµã³ãã« ã¹ã¯ãªãã ã®ã»ãšãã©ã§äœ¿çšãããŠããŸãã
[Trainer
] ãã€ã³ã¹ã¿ã³ã¹åããåã«ããã¬ãŒãã³ã°äžã«ã«ã¹ã¿ãã€ãºã®ãã¹ãŠã®ãã€ã³ãã«ã¢ã¯ã»ã¹ããããã« [TrainingArguments
] ãäœæããŸãã
ãã® API ã¯ãè€æ°ã® GPU/TPU ã§ã®åæ£ãã¬ãŒãã³ã°ãNVIDIA Apex ããã³ PyTorch ã®ãã€ãã£ã AMP ã«ããæ··å粟床ããµããŒãããŸãã
[Trainer
] ã«ã¯ãäžèšã®æ©èœããµããŒãããåºæ¬çãªãã¬ãŒãã³ã° ã«ãŒããå«ãŸããŠããŸããã«ã¹ã¿ã åäœãæ¿å
¥ããã«ã¯ããããããµãã¯ã©ã¹åãã次ã®ã¡ãœããããªãŒããŒã©ã€ãããŸãã
- get_train_dataloader -- ãã¬ãŒãã³ã° ããŒã¿ããŒããŒãäœæããŸãã
- get_eval_dataloader -- è©äŸ¡çšããŒã¿ããŒããŒãäœæããŸãã
- get_test_dataloader -- ãã¹ã ããŒã¿ããŒããŒãäœæããŸãã
- log -- ãã¬ãŒãã³ã°ãç£èŠããŠããããŸããŸãªãªããžã§ã¯ãã«é¢ããæ å ±ããã°ã«èšé²ããŸãã
- create_optimizer_and_scheduler -- ãªããã£ãã€ã¶ãšåŠç¿çã¹ã±ãžã¥ãŒã©ãæž¡ãããªãã£ãå Žåã«ã»ããã¢ããããŸãã
åæåã
create_optimizer
ã¡ãœãããšcreate_scheduler
ã¡ãœããããµãã¯ã©ã¹åãŸãã¯ãªãŒããŒã©ã€ãããããšãã§ããããšã«æ³šæããŠãã ããã å¥ã ã«ã - create_optimizer -- init ã§æž¡ãããªãã£ãå Žåã«ãªããã£ãã€ã¶ãŒãã»ããã¢ããããŸãã
- create_scheduler -- init ã§æž¡ãããªãã£ãå ŽåãåŠç¿çã¹ã±ãžã¥ãŒã©ãèšå®ããŸãã
- compute_loss - ãã¬ãŒãã³ã°å ¥åã®ãããã®æ倱ãèšç®ããŸãã
- training_step -- ãã¬ãŒãã³ã° ã¹ããããå®è¡ããŸãã
- prediction_step -- è©äŸ¡/ãã¹ã ã¹ããããå®è¡ããŸãã
- evaluate -- è©äŸ¡ã«ãŒããå®è¡ããã¡ããªã¯ã¹ãè¿ããŸãã
- predict -- ãã¹ã ã»ããã®äºæž¬ (ã©ãã«ã䜿çšå¯èœãªå Žåã¯ã¡ããªã¯ã¹ãå«ã) ãè¿ããŸãã
[Trainer
] ã¯ã©ã¹ã¯ ð€ Transformers ã¢ãã«çšã«æé©åãããŠãããé©ãã¹ãåäœãããå¯èœæ§ããããŸã
ä»ã®æ©çš®ã§äœ¿çšããå Žåãç¬èªã®ã¢ãã«ã§äœ¿çšããå Žåã¯ã次ã®ç¹ã確èªããŠãã ããã
- ã¢ãã«ã¯åžžã« [
~utils.ModelOutput
] ã®ã¿ãã«ãŸãã¯ãµãã¯ã©ã¹ãè¿ããŸãã labels
åŒæ°ãæå®ããããã®æ倱ãæåã®å€ãšããŠè¿ãããå Žåãã¢ãã«ã¯æ倱ãèšç®ã§ããŸãã ã¿ãã«ã®èŠçŽ (ã¢ãã«ãã¿ãã«ãè¿ãå Žå)- ã¢ãã«ã¯è€æ°ã®ã©ãã«åŒæ°ãåãå
¥ããããšãã§ããŸã ([
TrainingArguments
] ã§label_names
ã䜿çšããŠããã®ååã [Trainer
] ã«ç€ºããŸã) ãããããã®ãããã«ã"label"
ãšããååãä»ããå¿ èŠã¯ãããŸããã
以äžã¯ãå éæ倱ã䜿çšããããã« [Trainer
] ãã«ã¹ã¿ãã€ãºããæ¹æ³ã®äŸã§ã (äžåè¡¡ãªãã¬ãŒãã³ã° ã»ãããããå Žåã«åœ¹ç«ã¡ãŸã)ã
from torch import nn
from transformers import Trainer
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
PyTorch [Trainer
] ã®ãã¬ãŒãã³ã° ã«ãŒãã®åäœãã«ã¹ã¿ãã€ãºãããã 1 ã€ã®æ¹æ³ã¯ããã¬ãŒãã³ã° ã«ãŒãã®ç¶æ
ãæ€æ»ã§ãã callbacks ã䜿çšããããšã§ã (é²è¡ç¶æ³ã¬ããŒããTensorBoard ãŸãã¯ä»ã® ML ãã©ãããã©ãŒã ã§ã®ãã°èšé²ãªã©)ã決å®ïŒæ©æåæ¢ãªã©ïŒã
[[autodoc]] Trainer - all
[[autodoc]] Seq2SeqTrainer - evaluate - predict
[[autodoc]] TrainingArguments - all
[[autodoc]] Seq2SeqTrainingArguments - all
ããã©ã«ãã§ã¯ã[Trainer
] ã¯ãã¹ãŠã®ãã§ãã¯ãã€ã³ããã
[TrainingArguments
] ã䜿çšããŠããŸãããããã¯ãxxx ãå«ãcheckpoint-xxx
ãšããååã®ãµããã©ã«ããŒã«ä¿åãããŸãã
ããã¯ãã¬ãŒãã³ã°ã®æ®µéã§ããã
ãã§ãã¯ãã€ã³ããããã¬ãŒãã³ã°ãåéããã«ã¯ã次ã®ããããã䜿çšã㊠[Trainer.train
] ãåŒã³åºããŸãã
resume_from_checkpoint=True
ã¯ææ°ã®ãã§ãã¯ãã€ã³ããããã¬ãŒãã³ã°ãåéããŸãresume_from_checkpoint=checkpoint_dir
ãã£ã¬ã¯ããªå ã®ç¹å®ã®ãã§ãã¯ãã€ã³ããããã¬ãŒãã³ã°ãåéããŸã åæ Œããã
ããã«ãpush_to_hub=True
ã䜿çšãããšãã¢ãã« ããã«ãã§ãã¯ãã€ã³ããç°¡åã«ä¿åã§ããŸããããã©ã«ãã§ã¯ããã¹ãŠ
äžéãã§ãã¯ãã€ã³ãã«ä¿åãããã¢ãã«ã¯å¥ã®ã³ãããã«ä¿åãããŸããããªããã£ãã€ã¶ãŒã®ç¶æ
ã¯ä¿åãããŸãããé©å¿ã§ããŸã
[TrainingArguments
] ã® hub-strategy
å€ã次ã®ããããã«ããŸãã
"checkpoint"
: ææ°ã®ãã§ãã¯ãã€ã³ãã last-checkpoint ãšããååã®ãµããã©ã«ããŒã«ããã·ã¥ãããŸããtrainer.train(resume_from_checkpoint="output_dir/last-checkpoint")
ã䜿çšããŠãã¬ãŒãã³ã°ãç°¡åã«åéããŸãã"all_checkpoints"
: ãã¹ãŠã®ãã§ãã¯ãã€ã³ãã¯ãåºåãã©ã«ããŒã«è¡šç€ºãããããã«ããã·ã¥ãããŸã (ãããã£ãŠã1 ã€ã®ãã§ãã¯ãã€ã³ããåŸãããŸã) æçµãªããžããªå ã®ãã©ã«ããŒããšã®ãã§ãã¯ãã€ã³ã ãã©ã«ããŒ)
ããã©ã«ãã§ã¯ã[Trainer
] ã¯ã¡ã€ã³ããã»ã¹ã« logging.INFO
ã䜿çšããã¬ããªã«ãããå Žåã«ã¯ logging.WARNING
ã䜿çšããŸãã
ãããã®ããã©ã«ãã¯ã[TrainingArguments
] ã® 5 ã€ã® logging
ã¬ãã«ã®ããããã䜿çšããããã«ãªãŒããŒã©ã€ãã§ããŸãã
åŒæ°:
log_level
- ã¡ã€ã³ããã»ã¹çšlog_level_replica
- ã¬ããªã«çš
ããã«ã[TrainingArguments
] ã® log_on_each_node
ã False
ã«èšå®ãããŠããå Žåãã¡ã€ã³ ããŒãã®ã¿ã
ã¡ã€ã³ ããã»ã¹ã®ãã° ã¬ãã«èšå®ã䜿çšãããšãä»ã®ãã¹ãŠã®ããŒãã¯ã¬ããªã«ã®ãã° ã¬ãã«èšå®ã䜿çšããŸãã
[Trainer
] ã¯ãtransformers
ã®ãã° ã¬ãã«ãããŒãããšã«åå¥ã«èšå®ããããšã«æ³šæããŠãã ããã
[Trainer.__init__
]ããããã£ãŠãä»ã®æ©èœãå©çšããå Žåã¯ããããããæ©ãèšå®ããããšããå§ãããŸã (次ã®äŸãåç
§)ã
[Trainer
] ãªããžã§ã¯ããäœæããåã® transformers
æ©èœã
ãããã¢ããªã±ãŒã·ã§ã³ã§äœ¿çšããæ¹æ³ã®äŸã次ã«ç€ºããŸãã
[...]
logger = logging.getLogger(__name__)
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
# set the main code and the modules it uses to the same log-level according to the node
log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
trainer = Trainer(...)
ãããŠãã¡ã€ã³ ããŒããšä»ã®ãã¹ãŠã®ããŒãã§éè€ããå¯èœæ§ãé«ããã®ãåºåããªãããã«èŠåããã ãã衚瀺ãããå Žåã¯ã èŠå: 次ã®ããã«å®è¡ã§ããŸãã
my_app.py ... --log_level warning --log_level_replica error
ãã«ãããŒãç°å¢ã§ãåããŒãã®ã¡ã€ã³ããã»ã¹ã®ãã°ãç¹°ãè¿ããããªãå Žåã¯ã次ã®ããã«ããŸãã äžèšã次ã®ããã«å€æŽããŸãã
my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0
ãã®åŸãæåã®ããŒãã®ã¡ã€ã³ ããã»ã¹ã®ã¿ããèŠåãã¬ãã«ã§ãã°ã«èšé²ãããã¡ã€ã³ ããŒãäžã®ä»ã®ãã¹ãŠã®ããã»ã¹ã¯ãã°ã«èšé²ãããŸãã ããŒããšä»ã®ããŒãäžã®ãã¹ãŠã®ããã»ã¹ã¯ããšã©ãŒãã¬ãã«ã§ãã°ã«èšé²ãããŸãã
ã¢ããªã±ãŒã·ã§ã³ãã§ããã ãéãã«ããå¿ èŠãããå Žåã¯ã次ã®ããã«ããŸãã
my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0
(ãã«ãããŒãç°å¢ã®å Žå㯠--log_on_each_node 0
ãè¿œå ããŸã)
[Trainer
] ã«ãã£ãŠçæããããã§ãã¯ãã€ã³ãããåéããå Žåããã¹ãŠã®åªåããã®ç¶æ
ã埩å
ããããã«è¡ãããŸãã
pythonãnumpyãããã³ pytorch ã® RNG ç¶æ
ã¯ããã®ãã§ãã¯ãã€ã³ããä¿åããæç¹ãšåãç¶æ
ã«ãªããŸãã
ããã«ããããåæ¢ããŠåéããšããã¹ã¿ã€ã«ã®ãã¬ãŒãã³ã°ãããã³ã¹ããããã¬ãŒãã³ã°ã«å¯èœãªéãè¿ã¥ããããã¯ãã§ãã
ãã ããããŸããŸãªããã©ã«ãã®é決å®ç㪠pytorch èšå®ã«ãããããã¯å®å
šã«æ©èœããªãå¯èœæ§ããããŸãããã«ããåžæã®å Žåã¯
決å®è«ã«ã€ããŠã¯ãã©ã³ãã æ§ã®ãœãŒã¹ã®å¶åŸ¡ ãåç
§ããŠãã ãããããã¥ã¡ã³ãã§èª¬æãããŠããããã«ããããã®èšå®ã®äžéšã¯
ç©äºã決å®è«çã«ãããã® (äŸ: torch.backends.cudnn.deterministic
) ã¯ç©äºãé
ãããå¯èœæ§ããããããããã¯
ããã©ã«ãã§ã¯å®è¡ã§ããŸããããå¿
èŠã«å¿ããŠèªåã§æå¹ã«ããããšãã§ããŸãã
ã©ã® GPU ãã©ã®ãããªé åºã§äœ¿çšããããããã°ã©ã ã«æ瀺ããæ¹æ³ã«ã€ããŠèª¬æããŸãã
DistributedDataParallel
ã䜿çšã㊠GPU ã®ãµãã»ããã®ã¿ã䜿çšããå Žåã䜿çšãã GPU ã®æ°ãæå®ããã ãã§ãã ãããšãã°ãGPU ã 4 ã€ããããæåã® 2 ã€ã䜿çšãããå Žåã¯ã次ã®ããã«ããŸãã
torchrun --nproc_per_node=2 trainer-program.py ...
accelerate
ãŸã㯠deepspeed
ãã€ã³ã¹ããŒã«ãããŠããå Žåã¯ã次ã䜿çšããŠåãããšãéæããããšãã§ããŸããã®äžã€ïŒ
accelerate launch --num_processes 2 trainer-program.py ...
deepspeed --num_gpus 2 trainer-program.py ...
ãããã®ã©ã³ãã£ãŒã䜿çšããããã«ãAccelerate ãŸã㯠Deepspeed çµ±å æ©èœã䜿çšããå¿ èŠã¯ãããŸããã
ãããŸã§ã¯ãããã°ã©ã ã«äœ¿çšãã GPU ã®æ°ãæ瀺ã§ããŸããã次ã«ãç¹å®ã® GPU ãéžæãããã®é åºãå¶åŸ¡ããæ¹æ³ã«ã€ããŠèª¬æããŸãã
次ã®ç°å¢å€æ°ã¯ã䜿çšãã GPU ãšãã®é åºãå¶åŸ¡ããã®ã«åœ¹ç«ã¡ãŸãã
CUDA_VISIBLE_DEVICES
è€æ°ã® GPU ãããããã®ãã¡ã® 1 ã€ãŸãã¯ããã€ãã® GPU ã ãã䜿çšãããå Žåã¯ãç°å¢å€æ° CUDA_VISIBLE_DEVICES
ã䜿çšãã GPU ã®ãªã¹ãã«èšå®ããŸãã
ããšãã°ã4 ã€ã® GPU (0ã1ã2ã3) ããããšããŸããç©ç GPU 0 ãš 2 ã®ã¿ã§å®è¡ããã«ã¯ã次ã®ããã«ããŸãã
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
ãããã£ãŠãpytorch 㯠2 ã€ã® GPU ã®ã¿ãèªèããç©ç GPU 0 ãš 2 ã¯ãããã cuda:0
ãš cuda:1
ã«ãããã³ã°ãããŸãã
é åºãå€æŽããããšãã§ããŸãã
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
ããã§ã¯ãç©ç GPU 0 ãš 2 ãããããcuda:1
ãšcuda:0
ã«ãããã³ã°ãããŠããŸãã
äžèšã®äŸã¯ãã¹ãŠ DistributedDataParallel
䜿çšãã¿ãŒã³ã®ãã®ã§ãããåãæ¹æ³ã DataParallel
ã§ãæ©èœããŸãã
CUDA_VISIBLE_DEVICES=2,0 python trainer-program.py ...
GPU ã®ãªãç°å¢ããšãã¥ã¬ãŒãããã«ã¯ã次ã®ããã«ãã®ç°å¢å€æ°ã空ã®å€ã«èšå®ããã ãã§ãã
CUDA_VISIBLE_DEVICES= python trainer-program.py ...
ä»ã®ç°å¢å€æ°ãšåæ§ã«ãããããã³ãã³ã ã©ã€ã³ã«è¿œå ãã代ããã«ã次ã®ããã«ãšã¯ã¹ããŒãããããšãã§ããŸãã
export CUDA_VISIBLE_DEVICES=0,2
torchrun trainer-program.py ...
ãã ãããã®æ¹æ³ã§ã¯ã以åã«ç°å¢å€æ°ãèšå®ããããšãå¿ããŠããªãééã£ã GPU ã䜿çšãããŠããã®ãç解ã§ããªãå¯èœæ§ããããããæ··ä¹±ãæãå¯èœæ§ããããŸãããããã£ãŠããã®ã»ã¯ã·ã§ã³ã®ã»ãšãã©ã®äŸã§ç€ºãããŠããããã«ãåãã³ãã³ã ã©ã€ã³ã§ç¹å®ã®å®è¡ã«å¯ŸããŠã®ã¿ç°å¢å€æ°ãèšå®ããã®ãäžè¬çã§ãã
CUDA_DEVICE_ORDER
ç©çããã€ã¹ã®é åºãå¶åŸ¡ããè¿œå ã®ç°å¢å€æ° CUDA_DEVICE_ORDER
ããããŸããéžæè¢ã¯æ¬¡ã® 2 ã€ã§ãã
- PCIe ãã¹ ID é (
nvidia-smi
ã®é åºãšäžèŽ) - ãããããã©ã«ãã§ãã
export CUDA_DEVICE_ORDER=PCI_BUS_ID
- GPU ã³ã³ãã¥ãŒãã£ã³ã°èœåé ã«äžŠã¹ã
export CUDA_DEVICE_ORDER=FASTEST_FIRST
ã»ãšãã©ã®å Žåããã®ç°å¢å€æ°ãæ°ã«ããå¿
èŠã¯ãããŸããããå€ã GPU ãšæ°ãã GPU ãç©ççã«æ¿å
¥ãããŠãããããé
ãå€ãã«ãŒããé
ããªã£ãŠããããã«èŠãããããªåã£ãã»ããã¢ãããè¡ã£ãŠããå Žåã«ã¯ãéåžžã«åœ¹ç«ã¡ãŸããåããããã解決ãã 1 ã€ã®æ¹æ³ã¯ãã«ãŒãã亀æããããšã§ãããã ããã«ãŒãã亀æã§ããªãå Žå (ããã€ã¹ã®å·åŽã圱é¿ãåããå Žåãªã©)ãCUDA_DEVICE_ORDER=FASTEST_FIRST
ãèšå®ãããšãåžžã«æ°ããé«éã«ãŒããæåã«é
眮ãããŸãããã ããnvidia-smi
ã¯äŸç¶ãšã㊠PCIe ã®é åºã§ã¬ããŒããããããå€å°æ··ä¹±ããã§ãããã
é åºãå ¥ãæ¿ãããã 1 ã€ã®è§£æ±ºçã¯ã以äžã䜿çšããããšã§ãã
export CUDA_VISIBLE_DEVICES=1,0
ãã®äŸã§ã¯ 2 ã€ã® GPU ã ãã䜿çšããŠããŸããããã¡ãããã³ã³ãã¥ãŒã¿ãŒã«æèŒãããŠããæ°ã® GPU ã«ãåãããšãåœãŠã¯ãŸããŸãã
ãŸãããã®ç°å¢å€æ°ãèšå®ããå Žåã¯ã~/.bashrc
ãã¡ã€ã«ãŸãã¯ãã®ä»ã®èµ·åèšå®ãã¡ã€ã«ã«èšå®ããŠãå¿ããã®ãæåã§ãã
[Trainer
] ã¯ããã¬ãŒãã³ã°ãåçã«æ¹åããå¯èœæ§ã®ããã©ã€ãã©ãªããµããŒãããããã«æ¡åŒµãããŸããã
æéãšã¯ããã«å€§ããªã¢ãã«ã«é©åããŸãã
çŸåšããµãŒãããŒãã£ã®ãœãªã¥ãŒã·ã§ã³ DeepSpeed ããã³ PyTorch FSDP ããµããŒãããŠããŸããè«æ ZeRO: ã¡ã¢ãªã®æé©åå ãã©ã¡ãŒã¿ ã¢ãã«ã®ãã¬ãŒãã³ã°ã«åããŠãSamyam RajbhandariãJeff RasleyãOlatunji RuwaseãYuxiong He èã
ãã®æäŸããããµããŒãã¯ããã®èšäºã®å·çæç¹ã§ã¯æ°ãããŠå®éšçãªãã®ã§ãã DeepSpeed ãš PyTorch FSDP ã®ãµããŒãã¯ã¢ã¯ãã£ãã§ãããããã«é¢ããåé¡ã¯æè¿ããŸãããFairScale çµ±å㯠PyTorch ã¡ã€ã³ã«çµ±åãããŠããããããããµããŒãããŠããŸãã (PyTorch FSDP çµ±å)
ãã®èšäºã®å·çæç¹ã§ã¯ãDeepspeed ã䜿çšããã«ã¯ãCUDA C++ ã³ãŒããã³ã³ãã€ã«ããå¿ èŠããããŸãã
ãã¹ãŠã®ã€ã³ã¹ããŒã«ã®åé¡ã¯ãDeepspeed ã®å¯Ÿå¿ãã GitHub ã®åé¡ãéããŠå¯ŸåŠããå¿ èŠããããŸããããã«ãäžã«çºçããå¯èœæ§ã®ããäžè¬çãªåé¡ãããã€ããããŸãã CUDA æ¡åŒµæ©èœãæ§ç¯ããå¿ èŠããã PyTorch æ¡åŒµæ©èœã
ãããã£ãŠã次ã®æäœãå®è¡äžã« CUDA é¢é£ã®ãã«ãã®åé¡ãçºçããå Žåã¯ã次ã®ãšããã§ãã
pip install deepspeed
ãŸã次ã®æ³šæäºé ããèªã¿ãã ããã
ãããã®ããŒãã§ã¯ãpytorch
ã CUDA 10.2
ã§ãã«ããããå Žåã«äœããã¹ããã®äŸã瀺ããŸããããªãã®ç¶æ³ã次ã®ãããªå Žå
ç°ãªãå Žåã¯ãããŒãžã§ã³çªå·ãç®çã®ããŒãžã§ã³ã«èª¿æŽããããšãå¿ããªãã§ãã ããã
Pytorch ã«ã¯ç¬èªã® CUDA ããŒã«ããããä»å±ããŠããŸãããããã 2 ã€ã®ãããžã§ã¯ãããã«ãããã«ã¯ãåäžããŒãžã§ã³ã® CUDA ãå¿ èŠã§ãã ã·ã¹ãã å šäœã«ã€ã³ã¹ããŒã«ãããŸãã
ããšãã°ãPython ç°å¢ã« cudatoolkit==10.2
ãæå®ã㊠pytorch
ãã€ã³ã¹ããŒã«ããå Žåã¯ã次ã®ãã®ãå¿
èŠã§ãã
CUDA 10.2
ãã·ã¹ãã å
šäœã«ã€ã³ã¹ããŒã«ãããŸããã
æ£ç¢ºãªå Žæã¯ã·ã¹ãã ã«ãã£ãŠç°ãªãå ŽåããããŸãããå€ãã®ã·ã¹ãã ã§ã¯/usr/local/cuda-10.2
ãæãäžè¬çãªå Žæã§ãã
Unix ã·ã¹ãã ã CUDA ãæ£ããèšå®ãããPATH
ç°å¢å€æ°ã«è¿œå ããããšã
次ã®ããã«ããŠã€ã³ã¹ããŒã«å Žæãæå®ããŸãã
which nvcc
CUDA ãã·ã¹ãã å šäœã«ã€ã³ã¹ããŒã«ãããŠããªãå Žåã¯ãæåã«ã€ã³ã¹ããŒã«ããŠãã ããããæ°ã«å ¥ãã䜿çšããŠæé ãèŠã€ããããšãã§ããŸã æ€çŽ¢ãšã³ãžã³ãããšãã°ãUbuntu ã䜿çšããŠããå Žåã¯ãubuntu cuda 10.2 install ãæ€çŽ¢ãããšããã§ãããã
ãã 1 ã€ã®èããããäžè¬çãªåé¡ã¯ãã·ã¹ãã å šäœã«è€æ°ã® CUDA ããŒã«ããããã€ã³ã¹ããŒã«ãããŠããå¯èœæ§ãããããšã§ããããšãã°ããªã ãããå¯èœæ§ãããïŒ
/usr/local/cuda-10.2
/usr/local/cuda-11.0
ãã®ç¶æ³ã§ã¯ãPATH
ããã³ LD_LIBRARY_PATH
ç°å¢å€æ°ã«ä»¥äžãå«ãŸããŠããããšã確èªããå¿
èŠããããŸãã
ç®çã® CUDA ããŒãžã§ã³ãžã®æ£ãããã¹ãéåžžãããã±ãŒãž ã€ã³ã¹ããŒã©ãŒã¯ããããã«ã
æåŸã®ããŒãžã§ã³ãã€ã³ã¹ããŒã«ãããŸãããé©åãªããã±ãŒãžãèŠã€ãããªãããã«ããã±ãŒãžã®ãã«ãã倱æãããšããåé¡ãçºçããå Žåã¯ã
CUDA ããŒãžã§ã³ãã·ã¹ãã å
šäœã«ã€ã³ã¹ããŒã«ãããŠããã«ãããããããåè¿°ã® 2 ã€ã調æŽããå¿
èŠãããããšãæå³ããŸã
ç°å¢å€æ°ã
ãŸãããã®å 容ãèŠãŠã¿ãŸãããã
echo $PATH
echo $LD_LIBRARY_PATH
ããã§ãäžã«äœãå ¥ã£ãŠããããããããŸãã
LD_LIBRARY_PATH
ã空ã§ããå¯èœæ§ããããŸãã
PATH
ã¯å®è¡å¯èœãã¡ã€ã«ãååšããå Žæããªã¹ãããLD_LIBRARY_PATH
ã¯å
±æã©ã€ãã©ãªã®å Žæã瀺ããŸãã
æ¢ãããšã§ããã©ã¡ãã®å Žåããåã®ãšã³ããªãåŸã®ãšã³ããªããåªå
ãããŸãã :
ã¯è€æ°ãåºåãããã«äœ¿çšãããŸã
ãšã³ããªã
ããã§ããã«ã ããã°ã©ã ã«ç¹å®ã® CUDA ããŒã«ãããã®å Žæãæ瀺ããã«ã¯ãæåã«ãªã¹ããããåžæã®ãã¹ãæ¿å ¥ããŸãã ãã£ãŠããããšïŒ
export PATH=/usr/local/cuda-10.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH
æ¢åã®å€ãäžæžãããã®ã§ã¯ãªããå é ã«è¿œå ããããšã«æ³šæããŠãã ããã
ãã¡ãããå¿
èŠã«å¿ããŠããŒãžã§ã³çªå·ããã«ãã¹ã調æŽããŸããå²ãåœãŠããã£ã¬ã¯ããªãå®éã«æ©èœããããšã確èªããŠãã ãã
ååšããã lib64
ãµããã£ã¬ã¯ããªã¯ãlibcudart.so
ãªã©ã®ããŸããŸãª CUDA .so
ãªããžã§ã¯ããååšããå Žæã§ãã
ã·ã¹ãã ã§ã¯å¥ã®ååãä»ããããŸãããçŸå®ãåæ ããããã«èª¿æŽããŠãã ããã
äžéšã®å€ã CUDA ããŒãžã§ã³ã¯ãæ°ããã³ã³ãã€ã©ã§ã®ãã«ããæåŠããå ŽåããããŸããããšãã°ãããªãã¯gcc-9
ãæã£ãŠããŸããããããå¿
èŠã§ã
gcc-7
ã
ããã«ã¯ããŸããŸãªæ¹æ³ããããŸãã
ææ°ã® CUDA ããŒã«ããããã€ã³ã¹ããŒã«ã§ããå Žåã¯ãéåžžãæ°ããã³ã³ãã€ã©ããµããŒããããŠããã¯ãã§ãã
ãããã¯ãæ¢ã«ææããŠããã³ã³ãã€ã©ã«å ããŠãäžäœããŒãžã§ã³ã®ã³ã³ãã€ã©ãã€ã³ã¹ããŒã«ããããšãã§ããŸãã ãã§ã«ååšããŸãããããã©ã«ãã§ã¯ãªãããããã«ãã·ã¹ãã ã¯ãããèªèã§ããŸããã ãgcc-7ããã€ã³ã¹ããŒã«ãããŠãããã ãã«ãã·ã¹ãã ãèŠã€ãããªããšããã¡ãã»ãŒãžã衚瀺ããå Žåã¯ã次ã®æ¹æ³ã§è§£æ±ºã§ããå¯èœæ§ããããŸãã
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc
sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++
ããã§ã¯ã/usr/local/cuda-10.2/bin/gcc
ãã gcc-7
ãžã®ã·ã³ããªãã¯ãªã³ã¯ãäœæããŠããŸãã
/usr/local/cuda-10.2/bin/
㯠PATH
ç°å¢å€æ°å
ã«ããå¿
èŠããããŸã (åã®åé¡ã®è§£æ±ºçãåç
§)ã
gcc-7
(ããã³ g++7
) ãèŠã€ããã¯ãã§ããã«ãã¯æåããŸãã
ãã€ãã®ããã«ãç¶æ³ã«åãããŠäŸã®ãã¹ãç·šéããŠãã ããã
ãã倧ããªããã ãµã€ãºã§å·šå€§ãªã¢ãã«ã®ãã¬ãŒãã³ã°ãé«éåããã«ã¯ãå®å šã«ã·ã£ãŒãåãããããŒã¿äžŠåã¢ãã«ã䜿çšã§ããŸãã ãã®ã¿ã€ãã®ããŒã¿äžŠåãã©ãã€ã ã§ã¯ããªããã£ãã€ã¶ãŒã®ç¶æ ãåŸé ããã©ã¡ãŒã¿ãŒãã·ã£ãŒãã£ã³ã°ããããšã§ãããå€ãã®ããŒã¿ãšå€§èŠæš¡ãªã¢ãã«ããã£ããã£ã³ã°ã§ããŸãã ãã®æ©èœãšãã®å©ç¹ã®è©³çŽ°ã«ã€ããŠã¯ãå®å šã·ã£ãŒãã£ã³ã° ããŒã¿äžŠåããã° ãã芧ãã ããã ææ°ã® PyTorch ã® Fully Sharded Data Parallel (FSDP) ãã¬ãŒãã³ã°æ©èœãçµ±åããŸããã å¿ èŠãªã®ã¯ãèšå®ãéããŠæå¹ã«ããããšã ãã§ãã
FSDP ãµããŒãã«å¿ èŠãª PyTorch ããŒãžã§ã³: PyTorch Nightly (ãªãªãŒã¹åŸã«ãããèªãã å Žå㯠1.12.0) FSDP ãæå¹ã«ããã¢ãã«ã®ä¿åã¯ãæè¿ã®ä¿®æ£ã§ã®ã¿å©çšã§ããããã§ãã
䜿çšæ³ïŒ
-
é åžãããã©ã³ãã£ãŒãè¿œå ãããŠããããšã確èªããŠãã ãã ãŸã 䜿çšããŠããªãå Žåã¯ã
-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE
ã䜿çšããŸãã -
ã·ã£ãŒãã£ã³ã°æŠç¥:
- FULL_SHARD : ããŒã¿äžŠåã¯ãŒã«ãŒ/GPU ã«ãããã·ã£ãŒã ãªããã£ãã€ã¶ãŒã®ç¶æ
+ åŸé
+ ã¢ãã« ãã©ã¡ãŒã¿ãŒã
ãã®ããã«ã¯ãã³ãã³ãã©ã€ã³åŒæ°ã«
--fsdp full_shard
ãè¿œå ããŸãã - SHARD_GRAD_OP : ã·ã£ãŒã ãªããã£ãã€ã¶ãŒã®ç¶æ
+ ããŒã¿äžŠåã¯ãŒã«ãŒ/GPU å
šäœã®åŸé
ã
ãã®ããã«ã¯ãã³ãã³ãã©ã€ã³åŒæ°ã«
--fsdp shard_grad_op
ãè¿œå ããŸãã - NO_SHARD : ã·ã£ãŒãã£ã³ã°ãªãããã®ããã«ã¯ãã³ãã³ãã©ã€ã³åŒæ°ã«
--fsdp no_shard
ãè¿œå ããŸãã
- FULL_SHARD : ããŒã¿äžŠåã¯ãŒã«ãŒ/GPU ã«ãããã·ã£ãŒã ãªããã£ãã€ã¶ãŒã®ç¶æ
+ åŸé
+ ã¢ãã« ãã©ã¡ãŒã¿ãŒã
ãã®ããã«ã¯ãã³ãã³ãã©ã€ã³åŒæ°ã«
-
ãã©ã¡ãŒã¿ãšåŸé ã CPU ã«ãªãããŒãããã«ã¯ã ã³ãã³ãã©ã€ã³åŒæ°ã«
--fsdp "full_shard offload"
ãŸãã¯--fsdp "shard_grad_op offload"
ãè¿œå ããŸãã -
default_auto_wrap_policy
ã䜿çšã㊠FSDP ã§ã¬ã€ã€ãŒãèªåçã«ååž°çã«ã©ããããã«ã¯ã ã³ãã³ãã©ã€ã³åŒæ°ã«--fsdp "full_shard auto_wrap"
ãŸãã¯--fsdp "shard_grad_op auto_wrap"
ãè¿œå ããŸãã -
CPU ãªãããŒããšèªåã©ããã³ã°ã®äž¡æ¹ãæå¹ã«ããã«ã¯ã ã³ãã³ãã©ã€ã³åŒæ°ã«
--fsdp "full_shard offload auto_wrap"
ãŸãã¯--fsdp "shard_grad_op offload auto_wrap"
ãè¿œå ããŸãã -
æ®ãã® FSDP æ§æã¯ã
--fsdp_config <path_to_fsdp_config.json>
ãä»ããŠæž¡ãããŸããããã¯ã次ã®ããããã®å Žæã§ãã FSDP json æ§æãã¡ã€ã« (äŸ:fsdp_config.json
)ããŸãã¯ãã§ã«ããŒããããŠãã json ãã¡ã€ã«ãdict
ãšããŠäœ¿çšããŸãã- èªåã©ããã³ã°ãæå¹ãªå Žåã¯ããã©ã³ã¹ããŒã¹ã®èªåã©ãã ããªã·ãŒãŸãã¯ãµã€ãº ããŒã¹ã®èªåã©ãã ããªã·ãŒã䜿çšã§ããŸãã
- ãã©ã³ã¹ãã©ãŒããŒããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåãæ§æãã¡ã€ã«ã§
fsdp_transformer_layer_cls_to_wrap
ãæå®ããããšããå§ãããŸããæå®ããªãå Žåã䜿çšå¯èœãªå Žåãããã©ã«ãå€ã¯model._no_split_modules
ã«ãªããŸãã ããã¯ãã©ãããããã©ã³ã¹ãã©ãŒããŒå±€ã¯ã©ã¹åã®ãªã¹ã (倧æåãšå°æåãåºå¥) ãæå®ããŸã (äŸ: [BertLayer
]ã[GPTJBlock
]ã[T5Block
] ...)ã éã¿ãå ±æãããµãã¢ãžã¥ãŒã« (åã蟌ã¿å±€ãªã©) ãç°ãªã FSDP ã©ããããããŠãããã«ãªããªãããã«ããå¿ èŠããããããããã¯éèŠã§ãã ãã®ããªã·ãŒã䜿çšãããšããã«ãããã ã¢ãã³ã·ã§ã³ãšããã«ç¶ãããã€ãã® MLP ã¬ã€ã€ãŒãå«ããããã¯ããšã«ã©ããã³ã°ãçºçããŸãã å ±æåã蟌ã¿ãå«ãæ®ãã®å±€ã¯ãåãæãå€åŽã® FSDP ãŠãããã«ã©ãããããã®ã䟿å©ã§ãã ãããã£ãŠããã©ã³ã¹ããŒã¹ã®ã¢ãã«ã«ã¯ããã䜿çšããŠãã ããã - ãµã€ãºããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåã¯ãèšå®ãã¡ã€ã«ã«
fsdp_min_num_params
ãè¿œå ããŠãã ããã èªåã©ããã³ã°ã®ããã® FSDP ã®ãã©ã¡ãŒã¿ã®æå°æ°ãæå®ããŸãã
- ãã©ã³ã¹ãã©ãŒããŒããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåãæ§æãã¡ã€ã«ã§
- èšå®ãã¡ã€ã«ã§
fsdp_backward_prefetch
ãæå®ã§ããããã«ãªããŸããã次ã®ãã©ã¡ãŒã¿ã®ã»ããããã€ããªãã§ããããããå¶åŸ¡ããŸããbackward_pre
ãšbackward_pos
ãå©çšå¯èœãªãªãã·ã§ã³ã§ãã 詳现ã«ã€ããŠã¯ãtorch.distributed.fsdp.full_sharded_data_Parallel.BackwardPrefetch
ãåç §ããŠãã ããã - èšå®ãã¡ã€ã«ã§
fsdp_forward_prefetch
ãæå®ã§ããããã«ãªããŸããã次ã®ãã©ã¡ãŒã¿ã®ã»ããããã€ããªãã§ããããããå¶åŸ¡ããŸããTrue
ã®å ŽåãFSDP ã¯ãã©ã¯ãŒã ãã¹ã§ã®å®è¡äžã«ã次ã«æ¥ããªãŒã«ã®ã£ã¶ãŒãæ瀺çã«ããªãã§ããããŸãã - èšå®ãã¡ã€ã«ã§
limit_all_gathers
ãæå®ã§ããããã«ãªããŸãããTrue
ã®å ŽåãFSDP 㯠CPU ã¹ã¬ãããæ瀺çã«åæããŠãå®è¡äžã®ãªãŒã«ã®ã£ã¶ãå€ãããã®ãé²ããŸãã activation_checkpointing
ãèšå®ãã¡ã€ã«ã§æå®ã§ããããã«ãªããŸãããTrue
ã®å ŽåãFSDP ã¢ã¯ãã£ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ãã¯ãFSDP ã®ã¢ã¯ãã£ããŒã·ã§ã³ãã¯ãªã¢ããããšã§ã¡ã¢ãªäœ¿çšéãåæžããææ³ã§ãã ç¹å®ã®ã¬ã€ã€ãŒãåŠçããããã¯ã¯ãŒã ãã¹äžã«ããããåèšç®ããŸããäºå®äžãããã¯äœåãªèšç®æéãç ç²ã«ããŸã ã¡ã¢ãªäœ¿çšéãåæžããŸãã
- èªåã©ããã³ã°ãæå¹ãªå Žåã¯ããã©ã³ã¹ããŒã¹ã®èªåã©ãã ããªã·ãŒãŸãã¯ãµã€ãº ããŒã¹ã®èªåã©ãã ããªã·ãŒã䜿çšã§ããŸãã
泚æãã¹ã泚æç¹ãããã€ããããŸã
- ããã¯
generate
ãšäºææ§ããªãããã--predict_with_generate
ãšãäºææ§ããããŸãã ãã¹ãŠã® seq2seq/clm ã¹ã¯ãªãã (翻蚳/èŠçŽ/clm ãªã©)ã åé¡ #21667 ãåç §ããŠãã ããã
TPU ãŠãŒã¶ãŒã®çæ§ã«æå ±ã§ãã PyTorch/XLA 㯠FSDP ããµããŒãããããã«ãªããŸããã ææ°ã® Fully Sharded Data Parallel (FSDP) ãã¬ãŒãã³ã°ããã¹ãŠãµããŒããããŠããŸãã 詳现ã«ã€ããŠã¯ãFSDP ã䜿çšãã Cloud TPU ã§ã® PyTorch ã¢ãã«ã®ã¹ã±ãŒãªã³ã° ããã³ PyTorch/XLA å®è£ ãåç §ããŠãã ããã FSDP ã® å¿ èŠãªã®ã¯ãèšå®ãéããŠæå¹ã«ããããšã ãã§ãã
FSDP ãµããŒãã«å¿ èŠãª PyTorch/XLA ããŒãžã§ã³: >=2.0
䜿çšæ³ïŒ
--fsdp "full shard"
ãã--fsdp_config <path_to_fsdp_config.json>
ã«å ãããã次ã®å€æŽãšãšãã«æž¡ããŸãã
- PyTorch/XLA FSDP ãæå¹ã«ããã«ã¯ã
xla
ãTrue
ã«èšå®ããå¿ èŠããããŸãã xla_fsdp_settings
å€ã¯ãXLA FSDP ã©ããã³ã° ãã©ã¡ãŒã¿ãæ ŒçŽããèŸæžã§ãã ãªãã·ã§ã³ã®å®å šãªãªã¹ãã«ã€ããŠã¯ããã¡ããxla_fsdp_grad_ckpt
ãTrue
ã®å Žåããã¹ãããã XLA FSDP ã§ã©ãããããåã¬ã€ã€ãŒäžã§åŸé ãã§ãã¯ãã€ã³ãã䜿çšããŸãã ãã®èšå®ã¯ãxla ãã©ã°ã true ã«èšå®ãããŠãããèªåã©ããã³ã° ããªã·ãŒãæå®ãããŠããå Žåã«ã®ã¿äœ¿çšã§ããŸããfsdp_min_num_params
ãŸãã¯fsdp_transformer_layer_cls_to_wrap
ã- ãã©ã³ã¹ãã©ãŒã㌠ããŒã¹ã®èªåã©ãã ããªã·ãŒãŸãã¯ãµã€ãº ããŒã¹ã®èªåã©ãã ããªã·ãŒã®ããããã䜿çšã§ããŸãã
- ãã©ã³ã¹ãã©ãŒããŒããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåãæ§æãã¡ã€ã«ã§
fsdp_transformer_layer_cls_to_wrap
ãæå®ããããšããå§ãããŸããæå®ããªãå Žåã䜿çšå¯èœãªå Žåãããã©ã«ãå€ã¯model._no_split_modules
ã«ãªããŸãã ããã¯ãã©ãããããã©ã³ã¹ãã©ãŒããŒå±€ã¯ã©ã¹åã®ãªã¹ã (倧æåãšå°æåãåºå¥) ãæå®ããŸã (äŸ: [BertLayer
]ã[GPTJBlock
]ã[T5Block
] ...)ã éã¿ãå ±æãããµãã¢ãžã¥ãŒã« (åã蟌ã¿å±€ãªã©) ãç°ãªã FSDP ã©ããããããŠãããã«ãªããªãããã«ããå¿ èŠããããããããã¯éèŠã§ãã ãã®ããªã·ãŒã䜿çšãããšããã«ãããã ã¢ãã³ã·ã§ã³ãšããã«ç¶ãããã€ãã® MLP ã¬ã€ã€ãŒãå«ããããã¯ããšã«ã©ããã³ã°ãçºçããŸãã å ±æåã蟌ã¿ãå«ãæ®ãã®å±€ã¯ãåãæãå€åŽã® FSDP ãŠãããã«ã©ãããããã®ã䟿å©ã§ãã ãããã£ãŠããã©ã³ã¹ããŒã¹ã®ã¢ãã«ã«ã¯ããã䜿çšããŠãã ããã - ãµã€ãºããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåã¯ãèšå®ãã¡ã€ã«ã«
fsdp_min_num_params
ãè¿œå ããŠãã ããã èªåã©ããã³ã°ã®ããã® FSDP ã®ãã©ã¡ãŒã¿ã®æå°æ°ãæå®ããŸãã
- ãã©ã³ã¹ãã©ãŒããŒããŒã¹ã®èªåã©ããããªã·ãŒã®å Žåãæ§æãã¡ã€ã«ã§
PyTorch v1.12 ãªãªãŒã¹ã«ãããéçºè
ãšç 究è
㯠Apple ã·ãªã³ã³ GPU ãå©çšããŠã¢ãã« ãã¬ãŒãã³ã°ã倧å¹
ã«é«éåã§ããŸãã
ããã«ããããããã¿ã€ãã³ã°ã埮調æŽãªã©ã®æ©æ¢°åŠç¿ã¯ãŒã¯ãããŒã Mac äžã§ããŒã«ã«ã§å®è¡ã§ããããã«ãªããŸãã
PyTorch ã®ããã¯ãšã³ããšããŠã® Apple ã® Metal Performance Shaders (MPS) ã¯ãããå¯èœã«ããæ°ãã "mps"
ããã€ã¹çµç±ã§äœ¿çšã§ããŸãã
ããã«ãããèšç®ã°ã©ããšããªããã£ãã MPS Graph ãã¬ãŒã ã¯ãŒã¯ãš MPS ã«ãã£ãŠæäŸããã調æŽãããã«ãŒãã«ã«ãããã³ã°ãããŸãã
詳现ã«ã€ããŠã¯ãå
¬åŒããã¥ã¡ã³ã Mac ã§ã® Accelerated PyTorch Training ã®çŽ¹ä» ãåç
§ããŠãã ããã
ããã³ MPS ããã¯ãšã³ãã
MacOS ãã·ã³ã« PyTorch >= 1.13 (å·çæç¹ã§ã¯ãã€ããªãŒ ããŒãžã§ã³) ãã€ã³ã¹ããŒã«ããããšã匷ããå§ãããŸãã ãã©ã³ã¹ããŒã¹ã®ã¢ãã«ã®ã¢ãã«ã®æ£ç¢ºæ§ãšããã©ãŒãã³ã¹ã®åäžã«é¢é£ããäž»èŠãªä¿®æ£ãè¡ãããŠããŸãã 詳现ã«ã€ããŠã¯ãpytorch/pytorch#82707 ãåç §ããŠãã ããã
Apple Silicon ãããã䜿çšãããã¬ãŒãã³ã°ãšæšè«ã®å©ç¹
- ãŠãŒã¶ãŒãããŒã«ã«ã§å€§èŠæš¡ãªãããã¯ãŒã¯ãããã ãµã€ãºããã¬ãŒãã³ã°ã§ããããã«ããŸã
- ãŠããã¡ã€ã ã¡ã¢ãª ã¢ãŒããã¯ãã£ã«ãããããŒã¿ååŸã®é 延ãççž®ãããGPU ãã¡ã¢ãª ã¹ãã¢å šäœã«çŽæ¥ã¢ã¯ã»ã¹ã§ããããã«ãªããŸãã ãããã£ãŠããšã³ãããŒãšã³ãã®ããã©ãŒãã³ã¹ãåäžããŸãã
- ã¯ã©ãŠãããŒã¹ã®éçºã«é¢é£ããã³ã¹ããè¿œå ã®ããŒã«ã« GPU ã®å¿ èŠæ§ãåæžããŸãã
åææ¡ä»¶: mps ãµããŒããåããããŒããã€ã³ã¹ããŒã«ããã«ã¯ã ãã®çŽ æŽãããã¡ãã£ã¢èšäº GPU ã¢ã¯ã»ã©ã¬ãŒã·ã§ã³ã M1 Mac ã® PyTorch ã«ç»å Ž ã«åŸã£ãŠãã ããã ã
䜿çšæ³ïŒ
mps
ããã€ã¹ã¯ãcuda
ããã€ã¹ã䜿çšãããæ¹æ³ãšåæ§ã«å©çšå¯èœãªå Žåãããã©ã«ãã§äœ¿çšãããŸãã
ãããã£ãŠããŠãŒã¶ãŒã«ããã¢ã¯ã·ã§ã³ã¯å¿
èŠãããŸããã
ããšãã°ã以äžã®ã³ãã³ãã䜿çšããŠãApple Silicon GPU ã䜿çšããŠå
¬åŒã® Glue ããã¹ãåé¡ã¿ã¹ã¯ã (ã«ãŒã ãã©ã«ããŒãã) å®è¡ã§ããŸãã
export TASK_NAME=mrpc
python examples/pytorch/text-classification/run_glue.py \
--model_name_or_path google-bert/bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$TASK_NAME/ \
--overwrite_output_dir
泚æãã¹ãããã€ãã®æ³šæäºé
- äžéšã® PyTorch æäœã¯ mps ã«å®è£
ãããŠããªãããããšã©ãŒãã¹ããŒãããŸãã
ãããåé¿ãã 1 ã€ã®æ¹æ³ã¯ãç°å¢å€æ°
PYTORCH_ENABLE_MPS_FALLBACK=1
ãèšå®ããããšã§ãã ãããã®æäœã§ã¯ CPU ã«ãã©ãŒã«ããã¯ããŸãããã ããããã§ã UserWarning ãã¹ããŒãããŸãã - åæ£ã»ããã¢ãã
gloo
ããã³nccl
ã¯ãmps
ããã€ã¹ã§ã¯åäœããŸããã ããã¯ãçŸåšãmpsãããã€ã¹ ã¿ã€ãã®åäž GPU ã®ã¿ã䜿çšã§ããããšãæå³ããŸãã
æåŸã«ãèŠããŠãããŠãã ããã ð€ Trainer
㯠MPS ããã¯ãšã³ãã®ã¿ãçµ±åããããã
MPS ããã¯ãšã³ãã®äœ¿çšã«é¢ããŠåé¡ã質åãããå Žåã¯ã
PyTorch GitHub ã«åé¡ãæåºããŠãã ããã
å éããŠãã¬ãŒããŒã«ãã¯ãŒãäžããŸãããããŠãŒã¶ãŒãæåŸ ããããšã«é¢ããŠã¯ã次ã®ãšããã§ãã
- ãã¬ãŒããŒåŒæ°ã«å¯Ÿã㊠FSDPãDeepSpeed ãªã©ã®ãã¬ãŒã㌠ã€ã³ãã¬ãŒã·ã§ã³ãå€æŽããã«äœ¿çšãç¶ããããšãã§ããŸãã
- ãã¬ãŒããŒã§ Accelerate Launcher ã䜿çšã§ããããã«ãªããŸãã (æšå¥š)ã
ãã¬ãŒããŒã§ Accelerate Launcher ã䜿çšããæé :
- ð€ Accelerate ãã€ã³ã¹ããŒã«ãããŠããããšã確èªããŠãã ãããAccelerate ããªããš
Trainer
ã䜿çšããããšã¯ã§ããŸãããããã§ãªãå Žåã¯ãpip install accelerate
ããŠãã ããã Accelerate ã®ããŒãžã§ã³ãæŽæ°ããå¿ èŠãããå ŽåããããŸã:pip install activate --upgrade
accelerate config
ãå®è¡ããã¢ã³ã±ãŒãã«èšå ¥ããŸãã以äžã¯å éèšå®ã®äŸã§ãã ïœïŒ DDP ãã«ãããŒã ãã«ã GPU æ§æ:compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 #change rank as per the node main_process_ip: 192.168.20.1 main_process_port: 9898 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
b. FSDP config:
yaml compute_environment: LOCAL_MACHINE distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: true fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: BertLayer fsdp_use_orig_params: true machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
c.ãã¡ã€ã«ãæã DeepSpeed æ§æ:
yaml compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /home/user/configs/ds_zero3_config.json zero3_init_flag: true distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
d.å éãã©ã°ã€ã³ã䜿çšãã DeepSpeed æ§æ:
```yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 0.7
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
- å éèšå®ãŸãã¯ã©ã³ãã£ãŒåŒæ°ã«ãã£ãŠäžèšã§åŠçãããåŒæ°ä»¥å€ã®åŒæ°ã䜿çšããŠããã¬ãŒã㌠ã¹ã¯ãªãããå®è¡ããŸãã
以äžã¯ãäžèšã® FSDP æ§æã§
accelerate launcher
ã䜿çšããŠrun_glue.py
ãå®è¡ããäŸã§ãã
cd transformers
accelerate launch \
./examples/pytorch/text-classification/run_glue.py \
--model_name_or_path google-bert/bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 16 \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$TASK_NAME/ \
--overwrite_output_dir
accelerate launch
ããããã® cmd åŒæ°ãçŽæ¥äœ¿çšããããšãã§ããŸããäžã®äŸã¯æ¬¡ã®ããã«ãããã³ã°ãããŸãã
cd transformers
accelerate launch --num_processes=2 \
--use_fsdp \
--mixed_precision=bf16 \
--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
--fsdp_transformer_layer_cls_to_wrap="BertLayer" \
--fsdp_sharding_strategy=1 \
--fsdp_state_dict_type=FULL_STATE_DICT \
./examples/pytorch/text-classification/run_glue.py
--model_name_or_path google-bert/bert-base-cased \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 16 \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$TASK_NAME/ \
--overwrite_output_dir
詳现ã«ã€ããŠã¯ãð€ Accelerate CLI ã¬ã€ããåç §ããŠãã ãã: ð€ Accelerate ã¹ã¯ãªããã®èµ·åã
移åãããã»ã¯ã·ã§ã³:
[ DeepSpeed | Installation | Deployment with multiple GPUs | Deployment with one GPU | Deployment in Notebooks | Configuration | Passing Configuration | Shared Configuration | ZeRO | ZeRO-2 Config | ZeRO-3 Config | NVMe Support | ZeRO-2 vs ZeRO-3 Performance | ZeRO-2 Example | ZeRO-3 Example | Optimizer | Scheduler | fp32 Precision | Automatic Mixed Precision | Batch Size | Gradient Accumulation | Gradient Clipping | Getting The Model Weights Out ]