-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] deepspeed_aio swap file name too long. #5087
Comments
@shockline Was this issue observed with other optimizers? |
No. For other LLM training, it seems working and throws no error. |
Can you test against this branch https://github.com/microsoft/DeepSpeed/tree/jomayeri/issue-5087 |
github-merge-queue bot
pushed a commit
that referenced
this issue
Feb 23, 2024
Fixing issue #5087 . Limited the naming of the ds_id in ZeRO 3 to the first and last parameters of the group instead of every parameter in the group.
I have tested the patch, and everything works fine now. |
ShellyNR
pushed a commit
to ShellyNR/DeepSpeed
that referenced
this issue
Mar 11, 2024
Fixing issue microsoft#5087 . Limited the naming of the ds_id in ZeRO 3 to the first and last parameters of the group instead of every parameter in the group.
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this issue
May 9, 2024
Fixing issue microsoft#5087 . Limited the naming of the ds_id in ZeRO 3 to the first and last parameters of the group instead of every parameter in the group.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
A clear and concise description of what the bug is.
When I use DeepSpeedCPUAdam to train the mixtral 8*7b model ( stage 3 ), the error output is shown below.
After that, when I tried "ls /workspace/finetune/disk0/zero_stage_3/optimizer/rank29/0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_38_39_40_41_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_120_121_122_123_124_125_126_127_128_129_130_131_132_133_134_135_136_137_138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_160_161_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_183_184_185_186_187_188_189_190_191_192_193_194_195_196_197_198_199_200_201_202_203_204_205_206_207_208_209_210_211_212_213_214_215_216_217_218_219_220_221_222_223_224_225_226_227_228_229_230_231_232_233_234_235_236_237_238_239_240_241_242_243_244_245_246_247_248_249_250_251_252_253_254_255_256_257_258_259_260_261_262_263_264_265_266_267_268_269_270_271_272_273_274_275_276_277_278_279_280_281_282_283_284_285_286_287_288_289_290_291_292_293_294_295_296_297_298_299_300_301_302_303_304_305_306_307_308_309_310_311_312_313_314_315_316_317_318_319_320_321_322_323_324_325_326_327_328_329_330_331_332_333_334_335_336_337_338_339_340_341_342_343_344_345_346_347_348_349_350_351_352_353_354_355_356_357_358_359_360_361_362_363_364_365_366_367_368_369_370_371_372_373_374_375_376_377_378_379_380_381_382_383_384_385_386_387_388_389_390_391_392_393_394_395_396_397_398_399_400_401_402_403_404_405_406_407_408_409_410_411_412_413_414_415_416_417_418_419_420_421_422_423_424_425_426_427_428_429_430_431_432_433_434_435_436_437_438_439_440_441_442_443_444_445_446_447_448_449_450_451_452_453_454_455_456_457_458_459_460_461_462_463_464_465_466_467_468_469_470_471_472_473_474_475_476_477_478_479_480_481_482_483_484_485_486_487_488_489_490_491_492_493_494_495_496_497_498_499_500_501_502_503_504_505_506_507_508_509_510_511_512_513_514_515_516_517_518_519_520_521_522_523_524_525_526_527_528_529_530_531_532_533_534_535_536_537_538_539_540_541_542_543_544_545_546_547_548_549_550_551_552_553_554_555_556_557_558_559_560_561_562_563_564_565_566_567_568_569_570_571_572_573_574_575_576_577_578_579_580_581_582_583_584_585_586_587_588_589_590_591_592_593_594_595_596_597_598_599_600_601_602_603_604_605_606_607_608_609_610_611_612_613_614_615_616_617_618_619_620_621_622_623_624_625_626_627_628_629_630_631_632_633_634_635_636_637_638_639_640_641_642_643_644_645_646_647_648_649_650_651_652_653_654_655_656_657_658_659_660_661_662_663_664_665_666_667_668_669_670_671_672_673_674_675_676_677_678_679.tensor.swp"
the Linux system told me " File name too long".
How can I fix my training?
****** HERE IS THE COMMAND LINE OUTPUT. ********
worker-3: deepspeed_aio: open for write failed on /workspace/finetune/disk0/zero_stage_3/optimizer/rank24/0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_38_39_40_41_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_120_121_122_123_124_125_126_127_128_129_130_131_132_133_134_135_136_137_138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_160_161_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_183_184_185_186_187_188_189_190_191_192_193_194_195_196_197_198_199_200_201_202_203_204_205_206_207_208_209_210_211_212_213_214_215_216_217_218_219_220_221_222_223_224_225_226_227_228_229_230_231_232_233_234_235_236_237_238_239_240_241_242_243_244_245_246_247_248_249_250_251_252_253_254_255_256_257_258_259_260_261_262_263_264_265_266_267_268_269_270_271_272_273_274_275_276_277_278_279_280_281_282_283_284_285_286_287_288_289_290_291_292_293_294_295_296_297_298_299_300_301_302_303_304_305_306_307_308_309_310_311_312_313_314_315_316_317_318_319_320_321_322_323_324_325_326_327_328_329_330_331_332_333_334_335_336_337_338_339_340_341_342_343_344_345_346_347_348_349_350_351_352_353_354_355_356_357_358_359_360_361_362_363_364_365_366_367_368_369_370_371_372_373_374_375_376_377_378_379_380_381_382_383_384_385_386_387_388_389_390_391_392_393_394_395_396_397_398_399_400_401_402_403_404_405_406_407_408_409_410_411_412_413_414_415_416_417_418_419_420_421_422_423_424_425_426_427_428_429_430_431_432_433_434_435_436_437_438_439_440_441_442_443_444_445_446_447_448_449_450_451_452_453_454_455_456_457_458_459_460_461_462_463_464_465_466_467_468_469_470_471_472_473_474_475_476_477_478_479_480_481_482_483_484_485_486_487_488_489_490_491_492_493_494_495_496_497_498_499_500_501_502_503_504_505_506_507_508_509_510_511_512_513_514_515_516_517_518_519_520_521_522_523_524_525_526_527_528_529_530_531_532_533_534_535_536_537_538_539_540_541_542_543_544_545_546_547_548_549_550_551_552_553_554_555_556_557_558_559_560_561_562_563_564_565_566_567_568_569_570_571_572_573_574_575_576_577_578_579_580_581_582_583_584_585_586_587_588_589_590_591_592_593_594_595_596_597_598_599_600_601_602_603_604_605_606_607_608_609_610_611_612_613_614_615_616_617_618_619_620_621_622_623_624_625_626_627_628_629_630_631_632_633_634_635_636_637_638_639_640_641_642_643_644_645_646_647_648_649_650_651_652_653_654_655_656_657_658_659_660_661_662_663_664_665_666_667_668_669_670_671_672_673_674_675_676_677_678_679.tensor.swp error = 36
worker-3: Traceback (most recent call last):
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 297, in
worker-3: main()
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 235, in main
worker-3: train_result = trainer.train(resume_from_checkpoint=checkpoint)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
worker-3: return inner_training_loop(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/memory.py", line 136, in decorator
worker-3: return function(batch_size, *args, **kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1690, in _inner_training_loop
worker-3: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1219, in prepare
worker-3: result = self._prepare_deepspeed(*args)
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
worker-3: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 171, in initialize
worker-3: engine = DeepSpeedEngine(args=args,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 308, in init
worker-3: self._configure_optimizer(optimizer, model_parameters)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
worker-3: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
worker-3: optimizer = DeepSpeedZeroOptimizer_Stage3(
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 361, in init
worker-3: self._setup_for_real_optimizer()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
worker-3: self._create_fp32_partitions()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 866, in _create_fp32_partitions
worker-3: self.optimizer_swapper.initialize_parameters(parameters=swappable_fp32_tensors,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 51, in initialize_parameters
worker-3: self._initialize_parameters(parameters=parameters, src_tensors=src_tensors, aio_handle=self.aio_handle)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 329, in _initialize_parameters
worker-3: self._swap_out_unpinned_tensors(aio_handle=aio_handle,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 375, in _swap_out_unpinned_tensors
worker-3: swap_out_tensors(aio_handle, swap_buffers, swap_paths)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/utils.py", line 26, in swap_out_tensors
worker-3: assert (swap_handle.async_pwrite(buffer, path) == 0)
worker-3: AssertionError
worker-3: Traceback (most recent call last):
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 297, in
worker-3: main()
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 235, in main
worker-3: train_result = trainer.train(resume_from_checkpoint=checkpoint)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
worker-3: return inner_training_loop(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/memory.py", line 136, in decorator
worker-3: return function(batch_size, *args, **kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1690, in _inner_training_loop
worker-3: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1219, in prepare
worker-3: result = self._prepare_deepspeed(*args)
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
worker-3: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 171, in initialize
worker-3: engine = DeepSpeedEngine(args=args,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 308, in init
worker-3: self._configure_optimizer(optimizer, model_parameters)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
worker-3: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
worker-3: optimizer = DeepSpeedZeroOptimizer_Stage3(
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 361, in init
worker-3: self._setup_for_real_optimizer()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
worker-3: self._create_fp32_partitions()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 866, in _create_fp32_partitions
worker-3: self.optimizer_swapper.initialize_parameters(parameters=swappable_fp32_tensors,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 51, in initialize_parameters
worker-3: self._initialize_parameters(parameters=parameters, src_tensors=src_tensors, aio_handle=self.aio_handle)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 329, in _initialize_parameters
worker-3: self._swap_out_unpinned_tensors(aio_handle=aio_handle,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 375, in _swap_out_unpinned_tensors
worker-3: swap_out_tensors(aio_handle, swap_buffers, swap_paths)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/utils.py", line 26, in swap_out_tensors
worker-3: assert (swap_handle.async_pwrite(buffer, path) == 0)
worker-3: AssertionError
System info (please complete the following information):
deepspeed==0.13.1
model=mixtral 8*7b
The text was updated successfully, but these errors were encountered: