Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] deepspeed_aio swap file name too long. #5087

Closed
shockline opened this issue Feb 6, 2024 · 4 comments
Closed

[BUG] deepspeed_aio swap file name too long. #5087

shockline opened this issue Feb 6, 2024 · 4 comments
Assignees
Labels
bug Something isn't working training

Comments

@shockline
Copy link

shockline commented Feb 6, 2024

Describe the bug
A clear and concise description of what the bug is.

When I use DeepSpeedCPUAdam to train the mixtral 8*7b model ( stage 3 ), the error output is shown below.
After that, when I tried "ls /workspace/finetune/disk0/zero_stage_3/optimizer/rank29/0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_38_39_40_41_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_120_121_122_123_124_125_126_127_128_129_130_131_132_133_134_135_136_137_138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_160_161_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_183_184_185_186_187_188_189_190_191_192_193_194_195_196_197_198_199_200_201_202_203_204_205_206_207_208_209_210_211_212_213_214_215_216_217_218_219_220_221_222_223_224_225_226_227_228_229_230_231_232_233_234_235_236_237_238_239_240_241_242_243_244_245_246_247_248_249_250_251_252_253_254_255_256_257_258_259_260_261_262_263_264_265_266_267_268_269_270_271_272_273_274_275_276_277_278_279_280_281_282_283_284_285_286_287_288_289_290_291_292_293_294_295_296_297_298_299_300_301_302_303_304_305_306_307_308_309_310_311_312_313_314_315_316_317_318_319_320_321_322_323_324_325_326_327_328_329_330_331_332_333_334_335_336_337_338_339_340_341_342_343_344_345_346_347_348_349_350_351_352_353_354_355_356_357_358_359_360_361_362_363_364_365_366_367_368_369_370_371_372_373_374_375_376_377_378_379_380_381_382_383_384_385_386_387_388_389_390_391_392_393_394_395_396_397_398_399_400_401_402_403_404_405_406_407_408_409_410_411_412_413_414_415_416_417_418_419_420_421_422_423_424_425_426_427_428_429_430_431_432_433_434_435_436_437_438_439_440_441_442_443_444_445_446_447_448_449_450_451_452_453_454_455_456_457_458_459_460_461_462_463_464_465_466_467_468_469_470_471_472_473_474_475_476_477_478_479_480_481_482_483_484_485_486_487_488_489_490_491_492_493_494_495_496_497_498_499_500_501_502_503_504_505_506_507_508_509_510_511_512_513_514_515_516_517_518_519_520_521_522_523_524_525_526_527_528_529_530_531_532_533_534_535_536_537_538_539_540_541_542_543_544_545_546_547_548_549_550_551_552_553_554_555_556_557_558_559_560_561_562_563_564_565_566_567_568_569_570_571_572_573_574_575_576_577_578_579_580_581_582_583_584_585_586_587_588_589_590_591_592_593_594_595_596_597_598_599_600_601_602_603_604_605_606_607_608_609_610_611_612_613_614_615_616_617_618_619_620_621_622_623_624_625_626_627_628_629_630_631_632_633_634_635_636_637_638_639_640_641_642_643_644_645_646_647_648_649_650_651_652_653_654_655_656_657_658_659_660_661_662_663_664_665_666_667_668_669_670_671_672_673_674_675_676_677_678_679.tensor.swp"
the Linux system told me " File name too long".

How can I fix my training?

****** HERE IS THE COMMAND LINE OUTPUT. ********
worker-3: deepspeed_aio: open for write failed on /workspace/finetune/disk0/zero_stage_3/optimizer/rank24/0_1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_38_39_40_41_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_88_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_120_121_122_123_124_125_126_127_128_129_130_131_132_133_134_135_136_137_138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_160_161_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_183_184_185_186_187_188_189_190_191_192_193_194_195_196_197_198_199_200_201_202_203_204_205_206_207_208_209_210_211_212_213_214_215_216_217_218_219_220_221_222_223_224_225_226_227_228_229_230_231_232_233_234_235_236_237_238_239_240_241_242_243_244_245_246_247_248_249_250_251_252_253_254_255_256_257_258_259_260_261_262_263_264_265_266_267_268_269_270_271_272_273_274_275_276_277_278_279_280_281_282_283_284_285_286_287_288_289_290_291_292_293_294_295_296_297_298_299_300_301_302_303_304_305_306_307_308_309_310_311_312_313_314_315_316_317_318_319_320_321_322_323_324_325_326_327_328_329_330_331_332_333_334_335_336_337_338_339_340_341_342_343_344_345_346_347_348_349_350_351_352_353_354_355_356_357_358_359_360_361_362_363_364_365_366_367_368_369_370_371_372_373_374_375_376_377_378_379_380_381_382_383_384_385_386_387_388_389_390_391_392_393_394_395_396_397_398_399_400_401_402_403_404_405_406_407_408_409_410_411_412_413_414_415_416_417_418_419_420_421_422_423_424_425_426_427_428_429_430_431_432_433_434_435_436_437_438_439_440_441_442_443_444_445_446_447_448_449_450_451_452_453_454_455_456_457_458_459_460_461_462_463_464_465_466_467_468_469_470_471_472_473_474_475_476_477_478_479_480_481_482_483_484_485_486_487_488_489_490_491_492_493_494_495_496_497_498_499_500_501_502_503_504_505_506_507_508_509_510_511_512_513_514_515_516_517_518_519_520_521_522_523_524_525_526_527_528_529_530_531_532_533_534_535_536_537_538_539_540_541_542_543_544_545_546_547_548_549_550_551_552_553_554_555_556_557_558_559_560_561_562_563_564_565_566_567_568_569_570_571_572_573_574_575_576_577_578_579_580_581_582_583_584_585_586_587_588_589_590_591_592_593_594_595_596_597_598_599_600_601_602_603_604_605_606_607_608_609_610_611_612_613_614_615_616_617_618_619_620_621_622_623_624_625_626_627_628_629_630_631_632_633_634_635_636_637_638_639_640_641_642_643_644_645_646_647_648_649_650_651_652_653_654_655_656_657_658_659_660_661_662_663_664_665_666_667_668_669_670_671_672_673_674_675_676_677_678_679.tensor.swp error = 36
worker-3: Traceback (most recent call last):
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 297, in
worker-3: main()
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 235, in main
worker-3: train_result = trainer.train(resume_from_checkpoint=checkpoint)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
worker-3: return inner_training_loop(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/memory.py", line 136, in decorator
worker-3: return function(batch_size, *args, **kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1690, in _inner_training_loop
worker-3: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1219, in prepare
worker-3: result = self._prepare_deepspeed(*args)
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
worker-3: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 171, in initialize
worker-3: engine = DeepSpeedEngine(args=args,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 308, in init
worker-3: self._configure_optimizer(optimizer, model_parameters)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
worker-3: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
worker-3: optimizer = DeepSpeedZeroOptimizer_Stage3(
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 361, in init
worker-3: self._setup_for_real_optimizer()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
worker-3: self._create_fp32_partitions()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 866, in _create_fp32_partitions
worker-3: self.optimizer_swapper.initialize_parameters(parameters=swappable_fp32_tensors,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 51, in initialize_parameters
worker-3: self._initialize_parameters(parameters=parameters, src_tensors=src_tensors, aio_handle=self.aio_handle)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 329, in _initialize_parameters
worker-3: self._swap_out_unpinned_tensors(aio_handle=aio_handle,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 375, in _swap_out_unpinned_tensors
worker-3: swap_out_tensors(aio_handle, swap_buffers, swap_paths)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/utils.py", line 26, in swap_out_tensors
worker-3: assert (swap_handle.async_pwrite(buffer, path) == 0)
worker-3: AssertionError
worker-3: Traceback (most recent call last):
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 297, in
worker-3: main()
worker-3: File "/workspace/finetune/large_model/finetune/finetune.py", line 235, in main
worker-3: train_result = trainer.train(resume_from_checkpoint=checkpoint)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
worker-3: return inner_training_loop(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/memory.py", line 136, in decorator
worker-3: return function(batch_size, *args, **kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1690, in _inner_training_loop
worker-3: model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1219, in prepare
worker-3: result = self._prepare_deepspeed(*args)
worker-3: File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1604, in _prepare_deepspeed
worker-3: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/init.py", line 171, in initialize
worker-3: engine = DeepSpeedEngine(args=args,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 308, in init
worker-3: self._configure_optimizer(optimizer, model_parameters)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
worker-3: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
worker-3: optimizer = DeepSpeedZeroOptimizer_Stage3(
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 361, in init
worker-3: self._setup_for_real_optimizer()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
worker-3: self._create_fp32_partitions()
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 866, in _create_fp32_partitions
worker-3: self.optimizer_swapper.initialize_parameters(parameters=swappable_fp32_tensors,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/partitioned_optimizer_swapper.py", line 51, in initialize_parameters
worker-3: self._initialize_parameters(parameters=parameters, src_tensors=src_tensors, aio_handle=self.aio_handle)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 329, in _initialize_parameters
worker-3: self._swap_out_unpinned_tensors(aio_handle=aio_handle,
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/optimizer_utils.py", line 375, in _swap_out_unpinned_tensors
worker-3: swap_out_tensors(aio_handle, swap_buffers, swap_paths)
worker-3: File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/swap_tensor/utils.py", line 26, in swap_out_tensors
worker-3: assert (swap_handle.async_pwrite(buffer, path) == 0)
worker-3: AssertionError

System info (please complete the following information):
deepspeed==0.13.1
model=mixtral 8*7b

@shockline shockline added bug Something isn't working training labels Feb 6, 2024
@jomayeri
Copy link
Contributor

@shockline Was this issue observed with other optimizers?

@shockline
Copy link
Author

@shockline Was this issue observed with other optimizers?

No.

For other LLM training, it seems working and throws no error.

@jomayeri
Copy link
Contributor

Can you test against this branch https://github.com/microsoft/DeepSpeed/tree/jomayeri/issue-5087

github-merge-queue bot pushed a commit that referenced this issue Feb 23, 2024
Fixing issue #5087 . Limited the naming of the ds_id in ZeRO 3 to the
first and last parameters of the group instead of every parameter in the
group.
@shockline
Copy link
Author

I have tested the patch, and everything works fine now.
thank you.

ShellyNR pushed a commit to ShellyNR/DeepSpeed that referenced this issue Mar 11, 2024
Fixing issue microsoft#5087 . Limited the naming of the ds_id in ZeRO 3 to the
first and last parameters of the group instead of every parameter in the
group.
rraminen pushed a commit to ROCm/DeepSpeed that referenced this issue May 9, 2024
Fixing issue microsoft#5087 . Limited the naming of the ds_id in ZeRO 3 to the
first and last parameters of the group instead of every parameter in the
group.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants