ValueError: 150004 is not in list #22

dsh54054 · 2023-04-12T07:06:48Z

obj: {'prompt': [12313, 107, 125, 6054, 109, 3384, 104, 1833, 6, 11311, 110, 125, 938, 109, 986, 583, 1303, 7, 11121, 104, 532, 109, 12475, 29321, 100, 1029, 7, 4, 4, 125875, 14150, 12, 4, 64298, 66977, 100326, 69122, 76809, 65324, 65459, 85929, 63823, 43, 151, 4, 43, 151, 69106, 12, 64176, 6, 64219, 6, 68651, 63823, 43, 151, 4, 4, 125875, 11034, 12, 130001, 130004], 'completion': [28, 64872, 64219, 68281, 69902, 63984, 6, 63984, 64548, 68651, 63962, 6, 64872, 81715, 69754, 68840, 78, 150005]}
0%| | 0/134265 [00:00<?, ?it/s]
obj: {'prompt': [12313, 107, 125, 6054, 109, 3384, 104, 1833, 6, 11311, 110, 125, 938, 109, 986, 583, 1303, 7, 11121, 104, 532, 109, 12475, 29321, 100, 1029, 7, 4, 4, 125875, 14150, 12, 4, 67769, 65056, 68395, 100087, 63823, 4, 4987, 6, 225, 118, 120, 31, 4, 4, 125875, 11034, 12, 130001, 130004], 'completion': [5, 74874, 6, 63852, 66348, 31, 150005]}
0%| | 0/134265 [00:00<?, ?it/s]
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 309, in main
for step, batch in enumerate(t:=tqdm.tqdm(train_dataloader)):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/accelerate/data_loader.py", line 378, in iter
current_batch = next(dataloader_iter)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "multi_gpu_fintune_belle.py", line 121, in collate_fn
context_length = obj['prompt'].index(150004)
ValueError: 150004 is not in list
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 309, in main
for step, batch in enumerate(t:=tqdm.tqdm(train_dataloader)):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/accelerate/data_loader.py", line 378, in iter
current_batch = next(dataloader_iter)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "multi_gpu_fintune_belle.py", line 121, in collate_fn
context_length = obj['prompt'].index(150004)
ValueError: 150004 is not in list
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2408521) of binary: /tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/python
Traceback (most recent call last):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/torchrun", line 8, in
sys.exit(main())
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-12_07:04:52
host : localhost
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2408522)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-12_07:04:52
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2408521)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 · 2023-04-12T07:13:42Z

大家有遇到这个ValueError么，我看了一下obj["prompt"]最后一个值为130004，把代码里的context_length = obj['prompt'].index(150004)改成了context_length = obj['prompt'].index(130004)，然后报了另外一个错误，有人知道解决方法么:

Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 313, in main
os.makedirs(path)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'output/checkpoint_0'
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 317, in main
outputs = model(**batch)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
loss = self.module(*inputs, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/peft/peft_model.py", line 663, in forward
return self.base_model(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/workspace/Chatglm_lora_multi-gpu/modeling_chatglm.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/workspace/Chatglm_lora_multi-gpu/modeling_chatglm.py", line 860, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2409803) of binary: /tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/python
Traceback (most recent call last):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/torchrun", line 8, in
sys.exit(main())
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-12_07:12:36
host : localhost
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 2409804)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2409804

Root Cause (first observed failure):
[0]:
time : 2023-04-12_07:12:36
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2409803)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 · 2023-04-12T07:41:31Z

看了之前的issue，拉取了最新的代码，好像还是有这个ValueError: 150004 is not in list的问题，有人遇到过么，大佬可以帮助解答一下么，拜托了@liangwq

liangwq · 2023-04-12T08:14:07Z

看了之前的issue，拉取了最新的代码，好像还是有这个ValueError: 150004 is not in list的问题，有人遇到过么，大佬可以帮助解答一下么，拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的
试试把这个文件换成我们自己的token的python文件，还有一种办法就是直接拥抱清华的做法，把历史的模型删除，用autotoken automodel方法来导入模型和token

hrdxwandg · 2023-04-14T09:05:50Z

看了之前的issue，拉取了最新的代码，好像还是有这个ValueError: 150004 is not in list的问题，有人遇到过么，大佬可以帮助解答一下么，拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的试试把这个文件换成我们自己的token的python文件，还有一种办法就是直接拥抱清华的做法，把历史的模型删除，用autotoken automodel方法来导入模型和token

能详细解释下啥意思么？

liangwq · 2023-04-14T09:28:10Z

看了之前的issue，拉取了最新的代码，好像还是有这个ValueError: 150004 is not in list的问题，有人遇到过么，大佬可以帮助解答一下么，拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的试试把这个文件换成我们自己的token的python文件，还有一种办法就是直接拥抱清华的做法，把历史的模型删除，用autotoken automodel方法来导入模型和token

能详细解释下啥意思么？

清华把这几个文件做了修改，词表从150001变成130001
还有model文件也做了修改，所以最简单办法就是跟着清华做法用AutoModel、Autotoken方法来加载模型和词表，transformer背后会给你自动换model和token实现类，把你历史下载的model参数文件删除

algorithmconquer · 2023-04-19T08:36:31Z

@dsh54054 这个问题您解决了吗，遇到同样的问题了

dsh54054 · 2023-04-19T08:47:52Z

@dsh54054 这个问题您解决了吗，遇到同样的问题了

@algorithmconquer 把multi_gpu_fintune_belle.py的变量EOS_ID = 150005修改为EOS_ID = 130005就可以了

algorithmconquer · 2023-04-19T09:24:06Z

@dsh54054 多谢

algorithmconquer mentioned this issue Apr 19, 2023

> @dsh54054 这个问题您解决了吗，遇到同样的问题了 #28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: 150004 is not in list #22

ValueError: 150004 is not in list #22

dsh54054 commented Apr 12, 2023 •

edited

Loading

dsh54054 commented Apr 12, 2023 •

edited

Loading

dsh54054 commented Apr 12, 2023 •

edited

Loading

liangwq commented Apr 12, 2023

hrdxwandg commented Apr 14, 2023

liangwq commented Apr 14, 2023

algorithmconquer commented Apr 19, 2023

dsh54054 commented Apr 19, 2023

algorithmconquer commented Apr 19, 2023

ValueError: 150004 is not in list #22

ValueError: 150004 is not in list #22

Comments

dsh54054 commented Apr 12, 2023 • edited Loading

Failures: [1]: time : 2023-04-12_07:04:52 host : localhost rank : 1 (local_rank: 1) exitcode : 1 (pid: 2408522) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 commented Apr 12, 2023 • edited Loading

multi_gpu_fintune_belle.py FAILED

Failures: [1]: time : 2023-04-12_07:12:36 host : localhost rank : 1 (local_rank: 1) exitcode : -6 (pid: 2409804) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 2409804

Root Cause (first observed failure): [0]: time : 2023-04-12_07:12:36 host : localhost rank : 0 (local_rank: 0) exitcode : 1 (pid: 2409803) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 commented Apr 12, 2023 • edited Loading

liangwq commented Apr 12, 2023

hrdxwandg commented Apr 14, 2023

liangwq commented Apr 14, 2023

algorithmconquer commented Apr 19, 2023

dsh54054 commented Apr 19, 2023

algorithmconquer commented Apr 19, 2023

dsh54054 commented Apr 12, 2023 •

edited

Loading

Failures:
[1]:
time : 2023-04-12_07:04:52
host : localhost
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2408522)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 commented Apr 12, 2023 •

edited

Loading

Failures:
[1]:
time : 2023-04-12_07:12:36
host : localhost
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 2409804)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2409804

Root Cause (first observed failure):
[0]:
time : 2023-04-12_07:12:36
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2409803)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

dsh54054 commented Apr 12, 2023 •

edited

Loading