Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: 150004 is not in list #22

Open
dsh54054 opened this issue Apr 12, 2023 · 8 comments
Open

ValueError: 150004 is not in list #22

dsh54054 opened this issue Apr 12, 2023 · 8 comments

Comments

@dsh54054
Copy link

dsh54054 commented Apr 12, 2023

obj: {'prompt': [12313, 107, 125, 6054, 109, 3384, 104, 1833, 6, 11311, 110, 125, 938, 109, 986, 583, 1303, 7, 11121, 104, 532, 109, 12475, 29321, 100, 1029, 7, 4, 4, 125875, 14150, 12, 4, 64298, 66977, 100326, 69122, 76809, 65324, 65459, 85929, 63823, 43, 151, 4, 43, 151, 69106, 12, 64176, 6, 64219, 6, 68651, 63823, 43, 151, 4, 4, 125875, 11034, 12, 130001, 130004], 'completion': [28, 64872, 64219, 68281, 69902, 63984, 6, 63984, 64548, 68651, 63962, 6, 64872, 81715, 69754, 68840, 78, 150005]}
0%| | 0/134265 [00:00<?, ?it/s]
obj: {'prompt': [12313, 107, 125, 6054, 109, 3384, 104, 1833, 6, 11311, 110, 125, 938, 109, 986, 583, 1303, 7, 11121, 104, 532, 109, 12475, 29321, 100, 1029, 7, 4, 4, 125875, 14150, 12, 4, 67769, 65056, 68395, 100087, 63823, 4, 4987, 6, 225, 118, 120, 31, 4, 4, 125875, 11034, 12, 130001, 130004], 'completion': [5, 74874, 6, 63852, 66348, 31, 150005]}
0%| | 0/134265 [00:00<?, ?it/s]
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 309, in main
for step, batch in enumerate(t:=tqdm.tqdm(train_dataloader)):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/accelerate/data_loader.py", line 378, in iter
current_batch = next(dataloader_iter)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "multi_gpu_fintune_belle.py", line 121, in collate_fn
context_length = obj['prompt'].index(150004)
ValueError: 150004 is not in list
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 309, in main
for step, batch in enumerate(t:=tqdm.tqdm(train_dataloader)):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/accelerate/data_loader.py", line 378, in iter
current_batch = next(dataloader_iter)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "multi_gpu_fintune_belle.py", line 121, in collate_fn
context_length = obj['prompt'].index(150004)
ValueError: 150004 is not in list
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2408521) of binary: /tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/python
Traceback (most recent call last):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/torchrun", line 8, in
sys.exit(main())
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
multi_gpu_fintune_belle.py FAILED


Failures:
[1]:
time : 2023-04-12_07:04:52
host : localhost
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2408522)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-12_07:04:52
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2408521)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@dsh54054
Copy link
Author

dsh54054 commented Apr 12, 2023

大家有遇到这个ValueError么,我看了一下obj["prompt"]最后一个值为130004,把代码里的context_length = obj['prompt'].index(150004)改成了context_length = obj['prompt'].index(130004),然后报了另外一个错误,有人知道解决方法么:

Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 313, in main
os.makedirs(path)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'output/checkpoint_0'
Traceback (most recent call last):
File "multi_gpu_fintune_belle.py", line 361, in
main()
File "multi_gpu_fintune_belle.py", line 317, in main
outputs = model(**batch)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
loss = self.module(*inputs, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/peft/peft_model.py", line 663, in forward
return self.base_model(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/workspace/Chatglm_lora_multi-gpu/modeling_chatglm.py", line 1043, in forward
transformer_outputs = self.transformer(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/workspace/Chatglm_lora_multi-gpu/modeling_chatglm.py", line 860, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2409803) of binary: /tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/python
Traceback (most recent call last):
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/bin/torchrun", line 8, in
sys.exit(main())
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/tal-vePFS/SFT/dengshuhao1/local/anaconda3/envs/Chatglm_lora_multi-gpu/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-12_07:12:36
host : localhost
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 2409804)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 2409804

Root Cause (first observed failure):
[0]:
time : 2023-04-12_07:12:36
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2409803)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@dsh54054
Copy link
Author

dsh54054 commented Apr 12, 2023

看了之前的issue,拉取了最新的代码,好像还是有这个ValueError: 150004 is not in list的问题,有人遇到过么,大佬可以帮助解答一下么,拜托了@liangwq

@liangwq
Copy link
Owner

liangwq commented Apr 12, 2023

看了之前的issue,拉取了最新的代码,好像还是有这个ValueError: 150004 is not in list的问题,有人遇到过么,大佬可以帮助解答一下么,拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的
试试把这个文件换成我们自己的token的python文件,还有一种办法就是直接拥抱清华的做法,把历史的模型删除,用autotoken automodel方法来导入模型和token

@hrdxwandg
Copy link

看了之前的issue,拉取了最新的代码,好像还是有这个ValueError: 150004 is not in list的问题,有人遇到过么,大佬可以帮助解答一下么,拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的 试试把这个文件换成我们自己的token的python文件,还有一种办法就是直接拥抱清华的做法,把历史的模型删除,用autotoken automodel方法来导入模型和token

能详细解释下啥意思么?

@liangwq
Copy link
Owner

liangwq commented Apr 14, 2023

看了之前的issue,拉取了最新的代码,好像还是有这个ValueError: 150004 is not in list的问题,有人遇到过么,大佬可以帮助解答一下么,拜托了@liangwq

清华更新了他们的模型和token。你可以进入到模型里面看他的最新的token config的json文件怎么写的 试试把这个文件换成我们自己的token的python文件,还有一种办法就是直接拥抱清华的做法,把历史的模型删除,用autotoken automodel方法来导入模型和token

能详细解释下啥意思么?

image
清华把这几个文件做了修改,词表从150001变成130001
还有model文件也做了修改,所以最简单办法就是跟着清华做法用AutoModel、Autotoken方法来加载模型和词表,transformer背后会给你自动换model和token实现类,把你历史下载的model参数文件删除

@algorithmconquer
Copy link

@dsh54054 这个问题您解决了吗,遇到同样的问题了

@dsh54054
Copy link
Author

@dsh54054 这个问题您解决了吗,遇到同样的问题了

@algorithmconquer 把multi_gpu_fintune_belle.py的变量EOS_ID = 150005修改为EOS_ID = 130005就可以了

@algorithmconquer
Copy link

@dsh54054 多谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants