Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

訓練到一半跳出這則訊息,我該如何修正 #120

Open
stadatoin opened this issue Jul 8, 2023 · 17 comments
Open

訓練到一半跳出這則訊息,我該如何修正 #120

stadatoin opened this issue Jul 8, 2023 · 17 comments

Comments

@stadatoin
Copy link

./finetune/lora/train.py", line 347, in
model.load_state_dict(load_dict, strict=(not args.lora))
NameError: name 'load_dict' is not defined

@josStorer
Copy link
Owner

是否使用了自定义模型目录,如果没有使用,等下版本修复

@stadatoin
Copy link
Author

我除了把訓練路徑的文件,更改為預定用來訓練txt檔之外沒有做其他更動

@josStorer
Copy link
Owner

那等下个版本修复吧,再看看有没有问题,大概明天更新

@josStorer
Copy link
Owner

https://github.com/josStorer/RWKV-Runner/releases/tag/v1.3.6

@win10ogod
Copy link

win10ogod commented Jul 10, 2023

https://github.com/josStorer/RWKV-Runner/releases/tag/v1.3.6

之後能支持直接使用huggingface datasets的數據嗎?

@josStorer
Copy link
Owner

huggingface格式很多,我更建议自己写脚本转换,这难度不大

@win10ogod
Copy link

huggingface格式很多,我更建议自己写脚本转换,这难度不大

大佬有json轉jsonl的腳本嗎?
我寫的腳本轉換出來的沒法 在訓練中使用啊!

@josStorer
Copy link
Owner

点击训练页面的帮助按钮,里面有一个链接,点击跳转可以查看jsonl的示例

@xq2hz
Copy link

xq2hz commented Jul 10, 2023

点击训练页面的帮助按钮,里面有一个链接,点击跳转可以查看jsonl的示例
直接按照格式写入txt,后缀名改为jsonl,好像也可以用,是这样吗?

@josStorer
Copy link
Owner

@xq2hz 是的

@josStorer
Copy link
Owner

@stadatoin
Copy link
Author

我把10多MB的小說檔案放進去訓練之後,跑了8小時之後loss雖然有下降,但是還是在8附近

這是我的設備問題嗎?

外顯是1650 4G 記憶體

使用RWKV-4-World-7B-v1-20230626-ctx4096.pth

訓練參數沒有更改

還是我需要更改txt內文的格式或者用較少的資料進行訓練

@josStorer
Copy link
Owner

@stadatoin 增加微批次大小,尝试8,LORA R尝试增加到32

@stadatoin
Copy link
Author

好的,我試試看

@stadatoin
Copy link
Author

@josStorer

它顯示顯存不足,我應該換成0.1B來測試嗎?

@stadatoin
Copy link
Author

抱歉,我沒注意到我有動到精度,現在能夠訓練了

@stadatoin
Copy link
Author

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f11e9f8a457 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f11e9f543ec in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f11f3333c64 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f11f330b0dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f11f330e054 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d7d63 (0x7f11d3fb7d63 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x41 (0x7f11e9f6c1a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f11e9f6c214 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: + 0x4404d (0x7f11e9f7604d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x489ab33 (0x7f11acee6b33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: THPVariable_set_data(THPVariable*, _object*, void*) + 0x6f (0x7f11d420c30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)

frame #43: + 0x29d90 (0x7f11f3c2fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: __libc_start_main + 0x80 (0x7f11f3c2fe40 in /lib/x86_64-linux-gnu/libc.so.6)
./finetune/install-wsl-dep-and-train.sh: line 52: 429 Aborted python3 ./finetune/lora/train.py $modelInfo $@ --proj_dir lora-models --data_type binidx --lora --lora_parts=att,ffn,time,ln --strategy deepspeed_stage_2 --accelerator gpu

它跳出這個之後又不動了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants