訓練到一半跳出這則訊息，我該如何修正 #120

stadatoin · 2023-07-08T12:30:01Z

./finetune/lora/train.py", line 347, in
model.load_state_dict(load_dict, strict=(not args.lora))
NameError: name 'load_dict' is not defined

josStorer · 2023-07-08T12:30:44Z

是否使用了自定义模型目录，如果没有使用，等下版本修复

stadatoin · 2023-07-08T12:53:57Z

我除了把訓練路徑的文件，更改為預定用來訓練txt檔之外沒有做其他更動

josStorer · 2023-07-08T12:54:42Z

那等下个版本修复吧，再看看有没有问题，大概明天更新

josStorer · 2023-07-09T06:08:17Z

https://github.com/josStorer/RWKV-Runner/releases/tag/v1.3.6

win10ogod · 2023-07-10T03:23:57Z

https://github.com/josStorer/RWKV-Runner/releases/tag/v1.3.6

之後能支持直接使用huggingface datasets的數據嗎?

josStorer · 2023-07-10T03:29:09Z

huggingface格式很多，我更建议自己写脚本转换，这难度不大

win10ogod · 2023-07-10T06:47:38Z

huggingface格式很多，我更建议自己写脚本转换，这难度不大

大佬有json轉jsonl的腳本嗎?
我寫的腳本轉換出來的沒法在訓練中使用啊!

josStorer · 2023-07-10T06:52:36Z

点击训练页面的帮助按钮，里面有一个链接，点击跳转可以查看jsonl的示例

xq2hz · 2023-07-10T12:30:56Z

点击训练页面的帮助按钮，里面有一个链接，点击跳转可以查看jsonl的示例
直接按照格式写入txt，后缀名改为jsonl，好像也可以用，是这样吗？

josStorer · 2023-07-10T12:34:03Z

@xq2hz 是的

josStorer · 2023-07-10T12:34:42Z

@win10ogod 参考我这里的实现 https://github.com/josStorer/RWKV-Runner/blob/master/backend-golang/rwkv.go#L59

stadatoin · 2023-07-10T14:00:03Z

我把10多MB的小說檔案放進去訓練之後，跑了8小時之後loss雖然有下降，但是還是在8附近

這是我的設備問題嗎?

外顯是1650 4G 記憶體

使用RWKV-4-World-7B-v1-20230626-ctx4096.pth

訓練參數沒有更改

還是我需要更改txt內文的格式或者用較少的資料進行訓練

josStorer · 2023-07-10T14:03:04Z

@stadatoin 增加微批次大小，尝试8，LORA R尝试增加到32

stadatoin · 2023-07-10T14:03:43Z

好的，我試試看

stadatoin · 2023-07-10T14:17:55Z

@josStorer

它顯示顯存不足，我應該換成0.1B來測試嗎?

stadatoin · 2023-07-10T14:40:36Z

抱歉，我沒注意到我有動到精度，現在能夠訓練了

stadatoin · 2023-07-11T00:41:03Z

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f11e9f8a457 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f11e9f543ec in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f11f3333c64 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7f11f330b0dc in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f11f330e054 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d7d63 (0x7f11d3fb7d63 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x41 (0x7f11e9f6c1a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f11e9f6c214 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #8: + 0x4404d (0x7f11e9f7604d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #9: + 0x489ab33 (0x7f11acee6b33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #10: THPVariable_set_data(THPVariable*, _object*, void*) + 0x6f (0x7f11d420c30f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)

frame #43: + 0x29d90 (0x7f11f3c2fd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: __libc_start_main + 0x80 (0x7f11f3c2fe40 in /lib/x86_64-linux-gnu/libc.so.6)
./finetune/install-wsl-dep-and-train.sh: line 52: 429 Aborted python3 ./finetune/lora/train.py $modelInfo $@ --proj_dir lora-models --data_type binidx --lora --lora_parts=att,ffn,time,ln --strategy deepspeed_stage_2 --accelerator gpu

它跳出這個之後又不動了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

訓練到一半跳出這則訊息，我該如何修正 #120

訓練到一半跳出這則訊息，我該如何修正 #120

stadatoin commented Jul 8, 2023

josStorer commented Jul 8, 2023

stadatoin commented Jul 8, 2023

josStorer commented Jul 8, 2023

josStorer commented Jul 9, 2023

win10ogod commented Jul 10, 2023 •

edited

josStorer commented Jul 10, 2023

win10ogod commented Jul 10, 2023

josStorer commented Jul 10, 2023

xq2hz commented Jul 10, 2023

josStorer commented Jul 10, 2023

josStorer commented Jul 10, 2023

stadatoin commented Jul 10, 2023

josStorer commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 11, 2023

訓練到一半跳出這則訊息，我該如何修正 #120

訓練到一半跳出這則訊息，我該如何修正 #120

Comments

stadatoin commented Jul 8, 2023

josStorer commented Jul 8, 2023

stadatoin commented Jul 8, 2023

josStorer commented Jul 8, 2023

josStorer commented Jul 9, 2023

win10ogod commented Jul 10, 2023 • edited

josStorer commented Jul 10, 2023

win10ogod commented Jul 10, 2023

josStorer commented Jul 10, 2023

xq2hz commented Jul 10, 2023

josStorer commented Jul 10, 2023

josStorer commented Jul 10, 2023

stadatoin commented Jul 10, 2023

josStorer commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 10, 2023

stadatoin commented Jul 11, 2023

win10ogod commented Jul 10, 2023 •

edited