-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error" #4076
Labels
Comments
Ok, I tried to run
|
sorry, there was never 'optional', I believe.
y.
…On Mon, May 18, 2020 at 5:09 PM Andre Natal ***@***.***> wrote:
Ok, I tried to run --use-gpu=optional and that's an invalid choice for
this model I guess.
train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false',
'yes', 'no', 'wait')
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4076 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXYQVPJUOXTWBPMOXALRSGPWZANCNFSM4NEND24A>
.
|
I would assume it's because of memory -- the feed-forward network might
need much less memory.
set exclusive mode on your gpus and set --use-gpu='wait'
y.
…On Mon, May 18, 2020 at 5:11 PM Jan Trmal ***@***.***> wrote:
sorry, there was never 'optional', I believe.
y.
On Mon, May 18, 2020 at 5:09 PM Andre Natal ***@***.***>
wrote:
> Ok, I tried to run --use-gpu=optional and that's an invalid choice for
> this model I guess.
>
>
> train.py: error: argument --use-gpu: invalid choice: 'optional' (choose from 'true', 'false',
> 'yes', 'no', 'wait')
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#4076 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACUKYXYQVPJUOXTWBPMOXALRSGPWZANCNFSM4NEND24A>
> .
>
|
Hi @jtrmal I had set both already |
then try to track if it's always the same gpu or different gpu every time...
Might indeed be either HW issue or power issue.
y.
…On Mon, May 18, 2020 at 5:14 PM Andre Natal ***@***.***> wrote:
Hi @jtrmal <https://github.com/jtrmal> I had set both already
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4076 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX4Y4LW7QEJR44MZVXLRSGQMRANCNFSM4NEND24A>
.
|
I believe I had a driver issue, since after rebuilding the OS, cuda and kaldi installations, the training managed to complete fully. Thanks @jtrmal ! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello everyone,
I started to get
No CUDA GPU detected!, diagnostics: cudaError_t 3 : "initialization error"
errors intermittently when training ASPIRE's tdnn recipe (local/chain/tuning/run_tdnn_lstm_1a.sh
) after successfully traininglocal/chain/run_tdnn.sh
with no errors.I have a local setup with 4 RTX 2080 Ti running CUDA 10.2 and the crash usually occurred when 8 to 12 jobs were running. I tried then to reduce the number of minimum and maximum jobs to match the four GPUs and the crash occurred earlier than before.
I'm wondering if someone had this issue before start to consider some physical issue, since this server suffered an accidental shutdown in the middle of a training couple of weeks ago while it was pretty late in the training, without failures.
Maybe if I use
--use-gpu=optional
I might see some jobs being sent to CPU if for some reasonSelectGpuId
fails, and then have the training mixed between CPUs and GPUs without bringing further consequences to the final model?Thanks in advance.
Andre
The text was updated successfully, but these errors were encountered: