-
-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSONDecodeError
in JournalStorage
#5138
Comments
JSONDecodeError
in JournalStorage
JSONDecodeError
in JournalStorage
Hello, |
Same (100 workers in Parallel, 15 failed with this error) |
Thank you for your reports. I investigated this bug and found that it is caused by the wrong reading log logic. We plan to fix it until the next minor release. |
#5195 fixed this error. It will be included in the next minor release. Thank you for your reports and contributions! |
Hi, I have upgraded to the latest version of optuna, 4.0.0.dev, but still faces the same JSONDecodeError. Here is my setup: I attempted to use JournalFileStorage for distributed optimzation over the slurm cluster. Each of my run requires a node with 8 GPUs. The slurm will launch 8 tasks on each node corresponds to each gpu. I issue multiple runs on the slurm cluster with its array job functionality, thus achieveing distributed optimization (each slurm job perform one run of the study). Here is the main part of the logic:
and the objective function is given here
I still consistently observe file corruption in the output log file. This is my first time using the combination of optuna and torch distributed, so I cannot pinpoint wether the issues is within my code or there is a bug in the library, so any help is appreciated! Thanks in advance! |
Thank you for your bug report. Could you share the error log in your case? |
Yes of course:) The error looks like similar to the one reported earlier. It seems that the trial can start running for a while until another process started can corrupt the log file.
By the way here's the version just to confirm:
|
Thank you! It seems to be a different error than previously reported. The previous error message is "Unterminated string starting at: line 1 column 146 (char 145)," but your error message is "Extra data: line 1 column 2 (char 1)." Could you also show the corrupt line in the log file? It would be a great hint for debugging. |
I deleted the corrupted log file generated earlier. However, when I was trying to reproduce the problem, a different error showed up:
I started two processes, the first one stoped due to the error while the second one seem to run just fine. Here is the log file generated by optuna:
Note that there seems to be an empty line at the end of the file, which could be the problem here. |
Thank you. As you explained, the error seems to be occurring with a blank line. |
After investigation, I suspect the file export synchronization is not working correctly. I have noticed no line at the end of the log that stores the objective value. If one worker is terminating successfully, that line should always be included. It suggests that the file write has been lost for some reason. It should be persisted by optuna/optuna/storages/_journal/file.py Lines 207 to 209 in 98d4444
@lwang2070 If possible, could you tell us your storage environment and OS where you store the log file? |
I am using a NFS on a slurm cluster, not sure about the details of the underlying file system and OS env (though definitely linux :p). I will consult the admin and let you know. Additionally, I produced another log file with 'Extra data' error, and here's the line with the error
Hopefully this will help! |
Oh, on another topic, why not create a new log file for each processes in a specified directory to avoid the write confilct problem all together? To read previous logs, you can concat all files in the directory to produce a log file, and each process can write to their own log file. Forgive me if I made stupid assumptions about the underlying mechanism of the optuna. I assume each process only need to read the history of other trials once, and determine its own hyperparameters. It would no longer need to read the log after that. Please correct me if I made some mistakes here :) |
Okay, turns our the file system is something called juicefs. |
Hey, any suggestion for a quick fix that allows me to run temporarily? Some of my work is dependent on it😅 Thanks in advance! |
|
Yes, I will test that a bit later:) For the temporary fix, I simply change optuna/optuna/storages/_journal/file.py Lines 199 to 200 in 98d4444
to be |
Nope, this doesn't work either. I get random corrupted lines in the log such as
They can either be truncated like this one, or appended at the end of another line. It did seem that this is related to the properties of the NFS. It can either be that two processes can acquire the lock at the same time, or the log is not flushed to the filesystem as you suggested. |
Just for your information, |
Expected behavior
The all CIs should be passed but
test_pop_waiting_trial_multiprocess_safe
intests/storages_tests/test_journal.py
sometimes fails. https://github.com/optuna/optuna/actions/runs/7079914046/job/19267103786Environment
Error messages, stack traces, or logs
Steps to reproduce
Run
test_pop_waiting_trial_multiprocess_safe
intests/storages_tests/test_journal.py
.Additional context (optional)
No response
The text was updated successfully, but these errors were encountered: