`JSONDecodeError` in `JournalStorage` #5138

not522 · 2023-12-05T02:11:50Z

Expected behavior

The all CIs should be passed but test_pop_waiting_trial_multiprocess_safe in tests/storages_tests/test_journal.py sometimes fails. https://github.com/optuna/optuna/actions/runs/7079914046/job/19267103786

Environment

Optuna version: master
Python version: (at least) 3.7 & 3.8
OS: Ubuntu
(Optional) Other libraries and their versions:

Error messages, stack traces, or logs

___________________ test_pop_waiting_trial_multiprocess_safe ___________________
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/runner/work/optuna/optuna/tests/storages_tests/test_journal.py", line 103, in pop_waiting_trial
    study = optuna.load_study(storage=storage, study_name=study_name)
  File "/home/runner/work/optuna/optuna/optuna/_convert_positional_args.py", line 83, in converter_wrapper
    return func(**kwargs)
  File "/home/runner/work/optuna/optuna/optuna/study/study.py", line 1350, in load_study
    return Study(study_name=study_name, storage=storage, sampler=sampler, pruner=pruner)
  File "/home/runner/work/optuna/optuna/optuna/study/study.py", line 85, in __init__
    self._directions = storage.get_study_directions(study_id)
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/storage.py", line 213, in get_study_directions
    self._sync_with_backend()
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/storage.py", line 149, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/file.py", line 175, in read_logs
    raise last_decode_error
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/file.py", line 184, in read_logs
    logs.append(json.loads(line))
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 146 (char 145)
"""

The above exception was the direct cause of the following exception:

    def test_pop_waiting_trial_multiprocess_safe() -> None:
        with NamedTemporaryFilePool() as file:
            file_storage = optuna.storages.JournalFileStorage(file.name)
            storage = optuna.storages.JournalStorage(file_storage)
            study = optuna.create_study(storage=storage)
            num_enqueued = 10
            for i in range(num_enqueued):
                study.enqueue_trial({"i": i})
    
            trial_id_set = set()
            with ProcessPoolExecutor(10) as pool:
                futures = []
                for i in range(num_enqueued):
                    future = pool.submit(pop_waiting_trial, file.name, study.study_name)
                    futures.append(future)
    
                for future in as_completed(futures):
>                   trial_id = future.result()

tests/storages_tests/test_journal.py:124: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/_base.py:437: in result
    return self.__get_result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = None

    def __get_result(self):
        if self._exception:
            try:
>               raise self._exception
E               json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 146 (char 145)

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/_base.py:389: JSONDecodeError

Steps to reproduce

Run test_pop_waiting_trial_multiprocess_safe in tests/storages_tests/test_journal.py.

Additional context (optional)

No response

The text was updated successfully, but these errors were encountered:

araffin · 2023-12-11T08:09:51Z

Hello,
I can report similar issues when using many processes at once with the journal storage.

ernestum · 2024-01-15T14:15:15Z

Same (100 workers in Parallel, 15 failed with this error)

not522 · 2024-01-17T05:01:46Z

Thank you for your reports. I investigated this bug and found that it is caused by the wrong reading log logic. We plan to fix it until the next minor release.

not522 · 2024-02-05T07:54:57Z

#5195 fixed this error. It will be included in the next minor release. Thank you for your reports and contributions!

lwang2070 · 2024-05-21T03:18:37Z

Hi, I have upgraded to the latest version of optuna, 4.0.0.dev, but still faces the same JSONDecodeError. Here is my setup:

I attempted to use JournalFileStorage for distributed optimzation over the slurm cluster. Each of my run requires a node with 8 GPUs. The slurm will launch 8 tasks on each node corresponds to each gpu. I issue multiple runs on the slurm cluster with its array job functionality, thus achieveing distributed optimization (each slurm job perform one run of the study).

Here is the main part of the logic:

        if args.optuna_distributed_sweep:
            from sweep import objectives
            if gpc.is_rank_for_log():
                import optuna
                import logging
                import sys
                # Add stream handler of stdout to show the messages
                optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
                study_name = args.optuna_study_name  # Unique identifier of the study.
                storage = optuna.storages.JournalStorage(
                    optuna.storages.JournalFileStorage(f"{args.optuna_log_dir.rstrip('/')}/{study_name}.log"),  # NFS path for distributed optimization
                )
                study = optuna.create_study(study_name=study_name, direction="minimize", storage=storage, load_if_exists=True)
                if args.optuna_params:
                    import json
                    params = json.loads(args.optuna_params)
                    study.enqueue_trial(params)
                objective = partial(objectives[args.optuna_objective_name], main, args)
                study.optimize(objective, n_trials=1)
            else:
                objective = partial(objectives[args.optuna_objective_name], main, args)
                objective(None)
        else:
            main(args)

and the objective function is given here

def intern_5m_sweep_objective(fn, args, single_trial):
    trial = optuna.integration.TorchDistributedTrial(single_trial)
    gpc.config.model.scale_emb = trial.suggest_float('scale_emb', 0, 20)
    gpc.config.model.scale_depth = trial.suggest_float('scale_depth', 0, 10)
    gpc.config.model.w_std = trial.suggest_float('init_std', 0, 1)
    gpc.config.adam.lr = trial.suggest_float('lr', 0, 0.2)
    gpc.config.JOB_NAME += f'_scale_emb={gpc.config.model.scale_emb}_scale_depth={gpc.config.model.scale_depth}_w_std={gpc.config.model.w_std}_lr={gpc.config.adam.lr}'
    
    return fn(args, trial)

I still consistently observe file corruption in the output log file. This is my first time using the combination of optuna and torch distributed, so I cannot pinpoint wether the issues is within my code or there is a bug in the library, so any help is appreciated!

Thanks in advance!

not522 · 2024-05-21T03:37:55Z

Thank you for your bug report. Could you share the error log in your case?

lwang2070 · 2024-05-21T06:22:55Z

Yes of course:) The error looks like similar to the one reported earlier. It seems that the trial can start running for a while until another process started can corrupt the log file.

2024-05-20 19:06:47,578	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=5.335465091592829 step=5 loss=8.54883098602295 tgs (tokens/gpu/second)=24316.07 tgs/last_tgs_1=24316.12 tgs/tgs_all=12025.95 tgs/tgs_avg=13717.84 tgs/tgs_SMA=12025.95 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.010262852534651756 loss_scale=65536.0 grad_norm={'0_default': 0.2240985285550442, '1_default': 1.5120175088595298} epoch=0 micro_num=8 num_consumed_tokens=12582912 inf_nan_skip_batches=0 num_samples_in_batch=286 largest_length=2048 largest_batch=44 smallest_batch=24 adam_beta2=0.95 fwd_bwd_time=0.87 acc=0.0348 perplexity=5038.9927 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0348 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095388 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.525 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.525 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:00,953	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=4.3471425289503625 step=6 loss=8.242281913757324 tgs (tokens/gpu/second)=19811.85 tgs/last_tgs_1=19811.88 tgs/tgs_all=12741.27 tgs/tgs_avg=14588.42 tgs/tgs_SMA=12741.27 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.011728974325316293 loss_scale=65536.0 grad_norm={'0_default': 0.22520167994368076, '1_default': 1.6574819050026992} epoch=0 micro_num=8 num_consumed_tokens=14680064 inf_nan_skip_batches=0 num_samples_in_batch=259 largest_length=2048 largest_batch=51 smallest_batch=16 adam_beta2=0.95 fwd_bwd_time=0.91 acc=0.0216 perplexity=3487.989 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0216 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095603 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.1571 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.1571 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:21,367	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=2.8371251887611044 step=7 loss=8.092676162719727 tgs (tokens/gpu/second)=12930.04 tgs/last_tgs_1=12930.05 tgs/tgs_all=12764.57 tgs/tgs_avg=14381.12 tgs/tgs_SMA=12764.57 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.01319509611598083 loss_scale=65536.0 grad_norm={'0_default': 0.11347788975738815, '1_default': 1.038763763391638} epoch=0 micro_num=8 num_consumed_tokens=16777216 inf_nan_skip_batches=0 num_samples_in_batch=352 largest_length=2048 largest_batch=51 smallest_batch=38 adam_beta2=0.95 fwd_bwd_time=0.9 acc=0.0234 perplexity=3627.1111 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0234 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095289 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.1962 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.1962 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:34,388	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=4.459795912168429 step=8 loss=8.514721870422363 tgs (tokens/gpu/second)=20325.27 tgs/last_tgs_1=20325.29 tgs/tgs_all=13314.9 tgs/tgs_avg=15041.59 tgs/tgs_SMA=13314.9 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.014661217906645367 loss_scale=65536.0 grad_norm={'0_default': 0.09711209324832443, '1_default': 0.9855694260589425} epoch=0 micro_num=8 num_consumed_tokens=18874368 inf_nan_skip_batches=0 num_samples_in_batch=306 largest_length=2048 largest_batch=48 smallest_batch=27 adam_beta2=0.95 fwd_bwd_time=0.89 acc=0.0343 perplexity=3715.2295 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0343 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095257 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.2202 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.2202 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:34,544	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 208, in _run_trial
    frozen_trial = _tell_with_warning(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_tell.py", line 113, in _tell_with_warning
    frozen_trial = _get_frozen_trial(study, trial)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_tell.py", line 47, in _get_frozen_trial
    return study._storage.get_trial(trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 216, in _run_trial
    frozen_trial = study._storage.get_trial(trial._trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 372, in <module>
    study.optimize(objective, n_trials=1)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/study.py", line 453, in optimize
    _optimize(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 62, in _optimize
    _optimize_sequential(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 159, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 219, in _run_trial
    if frozen_trial.state == TrialState.COMPLETE:
UnboundLocalError: local variable 'frozen_trial' referenced before assignment

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
2024-05-20 19:07:39,952	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:62127

2024-05-20 19:07:39,962	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 2
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:60819

2024-05-20 19:07:39,981	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 4
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:6860

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
2024-05-20 19:07:45,412	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 3
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:7378

2024-05-20 19:07:45,417	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 5
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:33690

2024-05-20 19:07:45,429	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 6
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:20689

2024-05-20 19:07:50,793	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 7
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:14053

By the way here's the version just to confirm:

>>> import optuna
>>> optuna.__version__
'4.0.0.dev'

not522 · 2024-05-21T07:35:22Z

Thank you! It seems to be a different error than previously reported. The previous error message is "Unterminated string starting at: line 1 column 146 (char 145)," but your error message is "Extra data: line 1 column 2 (char 1)."

Could you also show the corrupt line in the log file? It would be a great hint for debugging.

lwang2070 · 2024-05-21T08:52:44Z

I deleted the corrupted log file generated earlier. However, when I was trying to reproduce the problem, a different error showed up:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 216, in _run_trial
    frozen_trial = study._storage.get_trial(trial._trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 372, in <module>
    study.optimize(objective, n_trials=1)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/study.py", line 453, in optimize
    _optimize(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 62, in _optimize
    _optimize_sequential(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 159, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 219, in _run_trial
    if frozen_trial.state == TrialState.COMPLETE:
UnboundLocalError: local variable 'frozen_trial' referenced before assignment

2024-05-21 16:21:39,758	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806044 with rank id: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:26836

2024-05-21 16:21:39,764	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806044 with rank id: 2
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I started two processes, the first one stoped due to the error while the second one seem to run just fine. Here is the log file generated by optuna:

{"op_code": 0, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "study_name": "intern_5m_muP_sweep_debug", "directions": [1]}
{"op_code": 4, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "study_id": 0, "datetime_start": "2024-05-21T16:16:48.404352"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "scale_emb", "param_value_internal": 5.0427348880931415, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 20.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "scale_depth", "param_value_internal": 0.21607731322222934, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 10.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "init_std", "param_value_internal": 0.9481608230815018, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 1.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "lr", "param_value_internal": 0.03880828519140931, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 0.2, \"log\": false}}"}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 0, "intermediate_value": 12.879324913024902}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 1, "intermediate_value": 12.788047790527344}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 2, "intermediate_value": 12.511056900024414}
...
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 1063, "intermediate_value": 6.470120429992676}

Note that there seems to be an empty line at the end of the file, which could be the problem here.

not522 · 2024-05-27T03:32:28Z

Thank you. As you explained, the error seems to be occurring with a blank line.
I reopen the issue and look into the cause of the error.

not522 · 2024-05-29T10:45:30Z

After investigation, I suspect the file export synchronization is not working correctly. I have noticed no line at the end of the log that stores the objective value. If one worker is terminating successfully, that line should always be included. It suggests that the file write has been lost for some reason. It should be persisted by f.flush() and os.fsync(f.fileno()), but it may not work in some environments.

optuna/optuna/storages/_journal/file.py

Lines 207 to 209 in 98d4444

    
           f.write(what_to_write.encode("utf-8")) 
        
           f.flush() 
        
           os.fsync(f.fileno())

@lwang2070 If possible, could you tell us your storage environment and OS where you store the log file?

lwang2070 · 2024-05-31T09:14:55Z

I am using a NFS on a slurm cluster, not sure about the details of the underlying file system and OS env (though definitely linux :p). I will consult the admin and let you know.

Additionally, I produced another log file with 'Extra data' error, and here's the line with the error

{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 126, "intermediate_value": 6.986872673034668}{{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 127, "intermediate_value": 7.017472267150879}

Hopefully this will help!

lwang2070 · 2024-05-31T09:21:57Z

Oh, on another topic, why not create a new log file for each processes in a specified directory to avoid the write confilct problem all together? To read previous logs, you can concat all files in the directory to produce a log file, and each process can write to their own log file.

Forgive me if I made stupid assumptions about the underlying mechanism of the optuna. I assume each process only need to read the history of other trials once, and determine its own hyperparameters. It would no longer need to read the log after that.

Please correct me if I made some mistakes here :)

lwang2070 · 2024-05-31T10:41:49Z

Okay, turns our the file system is something called juicefs.

lwang2070 · 2024-06-01T10:32:49Z

Hey, any suggestion for a quick fix that allows me to run temporarily? Some of my work is dependent on it😅 Thanks in advance!

not522 · 2024-06-01T13:09:24Z

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

lwang2070 · 2024-06-02T14:53:17Z

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

Yes, I will test that a bit later:)

For the temporary fix, I simply change

optuna/optuna/storages/_journal/file.py

Lines 199 to 200 in 98d4444

    
           last_decode_error = err 
        
           del self._log_number_offset[log_number + 1]

to be pass and managed to get the system running (in case anyone also need this).

lwang2070 · 2024-06-03T16:14:29Z

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

Nope, this doesn't work either. I get random corrupted lines in the log such as

_code": 4, "worker_id": "2fa91567-e09a-4a67-8bbb-fc0877a65779-139969574254400", "study_id": 0, "datetime_start": "2024-06-04T00:03:19.132871"}

They can either be truncated like this one, or appended at the end of another line.

It did seem that this is related to the properties of the NFS. It can either be that two processes can acquire the lock at the same time, or the log is not flushed to the filesystem as you suggested.

not522 · 2024-06-11T05:35:57Z

Just for your information, JournalFileStorage is implemented according to the NFS specification: locking using symlink is based on NFS v2, and locking using open is based on NFS v3. Technical details can be found in this article. https://medium.com/optuna/distributed-optimization-via-nfs-using-optunas-new-operation-based-logging-storage-9815f9c3f932

not522 added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Dec 5, 2023

not522 changed the title ~~JSONDecodeError in JournalStorage~~ JSONDecodeError in JournalStorage Dec 5, 2023

y0z added a commit to y0z/optuna that referenced this issue Jan 22, 2024

Fix JSONDecodeError in JournalStorage optuna#5138

6a5b118

y0z mentioned this issue Jan 22, 2024

Fix JSONDecodeError in JournalStorage #5195

Merged

not522 closed this as completed Feb 5, 2024

not522 reopened this May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`JSONDecodeError` in `JournalStorage` #5138

`JSONDecodeError` in `JournalStorage` #5138

not522 commented Dec 5, 2023 •

edited

araffin commented Dec 11, 2023

ernestum commented Jan 15, 2024

not522 commented Jan 17, 2024

not522 commented Feb 5, 2024

lwang2070 commented May 21, 2024 •

edited

not522 commented May 21, 2024

lwang2070 commented May 21, 2024 •

edited

not522 commented May 21, 2024

lwang2070 commented May 21, 2024

not522 commented May 27, 2024

not522 commented May 29, 2024

lwang2070 commented May 31, 2024

lwang2070 commented May 31, 2024

lwang2070 commented May 31, 2024

lwang2070 commented Jun 1, 2024 •

edited

not522 commented Jun 1, 2024

lwang2070 commented Jun 2, 2024

lwang2070 commented Jun 3, 2024

not522 commented Jun 11, 2024

JSONDecodeError in JournalStorage #5138

JSONDecodeError in JournalStorage #5138

Comments

not522 commented Dec 5, 2023 • edited

Expected behavior

Environment

Error messages, stack traces, or logs

Steps to reproduce

Additional context (optional)

araffin commented Dec 11, 2023

ernestum commented Jan 15, 2024

not522 commented Jan 17, 2024

not522 commented Feb 5, 2024

lwang2070 commented May 21, 2024 • edited

not522 commented May 21, 2024

lwang2070 commented May 21, 2024 • edited

not522 commented May 21, 2024

lwang2070 commented May 21, 2024

not522 commented May 27, 2024

not522 commented May 29, 2024

lwang2070 commented May 31, 2024

lwang2070 commented May 31, 2024

lwang2070 commented May 31, 2024

lwang2070 commented Jun 1, 2024 • edited

not522 commented Jun 1, 2024

lwang2070 commented Jun 2, 2024

lwang2070 commented Jun 3, 2024

not522 commented Jun 11, 2024

`JSONDecodeError` in `JournalStorage` #5138

`JSONDecodeError` in `JournalStorage` #5138

not522 commented Dec 5, 2023 •

edited

lwang2070 commented May 21, 2024 •

edited

lwang2070 commented May 21, 2024 •

edited

lwang2070 commented Jun 1, 2024 •

edited