Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONDecodeError in JournalStorage #5138

Open
not522 opened this issue Dec 5, 2023 · 19 comments
Open

JSONDecodeError in JournalStorage #5138

not522 opened this issue Dec 5, 2023 · 19 comments
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.

Comments

@not522
Copy link
Member

not522 commented Dec 5, 2023

Expected behavior

The all CIs should be passed but test_pop_waiting_trial_multiprocess_safe in tests/storages_tests/test_journal.py sometimes fails. https://github.com/optuna/optuna/actions/runs/7079914046/job/19267103786

Environment

  • Optuna version: master
  • Python version: (at least) 3.7 & 3.8
  • OS: Ubuntu
  • (Optional) Other libraries and their versions:

Error messages, stack traces, or logs

___________________ test_pop_waiting_trial_multiprocess_safe ___________________
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/runner/work/optuna/optuna/tests/storages_tests/test_journal.py", line 103, in pop_waiting_trial
    study = optuna.load_study(storage=storage, study_name=study_name)
  File "/home/runner/work/optuna/optuna/optuna/_convert_positional_args.py", line 83, in converter_wrapper
    return func(**kwargs)
  File "/home/runner/work/optuna/optuna/optuna/study/study.py", line 1350, in load_study
    return Study(study_name=study_name, storage=storage, sampler=sampler, pruner=pruner)
  File "/home/runner/work/optuna/optuna/optuna/study/study.py", line 85, in __init__
    self._directions = storage.get_study_directions(study_id)
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/storage.py", line 213, in get_study_directions
    self._sync_with_backend()
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/storage.py", line 149, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/file.py", line 175, in read_logs
    raise last_decode_error
  File "/home/runner/work/optuna/optuna/optuna/storages/_journal/file.py", line 184, in read_logs
    logs.append(json.loads(line))
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 146 (char 145)
"""

The above exception was the direct cause of the following exception:

    def test_pop_waiting_trial_multiprocess_safe() -> None:
        with NamedTemporaryFilePool() as file:
            file_storage = optuna.storages.JournalFileStorage(file.name)
            storage = optuna.storages.JournalStorage(file_storage)
            study = optuna.create_study(storage=storage)
            num_enqueued = 10
            for i in range(num_enqueued):
                study.enqueue_trial({"i": i})
    
            trial_id_set = set()
            with ProcessPoolExecutor(10) as pool:
                futures = []
                for i in range(num_enqueued):
                    future = pool.submit(pop_waiting_trial, file.name, study.study_name)
                    futures.append(future)
    
                for future in as_completed(futures):
>                   trial_id = future.result()

tests/storages_tests/test_journal.py:124: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/_base.py:437: in result
    return self.__get_result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = None

    def __get_result(self):
        if self._exception:
            try:
>               raise self._exception
E               json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 146 (char 145)

/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/concurrent/futures/_base.py:389: JSONDecodeError

Steps to reproduce

Run test_pop_waiting_trial_multiprocess_safe in tests/storages_tests/test_journal.py.

Additional context (optional)

No response

@not522 not522 added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Dec 5, 2023
@not522 not522 changed the title JSONDecodeError in JournalStorage JSONDecodeError in JournalStorage Dec 5, 2023
@araffin
Copy link
Contributor

araffin commented Dec 11, 2023

Hello,
I can report similar issues when using many processes at once with the journal storage.

@ernestum
Copy link

Same (100 workers in Parallel, 15 failed with this error)

@not522
Copy link
Member Author

not522 commented Jan 17, 2024

Thank you for your reports. I investigated this bug and found that it is caused by the wrong reading log logic. We plan to fix it until the next minor release.

@not522
Copy link
Member Author

not522 commented Feb 5, 2024

#5195 fixed this error. It will be included in the next minor release. Thank you for your reports and contributions!

@not522 not522 closed this as completed Feb 5, 2024
@lwang2070
Copy link

lwang2070 commented May 21, 2024

Hi, I have upgraded to the latest version of optuna, 4.0.0.dev, but still faces the same JSONDecodeError. Here is my setup:

I attempted to use JournalFileStorage for distributed optimzation over the slurm cluster. Each of my run requires a node with 8 GPUs. The slurm will launch 8 tasks on each node corresponds to each gpu. I issue multiple runs on the slurm cluster with its array job functionality, thus achieveing distributed optimization (each slurm job perform one run of the study).

Here is the main part of the logic:

        if args.optuna_distributed_sweep:
            from sweep import objectives
            if gpc.is_rank_for_log():
                import optuna
                import logging
                import sys
                # Add stream handler of stdout to show the messages
                optuna.logging.get_logger("optuna").addHandler(logging.StreamHandler(sys.stdout))
                study_name = args.optuna_study_name  # Unique identifier of the study.
                storage = optuna.storages.JournalStorage(
                    optuna.storages.JournalFileStorage(f"{args.optuna_log_dir.rstrip('/')}/{study_name}.log"),  # NFS path for distributed optimization
                )
                study = optuna.create_study(study_name=study_name, direction="minimize", storage=storage, load_if_exists=True)
                if args.optuna_params:
                    import json
                    params = json.loads(args.optuna_params)
                    study.enqueue_trial(params)
                objective = partial(objectives[args.optuna_objective_name], main, args)
                study.optimize(objective, n_trials=1)
            else:
                objective = partial(objectives[args.optuna_objective_name], main, args)
                objective(None)
        else:
            main(args)

and the objective function is given here

def intern_5m_sweep_objective(fn, args, single_trial):
    trial = optuna.integration.TorchDistributedTrial(single_trial)
    gpc.config.model.scale_emb = trial.suggest_float('scale_emb', 0, 20)
    gpc.config.model.scale_depth = trial.suggest_float('scale_depth', 0, 10)
    gpc.config.model.w_std = trial.suggest_float('init_std', 0, 1)
    gpc.config.adam.lr = trial.suggest_float('lr', 0, 0.2)
    gpc.config.JOB_NAME += f'_scale_emb={gpc.config.model.scale_emb}_scale_depth={gpc.config.model.scale_depth}_w_std={gpc.config.model.w_std}_lr={gpc.config.adam.lr}'
    
    return fn(args, trial)

I still consistently observe file corruption in the output log file. This is my first time using the combination of optuna and torch distributed, so I cannot pinpoint wether the issues is within my code or there is a bug in the library, so any help is appreciated!

Thanks in advance!

@not522
Copy link
Member Author

not522 commented May 21, 2024

Thank you for your bug report. Could you share the error log in your case?

@lwang2070
Copy link

lwang2070 commented May 21, 2024

Yes of course:) The error looks like similar to the one reported earlier. It seems that the trial can start running for a while until another process started can corrupt the log file.

2024-05-20 19:06:47,578	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=5.335465091592829 step=5 loss=8.54883098602295 tgs (tokens/gpu/second)=24316.07 tgs/last_tgs_1=24316.12 tgs/tgs_all=12025.95 tgs/tgs_avg=13717.84 tgs/tgs_SMA=12025.95 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.010262852534651756 loss_scale=65536.0 grad_norm={'0_default': 0.2240985285550442, '1_default': 1.5120175088595298} epoch=0 micro_num=8 num_consumed_tokens=12582912 inf_nan_skip_batches=0 num_samples_in_batch=286 largest_length=2048 largest_batch=44 smallest_batch=24 adam_beta2=0.95 fwd_bwd_time=0.87 acc=0.0348 perplexity=5038.9927 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0348 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095388 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.525 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.525 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:00,953	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=4.3471425289503625 step=6 loss=8.242281913757324 tgs (tokens/gpu/second)=19811.85 tgs/last_tgs_1=19811.88 tgs/tgs_all=12741.27 tgs/tgs_avg=14588.42 tgs/tgs_SMA=12741.27 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.011728974325316293 loss_scale=65536.0 grad_norm={'0_default': 0.22520167994368076, '1_default': 1.6574819050026992} epoch=0 micro_num=8 num_consumed_tokens=14680064 inf_nan_skip_batches=0 num_samples_in_batch=259 largest_length=2048 largest_batch=51 smallest_batch=16 adam_beta2=0.95 fwd_bwd_time=0.91 acc=0.0216 perplexity=3487.989 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0216 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095603 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.1571 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.1571 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:21,367	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=2.8371251887611044 step=7 loss=8.092676162719727 tgs (tokens/gpu/second)=12930.04 tgs/last_tgs_1=12930.05 tgs/tgs_all=12764.57 tgs/tgs_avg=14381.12 tgs/tgs_SMA=12764.57 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.01319509611598083 loss_scale=65536.0 grad_norm={'0_default': 0.11347788975738815, '1_default': 1.038763763391638} epoch=0 micro_num=8 num_consumed_tokens=16777216 inf_nan_skip_batches=0 num_samples_in_batch=352 largest_length=2048 largest_batch=51 smallest_batch=38 adam_beta2=0.95 fwd_bwd_time=0.9 acc=0.0234 perplexity=3627.1111 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0234 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095289 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.1962 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.1962 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:34,388	INFO training_internlm.py:771 in record_current_batch_training_metrics -- tflops=4.459795912168429 step=8 loss=8.514721870422363 tgs (tokens/gpu/second)=20325.27 tgs/last_tgs_1=20325.29 tgs/tgs_all=13314.9 tgs/tgs_avg=15041.59 tgs/tgs_SMA=13314.9 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=0.014661217906645367 loss_scale=65536.0 grad_norm={'0_default': 0.09711209324832443, '1_default': 0.9855694260589425} epoch=0 micro_num=8 num_consumed_tokens=18874368 inf_nan_skip_batches=0 num_samples_in_batch=306 largest_length=2048 largest_batch=48 smallest_batch=27 adam_beta2=0.95 fwd_bwd_time=0.89 acc=0.0343 perplexity=3715.2295 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=0.0343 acc//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0.0 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=2095257 tokens//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=0 loss_from_metric=8.2202 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/cn=8.2202 loss//juicefs-algorithm/workspace/nlp/jinquan_ding/InternEvo/data_local/test=nan 
2024-05-20 19:07:34,544	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 0
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 208, in _run_trial
    frozen_trial = _tell_with_warning(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_tell.py", line 113, in _tell_with_warning
    frozen_trial = _get_frozen_trial(study, trial)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_tell.py", line 47, in _get_frozen_trial
    return study._storage.get_trial(trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 216, in _run_trial
    frozen_trial = study._storage.get_trial(trial._trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 372, in <module>
    study.optimize(objective, n_trials=1)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/study.py", line 453, in optimize
    _optimize(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 62, in _optimize
    _optimize_sequential(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 159, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 219, in _run_trial
    if frozen_trial.state == TrialState.COMPLETE:
UnboundLocalError: local variable 'frozen_trial' referenced before assignment

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
2024-05-20 19:07:39,952	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:62127

2024-05-20 19:07:39,962	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 2
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:60819

2024-05-20 19:07:39,981	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 4
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:6860

Exception in thread Thread-3 (_pin_memory_loop):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
    do_one_step()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
2024-05-20 19:07:45,412	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 3
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:7378

2024-05-20 19:07:45,417	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 5
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.1.1]:33690

2024-05-20 19:07:45,429	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 6
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:20689

2024-05-20 19:07:50,793	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806050 with rank id: 7
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:14053

By the way here's the version just to confirm:

>>> import optuna
>>> optuna.__version__
'4.0.0.dev'

@not522
Copy link
Member Author

not522 commented May 21, 2024

Thank you! It seems to be a different error than previously reported. The previous error message is "Unterminated string starting at: line 1 column 146 (char 145)," but your error message is "Extra data: line 1 column 2 (char 1)."

Could you also show the corrupt line in the log file? It would be a great hint for debugging.

@lwang2070
Copy link

I deleted the corrupted log file generated earlier. However, when I was trying to reproduce the problem, a different error showed up:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 216, in _run_trial
    frozen_trial = study._storage.get_trial(trial._trial_id)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 368, in get_trial
    self._sync_with_backend()
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/storage.py", line 148, in _sync_with_backend
    logs = self._backend.read_logs(self._replay_result.log_number_read)
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 183, in read_logs
    raise last_decode_error
  File "/opt/conda/lib/python3.10/site-packages/optuna/storages/_journal/file.py", line 197, in read_logs
    logs.append(json.loads(line))
  File "/opt/conda/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 372, in <module>
    study.optimize(objective, n_trials=1)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/study.py", line 453, in optimize
    _optimize(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 62, in _optimize
    _optimize_sequential(
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 159, in _optimize_sequential
    frozen_trial = _run_trial(study, func, catch)
  File "/opt/conda/lib/python3.10/site-packages/optuna/study/_optimize.py", line 219, in _run_trial
    if frozen_trial.state == TrialState.COMPLETE:
UnboundLocalError: local variable 'frozen_trial' referenced before assignment

2024-05-21 16:21:39,758	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806044 with rank id: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 375, in <module>
    objective(None)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/sweep.py", line 13, in intern_5m_sweep_objective
    return fn(args, trial)
  File "/juicefs-algorithm/workspace/nlp/li_wang/internevo/train.py", line 305, in main
    trial.report(loss.item(), batch_count)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 76, in wrapped
    ) = self._call_and_communicate_obj(fetch_properties)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 323, in _call_and_communicate_obj
    return self._broadcast(result)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 332, in _broadcast
    dist.broadcast(size_buffer, src=0, group=self._group)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:26836

2024-05-21 16:21:39,764	ERROR train.py:379 in <module> -- Raise exception from ubuntu-DeepLearning-4806044 with rank id: 2
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 67, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optuna_integration/pytorch_distributed.py", line 237, in report
    raise err
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I started two processes, the first one stoped due to the error while the second one seem to run just fine. Here is the log file generated by optuna:

{"op_code": 0, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "study_name": "intern_5m_muP_sweep_debug", "directions": [1]}
{"op_code": 4, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "study_id": 0, "datetime_start": "2024-05-21T16:16:48.404352"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "scale_emb", "param_value_internal": 5.0427348880931415, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 20.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "scale_depth", "param_value_internal": 0.21607731322222934, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 10.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "init_std", "param_value_internal": 0.9481608230815018, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 1.0, \"log\": false}}"}
{"op_code": 5, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "param_name": "lr", "param_value_internal": 0.03880828519140931, "distribution": "{\"name\": \"FloatDistribution\", \"attributes\": {\"step\": null, \"low\": 0.0, \"high\": 0.2, \"log\": false}}"}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 0, "intermediate_value": 12.879324913024902}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 1, "intermediate_value": 12.788047790527344}
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 2, "intermediate_value": 12.511056900024414}
...
{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 1063, "intermediate_value": 6.470120429992676}

Note that there seems to be an empty line at the end of the file, which could be the problem here.

@not522
Copy link
Member Author

not522 commented May 27, 2024

Thank you. As you explained, the error seems to be occurring with a blank line.
I reopen the issue and look into the cause of the error.

@not522 not522 reopened this May 27, 2024
@not522
Copy link
Member Author

not522 commented May 29, 2024

After investigation, I suspect the file export synchronization is not working correctly. I have noticed no line at the end of the log that stores the objective value. If one worker is terminating successfully, that line should always be included. It suggests that the file write has been lost for some reason. It should be persisted by f.flush() and os.fsync(f.fileno()), but it may not work in some environments.

f.write(what_to_write.encode("utf-8"))
f.flush()
os.fsync(f.fileno())

@lwang2070 If possible, could you tell us your storage environment and OS where you store the log file?

@lwang2070
Copy link

I am using a NFS on a slurm cluster, not sure about the details of the underlying file system and OS env (though definitely linux :p). I will consult the admin and let you know.

Additionally, I produced another log file with 'Extra data' error, and here's the line with the error

{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 126, "intermediate_value": 6.986872673034668}{{"op_code": 7, "worker_id": "617e716b-c149-4ef3-a684-45456cda0aa4-139940470409024", "trial_id": 0, "step": 127, "intermediate_value": 7.017472267150879}

Hopefully this will help!

@lwang2070
Copy link

Oh, on another topic, why not create a new log file for each processes in a specified directory to avoid the write confilct problem all together? To read previous logs, you can concat all files in the directory to produce a log file, and each process can write to their own log file.

Forgive me if I made stupid assumptions about the underlying mechanism of the optuna. I assume each process only need to read the history of other trials once, and determine its own hyperparameters. It would no longer need to read the log after that.

Please correct me if I made some mistakes here :)

@lwang2070
Copy link

Okay, turns our the file system is something called juicefs.

@lwang2070
Copy link

lwang2070 commented Jun 1, 2024

Hey, any suggestion for a quick fix that allows me to run temporarily? Some of my work is dependent on it😅 Thanks in advance!

@not522
Copy link
Member Author

not522 commented Jun 1, 2024

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

@lwang2070
Copy link

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

Yes, I will test that a bit later:)

For the temporary fix, I simply change

last_decode_error = err
del self._log_number_offset[log_number + 1]

to be pass and managed to get the system running (in case anyone also need this).

@lwang2070
Copy link

JournalFileStorage uses locks to ensure exclusive write access. By default, it uses symlinks for locking, but it doesn't seem to work well in your environment. I'm not sure if it will work, but could you try JournalFileOpenLock as an argument of JournalFileStorage?

Nope, this doesn't work either. I get random corrupted lines in the log such as

_code": 4, "worker_id": "2fa91567-e09a-4a67-8bbb-fc0877a65779-139969574254400", "study_id": 0, "datetime_start": "2024-06-04T00:03:19.132871"}

They can either be truncated like this one, or appended at the end of another line.

It did seem that this is related to the properties of the NFS. It can either be that two processes can acquire the lock at the same time, or the log is not flushed to the filesystem as you suggested.

@not522
Copy link
Member Author

not522 commented Jun 11, 2024

Just for your information, JournalFileStorage is implemented according to the NFS specification: locking using symlink is based on NFS v2, and locking using open is based on NFS v3. Technical details can be found in this article. https://medium.com/optuna/distributed-optimization-via-nfs-using-optunas-new-operation-based-logging-storage-9815f9c3f932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.
Projects
None yet
Development

No branches or pull requests

4 participants