wandb: Network error (SSLError), entering retry loop. #1227

kiristern · 2022-12-19T03:16:03Z

Issue description

wandb: Network error (SSLError), entering retry loop. interferes with training.

Current behavior

The training still runs and I can see the metrics in wandb dashboard, wandb: Network error resolved after 0:06:24.504729, resuming normal operation. However, I think it really slows down the training as this occurs very frequently. From the wandb debug.log there is:
Caused by SSLError(SSLError(1, '[SSL: KRB5_S_TKT_NYV] unexpected eof while reading (_ssl.c:1091)

wandb support said (April 2022):

happens as a result of either
(1) Improper installation of SSL on your python distro as noted by some SO users here. I would recommend reinstalling Anaconda/your virtual environment and and upgrade openssl.

(But i don't think I have permission to do so on the Neuropoly servers.)

Expected behavior

run without interruption.

Steps to reproduce

running normal training: ivadomed --train -c config_Mod3DUnet_ax.json --path-data ../data/ --path-output ../results/ with bavaria-quebec preprocessed data.

config file

{
    "command": "train",
    "gpu_ids": [0],
    "path_output": "../results/ax_output_run1",
    "model_name": "ModifiedUnet3d_singleContrast",
    "debugging": true,
    "object_detection_params": {
        "object_detection_path": null,
        "safety_factor": [1.0, 1.0, 1.0]
    },
    "wandb": {
        "wandb_api_key": "",
        "project_name": "bavaria",
        "group_name": "lesion_ax",
        "run_name": "ax_run1",
        "log_grads_every": 100
    },
    "loader_parameters": {
        "path_data": ["~/duke/temp/kiri/bavaria-preprocessed"],
        "subject_selection:": {"n": [], "metadata": [], "value": []},
        "target_suffix": ["_lesion-manual"],
        "extensions": [".nii.gz"],
        "roi_params": {
            "suffix": null,
            "slice_filter_roi": null
        },
        "contrast_params": {
            "training_validation": ["T2w"],
            "testing": ["T2w"],
            "balance": {}
        },
        "slice_filter_params": {
            "filter_empty_mask": false,
            "filter_empty_input": false
        },
        "slice_axis": "axial",
        "multichannel": false,
        "soft_gt": false
    },
    "split_dataset": {
        "fname_split": null,
        "random_seed": 42,
        "split_method" : "participant_id",
        "data_testing": {"data_type": null, "data_value":[]},
        "balance": null,
        "train_fraction": 0.6,
        "test_fraction": 0.2
    },
    "training_parameters": {
        "batch_size":    2,
	"loss": {
            "name": "DiceLoss"
        },
        "training_time": {
            "num_epochs": 100,
            "early_stopping_patience": 100,
            "early_stopping_epsilon": 0.001
        },
        "scheduler": {
            "initial_lr": 1e-3,
            "lr_scheduler": {
                "name": "CosineAnnealingLR",
                "base_lr": 1e-5,
                "max_lr": 1e-3
            }
        },
        "balance_samples": {"applied": false, "type": "gt"}
    },
    "default_model": {
        "name": "Unet",
        "dropout_rate": 0.3,
        "bn_momentum": 0.1,
        "final_activation": "sigmoid",
	"is_2d": false,
        "depth": 4
    },
    "Modified3DUNet": {
        "applied": true,
        "length_3D": [160, 160, 720],
        "stride_3D": [80, 80, 360],
        "attention": false,
        "n_filters": 3
    },
    "uncertainty": {
        "epistemic": false,
        "aleatoric": false,
        "n_it": 0
    },
    "postprocessing": {
        "binarize_prediction": {"thr": 0.5},
        "uncertainty": {"thr": -1, "suffix": "_unc-vox.nii.gz"}
    },
    "evaluation_parameters": {},
    "transformation": {
        "Resample": {
            "wspace": 0.5,
            "hspace": 0.5,
            "dspace": 1
        },
        "CenterCrop": {
            "size": [160, 160, 720]
	},
        "RandomAffine": {
            "degrees": 10,
            "scale": [0.3, 0.3, 0.3],
            "translate": [0.1, 0.1, 0.1],
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
        "ElasticTransform": {
			"alpha_range": [25.0, 35.0],
			"sigma_range":  [3.5, 4.5],
			"p": 0.5,
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
	"RandomReverse": {
	    "applied_to": ["im", "gt"],
	    "dataset_type": ["training"]
	},
	"RandomGamma": {
            "log_gamma_range": [-1.5, 1.5],
            "p": 0.5,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBiasField": {
            "coefficients": 0.5,
            "order": 3,
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBlur": {
            "sigma_range": [0.0, 1.0],
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "NumpyToTensor": {},
        "NormalizeInstance": {"applied_to": ["im"]}
    }
}

-->

Environment

System description

NeuroPoly server, Rosenberg, Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-53-generic x86_64)

Installed packages

on branch mhb/1213-fix-3d-data-augmentation from PR 1222

Output of pip freeze

absl-py==1.1.0
astor==0.8.1
astunparse==1.6.3
awscli==1.22.34
beniget==0.4.1
bids-validator==1.9.9
botocore==1.23.34
brz-etckeeper==0.0.0
cachetools==5.2.0
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
colorama==0.4.4
coloredlogs==15.0.1
command-not-found==0.3
commonmark==0.9.1
cryptography==3.4.8
csv-diff==1.1
cycler==0.11.0
dbus-python==1.2.18
decorator==4.4.2
Deprecated==1.2.13
dictdiffer==0.9.0
dill==0.3.5.1
distlib==0.3.4
distro==1.7.0
distro-info===1.1build1
dnspython==2.1.0
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.17.1
filelock==3.6.0
flatbuffers==2.0.7
fonttools==4.33.3
formulaic==0.3.4
fsleyes==1.5.0
fsleyes-props==1.8.2
fsleyes-widgets==0.12.3
fslpy==3.9.5
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.8.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
gpg===1.16.0-unknown
grpcio==1.47.0
h5py==3.7.0
humanfriendly==10.0
humanize==4.4.0
idna==3.3
imageio==2.22.4
importlib-metadata==4.6.4
interface-meta==1.3.0
iotop==0.6
-e git+https://github.com/ivadomed/ivadomed.git@d6385f1c57b7433a57003167c215f2288db3b631#egg=ivadomed
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
keras==2.11.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.3
libclang==14.0.1
loguru==0.6.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.2
more-itertools==8.10.0
mpmath==1.2.1
netifaces==0.11.0
networkx==2.8.8
nibabel==3.2.2
num2words==0.5.12
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
onnxruntime==1.13.1
opt-einsum==3.3.0
osfclient==0.0.5
packaging==21.3
pandas==1.4.4
pathtools==0.1.2
Pillow==9.0.1
platformdirs==2.5.1
ply==3.11
promise==2.3
protobuf==3.19.4
psutil==5.9.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybids==0.15.5
Pygments==2.11.2
PyGObject==3.42.1
PyOpenGL==3.1.6
pyparsing==2.4.7
python-apt==2.3.0+ubuntu2.1
python-dateutil==2.8.1
pythran==0.10.0
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
requests==2.25.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rich==12.6.0
roman==3.3
rsa==4.8
s3transfer==0.5.0
scikit-image==0.19.3
scikit-learn==1.2.0
scipy==1.8.0
screen-resolution-extra==0.0.0
seaborn==0.12.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.11
SimpleITK==2.2.1
six==1.16.0
smmap==5.0.0
SQLAlchemy==1.3.24
ssh-import-id==5.11
sympy==1.11.1
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0
torchaudio==0.13.0
torchio==0.18.86
torchvision==0.12.0
tqdm==4.64.0
typer==0.7.0
typing_extensions==4.2.0
ubuntu-drivers-common==0.0.0
ufw==0.36.1
unattended-upgrades==0.1
urllib3==1.26.13
virtualenv==20.13.0+ds
wandb==0.13.7
Werkzeug==2.1.2
wrapt==1.14.1
wxPython==4.0.7
xkit==0.0.0
zipp==1.0.0

The text was updated successfully, but these errors were encountered:

jcohenadad · 2022-12-19T14:36:50Z

(But i don't think I have permission to do so on the Neuropoly servers.)

I don't think it is related to software installed for all users on rosenberg because I do not experience this issue. Have you tried reinstalling ivadomed with a fresh venv? Are you using conda? (i am)

kanishk16 · 2022-12-19T16:16:30Z

@kiristern looks like the speed of the network perhaps but in any case as per docs it definitely hurts the training time. Moreover, the following is a workaround that might be more relevant when working with large dataset(s):

Initialize the mode as 'offline' as suggested in the docs while calling init() at:

ivadomed/ivadomed/training.py

Line 74 in 9bb4e7f

wandb.init(project=project_name, group=group_name, name=run_name, config=cfg)

which means it would look something like:

wandb.init(project=project_name, group=group_name, name=run_name, config=cfg, mode=offline)

This would direct all the wandb logs into a dir. After the training is complete, you could sync it to view them on the dashboard by executing the wandb sync command with the options as suggested here.

kanishk16 · 2022-12-19T16:25:14Z

I guess it doesn't make sense to include this in the codebase, at least for now, as in most cases, we'd like to have the live mode (which is by default) to see the updated changes on the wandb dashboard simultaneously.

kiristern · 2022-12-19T20:46:27Z

Thanks for the suggestion @kanishk16, not getting the error message after setting 'mode' = 'offline'

jcohenadad · 2022-12-19T20:58:03Z

Still, if you want the live mode (which is useful), we need to figure out what is wrong in your config. I don't think it's a network issue because I'm using the same computer and I don't experience this issue.

jcohenadad · 2023-01-02T17:35:47Z

#1253 could indirectly help

kanishk16 · 2023-01-09T15:34:19Z

@kiristern I presumed that you already tried what Julien suggested but I still wanted to confirm, does the error persist upon installing ivadomed in a fresh conda environment?

kiristern · 2023-01-10T03:19:28Z

@kanishk16 thanks for following up... yes, i did try (also working in conda env) and it seemed to be working fine for my last training (was going to comment), but i just started getting the same error again (for another project)!:(

kiristern · 2023-01-10T03:46:51Z

AH! i just remembered about there being problems with calling data stored on duke so i just moved my dataset over and it seems to be working without the error now... hopefully that was indeed the issue and this is the solution (i must've called the data from my home dir for the first run -- but i then modified the dataset and put it on my temp so that naga could access it too)

kiristern · 2023-01-31T19:07:58Z

So I started getting the error message again, despite using my dataset that was not on duke and I think what is happening is that upon ssh to a node or when opening a new tmux session, the (base) environment is automatically instantiated, like so:

The SSLError appears to occur whenever I run ivadomed --train after conda activate my-env when (base) is already activated. The error does not occur if I conda deactivate out of (base) and then conda activate my-env. ie:

Does this seem like it could be the source of the error? It appears to solve the error for me at the moment, at least.

kanishk16 self-assigned this Jan 9, 2023

kiristern closed this as completed Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wandb: Network error (SSLError), entering retry loop. #1227

wandb: Network error (SSLError), entering retry loop. #1227

kiristern commented Dec 19, 2022 •

edited

jcohenadad commented Dec 19, 2022

kanishk16 commented Dec 19, 2022

kanishk16 commented Dec 19, 2022

kiristern commented Dec 19, 2022

jcohenadad commented Dec 19, 2022

jcohenadad commented Jan 2, 2023

kanishk16 commented Jan 9, 2023

kiristern commented Jan 10, 2023

kiristern commented Jan 10, 2023 •

edited

kiristern commented Jan 31, 2023

wandb: Network error (SSLError), entering retry loop. #1227

wandb: Network error (SSLError), entering retry loop. #1227

Comments

kiristern commented Dec 19, 2022 • edited

Issue description

Current behavior

Expected behavior

Steps to reproduce

Environment

System description

Installed packages

jcohenadad commented Dec 19, 2022

kanishk16 commented Dec 19, 2022

kanishk16 commented Dec 19, 2022

kiristern commented Dec 19, 2022

jcohenadad commented Dec 19, 2022

jcohenadad commented Jan 2, 2023

kanishk16 commented Jan 9, 2023

kiristern commented Jan 10, 2023

kiristern commented Jan 10, 2023 • edited

kiristern commented Jan 31, 2023

kiristern commented Dec 19, 2022 •

edited

kiristern commented Jan 10, 2023 •

edited