Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb: Network error (SSLError), entering retry loop. #1227

Closed
kiristern opened this issue Dec 19, 2022 · 10 comments
Closed

wandb: Network error (SSLError), entering retry loop. #1227

kiristern opened this issue Dec 19, 2022 · 10 comments
Assignees

Comments

@kiristern
Copy link

kiristern commented Dec 19, 2022

Issue description

wandb: Network error (SSLError), entering retry loop. interferes with training.
Screenshot 2022-12-18 at 21 17 56

Current behavior

The training still runs and I can see the metrics in wandb dashboard, wandb: Network error resolved after 0:06:24.504729, resuming normal operation. However, I think it really slows down the training as this occurs very frequently. From the wandb debug.log there is:
Caused by SSLError(SSLError(1, '[SSL: KRB5_S_TKT_NYV] unexpected eof while reading (_ssl.c:1091)

wandb support said (April 2022):

happens as a result of either
(1) Improper installation of SSL on your python distro as noted by some SO users here. I would recommend reinstalling Anaconda/your virtual environment and and upgrade openssl.

(But i don't think I have permission to do so on the Neuropoly servers.)

Expected behavior

run without interruption.

Steps to reproduce

running normal training: ivadomed --train -c config_Mod3DUnet_ax.json --path-data ../data/ --path-output ../results/ with bavaria-quebec preprocessed data.

config file
{
    "command": "train",
    "gpu_ids": [0],
    "path_output": "../results/ax_output_run1",
    "model_name": "ModifiedUnet3d_singleContrast",
    "debugging": true,
    "object_detection_params": {
        "object_detection_path": null,
        "safety_factor": [1.0, 1.0, 1.0]
    },
    "wandb": {
        "wandb_api_key": "",
        "project_name": "bavaria",
        "group_name": "lesion_ax",
        "run_name": "ax_run1",
        "log_grads_every": 100
    },
    "loader_parameters": {
        "path_data": ["~/duke/temp/kiri/bavaria-preprocessed"],
        "subject_selection:": {"n": [], "metadata": [], "value": []},
        "target_suffix": ["_lesion-manual"],
        "extensions": [".nii.gz"],
        "roi_params": {
            "suffix": null,
            "slice_filter_roi": null
        },
        "contrast_params": {
            "training_validation": ["T2w"],
            "testing": ["T2w"],
            "balance": {}
        },
        "slice_filter_params": {
            "filter_empty_mask": false,
            "filter_empty_input": false
        },
        "slice_axis": "axial",
        "multichannel": false,
        "soft_gt": false
    },
    "split_dataset": {
        "fname_split": null,
        "random_seed": 42,
        "split_method" : "participant_id",
        "data_testing": {"data_type": null, "data_value":[]},
        "balance": null,
        "train_fraction": 0.6,
        "test_fraction": 0.2
    },
    "training_parameters": {
        "batch_size":    2,
	"loss": {
            "name": "DiceLoss"
        },
        "training_time": {
            "num_epochs": 100,
            "early_stopping_patience": 100,
            "early_stopping_epsilon": 0.001
        },
        "scheduler": {
            "initial_lr": 1e-3,
            "lr_scheduler": {
                "name": "CosineAnnealingLR",
                "base_lr": 1e-5,
                "max_lr": 1e-3
            }
        },
        "balance_samples": {"applied": false, "type": "gt"}
    },
    "default_model": {
        "name": "Unet",
        "dropout_rate": 0.3,
        "bn_momentum": 0.1,
        "final_activation": "sigmoid",
	"is_2d": false,
        "depth": 4
    },
    "Modified3DUNet": {
        "applied": true,
        "length_3D": [160, 160, 720],
        "stride_3D": [80, 80, 360],
        "attention": false,
        "n_filters": 3
    },
    "uncertainty": {
        "epistemic": false,
        "aleatoric": false,
        "n_it": 0
    },
    "postprocessing": {
        "binarize_prediction": {"thr": 0.5},
        "uncertainty": {"thr": -1, "suffix": "_unc-vox.nii.gz"}
    },
    "evaluation_parameters": {},
    "transformation": {
        "Resample": {
            "wspace": 0.5,
            "hspace": 0.5,
            "dspace": 1
        },
        "CenterCrop": {
            "size": [160, 160, 720]
	},
        "RandomAffine": {
            "degrees": 10,
            "scale": [0.3, 0.3, 0.3],
            "translate": [0.1, 0.1, 0.1],
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
        "ElasticTransform": {
			"alpha_range": [25.0, 35.0],
			"sigma_range":  [3.5, 4.5],
			"p": 0.5,
            "applied_to": ["im", "gt"],
            "dataset_type": ["training"]
        },
	"RandomReverse": {
	    "applied_to": ["im", "gt"],
	    "dataset_type": ["training"]
	},
	"RandomGamma": {
            "log_gamma_range": [-1.5, 1.5],
            "p": 0.5,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBiasField": {
            "coefficients": 0.5,
            "order": 3,
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "RandomBlur": {
            "sigma_range": [0.0, 1.0],
            "p": 0.3,
            "applied_to": ["im"],
            "dataset_type": ["training"]
        },
        "NumpyToTensor": {},
        "NormalizeInstance": {"applied_to": ["im"]}
    }
}

-->

Environment

System description

NeuroPoly server, Rosenberg, Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-53-generic x86_64)

Installed packages

on branch mhb/1213-fix-3d-data-augmentation from PR 1222

Output of pip freeze
absl-py==1.1.0
astor==0.8.1
astunparse==1.6.3
awscli==1.22.34
beniget==0.4.1
bids-validator==1.9.9
botocore==1.23.34
brz-etckeeper==0.0.0
cachetools==5.2.0
certifi==2020.6.20
chardet==4.0.0
click==8.0.3
colorama==0.4.4
coloredlogs==15.0.1
command-not-found==0.3
commonmark==0.9.1
cryptography==3.4.8
csv-diff==1.1
cycler==0.11.0
dbus-python==1.2.18
decorator==4.4.2
Deprecated==1.2.13
dictdiffer==0.9.0
dill==0.3.5.1
distlib==0.3.4
distro==1.7.0
distro-info===1.1build1
dnspython==2.1.0
docker-pycreds==0.4.0
docopt==0.6.2
docutils==0.17.1
filelock==3.6.0
flatbuffers==2.0.7
fonttools==4.33.3
formulaic==0.3.4
fsleyes==1.5.0
fsleyes-props==1.8.2
fsleyes-widgets==0.12.3
fslpy==3.9.5
gast==0.4.0
gitdb==4.0.10
GitPython==3.1.29
google-auth==2.8.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
gpg===1.16.0-unknown
grpcio==1.47.0
h5py==3.7.0
humanfriendly==10.0
humanize==4.4.0
idna==3.3
imageio==2.22.4
importlib-metadata==4.6.4
interface-meta==1.3.0
iotop==0.6
-e git+https://github.com/ivadomed/ivadomed.git@d6385f1c57b7433a57003167c215f2288db3b631#egg=ivadomed
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
keras==2.11.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.3
libclang==14.0.1
loguru==0.6.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.2
more-itertools==8.10.0
mpmath==1.2.1
netifaces==0.11.0
networkx==2.8.8
nibabel==3.2.2
num2words==0.5.12
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.0
onnxruntime==1.13.1
opt-einsum==3.3.0
osfclient==0.0.5
packaging==21.3
pandas==1.4.4
pathtools==0.1.2
Pillow==9.0.1
platformdirs==2.5.1
ply==3.11
promise==2.3
protobuf==3.19.4
psutil==5.9.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybids==0.15.5
Pygments==2.11.2
PyGObject==3.42.1
PyOpenGL==3.1.6
pyparsing==2.4.7
python-apt==2.3.0+ubuntu2.1
python-dateutil==2.8.1
pythran==0.10.0
pytz==2022.6
PyWavelets==1.4.1
PyYAML==5.4.1
requests==2.25.1
requests-oauthlib==1.3.1
requests-toolbelt==0.9.1
rich==12.6.0
roman==3.3
rsa==4.8
s3transfer==0.5.0
scikit-image==0.19.3
scikit-learn==1.2.0
scipy==1.8.0
screen-resolution-extra==0.0.0
seaborn==0.12.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shellingham==1.5.0
shortuuid==1.0.11
SimpleITK==2.2.1
six==1.16.0
smmap==5.0.0
SQLAlchemy==1.3.24
ssh-import-id==5.11
sympy==1.11.1
tensorboard==2.11.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.11.0
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.26.0
termcolor==1.1.0
threadpoolctl==3.1.0
tifffile==2022.10.10
torch==1.11.0
torchaudio==0.13.0
torchio==0.18.86
torchvision==0.12.0
tqdm==4.64.0
typer==0.7.0
typing_extensions==4.2.0
ubuntu-drivers-common==0.0.0
ufw==0.36.1
unattended-upgrades==0.1
urllib3==1.26.13
virtualenv==20.13.0+ds
wandb==0.13.7
Werkzeug==2.1.2
wrapt==1.14.1
wxPython==4.0.7
xkit==0.0.0
zipp==1.0.0
@jcohenadad
Copy link
Member

(But i don't think I have permission to do so on the Neuropoly servers.)

I don't think it is related to software installed for all users on rosenberg because I do not experience this issue. Have you tried reinstalling ivadomed with a fresh venv? Are you using conda? (i am)

@kanishk16
Copy link
Contributor

@kiristern looks like the speed of the network perhaps but in any case as per docs it definitely hurts the training time. Moreover, the following is a workaround that might be more relevant when working with large dataset(s):

Initialize the mode as 'offline' as suggested in the docs while calling init() at:

wandb.init(project=project_name, group=group_name, name=run_name, config=cfg)

which means it would look something like:

wandb.init(project=project_name, group=group_name, name=run_name, config=cfg, mode=offline)

This would direct all the wandb logs into a dir. After the training is complete, you could sync it to view them on the dashboard by executing the wandb sync command with the options as suggested here.

@kanishk16
Copy link
Contributor

I guess it doesn't make sense to include this in the codebase, at least for now, as in most cases, we'd like to have the live mode (which is by default) to see the updated changes on the wandb dashboard simultaneously.

@kiristern
Copy link
Author

Thanks for the suggestion @kanishk16, not getting the error message after setting 'mode' = 'offline'

@jcohenadad
Copy link
Member

Still, if you want the live mode (which is useful), we need to figure out what is wrong in your config. I don't think it's a network issue because I'm using the same computer and I don't experience this issue.

@jcohenadad
Copy link
Member

#1253 could indirectly help

@kanishk16 kanishk16 self-assigned this Jan 9, 2023
@kanishk16
Copy link
Contributor

@kiristern I presumed that you already tried what Julien suggested but I still wanted to confirm, does the error persist upon installing ivadomed in a fresh conda environment?

@kiristern
Copy link
Author

@kanishk16 thanks for following up... yes, i did try (also working in conda env) and it seemed to be working fine for my last training (was going to comment), but i just started getting the same error again (for another project)!:(

@kiristern
Copy link
Author

kiristern commented Jan 10, 2023

AH! i just remembered about there being problems with calling data stored on duke so i just moved my dataset over and it seems to be working without the error now... hopefully that was indeed the issue and this is the solution (i must've called the data from my home dir for the first run -- but i then modified the dataset and put it on my temp so that naga could access it too)

@kiristern
Copy link
Author

So I started getting the error message again, despite using my dataset that was not on duke and I think what is happening is that upon ssh to a node or when opening a new tmux session, the (base) environment is automatically instantiated, like so:
Screenshot 2023-01-31 at 13 54 23

The SSLError appears to occur whenever I run ivadomed --train after conda activate my-env when (base) is already activated. The error does not occur if I conda deactivate out of (base) and then conda activate my-env. ie:
Screenshot 2023-01-31 at 14 03 16

Does this seem like it could be the source of the error? It appears to solve the error for me at the moment, at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants