Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Sceneflow scheme, Bug: Loss is NaN, stop training #11

Closed
lebionick opened this issue Jan 18, 2021 · 26 comments
Closed

Question: Sceneflow scheme, Bug: Loss is NaN, stop training #11

lebionick opened this issue Jan 18, 2021 · 26 comments
Labels
bug Something isn't working question Further information is requested

Comments

@lebionick
Copy link

Hello
I'm trying to pretrain network on sceneflow, however the way my folders organized is way different from that the code tries to find. Could you please tell, what data exactly did you downloaded? DispNet/FlowNet2.0 dataset subsets -> RGB images (cleanpass), Disparity, Disparity Occlusions from here?

@mli0603
Copy link
Owner

mli0603 commented Jan 18, 2021

Hi @lebionick , I downloaded the finalpass and disparity from "Full datasets". The occlusion is from the "DispNet/FlowNet2.0 dataset subsets".

@mli0603 mli0603 added the question Further information is requested label Jan 18, 2021
@lebionick
Copy link
Author

@mli0603 Isn't DispNet/FlowNet2.0 subset of "Full datasets"? So it contains occlusion maps not for all samples from "Full datasets"?

@mli0603
Copy link
Owner

mli0603 commented Jan 19, 2021

Hi @lebionick,

  • You are right, DispNet/FlowNet2.0 is subset of Full datasets.
  • However,
    • there is no occlusion for Full datasets, only images and disparities
    • there is no finalpass in DispNet/FlowNet2.0, which has more realistic rendering.

Therefore I had to donwload separately from both. So you will see that in my dataloader the occlusion folder is structured as DispNet/FlowNet2.0, but images and disp are structured as Full datasets. The good news is that the training sets overlap almost entirely, while evaluation has some differences. So what I did is to take the provided train/eval lists here and here to make sure Full dataset matches the DispNet/FlowNet2.0 (which is smaller). If you are ok with cleanpass provided in DispNet/FlowNet2.0, you can use that too but I cannot guarantee the result will be the same.

I hope this is clearer ;) If not, let me know

@lebionick
Copy link
Author

@mli0603 Yes, thank you!

@lebionick
Copy link
Author

Well, I could not make it work, I decided to write own class, may be will be convenient for someone else :)
(DispNet/FlowNet2.0 dataset subsets -> RGB images (cleanpass), Disparity, Disparity Occlusions from here)

class DispNetDataset(data.Dataset):
    def __init__(self, root_dir, mode="train"):
        data_dict = {}
        
        root_dir = Path(root_dir)
        mode_dir = root_dir / mode
        for data_type_folder_name in ("frame_finalpass", "disparity_occlusions", "disparity"):
            type_dir = mode_dir / data_type_folder_name
            for side in ("left", "right"):
                side_dir = type_dir / side
                for path in side_dir.iterdir():
                    stem = path.stem
                    if stem not in data_dict:
                        data_dict[stem] = {}
                        
                    stem_dict = data_dict[stem]
                    if data_type_folder_name not in stem_dict:
                        stem_dict[data_type_folder_name] = {}
                    
                    type_dict = stem_dict[data_type_folder_name]
                    type_dict[side] = path
        self.data_dict = data_dict
        self.idx_to_key = list(data_dict.keys())
        self.mode = mode
        self.transformation = self.create_transformation(mode)

    @staticmethod
    def create_transformation(mode):
        if mode == 'train':
            transformation = Compose([
                RandomShiftRotate(always_apply=True),
                RGBShiftStereo(always_apply=True, p_asym=0.3),
                OneOf([
                    GaussNoiseStereo(always_apply=True, p_asym=1.0),
                    RandomBrightnessContrastStereo(always_apply=True, p_asym=0.5)
                ], p=1.0)
            ])
        else:
            transformation = None
        return transformation
    

    def __len__(self):
        return len(self.data_dict)


    def __getitem__(self, idx):
        result = {}
        
        sample_dict = self.data_dict[self.idx_to_key[idx]]
        
        left_fname = sample_dict["frame_finalpass"]["left"]
        right_fname = sample_dict["frame_finalpass"]["right"]
        result['left'] = np.array(Image.open(left_fname)).astype(np.uint8)[..., :3]
        result['right'] = np.array(Image.open(right_fname)).astype(np.uint8)[..., :3]

        occ_left_fname = sample_dict["disparity_occlusions"]["left"]
        occ_right_fname = sample_dict["disparity_occlusions"]["right"]
        occ_left = np.array(Image.open(occ_left_fname)).astype(np.bool)
        occ_right = np.array(Image.open(occ_right_fname)).astype(np.bool)

        disp_left_fname = sample_dict["disparity"]["left"]
        disp_right_fname = sample_dict["disparity"]["right"]
        disp_left, _ = readPFM(disp_left_fname)
        disp_right, _ = readPFM(disp_right_fname)

        if self.mode == "train":
            # horizontal flip
            result['left'], result['right'], result['occ_mask'], result['occ_mask_right'], disp, disp_right \
                = horizontal_flip(result['left'], result['right'], occ_left, occ_right, disp_left, disp_right, self.mode)
            result['disp'] = np.nan_to_num(disp, nan=0.0)
            result['disp_right'] = np.nan_to_num(disp_right, nan=0.0)

            # random crop        
            result = random_crop(360, 640, result, self.mode)
        else:
            result['occ_mask'] = occ_left
            result['occ_mask_right'] = occ_right
            result['disp'] = disp_left
            result['disp_right'] = disp_right
        
        result = augment(result, self.transformation)

        return result

@lebionick
Copy link
Author

But after I launched training with this dataloader I get: Loss is nan, stopping training
What can I do about it?

@mli0603
Copy link
Owner

mli0603 commented Jan 19, 2021

Thank you so much for sharing your implementation. I really appreciate it!

For your error:

  • Can you first check if things are loaded correctly, i.e. no nan in your data
  • Does it happen from the very beginning (i.e. every sample) or randomly? If it is random, are you using apex? There is a bug that wouldn't allow training if you are not using apex. For more details please see Train crashes inside the Attention module #5

Let me know if the any of the above works for you.

@lebionick
Copy link
Author

I enabled apex:

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Start training

but error persists
And it happens at the first batch (since it's random, I believe it happens with any batch)
I checked for nans with this code:

def checker(tensor):
    if isinstance(tensor, torch.Tensor):
        with torch.no_grad():
            return (torch.isnan(tensor).sum() > 0).item()
    elif isinstance(tensor, np.ndarray):
        return np.isnan(tensor).sum() > 0
    else:
        raise NotImplementedError()
    return False

print(any(map(checker, (left, right, sampled_cols, sampled_rows, disp, occ_mask, occ_mask_right))))

in forward_pass.py
and there are no nans in data
I've even dumped it https://files.sberdisk.ru/s/knCQfqgGbXMa5az

@mli0603
Copy link
Owner

mli0603 commented Jan 19, 2021

Thanks for the input file. I am looking into this now.

@mli0603
Copy link
Owner

mli0603 commented Jan 19, 2021

Hi @lebionick, can you:

  • Give me your environement configuration? Torch version and etc.
  • Confirm if you are able to run inference using the provided weight and data?

@lebionick
Copy link
Author

@mli0603

Package                   Version             Location
------------------------- ------------------- ------------------------------------
absl-py                   0.10.0
adal                      1.2.4
aiohttp                   3.7.3
albumentations            0.5.2
alembic                   1.4.1
apex                      0.1
appdirs                   1.4.4
argon2-cffi               20.1.0
astunparse                1.6.3
async-generator           1.10
async-timeout             3.0.1
attrs                     20.3.0
audioread                 2.1.8
awscli                    1.18.157
azure-common              1.1.25
azure-storage-blob        2.1.0
azure-storage-common      2.1.0
backcall                  0.2.0
bitmath                   1.3.3.1
bleach                    3.2.1
bokeh                     2.2.3
boto3                     1.15.16
botocore                  1.18.16
brotlipy                  0.7.0
cachetools                4.1.1
certifi                   2020.12.5
cffi                      1.14.3
chardet                   3.0.4
click                     7.1.2
cloudpickle               1.6.0
colorama                  0.4.3
conda                     4.8.5
conda-package-handling    1.6.1
configparser              5.0.1
cryptography              3.1.1
cycler                    0.10.0
databricks-cli            0.12.2
decorator                 4.4.2
defusedxml                0.6.0
docker                    4.3.1
docutils                  0.15.2
empty-trash               0.1.0               /tmp/.jupyter/plugins/nb_empty_trash
entrypoints               0.3
Flask                     1.1.2
future                    0.18.2
gast                      0.3.3
gitdb                     4.0.5
GitPython                 3.1.9
google-api-core           1.22.4
google-auth               1.22.1
google-auth-oauthlib      0.4.1
google-cloud-core         1.4.3
google-cloud-language     1.3.0
google-cloud-storage      1.31.2
google-crc32c             1.0.0
google-pasta              0.2.0
google-resumable-media    1.1.0
googleapis-common-protos  1.52.0
gorilla                   0.3.0
graphviz                  0.8.4
grpcio                    1.32.0
gunicorn                  20.0.4
h5py                      2.10.0
horovod                   0.20.3
idna                      2.10
imageio                   2.9.0
imgaug                    0.4.0
importlib-metadata        2.0.0
inflect                   4.1.0
ipykernel                 5.4.2
ipython                   7.19.0
ipython-genutils          0.2.0
itsdangerous              1.1.0
jedi                      0.17.2
Jinja2                    2.11.2
jmespath                  0.10.0
joblib                    0.17.0
json5                     0.9.5
jsonschema                3.2.0
jupyter-client            6.1.7
jupyter-core              4.7.0
jupyter-server-proxy      1.3.2
jupyter-tensorboard       0.2.2a0
jupyterlab                2.2.9
jupyterlab-nvdashboard    0.4.0
jupyterlab-pygments       0.1.2
jupyterlab-server         1.2.0
Keras                     2.4.3
Keras-Preprocessing       1.1.2
kfserving                 0.4.0
kiwisolver                1.2.0
kubernetes                10.0.1
librosa                   0.8.0
llvmlite                  0.34.0
Mako                      1.1.3
Markdown                  3.3.1
MarkupSafe                1.1.1
matplotlib                3.3.2
minio                     6.0.0
mistune                   0.8.4
mlflow                    1.7.2
mpi4py                    3.0.3
multidict                 5.1.0
mxnet-cu101mkl            1.6.0.post0
natsort                   7.1.0
nbclient                  0.5.1
nbconvert                 6.0.7
nbformat                  5.0.8
nest-asyncio              1.4.3
networkx                  2.5
nltk                      3.5
notebook                  6.1.4
npm                       0.1.1
numba                     0.51.2
numpy                     1.18.5
oauthlib                  3.1.0
opencv-python             4.5.1.48
opencv-python-headless    4.5.1.48
opt-einsum                3.3.0
optional-django           0.1.0
packaging                 20.4
pandas                    1.1.3
pandocfilters             1.4.3
parso                     0.7.1
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    7.2.0
pip                       20.2.3
pooch                     1.2.0
portalocker               2.0.0
prometheus-client         0.8.0
prometheus-flask-exporter 0.18.1
prompt-toolkit            3.0.8
protobuf                  3.13.0
psutil                    5.7.2
ptyprocess                0.6.0
pyarrow                   1.0.1
pyasn1                    0.4.8
pyasn1-modules            0.2.8
pycosat                   0.6.3
pycparser                 2.20
Pygments                  2.7.3
PyJWT                     1.7.1
pynvml                    8.0.4
pyOpenSSL                 19.1.0
pyparsing                 2.4.7
pyrsistent                0.17.3
PySocks                   1.7.1
python-dateutil           2.8.1
python-editor             1.0.4
python-speech-features    0.6
pytz                      2020.1
PyWavelets                1.1.1
PyYAML                    5.3.1
pyzmq                     20.0.0
querystring-parser        1.2.4
regex                     2020.10.11
requests                  2.24.0
requests-oauthlib         1.3.0
resampy                   0.2.2
rsa                       4.5
ruamel-yaml-conda         0.15.80
s3transfer                0.3.3
sacrebleu                 1.4.14
scikit-image              0.18.1
scikit-learn              0.23.2
scipy                     1.4.1
Send2Trash                1.5.0
sentencepiece             0.1.91
setuptools                49.6.0.post20201009
Shapely                   1.7.1
simpervisor               0.3
simplejson                3.17.2
six                       1.15.0
smmap                     3.0.4
SoundFile                 0.10.3.post1
sox                       1.4.1
SQLAlchemy                1.3.13
sqlparse                  0.4.1
table-logger              0.3.6
tabulate                  0.8.7
tenacity                  6.2.0
tensorboard               2.3.0
tensorboard-plugin-wit    1.7.0
tensorboardX              1.9
tensorflow-estimator      2.3.0
tensorflow-gpu            2.3.0
termcolor                 1.1.0
terminado                 0.9.1
testpath                  0.4.4
threadpoolctl             2.1.0
tifffile                  2021.1.14
torch                     1.6.0+cu101
torchvision               0.7.0+cu101
tornado                   6.0.4
tqdm                      4.50.2
traitlets                 5.0.5
typer                     0.3.2
typing                    3.7.4.3
typing-extensions         3.7.4.3
urllib3                   1.25.10
wcwidth                   0.2.5
webencodings              0.5.1
websocket-client          0.57.0
Werkzeug                  1.0.1
wheel                     0.35.1
wrapt                     1.12.1
xgboost                   1.2.1
yarl                      1.6.3
zipp                      3.3.0

I ran inference_example.ipynb and it work just fine with both pretrained weights. On kitti2015 and even on my custom pair of images. I also applied model to input that I published above. No nans in the output.

Btw Where can I find parameters to run inference using sttr-light weights?

@lebionick
Copy link
Author

I checked for nans everywhere :)

checking if there are nans in input...
False
checking if there are nans in model...
False
checking if there are nans in outputs...
False
checking if there are nans in losses...
OrderedDict([('rr', tensor(0.3339, device='cuda:0', grad_fn=<MeanBackward0>)), ('l1_raw', tensor(nan, device='cuda:0', grad_fn=<SmoothL1LossBackward>)), ('l1', tensor(nan, device='cuda:0', grad_fn=<SmoothL1LossBackward>)), ('occ_be', tensor(0.6268, device='cuda:0', grad_fn=<MeanBackward0>)), ('aggregated', tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)), ('error_px', 0), ('total_px', 0), ('epe', tensor(nan, device='cuda:0')), ('iou', tensor(0.1047, device='cuda:0'))])

@mli0603 mli0603 added the bug Something isn't working label Jan 20, 2021
@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

STTR-Light can be downloaded from google drive here. Remeber to checkout the sttr-light branch.

This is so weird... One thing I can see is you are using Torch 1.6.0. Do you mind installing 1.5.1 and check if you still have the issue? I found a version compatibility issue in #8. But to be honest, I don't see why it will make gradient NaN.

  • Does training on KITTI2015 also give you this error? I wonder if we can narrow down the bug to the dataset class that you wrote.
  • Can you also pull the code again just in case something is out of sync?
  • The input above causes the training to crash right? If not, can you share a dumped input that crashes the code?
  • One last thing to try is to disable APEX and then enable anomly detection using this. This will tell us where the NaN is during the backprop process.

@mli0603 mli0603 reopened this Jan 20, 2021
@mli0603 mli0603 changed the title Question: Sceneflow scheme Question: Sceneflow scheme, Bug: Loss is NaN, stop training Jan 20, 2021
@lebionick
Copy link
Author

Ok, I'll try all of these and come back)

@lebionick
Copy link
Author

lebionick commented Jan 20, 2021

@mli0603
I cloned fresh master and launched training (kitti_finetune.sh) on KITTI2015, everything is ok, training is being performed. Even with random initialization.
Instead of using my custom class, I specified sceneflow_toy in pretrain.sh along with DispNet directory. Same error occurs:

Start training
Epoch: 0
  0%|                                                                                                                                                | 0/21818 [00:00<?, ?it/s]
Loss is nan, stopping training
#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0
python main.py  --epochs 15\
                --batch_size 1\
                --checkpoint pretrain\
                --pre_train\
                --num_workers 2\
                --dataset sceneflow_toy\
                --dataset_directory /home/jovyan/sceneflow2/FlyingThings3D_subset/train

@lebionick
Copy link
Author

I don't see why it will make gradient NaN.

Nans are presented in loss, not in gradients. I can dump inputs and outputs which produce this result in criterion function.

@lebionick
Copy link
Author

Alright, I localized it to criterion function:
criterion = loss.Criterion(3, -1, {'rr': 1.0, 'l1_raw': 1.0, 'l1': 1.0, 'occ_be': 1.0})
Here you can download files (inputs.pkl and outputs.pkl) to reproduce this
inputs.pkl contains dict instead of using NestedTensor to avoid environment error, so after loading with pickle it needs to be packed again. Btw I prefer using namedtuple for this:

from collections import namedtuple
NestedTensor = namedtuple("NestedTensor", list(inputs.keys()))
inputs = NestedTensor(*list(inputs.values()))

and criterion(inputs, outputs) gives:

OrderedDict([('rr', tensor(0.2853, grad_fn=<MeanBackward0>)),
             ('l1_raw', tensor(nan, grad_fn=<SmoothL1LossBackward>)),
             ('l1', tensor(nan, grad_fn=<SmoothL1LossBackward>)),
             ('occ_be', tensor(0.6261, grad_fn=<MeanBackward0>)),
             ('aggregated', tensor(nan, grad_fn=<AddBackward0>)),
             ('error_px', 0),
             ('total_px', 0),
             ('epe', tensor(nan)),
             ('iou', tensor(0.1535))])

@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

Thanks! Given all the nan are in disparity error, apparently something is wrong with disparity estimation or GT disparity. I think we have narrowed it down and hopefully I can pin the bug.

I wonder why this happens since it's been working. It may relate to a recent commit that I maded along with sttr-light which breaks things, which I should either revert or patch...

@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

Ok, so the loaded GT disparity (inputs.disp) is 0.0 everywhere. Does this ring any bell?

@lebionick
Copy link
Author

@mli0603
Oh, yes. I underestimated Sceneflow insanity... DispNet disparity values are in reversed format in comparison to base dataset!! For left images disparity values are negative and for right images they're positive. In the stage of preprocessing your code zeros out negative disparity values.

@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

Ah, makes sense!

So this is not a bug then. Good to know ;) I'll close this for now.

@mli0603 mli0603 closed this as completed Jan 20, 2021
@lebionick
Copy link
Author

Thank you a lot for your support! Now I'm able to launch training!)

@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

If this is not too much to ask, do you mind re-sharing your fixed DispNetDataset here? Just in case someone else stumbles across the same issue.

@lebionick
Copy link
Author

Sure! The fix is just in adding minus sign before disparities
I hope there are no more tricks with this dataset. But if I find, I'll share it

class DispNetDataset(data.Dataset):
    def __init__(self, root_dir, mode="train"):
        data_dict = {}
        
        root_dir = Path(root_dir)
        mode_dir = root_dir / mode
        for data_type_folder_name in ("frame_finalpass", "disparity_occlusions", "disparity"):
            type_dir = mode_dir / data_type_folder_name
            for side in ("left", "right"):
                side_dir = type_dir / side
                for path in side_dir.iterdir():
                    stem = path.stem
                    if stem not in data_dict:
                        data_dict[stem] = {}
                        
                    stem_dict = data_dict[stem]
                    if data_type_folder_name not in stem_dict:
                        stem_dict[data_type_folder_name] = {}
                    
                    type_dict = stem_dict[data_type_folder_name]
                    type_dict[side] = path
        self.data_dict = data_dict
        self.idx_to_key = list(data_dict.keys())
        self.mode = mode
        self.transformation = self.create_transformation(mode)

    @staticmethod
    def create_transformation(mode):
        if mode == 'train':
            transformation = Compose([
                RandomShiftRotate(always_apply=True),
                RGBShiftStereo(always_apply=True, p_asym=0.3),
                OneOf([
                    GaussNoiseStereo(always_apply=True, p_asym=1.0),
                    RandomBrightnessContrastStereo(always_apply=True, p_asym=0.5)
                ], p=1.0)
            ])
        else:
            transformation = None
        return transformation
    

    def __len__(self):
        return len(self.data_dict)


    def __getitem__(self, idx):
        result = {}
        
        sample_dict = self.data_dict[self.idx_to_key[idx]]
        
        left_fname = sample_dict["frame_finalpass"]["left"]
        right_fname = sample_dict["frame_finalpass"]["right"]
        result['left'] = np.array(Image.open(left_fname)).astype(np.uint8)[..., :3]
        result['right'] = np.array(Image.open(right_fname)).astype(np.uint8)[..., :3]

        occ_left_fname = sample_dict["disparity_occlusions"]["left"]
        occ_right_fname = sample_dict["disparity_occlusions"]["right"]
        occ_left = np.array(Image.open(occ_left_fname)).astype(np.bool)
        occ_right = np.array(Image.open(occ_right_fname)).astype(np.bool)

        disp_left_fname = sample_dict["disparity"]["left"]
        disp_right_fname = sample_dict["disparity"]["right"]
        disp_left, _ = readPFM(disp_left_fname)
        disp_right, _ = readPFM(disp_right_fname)

        if self.mode == "train":
            # horizontal flip
            result['left'], result['right'], result['occ_mask'], result['occ_mask_right'], disp, disp_right \
                = horizontal_flip(result['left'], result['right'], occ_left, occ_right, disp_left, disp_right, self.mode)
            result['disp'] = -np.nan_to_num(disp, nan=0.0)
            result['disp_right'] = -np.nan_to_num(disp_right, nan=0.0)

            # random crop        
            result = random_crop(360, 640, result, self.mode)
        else:
            result['occ_mask'] = occ_left
            result['occ_mask_right'] = occ_right
            result['disp'] = -disp_left
            result['disp_right'] = -disp_right
        
        result = augment(result, self.transformation)

        return result

@lebionick
Copy link
Author

lebionick commented Jan 20, 2021

By the way, how do you think, do we possibly need to handle the case when every disparity can be zero? For example, if data contains only very distanced objects? Or for strange augmentations, like duplicating left image

@mli0603
Copy link
Owner

mli0603 commented Jan 20, 2021

I think the best way to do it is to set invalid disparity to -1 instead of 0.

But if every disparity is 0, then left and right image are identical. I don't see such a case happens in stereo setting (unless intended).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants