Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All data nan #118

Closed
3bobo opened this issue Aug 18, 2018 · 10 comments
Closed

All data nan #118

3bobo opened this issue Aug 18, 2018 · 10 comments

Comments

@3bobo
Copy link

3bobo commented Aug 18, 2018

I've installed all requirement package, but as I run train.py --arch fcn8s ( I set ade20k as default), there are some warning and all result are nan like this:

/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:31:RuntimeWarning: invalid value encountered in double_scalars
acc = np.diag(hist).sum() / hist.sum()
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:32: RuntimeWarning: invalid value encountered in divide
acc_cls = np.diag(hist) / hist.sum(axis=1)
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:33: RuntimeWarning: Mean of empty slice
acc_cls = np.nanmean(acc_cls)
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:34: RuntimeWarning: invalid value encountered in divide
iu = np.diag(hist) / (hist.sum(axis=1) + hist.sum(axis=0) - np.diag(hist))
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:35: RuntimeWarning: Mean of empty slice
mean_iu = np.nanmean(iu)
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:36: RuntimeWarning: invalid value encountered in divide
freq = hist.sum(axis=1) / hist.sum()
/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/metrics.py:37: RuntimeWarning: invalid value encountered in greater
fwavacc = (freq[freq > 0] * iu[freq > 0]).sum()
('FreqW Acc : \t', 0.0)
('Overall Acc: \t', nan)
('Mean Acc : \t', nan)
('Mean IoU : \t', nan)
0it [00:00, ?it/s]
('FreqW Acc : \t', 0.0)
('Overall Acc: \t', nan)
('Mean Acc : \t', nan)
('Mean IoU : \t', nan)
0it [00:00, ?it/s]
('FreqW Acc : \t', 0.0)
('Overall Acc: \t', nan)
('Mean Acc : \t', nan)
('Mean IoU : \t', nan)
0it [00:00, ?it/s]
('FreqW Acc : \t', 0.0)
('Overall Acc: \t', nan)
('Mean Acc : \t', nan)
('Mean IoU : \t', nan)

what's wrong with this? and how to fix it,thanks for your help

@adam9500370
Copy link
Contributor

adam9500370 commented Aug 18, 2018

0it [00:00, ?it/s]
It seems that validation data path is wrong (loading 0 file).

https://github.com/meetshah1995/pytorch-semseg/blob/6d2a31c4e389dfe4c62301748a46d5db49019ce6/train.py#L30
https://github.com/meetshah1995/pytorch-semseg/blob/6d2a31c4e389dfe4c62301748a46d5db49019ce6/ptsemseg/loader/ade20k_loader.py#L25-L27
Due to inconsistent split definition in train.py and ptsemseg/loader/ade20k_loader.py, we should replace those lines in ptsemseg/loader/ade20k_loader.py with the following lines, then we can store the correct training and validation file lists.

split_dict = {"training": "training", "val": "validation",}
for split in split_dict:
    file_list = recursive_glob(rootdir=self.root + 'images/' + split_dict[split] + '/', suffix='.jpg')
    self.files[split] = file_list

@3bobo
Copy link
Author

3bobo commented Aug 19, 2018

@adam9500370 I've replaced the code in ptsmseg/loader/ade20k, and set the path in config.json and in ade20k_loader.py in the same like this:

in ade20k_loader.py :

if __name__ == '__main__':
    local_path = '/home/zqk/datasets/ADE20K/ADE20K_2016_07_26'
    dst = ADE20KLoader(local_path, is_transform=True)

and in config.json :

  "ade20k":
  {
    "data_path": "/home/zqk/datasets/ADE20K/ADE20K_2016_07_26"
  }

but the data is still nan while I run train.py, thanks for your help

@adam9500370
Copy link
Contributor

adam9500370 commented Aug 19, 2018

https://github.com/meetshah1995/pytorch-semseg/blob/6d2a31c4e389dfe4c62301748a46d5db49019ce6/ptsemseg/loader/ade20k_loader.py#L26
Since this line means string concatenating self.root (data_path) and the remaining detailed path, then you would read the path like the following:

absolute_path = self.root + 'images/'
print(absolute_path)
# Output:  /home/zqk/datasets/ADE20K/ADE20K_2016_07_26images/

Therefore, you should set the path with the last slash symbol "/home/zqk/datasets/ADE20K/ADE20K_2016_07_26/".

@3bobo
Copy link
Author

3bobo commented Aug 19, 2018

oh, thanks. but here appears another error :

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    train(args)
  File "train.py", line 82, in train
    for i, (images, labels) in enumerate(trainloader):
  File "/home/zqk/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/home/zqk/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/home/zqk/anaconda2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/loader/ade20k_loader.py", line 48, in __getitem__
    img, lbl = self.augmentations(img, lbl)tmseg/ptsemseg/augmentations.py", line 15, in __call__
    img, mask = Image.fromarray(img, mode='RGB'), Image.fromarray(mask, mode='L')
  File "/home/zqk/anaconda2/lib/python2.7/site-packages/PIL/Image.py", line 2436, in fromarray
    raise ValueError("Too many dimensions: %d > %d." % (ndim, ndmax))
ValueError: Too many dimensions: 3 > 2.

Is it because the dimension of ade20k's picture isn't in accordance with dimension the code requires?

@adam9500370
Copy link
Contributor

@3bobo
Copy link
Author

3bobo commented Aug 19, 2018

ok , thank you very much!

@3bobo
Copy link
Author

3bobo commented Aug 19, 2018

I use the second way so I keep original 3146 classes in ADE20K, I set self.n_classes = 3146 in ade20k_loader.py and modify the fcn.py as you said ,from

 l2.weight.data = l1.weight.data[:n_class, :].view(l2.weight.size())
 l2.bias.data = l1.bias.data[:n_class]

to:

 copy_idx = l1.weight.data.shape[0] if l1.weight.data.shape[0] < n_class else n_class
 l2.weight.data[:copy_idx, :] = l1.weight.data[:copy_idx, :]
 l2.bias.data[:copy_idx] = l1.bias.data[:copy_idx]

but a new error appears:

  File "/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/models/fcn.py", line 343, in init_vgg16_params
    l2.weight.data[:copy_idx, :] = l1.weight.data[:copy_idx, :]
RuntimeError: The expanded size of the tensor (1) must match the existing size (4096) at non-singleton dimension 3

thanks for your help!

@adam9500370
Copy link
Contributor

adam9500370 commented Aug 19, 2018

We can add .view(...) to match the same weight shape between l1 and l2.

l2.weight.data[:copy_idx, :] = l1.weight.data[:copy_idx, :].view(l2.weight[:copy_idx, :].size())

Or you can also set copy_fc8=False without loading the last pre-trained classifier layer.

@3bobo
Copy link
Author

3bobo commented Aug 19, 2018

I tried to change the code as you said, and met the size error:

  File "/home/zqk/work/pytorch_seg/pytorch-semseg/ptsemseg/models/fcn.py", line 340, in init_vgg16_params
    l2.weight.data = l1.weight.data[:n_class, :].view(l2.weight.size())
RuntimeError: invalid argument 2: size '[3146 x 4096 x 1 x 1]' is invalid for input with 4096000 elements at /opt/conda/conda-bld/pytorch_1524577177097/work/aten/src/TH/THStorage.c:41

so I set copy_fc8=False, but it comes back to the original error:

  File "/home/zqk/anaconda2/lib/python2.7/site-packages/PIL/Image.py", line 2436, in fromarray
    raise ValueError("Too many dimensions: %d > %d." % (ndim, ndmax))
ValueError: Too many dimensions: 3 > 2.

I check all the corrections like this issue:

  1. https://github.com/meetshah1995/pytorch-semseg/blob/dfecf4973c702c9be4b55d5066bd4a79bcb0c1bb/ptsemseg/loader/ade20k_loader.py#L25-L27
    to:
        split_dict = {"training": "training", "val": "validation",}
        for split in split_dict:
            file_list = recursive_glob(rootdir=self.root + 'images/' + split_dict[split] + '/', suffix='.jpg')
            self.files[split] = file_list
  1. https://github.com/meetshah1995/pytorch-semseg/blob/dfecf4973c702c9be4b55d5066bd4a79bcb0c1bb/ptsemseg/loader/ade20k_loader.py#L20
    to:
    self.n_classes=3146

  2. https://github.com/meetshah1995/pytorch-semseg/blob/dfecf4973c702c9be4b55d5066bd4a79bcb0c1bb/ptsemseg/loader/ade20k_loader.py#L68
    to:

        if not np.all(classes == np.unique(lbl)):
            print("WARN: resizing labels yielded fewer classes")

        if not np.all(np.unique(lbl) < self.n_classes):
            raise ValueError("Segmentation map contained invalid class values")

do I miss some import corrections ?

@adam9500370
Copy link
Contributor

You need to move line 63 to line 41 in ade20k_loader.py to avoid ValueError: Too many dimensions: 3 > 2., since lbl = self.encode_segmap(lbl) should be done before augmentation img, lbl = self.augmentations(img, lbl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants