New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Sceneflow scheme, Bug: Loss is NaN, stop training #11
Comments
Hi @lebionick , I downloaded the finalpass and disparity from "Full datasets". The occlusion is from the "DispNet/FlowNet2.0 dataset subsets". |
@mli0603 Isn't DispNet/FlowNet2.0 subset of "Full datasets"? So it contains occlusion maps not for all samples from "Full datasets"? |
Hi @lebionick,
Therefore I had to donwload separately from both. So you will see that in my dataloader the occlusion folder is structured as DispNet/FlowNet2.0, but images and disp are structured as Full datasets. The good news is that the training sets overlap almost entirely, while evaluation has some differences. So what I did is to take the provided train/eval lists here and here to make sure Full dataset matches the DispNet/FlowNet2.0 (which is smaller). If you are ok with cleanpass provided in DispNet/FlowNet2.0, you can use that too but I cannot guarantee the result will be the same. I hope this is clearer ;) If not, let me know |
@mli0603 Yes, thank you! |
Well, I could not make it work, I decided to write own class, may be will be convenient for someone else :) class DispNetDataset(data.Dataset):
def __init__(self, root_dir, mode="train"):
data_dict = {}
root_dir = Path(root_dir)
mode_dir = root_dir / mode
for data_type_folder_name in ("frame_finalpass", "disparity_occlusions", "disparity"):
type_dir = mode_dir / data_type_folder_name
for side in ("left", "right"):
side_dir = type_dir / side
for path in side_dir.iterdir():
stem = path.stem
if stem not in data_dict:
data_dict[stem] = {}
stem_dict = data_dict[stem]
if data_type_folder_name not in stem_dict:
stem_dict[data_type_folder_name] = {}
type_dict = stem_dict[data_type_folder_name]
type_dict[side] = path
self.data_dict = data_dict
self.idx_to_key = list(data_dict.keys())
self.mode = mode
self.transformation = self.create_transformation(mode)
@staticmethod
def create_transformation(mode):
if mode == 'train':
transformation = Compose([
RandomShiftRotate(always_apply=True),
RGBShiftStereo(always_apply=True, p_asym=0.3),
OneOf([
GaussNoiseStereo(always_apply=True, p_asym=1.0),
RandomBrightnessContrastStereo(always_apply=True, p_asym=0.5)
], p=1.0)
])
else:
transformation = None
return transformation
def __len__(self):
return len(self.data_dict)
def __getitem__(self, idx):
result = {}
sample_dict = self.data_dict[self.idx_to_key[idx]]
left_fname = sample_dict["frame_finalpass"]["left"]
right_fname = sample_dict["frame_finalpass"]["right"]
result['left'] = np.array(Image.open(left_fname)).astype(np.uint8)[..., :3]
result['right'] = np.array(Image.open(right_fname)).astype(np.uint8)[..., :3]
occ_left_fname = sample_dict["disparity_occlusions"]["left"]
occ_right_fname = sample_dict["disparity_occlusions"]["right"]
occ_left = np.array(Image.open(occ_left_fname)).astype(np.bool)
occ_right = np.array(Image.open(occ_right_fname)).astype(np.bool)
disp_left_fname = sample_dict["disparity"]["left"]
disp_right_fname = sample_dict["disparity"]["right"]
disp_left, _ = readPFM(disp_left_fname)
disp_right, _ = readPFM(disp_right_fname)
if self.mode == "train":
# horizontal flip
result['left'], result['right'], result['occ_mask'], result['occ_mask_right'], disp, disp_right \
= horizontal_flip(result['left'], result['right'], occ_left, occ_right, disp_left, disp_right, self.mode)
result['disp'] = np.nan_to_num(disp, nan=0.0)
result['disp_right'] = np.nan_to_num(disp_right, nan=0.0)
# random crop
result = random_crop(360, 640, result, self.mode)
else:
result['occ_mask'] = occ_left
result['occ_mask_right'] = occ_right
result['disp'] = disp_left
result['disp_right'] = disp_right
result = augment(result, self.transformation)
return result |
But after I launched training with this dataloader I get: |
Thank you so much for sharing your implementation. I really appreciate it! For your error:
Let me know if the any of the above works for you. |
I enabled apex:
but error persists def checker(tensor):
if isinstance(tensor, torch.Tensor):
with torch.no_grad():
return (torch.isnan(tensor).sum() > 0).item()
elif isinstance(tensor, np.ndarray):
return np.isnan(tensor).sum() > 0
else:
raise NotImplementedError()
return False
print(any(map(checker, (left, right, sampled_cols, sampled_rows, disp, occ_mask, occ_mask_right)))) in |
Thanks for the input file. I am looking into this now. |
Hi @lebionick, can you:
|
I ran inference_example.ipynb and it work just fine with both pretrained weights. On kitti2015 and even on my custom pair of images. I also applied model to input that I published above. No nans in the output. Btw Where can I find parameters to run inference using sttr-light weights? |
I checked for nans everywhere :)
|
STTR-Light can be downloaded from google drive here. Remeber to checkout the This is so weird... One thing I can see is you are using Torch 1.6.0. Do you mind installing 1.5.1 and check if you still have the issue? I found a version compatibility issue in #8. But to be honest, I don't see why it will make gradient NaN.
|
Ok, I'll try all of these and come back) |
@mli0603
#!/usr/bin/env bash
CUDA_VISIBLE_DEVICES=0
python main.py --epochs 15\
--batch_size 1\
--checkpoint pretrain\
--pre_train\
--num_workers 2\
--dataset sceneflow_toy\
--dataset_directory /home/jovyan/sceneflow2/FlyingThings3D_subset/train |
Nans are presented in loss, not in gradients. I can dump |
Alright, I localized it to criterion function:
and
|
Thanks! Given all the I wonder why this happens since it's been working. It may relate to a recent commit that I maded along with |
Ok, so the loaded GT disparity ( |
@mli0603 |
Ah, makes sense! So this is not a bug then. Good to know ;) I'll close this for now. |
Thank you a lot for your support! Now I'm able to launch training!) |
If this is not too much to ask, do you mind re-sharing your fixed |
Sure! The fix is just in adding minus sign before disparities class DispNetDataset(data.Dataset):
def __init__(self, root_dir, mode="train"):
data_dict = {}
root_dir = Path(root_dir)
mode_dir = root_dir / mode
for data_type_folder_name in ("frame_finalpass", "disparity_occlusions", "disparity"):
type_dir = mode_dir / data_type_folder_name
for side in ("left", "right"):
side_dir = type_dir / side
for path in side_dir.iterdir():
stem = path.stem
if stem not in data_dict:
data_dict[stem] = {}
stem_dict = data_dict[stem]
if data_type_folder_name not in stem_dict:
stem_dict[data_type_folder_name] = {}
type_dict = stem_dict[data_type_folder_name]
type_dict[side] = path
self.data_dict = data_dict
self.idx_to_key = list(data_dict.keys())
self.mode = mode
self.transformation = self.create_transformation(mode)
@staticmethod
def create_transformation(mode):
if mode == 'train':
transformation = Compose([
RandomShiftRotate(always_apply=True),
RGBShiftStereo(always_apply=True, p_asym=0.3),
OneOf([
GaussNoiseStereo(always_apply=True, p_asym=1.0),
RandomBrightnessContrastStereo(always_apply=True, p_asym=0.5)
], p=1.0)
])
else:
transformation = None
return transformation
def __len__(self):
return len(self.data_dict)
def __getitem__(self, idx):
result = {}
sample_dict = self.data_dict[self.idx_to_key[idx]]
left_fname = sample_dict["frame_finalpass"]["left"]
right_fname = sample_dict["frame_finalpass"]["right"]
result['left'] = np.array(Image.open(left_fname)).astype(np.uint8)[..., :3]
result['right'] = np.array(Image.open(right_fname)).astype(np.uint8)[..., :3]
occ_left_fname = sample_dict["disparity_occlusions"]["left"]
occ_right_fname = sample_dict["disparity_occlusions"]["right"]
occ_left = np.array(Image.open(occ_left_fname)).astype(np.bool)
occ_right = np.array(Image.open(occ_right_fname)).astype(np.bool)
disp_left_fname = sample_dict["disparity"]["left"]
disp_right_fname = sample_dict["disparity"]["right"]
disp_left, _ = readPFM(disp_left_fname)
disp_right, _ = readPFM(disp_right_fname)
if self.mode == "train":
# horizontal flip
result['left'], result['right'], result['occ_mask'], result['occ_mask_right'], disp, disp_right \
= horizontal_flip(result['left'], result['right'], occ_left, occ_right, disp_left, disp_right, self.mode)
result['disp'] = -np.nan_to_num(disp, nan=0.0)
result['disp_right'] = -np.nan_to_num(disp_right, nan=0.0)
# random crop
result = random_crop(360, 640, result, self.mode)
else:
result['occ_mask'] = occ_left
result['occ_mask_right'] = occ_right
result['disp'] = -disp_left
result['disp_right'] = -disp_right
result = augment(result, self.transformation)
return result |
By the way, how do you think, do we possibly need to handle the case when every disparity can be zero? For example, if data contains only very distanced objects? Or for strange augmentations, like duplicating left image |
I think the best way to do it is to set invalid disparity to -1 instead of 0. But if every disparity is 0, then left and right image are identical. I don't see such a case happens in stereo setting (unless intended). |
Hello
I'm trying to pretrain network on sceneflow, however the way my folders organized is way different from that the code tries to find. Could you please tell, what data exactly did you downloaded? DispNet/FlowNet2.0 dataset subsets -> RGB images (cleanpass), Disparity, Disparity Occlusions from here?
The text was updated successfully, but these errors were encountered: