Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eval gets stuck indefinitely #16

Open
kaushikb258 opened this issue May 16, 2022 · 5 comments
Open

eval gets stuck indefinitely #16

kaushikb258 opened this issue May 16, 2022 · 5 comments

Comments

@kaushikb258
Copy link

The eval_segmentation.py gets stuck for potsdam data. The issue is in batched_crf() in the following line:

outputs = pool.map(_apply_crf, zip(img_tensor.detach().cpu(), prob_tensor.detach().cpu()))

The code never proceeds further. One proc is waiting for others indefinitely. Any suggestions?

@mhamilton723
Copy link
Owner

Hey @kaushikb258, how long did you wait? The CRF for potsdam slices can take a few minutes to complete

@kaushikb258
Copy link
Author

I ran the eval code on Potsdam for over 4-5 hours and still no result (the code is still running). Even training didn't take this long.

@mhamilton723
Copy link
Owner

mhamilton723 commented May 17, 2022

Yes that definitely sounds like its stuck appreciate the context here. Perhaps set the num workers in this line

with Pool(cfg.num_workers + 5) as pool:

To something small and see if that stops you from getting stuck. If that's the case its probably due to starvation or something

@kaushikb258
Copy link
Author

kaushikb258 commented May 17, 2022

I decreased the num workers, but no progress. So I made a serial code for CRF and this works now. Attaching below if it can help others... (github is screwing up the indendation!)

def batched_crf(img_tensor, prob_tensor):
batch_size = list(img_tensor.size())[0]
img_tensor_cpu = img_tensor.detach().cpu()
prob_tensor_cpu = prob_tensor.detach().cpu()
out = []
for i in range(batch_size):
out_ = dense_crf(img_tensor_cpu[i], prob_tensor_cpu[i])
out.append(out_)
return torch.cat([torch.from_numpy(arr).unsqueeze(0) for arr in out], dim=0)

@Supgb
Copy link

Supgb commented Jun 28, 2022

It can be avoided by simply replacing

with Pool(cfg.num_workers + 5) as pool:

with

from multiprocessing import get_context

with get_context('spawn').Pool(cfg.num_workers + 5) as pool:
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants