New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training simulation for unprocessing net is stuck #32
Comments
Are you training on the GPU? I'm not sure what "stuck" means here, but it sounds like training could just be proceeding very slowly, and TF using the CPU for training is a very common reason for that. |
I second Jon's analysis of this: it looks like CPU is being used here, which can be extremely slow and appear stuck. After waiting a long time (say, an hour) do you see anything written to your model directory? I would recommend training with a GPU if that is possible, as it will be much faster. The last log is relating to Intel's OpenMP* thread mapping, which is probably because Intel's MKL-DNN is being used. But I have not seen those logs while training before, and see no reason why this would cause stalling. |
I just tried training out the current code, and it seems to produce model checkpoints as output. @aasharma90 , can you confirm that model checkpoints aren't being produced when you run this? It's a little confusing because training doesn't produce loss/epoch print statements, but that seems to be a visualization issue, and not a correctness issue. |
To print loss in the terminal during training, add By default, Estimator will log every 100 steps. You can change this by modifying the config in train.py: You may find it easier to use TensorBoard to visualize training progress, which can be done by running |
Thanks Tim, CL is in flight, I'll closes this issue once it's landed. |
Hi @timothybrooks and @jonbarron Regarding your questions -
The simulation is still at the same point I mentioned.
|
Hello,
Thanks for releasing the code for 'Unprocessing Images ... Raw Denoising'. Upon trying the training process, I see that the training simulation gets stuck at this point -
This is my run command -
python train.py --model_dir='./ckpts/' --train_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/train/* --test_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/test/*
Anybody knows this problem? any workaround? Thanks!
The text was updated successfully, but these errors were encountered: