training simulation for unprocessing net is stuck #32

aasharma90 · 2019-07-09T01:31:33Z

Hello,
Thanks for releasing the code for 'Unprocessing Images ... Raw Denoising'. Upon trying the training process, I see that the training simulation gets stuck at this point -

This is my run command -
python train.py --model_dir='./ckpts/' --train_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/train/* --test_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/test/*

Anybody knows this problem? any workaround? Thanks!

The text was updated successfully, but these errors were encountered:

jonbarron · 2019-07-22T20:31:10Z

Are you training on the GPU? I'm not sure what "stuck" means here, but it sounds like training could just be proceeding very slowly, and TF using the CPU for training is a very common reason for that.

timothybrooks · 2019-07-22T22:45:33Z

I second Jon's analysis of this: it looks like CPU is being used here, which can be extremely slow and appear stuck. After waiting a long time (say, an hour) do you see anything written to your model directory?

I would recommend training with a GPU if that is possible, as it will be much faster. The last log is relating to Intel's OpenMP* thread mapping, which is probably because Intel's MKL-DNN is being used. But I have not seen those logs while training before, and see no reason why this would cause stalling.

jonbarron · 2019-07-22T23:29:44Z

I just tried training out the current code, and it seems to produce model checkpoints as output. @aasharma90 , can you confirm that model checkpoints aren't being produced when you run this? It's a little confusing because training doesn't produce loss/epoch print statements, but that seems to be a visualization issue, and not a correctness issue.

timothybrooks · 2019-07-23T00:09:22Z

To print loss in the terminal during training, add tf.logging.set_verbosity(tf.logging.INFO) to set a high enough verbosity to see the training metrics. You can add this line right before the call to tf.estimator.train_and_evaluate(...) in train.py.

By default, Estimator will log every 100 steps. You can change this by modifying the config in train.py:
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=[num of steps])

You may find it easier to use TensorBoard to visualize training progress, which can be done by running tensorboard --logdir=[path to model dir] in a separate terminal during or after training, and opening the printed URL in a web browser.

jonbarron · 2019-07-23T00:17:40Z

Thanks Tim, CL is in flight, I'll closes this issue once it's landed.

aasharma90 · 2019-07-23T01:48:42Z

Hi @timothybrooks and @jonbarron

Regarding your questions -

I thought the default setting would be to be run it on GPU? Sorry, I'm a very new to TF so not much aware. Could you please let me know how that can be done? You can have a look at my default command I used for training in my original post above.
I added Tim's suggestions in train.py

...
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=1)
...
tf.logging.set_verbosity(tf.logging.INFO)

The simulation is still at the same point I mentioned.

Launching tensorboard, I can see the model graph, but I cannot see the training profile.

jonbarron closed this as completed Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training simulation for unprocessing net is stuck #32

training simulation for unprocessing net is stuck #32

aasharma90 commented Jul 9, 2019

jonbarron commented Jul 22, 2019

timothybrooks commented Jul 22, 2019

jonbarron commented Jul 22, 2019

timothybrooks commented Jul 23, 2019

jonbarron commented Jul 23, 2019

aasharma90 commented Jul 23, 2019

training simulation for unprocessing net is stuck #32

training simulation for unprocessing net is stuck #32

Comments

aasharma90 commented Jul 9, 2019

jonbarron commented Jul 22, 2019

timothybrooks commented Jul 22, 2019

jonbarron commented Jul 22, 2019

timothybrooks commented Jul 23, 2019

jonbarron commented Jul 23, 2019

aasharma90 commented Jul 23, 2019