Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training simulation for unprocessing net is stuck #32

Closed
aasharma90 opened this issue Jul 9, 2019 · 6 comments
Closed

training simulation for unprocessing net is stuck #32

aasharma90 opened this issue Jul 9, 2019 · 6 comments

Comments

@aasharma90
Copy link

Hello,
Thanks for releasing the code for 'Unprocessing Images ... Raw Denoising'. Upon trying the training process, I see that the training simulation gets stuck at this point -
Capture

This is my run command -
python train.py --model_dir='./ckpts/' --train_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/train/* --test_pattern=/disk1/aashishsharma/Datasets/MIRFlickr_Dataset/test/*

Anybody knows this problem? any workaround? Thanks!

@jonbarron
Copy link
Contributor

Are you training on the GPU? I'm not sure what "stuck" means here, but it sounds like training could just be proceeding very slowly, and TF using the CPU for training is a very common reason for that.

@timothybrooks
Copy link
Contributor

I second Jon's analysis of this: it looks like CPU is being used here, which can be extremely slow and appear stuck. After waiting a long time (say, an hour) do you see anything written to your model directory?

I would recommend training with a GPU if that is possible, as it will be much faster. The last log is relating to Intel's OpenMP* thread mapping, which is probably because Intel's MKL-DNN is being used. But I have not seen those logs while training before, and see no reason why this would cause stalling.

@jonbarron
Copy link
Contributor

I just tried training out the current code, and it seems to produce model checkpoints as output. @aasharma90 , can you confirm that model checkpoints aren't being produced when you run this? It's a little confusing because training doesn't produce loss/epoch print statements, but that seems to be a visualization issue, and not a correctness issue.

@timothybrooks
Copy link
Contributor

To print loss in the terminal during training, add tf.logging.set_verbosity(tf.logging.INFO) to set a high enough verbosity to see the training metrics. You can add this line right before the call to tf.estimator.train_and_evaluate(...) in train.py.

By default, Estimator will log every 100 steps. You can change this by modifying the config in train.py:
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=[num of steps])

You may find it easier to use TensorBoard to visualize training progress, which can be done by running tensorboard --logdir=[path to model dir] in a separate terminal during or after training, and opening the printed URL in a web browser.

@jonbarron
Copy link
Contributor

Thanks Tim, CL is in flight, I'll closes this issue once it's landed.

@aasharma90
Copy link
Author

Hi @timothybrooks and @jonbarron

Regarding your questions -

  1. I thought the default setting would be to be run it on GPU? Sorry, I'm a very new to TF so not much aware. Could you please let me know how that can be done? You can have a look at my default command I used for training in my original post above.

  2. I added Tim's suggestions in train.py

...
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=1)
...
tf.logging.set_verbosity(tf.logging.INFO)

The simulation is still at the same point I mentioned.

  1. Launching tensorboard, I can see the model graph, but I cannot see the training profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants