Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still facing NaN issues! #33

Open
chaithyagr opened this issue Oct 3, 2022 · 21 comments
Open

Still facing NaN issues! #33

chaithyagr opened this issue Oct 3, 2022 · 21 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request

Comments

@chaithyagr
Copy link

Even after a lot of efforts, we still seem to face NaN issues and sadly this seems to be verry erratic and random. I dont have a repro test at all...
From what I see, I think that this issue is only present in Graph Mode. @jmontalt does that hint at any issues possible? Is there any separate code paths possible for eager mode / graph mode?

I will try to use this issue to track its progress on what I observe.

@jmontalt
Copy link
Member

jmontalt commented Oct 3, 2022

That is frustrating! Thank you for your efforts. Is this happening on the CPU or on the GPU?

Eager mode and graph mode follow the same code path. As far as I understand, the main difference between the two modes for our purposes is that in graph mode multiple ops may be executed concurrently. The errors might be related to this concurrent execution.

It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.

@chaithyagr
Copy link
Author

chaithyagr commented Oct 3, 2022

Is this happening on the CPU or on the GPU?

Currently I am only working on GPU, so it is happening on GPU

It might be worth trying to set tf.config.threading.set_inter_op_parallelism_threads to 1. If this makes the issue disappear, it would confirm the suspicion that it's related to concurrent op execution.

Thats a good thing to test. I will try this. However, that still build the question as to how will we fix such an issue in case it is this. I atleast hope this will be faster to run as compared to eager mode..

@chaithyagr
Copy link
Author

In all the runs I launched yesterday all ran fine for 20h these are:

  1. Repeat an earlier run which resulted in NaN
  2. Run in eager mode
  3. Run with tf.config.threading.set_inter_op_parallelism_threads as 1

This is getting really frustating as I really just dont have a setup to repro to even be ...
All t he above runs was made with all the seeds set to 0. I am pretty much lost at this point.. But I will keep the issue open to track the presence of such an issue.. Is that fine @jmontalt ?

@jmontalt
Copy link
Member

jmontalt commented Oct 4, 2022

Hi @chaithyagr, thank you so much for running these tests. It's very useful to know this. Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?

Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.

Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).

@jmontalt jmontalt added the bug Something isn't working label Oct 4, 2022
@jmontalt jmontalt self-assigned this Oct 4, 2022
@chaithyagr
Copy link
Author

chaithyagr commented Oct 4, 2022

Just to be clear, you have never been able to reproduce the issue if either running on eager mode or if inter-op parallelism threads is 1. Is that correct?

Correct, and I get your point where you expect this to fix our issue and frankly I dont get a huge speedup with inter-op parallelism set as something other than 1. (I am working in 3D, so I guess we are pretty much using all resources already!)
However, as the repeat run also passed, I am confused :P

Yes, let's keep this issue open. Would you be willing to share some code that has failed before, even if it does not reproduce the issue consistently? It would give me a starting point to run some tests with code that has at least been known to fail.

I think the same code from #20 (comment) should be good enough minimum repro. Although one problem I see with this version is that in TFMRI we now threhold the density (https://github.com/mrphys/tensorflow-mri/blob/cfd8930ee5281e7f6dceb17c4a5acaf625fd3243/tensorflow_mri/python/ops/traj_ops.py#L938) and this non_nan reciprocal could be bad:

out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))

Interestingly, I don't currently observe this problem, and at this point I'm not sure whether this could be due to differences in our usage or something else (perhaps it's less likely on my hardware?).

I dont think it could be hardware as I run it on multiple machines and different GPUs. It could be possible that something is wrong with my codes specifically. I now plan to save the input when we get NaNs (although this may not be of much use as an extact repeat of run, which results in same output, did not result in NaN, signyfying that it is a more random issue)

@chaithyagr
Copy link
Author

A minor question, @jmontalt I was wondering what happens if the input points are not in [-pi, pi], could this possibly cause an errors / NaNs? Do we have any guard checks?

@jmontalt
Copy link
Member

jmontalt commented Oct 4, 2022

It should support up to the extended range [-3 * pi, 3 * pi], with points beyond [-pi, pi] simply being folded back to this range.

This extended range should give some flexibility and robustness against points that are close to [-pi, pi], but note that I have never tested this. It would be useful to add such a test.

For points beyond this extended range, behaviour is undefined.

We do not currently assert that the points lie within the correct range. This would be rather inefficient, especially on the GPU.

This behaviour is the same as in the original FINUFFT.

@chaithyagr
Copy link
Author

Well, while i agree the wrapping on a mathematical point of view, I think this isnt the valid in MRI. But i guess tenorflow-nufft is generic package so I guess its fine.

@jmontalt
Copy link
Member

jmontalt commented Oct 4, 2022

It does remain valid in MRI. Note that pi is, by definition, the maximum spatial frequency (in radians per pixel) that you can have in an image of a certain size (which is one of the inputs to nufft). In order to usefully sample higher frequencies than that, you would need a finer grid. But if you have a finer grid, then pi is simply redefined to be the maximum spatial frequency for that grid.

@chaithyagr
Copy link
Author

Yes, I get that. What i mean is that in the case of MRI if the inputs are given as is (not re-normalizing to pi), then:

with points beyond [-pi, pi] simply being folded back to this range.

This would be wrong.. But then again, as i told, I dont think thats the job of tensorflow-nufft as it is a more generic library..

@chaithyagr
Copy link
Author

For points beyond this extended range, behaviour is undefined.

If this is valid. I think this could be the issue. I observed on adding a lot of randomness to my k-space trajectory, the NaN issue came up suddenly. While I still cant pin point why, but my repro has always been in controlled setting where there is no points outside the range. Lets try to see how the code behaves in this scenario. I will try to create such an odd repro for now

@chaithyagr
Copy link
Author

Ok, I think that was the issue!
If I take the repro from above and basically add np.pi to the k-space locations, while NUFFT works fine, the gradients with respect to the trajectories are NaN!

Not sure if this is realistic though. But proper guard checks are added on my side, I will report back if the issue still remains.

@jmontalt
Copy link
Member

jmontalt commented Oct 5, 2022

Great, thanks for keeping me in the loop.

So forward nufft works as expected (e.g., with wrapping) with points in [-3 * pi, 3 * pi], but the gradient only works in [-pi, pi]? Or is it that you had points beyond [-3 * pi, 3 * pi]?

I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).

@jmontalt
Copy link
Member

jmontalt commented Oct 5, 2022

I will also document this behaviour in the website to reduce the chances it happens again in the future.

I'm keeping the issue open until everything is clear.

@jmontalt jmontalt added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 5, 2022
@chaithyagr
Copy link
Author

Well as I just added pi, I dont think we can have points beyond [-3 * pi, 3 * pi]
But I must add, I also tried adding 0.1 * pi and 0.5 * pi, didnt face any issues there. So we cant say much yet. Need some more core analysis of whats going wrong where and why...

@chaithyagr
Copy link
Author

I'll make a note to add an option to assert that the points are in the right range, which would be useful for debugging purposes (I don't want to enable this by default because of likely performance hit).

This is really helpful!

@chaithyagr
Copy link
Author

I think I finally have some complicated case is reproducible!

But if I have a loss which is Multiscale SSIM.

I get NaNs in Graph mode, but no NaNs in eager mode.

However, Using tf.config.threading.set_inter_op_parallelism_threads(1) the issue still exists! It only vanishes in eager mode.

I am not sure what is the issue specifically yet, I think with this I can make a good minimum repro

Additional things UI observed, the output is pretty much wrong sometimes from the network even if it isn't NaN (Observed this as I had all the seeds set to 0 and ) , which prevents its good use anyway...

To me, this could be because the MSSIM is launched before the end of nufft operation.
This is very weird to me as I feel that the MSSIM should technically not launch before nufft operations are completed.
We should have a look at some basic network implementation possibly to be sure we are not missing on blocking the output.

@chaithyagr
Copy link
Author

chaithyagr commented Nov 2, 2022

I wonder how much this issue is related to tensorflow/tensorflow#57353
Although my hypothesis above seems more valid to me as my results are just wrong sometimes...

@jmontalt
Copy link
Member

jmontalt commented Nov 2, 2022

Hi @chaithyagr, thank you for your efforts. My best guess would be that there's a missing device synchronisation somewhere, which would explain the randomness of the results. It'll be some time until I have the chance to take a look, but if you want to share your repro I will keep it in mind. There's also some significant refactoring in progress which should make debugging easier in the future.

@chaithyagr
Copy link
Author

Updating with some more info.
I had some models that ran for really long time with a lot of NUFFTs!

I seem to have hit NaN issues even with eager mode with this.
I have a wrapper to recompute if NaN occurs, and also spew where I get the NaN from and it clearly is from the NUFFT operation.

I feel eager mode runs slower thereby making it harder to have NaNs due to missing device sync (less probable).

For now, I added device sync as seen in chaithyagr#1
I dont see a huge perf hit, even though it is extreme step at the moment. I think we need to check with nvprof if the run is happening in a different stream to see if this even is needed.

For now, I dont see any issues in my runs with this update. I have no idea though, I seem to have tackled this issue in some or other form everytime only for it to be back a month later :P

@jmontalt
Copy link
Member

That's great, thank you! Fingers crossed it is finally fixed now. Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants