NaN's randomly pop up in computations #20

chaithyagr · 2022-04-26T09:31:20Z

I observe that we get NaNs coming up in my computations.

I have stopped my runs with conditional breakpoints and I have repeated the specific functions (tfft.nufft, tfft.spread etc.) and observed NaN's repeating once or twice. However, if i repeat the computations, usually, I dont get NaNs again

For now, I am writing my own wrapper to carry out recomputation in case this occurs, but I feel this can be something bigger like a missing _sync_threads or cudaDeviceSynchronize.

Sadly, given how random this issue is, I dont have a repro. I will see if I can setup one. For now this is a more of check if you faced this issue too in which case, do we have any idea?

I would love to debug deeper, is there a way to have a gdb or cuda-gdb setup?

The text was updated successfully, but these errors were encountered:

chaithyagr · 2022-04-26T09:35:59Z

Adding here that we never faced this in original cufinufft. So i am feeling this is a more specific issue with the wrapper

jmontalt · 2022-04-26T09:53:31Z

I have never observed this behaviour personally, although it's not the first report I have had of runs occasionally going wrong. Unfortunately I have not had the time to investigate that yet. It may be hardware-specific or problem-specific. Could you share some code (as simple as possible) that causes the issue, even if only occasionally?

Do you know if it's possible to run into this issue with eager execution, or does it only happen in graph execution?

As for debugging yourself, I use a VS code container for development, which has all the necessary dependencies to compile and run the op. It might be a good starting point where you can configure any debugger you would like to use.

chaithyagr · 2022-04-26T10:20:19Z

It occurs both in eager mode and graph execution. I checked that.

I will try to setup a simple repro for you.

I will use the container for some debug

jmontalt · 2022-04-26T10:36:43Z

Great, thank you! I'll be very interested to hear about anything you can find out. I'll also do my best to have a look as soon as possible.

chaithyagr · 2022-04-28T14:23:36Z

Adding additional details as i observe them

This issue occurs only when (take it with a apinch of salt as I really cant be sure given how random it is, i just run 100 iterations to be sure for now)

Gradients Tape is enabled
When we have multiscale ssim loss on the output.
It crops up even when optimizer is not used to update trajectory

I am currently still setting up a minimum repro, sadly, the code seems to work fine in repro, but I think its a matter on closing in on where the issue is being exactly triggered.

chaithyagr · 2022-04-29T08:08:52Z

I finally have a working reproducable example / test. Although, this works after running the code for many iterations.

Here is a simple boiled down case of a model which tries to learn a k-space sampling pattern for simple density compensated adjoint (here is the paper describing parts of it) :

import tensorflow as tf
import numpy as np
import tensorflow_nufft as tfft
import tensorflow_mri as tfmri


# We assume no gradients with respect to density compensation
def nograd_density_estimator(traj, img_size, method, max_iter):
    @tf.custom_gradient
    def get_density(traj):
        out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
        # Get approx scaling by take adj_op(op(1))
        test_im = tf.ones(img_size, dtype=tf.complex64)
        test_im_recon = tfft.nufft(
            tf.cast(out, tf.complex64) * tfft.nufft(
                test_im,
                traj,
                transform_type='type_2',
                fft_direction='forward'
            ),
            traj,
            grid_shape=img_size,
            transform_type='type_1',
            fft_direction='backward'
        )
        ratio = tf.reduce_mean(tf.math.abs(test_im_recon))
        out = out / tf.cast(ratio, out.dtype)
        def grad(dy):
            return None
        return out, grad
    return tf.stop_gradient(get_density(tf.transpose(traj)))

# Simple model to learn trajectory on Density compensated adjoint
class TrajModel(tf.keras.Model):
    def __init__(self, shape=(2, 32*512), img_size=(320, 320)):
        super(TrajModel, self).__init__()
        self.shape = shape
        # Radial initialization of trajectory
        self.init = tf.transpose(tf.reshape(tfmri.radial_trajectory(base_resolution=512/2, views=32), (-1, 2)))
        self.traj = self.add_weight(shape=self.shape, dtype=tf.float32)
        self.img_size = img_size
        self.traj.assign(self.init)

    def call(self, image):
        density_comp = nograd_density_estimator(self.traj, self.img_size, method='pipe', max_iter=10)
        kspace = tfft.nufft(image, tf.transpose(self.traj), transform_type='type_2', fft_direction='forward')
        kspace = kspace * tf.cast(density_comp, kspace.dtype)
        recon = tfft.nufft(kspace, tf.transpose(self.traj), grid_shape=image.shape[-2:], transform_type='type_1', fft_direction='backward')
        return tf.abs(recon[..., None])

#tf.config.run_functions_eagerly(True)
my_model = TrajModel()
np.random.seed(0)
I = np.random.random((1, 1, 320, 320)) + 1j * np.random.random((1, 1, 320, 320))
I = I.astype('complex64')
opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
for i in range(1000):
    print(i)
    with tf.GradientTape(persistent=True) as tape:
        recon = my_model(I)
        # loss = tf.reduce_mean(tf.abs(recon -tf.abs(I)))
        loss = tf.reduce_mean(1 - tf.image.ssim_multiscale(tf.abs(I[..., None]), recon, max_val=tf.reduce_max(tf.abs(I))))
        print("Loss : " + str(loss))
        print("Max value of Recon : " + str(np.max(recon)))
    grads = tape.gradient(loss, my_model.trainable_variables)
    print(tf.math.reduce_max(grads))

The same code seems to work fine (again pinch of salt, in my runs :P ) for L1 loss (if you comment the mSSIM loss and enable just the line above).

All the above details i provided in above comments ( #20 (comment) and #20 (comment) ) hold true here as well.

chaithyagr · 2022-04-29T15:41:29Z

I think I resolved the issue. It is as I expected, the need for a cudaDeviceSynchronize . (Also see https://stackoverflow.com/questions/55975775/non-deterministic-results-using-gpus-with-tensorflow-and-tensorflow-serving and https://forums.developer.nvidia.com/t/multiple-runs-different-results/37950 )

I will make a PR soon. After final tests. We cant exactly have a test in place to detect this as it would run on CPU where it will be fine, and also the above repro isnt exactly perfect.

I am just doing some final tests to check everything is running fine.

Thank you @jmontalt for your help!

jmontalt · 2022-04-29T16:07:02Z

Excellent! Race conditions of this sort are often easily fixed one you've closed in on the problem, but can be very tricky to pinpoint. I was not looking forward to it. So thanks to you @chaithyagr!

I agree we'll need to do without a test in this case. The nature of the problem means it's impossible to get deterministic behaviour, and the GitHub runner does not have a GPU anyway.

I will merge this ASAP.

chaithyagr · 2022-04-29T17:51:41Z

Well ideally, the race conditions can easily be found by using racecheck by Nvidia. It just happened that I found this while looking through the codes even before I got that racecheck run up with line informations.
I think cufinufft has a similar issue and a colleague of mine @paquiteau has isolated it around the same time.
Turns out same discoveries at same time do happen xD

chaithyagr · 2022-05-02T11:33:16Z

Hey, it looks like the issue is still not fixed. I am not yet very sure whats the issue. @jmontalt can you pleas reopen this issue? I will try to see if I can get another PR with the fix

jmontalt · 2022-05-02T11:35:20Z

Oh, sure!

chaithyagr · 2022-05-03T08:53:24Z

@jmontalt I took some time and went through the cuda API calls and trace in NSight. I suspected that most of the calls are occurring in the same stream due to which, we mostly wont need a device sync (my earlier fix isnt needed). It was a false fix given how variable this issue is, I just got wrong in my first thinking.

I had a question:

I see that for spread step, we do not set the global memory to 0 as is done for execute_type_1. Is there any particular reason? Do we expect tensorflow to do this?

I feel most our issues crop up from this, as I face this issue when I have density compensation estimation in my test code and if i run the same estimation and replace spread and interp with complete NUFFT operations (op and adj_op), everything runs smoothly for now (still testing to be very sure this time).

Further, I observed that the spread runs kernels which sends AtomicAdd operations to update the global memory, which will continue to hold on to the issue of uninitialized memory.

jmontalt · 2022-05-03T18:58:04Z

Hey @chaithyagr, many thanks for looking into this. I was able to reproduce your example. In my machine, it seems to happen at every iteration after the first one. After including the latest fix, I have not observed the issue in 1,000 iterations. You may have nailed it down this time.

You're absolutely right, the grid data should be initialized to 0. This memory was originally allocated with OpKernelContext::allocate_temp, documented here. I have no reason to believe this will initialize its memory.

I might remove the device synchronization instructions we added last time. There is no point in making the CPU wait if subsequent kernels will be launched in the same stream. Do you agree?

chaithyagr · 2022-05-03T19:01:25Z

Yes I agree with u. I tested without them also. We don't need them. Looking at the cuda trace I can be sure that we run almost everything in the same stream.

jmontalt · 2022-05-03T19:09:27Z

Great. I'll prepare the merge.

That certainly should be the case. The CUDA stream is assigned by the TensorFlow runtime and our code should respect its wishes.

Fix for #20

chaithyagr · 2022-05-04T09:52:31Z

Hello @jmontalt I dont see the next version up, I thought it was a CD?
Perhaps we can something like https://github.com/zaccharieramzi/tfkbnufft/blob/master/.github/workflows/publish.yml ?

jmontalt · 2022-05-04T09:55:09Z

It's certainly a CD, but releases are only triggered on tagged commits. I tagged it this morning and it was released half an hour ago.

chaithyagr · 2022-05-04T10:04:07Z

Ah I see, my bad. Thank u

chaithyagr mentioned this issue Apr 29, 2022

Add cudaDeviceSynchronize to prevent erratic behaviour #21

Merged

jmontalt added the bug Something isn't working label Apr 29, 2022

jmontalt closed this as completed in #21 Apr 29, 2022

jmontalt reopened this May 2, 2022

jmontalt added a commit that referenced this issue May 3, 2022

Merge pull request #22 from chaithyagr/master

0210901

Fix for #20

jmontalt closed this as completed May 3, 2022

chaithyagr mentioned this issue Oct 4, 2022

Still facing NaN issues! #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN's randomly pop up in computations #20

NaN's randomly pop up in computations #20

chaithyagr commented Apr 26, 2022

chaithyagr commented Apr 26, 2022

jmontalt commented Apr 26, 2022 •

edited

Loading

chaithyagr commented Apr 26, 2022

jmontalt commented Apr 26, 2022

chaithyagr commented Apr 28, 2022 •

edited

Loading

chaithyagr commented Apr 29, 2022 •

edited

Loading

chaithyagr commented Apr 29, 2022

jmontalt commented Apr 29, 2022

chaithyagr commented Apr 29, 2022

chaithyagr commented May 2, 2022

jmontalt commented May 2, 2022

chaithyagr commented May 3, 2022 •

edited

Loading

jmontalt commented May 3, 2022

chaithyagr commented May 3, 2022

jmontalt commented May 3, 2022

chaithyagr commented May 4, 2022

jmontalt commented May 4, 2022

chaithyagr commented May 4, 2022

NaN's randomly pop up in computations #20

NaN's randomly pop up in computations #20

Comments

chaithyagr commented Apr 26, 2022

chaithyagr commented Apr 26, 2022

jmontalt commented Apr 26, 2022 • edited Loading

chaithyagr commented Apr 26, 2022

jmontalt commented Apr 26, 2022

chaithyagr commented Apr 28, 2022 • edited Loading

chaithyagr commented Apr 29, 2022 • edited Loading

chaithyagr commented Apr 29, 2022

jmontalt commented Apr 29, 2022

chaithyagr commented Apr 29, 2022

chaithyagr commented May 2, 2022

jmontalt commented May 2, 2022

chaithyagr commented May 3, 2022 • edited Loading

jmontalt commented May 3, 2022

chaithyagr commented May 3, 2022

jmontalt commented May 3, 2022

chaithyagr commented May 4, 2022

jmontalt commented May 4, 2022

chaithyagr commented May 4, 2022

jmontalt commented Apr 26, 2022 •

edited

Loading

chaithyagr commented Apr 28, 2022 •

edited

Loading

chaithyagr commented Apr 29, 2022 •

edited

Loading

chaithyagr commented May 3, 2022 •

edited

Loading