Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN's randomly pop up in computations #20

Closed
chaithyagr opened this issue Apr 26, 2022 · 18 comments · Fixed by #21
Closed

NaN's randomly pop up in computations #20

chaithyagr opened this issue Apr 26, 2022 · 18 comments · Fixed by #21
Labels
bug Something isn't working

Comments

@chaithyagr
Copy link

I observe that we get NaNs coming up in my computations.

I have stopped my runs with conditional breakpoints and I have repeated the specific functions (tfft.nufft, tfft.spread etc.) and observed NaN's repeating once or twice. However, if i repeat the computations, usually, I dont get NaNs again

For now, I am writing my own wrapper to carry out recomputation in case this occurs, but I feel this can be something bigger like a missing _sync_threads or cudaDeviceSynchronize.

Sadly, given how random this issue is, I dont have a repro. I will see if I can setup one. For now this is a more of check if you faced this issue too in which case, do we have any idea?

I would love to debug deeper, is there a way to have a gdb or cuda-gdb setup?

@chaithyagr
Copy link
Author

Adding here that we never faced this in original cufinufft. So i am feeling this is a more specific issue with the wrapper

@jmontalt
Copy link
Member

jmontalt commented Apr 26, 2022

I have never observed this behaviour personally, although it's not the first report I have had of runs occasionally going wrong. Unfortunately I have not had the time to investigate that yet. It may be hardware-specific or problem-specific. Could you share some code (as simple as possible) that causes the issue, even if only occasionally?

Do you know if it's possible to run into this issue with eager execution, or does it only happen in graph execution?

As for debugging yourself, I use a VS code container for development, which has all the necessary dependencies to compile and run the op. It might be a good starting point where you can configure any debugger you would like to use.

@chaithyagr
Copy link
Author

It occurs both in eager mode and graph execution. I checked that.

I will try to setup a simple repro for you.

I will use the container for some debug

@jmontalt
Copy link
Member

Great, thank you! I'll be very interested to hear about anything you can find out. I'll also do my best to have a look as soon as possible.

@chaithyagr
Copy link
Author

chaithyagr commented Apr 28, 2022

Adding additional details as i observe them

This issue occurs only when (take it with a apinch of salt as I really cant be sure given how random it is, i just run 100 iterations to be sure for now)

  1. Gradients Tape is enabled
  2. When we have multiscale ssim loss on the output.
  3. It crops up even when optimizer is not used to update trajectory

I am currently still setting up a minimum repro, sadly, the code seems to work fine in repro, but I think its a matter on closing in on where the issue is being exactly triggered.

@chaithyagr
Copy link
Author

chaithyagr commented Apr 29, 2022

I finally have a working reproducable example / test. Although, this works after running the code for many iterations.

Here is a simple boiled down case of a model which tries to learn a k-space sampling pattern for simple density compensated adjoint (here is the paper describing parts of it) :

import tensorflow as tf
import numpy as np
import tensorflow_nufft as tfft
import tensorflow_mri as tfmri


# We assume no gradients with respect to density compensation
def nograd_density_estimator(traj, img_size, method, max_iter):
    @tf.custom_gradient
    def get_density(traj):
        out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
        # Get approx scaling by take adj_op(op(1))
        test_im = tf.ones(img_size, dtype=tf.complex64)
        test_im_recon = tfft.nufft(
            tf.cast(out, tf.complex64) * tfft.nufft(
                test_im,
                traj,
                transform_type='type_2',
                fft_direction='forward'
            ),
            traj,
            grid_shape=img_size,
            transform_type='type_1',
            fft_direction='backward'
        )
        ratio = tf.reduce_mean(tf.math.abs(test_im_recon))
        out = out / tf.cast(ratio, out.dtype)
        def grad(dy):
            return None
        return out, grad
    return tf.stop_gradient(get_density(tf.transpose(traj)))

# Simple model to learn trajectory on Density compensated adjoint
class TrajModel(tf.keras.Model):
    def __init__(self, shape=(2, 32*512), img_size=(320, 320)):
        super(TrajModel, self).__init__()
        self.shape = shape
        # Radial initialization of trajectory
        self.init = tf.transpose(tf.reshape(tfmri.radial_trajectory(base_resolution=512/2, views=32), (-1, 2)))
        self.traj = self.add_weight(shape=self.shape, dtype=tf.float32)
        self.img_size = img_size
        self.traj.assign(self.init)

    def call(self, image):
        density_comp = nograd_density_estimator(self.traj, self.img_size, method='pipe', max_iter=10)
        kspace = tfft.nufft(image, tf.transpose(self.traj), transform_type='type_2', fft_direction='forward')
        kspace = kspace * tf.cast(density_comp, kspace.dtype)
        recon = tfft.nufft(kspace, tf.transpose(self.traj), grid_shape=image.shape[-2:], transform_type='type_1', fft_direction='backward')
        return tf.abs(recon[..., None])

#tf.config.run_functions_eagerly(True)
my_model = TrajModel()
np.random.seed(0)
I = np.random.random((1, 1, 320, 320)) + 1j * np.random.random((1, 1, 320, 320))
I = I.astype('complex64')
opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
for i in range(1000):
    print(i)
    with tf.GradientTape(persistent=True) as tape:
        recon = my_model(I)
        # loss = tf.reduce_mean(tf.abs(recon -tf.abs(I)))
        loss = tf.reduce_mean(1 - tf.image.ssim_multiscale(tf.abs(I[..., None]), recon, max_val=tf.reduce_max(tf.abs(I))))
        print("Loss : " + str(loss))
        print("Max value of Recon : " + str(np.max(recon)))
    grads = tape.gradient(loss, my_model.trainable_variables)
    print(tf.math.reduce_max(grads))

The same code seems to work fine (again pinch of salt, in my runs :P ) for L1 loss (if you comment the mSSIM loss and enable just the line above).

All the above details i provided in above comments ( #20 (comment) and #20 (comment) ) hold true here as well.

@chaithyagr
Copy link
Author

I think I resolved the issue. It is as I expected, the need for a cudaDeviceSynchronize . (Also see https://stackoverflow.com/questions/55975775/non-deterministic-results-using-gpus-with-tensorflow-and-tensorflow-serving and https://forums.developer.nvidia.com/t/multiple-runs-different-results/37950 )

I will make a PR soon. After final tests. We cant exactly have a test in place to detect this as it would run on CPU where it will be fine, and also the above repro isnt exactly perfect.

I am just doing some final tests to check everything is running fine.

Thank you @jmontalt for your help!

@jmontalt
Copy link
Member

Excellent! Race conditions of this sort are often easily fixed one you've closed in on the problem, but can be very tricky to pinpoint. I was not looking forward to it. So thanks to you @chaithyagr!

I agree we'll need to do without a test in this case. The nature of the problem means it's impossible to get deterministic behaviour, and the GitHub runner does not have a GPU anyway.

I will merge this ASAP.

@jmontalt jmontalt added the bug Something isn't working label Apr 29, 2022
@chaithyagr
Copy link
Author

Well ideally, the race conditions can easily be found by using racecheck by Nvidia. It just happened that I found this while looking through the codes even before I got that racecheck run up with line informations.
I think cufinufft has a similar issue and a colleague of mine @paquiteau has isolated it around the same time.
Turns out same discoveries at same time do happen xD

@chaithyagr
Copy link
Author

Hey, it looks like the issue is still not fixed. I am not yet very sure whats the issue. @jmontalt can you pleas reopen this issue? I will try to see if I can get another PR with the fix

@jmontalt
Copy link
Member

jmontalt commented May 2, 2022

Oh, sure!

@jmontalt jmontalt reopened this May 2, 2022
@chaithyagr
Copy link
Author

chaithyagr commented May 3, 2022

@jmontalt I took some time and went through the cuda API calls and trace in NSight. I suspected that most of the calls are occurring in the same stream due to which, we mostly wont need a device sync (my earlier fix isnt needed). It was a false fix given how variable this issue is, I just got wrong in my first thinking.

I had a question:

I see that for spread step, we do not set the global memory to 0 as is done for execute_type_1. Is there any particular reason? Do we expect tensorflow to do this?

I feel most our issues crop up from this, as I face this issue when I have density compensation estimation in my test code and if i run the same estimation and replace spread and interp with complete NUFFT operations (op and adj_op), everything runs smoothly for now (still testing to be very sure this time).

Further, I observed that the spread runs kernels which sends AtomicAdd operations to update the global memory, which will continue to hold on to the issue of uninitialized memory.

@jmontalt
Copy link
Member

jmontalt commented May 3, 2022

Hey @chaithyagr, many thanks for looking into this. I was able to reproduce your example. In my machine, it seems to happen at every iteration after the first one. After including the latest fix, I have not observed the issue in 1,000 iterations. You may have nailed it down this time.

You're absolutely right, the grid data should be initialized to 0. This memory was originally allocated with OpKernelContext::allocate_temp, documented here. I have no reason to believe this will initialize its memory.

I might remove the device synchronization instructions we added last time. There is no point in making the CPU wait if subsequent kernels will be launched in the same stream. Do you agree?

@chaithyagr
Copy link
Author

Yes I agree with u. I tested without them also. We don't need them. Looking at the cuda trace I can be sure that we run almost everything in the same stream.

@jmontalt
Copy link
Member

jmontalt commented May 3, 2022

Great. I'll prepare the merge.

That certainly should be the case. The CUDA stream is assigned by the TensorFlow runtime and our code should respect its wishes.

jmontalt added a commit that referenced this issue May 3, 2022
@jmontalt jmontalt closed this as completed May 3, 2022
@chaithyagr
Copy link
Author

Hello @jmontalt I dont see the next version up, I thought it was a CD?
Perhaps we can something like https://github.com/zaccharieramzi/tfkbnufft/blob/master/.github/workflows/publish.yml ?

@jmontalt
Copy link
Member

jmontalt commented May 4, 2022

It's certainly a CD, but releases are only triggered on tagged commits. I tagged it this morning and it was released half an hour ago.

@chaithyagr
Copy link
Author

Ah I see, my bad. Thank u

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants