-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN's randomly pop up in computations #20
Comments
Adding here that we never faced this in original cufinufft. So i am feeling this is a more specific issue with the wrapper |
I have never observed this behaviour personally, although it's not the first report I have had of runs occasionally going wrong. Unfortunately I have not had the time to investigate that yet. It may be hardware-specific or problem-specific. Could you share some code (as simple as possible) that causes the issue, even if only occasionally? Do you know if it's possible to run into this issue with eager execution, or does it only happen in graph execution? As for debugging yourself, I use a VS code container for development, which has all the necessary dependencies to compile and run the op. It might be a good starting point where you can configure any debugger you would like to use. |
It occurs both in eager mode and graph execution. I checked that. I will try to setup a simple repro for you. I will use the container for some debug |
Great, thank you! I'll be very interested to hear about anything you can find out. I'll also do my best to have a look as soon as possible. |
Adding additional details as i observe them This issue occurs only when (take it with a apinch of salt as I really cant be sure given how random it is, i just run 100 iterations to be sure for now)
I am currently still setting up a minimum repro, sadly, the code seems to work fine in repro, but I think its a matter on closing in on where the issue is being exactly triggered. |
I finally have a working reproducable example / test. Although, this works after running the code for many iterations. Here is a simple boiled down case of a model which tries to learn a k-space sampling pattern for simple density compensated adjoint (here is the paper describing parts of it) : import tensorflow as tf
import numpy as np
import tensorflow_nufft as tfft
import tensorflow_mri as tfmri
# We assume no gradients with respect to density compensation
def nograd_density_estimator(traj, img_size, method, max_iter):
@tf.custom_gradient
def get_density(traj):
out = tf.math.reciprocal_no_nan(tfmri.sampling.estimate_density(traj, img_size, method, max_iter))
# Get approx scaling by take adj_op(op(1))
test_im = tf.ones(img_size, dtype=tf.complex64)
test_im_recon = tfft.nufft(
tf.cast(out, tf.complex64) * tfft.nufft(
test_im,
traj,
transform_type='type_2',
fft_direction='forward'
),
traj,
grid_shape=img_size,
transform_type='type_1',
fft_direction='backward'
)
ratio = tf.reduce_mean(tf.math.abs(test_im_recon))
out = out / tf.cast(ratio, out.dtype)
def grad(dy):
return None
return out, grad
return tf.stop_gradient(get_density(tf.transpose(traj)))
# Simple model to learn trajectory on Density compensated adjoint
class TrajModel(tf.keras.Model):
def __init__(self, shape=(2, 32*512), img_size=(320, 320)):
super(TrajModel, self).__init__()
self.shape = shape
# Radial initialization of trajectory
self.init = tf.transpose(tf.reshape(tfmri.radial_trajectory(base_resolution=512/2, views=32), (-1, 2)))
self.traj = self.add_weight(shape=self.shape, dtype=tf.float32)
self.img_size = img_size
self.traj.assign(self.init)
def call(self, image):
density_comp = nograd_density_estimator(self.traj, self.img_size, method='pipe', max_iter=10)
kspace = tfft.nufft(image, tf.transpose(self.traj), transform_type='type_2', fft_direction='forward')
kspace = kspace * tf.cast(density_comp, kspace.dtype)
recon = tfft.nufft(kspace, tf.transpose(self.traj), grid_shape=image.shape[-2:], transform_type='type_1', fft_direction='backward')
return tf.abs(recon[..., None])
#tf.config.run_functions_eagerly(True)
my_model = TrajModel()
np.random.seed(0)
I = np.random.random((1, 1, 320, 320)) + 1j * np.random.random((1, 1, 320, 320))
I = I.astype('complex64')
opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
for i in range(1000):
print(i)
with tf.GradientTape(persistent=True) as tape:
recon = my_model(I)
# loss = tf.reduce_mean(tf.abs(recon -tf.abs(I)))
loss = tf.reduce_mean(1 - tf.image.ssim_multiscale(tf.abs(I[..., None]), recon, max_val=tf.reduce_max(tf.abs(I))))
print("Loss : " + str(loss))
print("Max value of Recon : " + str(np.max(recon)))
grads = tape.gradient(loss, my_model.trainable_variables)
print(tf.math.reduce_max(grads)) The same code seems to work fine (again pinch of salt, in my runs :P ) for L1 loss (if you comment the mSSIM loss and enable just the line above). All the above details i provided in above comments ( #20 (comment) and #20 (comment) ) hold true here as well. |
I think I resolved the issue. It is as I expected, the need for a I will make a PR soon. After final tests. We cant exactly have a test in place to detect this as it would run on CPU where it will be fine, and also the above repro isnt exactly perfect. I am just doing some final tests to check everything is running fine. Thank you @jmontalt for your help! |
Excellent! Race conditions of this sort are often easily fixed one you've closed in on the problem, but can be very tricky to pinpoint. I was not looking forward to it. So thanks to you @chaithyagr! I agree we'll need to do without a test in this case. The nature of the problem means it's impossible to get deterministic behaviour, and the GitHub runner does not have a GPU anyway. I will merge this ASAP. |
Well ideally, the race conditions can easily be found by using racecheck by Nvidia. It just happened that I found this while looking through the codes even before I got that racecheck run up with line informations. |
Hey, it looks like the issue is still not fixed. I am not yet very sure whats the issue. @jmontalt can you pleas reopen this issue? I will try to see if I can get another PR with the fix |
Oh, sure! |
@jmontalt I took some time and went through the cuda API calls and trace in NSight. I suspected that most of the calls are occurring in the same stream due to which, we mostly wont need a device sync (my earlier fix isnt needed). It was a false fix given how variable this issue is, I just got wrong in my first thinking. I had a question: I see that for I feel most our issues crop up from this, as I face this issue when I have density compensation estimation in my test code and if i run the same estimation and replace Further, I observed that the |
Hey @chaithyagr, many thanks for looking into this. I was able to reproduce your example. In my machine, it seems to happen at every iteration after the first one. After including the latest fix, I have not observed the issue in 1,000 iterations. You may have nailed it down this time. You're absolutely right, the grid data should be initialized to 0. This memory was originally allocated with I might remove the device synchronization instructions we added last time. There is no point in making the CPU wait if subsequent kernels will be launched in the same stream. Do you agree? |
Yes I agree with u. I tested without them also. We don't need them. Looking at the cuda trace I can be sure that we run almost everything in the same stream. |
Great. I'll prepare the merge. That certainly should be the case. The CUDA stream is assigned by the TensorFlow runtime and our code should respect its wishes. |
Hello @jmontalt I dont see the next version up, I thought it was a CD? |
It's certainly a CD, but releases are only triggered on tagged commits. I tagged it this morning and it was released half an hour ago. |
Ah I see, my bad. Thank u |
I observe that we get NaNs coming up in my computations.
I have stopped my runs with conditional breakpoints and I have repeated the specific functions (
tfft.nufft
,tfft.spread
etc.) and observed NaN's repeating once or twice. However, if i repeat the computations, usually, I dont get NaNs againFor now, I am writing my own wrapper to carry out recomputation in case this occurs, but I feel this can be something bigger like a missing
_sync_threads
orcudaDeviceSynchronize
.Sadly, given how random this issue is, I dont have a repro. I will see if I can setup one. For now this is a more of check if you faced this issue too in which case, do we have any idea?
I would love to debug deeper, is there a way to have a
gdb
orcuda-gdb
setup?The text was updated successfully, but these errors were encountered: