Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training gives nan after the first iteration #5

Closed
Ping-C opened this issue Oct 27, 2023 · 3 comments
Closed

Training gives nan after the first iteration #5

Ping-C opened this issue Oct 27, 2023 · 3 comments

Comments

@Ping-C
Copy link

Ping-C commented Oct 27, 2023

Hello, when I try to run training with the command

python scripts/train.py --config configs/base.py:base

It seems like the train loss becomes nan immediately after the first iteration. Is this something that you have encountered before?

@Ping-C
Copy link
Author

Ping-C commented Oct 27, 2023

It seems like I was able to fix the error by uncommenting this line in train. Maybe this is something that you may want to consider updating in your repo?

# seems like remat is actually enabled by default -- this disables it
    # @partial(jax.checkpoint, policy=jax.checkpoint_policies.everything_saveable)

I still don't know why it gave an error though. It seems like this is something you encountered before. Would you mind sharing with me what this fix meant?

@kvablack
Copy link
Owner

I ran into this issue before with Jax 0.4.13 and fixed it by downgrading to 0.4.11. Very interesting that disabling remat fixes it --- I never figured that out. The commented-out line is just a remnant from when I was experimenting with memory usage. You might want to try different Jax versions instead of disabling remat, since (at least in theory) remat can give significant speed and memory savings.

@Ping-C
Copy link
Author

Ping-C commented Oct 27, 2023

I see. Thanks for getting back to me! Will update that :) Very cool paper by the way!

@Ping-C Ping-C closed this as completed Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants