Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter FP32 tests #614

Merged
merged 2 commits into from
Jun 18, 2024
Merged

Conversation

gordicaleksa
Copy link
Contributor

  • Stricter FP32 logit accuracy
  • Much stricter FP32 loss accuracy
  • Much stricter FP32 grad tensor accuracy (and somewhat stricter 16 bit accuracy)
  • Copied over new expected loss values from PyTorch (they're a bit different and I took all 6 decimal points and not just 4)
  • Also adapted our test logic to round loss to 6 decimal points

Regarding grad tensors: back when Andrej hardcoded the thresholds we had a bug in PyTorch that led to a bigger discrepancy between our PT vs C code - now that that's fixed we can be really strict and use 1e-6f here.

allok = allok & check_tensor(tensors1[13], tensors2[13], L * C, "ln2b", 2.5e-3f);
allok = allok & check_tensor(tensors1[14], tensors2[14], C, "lnfw", 0.12f); // hmm bit higher
allok = allok & check_tensor(tensors1[15], tensors2[15], C, "lnfb", 2e-2f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn this entire thing into a loop?
have a small struct {size, name, threshold}, which also would make it easier to match thresholds and tensor names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we could do that but i'd do it in a separate PR

0.7367,
0.4008,
0.1874
5.270009,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, we'd export these from pytorch when we generate the reference file instead of having to copy them manually

Copy link
Contributor Author

@gordicaleksa gordicaleksa Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully it doesn't change that often? since we're not touching the reference imp that frequently - but if it does then it's worth investing into that

@karpathy karpathy merged commit 6ecc52e into karpathy:master Jun 18, 2024
10 checks passed
@rosslwheeler
Copy link
Contributor

rosslwheeler commented Jun 19, 2024

@gordicaleksa - this appears to be causing a failure when running:

make testgpt2_cu USE_CUDNN=1 && ./testgpt2_cu

It may or may not be seen on your environment but am able to see it here.

@karpathy - I think this is what was causing the issue. I ran my tests one commit back from this in your repo and it passes consistently. If I checkout this commit, then the failures start. Not 100 percent sure but it does make some sense since this is the test that's failing and there aren't many other changes to this file recently?

Can you confirm since you were seeing the failure consistently too? Thank you.

@gordicaleksa
Copy link
Contributor Author

It's certainly this PR - sad our CI didn't catch this! See #615 for a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants