Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF roberta/XLM roberta numerics issues on A100 if num_iterations >= 100 #274

Closed
monorimet opened this issue Aug 16, 2022 · 3 comments
Closed
Assignees

Comments

@monorimet
Copy link
Collaborator

XLM-roberta assert failure:

>       np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E       AssertionError: 
E       Not equal to tolerance rtol=0.1, atol=0.01
E       
E       Mismatched elements: 5505 / 4000032 (0.138%)
E       Max absolute difference: 0.09074688
E       Max relative difference: 3171.7234
E        x: array([[[ 2.683771,  0.183121, 10.453473, ...,  6.315439,  2.047505,
E                 3.32532 ],
E               [-0.482143,  0.061366,  9.494564, ...,  6.593861,  1.620899,...
E        y: array([[[ 2.671124,  0.182537, 10.456981, ...,  6.322483,  2.0[515](https://github.com/nod-ai/SHARK/runs/7868468050?check_suite_focus=true#step:9:516)46,
E                 3.322179],
E               [-0.481575,  0.061454,  9.495419, ...,  6.59101 ,  1.619549,...

roberta-base-tf assert failure:

>       np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E       AssertionError: 
E       Not equal to tolerance rtol=0.1, atol=0.01
E       
E       Mismatched elements: 453 / 804240 (0.0563%)
E       Max absolute difference: 0.04533577
E       Max relative difference: 763.70135
E        x: array([[[33.55235 , -3.827327, 18.863625, ...,  3.420343,  6.171632,
E                11.648125],
E               [-0.598835, -4.141003, 14.904708, ..., -4.515923, -1.790529,...
E        y: array([[[33.567413, -3.829913, 18.870962, ...,  3.422938,  6.174327,
E                11.656706],
E               [-0.58585 , -4.141752, 14.913631, ..., -4.516505, -1.788759,...

To reproduce:

On a100 instance,

  • remove xfail for gpu case in tank/roberta-base_tf/roberta-base_tf_test.py
  • remove xfail for gpu case in tank/xlm-roberta-base_tf/xlm-roberta-base_tf.py
  • run: pytest tank/*roberta -k "gpu"
@monorimet monorimet self-assigned this Aug 16, 2022
@monorimet monorimet changed the title TF roberta and XLM roberta numerics issues on A100 without TF32 TF roberta/XLM roberta numerics issues on A100 if num_iterations >= 100 Aug 16, 2022
@monorimet
Copy link
Collaborator Author

until patch is merged checkout branch ean-bench to reproduce

@monorimet
Copy link
Collaborator Author

perhaps the solution to this will be keeping default shark_args.num_iterations = 1 and increasing only for benchmarks.

@monorimet
Copy link
Collaborator Author

This issue no longer relevant, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant