Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[references] Add FP16 support for training #263

Closed
5 tasks done
Tracked by #791
fg-mindee opened this issue May 17, 2021 · 2 comments · Fixed by #682
Closed
5 tasks done
Tracked by #791

[references] Add FP16 support for training #263

fg-mindee opened this issue May 17, 2021 · 2 comments · Fixed by #682
Assignees
Labels
ext: references Related to references folder framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend help wanted Extra attention is needed
Milestone

Comments

@fg-mindee
Copy link
Contributor

fg-mindee commented May 17, 2021

The reference training script should have an option to switch from a FP32 training to a FP16 training.
It would raise a question: how to harmonize model loading from both FP?

A few suggestions:

  • simply cast the FP16 checkpoint to FP32 before hashing + uploading
  • for a given model, each checkpoint is marked with its FP. The user may choose how the model needs to be instantiated: by default the output FP is FP32, but this can be changed by the user. Depending on the checkpoint FP, it will be cast to its counterpart FP for loading.

The second option would be cleaner and avoid unnecessary large checkpoints when they can be kept in FP16

Here is a proposition:

@fg-mindee fg-mindee added help wanted Extra attention is needed ext: references Related to references folder labels May 17, 2021
@fg-mindee fg-mindee added this to the 0.3.0 milestone May 17, 2021
@fg-mindee fg-mindee modified the milestones: 0.3.0, 0.3.1 Jul 1, 2021
@fg-mindee fg-mindee self-assigned this Jul 1, 2021
@fg-mindee fg-mindee added framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend labels Jul 6, 2021
@fg-mindee fg-mindee modified the milestones: 0.3.1, 0.4.0 Aug 26, 2021
@fg-mindee fg-mindee modified the milestones: 0.4.0, 0.4.1 Sep 20, 2021
@fg-mindee
Copy link
Contributor Author

fg-mindee commented Nov 10, 2021

Quick update: this design was good on paper but ML frameworks actually uses AMP instead of full fp16. And other data types yield issues with post processing (cv2 doesn't like things that are not in fp32 among others)

@fg-mindee
Copy link
Contributor Author

Moving this to a later release for TensorFlow support of AMP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: references Related to references folder framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant