🚀 Feature request
Fairseq uses memory efficient FP 16 training as explained in https://arxiv.org/pdf/1904.01038.pdf.
Motivation
Generally the model requires high end GPU's to fine-tune on larger length datasets. Using memory efficient FP 16 we can reduce the need of high GPU's and thus models can be fine-tune without OOM problems.