Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half precision floating point support. #8428

Closed
mkimi-lightricks opened this issue Mar 21, 2017 · 31 comments
Closed

Half precision floating point support. #8428

mkimi-lightricks opened this issue Mar 21, 2017 · 31 comments
Assignees
Labels

Comments

@mkimi-lightricks
Copy link

System information (version)
  • OpenCV => OpenCV iOS 3.1.
  • Operating System / Platform => MacOS Sierra (v 10.12.3), iMac
  • Compiler => Apple LLVM version 8.0.0 (clang-800.0.42.1)
Detailed description
  1. Do you plan to support half precision floating point type?
    If yes, when?
  2. Would you accept a PR which add half float support?
Steps to reproduce

None.

@mshabunin mshabunin added the RFC label Mar 21, 2017
@alalek
Copy link
Member

alalek commented Mar 21, 2017

You can take a look on this PR: #6447 (changes are merged)
FP16 values are stored as CV_16S (signed 16-bit integer (short)).

There are no plans to add FP16 as cv::Mat native type or to add support for some basic operations (like add/sub/mul/div). PR with such changes will be not accepted.

@mkimi-lightricks
Copy link
Author

Thanks.

@vpisarev vpisarev self-assigned this Mar 21, 2017
@vpisarev
Copy link
Contributor

@mkimi-lightricks, @alalek,

If we get a time machine and go back by ~10 years, I would add CV_16F as another regular data type, supported by CvMat and cv::Mat.

But even now we can probably squeeze it in with a little bit of effort. We use the 3 lower bits of CvMat/cv::Mat type field to represent the data type. Those are:

0 - CV_8U
1 - CV_8S
2 - CV_16U
3 - CV_16S
4 - CV_32S
5 - CV_32F
6 - CV_64F
7 - CV_PTR, see below ...

As you can see, we occupied all the possible values. And increasing the number of bits per depth from 3 to 4 or more is not feasible - there is too much code that may use it. But there is still workaround available.
Formally we use depth=7 (CV_PTR) to represent size_t, i.e. unsigned (or signed) integer of the same size as a pointer (32-bit on 32-bit systems, 64-bit on 64-bit ones). But we use it in such way very rarely. I think, it's still possible to override depth=7 as CV_16F.

Suppose that we've overridden depth=7 as CV_16F. What's next? C++ does not even know about such a type, as far as know. So we'd need to do the following:

  1. Define the scalar type (float16? float16_t? half_t? half? ....)
  2. Implement scalar operations (via conversion to/from float): +,-,*,/,abs(),min(),max().
  3. Implement all the basic functionality of cv::Mat: copyTo(), setTo(), convertTo() (@alalek has already mentioned that).
  4. Add support for CV_16F to cv::FileStorage (I/O to/from XML/YAML/JSON)
  5. Implement basic arithmetic operations on Mat's containing 16f.
  6. (maybe) Implement some image processing operations - threshold, filter2D, sepFilter2D, resize, cvtColor, ...
  7. implement OpenCL kernels for some of the corresponding functions.

It's a big project. I think, we could do 1-4, stop and wait for subsequent contributions. Now the question - who will do it? :)

@alalek
Copy link
Member

alalek commented Mar 21, 2017

There are still no effective implementations for arithmetic operations for almost all platforms (no support on x86 - conversions only, ARM has alternative floats). Current FP16 area is special accelerators (power efficient), and usage on CPUs is very limited.

Also there is alternative to FP16 - 16-bit fixed point numbers, like Q8.7. Both of approaches are interesting, but probably they are useful for specific tasks only.

@vpisarev
Copy link
Contributor

I believe, the main point here is to have some standard, "official" container for such data. Then people can add their own algorithms working on such data, including obvious scenarios when Mat is converted to CV_32F, processed and converted back to CV_16F if needed. Also, Intel, NVidia and AMD GPUs already have hardware FP16 support. There are AVX intrinsics to convert between half and float, so any more or less complex operation can do conversion on-fly, and the performance will be comparable with CV_32F: a few more conversion operations vs 2x smaller memory traffic.

Half is definitely more convenient to use than Q8.7, you still have a decent representation range: +/-65504, and at the same time can represent (normalized) values as little as 6*10^-5. With Q8.7 data you'd often need to supply the scale factor that would map your data to this quite narrow range.

@StatusReport
Copy link
Contributor

@vpisarev let me shed some light on why @mkimi-lightricks and our team requires half-float support, and how we're currently made OpenCV to "work" with it.

We have an engine based on OpenGL running on iOS. On these devices the GPU's and the CPU's memory is shared, so it's zero-copy to map GPU memory to CPU and work directly on GPU textures. Our CPU matrix data type is cv::Mat, and we use it for various image processing and CV operations. While most of the algorithms can be implemented on the GPU easily, there are times where an entire algorithm or just a part of it can be implemented more efficiently on the CPU. To do this, we map an OpenGL texture to the GPU, wrap it with a cv::Mat and use our own algorithms or OpenCV's to do the manipulations we require, then render the result using the GPU.

Since OpenGL ES (3.0) supports half-float out-of-the-box (and not float), we require support on the CPU side to that type as well. Additionally, since the data stored in these matrices is usually HDR images we do not require wider data types as the amount of RAM on these devices is fairly limited. Therefore, converting a 16MP image from half-float to float is a huge waste of memory and computation power we do not have.

Our current solution is a hack - we defined a type CV_16F which is aliased to CV_16S (since we do not use it at all), in the following manner:

#define CV_16F CV_16S

#define CV_16FC1 CV_MAKETYPE(CV_16F, 1)
#define CV_16FC2 CV_MAKETYPE(CV_16F, 2)
#define CV_16FC3 CV_MAKETYPE(CV_16F, 3)
#define CV_16FC4 CV_MAKETYPE(CV_16F, 4)
#define CV_16FC(n) CV_MAKETYPE(CV_16F, (n))

namespace cv {
  typedef Vec<half_float::half, 2> Vec2hf;
  typedef Vec<half_float::half, 3> Vec3hf;
  typedef Vec<half_float::half, 4> Vec4hf;

  typedef Mat_<half_float::half> Mat1hf;
  typedef Mat_<Vec2hf> Mat2hf;
  typedef Mat_<Vec3hf> Mat3hf;
  typedef Mat_<Vec4hf> Mat4hf;

  template<>
  class DataDepth<half_float::half> {
  public:
    enum {
      value = CV_16F,
      fmt = (int)'r'
    };
  };

  template<>
  class DataType<half_float::half> {
  public:
    typedef half_float::half value_type;
    typedef value_type work_type;
    typedef value_type channel_type;
    typedef value_type vec_type;
    enum {
      generic_type = 0,
      depth = DataDepth<channel_type>::value,
      channels = 1,
      fmt = DataDepth<channel_type>::fmt,
      type = CV_MAKETYPE(depth, channels)
    };
  };
}

The underlying data type is half_float::half of the half library.

Obviously, almost all OpenCV operations doesn't work, but the cv::Mat as container is working, and element access works as in any other type. We thought about overriding CV_USRTYPE1 (=7), but as you said, it's a lot of work to add all the operators and we didn't want to fork OpenCV and write our custom additions, and that's why @mkimi-lightricks opened the issue.

@alalek
Copy link
Member

alalek commented Mar 21, 2017

There was an idea of creation of special "fp16" module which will handle operation on half-float type (probably via cv::Mat CV_16S).
With possible dual conversions most of these functions may be inefficient, so probably this module is not for building production-ready pipelines.

But at least such functionality would be useful for testing or prototypes (for example, it may detect "overflows" in some simulation mode).

Also such approach would not explode size of OpenCV binaries with "unused" functionality.

@mkimi-lightricks
Copy link
Author

@vpisarev implementing the steps 1, ...,4 will be a great start.

How do you suggest to proceed from here?

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

Let me add some background here: 16-bit floats are about to appear in C++20 (and C, at some point) as new "short float" fundamental type:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2016.pdf
@mkimi-lightricks, @alalek - do you think we should get this time machine going by now ? :)
I am currently looking for one crucial piece missing in cudacodec::decoder:
alternative functions to produce float32/float16 frames instead of 8-bits RGBA in opencv/modules/cudacodec/src/cuda/nv12_to_rgb.cu: videoDecPostProcessFrame
Did anyone come up with those? I could use them right now.

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

@alalek : efficient FP16 GPU implementations go way back. Most professional photo and video formats use it. Once the fundamental type appears in the standard, OpenCV would have no choice but extend for it - better to start early!

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

@vpisarev, @mkimi-lightricks: as a main driver for "short float" proposal in ANSI C++, I would definitely volunteer to do the steps 1-4. Whoever has implemented any bits for FP16 support, please send those to me. Our company would be 100% behind it, as you might guess, but I am not authorized to speak for it :)

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

@StatusReport: I do not see any opencv forks on your profile - if you care to share any fp16 code you guys produced by now, I would appreciate that and would incorporate with all proper credits into (hopefully happening) PR for steps 1-4!

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

@mkimi-lightricks: I do not see any public forks of opencv in your profile either - I would be happy to consider your PR into what may become step 1-4 implementation.

@StatusReport
Copy link
Contributor

StatusReport commented Apr 5, 2017

@borisfom I was not aware of the upcoming short float, thanks!

We haven't forked OpenCV by now, as we tried to avoid diverging from the mainline releases. The code I posted + using the half_float library is a good start IMO, since it allows you to use cv::Mat as a container for half floats. Once it's merged we can start adding support for basic math operations (at least in an unoptimized manner) quite easily.

The pending question is if such PRs will be accepted by the maintainers.

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

@StatusReport: until there is 'short float', there is a problem with common type definition if one can't assume CUDA is present - otherwise one can't import CUDA half type and has to come up with distinct CPU type - trying to solve same issue for, say, Torch DL framework, was not successful.
So this is something to think about - how to make use of CUDA's half type with no overhead and have something functional for CPU as well.
As for the rest of big choices, I would suggest using depth=7 as CV_16F.

@StatusReport
Copy link
Contributor

I would go for CV_USRTYPE1 (7) as well, to avoid aliasing existing types. In our solution it was required so that we won't have to fork and change OpenCV itself.

@borisfom can you please elaborate on the problem of sharing different types? Isn't this solvable by defining implicit/explicit cast operators between them?

@borisfom
Copy link
Contributor

borisfom commented Apr 5, 2017

The problem was - if you can't rely on CUDA type being defined, you have to define it somehow for CPU/OpenCL. Even if it has the same name as CUDA type and defined in same exact literal way, C++ compiler had every right to consider it different type, so you would have to jump through the hoops of the conversions - yes it is solvable but would be better avoidable.

@vpisarev
Copy link
Contributor

vpisarev commented Apr 8, 2017

sorry for delay with reply. Ok, so here is the minimal float16_t implementation that we can put into some header in opencv2/core:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

namespace cv
{
struct CV_EXPORTS float16_t
{
    explicit float16_t(float x)
    {
        union
        {
            unsigned u;
            float f;
        } in;
        in.f = x;
        
        unsigned abs_f = in.u & 0x7fffffff;
        unsigned t = (abs_f >> 13) - 0x1c000;
        unsigned sign = (in.u & 0x80000000) >> 16;
        unsigned e = abs_f & 0x7f800000;

        t = e < 0x38800000 ? 0 : t; // Flush-to-zero
        t = e < 0x47000000 ? t : abs_f > 0x7f800000 ? t - 0x1c000 : 0x7c00; // Clamp-to-max
        t = (e == 0 ? 0 : t) | sign;
        
        bits = (unsigned short)t;
    }
    
    operator float() const
    {
        union
        {
            int u;
            float f;
        } out;

        unsigned t = ((bits & 0x7fff) << 13) + 0x38000000; // extract and adjust mantissa and exponent
        unsigned sign = (bits & 0x8000) << 16; // extract and shift the sign bit
        unsigned e = bits & 0x7c00;
        out.u = (e >= 0x7c00 ? t + 0x38000000 : e == 0 ? 0 : t) | sign; // convert denormals to 0's
        return out.f;
    }

    unsigned short bits;
};
}

int main(int argc, char** argv)
{
    float a[] = { 1.f, 0.f, 0.001f, 0.000001f, 1234.56f, 1e6f, (float)(4*atan(1.)), (float)(1./0.), (float)(0./0.), (float)log(0.) };
    int i, n = 10;
    for( i = 0; i < n; i++ )
    {
        cv::float16_t h(a[i]);
        printf("%d. flt=%f, half=%f (%04x)\n", i, a[i], (float)h, h.bits);
    }
    return 0;
}

who can be a volunteer to prepare the pull request where CV_16F and other things are defined properly?

@borisfom
Copy link
Contributor

borisfom commented Apr 8, 2017

@vpisarev: I would not sign up for other things - but would like to help defining the type right.

We had a few rounds integrating short float support into Torch/Caffe at NVIDIA- Caffe definitely being cleaner as it used a (fancier) C++ class, lifted from here: http://half.sourceforge.net/.
Check it out - half_float uses a lot of decorations to act as a first class citizen. Even including literals and numeric limits.
Also, conformance to rounding modes may be very important for DL applications.
But major part really is to define a separate expression type - then conversions do work.
To get an idea, try 'h+=.1f;' in your example above.

We also added CUDA-specific sections to make use of native CUDA support - not too many, but important places (conversion). It's not up on GitHub yet, we'll have a public release next week.
OpenCV may choose to add native __fp16 support for ARM CPUs as well.

Another very crucial thing we learned is that for DL (and I believe, for image processing as well) clamping conversion is a must (in example above, 1e6f would translate to the max positive value, not +Inf). Otherwise many algorithms go nuts.

@vpisarev
Copy link
Contributor

vpisarev commented Apr 9, 2017

@borisfom, thanks for the link! If I understand correctly, it needs C++11, which is still optional for OpenCV. In fact, the actual half implementation is not very important. We just want some placeholder for half type. As soon as you add cv::DataType<half::half_float> instantiation to your code (not to OpenCV itself), see opencv2/core/traits.hpp, you can easily do things like:

Mat_ m(1080, 1920);
for(...) {
  m(i, j) = ...;
}

In other words, declaring half type in OpenCV does not prevent you from using your own implementation of half.

@StatusReport
Copy link
Contributor

@vpisarev while this will solve (1), it will not solve (2)-(4), since we'll need to implement the basic half float operations by ourselves. Wouldn't it be easier to add a battle-tested third party such as half_float that contains all the operations we need (for basic implementation without vectorization, anyway)?

Also, half_float requires C++98, as described in its home page:

Whereas this library is fully C++98-compatible, it can profit from certain C++11 features. Support for those features is checked and enabled automatically at compile (or rather preprocessing) time, but can be explicitly enabled or disabled by defining the corresponding preprocessor symbols to either 1 or 0 yourself.

@vpisarev
Copy link
Contributor

vpisarev commented Apr 9, 2017

@StatusReport, we need to make sure that operations on half's are not just template instantiations of generic code, because at least now some major archs, like x86/x64, do not support half in hardware (except for conversion to/from float in AVX2). And so the instantiations will work significantly slower than even double-precision versions. From this point of view my dummy implementation of float16_t is even better than half::half_float, because there are bigger chances that the instantiations will not compile. If you want to provide certain operations on half float, e.g. in DNN module, you need to implement such operations explicitly. That's the point.

@borisfom
Copy link
Contributor

@StatusReport : yes C11 is optional. My point exactly: this implementation was tested with a lot of scrutiny, NVIDIA alone ran many CPU(and GPU)-weeks through it, uncovering any possible rounding/conversion/normalization issue.
@vpisarev: half_float::half actually implements most of its operations via promotion to float, correct, which is close to optimal in CUDA but may be slow on CPU. Yes having the code not compile is a good way of ensuring people do not use what they should not.
The Caffe approach was not entirely efficient for CPU (well we did not care much). If you were interested in actually providing a good one for both, avoiding conversions etc. I would suggest you take look at my effort in Torch (this one is a bit odd as they did not want C++ class, so it had to be one type in CPU and another in CUDA.
This is an example how operators can be defined (look at THCNumerics).
torch/cutorch#679
I would suggest implementing operators and math functions in global namespace, so you could easily switch types.
There are also some efficient vector operations using thrust in that directory, you may me interested in them regardless how fancy the scalar type is defined.
I also did some preliminary experimentation on using __fp16 in torch/torch7 on ARM, it was not great as __fp16 was storage-only type with not much instrintic support . Since then I can see commits in gcc that do imply passing it via registers with both hardfp and softfp. So perhaps for those platforms it would be enough to use __fp16 (or better, _Float16) on CPU and CUDA's half on gpu.

@borisfom
Copy link
Contributor

@StatusReport, @vpisarev : now that I have some free time after GTC, I would be interested to get some version of float.cpp with CUDA support into OpenCV - including CV_16F plumbing - please let me know how to proceed.

@borisfom
Copy link
Contributor

@StatusReport, @vpisarev , @mkimi-lightricks : I'd like to take a cut on FP16 support (steps 1-4), let me know if there is any new work on that in the process.

@mkimi-lightricks
Copy link
Author

Sorry, but there's nothing on my end.

@StatusReport
Copy link
Contributor

@borisfom please proceed, as it seems that you have strong opinions about (1). We'll be happy to chime in after the basic types are merged.

@borisfom
Copy link
Contributor

borisfom commented Aug 23, 2017

@StatusReport, @mkimi-lightricks, @vpisarev

Update: I started working on it. Resolutions so far:
0. Using 7 for CV_16F;
1. Keep signatures for any existing FP16 support functions (there is a bunch of them using 'short' storage).
2. Do not bring any CUDA headers where they were not previously used, to not break any builds.
3. Use new cv::float16 type to make sure there are no collisions. Using CUDA 'half' type directly is not feasible even when HAVE_CUDA (because of [2]), and even in case of _fp16 being available I think it's better to use same float16 type - to be replaced with 'short float' later (code changes will be needed).
4. Do not bring in any explicit arithmetic on float16 yet - just necessary conversions.
5. float16 definition available universally. Conversion support compiled in only for modules where either CV_CPU_COMPILE_FP16 or CUDA_VERSION defined
6. Both cvMat and Mat
should be extended

I have submitted 2 PRs so far, one fixing CUDA9 build and another extending CPU dispatch for the tests. Both are ready to go in:
#9418
#9426

@vpisarev
Copy link
Contributor

@borisfom, thank you for working on it! Hopefully, at some point we will have more or less universal, Halide-based solution to support CV_16F widely in many functions. But it would be impossible anyway without basic support for this type at cv::Mat/cv::UMat level. So, your effort is appreciated very much!

@borisfom
Copy link
Contributor

borisfom commented Aug 29, 2017

@vpisarev : You are welcome!
Update: goal (1) appeared not feasible - otherwise going smooth.
Current state can be seen here : https://github.com/borisfom/opencv/tree/fp16

@dkurt
Copy link
Member

dkurt commented Sep 25, 2018

I think we can close it. See #12463.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants