conversion #12

milankl · 2020-02-19T12:11:00Z

Just found this

julia> Float8(0.015)
Float8(0.0625)
julia> Float8(0.02)
Float8(0.015625)

although 0.015 <= 0.02 that's not true after conversion. Problem only occurs for subnormals. It's not the conversion back to Float32 (triggered by show):

 julia> bitstring(Float8(0.015))
"00000100"
julia> bitstring(Float8(0.02))
"00000001"

The text was updated successfully, but these errors were encountered:

milankl · 2020-02-26T18:02:17Z

@JeffreySarnoff do you have an idea? There's something wrong with the rounding for subnormals I assume...

https://github.com/milankl/Float8s.jl/blob/9345556ac3d11f61bef2c70f49947113b8949066/src/float8.jl#L164-L193

JeffreySarnoff · 2020-02-27T18:58:13Z

I'll look at it tonight.

JeffreySarnoff · 2020-02-28T09:14:37Z

still looking, meanwhile

Base.Float32(x::Float8) = Float8toFloat32[reinterpret(UInt8,x) + 0x01]

const Float8toFloat32 = Float32[0.0, 0.015625, 0.03125, 0.046875, 0.0625, 0.078125, 0.09375, 0.109375, 0.125, 0.140625, 0.15625, 0.171875, 0.1875, 0.203125, 0.21875, 0.234375, 0.25, 0.265625, 0.28125, 0.296875, 0.3125, 0.328125, 0.34375, 0.359375, 0.375, 0.390625, 0.40625, 0.421875, 0.4375, 0.453125, 0.46875, 0.484375, 0.5, 0.53125, 0.5625, 0.59375, 0.625, 0.65625, 0.6875, 0.71875, 0.75, 0.78125, 0.8125, 0.84375, 0.875, 0.90625, 0.9375, 0.96875, 1.0, 1.0625, 1.125, 1.1875, 1.25, 1.3125, 1.375, 1.4375, 1.5, 1.5625, 1.625, 1.6875, 1.75, 1.8125, 1.875, 1.9375, 2.0, 2.125, 2.25, 2.375, 2.5, 2.625, 2.75, 2.875, 3.0, 3.125, 3.25, 3.375, 3.5, 3.625, 3.75, 3.875, 4.0, 4.25, 4.5, 4.75, 5.0, 5.25, 5.5, 5.75, 6.0, 6.25, 6.5, 6.75, 7.0, 7.25, 7.5, 7.75, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, 15.0, 15.5, Inf32, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, -0.0, -0.015625, -0.03125, -0.046875, -0.0625, -0.078125, -0.09375, -0.109375, -0.125, -0.140625, -0.15625, -0.171875, -0.1875, -0.203125, -0.21875, -0.234375, -0.25, -0.265625, -0.28125, -0.296875, -0.3125, -0.328125, -0.34375, -0.359375, -0.375, -0.390625, -0.40625, -0.421875, -0.4375, -0.453125, -0.46875, -0.484375, -0.5, -0.53125, -0.5625, -0.59375, -0.625, -0.65625, -0.6875, -0.71875, -0.75, -0.78125, -0.8125, -0.84375, -0.875, -0.90625, -0.9375, -0.96875, -1.0, -1.0625, -1.125, -1.1875, -1.25, -1.3125, -1.375, -1.4375, -1.5, -1.5625, -1.625, -1.6875, -1.75, -1.8125, -1.875, -1.9375, -2.0, -2.125, -2.25, -2.375, -2.5, -2.625, -2.75, -2.875, -3.0, -3.125, -3.25, -3.375, -3.5, -3.625, -3.75, -3.875, -4.0, -4.25, -4.5, -4.75, -5.0, -5.25, -5.5, -5.75, -6.0, -6.25, -6.5, -6.75, -7.0, -7.25, -7.5, -7.75, -8.0, -8.5, -9.0, -9.5, -10.0, -10.5, -11.0, -11.5, -12.0, -12.5, -13.0, -13.5, -14.0, -14.5, -15.0, -15.5, -Inf32, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN, -NaN];

milankl · 2020-02-28T11:38:23Z

Yeah great, I always forget how small 8bit actually is. Will check how much faster that is!

milankl · 2020-02-28T16:10:59Z

Yep, a lookup table made it about 2x faster. I've added an @inbounds which gave another 2x speed-up.

Base.Float32(x::Float8) = @inbounds Float8toFloat32[reinterpret(UInt8,x) + 0x01]

JeffreySarnoff · 2020-02-28T22:54:33Z

The table method does not appear to round properly.
I put this together. The only function to call directly
is Float8(x::Float32) which accepts a Float32 value
and yields the corresponding (round nearest) Float8.

# the constants
# ---------------

#=
   One table is split into subsections of 16 entries each.
   This keeps the scan time minimal, even after accounting
   for the conditionals that select the proper subsection.

   These are Float8 values offset by 1/2 way to the next value,
   this lets `findfirst` return an appropriately rounded value.

   The `F8offsetN` tuples are used with normal values,
   values that are finite and are not subnormal. Values
   that round to zero are handled before these are used.
=#

const F8offset1 = (Float32[
   0.2578125, 0.2734375, 0.2890625, 0.3046875, 0.3203125, 0.3359375, 
   0.3515625, 0.3671875, 0.3828125, 0.3984375, 0.4140625, 0.4296875,
   0.4453125, 0.4609375, 0.4765625, 0.4921875]...,)

const F8offset2 = (Float32[
    0.515625, 0.546875, 0.578125, 0.609375, 0.640625, 0.671875, 
    0.703125, 0.734375, 0.765625, 0.796875, 0.828125, 0.859375,
    0.890625, 0.921875, 0.953125, 0.984375]...,);

const F8offset3 = (Float32[
    1.03125, 1.09375, 1.15625, 1.21875, 1.28125, 1.34375, 1.40625,
    1.46875, 1.53125, 1.59375, 1.65625, 1.71875, 1.78125, 1.84375, 
    1.90625, 1.96875]...,);

const F8offset4 = (Float32[
    2.0625, 2.1875, 2.3125, 2.4375, 2.5625, 2.6875, 2.8125,
    2.9375, 3.0625, 3.1875, 3.3125, 3.4375, 3.5625, 3.6875,
    3.8125, 3.9375]...,);

const F8offset5 = (Float32[
    4.125, 4.375, 4.625, 4.875, 5.125, 5.375, 5.625, 5.875, 6.125,
    6.375,  6.625, 6.875, 7.125, 7.375, 7.625, 7.875]...,);

const F8offset6 = (Float32[
    8.25, 8.75, 9.25, 9.75, 10.25, 10.75, 11.25, 11.75, 12.25, 12.75,
    13.25, 13.75, 14.25, 14.75, 15.25, 15.75]...,);
#=
     There is one table used with subnormal values.
      It is derived from the actual values of each
      subnormal quantity, shifted up halfway to the
      next subnormal.  This lets scanning also round.
      An initial (anchor) value is prepended, that value
      is half of the smallest subnormal.

     A corresponding table of UInt8 values also is used.
=#

const F8offset_subnormal = (
    0.0078125f0, 0.0234375f0, 0.0390625f0, 0.0546875f0, 0.0703125f0,
    0.0859375f0, 0.1015625f0, 0.1171875f0, 0.1328125f0, 0.1484375f0,
    0.1640625f0, 0.1796825f0, 0.1953125f0, 0.2109375f0, 0.2265625f0,
    0.2421875f0)

const U8subnormal = (collect(UInt8.(0:15))...,)

# some named constants to clarify the source text

const roundsto_floatmax8 = 15.25f0
const roundsto_zero8 = 0.0078125f0
const roundsto_subnormal = 0.2421875f0
const floatmaxplus8 = 15.75f0 # floatmax(Float8) + floatmin(Float8)

const UNaN8 = 0x78
const UInf8 = 0x70
const UFloatmax8 = 0x6f
const UFloatmin8 = 0x10
const UZero8 = 0x00

#  the functions
# ----------------

function Float8(x::Float32)
    s, absx = signbit(x), abs(x)
    ui =toUInt8(x)
    return reinterpret(Float8, ui)
end

function toUInt8(x::Float32)
    s, absx = signbit(x), abs(x)
    isnan(absx) && return s ? UNaN8|0x80 : UNaN8
    if absx >= roundsto_floatmax8
        if absx > floatmaxplus8
            return s ? UInf8|0x80 : UInf8
        else
            return s ? UFloatmax8|0x80 : UFloatmax8
        end
    end
    if absx < roundsto_zero8
        return s ? UZero8|0x80 : UZero8
    elseif absx < roundsto_subnormal
        return subnormal8(s, absx)
    end
    absx = min(15.5f0, max(0.25f0, absx))
    return normal8(s, absx)
end

@inline function subnormal8(s::Bool, absx::Float32)
    idx = findfirst(a->absx <= a, F8offset_subnormal)
    return s ? U8subnormal[idx] | 0x80 : U8subnormal[idx]
end

function normal8(s::Bool, absx::Float32)
   if absx <= 0.4921875f0
        idx = UInt8(15+findfirst(a->absx <= a, F8offset1))
        return s ? idx|0x80 : idx
   elseif absx <= 0.984375f0
        idx = UInt8(15+16+findfirst(a->absx <= a, F8offset2))
        return s ? idx|0x80 : idx
   elseif absx <= 1.96875f0
       idx = UInt8(15+32+findfirst(a->absx <= a, F8offset3))
       return s ? idx|0x80 : idx
   elseif absx <= 3.9375f0
       idx = UInt8(15+32+16+findfirst(a->absx <= a, F8offset4))
       return s ? idx|0x80 : idx
   elseif absx <= 7.875f0
       idx = UInt8(15+64+findfirst(a->absx <= a, F8offset5))
       return s ? idx|0x80 : idx
   elseif absx <= 15.75f0
       idx = UInt8(15+64+16+findfirst(a->absx <= a, F8offset6))
       return s ? idx|0x80 : idx
   else
       throw(DomainError(absx,"not normal for Float8"))
   end
end

milankl · 2020-03-03T11:09:33Z

Amazing Jeffrey, I'll implement that in the next days, thanks so much!

JeffreySarnoff · 2020-03-04T19:28:16Z

note correction in toUInt8
absx = min(15.5f0, max(0.25f0, absx))

This is yet faster. It uses the fact that the original discrimination ensures any input to normal8 will be found in firstof16lte and splits the search in normal8.

function normal8(s::Bool, absx::Float32)
   if absx <= 1.96875f0
     if absx <= 0.4921875f0
          idx = UInt8(15+firstof16lte(absx, F8offset1))
          return s ? idx|0x80 : idx
     elseif absx <= 0.984375f0
          idx = UInt8(15+16+firstof16lte(absx, F8offset2))
          return s ? idx|0x80 : idx
     else
         idx = UInt8(15+32+firstof16lte(absx, F8offset3))
         return s ? idx|0x80 : idx
     end    
   else  
     if absx <= 3.9375f0
         idx = UInt8(15+32+16+firstof16lte(absx, F8offset4))
         return s ? idx|0x80 : idx
     elseif absx <= 7.875f0
         idx = UInt8(15+64+firstof16lte(absx, F8offset5))
         return s ? idx|0x80 : idx
     else
         idx = UInt8(15+64+16+firstof16lte(absx, F8offset6))
         return s ? idx|0x80 : idx
     end
   end
end

function firstof16lte(needle, haystack)
    for idx = 1:16
        if needle <= haystack[idx]
            return idx
        end
    end
    error("should not be reached")
end

milankl · 2020-03-16T21:31:49Z

Great, yeah, master now contains your version, as mine contained a bug. However, I get considerably slower timings

julia> A = rand(Float32,300,300) .+ 1f0;  # no subnormals
julia> @btime Float8.($A); # Jeffrey's version
  2.815 ms (2 allocations: 88.02 KiB)

julia> @btime Float8_old.($A); # Milan's version (only correct for normals)
  1.652 ms (2 allocations: 88.02 KiB)

julia> @btime Float16.($A);
  1.065 ms (2 allocations: 175.89 KiB)

Any idea where this comes from?

JeffreySarnoff · 2020-03-17T00:18:28Z

You could use a hybrid, yours instead of my normal8 iff isnormal, keeping mine for issubnormal

milankl added the bug Something isn't working label Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conversion #12

conversion #12

milankl commented Feb 19, 2020 •

edited

Loading

milankl commented Feb 26, 2020 •

edited

Loading

JeffreySarnoff commented Feb 27, 2020

JeffreySarnoff commented Feb 28, 2020 •

edited

Loading

milankl commented Feb 28, 2020

milankl commented Feb 28, 2020

JeffreySarnoff commented Feb 28, 2020 •

edited

Loading

milankl commented Mar 3, 2020

JeffreySarnoff commented Mar 4, 2020 •

edited

Loading

milankl commented Mar 16, 2020

JeffreySarnoff commented Mar 17, 2020

conversion #12

conversion #12

Comments

milankl commented Feb 19, 2020 • edited Loading

milankl commented Feb 26, 2020 • edited Loading

JeffreySarnoff commented Feb 27, 2020

JeffreySarnoff commented Feb 28, 2020 • edited Loading

milankl commented Feb 28, 2020

milankl commented Feb 28, 2020

JeffreySarnoff commented Feb 28, 2020 • edited Loading

milankl commented Mar 3, 2020

JeffreySarnoff commented Mar 4, 2020 • edited Loading

milankl commented Mar 16, 2020

JeffreySarnoff commented Mar 17, 2020

milankl commented Feb 19, 2020 •

edited

Loading

milankl commented Feb 26, 2020 •

edited

Loading

JeffreySarnoff commented Feb 28, 2020 •

edited

Loading

JeffreySarnoff commented Feb 28, 2020 •

edited

Loading

JeffreySarnoff commented Mar 4, 2020 •

edited

Loading