Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc] Move printf long double to simple calc #75414

Merged

Conversation

michaelrj-google
Copy link
Contributor

The Ryu algorithm is very fast with its table, but that table grows too
large for long doubles. This patch adds a method of calculating the
digits of long doubles using just wide integers and fast modulo
operations. This results in significant performance improvements vs the
previous int calc mode, while taking up a similar amound of peak memory.
It will be slow in some %e/%g cases, but reasonable fast for %f with no
loss of accuracy.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 20, 2023

@llvm/pr-subscribers-libc

Author: None (michaelrj-google)

Changes

The Ryu algorithm is very fast with its table, but that table grows too
large for long doubles. This patch adds a method of calculating the
digits of long doubles using just wide integers and fast modulo
operations. This results in significant performance improvements vs the
previous int calc mode, while taking up a similar amound of peak memory.
It will be slow in some %e/%g cases, but reasonable fast for %f with no
loss of accuracy.


Patch is 31.30 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75414.diff

6 Files Affected:

  • (modified) libc/config/config.json (+1-1)
  • (modified) libc/docs/dev/printf_behavior.rst (+15-5)
  • (modified) libc/src/__support/UInt.h (+2)
  • (modified) libc/src/__support/float_to_string.h (+267-182)
  • (modified) libc/test/src/stdio/sprintf_test.cpp (+31)
  • (modified) utils/bazel/llvm-project-overlay/libc/BUILD.bazel (-2)
diff --git a/libc/config/config.json b/libc/config/config.json
index 77d10d75f36467..6a208cc5566116 100644
--- a/libc/config/config.json
+++ b/libc/config/config.json
@@ -13,7 +13,7 @@
       "doc": "Disable handling of %n in printf format string."
     },
     "LIBC_CONF_PRINTF_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE": {
-      "value": true,
+      "value": false,
       "doc": "Use large table for better printf long double performance."
     }
   },
diff --git a/libc/docs/dev/printf_behavior.rst b/libc/docs/dev/printf_behavior.rst
index 29b1b17ebaecb0..52252c61b02546 100644
--- a/libc/docs/dev/printf_behavior.rst
+++ b/libc/docs/dev/printf_behavior.rst
@@ -87,14 +87,25 @@ are not recommended to be adjusted except by persons familiar with the Printf
 Ryu Algorithm. Additionally they have no effect when float conversions are
 disabled.
 
+LIBC_COPT_FLOAT_TO_STR_NO_SPECIALIZE_LD
+---------------------------------------
+This flag disables the separate long double conversion implementation. It is
+not based on the Ryu algorithm, instead generating the digits by
+multiplying/dividing the written-out number by 10^9 to get blocks. It's
+significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
+and is small in binary size. Its downside is that it always calculates all
+of the digits above the decimal point, making it slightly ineffecient for %e
+calls with large exponents. This is the default. If this flag is not set, no 
+other flags will change the long double behavior.
+
 LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
 -------------------------------------------------
 When set, the float to string decimal conversion algorithm will use a larger
 table to accelerate long double conversions. This larger table is around 5MB of 
-size when compiled. This flag is enabled by default in the CMake.
+size when compiled.
 
-LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT(_LD)
---------------------------------------------
+LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT
+---------------------------------------
 When set, the float to string decimal conversion algorithm will use dyadic
 floats instead of a table when performing floating point conversions. This
 results in ~50 digits of accuracy in the result, then zeroes for the remaining
@@ -107,8 +118,7 @@ LIBC_COPT_FLOAT_TO_STR_USE_INT_CALC
 When set, the float to string decimal conversion algorithm will use wide
 integers instead of a table when performing floating point conversions. This
 gives the same results as the table, but is very slow at the extreme ends of
-the long double range. If no flags are set this is the default behavior for
-long double conversions.
+the long double range.
 
 LIBC_COPT_FLOAT_TO_STR_NO_TABLE
 -------------------------------
diff --git a/libc/src/__support/UInt.h b/libc/src/__support/UInt.h
index cfd495c5861851..e90535358d6aac 100644
--- a/libc/src/__support/UInt.h
+++ b/libc/src/__support/UInt.h
@@ -448,6 +448,8 @@ template <size_t Bits, bool Signed> struct BigInt {
     // pos is the index of the current 64-bit chunk that we are processing.
     size_t pos = WORDCOUNT;
 
+    // TODO: look into if constexpr(Bits > 256) skip leading zeroes.
+
     for (size_t q_pos = WORDCOUNT - lower_pos; q_pos > 0; --q_pos) {
       // q_pos is 1 + the index of the current 64-bit chunk of the quotient
       // being processed.
diff --git a/libc/src/__support/float_to_string.h b/libc/src/__support/float_to_string.h
index 923633e3d207f5..7c04b892266f6c 100644
--- a/libc/src/__support/float_to_string.h
+++ b/libc/src/__support/float_to_string.h
@@ -13,18 +13,27 @@
 
 #include "src/__support/CPP/type_traits.h"
 #include "src/__support/FPUtil/FPBits.h"
+#include "src/__support/FPUtil/FloatProperties.h"
 #include "src/__support/FPUtil/dyadic_float.h"
 #include "src/__support/UInt.h"
 #include "src/__support/common.h"
 #include "src/__support/libc_assert.h"
+#include "src/__support/macros/attributes.h"
 
 // This file has 5 compile-time flags to allow the user to configure the float
-// to string behavior. These allow the user to select which 2 of the 3 useful
-// properties they want. The useful properties are:
-//  1) Speed of Evaluation
-//  2) Small Size of Binary
-//  3) Centered Output Value
-// These are explained below with the flags that are missing each one.
+// to string behavior. These were used to explore tradeoffs during the design
+// phase, and can still be used to gain specific properties. Unless you
+// specifically know what you're doing, you should leave all these flags off.
+
+// LIBC_COPT_FLOAT_TO_STR_NO_SPECIALIZE_LD
+//  This flag disables the separate long double conversion implementation. It is
+//  not based on the Ryu algorithm, instead generating the digits by
+//  multiplying/dividing the written-out number by 10^9 to get blocks. It's
+//  significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
+//  and is small in binary size. Its downside is that it always calculates all
+//  of the digits above the decimal point, making it ineffecient for %e calls
+//  with large exponents. If this flag is not set, no other flags will change
+//  the long double behavior.
 
 // LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
 //  The Mega Table is ~5 megabytes when compiled. It lists the constants needed
@@ -33,16 +42,13 @@
 //  exchange for large binary size.
 
 // LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT
-// LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT_LD
 //  Dyadic floats are software floating point numbers, and their accuracy can be
 //  as high as necessary. This option uses 256 bit dyadic floats to calculate
 //  the table values that Ryu Printf needs. This is reasonably fast and very
 //  small compared to the Mega Table, but the 256 bit floats only give accurate
 //  results for the first ~50 digits of the output. In practice this shouldn't
 //  be a problem since long doubles are only accurate for ~35 digits, but the
-//  trailing values all being 0s may cause brittle tests to fail. The _LD
-//  version of this flag only effects the long double calculations, and the
-//  other version effects both long double and double.
+//  trailing values all being 0s may cause brittle tests to fail.
 
 // LIBC_COPT_FLOAT_TO_STR_USE_INT_CALC
 //  Integer Calculation uses wide integers to do the calculations for the Ryu
@@ -60,9 +66,8 @@
 
 // Default Config:
 //  If no flags are set, doubles use the normal (and much more reasonably sized)
-//  Ryu Printf table and long doubles use Integer Calculation. This is because
-//  long doubles are rarely used and the normal Ryu Printf table is very fast
-//  for doubles.
+//  Ryu Printf table and long doubles use their specialized implementation. This
+//  provides good performance and binary size.
 
 #ifdef LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
 #include "src/__support/ryu_long_double_constants.h"
@@ -152,15 +157,17 @@ LIBC_INLINE constexpr uint32_t ceil_log10_pow2(const uint32_t e) {
   return log10_pow2(e) + 1;
 }
 
+LIBC_INLINE constexpr uint32_t div_ceil(const uint32_t num,
+                                        const uint32_t denom) {
+  return (num + (denom - 1)) / denom;
+}
+
 // Returns the maximum number of 9 digit blocks a number described by the given
 // index (which is ceil(exponent/16)) and mantissa width could need.
 LIBC_INLINE constexpr uint32_t length_for_num(const uint32_t idx,
                                               const uint32_t mantissa_width) {
-  //+8 to round up when dividing by 9
-  return (ceil_log10_pow2(idx) + ceil_log10_pow2(mantissa_width + 1) +
-          (BLOCK_SIZE - 1)) /
-         BLOCK_SIZE;
-  // return (ceil_log10_pow2(16 * idx + mantissa_width) + 8) / 9;
+  return div_ceil(ceil_log10_pow2(idx) + ceil_log10_pow2(mantissa_width + 1),
+                  BLOCK_SIZE);
 }
 
 // The formula for the table when i is positive (or zero) is as follows:
@@ -427,8 +434,6 @@ class FloatToString {
 
     // Adjust for the width of the mantissa.
     exponent -= FRACTION_LEN;
-
-    // init_convert();
   }
 
   LIBC_INLINE constexpr bool is_nan() { return float_bits.is_nan(); }
@@ -452,6 +457,8 @@ class FloatToString {
       // shift_amount = -(c0 - exponent) = c_0 + 16 * ceil(exponent/16) -
       // exponent
 
+      const uint32_t pos_exp = idx * IDX_SIZE;
+
       cpp::UInt<MID_INT_SIZE> val;
 
 #if defined(LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT)
@@ -462,24 +469,25 @@ class FloatToString {
 
       // ---------------------------- INT CALC MODE ----------------------------
       const int32_t SHIFT_CONST = CALC_SHIFT_CONST;
-
       const uint64_t MAX_POW_2_SIZE =
-          exponent + CALC_SHIFT_CONST - (BLOCK_SIZE * block_index);
+          pos_exp + CALC_SHIFT_CONST - (BLOCK_SIZE * block_index);
       const uint64_t MAX_POW_5_SIZE =
           internal::log2_pow5(BLOCK_SIZE * block_index);
       const uint64_t MAX_INT_SIZE =
           (MAX_POW_2_SIZE > MAX_POW_5_SIZE) ? MAX_POW_2_SIZE : MAX_POW_5_SIZE;
 
       if (MAX_INT_SIZE < 1024) {
-        val = internal::get_table_positive<1024>(IDX_SIZE * idx, block_index);
+        val = internal::get_table_positive<1024>(pos_exp, block_index);
       } else if (MAX_INT_SIZE < 2048) {
-        val = internal::get_table_positive<2048>(IDX_SIZE * idx, block_index);
+        val = internal::get_table_positive<2048>(pos_exp, block_index);
       } else if (MAX_INT_SIZE < 4096) {
-        val = internal::get_table_positive<4096>(IDX_SIZE * idx, block_index);
+        val = internal::get_table_positive<4096>(pos_exp, block_index);
       } else if (MAX_INT_SIZE < 8192) {
-        val = internal::get_table_positive<8192>(IDX_SIZE * idx, block_index);
+        val = internal::get_table_positive<8192>(pos_exp, block_index);
+      } else if (MAX_INT_SIZE < 16384) {
+        val = internal::get_table_positive<16384>(pos_exp, block_index);
       } else {
-        val = internal::get_table_positive<16384>(IDX_SIZE * idx, block_index);
+        val = internal::get_table_positive<16384 + 128>(pos_exp, block_index);
       }
 #else
       // ----------------------------- TABLE MODE ------------------------------
@@ -487,9 +495,9 @@ class FloatToString {
 
       val = POW10_SPLIT[POW10_OFFSET[idx] + block_index];
 #endif
-      const uint32_t shift_amount =
-          SHIFT_CONST + (static_cast<uint32_t>(IDX_SIZE) * idx) - exponent;
-      const uint32_t digits =
+      const uint32_t shift_amount = SHIFT_CONST + pos_exp - exponent;
+
+      const BlockInt digits =
           internal::mul_shift_mod_1e9(mantissa, val, (int32_t)(shift_amount));
       return digits;
     } else {
@@ -503,35 +511,35 @@ class FloatToString {
 
       cpp::UInt<MID_INT_SIZE> val;
 
+      const uint32_t pos_exp = idx * IDX_SIZE;
+
 #if defined(LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT)
       // ----------------------- DYADIC FLOAT CALC MODE ------------------------
       const int32_t SHIFT_CONST = CALC_SHIFT_CONST;
-      val =
-          internal::get_table_negative_df<256>(idx * IDX_SIZE, block_index + 1);
+      val = internal::get_table_negative_df<256>(pos_exp, block_index + 1);
 #elif defined(LIBC_COPT_FLOAT_TO_STR_USE_INT_CALC)
       // ---------------------------- INT CALC MODE ----------------------------
       const int32_t SHIFT_CONST = CALC_SHIFT_CONST;
-      const uint64_t TEN_BLOCKS = (block_index + 1) * BLOCK_SIZE;
-      const uint64_t MAX_INT_SIZE = internal::log2_pow5(TEN_BLOCKS);
+
+      const uint64_t NUM_FIVES = (block_index + 1) * BLOCK_SIZE;
+      // Round MAX_INT_SIZE up to the nearest 64 (adding 1 because log2_pow5
+      // implicitly rounds down).
+      const uint64_t MAX_INT_SIZE =
+          ((internal::log2_pow5(NUM_FIVES) / 64) + 1) * 64;
 
       if (MAX_INT_SIZE < 1024) {
-        val =
-            internal::get_table_negative<1024>(idx * IDX_SIZE, block_index + 1);
+        val = internal::get_table_negative<1024>(pos_exp, block_index + 1);
       } else if (MAX_INT_SIZE < 2048) {
-        val =
-            internal::get_table_negative<2048>(idx * IDX_SIZE, block_index + 1);
+        val = internal::get_table_negative<2048>(pos_exp, block_index + 1);
       } else if (MAX_INT_SIZE < 4096) {
-        val =
-            internal::get_table_negative<4096>(idx * IDX_SIZE, block_index + 1);
+        val = internal::get_table_negative<4096>(pos_exp, block_index + 1);
       } else if (MAX_INT_SIZE < 8192) {
-        val =
-            internal::get_table_negative<8192>(idx * IDX_SIZE, block_index + 1);
+        val = internal::get_table_negative<8192>(pos_exp, block_index + 1);
       } else if (MAX_INT_SIZE < 16384) {
-        val = internal::get_table_negative<16384>(idx * IDX_SIZE,
-                                                  block_index + 1);
+        val = internal::get_table_negative<16384>(pos_exp, block_index + 1);
       } else {
-        val = internal::get_table_negative<32768>(idx * IDX_SIZE,
-                                                  block_index + 1);
+        val = internal::get_table_negative<16384 + 8192>(pos_exp,
+                                                         block_index + 1);
       }
 #else
       // ----------------------------- TABLE MODE ------------------------------
@@ -549,8 +557,8 @@ class FloatToString {
       val = POW10_SPLIT_2[p];
 #endif
       const int32_t shift_amount =
-          SHIFT_CONST + (-exponent - (static_cast<int32_t>(IDX_SIZE) * idx));
-      uint32_t digits =
+          SHIFT_CONST + (-exponent - static_cast<int32_t>(pos_exp));
+      BlockInt digits =
           internal::mul_shift_mod_1e9(mantissa, val, shift_amount);
       return digits;
     } else {
@@ -582,12 +590,18 @@ class FloatToString {
 
   // This takes the index of a block after the decimal point (a negative block)
   // and return if it's sure that all of the digits after it are zero.
-  LIBC_INLINE constexpr bool is_lowest_block(size_t block_index) {
+  LIBC_INLINE constexpr bool is_lowest_block(size_t negative_block_index) {
 #ifdef LIBC_COPT_FLOAT_TO_STR_NO_TABLE
-    return false;
+    // The decimal representation of 2**(-i) will have exactly i digits after
+    // the decimal point.
+    int num_requested_digits =
+        static_cast<int>((negative_block_index + 1) * BLOCK_SIZE);
+
+    return num_requested_digits > -exponent;
 #else
     const int32_t idx = -exponent / IDX_SIZE;
-    const size_t p = POW10_OFFSET_2[idx] + block_index - MIN_BLOCK_2[idx];
+    const size_t p =
+        POW10_OFFSET_2[idx] + negative_block_index - MIN_BLOCK_2[idx];
     // If the remaining digits are all 0, then this is the lowest block.
     return p >= POW10_OFFSET_2[idx + 1];
 #endif
@@ -595,169 +609,240 @@ class FloatToString {
 
   LIBC_INLINE constexpr size_t zero_blocks_after_point() {
 #ifdef LIBC_COPT_FLOAT_TO_STR_NO_TABLE
+    if (exponent < -MANT_WIDTH) {
+      const int pos_exp = -exponent - 1;
+      const uint32_t pos_idx =
+          static_cast<uint32_t>(pos_exp + (IDX_SIZE - 1)) / IDX_SIZE;
+      const int32_t pos_len = ((internal::ceil_log10_pow2(pos_idx * IDX_SIZE) -
+                                internal::ceil_log10_pow2(MANT_WIDTH + 1)) /
+                               BLOCK_SIZE) -
+                              1;
+      const uint32_t len = static_cast<uint32_t>(pos_len > 0 ? pos_len : 0);
+      return len;
+    }
     return 0;
-    // TODO (michaelrj): Find a good algorithm for this that doesn't use a
-    // table.
 #else
     return MIN_BLOCK_2[-exponent / IDX_SIZE];
 #endif
   }
 };
 
-#ifndef LIBC_LONG_DOUBLE_IS_FLOAT64
+#if !defined(LIBC_LONG_DOUBLE_IS_FLOAT64) &&                                   \
+    !defined(LIBC_COPT_FLOAT_TO_STR_NO_SPECIALIZE_LD)
 // --------------------------- LONG DOUBLE FUNCTIONS ---------------------------
 
-template <>
-LIBC_INLINE constexpr size_t FloatToString<long double>::get_positive_blocks() {
-  if (exponent >= -FRACTION_LEN) {
-    const uint32_t idx =
-        exponent < 0
-            ? 0
-            : static_cast<uint32_t>(exponent + (IDX_SIZE - 1)) / IDX_SIZE;
-    const uint32_t len = internal::length_for_num(idx * IDX_SIZE, FRACTION_LEN);
-    return len;
-  } else {
-    return 0;
+template <> class FloatToString<long double> {
+  fputil::FPBits<long double> float_bits;
+  bool is_negative = 0;
+  int exponent = 0;
+  FloatProp::StorageType mantissa = 0;
+
+  static constexpr int FRACTION_LEN = fputil::FPBits<long double>::FRACTION_LEN;
+  static constexpr int EXP_BIAS = fputil::FPBits<long double>::EXP_BIAS;
+
+  // static constexpr size_t FLOAT_AS_INT_WIDTH = 16384;
+  static constexpr size_t FLOAT_AS_INT_WIDTH =
+      internal::div_ceil(fputil::FPBits<long double>::MAX_BIASED_EXPONENT -
+                             FloatProp::EXP_BIAS,
+                         64) *
+      64;
+  // static constexpr size_t EXTRA_INT_WIDTH = 128;
+  static constexpr size_t EXTRA_INT_WIDTH =
+      internal::div_ceil(sizeof(long double) * 8, 64) * 64;
+
+  cpp::BigInt<FLOAT_AS_INT_WIDTH + EXTRA_INT_WIDTH, false> float_as_int = 0;
+  int int_block_index = 0;
+
+  static constexpr size_t BLOCK_BUFFER_LEN =
+      internal::div_ceil(internal::log10_pow2(FLOAT_AS_INT_WIDTH), BLOCK_SIZE);
+  BlockInt block_buffer[BLOCK_BUFFER_LEN] = {0};
+  size_t block_buffer_valid = 0;
+
+  template <size_t Bits>
+  LIBC_INLINE static constexpr BlockInt
+  grab_digits(cpp::BigInt<Bits, false> &int_num) {
+    BlockInt cur_block = 0;
+    auto wide_result = int_num.div_uint32_times_pow_2(1953125, 9);
+    // the optional only comes into effect when dividing by 0, which will
+    // never happen here. Thus, we just assert that it has value.
+    LIBC_ASSERT(wide_result.has_value());
+    cur_block = static_cast<BlockInt>(wide_result.value());
+    return cur_block;
   }
-}
 
-template <>
-LIBC_INLINE constexpr size_t
-FloatToString<long double>::zero_blocks_after_point() {
-#ifdef LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
-  return MIN_BLOCK_2[-exponent / IDX_SIZE];
-#else
-  return 0;
-  // TODO (michaelrj): Find a good algorithm for this that doesn't use a table.
-#endif
-}
+  LIBC_INLINE static constexpr void zero_leading_digits(
+      cpp::BigInt<FLOAT_AS_INT_WIDTH + EXTRA_INT_WIDTH, false> &int_num) {
+    // 64 is the width of the numbers used to internally represent the BigInt
+    for (size_t i = 0; i < EXTRA_INT_WIDTH / 64; ++i) {
+      int_num[i + (FLOAT_AS_INT_WIDTH / 64)] = 0;
+    }
+  }
 
-template <>
-LIBC_INLINE constexpr bool FloatToString<long double>::is_lowest_block(size_t) {
-  return false;
-}
+  LIBC_INLINE constexpr void init_convert() {
+    // This initializes float_as_int, cur_block, and block_buffer.
 
-template <>
-LIBC_INLINE constexpr BlockInt
-FloatToString<long double>::get_positive_block(int block_index) {
-  if (exponent >= -FRACTION_LEN) {
+    float_as_int = mantissa;
 
-    // idx is ceil(exponent/16) or 0 if exponent is negative. This is used to
-    // find the coarse section of the POW10_SPLIT table that will be used to
-    // calculate the 9 digit window, as well as some other related values.
-    const uint32_t idx =
-        exponent < 0
-            ? 0
-            : static_cast<uint32_t>(exponent + (IDX_SIZE - 1)) / IDX_SIZE;
-    const uint32_t pos_exp = idx * IDX_SIZE;
+    // No calculation necessary for the 0 case.
+    if (mantissa == 0 && exponent == 0) {
+      return;
+    }
 
-    // shift_amount = -(c0 - exponent) = c_0 + 16 * ceil(exponent/16) - exponent
+    if (exponent > 0) {
+      // if the exponent is positive, then the number is fully above the decimal
+      // point. Shift left by exponent to get the integer representation of this
+      // number.
+      float_as_int.shift_left(exponent);
+      int_block_index = 0;
+
+      while (float_as_int > 0) {
+        BlockInt cur_block = grab_digits(float_as_int);
+        block_buffer[int_block_index] = cur_block;
+        ++int_block_index;
+      }
+      block_buffer_valid = int_block_index;
 
-    cpp::UInt<MID_INT_SIZE> val;
-#ifdef LIBC_COPT_FLOAT_TO_STR_USE_MEGA_LONG_DOUBLE_TABLE
-    // ------------------------------ TABLE MODE -------------------------------
-    const int32_t SHIFT_CONST = TABLE_SHIFT_CONST;
-    val = POW10_SPLIT[POW10_OFFSET[idx] + block_index];
-
-#elif defined(LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT) ||                      \
-    defined(LIBC_COPT_FLOAT_TO_STR_USE_DYADIC_FLOAT_LD)
-    // ------------------------ DYADIC FLOAT CALC MODE -------------------------
-    const int32_t SHIFT_CONST = CALC_SHIFT_CONST;
-    val = internal...
[truncated]

@michaelrj-google michaelrj-google changed the title [libc][WIP] Move printf long double to simple calc [libc] Move printf long double to simple calc Dec 20, 2023
Copy link
Contributor

@gchatelet gchatelet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the code in libc/src/__support/float_to_string.h quite complicated. It's hard to grasp the overall algorithm, especially with the negative block indices. Is there a way to make it simpler? Maybe by introducing dedicated structures?

libc/docs/dev/printf_behavior.rst Outdated Show resolved Hide resolved
This flag disables the separate long double conversion implementation. It is
not based on the Ryu algorithm, instead generating the digits by
multiplying/dividing the written-out number by 10^9 to get blocks. It's
significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 10% or 10 times? 10x seems like a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 times. It a lot, but only sort of. I separated out the decimal (%f) long double tests and ran them both with the new specialization and with the mega table with -O3 in release mode. On my workstation the mega table version took ~400 microseconds and the new version took ~8000 microseconds.

The reason I think this is a worthwhile tradeoff is because the mega table takes up a lot of space. Just the converter (found in build/libc/src/stdio/printf_core/CMakeFiles/libc.src.stdio.printf_core.converter.dir/) takes up ~5 megabytes with the mega table, but only ~150 kilobytes with the specialization.

The other option I was considering for saving space was using integer calculation (INT_CALC) which takes up more space (~200 Kb) and time (~7,500,000 microseconds) so in comparison the new version looks quite good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache wise the 5MiB table seems pretty bad for a function that can be called frequently on multiple cores.

The sheer size is an issue in itself but cache invalidation is also a concern, libc instruction and data footprint should be small to allow for more application state to stay in the cache. How is the table accessed by the way? Are the indices about always the same or is it completely random? Is there a way to compress the table? Can we come up with a smaller table and interpolate between values without losing precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is accessed sequentially. It's a flattened two-dimensional table, where the POW10_OFFSET array gives you the starting indexes of each value in the larger array POW10_SPLIT. The negative indices have a similar but slightly different layout. Each number will map to one of the POW10_OFFSET values, then have an offset from that into the larger array which will be incremented or decremented (depending on if this is the positive or negative exponent table).

The 5MB table is already compressed. If you look at the top of ryu_long_double_constants.h you'll see TABLE_SHIFT_CONST, IDX_SIZE, and MID_INT_SIZE. These are the constants we can adjust to try to shrink the table. There's a more comprehensive explanation of what these mean in utils/mathtools/ryu_tablegen.py but here's the short version:

MID_INT_SIZE is the size of each entry in the POW10_SPLIT array. This needs to be at least TABLE_SHIFT_CONST + IDX_SIZE + sizeof(BLOCK_INT) so that it can actually fit the values.

TABLE_SHIFT_CONST (called CONSTANT in ryu_tablegen.py) adjusts the precision of each entry in the array, with higher being more precise. It's 120 here because anything lower tends to cause errors to get into the result.

IDX_SIZE is the compression factor. It's the number of exponents that can be mapped onto one entry in POW10_OFFSET. From my testing it only works when it's a power of 2. Pushing this higher would require increasing MID_INT_SIZE a lot, which would significantly reduce the actual size savings, and would also make the calculations slower.

I talked a bit with Ulf Adams (the original designer of the Ryu algorithm) and he suggested that it might be possible to also compress the table by approximating the next value by multiplying by 5**9. In my testing this worked, but required a higher TABLE_SHIFT_CONST since you lose some precision with each approximation. In the end this only shrank the table a bit, and so I decided it wasn't worthwhile.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the long explanation. I really appreciate it.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

I agree.

libc/docs/dev/printf_behavior.rst Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Show resolved Hide resolved
// 64 is the width of the numbers used to internally represent the BigInt
for (size_t i = 0; i < EXTRA_INT_WIDTH / 64; ++i) {
int_num[i + (FLOAT_AS_INT_WIDTH / 64)] = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like abstraction leakage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but I don't see another way to efficiently zero these blocks. If this were a normally sized integer I'd either do a shift left then shift right or use a mask, but both of those would have significant overhead for this very large integer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can derive 64 from the implementation instead of hardcoding it.

using element_type = cpp::remove_cvref_t<decltype(int_num[0])>;
constexpr size_t ELEMENT_WIDTH = sizeof(element_type) * CHAR_T;

Or change BigInt to expose this information,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a WORD_SIZE static member to BigInt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be updated here?

libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
Copy link
Contributor Author

@michaelrj-google michaelrj-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about this being a rather complex patch, I've addressed the simple comments and will update this patch again when I have an easier-to-read design.

This flag disables the separate long double conversion implementation. It is
not based on the Ryu algorithm, instead generating the digits by
multiplying/dividing the written-out number by 10^9 to get blocks. It's
significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 times. It a lot, but only sort of. I separated out the decimal (%f) long double tests and ran them both with the new specialization and with the mega table with -O3 in release mode. On my workstation the mega table version took ~400 microseconds and the new version took ~8000 microseconds.

The reason I think this is a worthwhile tradeoff is because the mega table takes up a lot of space. Just the converter (found in build/libc/src/stdio/printf_core/CMakeFiles/libc.src.stdio.printf_core.converter.dir/) takes up ~5 megabytes with the mega table, but only ~150 kilobytes with the specialization.

The other option I was considering for saving space was using integer calculation (INT_CALC) which takes up more space (~200 Kb) and time (~7,500,000 microseconds) so in comparison the new version looks quite good.

libc/src/__support/float_to_string.h Show resolved Hide resolved
libc/src/__support/float_to_string.h Show resolved Hide resolved
// 64 is the width of the numbers used to internally represent the BigInt
for (size_t i = 0; i < EXTRA_INT_WIDTH / 64; ++i) {
int_num[i + (FLOAT_AS_INT_WIDTH / 64)] = 0;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but I don't see another way to efficiently zero these blocks. If this were a normally sized integer I'd either do a shift left then shift right or use a mask, but both of those would have significant overhead for this very large integer.

Copy link
Contributor Author

@michaelrj-google michaelrj-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed more comments, and am still thinking about if there's a cleaner and more readable design.

This flag disables the separate long double conversion implementation. It is
not based on the Ryu algorithm, instead generating the digits by
multiplying/dividing the written-out number by 10^9 to get blocks. It's
significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is accessed sequentially. It's a flattened two-dimensional table, where the POW10_OFFSET array gives you the starting indexes of each value in the larger array POW10_SPLIT. The negative indices have a similar but slightly different layout. Each number will map to one of the POW10_OFFSET values, then have an offset from that into the larger array which will be incremented or decremented (depending on if this is the positive or negative exponent table).

The 5MB table is already compressed. If you look at the top of ryu_long_double_constants.h you'll see TABLE_SHIFT_CONST, IDX_SIZE, and MID_INT_SIZE. These are the constants we can adjust to try to shrink the table. There's a more comprehensive explanation of what these mean in utils/mathtools/ryu_tablegen.py but here's the short version:

MID_INT_SIZE is the size of each entry in the POW10_SPLIT array. This needs to be at least TABLE_SHIFT_CONST + IDX_SIZE + sizeof(BLOCK_INT) so that it can actually fit the values.

TABLE_SHIFT_CONST (called CONSTANT in ryu_tablegen.py) adjusts the precision of each entry in the array, with higher being more precise. It's 120 here because anything lower tends to cause errors to get into the result.

IDX_SIZE is the compression factor. It's the number of exponents that can be mapped onto one entry in POW10_OFFSET. From my testing it only works when it's a power of 2. Pushing this higher would require increasing MID_INT_SIZE a lot, which would significantly reduce the actual size savings, and would also make the calculations slower.

I talked a bit with Ulf Adams (the original designer of the Ryu algorithm) and he suggested that it might be possible to also compress the table by approximating the next value by multiplying by 5**9. In my testing this worked, but required a higher TABLE_SHIFT_CONST since you lose some precision with each approximation. In the end this only shrank the table a bit, and so I decided it wasn't worthwhile.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

libc/src/__support/float_to_string.h Show resolved Hide resolved
// 64 is the width of the numbers used to internally represent the BigInt
for (size_t i = 0; i < EXTRA_INT_WIDTH / 64; ++i) {
int_num[i + (FLOAT_AS_INT_WIDTH / 64)] = 0;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a WORD_SIZE static member to BigInt.

@michaelrj-google
Copy link
Contributor Author

I've updated the design to hopefully make it easier to follow.

Copy link
Member

@nickdesaulniers nickdesaulniers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by style comments, feel free to address or not.

libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
// 64 is the width of the numbers used to internally represent the BigInt
for (size_t i = 0; i < EXTRA_INT_WIDTH / 64; ++i) {
int_num[i + (FLOAT_AS_INT_WIDTH / 64)] = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment be updated here?

libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
@michaelrj-google
Copy link
Contributor Author

@gchatelet and @lntue do you have any comments before I land this? I'm going to let it run with the fuzzer for a while to check correctness.

@@ -27,6 +27,11 @@ namespace LIBC_NAMESPACE::cpp {

template <size_t Bits, bool Signed> struct BigInt {

// This being hardcoded as 64 is okay because we're using uint64_t as our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add

using word_type = uint64_t;
LIBC_INLINE_VAR static constexpr size_t WORD_SIZE = sizeof(word_type) * CHAR_T;
...
cpp::array<word_type, WORDCOUNT> val{};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done that, though I'm leaving the large scale refactoring for a followup patch.

libc/test/src/stdio/sprintf_test.cpp Outdated Show resolved Hide resolved
libc/test/src/stdio/sprintf_test.cpp Show resolved Hide resolved
This flag disables the separate long double conversion implementation. It is
not based on the Ryu algorithm, instead generating the digits by
multiplying/dividing the written-out number by 10^9 to get blocks. It's
significantly faster than INT_CALC, only about 10x slower than MEGA_TABLE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the long explanation. I really appreciate it.

In conclusion, I believe the table is already compressed within an order of magnitude of its minimum size, and that makes it too large to be practical. Long doubles are rarely used, so carrying a large table all the time to speed them up seems like a bad idea.

I agree.

libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
libc/src/__support/float_to_string.h Outdated Show resolved Hide resolved
The Ryu algorithm is very fast with its table, but that table grows too
large for long doubles. This patch adds a method of calculating the
digits of long doubles using just wide integers and fast modulo
operations. This results in significant performance improvements vs the
previous int calc mode, while taking up a similar amound of peak memory.
It will be slow in some %e/%g cases, but reasonable fast for %f with no
loss of accuracy.
@michaelrj-google
Copy link
Contributor Author

Rebased, going to land once presubmits finish.

@michaelrj-google michaelrj-google merged commit a621198 into llvm:main Jan 25, 2024
5 checks passed
@michaelrj-google michaelrj-google deleted the libcPrintfLongDoublePrecalc branch January 25, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants