Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVM] add LZMA for compression/decompression #83297

Closed
wants to merge 1 commit into from

Conversation

yxsamliu
Copy link
Collaborator

@yxsamliu yxsamliu commented Feb 28, 2024

LZMA (Lempel-Ziv/Markov-chain Algorithm) provides better comparession rate than zstd and zlib for clang-offload-bundler bundles which often contains large number of similar entries.

This patch adds liblzma as an alternative to the existing compression/decompression methods zlib and zstd to LLVM.

@llvmbot llvmbot added cmake Build system in general and CMake in particular clang Clang issues not falling into any other category clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' llvm:support llvm-lit testing-tools labels Feb 28, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Feb 28, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-testing-tools
@llvm/pr-subscribers-llvm-support

@llvm/pr-subscribers-clang-driver

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

LZMA (Lempel-Ziv/Markov-chain Algorithm) provides better comparession rate than zstd and zlib for clang-offload-bundler bundles which often contains large number of similar entries.

This patch adds liblzma as an alternative to the existing compression/decompression methods zlib and zstd to LLVM and let clang-offload-bundler use it as preferred compression/decompression method.


Full diff: https://github.com/llvm/llvm-project/pull/83297.diff

15 Files Affected:

  • (modified) clang/lib/Driver/OffloadBundler.cpp (+14-4)
  • (modified) clang/test/CMakeLists.txt (+1)
  • (added) clang/test/Driver/clang-offload-bundler-lzma.c (+76)
  • (modified) clang/test/lit.site.cfg.py.in (+1)
  • (modified) llvm/CMakeLists.txt (+2)
  • (modified) llvm/cmake/config-ix.cmake (+25)
  • (modified) llvm/cmake/modules/LLVMConfig.cmake.in (+5)
  • (modified) llvm/docs/CMake.rst (+5)
  • (modified) llvm/include/llvm/Config/llvm-config.h.cmake (+3)
  • (modified) llvm/include/llvm/Support/Compression.h (+24-2)
  • (modified) llvm/lib/Support/CMakeLists.txt (+17)
  • (modified) llvm/lib/Support/Compression.cpp (+98)
  • (modified) llvm/test/CMakeLists.txt (+1)
  • (modified) llvm/test/lit.site.cfg.py.in (+1)
  • (modified) llvm/utils/lit/lit/llvm/config.py (+3)
diff --git a/clang/lib/Driver/OffloadBundler.cpp b/clang/lib/Driver/OffloadBundler.cpp
index 99a34d25cfcd56..4497944f70c42d 100644
--- a/clang/lib/Driver/OffloadBundler.cpp
+++ b/clang/lib/Driver/OffloadBundler.cpp
@@ -943,7 +943,9 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer &Input,
 
   llvm::compression::Format CompressionFormat;
 
-  if (llvm::compression::zstd::isAvailable())
+  if (llvm::compression::lzma::isAvailable())
+    CompressionFormat = llvm::compression::Format::Lzma;
+  else if (llvm::compression::zstd::isAvailable())
     CompressionFormat = llvm::compression::Format::Zstd;
   else if (llvm::compression::zlib::isAvailable())
     CompressionFormat = llvm::compression::Format::Zlib;
@@ -977,7 +979,10 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer &Input,
 
   if (Verbose) {
     auto MethodUsed =
-        CompressionFormat == llvm::compression::Format::Zstd ? "zstd" : "zlib";
+        CompressionFormat == llvm::compression::Format::Lzma
+            ? "lzma"
+            : (CompressionFormat == llvm::compression::Format::Zstd ? "zstd"
+                                                                    : "zlib");
     llvm::errs() << "Compressed bundle format version: " << Version << "\n"
                  << "Compression method used: " << MethodUsed << "\n"
                  << "Binary size before compression: " << UncompressedSize
@@ -1026,7 +1031,10 @@ CompressedOffloadBundle::decompress(const llvm::MemoryBuffer &Input,
 
   llvm::compression::Format CompressionFormat;
   if (CompressionMethod ==
-      static_cast<uint16_t>(llvm::compression::Format::Zlib))
+      static_cast<uint16_t>(llvm::compression::Format::Lzma))
+    CompressionFormat = llvm::compression::Format::Lzma;
+  else if (CompressionMethod ==
+           static_cast<uint16_t>(llvm::compression::Format::Zlib))
     CompressionFormat = llvm::compression::Format::Zlib;
   else if (CompressionMethod ==
            static_cast<uint16_t>(llvm::compression::Format::Zstd))
@@ -1070,7 +1078,9 @@ CompressedOffloadBundle::decompress(const llvm::MemoryBuffer &Input,
                  << "Decompression method: "
                  << (CompressionFormat == llvm::compression::Format::Zlib
                          ? "zlib"
-                         : "zstd")
+                         : (CompressionFormat == llvm::compression::Format::Lzma
+                                ? "lzma"
+                                : "zstd"))
                  << "\n"
                  << "Size before decompression: " << CompressedData.size()
                  << " bytes\n"
diff --git a/clang/test/CMakeLists.txt b/clang/test/CMakeLists.txt
index fcfca354f4a75f..ca57daa6fc8651 100644
--- a/clang/test/CMakeLists.txt
+++ b/clang/test/CMakeLists.txt
@@ -12,6 +12,7 @@ llvm_canonicalize_cmake_booleans(
   ENABLE_BACKTRACES
   LLVM_ENABLE_ZLIB
   LLVM_ENABLE_ZSTD
+  LLVM_ENABLE_LZMA
   LLVM_ENABLE_PER_TARGET_RUNTIME_DIR
   LLVM_ENABLE_THREADS
   LLVM_ENABLE_REVERSE_ITERATION
diff --git a/clang/test/Driver/clang-offload-bundler-lzma.c b/clang/test/Driver/clang-offload-bundler-lzma.c
new file mode 100644
index 00000000000000..3c254af85936fb
--- /dev/null
+++ b/clang/test/Driver/clang-offload-bundler-lzma.c
@@ -0,0 +1,76 @@
+// REQUIRES: lzma
+// REQUIRES: x86-registered-target
+// UNSUPPORTED: target={{.*}}-darwin{{.*}}, target={{.*}}-aix{{.*}}
+
+//
+// Generate the host binary to be bundled.
+//
+// RUN: %clang -O0 -target %itanium_abi_triple %s -c -emit-llvm -o %t.bc
+
+//
+// Generate an empty file to help with the checks of empty files.
+//
+// RUN: touch %t.empty
+
+//
+// Generate device binaries to be bundled.
+//
+// RUN: echo 'Content of device file 1' > %t.tgt1
+// RUN: echo 'Content of device file 2' > %t.tgt2
+
+//
+// Check compression/decompression of offload bundle.
+//
+// RUN: env OFFLOAD_BUNDLER_COMPRESS=1 OFFLOAD_BUNDLER_VERBOSE=1 \
+// RUN:   clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa--gfx900,hip-amdgcn-amd-amdhsa--gfx906 \
+// RUN:   -input=%t.tgt1 -input=%t.tgt2 -output=%t.hip.bundle.bc 2>&1 | \
+// RUN:   FileCheck -check-prefix=COMPRESS %s
+// RUN: clang-offload-bundler -type=bc -list -input=%t.hip.bundle.bc | FileCheck -check-prefix=NOHOST %s
+// RUN: env OFFLOAD_BUNDLER_VERBOSE=1 \
+// RUN:   clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa--gfx900,hip-amdgcn-amd-amdhsa--gfx906 \
+// RUN:   -output=%t.res.tgt1 -output=%t.res.tgt2 -input=%t.hip.bundle.bc -unbundle 2>&1 | \
+// RUN:   FileCheck -check-prefix=DECOMPRESS %s
+// RUN: diff %t.tgt1 %t.res.tgt1
+// RUN: diff %t.tgt2 %t.res.tgt2
+
+//
+// COMPRESS: Compression method used: lzma
+// DECOMPRESS: Decompression method: lzma
+// NOHOST-NOT: host-
+// NOHOST-DAG: hip-amdgcn-amd-amdhsa--gfx900
+// NOHOST-DAG: hip-amdgcn-amd-amdhsa--gfx906
+//
+
+//
+// Check -bundle-align option.
+//
+
+// RUN: clang-offload-bundler -bundle-align=4096 -type=bc -targets=host-%itanium_abi_triple,openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu -input=%t.bc -input=%t.tgt1 -input=%t.tgt2 -output=%t.bundle3.bc -compress
+// RUN: clang-offload-bundler -type=bc -targets=host-%itanium_abi_triple,openmp-powerpc64le-ibm-linux-gnu,openmp-x86_64-pc-linux-gnu -output=%t.res.bc -output=%t.res.tgt1 -output=%t.res.tgt2 -input=%t.bundle3.bc -unbundle
+// RUN: diff %t.bc %t.res.bc
+// RUN: diff %t.tgt1 %t.res.tgt1
+// RUN: diff %t.tgt2 %t.res.tgt2
+
+//
+// Check unbundling archive.
+//
+// RUN: clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa--gfx900,hip-amdgcn-amd-amdhsa--gfx906 \
+// RUN:   -input=%t.tgt1 -input=%t.tgt2 -output=%t.hip_bundle1.bc -compress
+// RUN: clang-offload-bundler -type=bc -targets=hip-amdgcn-amd-amdhsa--gfx900,hip-amdgcn-amd-amdhsa--gfx906 \
+// RUN:   -input=%t.tgt1 -input=%t.tgt2 -output=%t.hip_bundle2.bc -compress
+// RUN: rm -f %t.hip_archive.a
+// RUN: llvm-ar cr %t.hip_archive.a %t.hip_bundle1.bc %t.hip_bundle2.bc
+// RUN: clang-offload-bundler -unbundle -type=a -targets=hip-amdgcn-amd-amdhsa--gfx900,hip-amdgcn-amd-amdhsa--gfx906 \
+// RUN:   -output=%t.hip_900.a -output=%t.hip_906.a -input=%t.hip_archive.a
+// RUN: llvm-ar t %t.hip_900.a | FileCheck -check-prefix=HIP-AR-900 %s
+// RUN: llvm-ar t %t.hip_906.a | FileCheck -check-prefix=HIP-AR-906 %s
+// HIP-AR-900-DAG: hip_bundle1-hip-amdgcn-amd-amdhsa--gfx900
+// HIP-AR-900-DAG: hip_bundle2-hip-amdgcn-amd-amdhsa--gfx900
+// HIP-AR-906-DAG: hip_bundle1-hip-amdgcn-amd-amdhsa--gfx906
+// HIP-AR-906-DAG: hip_bundle2-hip-amdgcn-amd-amdhsa--gfx906
+
+// Some code so that we can create a binary out of this file.
+int A = 0;
+void test_func(void) {
+  ++A;
+}
diff --git a/clang/test/lit.site.cfg.py.in b/clang/test/lit.site.cfg.py.in
index ef75770a2c3c9a..0ad5d0887c103e 100644
--- a/clang/test/lit.site.cfg.py.in
+++ b/clang/test/lit.site.cfg.py.in
@@ -22,6 +22,7 @@ config.host_cxx = "@CMAKE_CXX_COMPILER@"
 config.llvm_use_sanitizer = "@LLVM_USE_SANITIZER@"
 config.have_zlib = @LLVM_ENABLE_ZLIB@
 config.have_zstd = @LLVM_ENABLE_ZSTD@
+config.have_lzma = @LLVM_ENABLE_LZMA@
 config.clang_arcmt = @CLANG_ENABLE_ARCMT@
 config.clang_default_pie_on_linux = @CLANG_DEFAULT_PIE_ON_LINUX@
 config.clang_default_cxx_stdlib = "@CLANG_DEFAULT_CXX_STDLIB@"
diff --git a/llvm/CMakeLists.txt b/llvm/CMakeLists.txt
index f5f7d3f3253fd3..be500d51d22a7a 100644
--- a/llvm/CMakeLists.txt
+++ b/llvm/CMakeLists.txt
@@ -552,6 +552,8 @@ set(LLVM_ENABLE_ZLIB "ON" CACHE STRING "Use zlib for compression/decompression i
 
 set(LLVM_ENABLE_ZSTD "ON" CACHE STRING "Use zstd for compression/decompression if available. Can be ON, OFF, or FORCE_ON")
 
+set(LLVM_ENABLE_LZMA "ON" CACHE STRING "Use lzma for compression/decompression if available. Can be ON, OFF, or FORCE_ON")
+
 set(LLVM_USE_STATIC_ZSTD FALSE CACHE BOOL "Use static version of zstd. Can be TRUE, FALSE")
 
 set(LLVM_ENABLE_CURL "OFF" CACHE STRING "Use libcurl for the HTTP client if available. Can be ON, OFF, or FORCE_ON")
diff --git a/llvm/cmake/config-ix.cmake b/llvm/cmake/config-ix.cmake
index bf1b110245bb2f..4ac1e58cf565b1 100644
--- a/llvm/cmake/config-ix.cmake
+++ b/llvm/cmake/config-ix.cmake
@@ -162,6 +162,31 @@ if(LLVM_ENABLE_ZSTD)
 endif()
 set(LLVM_ENABLE_ZSTD ${zstd_FOUND})
 
+set(LZMA_FOUND 0)
+if(LLVM_ENABLE_LZMA)
+  if(LLVM_ENABLE_LZMA STREQUAL FORCE_ON)
+    find_package(LibLZMA REQUIRED)
+    if(NOT LIBLZMA_FOUND)
+      message(FATAL_ERROR "Failed to configure lzma, but LLVM_ENABLE_LZMA is FORCE_ON")
+    endif()
+  else()
+    find_package(LibLZMA QUIET)
+  endif()
+  if(LIBLZMA_FOUND)
+    # Check if lzma we found is usable; for example, we may have found a 32-bit
+    # library on a 64-bit system which would result in a link-time failure.
+    cmake_push_check_state()
+    list(APPEND CMAKE_REQUIRED_INCLUDES ${LIBLZMA_INCLUDE_DIRS})
+    list(APPEND CMAKE_REQUIRED_LIBRARIES ${LIBLZMA_LIBRARIES})
+    check_symbol_exists(lzma_lzma_preset lzma.h HAVE_LZMA)
+    cmake_pop_check_state()
+    if(LLVM_ENABLE_LZMA STREQUAL FORCE_ON AND NOT HAVE_LZMA)
+      message(FATAL_ERROR "Failed to configure lzma")
+    endif()
+  endif()
+endif()
+set(LLVM_ENABLE_LZMA ${LIBLZMA_FOUND})
+
 if(LLVM_ENABLE_LIBXML2)
   if(LLVM_ENABLE_LIBXML2 STREQUAL FORCE_ON)
     find_package(LibXml2 REQUIRED)
diff --git a/llvm/cmake/modules/LLVMConfig.cmake.in b/llvm/cmake/modules/LLVMConfig.cmake.in
index 770a9caea322e6..660e056f113859 100644
--- a/llvm/cmake/modules/LLVMConfig.cmake.in
+++ b/llvm/cmake/modules/LLVMConfig.cmake.in
@@ -80,6 +80,11 @@ if(LLVM_ENABLE_ZSTD)
   find_package(zstd)
 endif()
 
+set(LLVM_ENABLE_LZMA @LLVM_ENABLE_LZMA@)
+if(LLVM_ENABLE_LZMA)
+  find_package(LibLZMA)
+endif()
+
 set(LLVM_ENABLE_LIBXML2 @LLVM_ENABLE_LIBXML2@)
 if(LLVM_ENABLE_LIBXML2)
   find_package(LibXml2)
diff --git a/llvm/docs/CMake.rst b/llvm/docs/CMake.rst
index abef4f8103140f..d7f86caa71202b 100644
--- a/llvm/docs/CMake.rst
+++ b/llvm/docs/CMake.rst
@@ -629,6 +629,11 @@ enabled sub-projects. Nearly all of these variable names begin with
   zstd. Allowed values are ``OFF``, ``ON`` (default, enable if zstd is found),
   and ``FORCE_ON`` (error if zstd is not found).
 
+**LLVM_ENABLE_LZMA**:STRING
+  Used to decide if LLVM tools should support compression/decompression with
+  lzma. Allowed values are ``OFF``, ``ON`` (default, enable if lzma is found),
+  and ``FORCE_ON`` (error if lzma is not found).
+
 **LLVM_EXPERIMENTAL_TARGETS_TO_BUILD**:STRING
   Semicolon-separated list of experimental targets to build and linked into
   llvm. This will build the experimental target without needing it to add to the
diff --git a/llvm/include/llvm/Config/llvm-config.h.cmake b/llvm/include/llvm/Config/llvm-config.h.cmake
index 6605ea60df99e1..47e53f8b4ee7bc 100644
--- a/llvm/include/llvm/Config/llvm-config.h.cmake
+++ b/llvm/include/llvm/Config/llvm-config.h.cmake
@@ -173,6 +173,9 @@
 /* Define if zstd compression is available */
 #cmakedefine01 LLVM_ENABLE_ZSTD
 
+/* Define if lzma compression is available */
+#cmakedefine01 LLVM_ENABLE_LZMA
+
 /* Define if LLVM is using tflite */
 #cmakedefine LLVM_HAVE_TFLITE
 
diff --git a/llvm/include/llvm/Support/Compression.h b/llvm/include/llvm/Support/Compression.h
index c3ba3274d6ed87..6dc7b162772d90 100644
--- a/llvm/include/llvm/Support/Compression.h
+++ b/llvm/include/llvm/Support/Compression.h
@@ -73,9 +73,31 @@ Error decompress(ArrayRef<uint8_t> Input, SmallVectorImpl<uint8_t> &Output,
 
 } // End of namespace zstd
 
+namespace lzma {
+
+constexpr int NoCompression = 0;
+constexpr int BestSpeedCompression = 1;
+constexpr int DefaultCompression = 6;
+constexpr int BestSizeCompression = 9;
+
+bool isAvailable();
+
+void compress(ArrayRef<uint8_t> Input,
+              SmallVectorImpl<uint8_t> &CompressedBuffer,
+              int Level = DefaultCompression);
+
+Error decompress(ArrayRef<uint8_t> Input, uint8_t *Output,
+                 size_t &UncompressedSize);
+
+Error decompress(ArrayRef<uint8_t> Input, SmallVectorImpl<uint8_t> &Output,
+                 size_t UncompressedSize);
+
+} // End of namespace lzma
+
 enum class Format {
   Zlib,
   Zstd,
+  Lzma,
 };
 
 inline Format formatFor(DebugCompressionType Type) {
@@ -104,8 +126,8 @@ struct Params {
 };
 
 // Return nullptr if LLVM was built with support (LLVM_ENABLE_ZLIB,
-// LLVM_ENABLE_ZSTD) for the specified compression format; otherwise
-// return a string literal describing the reason.
+// LLVM_ENABLE_ZSTD, LLVM_ENABLE_LZMA) for the specified compression format;
+// otherwise return a string literal describing the reason.
 const char *getReasonIfUnsupported(Format F);
 
 // Compress Input with the specified format P.Format. If Level is -1, use
diff --git a/llvm/lib/Support/CMakeLists.txt b/llvm/lib/Support/CMakeLists.txt
index 1f2d82427552f7..1ed0dcd435ecf8 100644
--- a/llvm/lib/Support/CMakeLists.txt
+++ b/llvm/lib/Support/CMakeLists.txt
@@ -37,6 +37,10 @@ if(LLVM_ENABLE_ZSTD)
   list(APPEND imported_libs ${zstd_target})
 endif()
 
+if(LLVM_ENABLE_LZMA)
+  list(APPEND imported_libs LibLZMA::LibLZMA)
+endif()
+
 if( MSVC OR MINGW )
   # libuuid required for FOLDERID_Profile usage in lib/Support/Windows/Path.inc.
   # advapi32 required for CryptAcquireContextW in lib/Support/Windows/Path.inc.
@@ -323,6 +327,19 @@ if(LLVM_ENABLE_ZSTD)
   set(llvm_system_libs ${llvm_system_libs} "${zstd_library}")
 endif()
 
+if(LLVM_ENABLE_LZMA)
+  # CMAKE_BUILD_TYPE is only meaningful to single-configuration generators.
+  if(CMAKE_BUILD_TYPE)
+    string(TOUPPER ${CMAKE_BUILD_TYPE} build_type)
+    get_property(lzma_library TARGET LibLZMA::LibLZMA PROPERTY LOCATION_${build_type})
+  endif()
+  if(NOT lzma_library)
+    get_property(lzma_library TARGET LibLZMA::LibLZMA PROPERTY LOCATION)
+  endif()
+  get_library_name(${lzma_library} lzma_library)
+  set(llvm_system_libs ${llvm_system_libs} "${lzma_library}")
+endif()
+
 if(LLVM_ENABLE_TERMINFO)
   if(NOT terminfo_library)
     get_property(terminfo_library TARGET Terminfo::terminfo PROPERTY LOCATION)
diff --git a/llvm/lib/Support/Compression.cpp b/llvm/lib/Support/Compression.cpp
index 8e57ba798f5207..f88560e58e8135 100644
--- a/llvm/lib/Support/Compression.cpp
+++ b/llvm/lib/Support/Compression.cpp
@@ -23,6 +23,9 @@
 #if LLVM_ENABLE_ZSTD
 #include <zstd.h>
 #endif
+#if LLVM_ENABLE_LZMA
+#include <lzma.h>
+#endif
 
 using namespace llvm;
 using namespace llvm::compression;
@@ -39,6 +42,11 @@ const char *compression::getReasonIfUnsupported(compression::Format F) {
       return nullptr;
     return "LLVM was not built with LLVM_ENABLE_ZSTD or did not find zstd at "
            "build time";
+  case compression::Format::Lzma:
+    if (lzma::isAvailable())
+      return nullptr;
+    return "LLVM was not built with LLVM_ENABLE_LZMA or did not find lzma at "
+           "build time";
   }
   llvm_unreachable("");
 }
@@ -52,6 +60,9 @@ void compression::compress(Params P, ArrayRef<uint8_t> Input,
   case compression::Format::Zstd:
     zstd::compress(Input, Output, P.level);
     break;
+  case compression::Format::Lzma:
+    lzma::compress(Input, Output, P.level);
+    break;
   }
 }
 
@@ -62,6 +73,8 @@ Error compression::decompress(DebugCompressionType T, ArrayRef<uint8_t> Input,
     return zlib::decompress(Input, Output, UncompressedSize);
   case compression::Format::Zstd:
     return zstd::decompress(Input, Output, UncompressedSize);
+  case compression::Format::Lzma:
+    break;
   }
   llvm_unreachable("");
 }
@@ -74,6 +87,8 @@ Error compression::decompress(compression::Format F, ArrayRef<uint8_t> Input,
     return zlib::decompress(Input, Output, UncompressedSize);
   case compression::Format::Zstd:
     return zstd::decompress(Input, Output, UncompressedSize);
+  case compression::Format::Lzma:
+    return lzma::decompress(Input, Output, UncompressedSize);
   }
   llvm_unreachable("");
 }
@@ -218,3 +233,86 @@ Error zstd::decompress(ArrayRef<uint8_t> Input,
   llvm_unreachable("zstd::decompress is unavailable");
 }
 #endif
+#if LLVM_ENABLE_LZMA
+
+bool lzma::isAvailable() { return true; }
+
+void lzma::compress(ArrayRef<uint8_t> Input,
+                    SmallVectorImpl<uint8_t> &CompressedBuffer, int Level) {
+  lzma_options_lzma Opt;
+  if (lzma_lzma_preset(&Opt, Level) != LZMA_OK) {
+    report_bad_alloc_error("lzma::compress failed: preset error");
+    return;
+  }
+
+  lzma_filter Filters[] = {{LZMA_FILTER_LZMA2, &Opt},
+                           {LZMA_VLI_UNKNOWN, nullptr}};
+
+  size_t MaxOutSize = lzma_stream_buffer_bound(Input.size());
+  CompressedBuffer.resize_for_overwrite(MaxOutSize);
+
+  size_t OutPos = 0;
+  lzma_ret Ret = lzma_stream_buffer_encode(
+      Filters, LZMA_CHECK_CRC64, nullptr, Input.data(), Input.size(),
+      CompressedBuffer.data(), &OutPos, MaxOutSize);
+  if (Ret == LZMA_OK)
+    CompressedBuffer.resize(OutPos);
+  else
+    report_bad_alloc_error("lzma::compress failed");
+}
+
+Error lzma::decompress(ArrayRef<uint8_t> Input, uint8_t *Output,
+                       size_t &UncompressedSize) {
+  const size_t DecoderMemoryLimit = 100 * 1024 * 1024;
+  lzma_stream Strm = LZMA_STREAM_INIT;
+  size_t InPos = 0;
+  size_t OutPos = 0;
+
+  lzma_ret Ret = lzma_auto_decoder(&Strm, DecoderMemoryLimit, 0);
+  if (Ret != LZMA_OK)
+    return make_error<StringError>("Failed to initialize LZMA decoder",
+                                   inconvertibleErrorCode());
+
+  Strm.next_in = Input.data();
+  Strm.avail_in = Input.size();
+  Strm.next_out = Output;
+  Strm.avail_out = UncompressedSize;
+
+  Ret = lzma_code(&Strm, LZMA_FINISH);
+  if (Ret == LZMA_STREAM_END) {
+    UncompressedSize = Strm.total_out;
+    lzma_end(&Strm);
+    return Error::success();
+  } else {
+    lzma_end(&Strm);
+    return make_error<StringError>("LZMA decompression failed",
+                                   inconvertibleErrorCode());
+  }
+}
+
+Error lzma::decompress(ArrayRef<uint8_t> Input,
+                       SmallVectorImpl<uint8_t> &Output,
+                       size_t UncompressedSize) {
+  Output.resize_for_overwrite(UncompressedSize);
+  Error E = lzma::decompress(Input, Output.data(), UncompressedSize);
+  if (UncompressedSize < Output.size())
+    Output.truncate(UncompressedSize);
+  return E;
+}
+
+#else
+bool lzma::isAvailable() { return false; }
+void lzma::compress(ArrayRef<uint8_t> Input,
+                    SmallVectorImpl<uint8_t> &CompressedBuffer, int Level) {
+  llvm_unreachable("lzma::compress is unavailable");
+}
+Error lzma::decompress(ArrayRef<uint8_t> Input, uint8_t *Output,
+                       size_t &UncompressedSize) {
+  llvm_unreachable("lzma::decompress is unavailable");
+}
+Error lzma::decompress(ArrayRef<uint8_t> Input,
+                       SmallVectorImpl<uint8_t> &Output,
+                       size_t UncompressedSize) {
+  llvm_unreachable("lzma::decompress is unavailable");
+}
+#endif
diff --git a/llvm/test/CMakeLists.txt b/llvm/test/CMakeLists.txt
index 6127b76db06b7f..777a54784203a4 100644
--- a/llvm/test/CMakeLists.txt
+++ b/llvm/test/CMakeLists.txt
@@ -8,6 +8,7 @@ llvm_canonicalize_cmake_booleans(
   LLVM_ENABLE_HTTPLIB
   LLVM_ENABLE_ZLIB
   LLVM_ENABLE_ZSTD
+  LLVM_ENABLE_LZMA
   LLVM_ENABLE_LIBXML2
   LLVM_LINK_LLVM_DYLIB
   LLVM_TOOL_LTO_BUILD
diff --git a/llvm/test/lit.site.cfg.py.in b/llvm/test/lit.site.cfg.py.in
index b6f255d472d16f..7cdca4083295f5 100644
--- a/llvm/test/lit.site.cfg.py.in
+++ b/llvm/test/lit.site.cfg.py.in
@@ -35,6 +35,7 @@ config.llvm_use_intel_jitevents = @LLVM_USE_INTEL_JITEVENTS@
 config.llvm_use_sanitizer = "@LLVM_USE_SANITIZER@"
 config.have_zlib = @LLVM_ENABLE_ZLIB@
 config.have_zstd = @LLVM_ENABLE_ZSTD@
+config.have_lzma = @LLVM_ENABLE_LZMA@
 config.have_libxml2 = @LLVM_ENABLE_LIBXML2@
 config.have_curl = @LLVM_ENABLE_CURL@
 config.have_httplib = @LLVM_ENABLE_HTTPLIB@
diff --git a/llvm/utils/lit/lit/llvm/config.py b/llvm/utils/lit/lit/llvm/config.py
index 96b4f7bc86772d..6e307da7354118 100644
--- a/llvm/utils/lit/lit/llvm/config.py
+++ b/llvm/utils/lit/lit/llvm/config.py
@@ -131,6 +131,9 @@ def __init__(self, lit_config, config):
         have_zstd = getattr(config, "have_zstd", None)
         if have_zstd:
             features.add("zstd")
+        have_lzma = getattr(config, "have_lzma", None)
+        if have_lzma:
+            features.add("lzma")
 
         if getattr(config, "reverse_iteration", None):
             features.add("reverse_iteration")

@jhuber6
Copy link
Contributor

jhuber6 commented Feb 28, 2024

This seems to be adding an entirely new compression scheme to LLVM. I feel like that should be a separate patch and the part where we make HIP use it is a follow-up.

This patch adds liblzma as an alternative compression/decompression
method to zlib/zstd.
@yxsamliu yxsamliu changed the title [HIP] Support compressing bundle by LZMA add LZMA for compression/decompression Feb 28, 2024
@yxsamliu
Copy link
Collaborator Author

This seems to be adding an entirely new compression scheme to LLVM. I feel like that should be a separate patch and the part where we make HIP use it is a follow-up.

Keep this PR for LLVM changes only. Will open another PR for clang changes.

@jhuber6 jhuber6 changed the title add LZMA for compression/decompression [LLVM] add LZMA for compression/decompression Feb 28, 2024
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this seems pretty straightforward so it looks good to me. However I'll wait until some of the other LLVM contributors chime in.

@aganea
Copy link
Member

aganea commented Feb 28, 2024

Thanks for doing this @yxsamliu ! Does -DLLVM_ENABLE_LZMA=ON work on Windows too?

@Artem-B
Copy link
Member

Artem-B commented Feb 28, 2024

LZMA (Lempel-Ziv/Markov-chain Algorithm) provides better comparession rate than zstd and zlib for clang-offload-bundler bundles which often contains large number of similar entries.

Do you have any benchmarks to support this assertion?

For huge binaries, decompression speed may be more important than compression ratio. E.g. it's not unusual to have large ML apps carrying O(gigabyte) of GPU code blobs.

lzma's somewhat better compression ratio (vs zstd) comes at a price of a relatively slow decompression speed. zstd gives comparable compression ratio at much higher decompression speed:
https://morotti.github.io/lzbench-web/?dataset=silesia/mozilla&machine=desktop#results
image

@MaskRay
Copy link
Member

MaskRay commented Feb 29, 2024

I have heard several applications switched from lzma/lzma2 to zstd and therefore I am curious to see justifications for adding lzma. The compression ratio is better but compression/decompression is extremely slow. In addition, lzma is not good at compressing binary data.

I had surveyed multiple compression implementations when I added zstd to llvm: https://groups.google.com/g/generic-abi/c/satyPkuMisk
which may be useful

@aganea
Copy link
Member

aganea commented Feb 29, 2024

I do have a different perspective here. I worked with LZMA in the past and it is by far one of the best compression schemes out there in many regards. I do not understand the assertion around its decompression speed. Compression is certainly slower, but it is not a lot slower than the competition. I also do have practical use-cases today for its use, as opposed to "weaker" compression formats.

In the past I had a realtime streaming LZMA decompressor running on a 16 MHz ARM7TDMI, sharing timeslices along with many other runtime jobs for rendering a video stream. Admittedly it was hand optimized asm, but we had the same issues with memory latency as today, and the low bitrate of LZMA stream allowed less data to be read from the ROM. The gap has widened today, memory reads are a lot more expensive that cpu cyles, even if that data is already in the caches. Most likely the LZMA window would have to be crafted for today's cache hierarchy/target CPU architecture.

Even though COFF doesn't support internal compression today AFAIK, I tried compressing the .OBJ files for a LLVM Windows build including debug info, in this folder: stage1\tools\clang\unittests\Tooling\CMakeFiles\ToolingTests.dir on a modern Ryzen 9 Windows machine:

Compressor Compression time Decompression time Size
None 383 MB
7z.exe 23.00 -tzip a files.zip *.obj 3 sec 1.2 sec 44.7 MB
zstd.exe 1.5.5 -9 -f *.obj -o files.zstd 3 sec 0.008 sec 36.9 MB
zstd.exe 1.5.5 -19 -f *.obj -o files.zstd 1min 46 sec 0.210 sec 30.8 MB
7z.exe 23.00 a files.7z *.obj 5.5 sec 0.500 sec 26.2 MB
7z.exe 23.00 -mx9 a files.7z *.obj 26 sec 0.791 sec 18.2 MB

All figures are single-threaded. The assumption is that libzstd and liblzma have the same performance as their executables counter-part.

A practical counter-argument to the comp/decomp speed (which does not seem to be that terrible at the light of the figures I'm seeing above) is that people working from home are usually on poor/not-that-good internet connections. Upload speed from their end isn't that great, but their CPU power is. To avoid on cloud costs, it makes sense to distribute compilation on a private network on user's PC, which includes at-home PCs. In this case, the size of the generated assets/.OBJs is more important than the time spent on compressing/decompressing them, as long as it remains within reasonable terms. If 40 sec are spent on compiling an .OBJ and 2-3 secs on compression, this has a great value if it generates 2x smaller assets (great value for network transfer that is)

However I understand that this figures could be different if compressing individual sections within a DWARF file.

I'd like to give the OP the benefit of the doubt, if they can come with tangible figures for their use-case? In terms of compression/decompression speed, and size, comparing with existing compression schemes in LLVM. @yxsamliu

@yxsamliu
Copy link
Collaborator Author

Thanks for doing this @yxsamliu ! Does -DLLVM_ENABLE_LZMA=ON work on Windows too?

I was able to build libzlma with https://github.com/tukaani-project/xz on Windows with VS 2022. LLVM cmake config is able to find its include file but not the library with -DLIBLZMA_ROOT. I am still investigating.

@yxsamliu
Copy link
Collaborator Author

I will collect some benchmarking results.

@yxsamliu
Copy link
Collaborator Author

The following is measurement for compressing/decompressing Blender 4.1 bundled bitcode for 6 GPU arch's:

It is surprising that LZMA level 9 gets higher compressing rate with less compressing/decompressing time, but it did happen.

Compression Method Level Original Size (bytes) Compressed Size (bytes) Compression Rate Compression Time (s) Decompression Time (s)
LZMA 6 68,459,756 22,984,456 ~2.98:1 18.7226 1.1000
LZMA 9 68,459,756 4,139,504 ~16.55:1 14.32 0.3012
ZSTD 6 68,459,756 32,612,291 ~2.10:1 0.8067 0.0982
ZSTD 9 68,459,756 31,445,373 ~2.18:1 1.3375 0.0933

@Artem-B
Copy link
Member

Artem-B commented Feb 29, 2024

I do not understand the assertion around its decompression speed.

For small apps (let's say < 100MB), it probably does not matter. Most of compression algorithms will be fast enough.
It's the large apps where it matters. It's not uncommon for a large machine learning app to carry literally gigabytes of GPU binaries. Then 100MB/s vs 500MB/s decompression rate will make quite a bit of a difference.
There are typically way more users, than builders, so the aggregate cost balance w/ slow decompression is not very good, IMO.

Compression is certainly slower, but it is not a lot slower than the competition.

An order of magnitude difference on @yxsamliu sample would qualify as "a lot", IMO.
That said, comparison with zstd's -9/-6 is probably not fair as -9 would be roughly in the middle of zstd's compression levels. LZMA compression does appear to be better at both speed and compression ratio than zstd's high compression levels (11+, according to the chart below).
image

@yxsamliu zstd's compression levels don't seem to match those of lzma (e.g 9 is the highest compression for lzma, but only about a middle of the range for zstd). Could you also measure with zstd -20 and zstd -15 ?

On a side note, given that there's a huge jump in compression ratio between lzma -6 and -9, it suggests that it may have something to do with the compression window size. It may need to be large enough to cover multiple similar chunks of the binaries. I suspect we may be able to tweak zstd parameters to improve its compression ratio, too.

Did anybody try training zstd on binaries and check how much it would help us in this case?

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Mar 1, 2024

Compression Method Level Original Size (bytes) Compressed Size (bytes) Compression Rate Compression Time (s) Decompression Time (s)
LZMA 6 68,459,756 22,984,456 ~2.98:1 18.7226 1.1000
LZMA 9 68,459,756 4,139,504 ~16.55:1 14.32 0.3012
ZSTD 6 68,459,756 32,612,291 ~2.10:1 0.8067 0.0982
ZSTD 9 68,459,756 31,445,373 ~2.18:1 1.3375 0.0933
ZSTD 15 68,459,756 28,063,493 ~2.44:1 9.7183 0.0891
ZSTD 20 68,459,756 4,394,993 ~15.59:1 2.0157 0.0493

It seems we could use zstd level 20 for clang-offload-bundler to achieve similar compression rate as lzma level 9.

@Artem-B
Copy link
Member

Artem-B commented Mar 1, 2024

This compression ratio cliff bothers me a bit. I wonder if there's something special about the data the benchmark was ran on that triggers it. for both compression algorithms.

@yxsamliu would it be possible for you to rerun the benchmarks one more with the data set split into 1/3 and 2/3 of the original input in size and see if compression ratio cliff happens at lower compression levels for smaller inputs?

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Mar 2, 2024

This compression ratio cliff bothers me a bit. I wonder if there's something special about the data the benchmark was ran on that triggers it. for both compression algorithms.

@yxsamliu would it be possible for you to rerun the benchmarks one more with the data set split into 1/3 and 2/3 of the original input in size and see if compression ratio cliff happens at lower compression levels for smaller inputs?

The specialty about the data is that the bitcode for different GPU arch's are very similar, which is common for HIP, therefore the file to be compressed contains N similar portions for N GPU archs.

The following tables shows zstd level 20 results bundled bitcode for 2, 4, and 6 GPU arch:

GPU Archs Size Before (bytes) Size After (bytes) Compression Rate Compress Time (s) Decompress Time (s)
2 22,819,940 4,390,242 5.20 2.0094 0.0293
4 45,639,848 4,392,548 10.39 2.0127 0.0391
6 68,459,756 4,394,991 15.58 2.1567 0.0429

You can see the compressed size, compression and decompression time are almost the same.

This means the more GPU archs, the better compression rate we will get.

Only zstd level 20 and above can achieve this.

@Artem-B
Copy link
Member

Artem-B commented Mar 2, 2024

I do get the part that multiple GPU variants give us a lot of redundancy in the data to compress away.

It's was not quite clear to me why compression ratio dramatically improves between zstd-15 and zstd -20 on the same blob. Or, another way to put it -- why don't se see such a high compression ratio at lower compression levels.

Though the bit that we're compressing multiple similar GPU blobs may be the likely explanation here, too. If compression window is smaller than the size of one GPU blob, it may not benefit from the commonality across multiple blobs. By the time we get to the beginning of the second GPU variant, we've essentially forgot what we had at the beginning of the first one.

The following tables shows zstd level 20 results bundled bitcode for 2, 4, and 6 GPU arch:

Interesting. So, compression ratio for a single GPU blob is around 2.0-3.0x, and all subsequent blobs for other GPU variants get compressed essentially into nothing, as long as we can squeeze one complete GPU blob into compression window.

It sounds like there may be further room for improvement by tweaking zstd parameters to exploit specific properties of the data we're packing.

@dwblaikie
Copy link
Collaborator

Excuse the outlandish suggestion, but given:

I do get the part that multiple GPU variants give us a lot of redundancy in the data to compress away.

Is there any chance of some sort of domain-specific compression, especially that would be more resilient to the size of the kernels? (seems like increasing the compression level increases the compression window size, which has some cliff/break points for kernels of certain sizes, which seems unfortunately non-general - like it'd be nice to not have to push the compression algorithm so hard for smaller kernels, and it'd be nice if larger kernels could still be deduplicated)

@Artem-B
Copy link
Member

Artem-B commented Mar 4, 2024

Excuse the outlandish suggestion, but given:

I do get the part that multiple GPU variants give us a lot of redundancy in the data to compress away.

Is there any chance of some sort of domain-specific compression,

The key domain-specific quirk we can exploit here is that we produce N very similar blobs (same code, with minor differences due to GPU-specific intrinsics, etc.) There's nothing particularly interesting about the individual blobs.

especially that would be more resilient to the size of the kernels? (seems like increasing the compression level increases the compression window size, which has some cliff/break points for kernels of certain sizes, which seems unfortunately non-general - like it'd be nice to not have to push the compression algorithm so hard for smaller kernels, and it'd be nice if larger kernels could still be deduplicated)

One way to achieve that would be to interleave GPU blobs. Instead of AAAAABBBBBCCCCC, pack them as ABCABCABCABC. This way the compression window requirement will be reduced to cover only a slice, not the whole blob.

Increasing compression window while keeping the rest of parameters at a lower compression level may work, too. At least on my experiments zstd -9 --zstd=wlog=25 does not seem to affect compression time much. It still works much faster than zstd -20.

@dwblaikie
Copy link
Collaborator

Excuse the outlandish suggestion, but given:

I do get the part that multiple GPU variants give us a lot of redundancy in the data to compress away.

Is there any chance of some sort of domain-specific compression,

The key domain-specific quirk we can exploit here is that we produce N very similar blobs (same code, with minor differences due to GPU-specific intrinsics, etc.) There's nothing particularly interesting about the individual blobs.

especially that would be more resilient to the size of the kernels? (seems like increasing the compression level increases the compression window size, which has some cliff/break points for kernels of certain sizes, which seems unfortunately non-general - like it'd be nice to not have to push the compression algorithm so hard for smaller kernels, and it'd be nice if larger kernels could still be deduplicated)

One way to achieve that would be to interleave GPU blobs. Instead of AAAAABBBBBCCCCC, pack them as ABCABCABCABC. This way the compression window requirement will be reduced to cover only a slice, not the whole blob.

I was thinking something even more domain specific (like an actual domain specific compression scheme - not that it wouldn't be able to be further compressed by something generic - but encoding the data with less duplication to start with) - but I don't know enough about the structure/contents of these kernels to know what that'd look like. If I were speculating rampantly - maybe some kind of macro scheme to describe the architectural differences, that could be quickly stripped out when the arch specific version was needed on-device. (wonder if it'd be feasible to even compile for multiple targets simultaneously - keeping these differences in conditional blocks - rather than redundantly generating all the kernels then trying to figure out their commonalities/merge them again)

But I reaize this is all quite out of my depth and you folks who work on this stuff probably already know what's feasible or not here.

Increasing compression window while keeping the rest of parameters at a lower compression level may work, too. At least on my experiments zstd -9 --zstd=wlog=25 does not seem to affect compression time much. It still works much faster than zstd -20.

That sounds pretty promising (though perhaps still interesting to know how much the window size helps/hurts compared to the distribution of kernel sizes? Like do we have population data about kernel sizes? Does wlog=25 cover the 90% case? is the population widely distributed, or fairly tightly clustered? is it increasing over time, such that today wlog=25 is 90%, but in a year or two it'll be only 50%?)

@Artem-B
Copy link
Member

Artem-B commented Mar 5, 2024

I was thinking something even more domain specific

That sounds like a reasonable research project topic. :-)
I suspect there may be something to be gain compression wise for specific ISAs, but I would not expect miracles as existing compression algorithms are pretty good at removing redundancies. zstd do have ability to 'train' on particular set of input files and provide a dictionary optimal for those binaries. It may be interesting to experiment with that.

maybe some kind of macro scheme to describe the architectural differences,

That's largely what happens in CUDA and AMDGPU. Unfortunately, those minor differences percolate through the rest of the code and we usually end up with similar, but not identical compiler outputs, and it's hard to generalize which parts will be affected, so in practice we do need to compile everything.

Like do we have population data about kernel sizes?

Anecdotally, individual kernel size varies from nothing to O(megabytes). Individual TUs (I think that's what object bundler ends up dealing with) will likely be on the smaller side, but the outliers are fairly common.

Does wlog=25 cover the 90% case?

My guess is that it should be sufficient for most of the use cases.

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Mar 8, 2024

close this PR since we decide to use zstd

@yxsamliu yxsamliu closed this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:driver 'clang' and 'clang++' user-facing binaries. Not 'clang-cl' clang Clang issues not falling into any other category cmake Build system in general and CMake in particular llvm:support llvm-lit testing-tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants