MI300 compatibility #1764

fxmarty · 2024-04-18T23:31:56Z

Adds support for AMD Instinct MI300 in TGI.

Most changes are:

Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with PYTORCH_TUNABLEOP_ENABLED=1.
Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from [ROCm] TunableOp improvements pytorch/pytorch#124362)
Support SILU & Linear custom kernels contributed by AMD
Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit ROCm/vllm@3489ce7
Support FA2 Triton kernel as recommended by AMD. Can be used by specifying ROCM_USE_FLASH_ATTN_V2_TRITON=1.
Update dockerfile to ROCm 6.1

By default, TunableOp tuning results are saved in /data (e.g. /data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv) in order to avoid to have to rerun the tuning at each docker run.

Example:

Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489

Dockerfile_amd

launcher/src/main.rs

server/text_generation_server/utils/layers.py

server/text_generation_server/utils/flash_attn.py

fxmarty · 2024-05-16T11:49:29Z

@Narsil This PR is ready. Could you give a look?

We are just waiting for a patched / updated rocm/dev-ubuntu-22.04 base image that would fix an issue with libamdhip64.so on certain VMs, avoiding

text-generation-inference/Dockerfile_amd

Lines 117 to 122 in afc7473

    
           ARG GITHUB_TOKEN 
        
           RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends wget && \ 
        
               rm -rf /var/lib/apt/lists/* && \ 
        
               wget --header "Authorization: token ${GITHUB_TOKEN}" https://raw.githubusercontent.com/fxmarty/patched_hipruntime/main/libamdhip64.so.6 
        
           ENV LD_PRELOAD="/libamdhip64.so.6"

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

HuggingFaceDocBuilderDev · 2024-05-16T11:51:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil · 2024-05-16T13:50:29Z

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

Sure releases are kind of trivial now.

fxmarty · 2024-05-16T13:57:03Z

Dockerfile_amd

+# COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
+# RUN chmod +x /tgi-entrypoint.sh
+
+# ENTRYPOINT ["/tgi-entrypoint.sh"]
+# CMD ["--json-output"]


Narsil

Very nice.

Lots of nits and code structure suggestions, but overall everything looks good.

server/text_generation_server/models/flash_causal_lm.py

Narsil · 2024-05-16T14:07:30Z

server/text_generation_server/utils/flash_attn_triton.py

@@ -0,0 +1,816 @@
+#!/usr/bin/env python


Shouldn't we put this in layers/flash_attn/triton.py maybe ? (and flash_attn.py-> flash_attn/__init__.py for simplificity ?)

Could we do that in an other PR? Then e.g. paged_attention.py should be moved as well

server/text_generation_server/utils/flash_attn.py

server/text_generation_server/models/t5.py

Narsil · 2024-05-16T14:25:21Z

server/text_generation_server/layers/linear.py

+            bias = None
+        return cls(weight, bias)
+
+    def forward(self, inp: torch.Tensor) -> torch.Tensor:


s/inp/input/

isn't input bad? https://stackoverflow.com/questions/20670732/is-input-a-keyword-in-python

server/text_generation_server/layers/linear.py

Narsil · 2024-05-16T14:29:39Z

server/text_generation_server/layers/linear.py

+            out = torch.empty(
+                inp.shape[0], weight.shape[0], dtype=inp.dtype, device="cuda"
+            )
+            if (k == 8192 and (m == 1280 or m == 7168)) or (k == 3584 and m == 8192):


This feel way overspecified, no ?
Is it really only implemented for these shapes ?

The second condition looks way more general.

cc @gshtras @charlifu have you tested different rows_per_block? Was it specifically tuned for these shapes?

fxmarty · 2024-05-17T09:18:29Z

As we got an updated rocm/dev-ubuntu-22.04:6.1.1_hip_update, this PR may be merged once build is done & tests are passing

Not all models were tested in #1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

Not all models were tested in huggingface/text-generation-inference#1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

commit 6adf97815ef6828e0aa06f2a4635370b4ad7476e Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jul 6 13:18:16 2024 -0400 Fix the decoding logic in test_local_grpc.py (#44) * fix the test_local_grpc script * lint fix commit f355733482f4ebc15916df151ad00ad9d64d451d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jul 6 07:50:55 2024 -0700 bug fixes commit 466b0a65429d339a1c004c5991749e6f9cb1230b Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jul 1 22:48:56 2024 -0400 Add the batch concatenation functionality for flashinfer server (#43) * refactor flashinfer causal lm * modify test_local_api * fixes * fixes * lint commit b9838c5c4720ff09f946e7fce8dd328aab57dc16 Author: NovTi <yx2432@nyu.edu> Date: Tue Jul 2 00:07:24 2024 +0800 Add ChatGLM and refactor Qwen2 commit 9fafffcfacb8ded0d0d5aefac2cf38ae3a44876f Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jul 1 10:30:21 2024 +0800 update mistral flashinfer commit d099bbbbeeaf638220696b5c9f94cf9634f8c221 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:39:44 2024 -0700 update submodules commit 4edacd568d064cb834597d8cf2f24bf1bef20683 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:29:34 2024 -0700 update submodules commit 9da076dc488140273ab17773ae642e8ac3edb119 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 30 18:17:41 2024 -0700 minor fix in makefile commit fa213e263fd86ec41d033cb8d46dea07076720bd Author: MichaelYuan2 <hy2203@nyu.edu> Date: Tue Jun 25 10:41:09 2024 +0800 update FlashinferAttentionWrapper to flashinfer 0.0.6 commit 8d3dd4898a26f89d82233640123aad90e2477bb6 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 24 11:25:08 2024 -0400 Fix the server CLI issue with use_flashinfer flag (#42) * fix refactor * empty * fix lint commit 23118727bdf000d87115df9ac6a6ccf3aee7a2ef Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sat Jun 22 17:22:51 2024 -0400 decouple flashinfer files from flash attention (#41) commit 9b3c09850ddfdd8141601ee9b1b027e4aa2d4b83 Merge: 4a40c64 f0d3664 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 11:13:14 2024 -0400 Merge pull request #40 from mlsys-io/add_baichuan Adjust the flashinfer llama model to accommodate the baichuan model commit f0d3664f34acae5020f045fabca15aa310ce60ec Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 20 10:46:12 2024 -0400 adjust the flashinfer llama model to accomodate baichuan commit 4a40c6415cd7f1d29bab6de9907ca8ac66833863 Merge: 0ba0ac9 6aaab88 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:15:42 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 0ba0ac9dd8825cef92cd7b92fef49ab0efcb8fbd Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 17 10:01:44 2024 -0700 minor fix in output example commit 6aaab883fb154b960de9ad501de74ad90f447725 Merge: 7a93d84 08fde0f Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:46:13 2024 -0400 Merge pull request #38 from mlsys-io/flash_attn_rotary Use Flash attention for rotary embedding and layer normalization for Phi2 and Phi3 commit 08fde0f9ab74fd54fe59bbca5020448a862c1188 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:43:19 2024 -0400 revert test file commit c51e36e3a3bf60f5e23f3a1fee5fe6b116fcc362 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:42:16 2024 -0400 fix lint commit 7dfa57d5ca29e366c1d7c6de01ef6e81840fd7d5 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 12:40:40 2024 -0400 empty commit b45e8968e75976ac506dc25b467e41520c457d48 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 14:17:55 2024 +0000 fix phi2 and phi3 modeling commit 31ad6bd942293ce18addf79944da5d742518f900 Merge: 1e2bf10 7a93d84 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 08:55:24 2024 -0400 merge master commit 1e2bf1026420e298cb7fbed4d73166baefcbf615 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 06:43:51 2024 -0400 fix the flashinfer adapter commit da84f6bcce038029714f48916510964d5b00d757 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Mon Jun 17 00:51:55 2024 +0000 fixes commit e0feabb012e8d82d6265dc85811a47fff44c1c65 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 20:20:59 2024 -0400 fix rotary bug commit 7a93d8413fbfb62e8ae6646a12aaed55b36afaa1 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:50:16 2024 -0700 update to rust 1.79 commit 6c4fa6effac801c7c4a30479eca30a7c5ecb057d Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 15 22:15:41 2024 -0700 minor fixes commit ad40a1752d5964554814261754e63a1122829ce9 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Sun Jun 16 01:49:28 2024 +0000 flash attn rotary commit 868d3f2fa74a07178806eadc79a2f23f59bafa77 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 22:57:09 2024 -0700 minor router-server fix commit b8a47854a60d21347d6e4f66a507d1a4d2580c30 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 16:43:32 2024 -0700 finalize docker build workflow commit fa2f2f2c8d5249e151cb51c35ef2952cf937b98c Merge: 93edec5 85f34cb Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:29 2024 -0700 Merge branch 'master' of github.com:mlsys-io/kv.run commit 93edec51ef1714c95b56699ad0b284f6c0b7a916 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 14:16:18 2024 -0700 dependency and rust toolchain fix commit 85f34cb1147265e3d13080d032c92b7d81d09895 Merge: de58365 e263ba8 Author: Alfred Gui <alfredzqgui@gmail.com> Date: Fri Jun 14 15:16:44 2024 -0400 Merge pull request #36 from mlsys-io/fix_warm Fix the warm-up issue commit de5836558a56c3541ec9be3b1d41dde51d08969a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 12:06:42 2024 -0700 fix in workflow commit 83fc271da0ef6c0580d5d8491605b582c2d730cc Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 11:32:29 2024 -0700 build workflow update commit 66d272347539741c6750841938123b5522abb144 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:10:00 2024 -0700 docker workflow commit e8f9ff4f2be08421219acc6d2b611e2c4ba87768 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 09:08:55 2024 -0700 docker workflow commit e49f754e1fb33af4b9bf33bcc08a6d23d4cacb56 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:04:32 2024 -0700 remove tgi build workflow commit a4802b7867e766e492cb1f99877f386148962c3a Author: Yao Lu <fdyaolu@gmail.com> Date: Fri Jun 14 00:01:15 2024 -0700 docker build workflow; remove submodules (#35) * test docker * docker * remove submodule * updates commit e263ba802023d45ee5b26df0d90f8401ee0f87aa Author: Alfred Gui <alfredzqgui@gmail.com> Date: Thu Jun 13 20:32:48 2024 -0400 fix warm up issue commit c7613eb887ac10ba8d38b00ab26b85ff395ecdc6 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 17:01:27 2024 -0700 test docker (#34) commit e61ea779f8dffacab0a161aa13135999d6ec3ee7 Author: Yao Lu <fdyaolu@gmail.com> Date: Thu Jun 13 09:47:33 2024 -0700 minor fixes and rename tests.xml commit 8ae802cb8848df58fc9c0c279044f5b50309044e Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 14:09:50 2024 -0700 fix dtype bugs in flashinfer model def commit b821d68f4120951bbde7f57ca0ad9ba914d33354 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 11:30:51 2024 -0700 bug fix in layers/__init__.py commit b7c8735c77cb76446ba30efbb20f19067289fcab Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:33:50 2024 -0700 minor typo commit 6010fad087f477174766981acc162322e1d767da Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 11 10:30:45 2024 -0700 critical output bug (#25) * output debug * update minor commit b599cc65ecb8215cfcc8a9db6daa0d88450b9cc5 Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 10:34:24 2024 -0400 Decouple flashinfer code paths from flash attention library dependencies (#33) * decouple flash attn dependency from flashinfer code paths * follow up commit e0cd4a67f7cffdc620baa5d1ae22a32a3be94d4e Author: Alfred Gui <zgui@flexport.com> Date: Tue Jun 11 09:47:06 2024 -0400 reformat the llama files (#32) commit 6c96fddcbbe4c16f97fe391ef3387702234f4f65 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 21:02:42 2024 -0400 Llama rewrite (#31) * write llama in tgi style * fixes * fix the runtime issues commit 9dd3b75af84cb0d3411bd43fc0414e4592193037 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 17:10:20 2024 -0700 Kv.run test workflows (#30) * python 3.10 * python 3.10.14 * update doc * dispatch * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow * update python workflow commit 9ec483dae3eb34f594511b649370af354d5d0923 Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 15:15:35 2024 -0700 kv.run test workflows (#29) * python 3.10 * python 3.10.14 * update doc * dispatch commit 4757af8b6bb5b5548e17c5aeee767f5650607aed Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 14:53:52 2024 -0700 kv.run test workflow commit d58a35ed4694a18b1d3028b79cab9b3227ccdafc Author: Yao Lu <fdyaolu@gmail.com> Date: Mon Jun 10 11:41:13 2024 -0700 Compliant for pre-commit configs commit a8144374aa50e85016c19fa6f4a45c7f7c724d46 Author: Alfred Gui <zgui@flexport.com> Date: Mon Jun 10 06:45:29 2024 -0400 Introduce the flashinfer attention wrapper abstraction and use it for Llama and Gemma models (#28) * abstract the attention layer * fix the bugs commit 3956e467fd043e8218462e475d71892784ad5907 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 9 06:36:01 2024 -0400 Refactor the Flashinfer models (#27) * refactor the flashinfer models * fixes commit 7dda533b23d548bff8c569370daff203699a6e60 Author: Alfred Gui <zgui@flexport.com> Date: Sat Jun 8 08:40:55 2024 -0400 Support Flashinfer based Phi2 and Phi3 models (#26) * add phi model * fix phi integration errors * padding for phi * fix modeling for phi * workarounds for phi * use flash attn's position rotary embedding * support phi3 and baichuan * fix position encoding * clean up commit 482ef988e2c2ef59743aeaff01d79b72e0546baa Author: NovTi <yx2432@nyu.edu> Date: Wed Jun 5 22:04:14 2024 +0800 Add qwen2 1.8b and 72b base inference commit 5935ccedd980669c1366d70f20b5c3739184815f Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 21:30:52 2024 -0700 add lora functions to python client; test llama-3-70b AWQ commit 48b505376376f01e36b69bb0026f9a6af7e95676 Author: Yao Lu <fdyaolu@gmail.com> Date: Tue Jun 4 13:28:18 2024 -0700 testing llama-3-70b-gptq commit 80d4a605347f60c6d12958a577182b27ec413def Author: NovTi <yx2432@nyu.edu> Date: Tue Jun 4 22:03:11 2024 +0800 Fix minor typos commit e6af233933f9709e7da606409151c0802520f6ef Author: NovTi <yx2432@nyu.edu> Date: Mon Jun 3 22:33:17 2024 +0800 Integrate qwen2 commit 72d74cf82d1976457881318ae035b956fde3f220 Author: Yao Lu <fdyaolu@gmail.com> Date: Sun Jun 2 20:42:44 2024 -0700 Update Makefile to include punica kernels commit e7fb9b9dc6651aeb68e9e793d0d25381a14e12b5 Author: PeterYaoNYU <yy4108@nyu.edu> Date: Mon Jun 3 10:51:16 2024 +0800 integrate lora intommistral commit 47f4685004ac7db295c46ec9a69f62a783fe07a6 Author: Alfred Gui <zgui@flexport.com> Date: Sun Jun 2 08:34:24 2024 -0400 add placeholder for flashinfer phi modeling (#24) commit 40a70bcc369c6b61f486dc273ab0fd4330e21d58 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 22:06:30 2024 -0700 Update README.md commit f125e73ade681ac4e60cd48488a59f2bab162f97 Merge: 79402fb 7243638 Author: Yao Lu <fdyaolu@gmail.com> Date: Sat Jun 1 21:22:58 2024 -0700 Merge pull request #23 from mlsys-io/reorder-codebase Reorder code base commit 72436388e230d6778a6303fd656befa19632dbba Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 19:10:39 2024 -0700 fix the lora-id parameter in the benchmark commit 650c743e1572b35c0c304edcba8afb3b8865935d Merge: 79402fb 799a193 Author: rainj-me <rain-jiang@outlook.com> Date: Sat Jun 1 18:58:38 2024 -0700 directly merge from tgi commit 799a193b109662743bed1b18a09af1fdcd508c8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat Jun 1 08:47:00 2024 +0000 Fixing Phi3. commit 79402fb10d115a1ebe19ad97dd1482bd03479c80 Author: Yao Lu <fdyaolu@gmail.com> Date: Fri May 31 16:02:53 2024 -0700 Rest API to download lora adapter on router commit 08b3eac2ce54e25bec12088fd7e69ee3c07adaf5 Author: Nicholas Broad <nbroad94@gmail.com> Date: Fri May 31 09:42:14 2024 -0700 single char ` addition for docs (#1989) # What does this PR do? I think this will fix the docs from being weirdly formatted. All the sections after MAX_TOP_N_TOKENS don't show up in the bar on the right (https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens) ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @merveenoyan --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit 5ab4cef67ef6326429a0e4e3d44b9710d9f26c53 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 18:01:43 2024 +0200 Fixing exl2 scratch buffer. (#1990) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 06edde94910594eef86988934cbbc43d775eb965 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 31 17:57:01 2024 +0200 Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 659bd67fec0a874e325fc2a2afd0c2ed2af692f0 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 31 07:03:24 2024 -0700 Update documentation version to 2.0.4 (#1980) As per title cc @Narsil commit 967ced2ff4565a5358d45a1372d32fbab113700b Author: Daniël de Kok <me@danieldk.eu> Date: Thu May 30 07:10:10 2024 +0000 Gemma GPTQ checks: skip logprob checks This test fails somewhat regularly due to non-determinism and this test is primarily to verify that we are loading a model which doesn't have `float16` as the default dtype correctly. commit 36dd16017c7211b7760d1daa188172bb902e486f Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 09:51:31 2024 +0000 Add support for exl2 quantization Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM. commit cbced7f0f9ca0b62216223859b82a2632d1c7a1f Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 29 12:42:11 2024 -0400 feat: adjust attn weight loading logic (#1975) This PR updates `load_attention` to prefer loading specific attention based on the model type. Additionally there were two cases where `TensorParallelColumnLinear.load_multi` was called and this reduces it to a single path commit 612bc483b6f5029918039e684982fc1bfbe1b502 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 16:55:36 2024 +0200 Fixing the text part from tokenizer endpoint. (#1967) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit f20463e4e3a994fbcbc836cd315c14b766c72205 Author: Daniël de Kok <me@danieldk.eu> Date: Tue May 28 07:25:14 2024 +0000 Fix (non-container) pytest stdout buffering-related lock-up Two issues: 1. When one of the stdout/stderr pipe buffers of a process started with `subprocess.Popen` is full, the process can get blocked until the buffer is drained. 2. Calling `Popen.wait` can deadlock when called before draining the pipe buffers (if they are full). This avoids the issue altogether by giving the child process a temporary file to write to. commit e76b9824ae965e95923dbcf50aa30efb633a1974 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Tue May 28 14:52:17 2024 +0200 Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959) - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's our time now - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files) hasn't yet, and hasn't for several months now, so let's disabled the feature for the time being. # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b7ffa287f228e065c45a99684e73b862a5166fac Author: Moritz Laurer <41862082+MoritzLaurer@users.noreply.github.com> Date: Mon May 27 17:31:06 2024 +0200 fix small typo and broken link (#1958) # What does this PR do? Fix a typo; fix a broken link; add one sentence in the guidance docs to make the word "grammar" less abstract ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @drbh commit 0732b9d2f0fb9a4dd9753bdabe3ddb7d452c49cf Author: drbh <david.richard.holtz@gmail.com> Date: Mon May 27 10:03:16 2024 -0400 Processor config chat template (#1954) This PR loads the `processor_config` similar to the `tokenizer_config` and uses the processor_config's chat_template if the tokenizer_config does not include one. These changes enable chat with idefics2 commit a401c83c355d3b66ad158f4798b58bb5c696caac Author: Daniël de Kok <me@danieldk.eu> Date: Mon May 27 14:41:28 2024 +0200 Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953) # What does this PR do? Fix GPTQ for models which do not have float16 at the default dtype Before this change GPTQ models would not work if the model's default data type is not `float16`. For example, Gemma GPTQ models would fail because the default dtype of Gemma is `bfloat16`. There are two issues: If the default `dtype` is not `float16`, the quantizer's `float16` parameters get converted to that dtype. The kernels cannot deal with non-`float16` types. The same applies to inputs of quantized ops. This is resolved by setting the dtype of gptq/awq-quantized models to `float16`. Simpler version of #1951. **Draft:** just testing... ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 9231098f3a9b2f0fe7f6652f10f02f4d8f551143 Author: Daniël de Kok <me@danieldk.eu> Date: Fri May 24 15:34:42 2024 +0000 Fix (flash) Gemma prefix and enable tests commit d32e33bd489f2419e579f5d423073791ee19f789 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 15:36:13 2024 +0200 Fix seeded output. (#1949) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit cff472ba2b9147015ffd005aace282481d489695 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 24 12:40:39 2024 +0200 Fixing codellama loads by using purely `AutoTokenizer`. (#1947) - The need for the slow tokenizer default stems from back when llama 1 was introduced and all the flags where not supported in `tokenizers`. - Fixes #1891 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 954653466d24a9b3435988136983398bdf788a2f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 15:40:40 2024 +0200 Improving the logging system. (#1938) - Added a debug log for speculated ids (helps seeing in logs quality of a speculator). - Remove newlines from child process logs when re-emitting in non JSON mode. - Made standard level be closer to what's expected (only our binaries level). - Propagate that level correctly to the shard (was forced into INFO). # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 629047cb82d2ff97a8f0d0446ed7a3a68bed63a7 Author: Thomas Schillaci <thomas.schillaci@gmail.com> Date: Thu May 23 15:37:09 2024 +0200 Add completion route to client and add stop parameter where it's missing (#1869) # What does this PR do? - Add the stop parameter to the completion route - Add the completion method to the python client - Add the stop parameter to the python client's chat method ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil --------- Co-authored-by: Thomas SCHILLACI <tschilla@px101.prod.exalead.com> Co-authored-by: Thomas Schillaci <thomas.schillaci@3ds.com> commit f4a073ae6d2cbcf6ee353b4e27ea90586893fe8b Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 23 14:39:38 2024 +0200 Fixing some legacy behavior (big swapout of serverless on legacy stuff). (#1937) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Daniël de Kok <me@github.danieldk.eu> commit f41d644a903d179915e122896aba6bc77821795a Author: Wang, Yi <yi.a.wang@intel.com> Date: Thu May 23 20:11:08 2024 +0800 reenable xpu for tgi (#1939) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> commit a103e3e9e2041add8bd83a8b5b35c497784b9722 Author: drbh <david.richard.holtz@gmail.com> Date: Thu May 23 05:34:18 2024 -0400 feat: add train medusa head tutorial (#1934) This PR adds a tutorial to self distill and train medusa heads for a specific model --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit efb73fcb598fbb93c6cae7d6667a58b373b0de96 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 14:46:29 2024 -0400 fix: use path inside of speculator config (#1935) This PR access the path on the speculator similar to `MLPSpeculatorHead.load` and `MedusaHeadV1.load` these changes resolves this error locally when loading a `MedusaHeadV2` ``` TypeError: expected str, bytes or os.PathLike object, not dict ``` commit 2f243a1a150da40fc71cbdd08cd07e314cf7098e Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Wed May 22 16:22:57 2024 +0200 Creating doc automatically for supported models. (#1929) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit fc0eaffc81fafcc0fb554692f32efbed1c4b2683 Author: drbh <david.richard.holtz@gmail.com> Date: Wed May 22 03:58:26 2024 -0400 feat: include token in client test like server tests (#1932) This PR simply includes the HF token in the client tests similar to how it's included in the server tests. This helps avoid CI failure due to rate limiting commit 904ff36917e100047669bd6168d7138045469bbe Author: Junlin Zhou <jameszhou2108@hotmail.com> Date: Wed May 22 01:12:14 2024 +0800 docs: Fix grafana dashboard url (#1925) # What does this PR do?   Fixes an incorrect url in monitoring doc. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 293b8125e7a6ebd3eff65b55699e9386d1c1abf5 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Mon May 20 02:44:48 2024 +0200 ROCm: make CK FA2 default instead of Triton (#1924) As per title. Triton autotune overhead is prohibitive, as it needs to be done for each different prompt length. commit f871f114ca5f5a18a2a4a2c7658aed87440d381f Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Sat May 18 13:31:24 2024 +0200 Fixing the download strategy for ibm-fms (#1917) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 5dad0c0b29cf31271c01948653ac164649a3ac78 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 19:50:52 2024 +0200 Fix TGI issues with ROCm (#1921) Not all models were tested in https://github.com/huggingface/text-generation-inference/pull/1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two commit b5f1c9de06ad00bbdeec0348c47f53bee271cedc Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 18:21:51 2024 +0200 Fix TunableOp bug (#1920) cc @Narsil commit 422bf1f9866e99ef287d6280e8236d22173ee709 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 17:37:23 2024 +0200 Update grafana template (#1918) As per title, there was a mistake credit to @Narsil updated https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring as well Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> commit c4cf8b49d1ecce2353935c2497bd8c028cb25320 Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 16:34:44 2024 +0200 Add TGI monitoring guide through Grafana and Prometheus (#1908) As per title. It is very useful. commit 232e8d522713f43834d48ae45d1330b0e6dd367e Author: fxmarty <9808326+fxmarty@users.noreply.github.com> Date: Fri May 17 15:30:47 2024 +0200 MI300 compatibility (#1764) Adds support for AMD Instinct MI300 in TGI. Most changes are: * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with `PYTORCH_TUNABLEOP_ENABLED=1`. * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from https://github.com/pytorch/pytorch/pull/124362) * Support SILU & Linear custom kernels contributed by AMD * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308 * Support FA2 Triton kernel as recommended by AMD. Can be used by specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`. * Update dockerfile to ROCm 6.1 By default, TunableOp tuning results are saved in `/data` (e.g. `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order to avoid to have to rerun the tuning at each `docker run`. Example: ``` Validator,PT_VERSION,2.3.0 Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c Validator,HIPBLASLT_VERSION,0.7.0-1549b021 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098 GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431 GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546 GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119 GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645 GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971 GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694 GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522 GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671 GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834 GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622 GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122 GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191 GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514 GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914 GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516 GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953 GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043 GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497 GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895 GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716 GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731 GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816 GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701 GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159 GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524 GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074 GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045 GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582 GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705 GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489 ``` --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com> commit a60fa8406abd98d41e2bfafaf6f81f3dd6044b15 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Fri May 17 11:35:49 2024 +0200 Removing some unused code. (#1915) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit 3b5d93e68d22f5db7950175b5210ce6390df8172 Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 21:40:10 2024 +0200 Fixing signals. (#1910) Taking the signal handles later, so during loads, regular signal handling is done, we only need to handle SIGINT and SIGTERM during real loads to get more graceful shutdowns when queries are in flight. Fixes #1842 # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  commit b3dd3902e76df777d28ee76993800f4baf73c40c Author: Nicolas Patry <patry.nicolas@protonmail.com> Date: Thu May 16 17:21:00 2024 +0200 Types. (#1909) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting d…

Adds support for AMD Instinct MI300 in TGI. Most changes are: * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with `PYTORCH_TUNABLEOP_ENABLED=1`. * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from pytorch/pytorch#124362) * Support SILU & Linear custom kernels contributed by AMD * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit ROCm/vllm@3489ce7 * Support FA2 Triton kernel as recommended by AMD. Can be used by specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`. * Update dockerfile to ROCm 6.1 By default, TunableOp tuning results are saved in `/data` (e.g. `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order to avoid to have to rerun the tuning at each `docker run`. Example: ``` Validator,PT_VERSION,2.3.0 Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c Validator,HIPBLASLT_VERSION,0.7.0-1549b021 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098 GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431 GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546 GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119 GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645 GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971 GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694 GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522 GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671 GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834 GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622 GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122 GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191 GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514 GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914 GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516 GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953 GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043 GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497 GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895 GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716 GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731 GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816 GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701 GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159 GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524 GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074 GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045 GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582 GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705 GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489 ``` --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>

Not all models were tested in huggingface#1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

at last working!

3016e15

fxmarty commented Apr 18, 2024

View reviewed changes

Dockerfile_amd Outdated Show resolved Hide resolved

fxmarty added 5 commits April 19, 2024 09:09

tunableop in warmup

b503b3d

wip fa2 triton & fix cudagraph bug

47e522a

WIP debug Triton FA2

0ca83be

working

f723e5c

_custom_C.LLMM1 and HIP_FORCE_DEV_KERNARG=1

1b4c8b4

fxmarty commented Apr 19, 2024

View reviewed changes

launcher/src/main.rs Outdated Show resolved Hide resolved

fxmarty added 5 commits April 19, 2024 11:57

cleaning

ec5343e

add missing files

8eacae0

revert dev only changes

6d59eb2

fix

562cd4b

disable _custom_C.LLMM1 as it is broken for TP>=2

81c27ba

fxmarty commented Apr 19, 2024

View reviewed changes

server/text_generation_server/utils/layers.py Outdated Show resolved Hide resolved

fxmarty commented Apr 19, 2024

View reviewed changes

server/text_generation_server/utils/flash_attn.py Outdated Show resolved Hide resolved

fxmarty and others added 15 commits April 19, 2024 16:19

reenable _custom_C.LLMM1 as the culprit was FA2 triton

325f977

fix fa2 triton kernel not working with MQA/GQA

aef931e

add LLMM_Silu

fbc5a6a

black

e728970

Merge branch 'main' into mi300-compat

7502367

run integration tests on rocm

b8da902

use released torch 2.3

193dbb6

working & cached tunableop

17f5c30

trying to update to ROCm 6.1

a509360

wip fix tunableop

2677bf8

Merge branch 'main' into mi300-compat

8ec3b1a

working tunable

ff5e16b

Merge branch 'mi300-temp' into mi300-compat

d2b4b02

add model id

51b0c25

tunableop on 1,...,8

1f37d57

fxmarty added 2 commits May 16, 2024 11:02

Merge branch 'main' into mi300-compat

f219124

documentation

afc7473

typo

0812e3b

fxmarty commented May 16, 2024

View reviewed changes

Narsil reviewed May 16, 2024

View reviewed changes

fxmarty added 11 commits May 16, 2024 14:46

black

265c76d

update version

c945573

fixes on review

df0a453

refactor model_id, make tunableop default

a040a59

reflect in doc that tunableop is default

c847559

remove unnecessary imports

7c6b9a0

diff nicer

8d7f18f

nicer diff x2

3ded96f

cleanup fastlinear

956ac30

precise amd doc

2a7ba6e

cleanup dockerfile

eea3226

tentatively fix build workflow

f5007eb

Narsil merged commit 232e8d5 into main May 17, 2024
9 of 10 checks passed

Narsil deleted the mi300-compat branch May 17, 2024 13:30

drbh mentioned this pull request May 17, 2024

Update documentation version 1.4 -> 2.0.3 #1916

Closed

fxmarty mentioned this pull request May 17, 2024

Fix TGI issues with ROCm #1921

Merged

fxmarty added a commit that referenced this pull request May 17, 2024

Fix TGI issues with ROCm (#1921)

5dad0c0

Not all models were tested in #1764. Fixing some more issues (notably starcoder2) here, the full CI will come shortly once we split `build.yml` in two

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MI300 compatibility #1764

MI300 compatibility #1764

fxmarty commented Apr 18, 2024 •

edited

Loading

fxmarty commented May 16, 2024

HuggingFaceDocBuilderDev commented May 16, 2024

Narsil commented May 16, 2024

fxmarty May 16, 2024

Narsil left a comment

Narsil May 16, 2024

fxmarty May 17, 2024

Narsil May 16, 2024

fxmarty May 17, 2024

Narsil May 16, 2024

fxmarty May 17, 2024

fxmarty commented May 17, 2024

MI300 compatibility #1764

MI300 compatibility #1764

Conversation

fxmarty commented Apr 18, 2024 • edited Loading

fxmarty commented May 16, 2024

HuggingFaceDocBuilderDev commented May 16, 2024

Narsil commented May 16, 2024

fxmarty May 16, 2024

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

Narsil May 16, 2024

Choose a reason for hiding this comment

fxmarty May 17, 2024

Choose a reason for hiding this comment

fxmarty commented May 17, 2024

fxmarty commented Apr 18, 2024 •

edited

Loading