Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MI300 compatibility #1764

Merged
merged 58 commits into from
May 17, 2024
Merged

MI300 compatibility #1764

merged 58 commits into from
May 17, 2024

Conversation

fxmarty
Copy link
Contributor

@fxmarty fxmarty commented Apr 18, 2024

Adds support for AMD Instinct MI300 in TGI.

Most changes are:

By default, TunableOp tuning results are saved in /data (e.g. /data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv) in order to avoid to have to rerun the tuning at each docker run.

Example:

Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489

Dockerfile_amd Outdated Show resolved Hide resolved
launcher/src/main.rs Outdated Show resolved Hide resolved
@fxmarty
Copy link
Contributor Author

fxmarty commented May 16, 2024

@Narsil This PR is ready. Could you give a look?

We are just waiting for a patched / updated rocm/dev-ubuntu-22.04 base image that would fix an issue with libamdhip64.so on certain VMs, avoiding

ARG GITHUB_TOKEN
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends wget && \
rm -rf /var/lib/apt/lists/* && \
wget --header "Authorization: token ${GITHUB_TOKEN}" https://raw.githubusercontent.com/fxmarty/patched_hipruntime/main/libamdhip64.so.6
ENV LD_PRELOAD="/libamdhip64.so.6"

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Narsil
Copy link
Collaborator

Narsil commented May 16, 2024

We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?

Sure releases are kind of trivial now.

Dockerfile_amd Outdated
Comment on lines 220 to 224
# COPY ./tgi-entrypoint.sh /tgi-entrypoint.sh
# RUN chmod +x /tgi-entrypoint.sh

# ENTRYPOINT ["/tgi-entrypoint.sh"]
# CMD ["--json-output"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clean

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Lots of nits and code structure suggestions, but overall everything looks good.

server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
server/text_generation_server/models/flash_causal_lm.py Outdated Show resolved Hide resolved
@@ -0,0 +1,816 @@
#!/usr/bin/env python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we put this in layers/flash_attn/triton.py maybe ? (and flash_attn.py-> flash_attn/__init__.py for simplificity ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do that in an other PR? Then e.g. paged_attention.py should be moved as well

server/text_generation_server/utils/flash_attn.py Outdated Show resolved Hide resolved
server/text_generation_server/models/t5.py Outdated Show resolved Hide resolved
bias = None
return cls(weight, bias)

def forward(self, inp: torch.Tensor) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/inp/input/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out = torch.empty(
inp.shape[0], weight.shape[0], dtype=inp.dtype, device="cuda"
)
if (k == 8192 and (m == 1280 or m == 7168)) or (k == 3584 and m == 8192):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feel way overspecified, no ?
Is it really only implemented for these shapes ?

The second condition looks way more general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @gshtras @charlifu have you tested different rows_per_block? Was it specifically tuned for these shapes?

@fxmarty
Copy link
Contributor Author

fxmarty commented May 17, 2024

As we got an updated rocm/dev-ubuntu-22.04:6.1.1_hip_update, this PR may be merged once build is done & tests are passing

@Narsil Narsil merged commit 232e8d5 into main May 17, 2024
9 of 10 checks passed
@Narsil Narsil deleted the mi300-compat branch May 17, 2024 13:30
fxmarty added a commit that referenced this pull request May 17, 2024
Not all models were tested in
#1764.

Fixing some more issues (notably starcoder2) here, the full CI will come
shortly once we split `build.yml` in two
alfredgui2 pushed a commit to mlsys-io/kv.run that referenced this pull request Jul 6, 2024
Not all models were tested in
huggingface/text-generation-inference#1764.

Fixing some more issues (notably starcoder2) here, the full CI will come
shortly once we split `build.yml` in two
tjluyao added a commit to mlsys-io/kv.run that referenced this pull request Jul 7, 2024
commit 6adf97815ef6828e0aa06f2a4635370b4ad7476e
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Sat Jul 6 13:18:16 2024 -0400

    Fix the decoding logic in test_local_grpc.py (#44)

    * fix the test_local_grpc script

    * lint fix

commit f355733482f4ebc15916df151ad00ad9d64d451d
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sat Jul 6 07:50:55 2024 -0700

    bug fixes

commit 466b0a65429d339a1c004c5991749e6f9cb1230b
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jul 1 22:48:56 2024 -0400

    Add the batch concatenation functionality for flashinfer server (#43)

    * refactor flashinfer causal lm

    * modify test_local_api

    * fixes

    * fixes

    * lint

commit b9838c5c4720ff09f946e7fce8dd328aab57dc16
Author: NovTi <yx2432@nyu.edu>
Date:   Tue Jul 2 00:07:24 2024 +0800

    Add ChatGLM and refactor Qwen2

commit 9fafffcfacb8ded0d0d5aefac2cf38ae3a44876f
Author: PeterYaoNYU <yy4108@nyu.edu>
Date:   Mon Jul 1 10:30:21 2024 +0800

    update mistral flashinfer

commit d099bbbbeeaf638220696b5c9f94cf9634f8c221
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sun Jun 30 18:39:44 2024 -0700

    update submodules

commit 4edacd568d064cb834597d8cf2f24bf1bef20683
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sun Jun 30 18:29:34 2024 -0700

    update submodules

commit 9da076dc488140273ab17773ae642e8ac3edb119
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sun Jun 30 18:17:41 2024 -0700

    minor fix in makefile

commit fa213e263fd86ec41d033cb8d46dea07076720bd
Author: MichaelYuan2 <hy2203@nyu.edu>
Date:   Tue Jun 25 10:41:09 2024 +0800

    update FlashinferAttentionWrapper to flashinfer 0.0.6

commit 8d3dd4898a26f89d82233640123aad90e2477bb6
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 24 11:25:08 2024 -0400

    Fix the server CLI issue with use_flashinfer flag (#42)

    * fix refactor

    * empty

    * fix lint

commit 23118727bdf000d87115df9ac6a6ccf3aee7a2ef
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Sat Jun 22 17:22:51 2024 -0400

    decouple flashinfer files from flash attention (#41)

commit 9b3c09850ddfdd8141601ee9b1b027e4aa2d4b83
Merge: 4a40c64 f0d3664
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Thu Jun 20 11:13:14 2024 -0400

    Merge pull request #40 from mlsys-io/add_baichuan

    Adjust the flashinfer llama model to accommodate the baichuan model

commit f0d3664f34acae5020f045fabca15aa310ce60ec
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Thu Jun 20 10:46:12 2024 -0400

    adjust the flashinfer llama model to accomodate baichuan

commit 4a40c6415cd7f1d29bab6de9907ca8ac66833863
Merge: 0ba0ac9 6aaab88
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 17 10:15:42 2024 -0700

    Merge branch 'master' of github.com:mlsys-io/kv.run

commit 0ba0ac9dd8825cef92cd7b92fef49ab0efcb8fbd
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 17 10:01:44 2024 -0700

    minor fix in output example

commit 6aaab883fb154b960de9ad501de74ad90f447725
Merge: 7a93d84 08fde0f
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 12:46:13 2024 -0400

    Merge pull request #38 from mlsys-io/flash_attn_rotary

    Use Flash attention for rotary embedding and layer normalization for Phi2 and Phi3

commit 08fde0f9ab74fd54fe59bbca5020448a862c1188
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 12:43:19 2024 -0400

    revert test file

commit c51e36e3a3bf60f5e23f3a1fee5fe6b116fcc362
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 12:42:16 2024 -0400

    fix lint

commit 7dfa57d5ca29e366c1d7c6de01ef6e81840fd7d5
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 12:40:40 2024 -0400

    empty

commit b45e8968e75976ac506dc25b467e41520c457d48
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 14:17:55 2024 +0000

    fix phi2 and phi3 modeling

commit 31ad6bd942293ce18addf79944da5d742518f900
Merge: 1e2bf10 7a93d84
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 08:55:24 2024 -0400

    merge master

commit 1e2bf1026420e298cb7fbed4d73166baefcbf615
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 06:43:51 2024 -0400

    fix the flashinfer adapter

commit da84f6bcce038029714f48916510964d5b00d757
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Mon Jun 17 00:51:55 2024 +0000

    fixes

commit e0feabb012e8d82d6265dc85811a47fff44c1c65
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Sun Jun 16 20:20:59 2024 -0400

    fix rotary bug

commit 7a93d8413fbfb62e8ae6646a12aaed55b36afaa1
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sat Jun 15 22:50:16 2024 -0700

    update to rust 1.79

commit 6c4fa6effac801c7c4a30479eca30a7c5ecb057d
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sat Jun 15 22:15:41 2024 -0700

    minor fixes

commit ad40a1752d5964554814261754e63a1122829ce9
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Sun Jun 16 01:49:28 2024 +0000

    flash attn rotary

commit 868d3f2fa74a07178806eadc79a2f23f59bafa77
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 22:57:09 2024 -0700

    minor router-server fix

commit b8a47854a60d21347d6e4f66a507d1a4d2580c30
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 16:43:32 2024 -0700

    finalize docker build workflow

commit fa2f2f2c8d5249e151cb51c35ef2952cf937b98c
Merge: 93edec5 85f34cb
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 14:16:29 2024 -0700

    Merge branch 'master' of github.com:mlsys-io/kv.run

commit 93edec51ef1714c95b56699ad0b284f6c0b7a916
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 14:16:18 2024 -0700

    dependency and rust toolchain fix

commit 85f34cb1147265e3d13080d032c92b7d81d09895
Merge: de58365 e263ba8
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Fri Jun 14 15:16:44 2024 -0400

    Merge pull request #36 from mlsys-io/fix_warm

    Fix the warm-up issue

commit de5836558a56c3541ec9be3b1d41dde51d08969a
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 12:06:42 2024 -0700

    fix in workflow

commit 83fc271da0ef6c0580d5d8491605b582c2d730cc
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 11:32:29 2024 -0700

    build workflow update

commit 66d272347539741c6750841938123b5522abb144
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 09:10:00 2024 -0700

    docker workflow

commit e8f9ff4f2be08421219acc6d2b611e2c4ba87768
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 09:08:55 2024 -0700

    docker workflow

commit e49f754e1fb33af4b9bf33bcc08a6d23d4cacb56
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 00:04:32 2024 -0700

    remove tgi build workflow

commit a4802b7867e766e492cb1f99877f386148962c3a
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri Jun 14 00:01:15 2024 -0700

    docker build workflow; remove submodules (#35)

    * test docker

    * docker

    * remove submodule

    * updates

commit e263ba802023d45ee5b26df0d90f8401ee0f87aa
Author: Alfred Gui <alfredzqgui@gmail.com>
Date:   Thu Jun 13 20:32:48 2024 -0400

    fix warm up issue

commit c7613eb887ac10ba8d38b00ab26b85ff395ecdc6
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Thu Jun 13 17:01:27 2024 -0700

    test docker (#34)

commit e61ea779f8dffacab0a161aa13135999d6ec3ee7
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Thu Jun 13 09:47:33 2024 -0700

    minor fixes and rename tests.xml

commit 8ae802cb8848df58fc9c0c279044f5b50309044e
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 11 14:09:50 2024 -0700

    fix dtype bugs in flashinfer model def

commit b821d68f4120951bbde7f57ca0ad9ba914d33354
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 11 11:30:51 2024 -0700

    bug fix in layers/__init__.py

commit b7c8735c77cb76446ba30efbb20f19067289fcab
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 11 10:33:50 2024 -0700

    minor typo

commit 6010fad087f477174766981acc162322e1d767da
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 11 10:30:45 2024 -0700

    critical output bug (#25)

    * output debug

    * update minor

commit b599cc65ecb8215cfcc8a9db6daa0d88450b9cc5
Author: Alfred Gui <zgui@flexport.com>
Date:   Tue Jun 11 10:34:24 2024 -0400

    Decouple flashinfer code paths from flash attention library dependencies (#33)

    * decouple flash attn dependency from flashinfer code paths

    * follow up

commit e0cd4a67f7cffdc620baa5d1ae22a32a3be94d4e
Author: Alfred Gui <zgui@flexport.com>
Date:   Tue Jun 11 09:47:06 2024 -0400

    reformat the llama files (#32)

commit 6c96fddcbbe4c16f97fe391ef3387702234f4f65
Author: Alfred Gui <zgui@flexport.com>
Date:   Mon Jun 10 21:02:42 2024 -0400

    Llama rewrite (#31)

    * write llama in tgi style

    * fixes

    * fix the runtime issues

commit 9dd3b75af84cb0d3411bd43fc0414e4592193037
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 10 17:10:20 2024 -0700

    Kv.run test workflows (#30)

    * python 3.10

    * python 3.10.14

    * update doc

    * dispatch

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

    * update python workflow

commit 9ec483dae3eb34f594511b649370af354d5d0923
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 10 15:15:35 2024 -0700

    kv.run test workflows (#29)

    * python 3.10

    * python 3.10.14

    * update doc

    * dispatch

commit 4757af8b6bb5b5548e17c5aeee767f5650607aed
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 10 14:53:52 2024 -0700

    kv.run test workflow

commit d58a35ed4694a18b1d3028b79cab9b3227ccdafc
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Mon Jun 10 11:41:13 2024 -0700

    Compliant for pre-commit configs

commit a8144374aa50e85016c19fa6f4a45c7f7c724d46
Author: Alfred Gui <zgui@flexport.com>
Date:   Mon Jun 10 06:45:29 2024 -0400

    Introduce the flashinfer attention wrapper abstraction and use it for Llama and Gemma models (#28)

    * abstract the attention layer

    * fix the bugs

commit 3956e467fd043e8218462e475d71892784ad5907
Author: Alfred Gui <zgui@flexport.com>
Date:   Sun Jun 9 06:36:01 2024 -0400

    Refactor the Flashinfer models (#27)

    * refactor the flashinfer models

    * fixes

commit 7dda533b23d548bff8c569370daff203699a6e60
Author: Alfred Gui <zgui@flexport.com>
Date:   Sat Jun 8 08:40:55 2024 -0400

    Support Flashinfer based Phi2 and Phi3 models (#26)

    * add phi model

    * fix phi integration errors

    * padding for phi

    * fix modeling for phi

    * workarounds for phi

    * use flash attn's position rotary embedding

    * support phi3 and baichuan

    * fix position encoding

    * clean up

commit 482ef988e2c2ef59743aeaff01d79b72e0546baa
Author: NovTi <yx2432@nyu.edu>
Date:   Wed Jun 5 22:04:14 2024 +0800

    Add qwen2 1.8b and 72b base inference

commit 5935ccedd980669c1366d70f20b5c3739184815f
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 4 21:30:52 2024 -0700

    add lora functions to python client; test llama-3-70b AWQ

commit 48b505376376f01e36b69bb0026f9a6af7e95676
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Tue Jun 4 13:28:18 2024 -0700

    testing llama-3-70b-gptq

commit 80d4a605347f60c6d12958a577182b27ec413def
Author: NovTi <yx2432@nyu.edu>
Date:   Tue Jun 4 22:03:11 2024 +0800

    Fix minor typos

commit e6af233933f9709e7da606409151c0802520f6ef
Author: NovTi <yx2432@nyu.edu>
Date:   Mon Jun 3 22:33:17 2024 +0800

    Integrate qwen2

commit 72d74cf82d1976457881318ae035b956fde3f220
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sun Jun 2 20:42:44 2024 -0700

    Update Makefile to include punica kernels

commit e7fb9b9dc6651aeb68e9e793d0d25381a14e12b5
Author: PeterYaoNYU <yy4108@nyu.edu>
Date:   Mon Jun 3 10:51:16 2024 +0800

    integrate lora intommistral

commit 47f4685004ac7db295c46ec9a69f62a783fe07a6
Author: Alfred Gui <zgui@flexport.com>
Date:   Sun Jun 2 08:34:24 2024 -0400

    add placeholder for flashinfer phi modeling (#24)

commit 40a70bcc369c6b61f486dc273ab0fd4330e21d58
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sat Jun 1 22:06:30 2024 -0700

    Update README.md

commit f125e73ade681ac4e60cd48488a59f2bab162f97
Merge: 79402fb 7243638
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Sat Jun 1 21:22:58 2024 -0700

    Merge pull request #23 from mlsys-io/reorder-codebase

    Reorder code base

commit 72436388e230d6778a6303fd656befa19632dbba
Author: rainj-me <rain-jiang@outlook.com>
Date:   Sat Jun 1 19:10:39 2024 -0700

    fix the lora-id parameter in the benchmark

commit 650c743e1572b35c0c304edcba8afb3b8865935d
Merge: 79402fb 799a193
Author: rainj-me <rain-jiang@outlook.com>
Date:   Sat Jun 1 18:58:38 2024 -0700

    directly merge from tgi

commit 799a193b109662743bed1b18a09af1fdcd508c8b
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Sat Jun 1 08:47:00 2024 +0000

    Fixing Phi3.

commit 79402fb10d115a1ebe19ad97dd1482bd03479c80
Author: Yao Lu <fdyaolu@gmail.com>
Date:   Fri May 31 16:02:53 2024 -0700

    Rest API to download lora adapter on router

commit 08b3eac2ce54e25bec12088fd7e69ee3c07adaf5
Author: Nicholas Broad <nbroad94@gmail.com>
Date:   Fri May 31 09:42:14 2024 -0700

    single char ` addition for docs (#1989)

    # What does this PR do?

    I think this will fix the docs from being weirdly formatted. All the
    sections after MAX_TOP_N_TOKENS don't show up in the bar on the right
    (https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens)

    ## Before submitting
    - [x] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    @merveenoyan

    ---------

    Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

commit 5ab4cef67ef6326429a0e4e3d44b9710d9f26c53
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Fri May 31 18:01:43 2024 +0200

    Fixing exl2 scratch buffer. (#1990)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 06edde94910594eef86988934cbbc43d775eb965
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Fri May 31 17:57:01 2024 +0200

    Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 659bd67fec0a874e325fc2a2afd0c2ed2af692f0
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 31 07:03:24 2024 -0700

    Update documentation version to 2.0.4 (#1980)

    As per title

    cc @Narsil

commit 967ced2ff4565a5358d45a1372d32fbab113700b
Author: Daniël de Kok <me@danieldk.eu>
Date:   Thu May 30 07:10:10 2024 +0000

    Gemma GPTQ checks: skip logprob checks

    This test fails somewhat regularly due to non-determinism and this
    test is primarily to verify that we are loading a model which doesn't
    have `float16` as the default dtype correctly.

commit 36dd16017c7211b7760d1daa188172bb902e486f
Author: Daniël de Kok <me@danieldk.eu>
Date:   Tue May 28 09:51:31 2024 +0000

    Add support for exl2 quantization

    Mostly straightforward, changes to existing code:

    * Wrap quantizer parameters in a small wrapper to avoid passing
      around untyped tuples and needing to repack them as a dict.
    * Move scratch space computation to warmup, because we need the
      maximum input sequence length to avoid allocating huge
      scratch buffers that OOM.

commit cbced7f0f9ca0b62216223859b82a2632d1c7a1f
Author: drbh <david.richard.holtz@gmail.com>
Date:   Wed May 29 12:42:11 2024 -0400

    feat: adjust attn weight loading logic (#1975)

    This PR updates `load_attention` to prefer loading specific attention
    based on the model type. Additionally there were two cases where
    `TensorParallelColumnLinear.load_multi` was called and this reduces it
    to a single path

commit 612bc483b6f5029918039e684982fc1bfbe1b502
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Tue May 28 16:55:36 2024 +0200

    Fixing the text part from tokenizer endpoint. (#1967)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit f20463e4e3a994fbcbc836cd315c14b766c72205
Author: Daniël de Kok <me@danieldk.eu>
Date:   Tue May 28 07:25:14 2024 +0000

    Fix (non-container) pytest stdout buffering-related lock-up

    Two issues:

    1. When one of the stdout/stderr pipe buffers of a process started
       with `subprocess.Popen` is full, the process can get blocked until
       the buffer is drained.
    2. Calling `Popen.wait` can deadlock when called before draining
       the pipe buffers (if they are full).

    This avoids the issue altogether by giving the child process a
    temporary file to write to.

commit e76b9824ae965e95923dbcf50aa30efb633a1974
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Tue May 28 14:52:17 2024 +0200

    Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959)

    - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's
    our time now
    - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files)
    hasn't yet, and hasn't for several months now, so let's disabled the
    feature for the time being.

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit b7ffa287f228e065c45a99684e73b862a5166fac
Author: Moritz Laurer <41862082+MoritzLaurer@users.noreply.github.com>
Date:   Mon May 27 17:31:06 2024 +0200

    fix small typo and broken link (#1958)

    # What does this PR do?

    Fix a typo; fix a broken link; add one sentence in the guidance docs to
    make the word "grammar" less abstract

    ## Before submitting
    - [x] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    @drbh

commit 0732b9d2f0fb9a4dd9753bdabe3ddb7d452c49cf
Author: drbh <david.richard.holtz@gmail.com>
Date:   Mon May 27 10:03:16 2024 -0400

    Processor config chat template (#1954)

    This PR loads the `processor_config` similar to the `tokenizer_config`
    and uses the processor_config's chat_template if the tokenizer_config
    does not include one. These changes enable chat with idefics2

commit a401c83c355d3b66ad158f4798b58bb5c696caac
Author: Daniël de Kok <me@danieldk.eu>
Date:   Mon May 27 14:41:28 2024 +0200

    Fix GPTQ for models which do not have float16 at the default dtype (simpler) (#1953)

    # What does this PR do?

    Fix GPTQ for models which do not have float16 at the default dtype

    Before this change GPTQ models would not work if the model's default
    data type is not `float16`. For example, Gemma GPTQ models would fail
    because the default dtype of Gemma is `bfloat16`. There are two issues:

    If the default `dtype` is not `float16`, the quantizer's `float16`
    parameters get converted to that dtype. The kernels cannot deal
    with non-`float16` types. The same applies to inputs of quantized ops.

    This is resolved by setting the dtype of gptq/awq-quantized models to
    `float16`.

    Simpler version of #1951.

    **Draft:** just testing...

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [x] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 9231098f3a9b2f0fe7f6652f10f02f4d8f551143
Author: Daniël de Kok <me@danieldk.eu>
Date:   Fri May 24 15:34:42 2024 +0000

    Fix (flash) Gemma prefix and enable tests

commit d32e33bd489f2419e579f5d423073791ee19f789
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Fri May 24 15:36:13 2024 +0200

    Fix seeded output. (#1949)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit cff472ba2b9147015ffd005aace282481d489695
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Fri May 24 12:40:39 2024 +0200

    Fixing codellama loads by using purely `AutoTokenizer`. (#1947)

    - The need for the slow tokenizer default stems from back
      when llama 1 was introduced and all the flags where not
      supported in `tokenizers`.

    - Fixes #1891

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 954653466d24a9b3435988136983398bdf788a2f
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Thu May 23 15:40:40 2024 +0200

    Improving the logging system. (#1938)

    - Added a debug log for speculated ids (helps seeing in logs quality of
      a speculator).
    - Remove newlines from child process logs when re-emitting in non JSON
      mode.
    - Made standard level be closer to what's expected (only our binaries
      level).
    - Propagate that level correctly to the shard (was forced into INFO).

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 629047cb82d2ff97a8f0d0446ed7a3a68bed63a7
Author: Thomas Schillaci <thomas.schillaci@gmail.com>
Date:   Thu May 23 15:37:09 2024 +0200

    Add completion route to client and add stop parameter where it's missing (#1869)

    # What does this PR do?

    - Add the stop parameter to the completion route
    - Add the completion method to the python client
    - Add the stop parameter to the python client's chat method

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [x] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    @Narsil

    ---------

    Co-authored-by: Thomas SCHILLACI <tschilla@px101.prod.exalead.com>
    Co-authored-by: Thomas Schillaci <thomas.schillaci@3ds.com>

commit f4a073ae6d2cbcf6ee353b4e27ea90586893fe8b
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Thu May 23 14:39:38 2024 +0200

    Fixing some legacy behavior (big swapout of serverless on legacy stuff). (#1937)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

    ---------

    Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

commit f41d644a903d179915e122896aba6bc77821795a
Author: Wang, Yi <yi.a.wang@intel.com>
Date:   Thu May 23 20:11:08 2024 +0800

    reenable xpu for tgi (#1939)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

    Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

commit a103e3e9e2041add8bd83a8b5b35c497784b9722
Author: drbh <david.richard.holtz@gmail.com>
Date:   Thu May 23 05:34:18 2024 -0400

    feat: add train medusa head tutorial (#1934)

    This PR adds a tutorial to self distill and train medusa heads for a
    specific model

    ---------

    Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

commit efb73fcb598fbb93c6cae7d6667a58b373b0de96
Author: drbh <david.richard.holtz@gmail.com>
Date:   Wed May 22 14:46:29 2024 -0400

    fix: use path inside of speculator config (#1935)

    This PR access the path on the speculator similar to
    `MLPSpeculatorHead.load` and `MedusaHeadV1.load`

    these changes resolves this error locally when loading a `MedusaHeadV2`
    ```
    TypeError: expected str, bytes or os.PathLike object, not dict
    ```

commit 2f243a1a150da40fc71cbdd08cd07e314cf7098e
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Wed May 22 16:22:57 2024 +0200

    Creating doc automatically for supported models. (#1929)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit fc0eaffc81fafcc0fb554692f32efbed1c4b2683
Author: drbh <david.richard.holtz@gmail.com>
Date:   Wed May 22 03:58:26 2024 -0400

    feat: include token in client test like server tests (#1932)

    This PR simply includes the HF token in the client tests similar to how
    it's included in the server tests. This helps avoid CI failure due to
    rate limiting

commit 904ff36917e100047669bd6168d7138045469bbe
Author: Junlin Zhou <jameszhou2108@hotmail.com>
Date:   Wed May 22 01:12:14 2024 +0800

    docs: Fix grafana dashboard url (#1925)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes an incorrect url in monitoring doc.

    ## Before submitting
    - [x] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 293b8125e7a6ebd3eff65b55699e9386d1c1abf5
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Mon May 20 02:44:48 2024 +0200

    ROCm: make CK FA2 default instead of Triton (#1924)

    As per title.

    Triton autotune overhead is prohibitive, as it needs to be done for each
    different prompt length.

commit f871f114ca5f5a18a2a4a2c7658aed87440d381f
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Sat May 18 13:31:24 2024 +0200

    Fixing the download strategy for ibm-fms (#1917)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 5dad0c0b29cf31271c01948653ac164649a3ac78
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 17 19:50:52 2024 +0200

    Fix TGI issues with ROCm (#1921)

    Not all models were tested in
    https://github.com/huggingface/text-generation-inference/pull/1764.

    Fixing some more issues (notably starcoder2) here, the full CI will come
    shortly once we split `build.yml` in two

commit b5f1c9de06ad00bbdeec0348c47f53bee271cedc
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 17 18:21:51 2024 +0200

    Fix TunableOp bug (#1920)

    cc @Narsil

commit 422bf1f9866e99ef287d6280e8236d22173ee709
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 17 17:37:23 2024 +0200

    Update grafana template (#1918)

    As per title, there was a mistake

    credit to @Narsil

    updated
    https://huggingface.co/docs/text-generation-inference/basic_tutorials/monitoring
    as well

    Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

commit c4cf8b49d1ecce2353935c2497bd8c028cb25320
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 17 16:34:44 2024 +0200

    Add TGI monitoring guide through Grafana and Prometheus (#1908)

    As per title. It is very useful.

commit 232e8d522713f43834d48ae45d1330b0e6dd367e
Author: fxmarty <9808326+fxmarty@users.noreply.github.com>
Date:   Fri May 17 15:30:47 2024 +0200

    MI300 compatibility (#1764)

    Adds support for AMD Instinct MI300 in TGI.

    Most changes are:
    * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
    https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
    TunableOp is disabled by default, and can be enabled with
    `PYTORCH_TUNABLEOP_ENABLED=1`.
    * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
    from https://github.com/pytorch/pytorch/pull/124362)
    * Support SILU & Linear custom kernels contributed by AMD
    * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
    branching out of a much more recent commit
    https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308
    * Support FA2 Triton kernel as recommended by AMD. Can be used by
    specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
    * Update dockerfile to ROCm 6.1

    By default, TunableOp tuning results are saved in `/data` (e.g.
    `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
    to avoid to have to rerun the tuning at each `docker run`.

    Example:
    ```
    Validator,PT_VERSION,2.3.0
    Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
    Validator,HIPBLASLT_VERSION,0.7.0-1549b021
    Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
    Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
    GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
    GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
    GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
    GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
    GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
    GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
    GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
    GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
    GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
    GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
    GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
    GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
    GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
    GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
    GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
    GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
    GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
    GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
    GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
    GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
    GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
    GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
    GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
    GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
    GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
    GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
    GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
    GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
    GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
    GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
    GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
    ```

    ---------

    Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>

commit a60fa8406abd98d41e2bfafaf6f81f3dd6044b15
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Fri May 17 11:35:49 2024 +0200

    Removing some unused code. (#1915)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit 3b5d93e68d22f5db7950175b5210ce6390df8172
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Thu May 16 21:40:10 2024 +0200

    Fixing signals. (#1910)

    Taking the signal handles later, so during loads,
    regular signal handling is done, we only need to handle SIGINT and
    SIGTERM during real loads to get more graceful shutdowns when queries
    are in flight.

    Fixes #1842

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
    - [ ] Did you write any new necessary tests?

    ## Who can review?

    Anyone in the community is free to review the PR once the tests have
    passed. Feel free to tag
    members/contributors who may be interested in your PR.

    <!-- Your PR will be replied to more quickly if you can figure out the
    right person to tag with @

    @OlivierDehaene OR @Narsil

     -->

commit b3dd3902e76df777d28ee76993800f4baf73c40c
Author: Nicolas Patry <patry.nicolas@protonmail.com>
Date:   Thu May 16 17:21:00 2024 +0200

    Types. (#1909)

    # What does this PR do?

    <!--
    Congratulations! You've made it this far! You're not quite done yet
    though.

    Once merged, your PR is going to appear in the release notes with the
    title you set, so make sure it's a great title that fully reflects the
    extent of your awesome contribution.

    Then, please replace this with a description of the change and which
    issue is fixed (if applicable). Please also include relevant motivation
    and context. List any dependencies (if any) that are required for this
    change.

    Once you're done, someone will review your PR shortly (see the section
    "Who can review?" below to tag some potential reviewers). They may
    suggest changes to make the code even better. If no one reviewed your PR
    after a week has passed, don't hesitate to post a new comment
    @-mentioning the same persons---sometimes notifications get lost.
    -->

    <!-- Remove if not applicable -->

    Fixes # (issue)

    ## Before submitting
    - [ ] This PR fixes a typo or improves the docs (you can dismiss the
    other checks if that's the case).
    - [ ] Did you read the [contributor
    guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
          Pull Request section?
    - [ ] Was this discussed/approved via a Github issue or the
    [forum](https://discuss.huggingface.co/)? Please add a link
          to it if that's the case.
    - [ ] Did you make sure to update the documentation with your changes?
    Here are the
    [documentation
    guidelines](https://github.com/huggingface/transformers/tree/main/docs),
    and
    [here are tips on formatting
    d…
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Jul 17, 2024
Adds support for AMD Instinct MI300 in TGI.

Most changes are:
* Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
TunableOp is disabled by default, and can be enabled with
`PYTORCH_TUNABLEOP_ENABLED=1`.
* Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
from pytorch/pytorch#124362)
* Support SILU & Linear custom kernels contributed by AMD
* Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
branching out of a much more recent commit
ROCm/vllm@3489ce7
* Support FA2 Triton kernel as recommended by AMD. Can be used by
specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
* Update dockerfile to ROCm 6.1

By default, TunableOp tuning results are saved in `/data` (e.g.
`/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
to avoid to have to rerun the tuning at each `docker run`.

Example:
```
Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
```

---------

Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
yuanwu2017 pushed a commit to yuanwu2017/tgi-gaudi that referenced this pull request Jul 17, 2024
Not all models were tested in
huggingface#1764.

Fixing some more issues (notably starcoder2) here, the full CI will come
shortly once we split `build.yml` in two
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants