Merge from upstream by l3utterfly · Pull Request #12 · l3utterfly/llama.cpp

l3utterfly · 2024-04-25T07:01:01Z

No description provided.

* add explicit phi3 support * add explicit phi3 support * remove unused code * convert : add BOS token * llama : match EOT token <|end|> * llama : minor / style * llama : tabs -> spaces * convert : fix lint checks --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add support of codeqwen due to tokenizer * override load_hparams * fix typo * fix load_params * convert : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add phi 3 chat template & tests * test : fix chat template result --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml-ci

* Server: add tests for consistent results * sampling: separate rng per sampling context

…g#6860) * fix: revert showing control tokens by default * feat: revert changes to default behavior of llama_token_to_piece; provide overridden declaration to receive "bool special" param to toggle showing control tokens * feat: use the overridden declaration of llama_token_to_piece from common/common.cpp to specify "false" so that control tokens are not shown in chat completion responses" * common : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…6850)

* add llama_get_pooling_type function * fix argument name, move with ctx funcs

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

liuwei-git and others added 10 commits April 24, 2024 10:00

convert : add support of codeqwen due to tokenizer (ggml-org#6707)

3fec68b

* add support of codeqwen due to tokenizer * override load_hparams * fix typo * fix load_params * convert : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

llama : add phi 3 chat template (ggml-org#6857)

abd3314

* Add phi 3 chat template & tests * test : fix chat template result --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggml : move 32-bit arm compat in ggml-impl.h (ggml-org#6865)

c0d1b3e

ggml-ci

Server: fix seed for multiple slots (ggml-org#6835)

28103f4

* Server: add tests for consistent results * sampling: separate rng per sampling context

server : do not apply Markdown formatting in code sections (ggml-org#…

3fe847b

…6850)

llama : add llama_get_pooling_type function (ggml-org#6862)

b4e4b8a

* add llama_get_pooling_type function * fix argument name, move with ctx funcs

README: add graphic for matrix multiplication (ggml-org#6881)

784e11d

Merge branch 'layla-build' into merge-master-layla

7a92678

l3utterfly merged commit c259b80 into layla-build Apr 25, 2024

l3utterfly deleted the merge-master-layla branch April 25, 2024 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge from upstream#12

Merge from upstream#12
l3utterfly merged 10 commits intolayla-buildfrom
merge-master-layla

l3utterfly commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

l3utterfly commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants