Sync master with upstream release b8580 by jan-service-account · Pull Request #470 · janhq/llama.cpp

jan-service-account · 2026-03-30T00:54:52Z

Updates dev branch with latest release (b8580) from ggml-org/llama.cpp

…ggml-org#21093) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0

* hex-fa: add simple dma cache for Mask I noticed that we were refetch the mask rows over and over. This simple cache avoids that. * hex-dma: unset in-order desc bit which caused signficant perf regression We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen. * hex-rope: update comment to clarify that we don't need in-order DMA completions

@am17an

* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

arthw and others added 5 commits March 29, 2026 09:02

[SYCL] Enhance build script to use half cores to build, avoid OS hang (…

afe65aa

…ggml-org#21093) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0

devops: including compute-runtime for intel.Dockerfile (ggml-org#21076)

2405d59

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (ggml-org#21150)

7c20367

jan-service-account merged commit 03eccdc into dev Mar 30, 2026
3 checks passed

jan-service-account deleted the update-dev-from-master-2026-03-30-00-54 branch March 30, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8580#470

Sync master with upstream release b8580#470
jan-service-account merged 5 commits intodevfrom
update-dev-from-master-2026-03-30-00-54

jan-service-account commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jan-service-account commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants