Sync master with upstream release b8783 by jan-service-account · Pull Request #485 · janhq/llama.cpp

jan-service-account · 2026-04-14T00:57:55Z

Updates dev branch with latest release (b8783) from ggml-org/llama.cpp

* Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>

* webui: add setting for first-line chat titles Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first non-empty line of a prompt as the generated conversation title. Previously, the complete multi-line prompt was being used, which created long titles for complex queries. Coupled with "Ask for confirmation before changing conversation title", the dialog would overflow. * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/utils/text.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: Run build to update the bundle As requested in: ggml-org#21797 (review) * webui: Fix missing import for NEWLINE_SEPARATOR --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode

) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…20797) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks

…21785)

* docs: listing qwen3-asr and qwen3-omni as supported * nits

qnixsynapse and others added 13 commits April 13, 2026 09:44

sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#…

873c825

…21807)

Remove extra conditional check on debug mode. (ggml-org#21798)

bafae27

webui: MCP Diagnostics improvements (ggml-org#21803)

227ed28

* Add MCP Connection diagnostics and CORS hint to web-ui * tidy up test * webui: Refactor and improve MCP diagnostic logging --------- Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>

mtmd: use causal attn for gemma 4 audio (ggml-org#21824)

920b3e7

server: Expose build_info in router mode (ggml-org#21835)

ce8fd4b

common : add download cancellation and temp file cleanup (ggml-org#21813

aa00911

) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci: Also exempt 'security' tag from auto-close (ggml-org#21844)

a8bad38

chat: dedicated DeepSeek v3.2 parser + "official" template (ggml-org#…

1c0d908

…21785)

docs: listing qwen3-asr and qwen3-omni as supported (ggml-org#21857)

e974923

* docs: listing qwen3-asr and qwen3-omni as supported * nits

common/gemma4 : handle parsing edge cases (ggml-org#21760)

e21cdc1

jan-service-account merged commit 5282e8d into dev Apr 14, 2026
3 checks passed

jan-service-account deleted the update-dev-from-master-2026-04-14-00-57 branch April 14, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b8783#485

Sync master with upstream release b8783#485
jan-service-account merged 13 commits intodevfrom
update-dev-from-master-2026-04-14-00-57

jan-service-account commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

jan-service-account commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants