Skip to content

Conversation

@xenova
Copy link
Collaborator

@xenova xenova commented Oct 16, 2025

https://github.com/karpathy/nanochat

Example usage:

import { pipeline, TextStreamer } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/nanochat-d32-ONNX",
  { dtype: "q4" }, // Options: "fp32", "fp16", "q4", "q4f16"
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" },
];

// Generate a response
const output = await generator(messages, {
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(generator.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: true,
  }),
});
console.log(output[0].generated_text.at(-1).content);

Linked to huggingface/transformers#41634

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @nico-martin for tokenizers.js

We'd need to add these to the mapping

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nico-martin
Copy link
Collaborator

I tried it on my MacBook and it takes super long to process and generate tokens. Also it only runs on the wasm runtime. Is that expected?

@xenova
Copy link
Collaborator Author

xenova commented Oct 18, 2025

Yeah I'm still figuring out the best quantization strategy here: The current q4 produces very poor outputs... so, a mixed precision approach is needed, I think.

Also it only runs on the wasm runtime.

Do you see any errors on webgpu? 👀 The model might only run in q4f16 on webgpu (but, poor quality)

@xenova
Copy link
Collaborator Author

xenova commented Oct 19, 2025

new quantizations are much better from my testing 👍
https://huggingface.co/onnx-community/nanochat-d32-ONNX/commit/e0d8a83ed5e1cd954a7377ad57beee4cd653f7c1

@xenova xenova merged commit 85b8eb2 into main Oct 19, 2025
4 checks passed
@xenova xenova deleted the add-nanochat branch October 19, 2025 17:47
@xenova
Copy link
Collaborator Author

xenova commented Oct 19, 2025

I did also encounter the WebGPU bug, but this is an issue with the model (not really, but more of a "backwards-compatibility" issue) and not the PR, so I will fix it on the HF side 👍

An error occurred during model execution: "Error: [WebGPU] Kernel "[Mul] /model/layers.0/attn/k_rotary/x2_cos_mul" failed. Error: Can't perform binary op on the given tensors".

@xenova
Copy link
Collaborator Author

xenova commented Oct 20, 2025

This has now been fixed 👍 https://huggingface.co/onnx-community/nanochat-d32-ONNX/commit/5e500c2ad822ea5379361f6fc08f3da9bb55fec1

Model (q4) now runs well on WebGPU in-browser, even on older JS EP.

q4f16 and fp16 seem to have some overflow issues on WebGPU. Not sure what best options are here since the spec doesn't support bf16. tc39/proposal-float16array#4 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants