Skip to content

3.6.0

Compare
Choose a tag to compare
@xenova xenova released this 26 Jun 15:48
· 8 commits to main since this release
7b45042

πŸš€ Transformers.js v3.6 β€” Gemma 3n, Qwen3-Embedding, Llava-Qwen2

πŸ€– New models

Gemma 3n

Gemma 3n, which was announced as a preview during Google I/O, is a model that is designed from the ground up to run locally on your hardware. On top of that, it's natively multimodal, supporting image, text, audio, and video inputs 🀯

Gemma 3n models have multiple architecture innovations:

  • They are available in two sizes based on effective parameters. While the raw parameter count of this model is 6B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 2B model by offloading low-utilization matrices from the accelerator.
  • They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (this model repository), or you can access a spectrum of custom-sized models using the Mix-and-Match method.

Learn more about these techniques in the technical blog post and the Gemma documentation.

As part of the release, we are releasing ONNX weights for the gemma-3n-E2B-it variant (link), making it compatible with Transformers.js:

Warning

Due to the model's large size, we currently only support Node.js, Deno, and Bun execution.
In-browser WebGPU support is actively being worked on, so stay tuned for an update!

Example: Caption an image

import {
  AutoProcessor,
  AutoModelForImageTextToText,
  load_image,
  TextStreamer,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/gemma-3n-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "q8",
    audio_encoder: "q8",
    vision_encoder: "fp16",
    decoder_model_merged: "q4",
  },
  device: "cpu", // NOTE: WebGPU support coming soon!
});

// Prepare prompt
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image in detail." },
    ],
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
});

// Prepare inputs
const url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg";
const image = await load_image(url);
const audio = null;
const inputs = await processor(prompt, image, audio, {
  add_special_tokens: false,
});

// Generate output
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
    // callback_function: (text) => { /* Do something with the streamed output */ },
  }),
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
See example output
The image is a close-up, slightly macro shot of a cluster of vibrant pink cosmos flowers in full bloom. The flowers are the focal point, with their delicate, slightly ruffled petals radiating outwards. They have a soft, almost pastel pink hue, and their edges are subtly veined. 

A small, dark-colored bee is actively visiting one of the pink flowers, its body positioned near the center of the bloom. The bee appears to be collecting pollen or nectar. 

The flowers are attached to slender, brownish-green stems, and some of the surrounding foliage is visible in a blurred background, suggesting a natural outdoor setting. There are also hints of other flowers in the background, including some red ones, adding a touch of contrast to the pink. 

The lighting in the image seems to be natural daylight, casting soft shadows and highlighting the textures of the petals and the bee. The overall impression is one of delicate beauty and the gentle activity of nature.

Example: Transcribe audio

import {
  AutoProcessor,
  AutoModelForImageTextToText,
  TextStreamer,
} from "@huggingface/transformers";
import wavefile from "wavefile";

// Load processor and model
const model_id = "onnx-community/gemma-3n-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "q8",
    audio_encoder: "q4",
    vision_encoder: "fp16",
    decoder_model_merged: "q4",
  },
  device: "cpu", // NOTE: WebGPU support coming soon!
});

// Prepare prompt
const messages = [
  {
    role: "user",
    content: [
      { type: "audio" },
      { type: "text", text: "Transcribe this audio verbatim." },
    ],
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
});

// Prepare inputs
const url = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav";
const buffer = Buffer.from(await fetch(url).then((x) => x.arrayBuffer()));
const wav = new wavefile.WaveFile(buffer);
wav.toBitDepth("32f"); // Pipeline expects input as a Float32Array
wav.toSampleRate(processor.feature_extractor.config.sampling_rate);
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
  if (audioData.length > 1) {
    for (let i = 0; i < audioData[0].length; ++i) {
      audioData[0][i] = (Math.sqrt(2) * (audioData[0][i] + audioData[1][i])) / 2;
    }
  }
  audioData = audioData[0];
}

const image = null;
const audio = audioData;
const inputs = await processor(prompt, image, audio, {
  add_special_tokens: false,
});

// Generate output
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
    // callback_function: (text) => { /* Do something with the streamed output */ },
  }),
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
See example output
And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country.

Qwen3-Embedding

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B). This series inherits the exceptional multilingual capabilities, long-text understanding, and reasoning skills of its foundational model.

You can run it with Transformers.js as follows:

import { pipeline, matmul } from "@huggingface/transformers";

// Create a feature extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "onnx-community/Qwen3-Embedding-0.6B-ONNX",
  {
    dtype: "fp32", // Options: "fp32", "fp16", "q8"
    // device: "webgpu",
  },
);

function get_detailed_instruct(task_description, query) {
  return `Instruct: ${task_description}\nQuery:${query}`;
}

// Each query must come with a one-sentence instruction that describes the task
const task = "Given a web search query, retrieve relevant passages that answer the query";
const queries = [
  get_detailed_instruct(task, "What is the capital of China?"),
  get_detailed_instruct(task, "Explain gravity"),
];

// No need to add instruction for retrieval documents
const documents = [
  "The capital of China is Beijing.",
  "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
];
const input_texts = [...queries, ...documents];

// Extract embeddings for queries and documents
const output = await extractor(input_texts, {
  pooling: "last_token",
  normalize: true,
});
const scores = await matmul(
  output.slice([0, queries.length]), // Query embeddings
  output.slice([queries.length, null]).transpose(1, 0), // Document embeddings
);
console.log(scores.tolist());
// [
//   [ 0.7645590305328369, 0.14142560958862305 ],
//   [ 0.13549776375293732, 0.599955141544342 ]
// ]

Llava-Qwen2

Finally, we also added support for Llava models with a Qwen2 text backbone:

import {
  AutoProcessor,
  AutoModelForImageTextToText,
  load_image,
  TextStreamer,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/FastVLM-0.5B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "fp16",
    vision_encoder: "q4",
    decoder_model_merged: "q4",
  },
});

// Prepare prompt
const messages = [
  {
    role: "user",
    content: "<image>Describe this image in detail.",
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
});

// Prepare inputs
const url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg";
const image = await load_image(url);
const inputs = await processor(image, prompt, {
  add_special_tokens: false,
});

// Generate output
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
    // callback_function: (text) => { /* Do something with the streamed output */ },
  }),
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
See here for example output
The image depicts a vibrant and colorful scene featuring a variety of flowers and plants. The main focus is on a striking pink flower with a dark center, which appears to be a type of petunia. The petals are a rich, deep pink, and the flower has a classic, slightly ruffled appearance. The dark center of the flower is a contrasting color, likely a deep purple or black, which adds to the flower's visual appeal.

In the background, there are several other flowers and plants, each with their unique colors and shapes. To the left, there is a red flower with a bright, vivid hue, which stands out against the pink flower. The red flower has a more rounded shape and a lighter center, with petals that are a lighter shade of red compared to the pink flower.

To the right of the pink flower, there is a plant with red flowers, which are smaller and more densely packed. The red flowers are a deep, rich red color, and they have a more compact shape compared to the pink flower.

In the foreground, there is a green plant with a few leaves and a few small flowers. The leaves are a bright green color, and the flowers are a lighter shade of green, with a few petals that are slightly open.

Overall, the image is a beautiful representation of a garden or natural setting, with a variety of flowers and plants that are in full bloom. The colors are vibrant and the composition is well-balanced, with the pink flower in the center drawing the viewer's attention.

πŸ› οΈ Other improvements

  • Improve detection & usage of deno/bun in #1333
  • Add eos/last_token pooling in #1335

Full Changelog: 3.5.2...3.6.0