##### Copyright 2026 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma 3n - Run with Transformers.js

Author: Sitam Meur

*   GitHub: [github.com/sitammeur](https://github.com/sitammeur/)
*   X: [@sitammeur](https://x.com/sitammeur)

Description: This notebook demonstrates how you can run inference on Gemma 3n model using Node.js and [Transformers.js](https://huggingface.co/docs/transformers.js/index). Transformers.js lets you run Hugging Face's transformer models directly in browser, offering a JavaScript API similar to Python's.  It supports NLP, computer vision, audio, and multimodal tasks using ONNX Runtime and allows easy conversion of PyTorch, TensorFlow, and JAX models.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_3n]Using_with_Transformersjs.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma 3n model. In this case, you can use CPU runtime:

1. In the upper-right of the Colab window, select **â–¾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **CPU**.

## Installation

Let's get started with installing the dependencies.

In [None]:
# Install Node.js
!curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
!sudo apt-get install -y nodejs

## Create Node.js project

Create a new Node.js project and install the required transformers package via [NPM](https://www.npmjs.com/package/@huggingface/transformers).

In [None]:
# Create project directory
!mkdir gemma3n-node
%cd gemma3n-node

# Initialize NPM project
!npm init -y
!npm i @huggingface/transformers wavefile

In [None]:
%%writefile package.json

{
  "name": "gemma3n-node",
  "version": "1.0.0",
  "main": "index.js",
  "type": "module",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "description": "",
  "dependencies": {
    "@huggingface/transformers": "^3.8.1",
    "wavefile": "^11.0.0"
  }
}

## Transformers.js Inference

Now, let's run inference on the Gemma 3n model using Transformers.js. First, create a generation pipeline for images and text, audio and text, and images and audio. Then, prepare the inputs to run inference and obtain the desired output. For reference, you can check the model's page on the Hugging Face model hub under ONNX models section [here](https://huggingface.co/onnx-community/gemma-3n-E2B-it-ONNX).

### Image + Text Inference

In [None]:
# Show the image from the URL
from PIL import Image
import requests

url = "https://jethac.github.io/assets/juice.jpg"
img = Image.open(requests.get(url, stream=True).raw)
img

In [None]:
%%writefile index.js

// Import the required modules
import {
  AutoProcessor,
  AutoModelForImageTextToText,
  load_image,
  TextStreamer,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/gemma-3n-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "q8",
    audio_encoder: "q8",
    vision_encoder: "fp16",
    decoder_model_merged: "q4",
  },
  device: "cpu",
});

// Define the list of messages
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image in detail." },
    ],
  },
];

try {
  // Prepare prompt
  const prompt = processor.apply_chat_template(messages, {
    add_generation_prompt: true,
  });

  // Prepare inputs
  const url = "https://jethac.github.io/assets/juice.jpg";
  const image = await load_image(url);
  const audio = null;

  const inputs = await processor(prompt, image, audio, {
    add_special_tokens: false,
  });

  // Generate output
  const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 512,
    do_sample: false,
  });

  // Decode output
  const promptLen = inputs.input_ids.dims.at(-1);
  const decoded = processor.batch_decode(outputs.slice(null, [promptLen, null]), {
    skip_special_tokens: true,
  });

  console.log(decoded[0]);
} catch (error) {
  console.error("Error generating response:", error);
}

In [None]:
# Run the node.js application (Image + Text)
!node index.js

### Audio + Text Inference

In [None]:
# Display audio
from IPython.display import Audio
import requests
import io

url = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav"

audio_bytes = requests.get(url).content
Audio(audio_bytes)

In [None]:
%%writefile index.js

// Import the required modules
import {
  AutoProcessor,
  AutoModelForImageTextToText,
  TextStreamer,
} from "@huggingface/transformers";
import wavefile from "wavefile";

// Load processor and model
const model_id = "onnx-community/gemma-3n-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "q8",
    audio_encoder: "q4",
    vision_encoder: "fp16",
    decoder_model_merged: "q4",
  },
  device: "cpu",
});

// Define the list of messages
const messages = [
  {
    role: "user",
    content: [
      { type: "audio" },
      { type: "text", text: "Transcribe this audio." },
    ],
  },
];

try {
  // Prepare prompt
  const prompt = processor.apply_chat_template(messages, {
    add_generation_prompt: true,
  });

  // Prepare inputs (audio from URL -> Float32Array @ model sample rate)
  const url =
    "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav";

  const buffer = Buffer.from(await fetch(url).then((x) => x.arrayBuffer()));
  const wav = new wavefile.WaveFile(buffer);

  // Pipeline expects Float32 samples
  wav.toBitDepth("32f");
  wav.toSampleRate(processor.feature_extractor.config.sampling_rate);

  let audioData = wav.getSamples();

  // Convert stereo -> mono (simple average with sqrt(2) normalization, like the docs)
  if (Array.isArray(audioData)) {
    if (audioData.length > 1) {
      for (let i = 0; i < audioData[0].length; ++i) {
        audioData[0][i] =
          (Math.sqrt(2) * (audioData[0][i] + audioData[1][i])) / 2;
      }
    }
    audioData = audioData[0];
  }

  const image = null;
  const audio = audioData;

  const inputs = await processor(prompt, image, audio, {
    add_special_tokens: false,
  });

  // Generate output
  const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 512,
    do_sample: false,
  });

  // Decode output
  const promptLen = inputs.input_ids.dims.at(-1);
  const decoded = processor.batch_decode(outputs.slice(null, [promptLen, null]), {
    skip_special_tokens: true,
  });

  console.log(decoded[0]);
} catch (error) {
  console.error("Error generating response:", error);
}

In [None]:
# Run the node.js application (Audio + Text)
!node index.js

### Image + Audio Inference

In [None]:
# Show the image from the URL
from PIL import Image
import requests

url = "https://jethac.github.io/assets/juice.jpg"
img = Image.open(requests.get(url, stream=True).raw)
img

In [None]:
# Display audio
from IPython.display import Audio
import requests
import io

url = "https://raw.githubusercontent.com/sitammeur/test-assets/main/cat.wav"

audio_bytes = requests.get(url).content
Audio(audio_bytes)

In [None]:
%%writefile index.js

// Import the required modules
import {
  AutoProcessor,
  AutoModelForImageTextToText,
  load_image,
  TextStreamer,
} from "@huggingface/transformers";
import wavefile from "wavefile";

// Load processor and model
const model_id = "onnx-community/gemma-3n-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForImageTextToText.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "q8",
    audio_encoder: "q4",
    vision_encoder: "fp16",
    decoder_model_merged: "q4",
  },
  device: "cpu",
});

// Define the list of messages
const messages = [
  {
    role: "user",
    content: [{ type: "image" }, { type: "audio" }],
  },
];

try {
  // Prepare prompt
  const prompt = processor.apply_chat_template(messages, {
    add_generation_prompt: true,
  });

  // Prepare inputs (image + audio from URLs)
  const imageUrl = "https://jethac.github.io/assets/juice.jpg";
  const image = await load_image(imageUrl);

  const audioUrl = "https://raw.githubusercontent.com/sitammeur/test-assets/main/cat.wav";
  const buffer = Buffer.from(await fetch(audioUrl).then((r) => r.arrayBuffer()));
  const wav = new wavefile.WaveFile(buffer);

  // Pipeline expects Float32 samples at model sample rate
  wav.toBitDepth("32f");
  wav.toSampleRate(processor.feature_extractor.config.sampling_rate);

  let audioData = wav.getSamples();

  // Convert stereo -> mono
  if (Array.isArray(audioData) && audioData.length > 1) {
    const left = audioData[0];
    const right = audioData[1];
    audioData = left.map((_, i) => (Math.sqrt(2) * (left[i] + right[i])) / 2);
  } else if (Array.isArray(audioData)) {
    // If it's an array wrapper (single channel), unwrap it
    audioData = audioData[0];
  }

  const inputs = await processor(prompt, image, audioData, {
    add_special_tokens: false,
  });

  // Generate output
  const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 512,
    do_sample: false,
  });

  // Decode output
  const promptLen = inputs.input_ids.dims.at(-1);
  const decoded = processor.batch_decode(outputs.slice(null, [promptLen, null]), {
    skip_special_tokens: true,
  });

  console.log(decoded[0]);
} catch (error) {
  console.error("Error generating response:", error);
}

In [None]:
# Run the node.js application (Image + Audio)
!node index.js

## Conclusion

Congratulations! You have successfully run inference on Gemma 3n model using Transformers.js via Node.js environment. You can now integrate this into your projects.