2.9.0
What's new?
π Exciting new tasks!
Transformers.js v2.9.0 adds support for three new tasks: (1) Depth estimation, (2) Zero-shot object detection, and (3) Optical document understanding.
π΅οΈββοΈ Depth Estimation
The task of predicting the depth of objects present in an image. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create depth estimation pipeline
let depth_estimator = await pipeline('depth-estimation', 'Xenova/dpt-hybrid-midas');
// Predict depth for image
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
let output = await depth_estimator(url);
Input | Output |
---|---|
Raw output
// {
// predicted_depth: Tensor {
// dims: [ 384, 384 ],
// type: 'float32',
// data: Float32Array(147456) [ 542.859130859375, 545.2833862304688, 546.1649169921875, ... ],
// size: 147456
// },
// depth: RawImage {
// data: Uint8Array(307200) [ 86, 86, 86, ... ],
// width: 640,
// height: 480,
// channels: 1
// }
// }
π― Zero-shot Object Detection
The task of identifying objects of classes that are unseen during training. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create zero-shot object detection pipeline
let detector = await pipeline('zero-shot-object-detection', 'Xenova/owlvit-base-patch32');
// Predict bounding boxes
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/astronaut.png';
let candidate_labels = ['human face', 'rocket', 'helmet', 'american flag'];
let output = await detector(url, candidate_labels);
Raw output
// [
// {
// score: 0.24392342567443848,
// label: 'human face',
// box: { xmin: 180, ymin: 67, xmax: 274, ymax: 175 }
// },
// {
// score: 0.15129457414150238,
// label: 'american flag',
// box: { xmin: 0, ymin: 4, xmax: 106, ymax: 513 }
// },
// {
// score: 0.13649864494800568,
// label: 'helmet',
// box: { xmin: 277, ymin: 337, xmax: 511, ymax: 511 }
// },
// {
// score: 0.10262022167444229,
// label: 'rocket',
// box: { xmin: 352, ymin: -1, xmax: 463, ymax: 287 }
// }
// ]
π Optical Document Understanding (image-to-text)
This task involves translating images of scientific PDFs to markdown, enabling easier access to them. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create image-to-text pipeline
let pipe = await pipeline('image-to-text', 'Xenova/nougat-small');
// Generate markdown
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/nougat_paper.png';
let output = await pipe(url, {
min_length: 1,
max_new_tokens: 40,
bad_words_ids: [[pipe.tokenizer.unk_token_id]],
});
// [{ generated_text: "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucur" }]
π» New architectures: Nougat, DPT, GLPN, OwlViT
We added support for 4 new architectures, bringing the total up to 61!
- DPT for depth estimation. See here for the list of available models.
- GLPN for depth estimation. See here for the list of available models.
- OwlViT for zero-shot object detection. See here for the list of available models.
- Nougat for optical understanding of academic documents (
image-to-text
). See here for the list of available models.
π¨ Other improvements
- Add support for Grouped Query Attention on Llama Model by @felladrin in #393
- Implement max character check by @samlhuillier in #398
- Add
CLIPFeatureExtractor
(and tests) in #387 - Add jsDelivr stats to README in #395
- Update sharp dependency version in #400
π Bug fixes
- Move tensor clone to fix Worker ownership NaN issue by @kungfooman in #404
- Add default token_type_ids for
multilingual-e5-*
models by @do-me in #403 - Ensure WASM fallback does not crash in GH actions in #402
π€ New contributors
- @felladrin made their first contribution in #393
- @samlhuillier made their first contribution in #398
- @do-me made their first contribution in #403
Full Changelog: 2.8.0...2.9.0