Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| [Token Classification](https://huggingface.co/tasks/token-classification) | `token-classification` or `ner` | Assigning a label to each token in a text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TokenClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers.js) |
| [Translation](https://huggingface.co/tasks/translation) | `translation` | Converting text from one language to another. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TranslationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=translation&library=transformers.js) |
| [Zero-Shot Classification](https://huggingface.co/tasks/zero-shot-classification) | `zero-shot-classification` | Classifying text into classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-classification&library=transformers.js) |
| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) |

#### Vision

Expand All @@ -223,6 +224,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) |
| [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ |
| [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ |
| [Image Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) |

#### Audio

Expand All @@ -247,7 +249,6 @@ You can refine your search by selecting the task you're interested in (e.g., [te
| Task | ID | Description | Supported? |
|--------------------------|----|-------------|------------|
| [Document Question Answering](https://huggingface.co/tasks/document-question-answering) | `document-question-answering` | Answering questions on document images. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DocumentQuestionAnsweringPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=document-question-answering&library=transformers.js) |
| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) |
| [Image-to-Text](https://huggingface.co/tasks/image-to-text) | `image-to-text` | Output text from a given image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToTextPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-to-text&library=transformers.js) |
| [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. | ❌ |
| [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. | ❌ |
Expand Down
3 changes: 2 additions & 1 deletion docs/snippets/5_supported-tasks.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
| [Token Classification](https://huggingface.co/tasks/token-classification) | `token-classification` or `ner` | Assigning a label to each token in a text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TokenClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers.js) |
| [Translation](https://huggingface.co/tasks/translation) | `translation` | Converting text from one language to another. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TranslationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=translation&library=transformers.js) |
| [Zero-Shot Classification](https://huggingface.co/tasks/zero-shot-classification) | `zero-shot-classification` | Classifying text into classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotClassificationPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-classification&library=transformers.js) |
| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) |

#### Vision

Expand All @@ -30,6 +31,7 @@
| [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) |
| [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ |
| [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ |
| [Image Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) |

#### Audio

Expand All @@ -54,7 +56,6 @@
| Task | ID | Description | Supported? |
|--------------------------|----|-------------|------------|
| [Document Question Answering](https://huggingface.co/tasks/document-question-answering) | `document-question-answering` | Answering questions on document images. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DocumentQuestionAnsweringPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=document-question-answering&library=transformers.js) |
| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) |
| [Image-to-Text](https://huggingface.co/tasks/image-to-text) | `image-to-text` | Output text from a given image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToTextPipeline)<br>[(models)](https://huggingface.co/models?pipeline_tag=image-to-text&library=transformers.js) |
| [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. | ❌ |
| [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. | ❌ |
Expand Down
15 changes: 13 additions & 2 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -5800,6 +5800,12 @@ const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([
['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]],
])

// NOTE: This is custom to Transformers.js, and is necessary because certain models
// (e.g., CLIP) are split into vision and text components
const MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES = new Map([
['clip', ['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection]],
['siglip', ['SiglipVisionModel', SiglipVisionModel]],
])

const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES.EncoderOnly],
Expand Down Expand Up @@ -5828,6 +5834,9 @@ const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],

// Custom:
[MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
];

for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
Expand All @@ -5841,9 +5850,7 @@ for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {

const CUSTOM_MAPPING = [
['CLIPTextModelWithProjection', CLIPTextModelWithProjection, MODEL_TYPES.EncoderOnly],
['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection, MODEL_TYPES.EncoderOnly],
['SiglipTextModel', SiglipTextModel, MODEL_TYPES.EncoderOnly],
['SiglipVisionModel', SiglipVisionModel, MODEL_TYPES.EncoderOnly],
['ClapTextModelWithProjection', ClapTextModelWithProjection, MODEL_TYPES.EncoderOnly],
['ClapAudioModelWithProjection', ClapAudioModelWithProjection, MODEL_TYPES.EncoderOnly],
]
Expand Down Expand Up @@ -6070,6 +6077,10 @@ export class AutoModelForDepthEstimation extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES];
}

export class AutoModelForImageFeatureExtraction extends PretrainedMixin {
static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES];
}

//////////////////////////////////////////////////

//////////////////////////////////////////////////
Expand Down
88 changes: 88 additions & 0 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ import {
AutoModelForDocumentQuestionAnswering,
AutoModelForImageToImage,
AutoModelForDepthEstimation,
AutoModelForImageFeatureExtraction,
PreTrainedModel,
} from './models.js';
import {
Expand Down Expand Up @@ -1206,6 +1207,82 @@ export class FeatureExtractionPipeline extends (/** @type {new (options: TextPip
}
}


/**
* @typedef {Object} ImageFeatureExtractionPipelineOptions Parameters specific to image feature extraction pipelines.
* @property {boolean} [pool=null] Whether or not to return the pooled output. If set to `false`, the model will return the raw hidden states.
*
* @callback ImageFeatureExtractionPipelineCallback Extract the features of the input(s).
* @param {ImagePipelineInputs} images One or several images (or one list of images) to get the features of.
* @param {ImageFeatureExtractionPipelineOptions} [options] The options to use for image feature extraction.
* @returns {Promise<Tensor>} The image features computed by the model.
*
* @typedef {ImagePipelineConstructorArgs & ImageFeatureExtractionPipelineCallback & Disposable} ImageFeatureExtractionPipelineType
*/

/**
* Image feature extraction pipeline using no model head. This pipeline extracts the hidden
* states from the base transformer, which can be used as features in downstream tasks.
*
* **Example:** Perform image feature extraction with `Xenova/vit-base-patch16-224-in21k`.
* ```javascript
* const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/vit-base-patch16-224-in21k');
* const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
* const features = await image_feature_extractor(url);
* // Tensor {
* // dims: [ 1, 197, 768 ],
* // type: 'float32',
* // data: Float32Array(151296) [ ... ],
* // size: 151296
* // }
* ```
*
* **Example:** Compute image embeddings with `Xenova/clip-vit-base-patch32`.
* ```javascript
* const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/clip-vit-base-patch32');
* const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
* const features = await image_feature_extractor(url);
* // Tensor {
* // dims: [ 1, 512 ],
* // type: 'float32',
* // data: Float32Array(512) [ ... ],
* // size: 512
* // }
* ```
*/
export class ImageFeatureExtractionPipeline extends (/** @type {new (options: ImagePipelineConstructorArgs) => ImageFeatureExtractionPipelineType} */ (Pipeline)) {
/**
* Create a new ImageFeatureExtractionPipeline.
* @param {ImagePipelineConstructorArgs} options An object used to instantiate the pipeline.
*/
constructor(options) {
super(options);
}

/** @type {ImageFeatureExtractionPipelineCallback} */
async _call(images, {
pool = null,
} = {}) {

const preparedImages = await prepareImages(images);
const { pixel_values } = await this.processor(preparedImages);
const outputs = await this.model({ pixel_values });

/** @type {Tensor} */
let result;
if (pool) {
if (!('pooler_output' in outputs)) {
throw Error(`No pooled output was returned. Make sure the model has a 'pooler' layer when using the 'pool' option.`);
}
result = outputs.pooler_output;

} else {
result = outputs.last_hidden_state ?? outputs.logits ?? outputs.image_embeds;
}
return result;
}
}

// TODO
// export class SentenceSimilarityPipeline extends Pipeline {
// }
Expand Down Expand Up @@ -2953,6 +3030,17 @@ const SUPPORTED_TASKS = Object.freeze({
},
"type": "text",
},
"image-feature-extraction": {
"processor": AutoProcessor,
"pipeline": ImageFeatureExtractionPipeline,
"model": [AutoModelForImageFeatureExtraction, AutoModel],
"default": {
// TODO: replace with original
// "model": "google/vit-base-patch16-224",
"model": "Xenova/vit-base-patch16-224-in21k",
},
"type": "image",
},
})


Expand Down