From fd9683a66c46ba22d6f1a9116e8b896e972f6b87 Mon Sep 17 00:00:00 2001 From: Joshua Lochner Date: Mon, 18 Mar 2024 01:32:41 +0200 Subject: [PATCH 1/2] Add `image-feature-extractor` pipeline --- README.md | 3 +- docs/snippets/5_supported-tasks.snippet | 3 +- src/models.js | 15 ++++- src/pipelines.js | 88 +++++++++++++++++++++++++ 4 files changed, 105 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 589af9de5..854156568 100644 --- a/README.md +++ b/README.md @@ -210,6 +210,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te | [Token Classification](https://huggingface.co/tasks/token-classification) | `token-classification` or `ner` | Assigning a label to each token in a text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TokenClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers.js) | | [Translation](https://huggingface.co/tasks/translation) | `translation` | Converting text from one language to another. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TranslationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=translation&library=transformers.js) | | [Zero-Shot Classification](https://huggingface.co/tasks/zero-shot-classification) | `zero-shot-classification` | Classifying text into classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-classification&library=transformers.js) | +| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) | #### Vision @@ -223,6 +224,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te | [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) | | [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ | | [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ | +| [Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | #### Audio @@ -247,7 +249,6 @@ You can refine your search by selecting the task you're interested in (e.g., [te | Task | ID | Description | Supported? | |--------------------------|----|-------------|------------| | [Document Question Answering](https://huggingface.co/tasks/document-question-answering) | `document-question-answering` | Answering questions on document images. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DocumentQuestionAnsweringPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=document-question-answering&library=transformers.js) | -| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) | | [Image-to-Text](https://huggingface.co/tasks/image-to-text) | `image-to-text` | Output text from a given image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToTextPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-to-text&library=transformers.js) | | [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. | ❌ | | [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. | ❌ | diff --git a/docs/snippets/5_supported-tasks.snippet b/docs/snippets/5_supported-tasks.snippet index 838026092..a807145b0 100644 --- a/docs/snippets/5_supported-tasks.snippet +++ b/docs/snippets/5_supported-tasks.snippet @@ -17,6 +17,7 @@ | [Token Classification](https://huggingface.co/tasks/token-classification) | `token-classification` or `ner` | Assigning a label to each token in a text. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TokenClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=token-classification&library=transformers.js) | | [Translation](https://huggingface.co/tasks/translation) | `translation` | Converting text from one language to another. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.TranslationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=translation&library=transformers.js) | | [Zero-Shot Classification](https://huggingface.co/tasks/zero-shot-classification) | `zero-shot-classification` | Classifying text into classes that are unseen during training. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ZeroShotClassificationPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=zero-shot-classification&library=transformers.js) | +| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) | #### Vision @@ -30,6 +31,7 @@ | [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) | | [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ | | [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ | +| [Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | #### Audio @@ -54,7 +56,6 @@ | Task | ID | Description | Supported? | |--------------------------|----|-------------|------------| | [Document Question Answering](https://huggingface.co/tasks/document-question-answering) | `document-question-answering` | Answering questions on document images. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.DocumentQuestionAnsweringPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=document-question-answering&library=transformers.js) | -| [Feature Extraction](https://huggingface.co/tasks/feature-extraction) | `feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original dataset. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js) | | [Image-to-Text](https://huggingface.co/tasks/image-to-text) | `image-to-text` | Output text from a given image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageToTextPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-to-text&library=transformers.js) | | [Text-to-Image](https://huggingface.co/tasks/text-to-image) | `text-to-image` | Generates images from input text. | ❌ | | [Visual Question Answering](https://huggingface.co/tasks/visual-question-answering) | `visual-question-answering` | Answering open-ended questions based on an image. | ❌ | diff --git a/src/models.js b/src/models.js index 1a9b021f1..2b4052633 100644 --- a/src/models.js +++ b/src/models.js @@ -5800,6 +5800,12 @@ const MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = new Map([ ['glpn', ['GLPNForDepthEstimation', GLPNForDepthEstimation]], ]) +// NOTE: This is custom to Transformers.js, and is necessary because certain models +// (e.g., CLIP) are split into vision and text components +const MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES = new Map([ + ['clip', ['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection]], + ['siglip', ['SiglipVisionModel', SiglipVisionModel]], +]) const MODEL_CLASS_TYPE_MAPPING = [ [MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES.EncoderOnly], @@ -5828,6 +5834,9 @@ const MODEL_CLASS_TYPE_MAPPING = [ [MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], [MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], + + // Custom: + [MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly], ]; for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) { @@ -5841,9 +5850,7 @@ for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) { const CUSTOM_MAPPING = [ ['CLIPTextModelWithProjection', CLIPTextModelWithProjection, MODEL_TYPES.EncoderOnly], - ['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection, MODEL_TYPES.EncoderOnly], ['SiglipTextModel', SiglipTextModel, MODEL_TYPES.EncoderOnly], - ['SiglipVisionModel', SiglipVisionModel, MODEL_TYPES.EncoderOnly], ['ClapTextModelWithProjection', ClapTextModelWithProjection, MODEL_TYPES.EncoderOnly], ['ClapAudioModelWithProjection', ClapAudioModelWithProjection, MODEL_TYPES.EncoderOnly], ] @@ -6070,6 +6077,10 @@ export class AutoModelForDepthEstimation extends PretrainedMixin { static MODEL_CLASS_MAPPINGS = [MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES]; } +export class AutoModelForImageFeatureExtraction extends PretrainedMixin { + static MODEL_CLASS_MAPPINGS = [MODEL_FOR_IMAGE_FEATURE_EXTRACTION_MAPPING_NAMES]; +} + ////////////////////////////////////////////////// ////////////////////////////////////////////////// diff --git a/src/pipelines.js b/src/pipelines.js index 25dfb5875..2b064d522 100644 --- a/src/pipelines.js +++ b/src/pipelines.js @@ -39,6 +39,7 @@ import { AutoModelForDocumentQuestionAnswering, AutoModelForImageToImage, AutoModelForDepthEstimation, + AutoModelForImageFeatureExtraction, PreTrainedModel, } from './models.js'; import { @@ -1206,6 +1207,82 @@ export class FeatureExtractionPipeline extends (/** @type {new (options: TextPip } } + +/** + * @typedef {Object} ImageFeatureExtractionPipelineOptions Parameters specific to image feature extraction pipelines. + * @property {boolean} [pool=null] Whether or not to return the pooled output. If set to `false`, the model will return the raw hidden states. + * + * @callback ImageFeatureExtractionPipelineCallback Extract the features of the input(s). + * @param {ImagePipelineInputs} images One or several images (or one list of images) to get the features of. + * @param {ImageFeatureExtractionPipelineOptions} [options] The options to use for image feature extraction. + * @returns {Promise} The image features computed by the model. + * + * @typedef {ImagePipelineConstructorArgs & ImageFeatureExtractionPipelineCallback & Disposable} ImageFeatureExtractionPipelineType + */ + +/** + * Image feature extraction pipeline using no model head. This pipeline extracts the hidden + * states from the base transformer, which can be used as features in downstream tasks. + * + * **Example:** Perform image feature extraction with `Xenova/vit-base-patch16-224-in21k`. + * ```javascript + * const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/vit-base-patch16-224-in21k'); + * const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png'; + * const features = await image_feature_extractor(url); + * // Tensor { + * // dims: [ 1, 197, 768 ], + * // type: 'float32', + * // data: Float32Array(151296) [ ... ], + * // size: 151296 + * // } + * ``` + * + * **Example:** Compute image embeddings with `Xenova/clip-vit-base-patch32`. + * ```javascript + * const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/clip-vit-base-patch32'); + * const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png'; + * const features = await image_feature_extractor(url); + * // Tensor { + * // dims: [ 1, 512 ], + * // type: 'float32', + * // data: Float32Array(512) [ ... ], + * // size: 512 + * // } + * ``` + */ +export class ImageFeatureExtractionPipeline extends (/** @type {new (options: ImagePipelineConstructorArgs) => ImageFeatureExtractionPipelineType} */ (Pipeline)) { + /** + * Create a new ImageFeatureExtractionPipeline. + * @param {ImagePipelineConstructorArgs} options An object used to instantiate the pipeline. + */ + constructor(options) { + super(options); + } + + /** @type {ImageFeatureExtractionPipelineCallback} */ + async _call(images, { + pool = null, + } = {}) { + + const preparedImages = await prepareImages(images); + const { pixel_values } = await this.processor(preparedImages); + const outputs = await this.model({ pixel_values }); + + /** @type {Tensor} */ + let result; + if (pool) { + if (!('pooler_output' in outputs)) { + throw Error(`No pooled output was returned. Make sure the model has a 'pooler' layer when using the 'pool' option.`); + } + result = outputs.pooler_output; + + } else { + result = outputs.last_hidden_state ?? outputs.logits ?? outputs.image_embeds; + } + return result; + } +} + // TODO // export class SentenceSimilarityPipeline extends Pipeline { // } @@ -2953,6 +3030,17 @@ const SUPPORTED_TASKS = Object.freeze({ }, "type": "text", }, + "image-feature-extraction": { + "processor": AutoProcessor, + "pipeline": ImageFeatureExtractionPipeline, + "model": [AutoModelForImageFeatureExtraction, AutoModel], + "default": { + // TODO: replace with original + // "model": "google/vit-base-patch16-224", + "model": "Xenova/vit-base-patch16-224-in21k", + }, + "type": "image", + }, }) From 10c7bdcf9e0987f4cb6b5e2bed90c2fa1e1169d7 Mon Sep 17 00:00:00 2001 From: Joshua Lochner Date: Wed, 20 Mar 2024 14:16:59 +0200 Subject: [PATCH 2/2] Update "Image Feature Extraction" heading --- README.md | 2 +- docs/snippets/5_supported-tasks.snippet | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 854156568..51a30d3aa 100644 --- a/README.md +++ b/README.md @@ -224,7 +224,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te | [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) | | [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ | | [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ | -| [Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | +| [Image Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | #### Audio diff --git a/docs/snippets/5_supported-tasks.snippet b/docs/snippets/5_supported-tasks.snippet index a807145b0..ac71ee528 100644 --- a/docs/snippets/5_supported-tasks.snippet +++ b/docs/snippets/5_supported-tasks.snippet @@ -31,7 +31,7 @@ | [Object Detection](https://huggingface.co/tasks/object-detection) | `object-detection` | Identify objects of certain defined classes within an image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ObjectDetectionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=object-detection&library=transformers.js) | | [Video Classification](https://huggingface.co/tasks/video-classification) | n/a | Assigning a label or class to an entire video. | ❌ | | [Unconditional Image Generation](https://huggingface.co/tasks/unconditional-image-generation) | n/a | Generating images with no condition in any context (like a prompt text or another image). | ❌ | -| [Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | +| [Image Feature Extraction](https://huggingface.co/tasks/image-feature-extraction) | `image-feature-extraction` | Transforming raw data into numerical features that can be processed while preserving the information in the original image. | ✅ [(docs)](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.ImageFeatureExtractionPipeline)
[(models)](https://huggingface.co/models?pipeline_tag=image-feature-extraction&library=transformers.js) | #### Audio