# Load pretrained instances with an AutoClass

## AutoTokenizer

A tokenizer converts our input into a format that can be processed by the model.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('google-bert/bert-base-uncased')

Then we can tokenize our input:

In [2]:
sequence = "In a hole in the ground there lived a hobbit."
print(tokenizer(sequence))

{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## AutoImageProcessor

For vision tasks, an image processor processes the image into the correct input format:

In [4]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained('google/vit-base-patch16-224')

preprocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/69.7k [00:00<?, ?B/s]

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


## AutoBackbone

The AutoBackbone lets us use pretrained models as backbones to get feature maps from different stages of the backbone.

We can specify one of the following parameters in `from_pretrained()`:
* `out_indices` - the index of the layer we would like to get the feature map from
* `out_features` - the name of the layer we would like to get the feature map from

In [6]:
from transformers import AutoImageProcessor, AutoBackbone
import torch
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
processor = AutoImageProcessor.from_pretrained('microsoft/swin-tiny-patch4-window7-224')
model = AutoBackbone.from_pretrained('microsoft/swin-tiny-patch4-window7-224',
                                     out_indices=(1,)) # get the stage 1

config.json:   0%|          | 0.00/71.8k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/113M [00:00<?, ?B/s]

Some weights of SwinBackbone were not initialized from the model checkpoint at microsoft/swin-tiny-patch4-window7-224 and are newly initialized: ['swin.hidden_states_norms.stage1.bias', 'swin.hidden_states_norms.stage1.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
inputs = processor(images=image, return_tensors='pt')
inputs

{'pixel_values': tensor([[[[ 0.3138,  0.4337,  0.4851,  ..., -0.3541, -0.3369, -0.3541],
          [ 0.3652,  0.4337,  0.4679,  ..., -0.3541, -0.3541, -0.3883],
          [ 0.3138,  0.3994,  0.4166,  ..., -0.4568, -0.4226, -0.3883],
          ...,
          [ 1.9064,  1.7865,  1.6495,  ...,  1.6153,  1.4954,  1.4440],
          [ 1.8722,  1.8037,  1.7523,  ...,  1.4098,  1.1358,  0.9817],
          [ 1.8722,  1.7180,  1.7352,  ...,  0.1254, -0.1657, -0.4739]],

         [[-1.6155, -1.6155, -1.6155,  ..., -1.7906, -1.7906, -1.8081],
          [-1.5630, -1.5630, -1.5630,  ..., -1.7731, -1.7556, -1.7731],
          [-1.6331, -1.5980, -1.5630,  ..., -1.8081, -1.7906, -1.7906],
          ...,
          [-0.3901, -0.5301, -0.6352,  ..., -0.7402, -0.8102, -0.8627],
          [-0.3901, -0.4426, -0.5651,  ..., -0.8452, -1.0028, -1.0728],
          [-0.4251, -0.5651, -0.5826,  ..., -1.4930, -1.5980, -1.7206]],

         [[-0.7936, -0.6018, -0.6541,  ..., -1.2293, -1.1247, -1.1596],
          [-0

In [9]:
outputs = model(**inputs)
outputs

BackboneOutput(feature_maps=(tensor([[[[ 1.0064e+00,  6.3087e-01, -1.9434e-01,  ...,  3.4310e-01,
            2.3735e-01,  1.6851e-01],
          [ 1.1366e+00,  6.7725e-01, -2.6274e-02,  ..., -1.1405e-01,
            6.0058e-01,  1.7787e-01],
          [ 1.0791e+00,  5.6264e-01,  2.5349e-01,  ...,  2.9255e-01,
            3.0649e-01,  6.1798e-01],
          ...,
          [ 5.0816e-01,  3.7973e-01, -7.7769e-01,  ..., -8.8344e-02,
            5.2977e-02, -5.7092e-01],
          [-3.7535e-01, -1.8946e-01, -3.2584e-01,  ..., -2.9201e-01,
           -5.8430e-02,  1.2849e-01],
          [ 2.5337e-01, -8.9120e-02,  3.7873e-01,  ..., -7.1851e-01,
            1.3847e-01,  5.3692e-01]],

         [[ 9.5971e-01,  9.7547e-01,  1.3414e+00,  ...,  8.0082e-01,
            8.1081e-01,  7.8101e-01],
          [ 7.9007e-01,  1.2079e+00,  1.1176e+00,  ...,  9.6235e-01,
            6.2085e-01,  7.4454e-01],
          [ 5.2195e-02,  9.7713e-01,  1.0464e+00,  ...,  7.8408e-01,
            9.2502e-01,  7.90

In [10]:
feature_maps = outputs.feature_maps
feature_maps

(tensor([[[[ 1.0064e+00,  6.3087e-01, -1.9434e-01,  ...,  3.4310e-01,
             2.3735e-01,  1.6851e-01],
           [ 1.1366e+00,  6.7725e-01, -2.6274e-02,  ..., -1.1405e-01,
             6.0058e-01,  1.7787e-01],
           [ 1.0791e+00,  5.6264e-01,  2.5349e-01,  ...,  2.9255e-01,
             3.0649e-01,  6.1798e-01],
           ...,
           [ 5.0816e-01,  3.7973e-01, -7.7769e-01,  ..., -8.8344e-02,
             5.2977e-02, -5.7092e-01],
           [-3.7535e-01, -1.8946e-01, -3.2584e-01,  ..., -2.9201e-01,
            -5.8430e-02,  1.2849e-01],
           [ 2.5337e-01, -8.9120e-02,  3.7873e-01,  ..., -7.1851e-01,
             1.3847e-01,  5.3692e-01]],
 
          [[ 9.5971e-01,  9.7547e-01,  1.3414e+00,  ...,  8.0082e-01,
             8.1081e-01,  7.8101e-01],
           [ 7.9007e-01,  1.2079e+00,  1.1176e+00,  ...,  9.6235e-01,
             6.2085e-01,  7.4454e-01],
           [ 5.2195e-02,  9.7713e-01,  1.0464e+00,  ...,  7.8408e-01,
             9.2502e-01,  7.9040e-01],


Now we can access the `feature_maps` from the first stage of the backbone

In [11]:
list(feature_maps[0].shape)

[1, 96, 56, 56]

## AutoFeatureExtractor

For audio tasks, a feature extractor processes the audio signal into the correct input format.

In [12]:
from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(
    'ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition'
)

preprocessor_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

## AutoProcessor

Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the LayoutLMV2 model requires an image processor to handle images and a tokenizer to handle text; a processor combines both of them.

In [13]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    'microsoft/layoutlmv2-base-uncased'
)

preprocessor_config.json:   0%|          | 0.00/135 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



## AutoModel

In [14]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert/distilbert-base-uncased'
)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    'distilbert/distilbert-base-uncased'
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
