# Load pretrained instances with an AutoClass

本チュートリアルでは、`AutoClass.from_pretrained()` を用いて、訓練済みモデルから目的のアーキテクチャをロードする方法を学びます。\
`pipeline()` が事前処理・推論・事後処理を一挙に行うモデル全体を提供していたのとは対照的に、`AutoClass.from_pretrained()` はモデルを構成する各アーキテクチャを選択的にロードし、別のタスクのために利用することができます。

本チュートリアルは、[Hugging Face Tranformers チュートリアル](https://huggingface.co/docs/transformers/v4.57.0/ja/autoclass_tutorial) を元に、一部加筆・修正して作成しています。

## Dependencies

このチュートリアルコードをすべて実行するためには、明示的に `import` するライブラリの他に、以下のソフトウェアが必要です。

- `torch` ライブラリ (or `tensorflow` ライブラリ): `AutoModelForSequenceClassification` クラスなどのバックエンド
    - 本チュートリアルでは `torch` を用いるコードしか紹介しません

もし自分の環境にインストールされていない場合には、事前にインストールしておいてください。

In [None]:
# run this cell if you are working in google colab

!pip install librosa pillow torch transformers

In [1]:
import io
import librosa
from PIL import Image
import requests
from transformers import (
    AutoFeatureExtractor,
    AutoImageProcessor,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoTokenizer,
    AutoProcessor,
)

  from .autonotebook import tqdm as notebook_tqdm


## AutoTokenizer

Tokenizer は、入力テキストにトークン化・テンソル化などの変換を行うアーキテクチャです。\
ここでは `AutoTokenizer` クラスを用います。

In [3]:
# model: "google-bert/bert-base-uncased" (110M params)
# ref: https://huggingface.co/google-bert/bert-base-uncased

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
sequence = "In a hole in the ground there lived a hobbit."
print(tokenizer(sequence))

{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## AutoImageProcessor

画像プロセッサは、入力画像にサイズ変更・正規化・テンソル化などの変更を行うアーキテクチャです。\
ここでは `AutoImageProcessor` クラスを用います。

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" width="30%">

In [None]:
# model: "google/vit-base-patch16-224" (86.6M params)
# ref: https://huggingface.co/google/vit-base-patch16-224

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(io.BytesIO(requests.get(image_url).content))
print(image_processor(image))

{'pixel_values': [array([[[-0.05882353, -0.10588235, -0.21568626, ...,  0.19215691,
          0.1686275 ,  0.09803927],
        [-0.02745098, -0.0745098 , -0.17647058, ...,  0.2313726 ,
          0.20000005,  0.12941182],
        [-0.01176471, -0.06666666, -0.17647058, ...,  0.26274514,
          0.22352946,  0.16078436],
        ...,
        [ 0.79607844,  0.79607844,  0.8039216 , ..., -0.8666667 ,
         -0.8666667 , -0.8352941 ],
        [ 0.8039216 ,  0.8039216 ,  0.8039216 , ..., -0.7647059 ,
         -0.8117647 , -0.8117647 ],
        [ 0.8117647 ,  0.81960785,  0.8117647 , ..., -0.7176471 ,
         -0.79607844, -0.8117647 ]],

       [[-0.03529412, -0.08235294, -0.19215685, ...,  0.15294123,
          0.12941182,  0.05882359],
        [-0.00392157, -0.05098039, -0.15294117, ...,  0.19215691,
          0.16078436,  0.09019613],
        [ 0.01176476, -0.04313725, -0.15294117, ...,  0.22352946,
          0.18431377,  0.12156868],
        ...,
        [ 0.84313726,  0.84313726,  

## AutoFeatureExtractor

特徴抽出器は、入力の画像や動画から一定の方法で特徴を抽出し、正規化・テンソル化などの変換を行うアーキテクチャです。\
ここでは、`AutoFeatureExtractor` クラスを用います。

<audio src="https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac" controls></audio>

In [38]:
# model: "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" (316M params)
# ref: https://huggingface.co/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

feature_extractor = AutoFeatureExtractor.from_pretrained(
    "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition",
)
target_sr = feature_extractor.sampling_rate
speech_url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
speech_bytes = io.BytesIO(requests.get(speech_url).content)
speech, sr = librosa.load(speech_bytes, sr=None)
if sr != target_sr:
    speech = librosa.resample(speech, orig_sr=sr, target_sr=target_sr)
    sr = target_sr
print(feature_extractor(speech))

It is strongly recommended to pass the `sampling_rate` argument to `Wav2Vec2FeatureExtractor()`. Failing to do so can result in silent errors that might be hard to debug.


{'input_values': [array([ 0.07173638,  0.04628736, -0.05077884, ..., -0.00460555,
        0.03257194, -0.16154806], shape=(208000,), dtype=float32)], 'attention_mask': [array([1, 1, 1, ..., 1, 1, 1], shape=(208000,), dtype=int32)]}


## AutoProcessor



`Transformers` において、プロセッサはマルチモーダルモデルの入力に対して前処理を行うアーキテクチャを指します。\
ここでは、`AutoProcessor` クラスを用います。

In [None]:
# model: "microsoft/layoutlmv2-base-uncased" (200M params)
# ref: https://huggingface.co/microsoft/layoutlmv2-base-uncased

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")
image_url = "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png"
image = Image.open(io.BytesIO(requests.get(image_url).content)).convert("RGB")
text = ["invoice", "number"]
print(processor(images=[image, image], text=text, return_tensors="pt", padding=True))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'input_ids': tensor([[  101,  1999,  6767,  6610,   102,  1999,  6767,  6610,  2264,  7192,
          4297,  1012,  4878, 11203,  4644,  2047,  2259,  1010,  6396, 13092,
         10790,  3021,  3406,  2911,  3406,  1999,  6767,  6610,  1001,  2149,
          1011, 25604,  2198,  3044,  2198,  3044,  1999,  6767,  6610,  3058,
         11118,  2692, 17465, 11387, 16147,  1016,  2457,  2675,  1024,  4261,
          2620,  2581,  7222,  8584,  3298, 13433,  2015,  2047,  2259,  1010,
          6396, 13092, 10790,  4729,  1010,  5003, 13092, 10790,  1037, 20304,
          2475,  1013, 10476,  2349,  3058, 24441,  2692, 17465, 11387, 16147,
          1037,  2100,  6412,  3131,  3976,  3815,  1015,  2392,  5685,  4373,
         13428, 15196,  2531,  1012,  4002,  2531,  1012,  4002,  1016,  2739,
         18903,  2546, 15749,  2572,  5244,  2321,  1012,  4002,  2382,  1012,
          4002,  1017,  4450, 16425,  2869,  3156, 10347,  4942,  3406,  9080,
         13741,  1012,  4002, 16183,  

## AutoModel

特定のタスクに対して訓練済みモデルを `torch.Tensor` (`tf.Tensor`) 形式でロードするためには、`AutoModelFor` クラス (`TFAutoModelFor` クラス) を用います。

In [2]:
# model: "distilbert/distilbert-base-uncased" (67M params)
# ref: https://huggingface.co/distilbert/distilbert-base-uncased

AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [3]:
AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased")

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
   