# Create a custom architecture

## Configuration

A **configuration** refers to a model's specific attributes. Each model configuration has different attributes.

For example, check out the `DistilBERT` configuration by accessing `DistilBertConfig`:

In [None]:
from transformers import DistilBertConfig

config = DistilBertConfig()
config

The default attributes from `DistilBertConfig` are used to build a base `DistilBertModel`. All attributes are customizable, creating space for experimentation.

We can customize a default model to
* try a different activation function with the `activation` parameter
* use a higher dropout ratio for the attention probabilities with the `attention_dropout` parameter

In [None]:
my_config = DistilBertConfig(
    activation='relu',
    attention_dropout=0.4
)
my_config

If we want to modify the pretrained model, we can change that in the `from_pretrained()` method:

In [None]:
my_config = DistilBertConfig.from_pretrained(
    'distilbert/distilbert-base-uncased',
    activation='relu',
    attention_dropout=0.4
)
my_config

Once we are satisfied with our model configuration, we can save it with `save_pretrained()` method. Our configuration is stored as a JSON file in the specified save directory:

In [None]:
my_config.save_pretrained(save_directory='./our_model_save_path')

To reuse the configuration file,

In [None]:
my_config = DistilBertConfig.from_pretrained('./our_model_save_path/config.json')
my_config

## Model

Once we have a configuration file, we can create a model.

The model - also referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes from the configuration are used to define the architecture.

Every model shares the base class `PreTrainedModel` and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also a `torch.nn.Module` subclass.

In [None]:
from transformers import DistilBertModel

# This is our config
my_config = DistilBertConfig.from_pretrained('./our_model_save_path/config.json')

# now create a model
model = DistilBertModel(my_config)

This creates a model with random values instead of pretrained weights.

If we want to use a pretrained model for inference,

In [None]:
model = DistilBertModel.from_pretrained('distilbert/distilbert-base-uncased')

When we load pretrained weights, the default model configuration is automatically loaded if the model is provided by Transformers library. However, we can still replace - some or all of - the default model configuration attributes with our own configuration:

In [None]:
model = DistilBertModel.from_pretrained(
    'distilbert/distilbert-base-uncased',
    config=my_config
)

### Model heads

At this point, we have a base `DistilBERT` MODEL which outputs the *hidden states*.

The hidden states are passed as inputs to a model head to produce the final output. Transformers provides a different model head for each task as long as a model supports the task (i.e., we cannot use `DistilBERT` for a sequence-to-sequence task like translation).

For example, `DistilBertForSequenceClassification` is a base `DistilBERT` model with a sequence classification head. The sequence classification head is a linear layer on top of the pooled outputs.

In [None]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert/distilbert-base-uncased')

To reuse this checkpoint for another task, we can switch to a different model head. For a question answering task, we would use the `DistilBertForQuestionAnswering` model head.

The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output:

In [None]:
from transformers import DistilBertForQuestionAnswering

model = DistilBertForQuestionAnswering.from_pretrained('distilbert/distilbert-base-uncased')

## Tokenizer

A **tokenizer** is used to convert raw text to tensors. In Transformers there are two types of tokenizers:
* `PreTrainedTokenizer`: a Python implementation of a tokenizer
* `PreTrainedTokenizerFast`: a tokenizer from Rust-based Tokenizer library with additional methods. This tokenizer type is significantly fast during batch tokenization.

If we trained our own tokenizer, we can create one from our *vocabulary* file:

In [None]:
from transformers import DistilBertTokenizer

my_tokenizer = DistilBertTokenizer(
    vocab_file='my_vocab_file.txt',
    do_lower_case=False,
    padding_side='left'
)

The vocabulary from a custom tokenizer will be different from the vocabulary generated by a pretrained model's tokenizer. We need to use a pretrained model's vocabulary if we are using a pretrained model, otherwise the inputs will not make sense.

We can create a tokenizer with a pretrained model's vocabulary:

In [None]:
from transformers import DistilBertTokenizer

my_tokenizer = DistilBertTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

In [None]:
from transformers import DistilBertTokenizerFast

my_tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert/distilbert-base-uncased')

By default, `AutoTokenizer` will try to load a fast tokenizer. We can disable this by setting `use_fast=False` in `from_pretrained`.

## Image processor

An image processor processes vision inputs.

For example, we can create a default `ViTImageProcessor` if we are using `ViT` for image classification:

In [None]:
from transformers import ViTImageProcessor

vit_extractor = ViTImageProcessor()
vit_extractor

We can modify any of the `ViTImageProcessor` parameters to create our custom image processor:

In [None]:
from transformers import ViTImageProcessor

my_vit_extractor = ViTImageProcessor(
    resample='PIL.Image.BOX',
    do_normalize=False,
    image_mean=[0.3, 0.3, 0.3]
)
my_vit_extractor

## Backbone

![cv_models](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Backbone.png)

Computer vision models consist of a backbone, neck, and head. The backbone extracts features from an input image, the neck combines and enhances the extracted features, and the head is used for the main task (e.g., object detection)

Start by initializing a backbone in the model config and specify whether we want to load pretrained weights or load randomly initialized weights. Then we can pass the model config to the model head.

For example, we can load a ResNet backbone into a MaskFormer model with an instance segmentation head. We need to set `use_pretrained_backbone=True` to load pretrained ResNet weights for the backbone:

In [None]:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

# backbone and neck config
config = MaskFormerConfig(
    backbone='microsoft/resnet-50',
    use_pretrained_backbone=True
)
# head
model = MaskFormerForInstanceSegmentation(config)

Set `use_pretrained_backbone=False` to randomly initialize a ResNet backbone

In [None]:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

# backbone and neck config
config = MaskFormerConfig(
    backbone='microsoft/resnet-50',
    use_pretrained_backbone=False
)
# head
model = MaskFormerForInstanceSegmentation(config)

We could also load the backbone config separately and then pass it to the model config.

In [None]:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig

backbone_config = ResNetConfig()
config = MaskFormerConfig(backbone_config=backbone_config)
model = MaskFormerForInstanceSegmentation(config)

The [`timm`](https://hf.co/docs/timm/index) models are loaded within a model with `use_timm_backbone=True` or with `TimmBackbone` and `TimmBackboneConfig`. Use `use_timm_backbone=True` and `use_pretrained_backbone=True` to load pretrained timm weights for the backbone.

In [None]:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

# backbone and neck config
config = MaskFormerConfig(
    backbone='resnet50',
    use_pretrained_backbone=True,
    use_timm_backbone=True
)
# head
model = MaskFormerForInstanceSegmentation(config)

Set `use_timm_backbone=True` and `use_pretrained_backbone=False` to load a randomly initialized timm backcone.

In [None]:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

# backbone and neck config
config = MaskFormerConfig(
    backbone='resnet50',
    use_pretrained_backbone=False,
    use_timm_backbone=True
)
# head
model = MaskFormerForInstanceSegmentation(config)

We could also load the backbone config and use it to create a `TimmBackbone` or pass it to the model config. Timm backbones will load pretrained weights by default. Set `use_pretrained_backbone=False` to load randomly initialized weights:

In [None]:
from transformers import TimmBackboneConfig, TimmBackbone

backbone_config = TimmBackboneConfig(
    'resnet50',
    use_pretrained_backbone=False
)

# create a backbone class
backbone = TimmBackbone(backbone_config)
# create a model with a timm backbone
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

config = MaskFormerConfig(backbone_config=backbone_config)
model = MaskFormerForInstanceSegmentation(config)

## Feature extractor

A **feature extractor** process audio inputs.

We can create a feature extractor associated with the model we are using. For example, we can create a default `Wav2Vec2FeatureExtractor` if we are using `Wav2Vec2` for audio classification:

In [None]:
from transformers import Wav2Vec2FeatureExtractor

w2v2_extractor = Wav2Vec2FeatureExtractor()
w2v2_extractor

We can modify any of the `Wav2Vec2FeatureExtractor` parameters to create our custom feature extractor:

In [None]:
from transformers import Wav2Vec2FeatureExtractor

w2v2_extractor = Wav2Vec2FeatureExtractor(
    sampling_rate=8000,
    do_normalize=False
)
w2v2_extractor

## Processor

For models that support multimodal tasks, Transformers has a processor class that conveniently wraps processing classes.

Create a feature extractor to handle the audio inputs:

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True)

Create a tokenizer to handle text inputs:

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer(vocab_file='my_vocab_file.txt')

Combine the feature extractor and tokenizer in `Wav2Vec2Processor`:

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(
    feature_extractor=feature_extractor,
    tokenizer=tokenizer
)