# Get started

This sections covers how to get started with `ConfigVLM`.

There are currently two ways of installing `ConfigVLM`. The recommended way is using [pypi](https://pypi.org/) via `pip install configvlm`. Alternatively, the source can be directly downloaded from [github](https://github.com/lhackel-tub/ConfigVLM).

`ConfigVLM` allows you to easily combine and use predefined vision and language models and use them for tasks such as [Supervised Pretraining](sup_pretraining.ipynb) or [VQA](vqa.ipynb).
For this, models can be trained in an end-to-end fashion or pre-trained checkpoints can be used.

## Model Configuration

The central object of `ConfigVLM` is the dataclass `VLMConfiguration`. This is used to decide which parts the model consists of, how it is combined, and which task it should ultimately solve. A possible configuration for Supervised Classification can look like this:
:::{note}
Not all properties of the object are always used. Which properties are unused depends on the network type specified. For classification there is no fusion or language part, therefore in this example all parameters associated with fusion or language modeling are unused.
:::

In [14]:
from configvlm.ConfigVLM import VLMConfiguration, VLMType
from pprint import pprint
model_config = VLMConfiguration(
    timm_model_name="resnet18",
    classes=100,
    image_size=128,
    channels=3,
    network_type=VLMType.VISION_CLASSIFICATION
)
pprint(model_config)

VLMConfiguration(timm_model_name='resnet18',
                 hf_model_name=None,
                 image_size=128,
                 channels=3,
                 classes=100,
                 class_names=None,
                 network_type=<VLMType.VISION_CLASSIFICATION: 0>,
                 visual_features_out=512,
                 fusion_in=512,
                 fusion_hidden=256,
                 v_dropout_rate=0.25,
                 t_dropout_rate=0.25,
                 fusion_dropout_rate=0.25,
                 fusion_method=<built-in method mul of type object at 0x7f2e451fd460>,
                 fusion_activation=Tanh(),
                 drop_rate=0.2,
                 use_pooler_output=True,
                 max_sequence_length=32,
                 load_timm_if_available=False,
                 load_hf_if_available=True)


This class is used to ultimately create the model, but also collects all other information such as the image size in an object. This facilitates the organization in the code and is to prevent that there are many global variables.
Currently, the configuration supports the following network types:

In [13]:
# remove-input
from pprint import pprint
pprint([p.name for p in VLMType])

['VISION_CLASSIFICATION', 'VQA_CLASSIFICATION']
