VL Hub integrates CLIP pretraining, LiT-Tuning, alternate CLIP-like architectures, CoCa, conventional timm vision models and SimCLR contrastive models into a single test-train-eval framework, making it easier to compare models across architectures and objectives.
This software is heavily indebted to all of the above listed packages, and particularly to open clip.
If you're new to CLIP, you might want to familiarize yourself with the basics of the architecture.
If you haven't worked with the OpenCLIP implementation of CLIP before, the best place to start is the OpenCLIP readme in the docs folder.
If you're in need of a vision-language dataset for your training experiments, check out CaptionNet, a dataset designed precisely for that purpose.
If you're planning to use one of the alternate architectures for training, we recommend using the links above to familiarize yourself with the details of that architecture before proceeding.
In this readme, we focus on features that are new in our implementation.
VL Hub integrates support for a wide range of zero-shot evaluation metrics, including ImageNet and its distribution shifts, food101, fgvc-Aircraft, Stanford Cars, and iNaturalist. Simply add the dataset you wish to evaluate on, and include the following flag(s) when at evaluation time, substituting the appropriate PATH --
--food "/", --air "/", --stanfordcars "/", --imagenet-val "/imagenet/val/", --imagenet-a "/imagenet-a", --imagenet-r "/imagenet-r", --imagenet-v2 "/imagenet-v2", --imagenet-s "/imagenet-sketch", --inat2021 "/inat2021"
ImageNet and its distribution shifts also support evaluation on a 100-class subset of ImageNet-1k; this is particularly useful when training models on smaller datasets such as CaptionNet.
To evaluate on a subset of ImageNet, include the flag --caption-subset=True
VL Hub supports extended evaluation metrics such as confusion matrices and per-class accuracy results. To utilize these features, pass the flag --extended-metrics=True.
VL Hub currently supports training the following architectures;
- CLIP-loss models
As this is the default training mode, you need only pass the type of model you wish to train, EG, --model=RN50
- LiT-tuned models
Please refer to docs/openclip_readme for details
--lock-image-unlocked-groups will unlock only the last
--lock-text will lock the text tower during training (not recommended)
- Conventional cross-entropy loss models
In order to train conventional cross-entropy loss models without using a text tower, if your dataset is in CSV format, you must specify a caption-key column which contains either integers or a list of integers.
If you are training on webdataset, this step is not necessary, but you should specify a dataset filter --
Integer labels will be generated using a subset matching strategy. For more details, please see our paper.
In either case, you must also pass
You should then choose a model architecture with an appropriate head. For instance, if training an ImageNet model, you might choose
In order to train a CoCa model, simply pass
Please note that at this time, support for CoCa is limited to a single vision backbone, and the loss weighting has to be adjusted manually.
- DeCLIP, VSSL, filip
To train using one of these alternate objectives, pass the model architecture you wish to use as your base, and flag the objective you wish to train with. For instance:
- Changing image and text weighting
By passing float values between .1 and .9 to the flags --text-weight and --image-weight, it is possible to change how heavily CLIP weights both text and image loss.
In order to make training on blended or combined datasets more convenient when using webdataset, we implement training schema. Pass the flag
and do NOT pass a path to any training data in order to use schema.
Sample schema are provided in the repository.
Please note; schema training is only supported when using webdataset format; if you wish to combine CSV datasets, simply merge the CSVs you wish to combine.
is passed, the model will use SimCLR augmentations instead of standard CLIP augmentations. This has been shown to improve model zero-shot performance by as much as 10 percent.
VL Hub offers support for gradient caching, as described in GradCache.
Models like CLIP are typically trained with very large batch sizes -- 32,768 is standard. This has been shown to improve rates of loss convergence.
However, most researchers and even most businesses do not have access to the distributed training setups required for such batch sizes. Gradient caching saves computed gradients in RAM instead of in VRAM, allowing for much larger batch sizes on a single node.
Unlike gradient accumulation, gradient caching is mathematically identical to training with a large batch size.
To try gradient caching, use these three new command line arguments --
--gc=True: If set to True, gradient caching will be enabled. If gc=True, you should also set gpumaxbatch to an appropriate size for your GPU.
--gpumaxbatch=INT: This is an integer value representing the maximum batch size that your GPU can handle
--no_eval=True: Skips the evaluation phase, accelerating training
The default subset matching strategy is single-class, as this has been shown to perform best. However, other strategies are available. --multiclass enforces multiclass subset matching, whereas --strict utilizes a strict subset matching strategy. For more details on subset matching strategies, please refer to our paper.
VL Hub supports many modifications to captions, both during inference and during training.
"--zs-upper" forces classes to UPPER-CASE during inference, "zs-lower" forces lower-case. "--csv-cleaned" cleans the captions prior to training (this flag also works for webdatasets).
--token-strip strips all tokens not used for evaluation.
--token-scrambled scrambles token order during training.
"--simplecaptions" changes caption text to 'An image of CLASSNAME, CLASSNAME'
--shift-cipher=3 applies a 3-step shift-cipher to the caption space.