distilvit

Fine-tune a Visual Encoder Decoder model for image captioning.

Resulting model is available on Hugging Face model hub at https://huggingface.co/mozilla/distilvit

To install, use your favorite tools or you can run this:

python -m venv .
bin/pip install -r requirements.txt
bin/pip install -e .

To train against all image & caption pairs (COCO, Flickr30k and TextCaps), make sure you have 2T of disk space, and run:

bin/train --dataset all

Once trained, you can try it out with the test script:

bin/python distilvit/infere.py

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
distilvit		distilvit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback