Image to Prompt

A generative text-to-image model is a model that can generate an image from a text prompt.

This repository is a final project for the course EECM30064 Deep Learning

Contributors

Motivation and Background

Stable Diffusion - Image to Prompts is a competition on Kaggle.

The goal of this competition is to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt.

We want to create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of $\verb|(prompt, image)|$ pairs generated by Stable Diffusion 2.0, in order to understand how reversible the latent relationship is.

Sample images from the competition dataset and their corresponding prompts are shown below.

Image	Prompt
	`ultrasaurus holding a black bean taco in the woods, near an identical cheneosaurus`

Methodology

Our method is to ensemble the CLIP Interrogator, OFA model, and ViT model.

Here's the ratio for three different model

Vision Transformer (ViT) model: 74.88%
CLIP Interrogator: 21.12%
OFA model fine-tuned for image captioning: 4%

Application and Datasets

Application

Based on the Kaggle competition, we want to build a model to predict the prompts that were used to generate target images.

Datasets

Prompts for this challenge were generated using a variety of (non disclosed) methods, and range from fairly simple to fairly complex with multiple objects and modifiers.

Images were generated from the prompts using Stable Diffusion $2.0$ ($768$-v-ema.ckpt) and were generated with 50 steps at $768 \times 768$ px and then downsized to $512 \times 512$ for the competition dataset. The hidden re-run test folder contains approximately $16,000$ images.

References

[1] Learning Transferable Visual Models From Natural Language Supervision

[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[3] Very Deep Convolutional Networks for Large-Scale Image Recognition

[4] SentenceTransformers

[5] CLIPInterrogator + OFA + ViT

[6] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[7] CoCa: Contrastive Captioners are Image-Text Foundation Models

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
images		images
Image_To_Prompts.pdf		Image_To_Prompts.pdf
LICENSE		LICENSE
README.md		README.md
clipinterrogator-ofa-vit.ipynb		clipinterrogator-ofa-vit.ipynb
prompts.csv		prompts.csv
sample_submission.csv		sample_submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

Image_To_Prompts.pdf

Image_To_Prompts.pdf

LICENSE

LICENSE

README.md

README.md

clipinterrogator-ofa-vit.ipynb

clipinterrogator-ofa-vit.ipynb

prompts.csv

prompts.csv

sample_submission.csv

sample_submission.csv

Repository files navigation

Image to Prompt

Contributors

Motivation and Background

Methodology

Application and Datasets

Application

Datasets

References

About

Releases

Packages

Contributors 3

Languages

License

jacksonchen1998/Image-to-Prompts

Folders and files

Latest commit

History

Repository files navigation

Image to Prompt

Contributors

Motivation and Background

Methodology

Application and Datasets

Application

Datasets

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages