Skip to content

jacksonchen1998/Image-to-Prompts

Repository files navigation

Image to Prompt

| Code | Slide | Report |

A generative text-to-image model is a model that can generate an image from a text prompt.

This repository is a final project for the course EECM30064 Deep Learning

Contributors

Motivation and Background

Stable Diffusion - Image to Prompts is a competition on Kaggle.

The goal of this competition is to reverse the typical direction of a generative text-to-image model: instead of generating an image from a text prompt.

We want to create a model which can predict the text prompt given a generated image. And making predictions on a dataset containing a wide variety of $\verb|(prompt, image)|$ pairs generated by Stable Diffusion 2.0, in order to understand how reversible the latent relationship is.

Sample images from the competition dataset and their corresponding prompts are shown below.

Image Prompt
ultrasaurus holding a black bean taco in the woods, near an identical cheneosaurus

Methodology

Our method is to ensemble the CLIP Interrogator, OFA model, and ViT model.

Here's the ratio for three different model

  • Vision Transformer (ViT) model: 74.88%
  • CLIP Interrogator: 21.12%
  • OFA model fine-tuned for image captioning: 4%

Application and Datasets

Application

Based on the Kaggle competition, we want to build a model to predict the prompts that were used to generate target images.

Datasets

Prompts for this challenge were generated using a variety of (non disclosed) methods, and range from fairly simple to fairly complex with multiple objects and modifiers.

Images were generated from the prompts using Stable Diffusion $2.0$ ($768$-v-ema.ckpt) and were generated with 50 steps at $768 \times 768$ px and then downsized to $512 \times 512$ for the competition dataset. The hidden re-run test folder contains approximately $16,000$ images.

References

[1] Learning Transferable Visual Models From Natural Language Supervision

[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[3] Very Deep Convolutional Networks for Large-Scale Image Recognition

[4] SentenceTransformers

[5] CLIPInterrogator + OFA + ViT

[6] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

[7] CoCa: Contrastive Captioners are Image-Text Foundation Models

About

A generative text-to-image model is a model that can generate an image from a text prompt.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published