LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [ICLR 2024]

Hanan Gani¹, Shariq Farooq Bhat², Muzammal Naseer¹, Salman Khan^1,3, Peter Wonka²

¹MBZUAI ²KAUST ³Australian National University

Official implementation of the paper "LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts".

Updates

Code is released. [Feb 12, 2024]
Our paper is accepted at ICLR 2024 [Jan 15, 2024]

Highlights

Abstract: Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.

Main Contributions

Scene Blueprints: we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. Utilizing bounding
Global Scene Generation: Utilzing the bounding box layout and genralized background prompt, we generate an initial image using Layout-to-Image generator.
Iterative Refinement Scheme : Given the initial image, our proposed refinement mechanism iteratively evaluates and refines the box-level content of each object to align them with their textual descriptions, recomposing objects as needed to ensure consistency.

Methodology

Installation

This codebase is tested on Ubuntu 20.04.2 LTS with python 3.8. Follow the below steps to create environment and install dependencies.

# Create a conda environment
conda create -n llmblueprint python==3.8

# Activate the environment
conda activate llmblueprint

#Install torch
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia

# Install requirements
pip install -r requirements.txt

# Additionally do this step at the end
python -m spacy download en_core_web_md

Run LLMBlueprint

Download the pretrained weights of composition model from here and provide its path in yaml files placed inside configs folder.

Generate

python main.py --config configs/livingroom_1.yaml

The hyperparameters and input arguments can be modified inside yaml files. The generated results will be saved in ./outputs folder.

Contact

Should you have any questions, please contact at hanan.ghani@mbzuai.ac.ae

Citation

If you use our work, please consider citing:

@misc{gani2023llm,
      title={LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts}, 
      author={Hanan Gani and Shariq Farooq Bhat and Muzammal Naseer and Salman Khan and Peter Wonka},
      year={2023},
      eprint={2310.10640},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

Our code is built on the repositories of LLM Grounded Diffusion and Paint by Example. We thank them for their open-source implementation and instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
__pycache__		__pycache__
composition_module		composition_module
configs		configs
docs		docs
models		models
segment-anything		segment-anything
utils		utils
README.md		README.md
baseline.py		baseline.py
generation.py		generation.py
helper_functions.py		helper_functions.py
main.py		main.py
prompt_templates.py		prompt_templates.py
requirements.txt		requirements.txt
shared.py		shared.py

hananshafi/llmblueprint

Folders and files

Latest commit

History

Repository files navigation

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts [ICLR 2024]

Contents

Updates

Highlights

Main Contributions

Methodology

Installation

Run LLMBlueprint

Generate

Contact

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages