PaletteLab

💻Repository 🕹️Demo 📦Model

PaletteLab is a project that explores AI-driven color palette generation. It is designed to support multimodal inputs, including text, images, and palettes. At present, only text-to-palette generation is supported.

Examples

More samples

Installation

Prerequisites

Python 3.10 or higher
PyTorch 2.7 or higher

Setup

Clone the repository and enter the directory

git clone https://github.com/oakoio/palettelab
cd palettelab

If you have an NVIDIA GPU, install a CUDA-enabled PyTorch build. CPU-only users can skip this step.

pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126

Install other dependencies

pip install -r requirements.txt

Usage

Visit the demo (link on top).
You can also use it locally through a notebook or a Gradio web interface. Download the model weights (link on top) or let the script download automatically by default.

Notebook

Open run.ipynb.
Locate the cell that defines use_local and configure variables as needed.
Locate the cell that defines text_prompt and configure variables as needed.
Run the notebook.

Gradio

Run the following command:

python -m gradio_app.app

If you manually download the model weights, run the command with --model argument:

python -m gradio_inference.app --model [model.path]

Pipeline

Overview

The current text-to-palette model is primarily a conditional transformer decoder. The high-level workflow is as follows:

Text encoder: A pretrained CLIP model is used to produce text features and is largely frozen. The text features are projected into the model's latent space using a MLP (text_proj) and serve as decoder inputs.
Palette encoder: Palettes, represented in normalized LAB color space, are embedded using a MLP (color_embed). These color embeddings are also used as decoder inputs.
Transformer decoder: An autoregressive nn.TransformerDecoder generates the palette conditioned on the projected text features. A learned BOS token is prepended to the palette sequence. Sinusoidal positional encodings are added to inject sequence position information.
Stochastic conditioning (z): A noise vector z ~ N(0,I) is sampled per sequence, projected via a learnable z_proj, and added to the decoder inputs.
Per-step heads: Two output heads predict L (via sigmoid) and ab (via tanh) values for color generation.

Stochasticity

During training, stochasticity is introduced through noisy teacher forcing on palette inputs and stochastic conditioning. This helps regularize the model, improves robustness to small color variations, and prevents overfitting to exact palette configurations.

During inference, stochasticity arises from stochastic conditioning and per-step color sampling. This allows the model to generate diverse palettes for the same text prompt. Both sources of randomness can be disabled to produce deterministic palette generation.

Loss Function

The loss function is a weighted combination of:

Mean squared error (MSE) loss for per-color reconstruction accuracy
Hungarian matching loss for evaluation independent of color order

The relative contributions of each term are controlled by scalar weights (see configs/config.yaml).

Dataset

The training dataset was curated from multiple online sources in January 2026. After pruning entries containing uncommon words not included in this English word database, the dataset contains 98,863 text-palette pairs. Approximately 95% of the entries are palettes of 5 colors. The training dataset is not released.

Training

The model was trained for 20 epochs, and achieved its best validation loss at epoch 11. The weights from this epoch were selected as the published model.

Results

Across a wide range of prompts, the model generates palettes that are visually coherent and semantically aligned with the input text. However, some palettes are not aesthetically pleasing and contain highly similar colors. Performance is strong on prompts that explicitly reference color names, but degrades on linguistically complex prompts involving compositional reasoning or negation.

Embeddings

Embeddings of test prompts and their generated palettes are visualized using t-SNE. Text embeddings (from the CLIP-based text projection) and palette embeddings (mean-pooled color embeddings) are projected separately.

In the palette embedding space, the color of each point corresponds to the average color of the associated palette.

Key observations:

Prompts suggesting high-saturation color themes form tight, well-defined clusters in both text space and palette space, including green, yellow, and blue themes.
Muted and neutral prompts appear more dispersed.

Overall, the cluster structure is largely preserved from text space to palette space. The model maintains semantic organization while mapping language to color.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analysis		analysis
configs		configs
examples		examples
figures		figures
gradio_app		gradio_app
inference		inference
logs		logs
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
dataset.py		dataset.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaletteLab

Examples

Installation

Prerequisites

Setup

Usage

Notebook

Gradio

Pipeline

Overview

Stochasticity

Loss Function

Dataset

Training

Results

Embeddings

Key observations:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PaletteLab

Examples

Installation

Prerequisites

Setup

Usage

Notebook

Gradio

Pipeline

Overview

Stochasticity

Loss Function

Dataset

Training

Results

Embeddings

Key observations:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages