AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Abstract

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: how can we adopt such models to be conditioned on other modalities?. In this paper, we propose a novel method utilizing latent diffusion models, trained for text-to-image-generation, to generate images, conditioned on audio recordings. Using a pre- trained audio encoding model, the proposed method encodes audio into a new token which can be considered as an adap- tation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable param eters making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods considering both objective and subjective metrics.

Installation

git clone git@github.com:guyyariv/AudioToken.git
cd AudioToken
pip install -r requirements.txt

And initialize an Accelerate environment with:

accelerate config

Download BEATs pre-trained model

mkdir -p models/BEATs/ && wget -O models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt "https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D"

Pre-Trained Embedder

The embedder's weights, which we pre-trained and on which the article is based, may be found at: output/embedder_learned_embeds.bin

Training

First, download our data set. VGGSound. Download links for the dataset can be found here.

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./vggsound/"
export OUTPUT_DIR="output/"

accelerate launch train.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR 
  --resolution=512 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=30000 \
  --learning_rate=1.0e-05 \

Note: Change the resolution to 768 if you are using the stable-diffusion-2 768x768 model.

Inference

After you've trained a model with the above command, you can simply generate images using the following script:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./vggsound/"
export OUTPUT_DIR="output/"
export LEARNED_EMBEDS="output/embedder_learned_embeds.bin"

accelerate launch inference.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR \ 
  --learned_embeds=$LEARNED_EMBEDS

Cite

If you use our work in your research, please cite the following paper:

@article{yariv2023audiotoken,
  title={AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation},
  author={Yariv, Guy and Gat, Itai and Wolf, Lior and Adi, Yossi and Schwartz, Idan},
  journal={arXiv preprint arXiv:2305.13050},
  year={2023}
}

License

This repository is released under the MIT license as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.idea		.idea
constants		constants
data		data
figs		figs
modules		modules
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inf.sh		inf.sh
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

License

guyyariv/AudioToken

Folders and files

Latest commit

History

Repository files navigation

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Abstract

Installation

Pre-Trained Embedder

Training

Inference

Cite

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages