Skip to content

Latest commit

 

History

History

train

Training an Anticipatory Music Transformer

Preprocessing the LakhMidi dataset

The LakhMidi dataset contains over 170,000 MIDI files and the following preprocessing scripts require considerable computation, system memory, and storage resources to execute. See the Resource Management section for discussion of how to manage the demands of preprocessing within the constraints of a particular system.

Download the full LakhMidi dataset from here. Extract the data to DATAPATH.

tar -xvvf lmd_full.tar.gz -C $DATAPATH

Preprocess the binary MIDI files to an intermediate text representation.

python midi-preprocess.py $DATAPATH/lmd_full

Tokenize batches of intermediate LakhMIDI data files according to the vocabulary defined in src/settings/vocab.py. The top-level script depends upon the directory structure of the LakhMIDI dataset. Parallelism is again controlled by PREPROC_WORKERS. Choose a dataset augmentation factor (multiple of 10) for training an anticipatory infilling model, or 1 (default) for standard autoregressive training. Use the optional -i flag to generate training data for an interarrival-time model.

python tokenize-lakh.py $DATAPATH/lmd_full --augment 1

Define the train/validation/test splits. LakhMidi files are named according to their (hexadecimal) MD5 checksum: our convention is to use files starting with f as the test set, files starting with e as validation, and the rest of the dataset for training.

mv $DATAPATH/lmd_full/tokenized-events-e.txt $DATAPATH/valid.txt
mv $DATAPATH/lmd_full/tokenized-events-f.txt $DATAPATH/test.txt
cat $DATAPATH/lmd_full/tokenized-events-*.txt > $DATAPATH/lmd_full/train-ordered.txt

Finally, we must shuffle our training data. For this shuffle procedure, your have a machine with enough RAM to read the whole training data file.

shuf $DATAPATH/lmd_full/train-ordered.txt > $DATAPATH/train.txt

The final preprocessed train/valid/test splits are available at DATAPATH.

Resource Management

Compute. Preprocessing scripts are designed to run with multiprocessing parallelism: you can configure the number of workers using PREPROC_WORKERS in src/settings/constants.py.

Memory. The most memory intensive operation is the final shuffle operation, which requires the entire final training dataset to be loaded into memory. Alternative memory-efficient solutions do exist for shuffling (or approximately shuffling) the lines of a file, which you may wish to explore if memory is a constraint. Warning: we have observed that approximate shuffling using a local shuffle of the training data is not sufficient to achieve good model performance and should be avoided.

Disk. The base preprocessed LakhMidi dataset--prepared for autoregressive modeling without anticipation--is about 10Gb. The size of the dataset when prepared for anticipatory autoregressive modeling is a multiple of the base dataset size, controlled by augment parameter to scripts/tokenize-lakh.py. See the Cleanup for cleaning up temporary files on disk after preprocessing.

Cleanup

Each stage of preprocessing generates temporary files with intermediate outputs. After completing preprocessing, these temporary files can be cleaned up as follows.

Delete the intermediate data files generated by the midi-preprocess script.

rm $DATAPATH/lmd_full/*/*.txt

Delete the intermediate tokenized events generated by the tokenize-lakh script.

rm $DATAPATH/lmd_full/tokenized-events-*.txt

Delete the ordered training data file, which is superseded by the shuffled version.

rm $DATAPATH/lmd_full/train-ordered.txt

Levanter Training

This directory contains config files for training Small, Medium, and Large models using the Levanter codebase:

lakh-small.yaml
lakh-medium.yaml
lakh-large.yaml

You will need to update this configuration files to point at the location where you have stored your preprocessed data splits.

Use the split_lakh branch of the Levanter codebase to reproduce the training environment used for the models presented in the Anticipatory Music Transformer paper. This branch includes some minor modifications to the mainline Levanter codebase for processing pre-tokenized music datasets.

See the Levanter codebase for further instructions on training a model.