[MusicBERT] Restriction to 1002 octuples when using `preprocess.encoding_to_str` #60

tripathiarpan20 · 2022-07-05T04:37:28Z

Hi once again!

While preprocessing a MIDI file, I noticed that the MIDI_to_encoding method performs as intended and converts the sample song to 106 bars as seen in the snip below of the resultant octuples (please correct me if I'm wrong).

However, the encoding_to_str method has the result with restriction to just 18 bars (as conculsive from highlighted <0-18> near the end of the encoded string in snip below):

More generally, what I have noticed in cases of multiple MIDI files is that only upto the first 1000 octuples (i.e, start token octuple + 1000 note octuples + end token octuple = (1002 * 8) = 8016 tokens) are considered:

Is there any way to change encoding_to_str to get the whole song instead?, upto 256 bars only I mean, as model vocabulary is also restricted to 256 bars.
I am not familiar enough with miditoolkit or mido to understand the code properly as of now, else I would have tried to fix this.

Thanks in advance!

Edit: I am aware that the musicbert_base model can support upto 8192 octuples (i.e, final input to MusicBERT encoder) only, but that does not seem to be the issue here I think.

The text was updated successfully, but these errors were encountered:

mlzeng · 2022-07-05T07:13:13Z

Hi @tripathiarpan20

musicbert_base model can only support up to 1024 octuples (= 8192 tokens)

If you need to change encoding_to_str to get the whole song, you can increase the value of sample_len_max or just replace the for i in e[p: p + sample_len_max] code in that function with for i in e.

tripathiarpan20 · 2022-07-05T08:18:32Z

Hi @mlzeng .
There is a bit of confusion, with context from musicbert/musicbert/__init__.py, let's say we have the input tokens of some batch size having 8192 octuples (i.e, 8192 * 8 tokens) in each sample.

tokens would be something like torch.Tensor([ [0, 0, 0, 0, 0, 0, 0, 0, ............., 2, 2, 2, 2, 2, ,2, 2, 2].... ]) given that token mapping of <s> is 0 and </s> is 2 obtained like roberta_base.task.label_dictionary.index('<0-1>') and so on.
Also, tokens.shape[1] % 8 == 0 is True because Octuples are there.

The OctupleEncoder in lines 251-252 first convert tokens of dimension [batch_size, 8192 * 8] to x having dimensions [batch_size, 8192 * 8, 768] (see green arrow in Colab snip at end of comment) :

The lines 253-254 then convert x:

First from dimension [batch_size, 8192 * 8, 768] to [batch_size, 8192, 768 * 8] dimensions (with x.view), this might be a way to actually form the octuples by concatenating the token embeddings of the 8 elements of each octuple.
The above output is converted to [batch_size, 8192, 768] dimensions with a nn.Linear layer named self.downsampling (ratio is 8 from Line 232)

Also see red arrow in Colab snip at end of comment for definition of self.downsampling :

The lines 257-259 then add positional embedding using only every (8*i + 1) th element of original sequence tokens (since ratio is 8), which is the measure (i.e, bar) field of each octuple:

Now if we see the definition of embed_positions in MusicBERTEncoder marked by blue arrow in Colab snip below, tokens[:, ::ratio].shape[1] should be equal to 8194!!

Therefore upto 8194 octuples should be supported (could be with or without padding)!

Please let me know if there is any misunderstanding in my logic, as I really need to understand this.

Thanks!

mlzeng · 2022-07-05T08:44:59Z

Hi @tripathiarpan20

Your logic is pretty correct. But MusicBERT models are trained with setting TOKENS_PER_SAMPLE=8192 (as seen in train_mask.sh) which means the length of input sequences would not exceed 8192 tokens (= 1024 octuples), and the attention layers in the encoder will only get tensors with length no more than 1024.

Processing 8192 octuple tokens with MusicBERT is theoretically possible, but that will require 64 times more GPU memory (memory usage is quadratic proportional to sequence length), which is non-practical currently. (The original RoBERTa models are trained with sequence length = 512)

tripathiarpan20 · 2022-07-05T10:15:12Z

I see, it is now clear! 😃

The provided MusicBERT Base also performs amazing without any further training and thanks to the team once again for making it open-source once.

In musicbert/musicbert/__init__.py I also noticed definition of musicbert_large. Is that another model that will be released eventually?

mlzeng · 2022-07-05T16:17:45Z

We have tried pre-training using the musicbert_large config, but the model is unstable and exploded during training. 🤯

So be careful of training a large model. We will release a large model if we can figure out the problem.

tripathiarpan20 · 2022-07-06T10:37:32Z

Oh I see, hoping it is resolved eventually.

Would the finetuned models on the genre prediction and accompaniment suggestion be released too?
I'm thinking of implementing a genre based task and the finetuned model would be a great starting point.

tripathiarpan20 closed this as completed Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MusicBERT] Restriction to 1002 octuples when using `preprocess.encoding_to_str` #60

[MusicBERT] Restriction to 1002 octuples when using `preprocess.encoding_to_str` #60

tripathiarpan20 commented Jul 5, 2022 •

edited

mlzeng commented Jul 5, 2022 •

edited

tripathiarpan20 commented Jul 5, 2022 •

edited

mlzeng commented Jul 5, 2022 •

edited

tripathiarpan20 commented Jul 5, 2022

mlzeng commented Jul 5, 2022

tripathiarpan20 commented Jul 6, 2022

[MusicBERT] Restriction to 1002 octuples when using preprocess.encoding_to_str #60

[MusicBERT] Restriction to 1002 octuples when using preprocess.encoding_to_str #60

Comments

tripathiarpan20 commented Jul 5, 2022 • edited

mlzeng commented Jul 5, 2022 • edited

tripathiarpan20 commented Jul 5, 2022 • edited

mlzeng commented Jul 5, 2022 • edited

tripathiarpan20 commented Jul 5, 2022

mlzeng commented Jul 5, 2022

tripathiarpan20 commented Jul 6, 2022

[MusicBERT] Restriction to 1002 octuples when using `preprocess.encoding_to_str` #60

[MusicBERT] Restriction to 1002 octuples when using `preprocess.encoding_to_str` #60

tripathiarpan20 commented Jul 5, 2022 •

edited

mlzeng commented Jul 5, 2022 •

edited

tripathiarpan20 commented Jul 5, 2022 •

edited

mlzeng commented Jul 5, 2022 •

edited