Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MusicBERT] Restriction to 1002 octuples when using preprocess.encoding_to_str #60

Closed
tripathiarpan20 opened this issue Jul 5, 2022 · 6 comments

Comments

@tripathiarpan20
Copy link

tripathiarpan20 commented Jul 5, 2022

Hi once again!

While preprocessing a MIDI file, I noticed that the MIDI_to_encoding method performs as intended and converts the sample song to 106 bars as seen in the snip below of the resultant octuples (please correct me if I'm wrong).

However, the encoding_to_str method has the result with restriction to just 18 bars (as conculsive from highlighted <0-18> near the end of the encoded string in snip below):

image

More generally, what I have noticed in cases of multiple MIDI files is that only upto the first 1000 octuples (i.e, start token octuple + 1000 note octuples + end token octuple = (1002 * 8) = 8016 tokens) are considered:

image

Is there any way to change encoding_to_str to get the whole song instead?, upto 256 bars only I mean, as model vocabulary is also restricted to 256 bars.
I am not familiar enough with miditoolkit or mido to understand the code properly as of now, else I would have tried to fix this.

Thanks in advance!

Edit: I am aware that the musicbert_base model can support upto 8192 octuples (i.e, final input to MusicBERT encoder) only, but that does not seem to be the issue here I think.

@mlzeng
Copy link
Collaborator

mlzeng commented Jul 5, 2022

Hi @tripathiarpan20

musicbert_base model can only support up to 1024 octuples (= 8192 tokens)

If you need to change encoding_to_str to get the whole song, you can increase the value of sample_len_max or just replace the for i in e[p: p + sample_len_max] code in that function with for i in e.

@tripathiarpan20
Copy link
Author

tripathiarpan20 commented Jul 5, 2022

Hi @mlzeng .
There is a bit of confusion, with context from musicbert/musicbert/__init__.py, let's say we have the input tokens of some batch size having 8192 octuples (i.e, 8192 * 8 tokens) in each sample.

tokens would be something like torch.Tensor([ [0, 0, 0, 0, 0, 0, 0, 0, ............., 2, 2, 2, 2, 2, ,2, 2, 2].... ]) given that token mapping of <s> is 0 and </s> is 2 obtained like roberta_base.task.label_dictionary.index('<0-1>') and so on.
Also, tokens.shape[1] % 8 == 0 is True because Octuples are there.

The OctupleEncoder in lines 251-252 first convert tokens of dimension [batch_size, 8192 * 8] to x having dimensions [batch_size, 8192 * 8, 768] (see green arrow in Colab snip at end of comment) :

image

The lines 253-254 then convert x:

  • First from dimension [batch_size, 8192 * 8, 768] to [batch_size, 8192, 768 * 8] dimensions (with x.view), this might be a way to actually form the octuples by concatenating the token embeddings of the 8 elements of each octuple.
  • The above output is converted to [batch_size, 8192, 768] dimensions with a nn.Linear layer named self.downsampling (ratio is 8 from Line 232)

Also see red arrow in Colab snip at end of comment for definition of self.downsampling :
image

The lines 257-259 then add positional embedding using only every (8*i + 1) th element of original sequence tokens (since ratio is 8), which is the measure (i.e, bar) field of each octuple:
image
image

Now if we see the definition of embed_positions in MusicBERTEncoder marked by blue arrow in Colab snip below, tokens[:, ::ratio].shape[1] should be equal to 8194!!

image
image

Therefore upto 8194 octuples should be supported (could be with or without padding)!

Please let me know if there is any misunderstanding in my logic, as I really need to understand this.

Thanks!

@mlzeng
Copy link
Collaborator

mlzeng commented Jul 5, 2022

Hi @tripathiarpan20

Your logic is pretty correct. But MusicBERT models are trained with setting TOKENS_PER_SAMPLE=8192 (as seen in train_mask.sh) which means the length of input sequences would not exceed 8192 tokens (= 1024 octuples), and the attention layers in the encoder will only get tensors with length no more than 1024.

Processing 8192 octuple tokens with MusicBERT is theoretically possible, but that will require 64 times more GPU memory (memory usage is quadratic proportional to sequence length), which is non-practical currently. (The original RoBERTa models are trained with sequence length = 512)

@tripathiarpan20
Copy link
Author

I see, it is now clear! 😃

The provided MusicBERT Base also performs amazing without any further training and thanks to the team once again for making it open-source once.

In musicbert/musicbert/__init__.py I also noticed definition of musicbert_large. Is that another model that will be released eventually?

@mlzeng
Copy link
Collaborator

mlzeng commented Jul 5, 2022

We have tried pre-training using the musicbert_large config, but the model is unstable and exploded during training. 🤯

So be careful of training a large model. We will release a large model if we can figure out the problem.

@tripathiarpan20
Copy link
Author

Oh I see, hoping it is resolved eventually.

Would the finetuned models on the genre prediction and accompaniment suggestion be released too?
I'm thinking of implementing a genre based task and the finetuned model would be a great starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants