You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to fine-tune the pretrained model using another dataset, but I'm stuck at the loop block below.
I understand the final format of output_str_list, but I simply cannot get a grasp on what this code does, so I was hoping you could provide me with an explanation.
output_str_list = []
sample_step = max(round(sample_len_max / sample_overlap_rate), 1)
for p in range(0 - random.randint(0, sample_len_max - 1), len(e), sample_step):
L = max(p, 0)
R = min(p + sample_len_max, len(e)) - 1
bar_index_list = [e[i][0] for i in range(L, R + 1) if e[i][0] is not None]
bar_index_min = 0
bar_index_max = 0
if len(bar_index_list) > 0:
bar_index_min = min(bar_index_list)
bar_index_max = max(bar_index_list)
offset_lower_bound = -bar_index_min
offset_upper_bound = bar_max - 1 - bar_index_max
# to make bar index distribute in [0, bar_max)
bar_index_offset = random.randint(
offset_lower_bound, offset_upper_bound) if offset_lower_bound <= offset_upper_bound else offset_lower_bound
e_segment = []
for i in e[L: R + 1]:
if i[0] is None or i[0] + bar_index_offset < bar_max:
e_segment.append(i)
else:
break
tokens_per_note = 8
output_words = (['<s>'] * tokens_per_note) \
+ [('<{}-{}>'.format(j, k if j > 0 else k + bar_index_offset) if k is not None else '<unk>') for i in e_segment for j, k in enumerate(i)] \
+ (['</s>'] * (tokens_per_note - 1)
) # tokens_per_note - 1 for append_eos functionality of binarizer in fairseq
output_str_list.append(' '.join(output_words))
Also, in gen_genre.py, why do we want to sample the train set multiple times? Why do we need output_str_list four times?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Some octuple token sequences from the LMD dataset can be very long (more than 1024 octuple tokens), and the Transformer model cannot handle long sequences due to GPU memory size bound. So we use the sliding window style random sampling method to crop very long sequences into multiple shorter segments (which may overlap) for pre-training.
We randomly select multiple segments to avoid overfitting and wasting train data. Randomly cropping long sequences on-the-fly during training could be better, but it requires some code.
The performance of the model won't significantly degrade if only one segment is used for every sequence (n_time = 1).
Hello!
I'm trying to fine-tune the pretrained model using another dataset, but I'm stuck at the loop block below.
I understand the final format of output_str_list, but I simply cannot get a grasp on what this code does, so I was hoping you could provide me with an explanation.
Also, in gen_genre.py, why do we want to sample the train set multiple times? Why do we need output_str_list four times?
Thanks in advance!
The text was updated successfully, but these errors were encountered: