#### **Tokens and vocabulary**
A token is a distinct element, part of a sequence of tokens. In natural language, a token can be a character, a subword or a word. A sentence can then be tokenized into a sequence of tokens representing the words and punctuation. For symbolic music, tokens can represent the values of the note attributes (pitch, valocity, duration) or time events. These are the “basic” tokens, that can be compared to the characters in natural language. With Byte Pair Encoding (BPE), tokens can represent successions of these basic tokens. A token can take three forms, which we name by convention:

- **Token** (string): the form describing it, e.g. Pitch_50.
- **Id** (int): an unique associated integer, used as an index.
- **Byte** (string): an unique associated byte, used internally for Byte Pair Encoding (BPE).

MidiTok works with TokSequence objects to output token sequences of represented by these three forms.

#### **Vocabulary**
The vocabulary of a tokenizer acts as a lookup table, linking tokens (string / byte) to their ids (integer). The vocabulary is an attribute of the tokenizer and can be accessed with tokenizer.vocab. The vocabulary is a Python dictionary binding tokens (keys) to their ids (values). For tokenizations with embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of Vocabulary objects, and the tokenizer.is_multi_vocab property will be True.

With **Byte Pair Encoding (BPE)**: tokenizer.vocab holds all the basic tokens describing the note and time attributes of music. By analogy with text, these tokens can be seen as unique characters. After training a tokenizer with BPE, a new vocabulary is built with newly created tokens from pairs of basic tokens. This vocabulary can be accessed with tokenizer.vocab_bpe, and binds tokens as bytes (string) to their associated ids (int). This is the vocabulary of the tokenizers BPE model.
BPE allows to reduce the lengths of the sequences of tokens, in turn model efficiency, while improving the results quality/model performance.

#### **TokSequence**
The methods of MidiTok use miditok.TokSequence objects as input and outputs. A miditok.TokSequence holds tokens as the three forms described in Byte Pair Encoding (BPE). TokSequences are subscriptable and implement __ len __ (you can run tok_seq[id] and len(tok_seq)).

You can use the miditok.MIDITokenizer.complete_sequence() method to automatically fill the non-initialized attributes of a miditok.TokSequence.

In [15]:
from miditok import REMI, TokenizerConfig  # here we choose to use REMI

# Our parameters
TOKENIZER_PARAMS = {
    "pitch_range": (21, 109),
    "beat_res": {(0, 15): 16},
    "num_velocities": 32,
    "special_tokens": ["PAD", "BOS", "EOS", "MASK"],
    "use_chords": True,
    "use_rests": False,
    "use_tempos": True,
    "use_time_signatures": False,
    "use_programs": False,
    "num_tempos": 1,  # number of tempo bins
    "tempo_range": (120, 120),  # (min, max)
}
config = TokenizerConfig(**TOKENIZER_PARAMS)

# Creates the tokenizer
tokenizer = REMI(config)

from pathlib import Path

# Tokenize a MIDI file
midi_path = list(Path("examples").glob("**/*.mid")) [0]
print(midi_path)
tokens = tokenizer(midi_path)  # automatically detects Score objects, paths, tokens

print(tokens)
print(len(tokens[0]))

# tokenizer.learn_bpe(vocab_size=len(tokens[0]), files_paths='examples/bass_example.MID')

# Convert to MIDI and save it
generated_midi = tokenizer(tokens)  # MidiTok can handle PyTorch/Numpy/Tensorflow tensors
generated_midi.dump_midi(Path("output/decoded_midi.MID"))

examples\bass_example.MID
[TokSequence(tokens=['Bar_None', 'Position_0', 'Tempo_120.0', 'Pitch_43', 'Velocity_127', 'Duration_0.1.16', 'Position_1', 'Pitch_43', 'Velocity_127', 'Duration_0.8.16', 'Position_12', 'Pitch_43', 'Velocity_127', 'Duration_0.4.16', 'Position_17', 'Pitch_47', 'Velocity_127', 'Duration_0.8.16', 'Position_27', 'Pitch_47', 'Velocity_127', 'Duration_0.4.16', 'Position_32', 'Pitch_50', 'Velocity_127', 'Duration_0.8.16', 'Position_43', 'Pitch_50', 'Velocity_127', 'Duration_0.4.16', 'Position_49', 'Pitch_53', 'Velocity_127', 'Duration_0.4.16', 'Position_54', 'Pitch_52', 'Velocity_127', 'Duration_0.3.16', 'Position_60', 'Pitch_50', 'Velocity_127', 'Duration_0.4.16', 'Bar_None', 'Position_1', 'Pitch_43', 'Velocity_127', 'Duration_0.9.16', 'Position_12', 'Pitch_43', 'Velocity_127', 'Duration_0.4.16', 'Position_18', 'Pitch_47', 'Velocity_127', 'Duration_0.9.16', 'Position_28', 'Pitch_47', 'Velocity_127', 'Duration_0.4.16', 'Position_34', 'Pitch_50', 'Velocity_127', 'Durat

#### Creates a Dataset and collator for training
Creates a Dataset and a collator to be used with a PyTorch DataLoader to train a model

In [8]:
from miditok import REMI, TokenizerConfig
from miditok.pytorch_data import DatasetMIDI, DataCollator, split_midis_for_training
from torch.utils.data import DataLoader
from pathlib import Path

# Creating a multitrack tokenizer configuration, read the doc to explore other parameters
TOKENIZER_PARAMS = {
    "pitch_range": (21, 109),
    "beat_res": {(0, 15): 16},
    "num_velocities": 32,
    "special_tokens": ["PAD", "BOS", "EOS", "MASK"],
    "use_chords": True,
    "use_rests": False,
    "use_tempos": True,
    "use_time_signatures": False,
    "use_programs": False,
    "num_tempos": 1,  # number of tempo bins
    "tempo_range": (100, 100),  # (min, max)
}
config = TokenizerConfig(**TOKENIZER_PARAMS)

# Creates the tokenizer
tokenizer = REMI(config)

# Train the tokenizer with Byte Pair Encoding (BPE)
midi_paths = list(Path("examples").glob("**/*.mid"))
midi_paths = [midi_paths[0]]
print(midi_paths)
tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_paths)
tokenizer.save_params(Path('output', "tokenizer.json"))


# Create a Dataset, a DataLoader and a collator to train a model
dataset = DatasetMIDI(
    files_paths= list(dataset_chunks_dir.glob("**/*.mid")),
    tokenizer=tokenizer,
    max_seq_len=32,
    bos_token_id=tokenizer["BOS_None"], # beginning-of-sequence 
    eos_token_id=tokenizer["EOS_None"], # end-of-sequence
)
collator = DataCollator(tokenizer["PAD_None"])
dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)

import sys
sys.path.append('../TCN')
from tcn import TemporalConvNet

model = TemporalConvNet(num_inputs=1, num_channels=[10, 10, 10, 10, 10], kernel_size=3, dropout=0.2)

# Iterate over the dataloader to train a model
for i, batch in enumerate(dataloader):
    tokens = batch['input_ids']
    print(f'\nBatch_{i}, Length={len(tokens[0])}')
    print(batch['input_ids'])

[WindowsPath('examples/bass_example.MID')]


  split_midis_for_training(


In [None]:
import time
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
          decoder_optimizer, criterion):

    total_loss = 0
    for data in dataloader:
        input_tensor, target_tensor = data

        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        encoder_outputs, encoder_hidden = encoder(input_tensor)
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )
        loss.backward()

        encoder_optimizer.step()
        decoder_optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
               print_every=100, plot_every=100):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    for epoch in range(1, n_epochs + 1):
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            # print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
            #                             epoch, epoch / n_epochs * 100, print_loss_avg))

        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    # showPlot(plot_losses)

#### Tokenize a dataset 
Here we tokenize a whole dataset into JSON files storing the tokens ids. We also perform data augmentation on the pitch, velocity and duration dimension.

In [5]:
from miditok import REMI
from miditok.data_augmentation import augment_midi_dataset
from pathlib import Path

# Creates the tokenizer and list the file paths
tokenizer = REMI()  # using defaults parameters (constants.py)
data_path = Path("path", "to", "dataset")

# A validation method to discard MIDIs we do not want
# It can also be used for custom pre-processing, for instance if you want to merge
# some tracks before tokenizing a MIDI file
def midi_valid(midi) -> bool:
    if any(ts.numerator != 4 for ts in midi.time_signature_changes):
        return False  # time signature different from 4/*, 4 beats per bar
    return True

# Performs data augmentation on one pitch octave (up and down), velocities and
# durations
midi_aug_path = Path("to", "new", "location", "augmented")
augment_midi_dataset(
    data_path,
    pitch_offsets=[-12, 12],
    velocity_offsets=[-4, 5],
    duration_offsets=[-0.5, 1],
    out_path=midi_aug_path,
)
tokenizer.tokenize_midi_dataset(        # 2 velocity and 1 duration values
    data_path,
    Path("path", "to", "tokens"),
    midi_valid,
)

Performing data augmentation: 0it [00:00, ?it/s]
Tokenizing MIDIs (to/tokens): 0it [00:00, ?it/s]


#### SYMUSIC Library

Read MIDI and add time_signature if necessary

In [13]:
from symusic import Score, TimeSignature
from symusic.core import TimeSignatureTickList
score = Score("examples/bass_example_copy.mid")
score = Score("examples/HereComesTheSun.mid")
print("note_num: ", score.note_num())
print("start_time: ", score.start())
print("end_time: ", score.end())
print(score.tempos)
print(score.key_signatures)
print(score.time_signatures)

score.time_signatures = TimeSignatureTickList([TimeSignature(time=0, numerator=4, denominator=4)])

print(score.time_signatures)

note_num:  3887
start_time:  0
end_time:  78494
symusic.core.TempoTickList([Tempo(time=0, qpm=135.000135000135, mspq=444444)])
symusic.core.KeySignatureTickList([KeySignature(time=0, key=3, tonality=0, degree=3)])
symusic.core.TimeSignatureTickList([TimeSignature(time=0, numerator=4, denominator=4), TimeSignature(time=37632, numerator=2, denominator=4), TimeSignature(time=38016, numerator=3, denominator=8), TimeSignature(time=38880, numerator=5, denominator=8), TimeSignature(time=39360, numerator=4, denominator=4), TimeSignature(time=40128, numerator=2, denominator=4), TimeSignature(time=40512, numerator=3, denominator=8), TimeSignature(time=41376, numerator=5, denominator=8), TimeSignature(time=41856, numerator=4, denominator=4), TimeSignature(time=42624, numerator=2, denominator=4), TimeSignature(time=43008, numerator=3, denominator=8), TimeSignature(time=43872, numerator=5, denominator=8), TimeSignature(time=44352, numerator=4, denominator=4), TimeSignature(time=45120, numerator=2, 