#### **Tokens and vocabulary**
A token is a distinct element, part of a sequence of tokens. In natural language, a token can be a character, a subword or a word. A sentence can then be tokenized into a sequence of tokens representing the words and punctuation. For symbolic music, tokens can represent the values of the note attributes (pitch, valocity, duration) or time events. These are the “basic” tokens, that can be compared to the characters in natural language. With Byte Pair Encoding (BPE), tokens can represent successions of these basic tokens. A token can take three forms, which we name by convention:

- **Token** (string): the form describing it, e.g. Pitch_50.
- **Id** (int): an unique associated integer, used as an index.
- **Byte** (string): an unique associated byte, used internally for Byte Pair Encoding (BPE).

MidiTok works with TokSequence objects to output token sequences of represented by these three forms.

#### **Vocabulary**
The vocabulary of a tokenizer acts as a lookup table, linking tokens (string / byte) to their ids (integer). The vocabulary is an attribute of the tokenizer and can be accessed with tokenizer.vocab. The vocabulary is a Python dictionary binding tokens (keys) to their ids (values). For tokenizations with embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of Vocabulary objects, and the tokenizer.is_multi_vocab property will be True.

With **Byte Pair Encoding (BPE)**: tokenizer.vocab holds all the basic tokens describing the note and time attributes of music. By analogy with text, these tokens can be seen as unique characters. After training a tokenizer with BPE, a new vocabulary is built with newly created tokens from pairs of basic tokens. This vocabulary can be accessed with tokenizer.vocab_bpe, and binds tokens as bytes (string) to their associated ids (int). This is the vocabulary of the tokenizers BPE model.

#### **TokSequence**
The methods of MidiTok use miditok.TokSequence objects as input and outputs. A miditok.TokSequence holds tokens as the three forms described in Byte Pair Encoding (BPE). TokSequences are subscriptable and implement __ len __ (you can run tok_seq[id] and len(tok_seq)).

You can use the miditok.MIDITokenizer.complete_sequence() method to automatically fill the non-initialized attributes of a miditok.TokSequence.

In [26]:
from miditok import REMI, TokenizerConfig  # here we choose to use REMI

# Our parameters
TOKENIZER_PARAMS = {
    "pitch_range": (21, 109),
    "beat_res": {(0, 15): 16},
    "num_velocities": 32,
    "special_tokens": ["PAD", "BOS", "EOS", "MASK"],
    "use_chords": True,
    "use_rests": False,
    "use_tempos": True,
    "use_time_signatures": False,
    "use_programs": False,
    "num_tempos": 1,  # number of tempo bins
    "tempo_range": (100, 100),  # (min, max)
}
config = TokenizerConfig(**TOKENIZER_PARAMS)

# Creates the tokenizer
tokenizer = REMI(config)

from pathlib import Path

# Tokenize a MIDI file
tokens = tokenizer(Path('examples/bass_example.MID'))  # automatically detects Score objects, paths, tokens

print(tokens)
print(len(tokens[0]))

tokenizer.learn_bpe(vocab_size=len(tokens[0]), files_paths='examples/bass_example.MID')

# Convert to MIDI and save it
generated_midi = tokenizer(tokens)  # MidiTok can handle PyTorch/Numpy/Tensorflow tensors
generated_midi.dump_midi(Path("decoded_midi.MID"))

[TokSequence(tokens=['Bar_None', 'Position_0', 'Tempo_100.0', 'Pitch_43', 'Velocity_127', 'Duration_0.1.16', 'Position_1', 'Pitch_43', 'Velocity_127', 'Duration_0.8.16', 'Position_12', 'Pitch_43', 'Velocity_127', 'Duration_0.4.16', 'Position_17', 'Pitch_47', 'Velocity_127', 'Duration_0.8.16', 'Position_27', 'Pitch_47', 'Velocity_127', 'Duration_0.4.16', 'Position_32', 'Pitch_50', 'Velocity_127', 'Duration_0.8.16', 'Position_43', 'Pitch_50', 'Velocity_127', 'Duration_0.4.16', 'Position_49', 'Pitch_53', 'Velocity_127', 'Duration_0.4.16', 'Position_54', 'Pitch_52', 'Velocity_127', 'Duration_0.3.16', 'Position_60', 'Pitch_50', 'Velocity_127', 'Duration_0.4.16', 'Bar_None', 'Position_1', 'Pitch_43', 'Velocity_127', 'Duration_0.9.16', 'Position_12', 'Pitch_43', 'Velocity_127', 'Duration_0.4.16', 'Position_18', 'Pitch_47', 'Velocity_127', 'Duration_0.9.16', 'Position_28', 'Pitch_47', 'Velocity_127', 'Duration_0.4.16', 'Position_34', 'Pitch_50', 'Velocity_127', 'Duration_0.7.16', 'Position_45'

  tokenizer.learn_bpe(vocab_size=len(tokens[0]), files_paths='examples/bass_example.MID')


#### Trains a tokenizer with BPE
Here we train the tokenizer with Byte Pair Encoding (BPE). BPE allows to reduce the lengths of the sequences of tokens, in turn model efficiency, while improving the results quality/model performance.

In [11]:
from miditok import REMI
from pathlib import Path

# Creates the tokenizer and list the file paths
tokenizer = REMI()  # using defaults parameters (constants.py)
midi_paths = [Path('examples/bass_example.MID')]
# midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))

# Builds the vocabulary with BPE
tokenizer.learn_bpe(vocab_size=30000, files_paths=midi_paths)

#### Creates a Dataset and collator for training
Creates a Dataset and a collator to be used with a PyTorch DataLoader to train a model

In [12]:
from miditok import REMI
from miditok.pytorch_data import DatasetTok, DataCollator

# midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))

dataset = DatasetTok(
    files_paths=midi_paths,
    min_seq_len=100,
    max_seq_len=1024,
    tokenizer=tokenizer,
)
collator = DataCollator(
    tokenizer["PAD_None"], tokenizer["BOS_None"], tokenizer["EOS_None"]
)

from torch.utils.data import DataLoader
data_loader = DataLoader(dataset=dataset, collate_fn=collator)

# Using the data loader in the training loop
for batch in data_loader:
    print("Train your model on this batch...")

ImportError: cannot import name 'DatasetTok' from 'miditok.pytorch_data' (C:\Users\Gianni\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\miditok\pytorch_data\__init__.py)

#### Tokenize a dataset 
Here we tokenize a whole dataset into JSON files storing the tokens ids. We also perform data augmentation on the pitch, velocity and duration dimension.

In [None]:
from miditok import REMI
from miditok.data_augmentation import augment_midi_dataset
from pathlib import Path

# Creates the tokenizer and list the file paths
tokenizer = REMI()  # using defaults parameters (constants.py)
data_path = Path("path", "to", "dataset")

# A validation method to discard MIDIs we do not want
# It can also be used for custom pre-processing, for instance if you want to merge
# some tracks before tokenizing a MIDI file
def midi_valid(midi) -> bool:
    if any(ts.numerator != 4 for ts in midi.time_signature_changes):
        return False  # time signature different from 4/*, 4 beats per bar
    return True

# Performs data augmentation on one pitch octave (up and down), velocities and
# durations
midi_aug_path = Path("to", "new", "location", "augmented")
augment_midi_dataset(
    data_path,
    pitch_offsets=[-12, 12],
    velocity_offsets=[-4, 5],
    duration_offsets=[-0.5, 1],
    out_path=midi_aug_path,
)
tokenizer.tokenize_midi_dataset(        # 2 velocity and 1 duration values
    data_path,
    Path("path", "to", "tokens"),
    midi_valid,
)