# Model Mistral + REMI + Maestro

First baseline model using a Mistral transformer and REMI tokenizer, inspired by https://github.com/Natooz/MidiTok/blob/main/colab-notebooks/Example_HuggingFace_Mistral_Transformer.ipynb

**Model name:** mistral162M_remi_maestro  
**Previous version:** mistral309M_remi_maestro  
**Changes:**
- reduce model size to 162M
- increase MLP dimension to 8x embedding dimension instead of the usual 4x rule - to allow learning more complex temporal dependencies that are present in music but not in language
- increase max_position_embeddings back to 8192 to allow for extrapolation to longer sequences
- decreasing gradient_accumulation_steps from 3 to 1 - as we already increased the batch size from 16 to our desired 64 we don't need to simulate higher batch sizes
- set gradient_checkpointing to False as this slows down the training run while saving memory - if we run out of memory we can set it to True again
- set warmup_ratio from 30% to 3% as it was unreasonably high, only wasting compute
- set a min_lr ratio of 0.1 instead of none to not waste so much compute at the end of cosine schedule
- in the tokenizer change num_velocities and tempo_range to the default values as it seemed otherwise a bit random
- replace offline augmentation by on-the-fly augmentation and introduce much more variety in data augmentation -> will better mitigate overfitting, the model will see a wider variety of data
  -> but number of examples per epoch is decreasing, so number of epochs should be increased
- add tempo change data augmentation and overall reasonable and tested values for the different augmentation axes
- increase train epochs from 20 to 1000. 20 epochs were not enough and we have decreased the number of example per epoch by 9, as we use early stopping we can increase this very much
- introduce early stopping with a conservative patience of 10
- set attention_dropout of 0.1 to reduce risk of overfitting. hidden_dropout (dropout in the feedforward layers) is not easily possible with mistral as the Llama architecture is not intended for this
- use recommended train/valid/test split to ensure duplicate pieces are in the same set
- use only train set for tokenization to avoid leakage
- increase num_overlap_bars for chunking from 2 to 16 for better data utilization

## Setup

In [None]:
import os

import wandb
import tqdm
from transformers.trainer_utils import set_seed
from transformers import AutoModelForCausalLM

from piano_transformer.config import load_config
from piano_transformer.datasets.dataset import build_collator, build_datasets
from piano_transformer.datasets.preprocessing import split_datasets_into_chunks
from piano_transformer.model import build_mistral_model
from piano_transformer.tokenizer import create_remi_tokenizer
from piano_transformer.trainer import make_trainer
from piano_transformer.midi import get_midi_file_lists

In [None]:
cfg = load_config("../../config.yaml")

print(f"Model:\n{cfg.model_name}")

os.environ["WANDB_ENTITY"] = "jonathanlehmkuhl-rwth-aachen-university"
os.environ["WANDB_PROJECT"] = "piano-transformer"
wandb.login()

set_seed(cfg.seed)

Model:
mistral-162M_remi_maestro_v1


[34m[1mwandb[0m: Currently logged in as: [33mjonathanlehmkuhl[0m ([33mjonathanlehmkuhl-rwth-aachen-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Dataset Preparation

In [5]:
midi_lists = get_midi_file_lists(cfg.data_raw_path / "maestro" / "maestro-v3.0.0.csv", cfg.data_raw_path / "maestro")

for split in ["train", "validation", "test"]:
    print(f"Number of {split} files: {len(midi_lists[split])}")

tokenizer = create_remi_tokenizer(midi_lists["train"], cfg.experiment_path / "tokenizer.json")

MAX_SEQ_LEN = 2048
NUM_OVERLAP_BARS = 16

chunks_lists = split_datasets_into_chunks(midi_lists, tokenizer, cfg.data_processed_path, "maestro", MAX_SEQ_LEN, NUM_OVERLAP_BARS)

augmentation_cfg = {
    "pitch_offsets": list(range(-6, 6)),
    "velocity_offsets": list(range(-20, 21)),
    "duration_offsets": [-0.5, -0.375, -0.25, -0.125, 0, 0.125, 0.25, 0.375, 0.5],
    "tempo_factors": [0.9, 0.925, 0.95, 0.975, 1.0, 1.025, 1.05, 1.075, 1.1],
}


train_ds, valid_ds, test_ds = build_datasets(chunks_lists, tokenizer, MAX_SEQ_LEN, augmentation_cfg)
collator = build_collator(tokenizer)

Number of train files: 962
Number of validation files: 137
Number of test files: 177


Splitting music files (..\..\models\mistral-162M_remi_maestro_v1\data_processed\maestro_train): 100%|██████████| 962/962 [00:11<00:00, 82.03it/s] 
Splitting music files (..\..\models\mistral-162M_remi_maestro_v1\data_processed\maestro_validation): 100%|██████████| 137/137 [00:01<00:00, 101.13it/s]
Splitting music files (..\..\models\mistral-162M_remi_maestro_v1\data_processed\maestro_test): 100%|██████████| 177/177 [00:01<00:00, 110.46it/s]


## Training

In [None]:
model_cfg = {
    "num_hidden_layers": 18,
    "hidden_size": 512,
    "intermediate_size": 512 * 8,
    "num_attention_heads": 8,
    "attention_dropout": 0.1,
    
} 

model = build_mistral_model(model_cfg, tokenizer, MAX_SEQ_LEN)

print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

trainer_cfg = {
    "output_dir": cfg.runs_path,
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 64,
    "learning_rate": 1e-4,
    "weight_decay": 0.01,
    "max_grad_norm": 3.0,
    "lr_scheduler_type": "cosine_with_min_lr",
    "min_lr_rate": 0.1,
    "warmup_ratio": 0.03,
    "logging_steps": 20,
    "num_train_epochs": 1000,
    "seed": cfg.seed,
    "data_seed": cfg.seed,
    "run_name": cfg.model_name,
    "optim": "adamw_torch",
    "early_stopping_patience": 10,
}

trainer = make_trainer(trainer_cfg, model, collator, train_ds, valid_ds)

result = trainer.train()
trainer.save_model(cfg.model_path)
trainer.log_metrics("train", result.metrics)
trainer.save_metrics("train", result.metrics)
trainer.save_state()

## Generation

In [None]:
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)

In [None]:
generation_config = GenerationConfig(
    max_new_tokens=200,  # extends samples by 200 tokens
    num_beams=1,
    do_sample=True,
    temperature=1,
    top_k=15,
    top_p=0.95,
    epsilon_cutoff=3e-4,
    eta_cutoff=1e-3,
    pad_token_id=tokenizer.pad_token_id,
)

# Here the sequences are padded to the left, so that the last token along the time dimension
# is always the last token of each seq, allowing to efficiently generate by batch
collator.pad_on_left = True
collator.eos_token = None

model.eval()

In [None]:
def generate(dataset, output):
    (output_path := Path(output)).mkdir(parents=True, exist_ok=True)
    dataloader = DataLoader(dataset, batch_size=16, collate_fn=collator)
    
    count = 0
    for batch in tqdm(dataloader, desc="Generating outputs"):
        res = model.generate(
            inputs=batch["input_ids"].to(model.device),
            attention_mask=batch["attention_mask"].to(model.device),
            generation_config=generation_config
        )
    
        # Saves the generated music, as MIDI files and tokens (json)
        for prompt, continuation in zip(batch["input_ids"], res):
            generated = continuation[len(prompt):]
            tokens = [generated, prompt, continuation]
            tokens = [seq.tolist() for seq in tokens]

            midi_generated = tokenizer.decode([deepcopy(tokens[0])])
            midi_prompt = tokenizer.decode([deepcopy(tokens[1])])
            midi_full = tokenizer.decode([deepcopy(tokens[2])])

            # Name the tracks
            if midi_generated.tracks:
                midi_generated.tracks[0].name = f"Generated continuation ({len(tokens[0])} tokens)"
            if midi_prompt.tracks:
                midi_prompt.tracks[0].name = f"Original prompt ({len(tokens[1])} tokens)"
            if midi_full.tracks:
                midi_full.tracks[0].name = f"Full sequence ({len(tokens[2])} tokens)"

            # Save each as a separate MIDI file
            midi_generated.dump_midi(output_path / f"{count}_generated.midi")
            midi_prompt.dump_midi(output_path / f"{count}_prompt.midi")
            midi_full.dump_midi(output_path / f"{count}_full.midi")
            tokenizer.save_tokens(tokens, output_path / f"{count}.json")
    
            count += 1

In [None]:
generate(dataset_test, OUTPUT_PATH / "test")

## Convert to WAV

In [None]:
soundfont_path = "FluidR3_GM.sf2"
midi_folder = OUTPUT_PATH / "test"
output_folder = OUTPUT_PATH / "test_wav"

for filename in os.listdir(midi_folder):
    if filename.lower().endswith(".midi"):
        midi_path = os.path.join(midi_folder, filename)
        wav_filename = os.path.splitext(filename)[0] + ".wav"
        wav_path = os.path.join(output_folder, wav_filename)
        
        print(f"Converting {filename} to {wav_filename}...")
        
        # Build FluidSynth command
        command = [
            "fluidsynth",
            "-ni",  # no interactive mode
            soundfont_path,
            midi_path,
            "-F", wav_path,  # output file
            "-r", "44100"    # sample rate
        ]
        
        subprocess.run(command, check=True)

Converting Song-000.MID to Song-000.wav...


FileNotFoundError: [WinError 2] The system cannot find the file specified