# [Hands-On] Training a Custom Music Generation Model (Language Model Approach)

- Author: Hugman Sangkeun Jung (hugmanskj@gmail.com)

> Educational Purpose


## Overview
This project demonstrates how to train a custom music generation model based on ABC text notation. The pipeline involves:

- Data Preparation: Loading a large text file of ABC-encoded musical scores.
- Vocabulary Building: Mapping each character in the text to a unique index (and back).
- Dataset & DataLoader Setup: Creating a sequence modeling dataset that provides subsequences of text to the model.
- Model Architecture: Defining an LSTM-based language model (character-level) to learn musical patterns in ABC notation.
- Training: Optimizing the model using a cross-entropy loss function and an Adam optimizer.
- Generation: Sampling new sequences of ABC notation from the trained model by providing a prompt.
- Audio Conversion: Converting the generated ABC notation to a playable WAV file using music21, pretty_midi, and soundfile.

You can think of it as a text-generation pipeline where the "language" is ABC music notation. Once new ABC text is generated, it’s turned into MIDI and finally into an audio (WAV) file for playback or analysis.

In [1]:
!pip install torch music21
!pip install pretty_midi soundfile
!pip install requests

Collecting pretty_midi
  Downloading pretty_midi-0.2.10.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting mido>=1.1.16 (from pretty_midi)
  Downloading mido-1.3.3-py3-none-any.whl.metadata (6.4 kB)
Downloading mido-1.3.3-py3-none-any.whl (54 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.6/54.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pretty_midi
  Building wheel for pretty_midi (setup.py) ... [?25l[?25hdone
  Created wheel for pretty_midi: filename=pretty_midi-0.2.10-py3-none-any.whl size=5592287 sha256=a18a37f9edb4f89aeb2fc03d938a0dcc08d1ed55d9c487677941be7b9069fdfa
  Stored in directory: /root/.cache/pip/wheels/e6/95/ac/15ceaeb2823b04d8e638fd1495357adb8d26c00ccac9d7782e
Successfully built pretty_midi
Installing collected packages: mido, pretty_midi
Successf

These commands install the required libraries:

- torch: PyTorch, used for building and training the neural network (LSTM)
- music21: A toolkit for computer-aided musicology, useful for parsing and handling ABC/MIDI data
- pretty_midi: A library to handle MIDI files in Python
- soundfile: Enables reading and writing of audio files (e.g., WAV)

Installing them ensures that all necessary dependencies are available for the subsequent code.

In [2]:
import requests

# Convert the Dropbox share link to a direct download link
url = "https://www.dropbox.com/scl/fi/ltl2myq5zxvzlplgg1igb/input.txt?rlkey=ez9tbnwv4wm8rhbd38cg0xoi5&st=f45ggmdu&dl=1"

# Download and save the file
response = requests.get(url)
with open('./abc.txt', 'wb') as f:
    f.write(response.content)

print("File downloaded successfully!")

File downloaded successfully!


In [3]:
import pretty_midi
import soundfile as sf
from IPython.display import Audio
def abc_to_wav(abc_sequence, output_filename, soundfont_path='FluidR3_GM.sf2'):
    """
    Convert ABC notation to WAV file using pretty_midi + soundfile

    Parameters:
        abc_sequence (str): Music sequence in ABC notation
        output_filename (str): Output WAV filename (without extension)
        soundfont_path (str): Path to SoundFont file (.sf2),
                              defaults to 'FluidR3_GM.sf2'

    Returns:
        str: Full path of the generated WAV file
    """
    import music21
    import pretty_midi
    import soundfile as sf
    import os

    # Temporary MIDI filename
    temp_midi = f'{output_filename}_temp.mid'
    # Final WAV filename
    final_wav = f'{output_filename}.wav'

    try:
        # 1) Convert ABC to MIDI (using music21)
        score = music21.converter.parseData(abc_sequence, format='abc')
        score.write('midi', temp_midi)

        # 2) Load MIDI with pretty_midi
        pm = pretty_midi.PrettyMIDI(temp_midi)
        audio_data = pm.synthesize()  # numpy array(float)

        # 3) WAV
        sf.write(final_wav, audio_data, 44100, subtype='PCM_16')
        return final_wav

    except Exception as e:
        # Clean up temporary MIDI if error occurs
        if os.path.exists(temp_midi):
            os.remove(temp_midi)
        raise Exception(f"Conversion error: {str(e)}")


This function converts an ABC-encoded music sequence into a WAV audio file. Here’s the step-by-step:

- Parse ABC: Uses music21 to parse the raw ABC notation into a musical score object.
- Write MIDI: Exports that musical score to a temporary MIDI file.
- Load and Synthesize: Uses pretty_midi to load the MIDI file, then synthesizes it into a NumPy array representing the raw audio data.
- Save as WAV: Uses the soundfile library (sf.write) to write the NumPy audio data into a WAV file.

This function is useful for listening to the ABC notation you’ve generated or processed. It bridges the gap between text-based notation and actual audio playback.

Here, we provide a sample ABC string (sample_note) and invoke abc_to_wav to create a WAV file named "sample.wav". Immediately after, we use Audio(wav_file) (in a Jupyter Notebook) to play the generated audio. This demonstrates the end-to-end process of:

1. Taking a snippet of ABC notation.
2. Converting it to a WAV file.
3. Playing the result directly in the notebook.

### Explanation of ABC data

- X: Reference number - Used to identify the tune in a collection
- T: Title - The name of the tune
- M: Meter/Time signature - Indicates the time signature of the piece (4/4 in this case)
- L: Default note length - Sets the default length for notes (1/8 means eighth notes)
- B: Book - Source book reference
- N: Notes - General annotations about the piece
- Z: Transcriber - Information about who transcribed the piece
- K: Key - The key signature of the piece (D major in this case)

In [4]:
sample_note = """
X:100
T:NewTune
M: 4/4
L: 1/8
B: "O'Neill's 1"
N: "With spirit" "collected by J. O'Neill"
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
K:G
(G/2A/2 Bc) | (d/4c/4B/4c/4d/4e/4f/4g/4) (fdc>B) | {B}(AG/2F/2) (GDGA) | B-~d (c/A/G/F/).G/ | A2 (A/G/^F/A/) | (~G>F) DD | D2 z ||
(D/E/) | F>(E F/G/A/B/) | c>(d c/B/A/G/) | Af ef | d2 f>e |
d>c AG | F>E FG | AB cc | c3 (D/E/) |
F>(E F/G/A/B/) | c>(d c/B/A/G/) | Af ef | d2 f>e |
d>c AG | F>E FG | AB cc | c3 (D/E/) |
F>(E F/G/A/B/) | c>(d c/B/A/G/) | Af ef | d2 f>e |
d>c AG | F>E
"""

In [5]:
wav_file = abc_to_wav(sample_note, "sample")
Audio(wav_file)  # Play in Jupyter Notebook

# Imports and Basic Setup

In [6]:
# Import basic libraries
import os
import time
import math
import random
import numpy as np

# Import PyTorch libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Music-related libraries (optional)
import music21
import pretty_midi
import soundfile as sf

# Set device configuration (CPU/GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [7]:
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if device.type == 'cuda':
        torch.cuda.manual_seed_all(seed)

set_seed(42)


We define a function set_seed(42) and call it so that all random number generators (Python’s random, NumPy, and PyTorch) use the same seed value. This ensures reproducible behavior across different runs of the code. If seeds are not set, the training process and any stochastic operations (like data shuffling or random sampling) may yield slightly different results every time, making it harder to compare experiments.

## Data Loading and Processing

Here, we'll define a few utility functions for loading the text data, building a vocabulary, and converting characters to indices (and vice versa).
1. `load_text_file(file_path)`: Reads and returns the contents of a text file.
2. `build_vocab(text)`: Creates character-level mappings
3. `text_to_tensor(text, char2idx)`: Converts a string into a `torch.Tensor` of token indices.

In [8]:
def load_text_file(file_path: str) -> str:
    """
    Reads the entire text file and returns it as a string
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    return text

file_path = "./abc.txt"  # Modify according to your environment
text = load_text_file(file_path)
print("Text length:", len(text))  # Note: changed raw_text to text to match the variable name
print("Preview:", text[:300])

Text length: 1684914
Preview: X: 1
T: The Enchanted Valley
M: 2/4
L: 1/16
B: "O'Neill's 1"
N: "Very slow" "collected by J. O'Neill"
N:
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
Z:
K:Gm
G3-A (Bcd=e) | f4 (g2dB) | ({d}c3-B) G2-E2 | F4 (D2=E^F) |
G3-A (Bcd=e) | f4 d2-f2 | (g2a2 b2).g2 | {b}(a2g2 f2).d2 |
(d2{ed}c2) B2B2


In [9]:
def build_vocab(text: str):
    # Extract unique characters
    unique_chars = sorted(list(set(text)))
    char2idx = {ch: i for i, ch in enumerate(unique_chars)}
    idx2char = {i: ch for ch, i in char2idx.items()}
    return char2idx, idx2char

char2idx, idx2char = build_vocab(text)
vocab_size = len(char2idx)
print("Vocab size:", vocab_size)

Vocab size: 95


In [10]:
def text_to_tensor(text: str, char2idx: dict) -> torch.Tensor:
    """
    Converts a given string into a tensor of character indices
    """
    return torch.tensor([char2idx[ch] for ch in text], dtype=torch.long)

data_tensor = text_to_tensor(text, char2idx)
print("data_tensor size:", data_tensor.size())

data_tensor size: torch.Size([1684914])


## Creating a Custom Dataset

To train a language model, we typically use consecutive segments of text as both input (`x`) and target (`y`), where `y` is shifted by one character relative to `x`.

### `ABCDataset`
- We store the entire dataset as a single long `Tensor`.
- For each sample, we extract a window of length `seq_len` as `x` and the subsequent window as `y`.
- This helps the model learn to predict the next token at each position.

In [11]:
class ABCDataset(Dataset):
    """
    data_tensor: Full dataset in token index form
    seq_len: Length of each sample sequence
    """
    def __init__(self, data_tensor: torch.Tensor, seq_len: int):
        self.data = data_tensor
        self.seq_len = seq_len

    def __len__(self):
        # When considering the last timestep, starting index can only go up to (len(data)-seq_len-1)
        return len(self.data) - self.seq_len

    def __getitem__(self, idx):
        x = self.data[idx : idx + self.seq_len]
        y = self.data[idx + 1 : idx + self.seq_len + 1]
        return x, y

In [12]:
# Hyperparams
seq_len = 100   # Context length
batch_size = 1024
dataset = ABCDataset(data_tensor, seq_len)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
print("Number of samples:", len(dataset))

Number of samples: 1684814


- seq_len = 100: Each training sample is a sequence of 100 characters.
- batch_size = 1024: Large batch size to speed up training (depending on available GPU memory).
- dataset = ABCDataset(...): Constructs a dataset for the entire text.
- dataloader = DataLoader(...): Wraps the dataset in batches, shuffles, and provides iteration logic during training.

This setup ensures that each batch contains multiple input sequences and their corresponding targets.

## Defining the LSTM-based Music Model
We'll define a minimal LSTM-based language model in PyTorch.

In [13]:
class LSTMMusicModel(nn.Module):
   def __init__(self, vocab_size, embedding_dim=256, hidden_dim=512, num_layers=3):
       super().__init__()
       self.vocab_size = vocab_size
       self.embedding_dim = embedding_dim
       self.hidden_dim = hidden_dim
       self.num_layers = num_layers

       self.embedding = nn.Embedding(vocab_size, embedding_dim)
       self.lstm = nn.LSTM(
           input_size=embedding_dim,
           hidden_size=hidden_dim,
           num_layers=num_layers,
           batch_first=True
       )
       self.fc = nn.Linear(hidden_dim, vocab_size)

   def forward(self, x, hidden=None):
       # x shape: [batch_size, seq_len]
       embedded = self.embedding(x)   # shape: [batch_size, seq_len, embedding_dim]
       if hidden is not None:
           out, hidden = self.lstm(embedded, hidden)
       else:
           out, hidden = self.lstm(embedded)
       # out shape: [batch_size, seq_len, hidden_dim]
       out = self.fc(out)            # shape: [batch_size, seq_len, vocab_size]
       return out, hidden

   def init_hidden(self, batch_size):
       # Initialize LSTM hidden state
       h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
       c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
       return (h0, c0)

In [14]:
embedding_dim = 256
hidden_dim = 512
num_layers = 3

model = LSTMMusicModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    hidden_dim=hidden_dim,
    num_layers=num_layers
).to(device)

print(model)


LSTMMusicModel(
  (embedding): Embedding(95, 256)
  (lstm): LSTM(256, 512, num_layers=3, batch_first=True)
  (fc): Linear(in_features=512, out_features=95, bias=True)
)


## Training Step

We define a function `train_one_epoch` that:
1. Sets the model to training mode.
2. Iterates over the dataloader to retrieve batches.
3. Performs forward and backward passes, then updates the parameters.
4. Accumulates the loss for reporting.

In [15]:
def train_one_epoch(model, dataloader, criterion, optimizer, device, log_interval=500):
   """
   Train model for one epoch with monitoring at specified intervals

   Args:
       model: Neural network model
       dataloader: Training data loader
       criterion: Loss function
       optimizer: Optimizer
       device: Computing device ('cpu' or 'cuda')
       log_interval: Number of steps between logging (default: 500)
   """
   model.train()
   total_loss = 0
   running_loss = 0

   for step, (x, y) in enumerate(dataloader, 1):
       x = x.to(device)
       y = y.to(device)
       # Initialize hidden state
       hidden = model.init_hidden(batch_size=x.size(0))

       optimizer.zero_grad()
       out, _ = model(x, hidden)  # [batch_size, seq_len, vocab_size]

       # Reshape for CrossEntropy: (batch*seq_len, vocab_size) vs (batch*seq_len)
       out_reshaped = out.view(-1, vocab_size)
       y_reshaped = y.view(-1)

       loss = criterion(out_reshaped, y_reshaped)
       loss.backward()

       optimizer.step()

       # Update losses
       running_loss += loss.item()
       total_loss += loss.item()

       # Monitor training every log_interval steps
       if step % log_interval == 0:
           avg_loss = running_loss / log_interval
           print(f'Step [{step}/{len(dataloader)}], Average Loss: {avg_loss:.4f}')
           running_loss = 0

   return total_loss / len(dataloader)

In [16]:
epochs = 5
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(1, epochs+1):
    start_time = time.time()
    avg_loss = train_one_epoch(model, dataloader, criterion, optimizer, device)
    elapsed = time.time() - start_time

    print(f"[Epoch {epoch}/{epochs}] loss={avg_loss:.4f}  (time: {elapsed:.1f} sec)")


Step [500/1646], Average Loss: 1.4328
Step [1000/1646], Average Loss: 0.5224
Step [1500/1646], Average Loss: 0.2613
[Epoch 1/5] loss=0.6916  (time: 2198.6 sec)
Step [500/1646], Average Loss: 0.1869
Step [1000/1646], Average Loss: 0.1713
Step [1500/1646], Average Loss: 0.1621
[Epoch 2/5] loss=0.1721  (time: 2215.2 sec)
Step [500/1646], Average Loss: 0.1532
Step [1000/1646], Average Loss: 0.1494
Step [1500/1646], Average Loss: 0.1461
[Epoch 3/5] loss=0.1491  (time: 2217.0 sec)
Step [500/1646], Average Loss: 0.1411
Step [1000/1646], Average Loss: 0.1395
Step [1500/1646], Average Loss: 0.1379
[Epoch 4/5] loss=0.1392  (time: 2217.6 sec)
Step [500/1646], Average Loss: 0.1342
Step [1000/1646], Average Loss: 0.1336
Step [1500/1646], Average Loss: 0.1325
[Epoch 5/5] loss=0.1333  (time: 2220.3 sec)


## Text Generation

We define a function `generate_text` to produce new sequences from the trained model.  
- We start with a prompt (`start_text`) and iterate up to `max_length` tokens.

In [17]:
import torch.nn.functional as F

def generate_text(
   model,
   start_text: str,
   char2idx: dict,
   idx2char: dict,
   max_length=300,
   temperature=1.0
):
   model.eval()
   # 1) Convert start_text to tensor
   input_ids = [char2idx[ch] for ch in start_text]
   input_ids = torch.tensor([input_ids], dtype=torch.long, device=device)  # [1, len_of_start]

   # 2) Initialize hidden state
   hidden = model.init_hidden(batch_size=1)

   # 3) Pre-forward through start_text length
   #    (LSTM needs to process start_text first to remember past states)
   with torch.no_grad():
       for i in range(input_ids.size(1)-1):
           _, hidden = model(input_ids[:, i:i+1], hidden)

   generated = list(start_text)  # Store results

   current_input = input_ids[:, -1:]  # Last token

   # 4) Generate tokens one by one up to max_length
   for _ in range(max_length):
       out, hidden = model(current_input, hidden)  # out: [1, 1, vocab_size]
       logits = out[:, -1, :]  # [1, vocab_size]

       # (Optional) Apply temperature
       logits = logits / temperature

       # Probability distribution
       probs = F.softmax(logits, dim=-1)
       # Sample from probabilities
       next_token_id = torch.multinomial(probs, 1).item()

       next_char = idx2char[next_token_id]
       generated.append(next_char)

       # Next input
       current_input = torch.tensor([[next_token_id]], device=device)

   return "".join(generated)

In [18]:
start_text = """X:100
T:NewTune
M: 4/4
L: 1/8
"""
gen_text = generate_text(model, start_text, char2idx, idx2char, max_length=500, temperature=1.0)

print("=== Generated Text ===")
print(gen_text)

=== Generated Text ===
X:100
T:NewTune
M: 4/4
L: 1/8
B:"O'Neill's 47"
N:"Tenderly" "collected by J. O'Neill"
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
K:Gm
d/2-c/2 | B>B (A{BA}G) | A-<d c>A | G>G A-<B | c>-A (F3/2G/4A/4) |
(.B>.B) (A{BA}G) | A-<d (3(cAG) | (.A>.A) (.G>.G) | G3 ||


X: 16
T: My Darling I Am Fond of You
M: 3/4
L: 1/8
B: "O'Neill's 16"
N: "Tenderly" "collected by F. O'Neill"
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
K:D
A/2-G/2 FA | B>-G E-D CE | D2 (E/2F/2G) A-c | d2-e | c2-A | (d c A) |\
(~G F G) | A2 A | A3


In [19]:
def split_songs_to_array(abc_text):
    """
    Splits multiple songs in ABC notation into an array where each element is a complete song

    Args:
        abc_text (str): Text containing multiple songs in ABC notation

    Returns:
        list: List of strings, where each string is a complete song in ABC notation
    """
    # Check for empty string
    if not abc_text:
        return []

    # Split text into lines
    lines = abc_text.split('\n')

    # Array to store all songs
    songs = []

    # List to store current song's lines
    current_song_lines = []

    for line in lines:
        # Check if current line is the start of a new song (X:)
        if line.strip().startswith('X:'):
            # If we already have lines for a song, save it before starting new one
            if current_song_lines:
                songs.append('\n'.join(current_song_lines))
                current_song_lines = []

        # Only add non-empty lines
        if line.strip():
            current_song_lines.append(line)

    # Don't forget to add the last song
    if current_song_lines:
        songs.append('\n'.join(current_song_lines))

    return songs

In [20]:
music_texts = split_songs_to_array(gen_text)

### Play - First song

In [21]:
print("=== Generated Music ===")
print(music_texts[0])
wav_file = abc_to_wav(music_texts[0], "gen_music")
Audio(wav_file)  # Play in Jupyter Notebook

=== Generated Music ===
X:100
T:NewTune
M: 4/4
L: 1/8
B:"O'Neill's 47"
N:"Tenderly" "collected by J. O'Neill"
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
K:Gm
d/2-c/2 | B>B (A{BA}G) | A-<d c>A | G>G A-<B | c>-A (F3/2G/4A/4) |
(.B>.B) (A{BA}G) | A-<d (3(cAG) | (.A>.A) (.G>.G) | G3 ||


### Play - Second song

In [22]:
print("=== Generated Music ===")
print(music_texts[1])
wav_file = abc_to_wav(music_texts[1], "gen_music")
Audio(wav_file)  # Play in Jupyter Notebook

=== Generated Music ===
X: 16
T: My Darling I Am Fond of You
M: 3/4
L: 1/8
B: "O'Neill's 16"
N: "Tenderly" "collected by F. O'Neill"
Z: "Transcribed by Norbert Paap, norbertp@bdu.uva.nl"
K:D
A/2-G/2 FA | B>-G E-D CE | D2 (E/2F/2G) A-c | d2-e | c2-A | (d c A) |\
(~G F G) | A2 A | A3


## Concluding Remarks:

By following this workflow, we have:

1. Prepared and tokenized ABC notation text.
2. Trained an LSTM-based model to learn musical patterns at a character level.
3. Generated new ABC music notation using the trained model.
4. Converted the generated notation to a WAV file for listening.

This entire pipeline showcases how to build and deploy a simple text-generation approach for symbolic music. You can expand on it by experimenting with different hyperparameters, adding more post-processing to the generated ABC, or even substituting different model architectures to improve musical quality and complexity.