In [1]:
import numpy as np
import pandas as pd
import mido
from pathlib import Path

Expects to have the dataset unzipped at the project root under `maestro`.
Not committed due to size.

In [2]:
data_dir = Path('../maestro/maestro-v3.0.0/')
df = pd.read_csv(data_dir / 'maestro-v3.0.0.csv')
df.head()

Unnamed: 0,canonical_composer,canonical_title,split,year,midi_filename,audio_filename,duration
0,Alban Berg,Sonata Op. 1,train,2018,2018/MIDI-Unprocessed_Chamber3_MID--AUDIO_10_R...,2018/MIDI-Unprocessed_Chamber3_MID--AUDIO_10_R...,698.66116
1,Alban Berg,Sonata Op. 1,train,2008,2008/MIDI-Unprocessed_03_R2_2008_01-03_ORIG_MI...,2008/MIDI-Unprocessed_03_R2_2008_01-03_ORIG_MI...,759.518471
2,Alban Berg,Sonata Op. 1,train,2017,2017/MIDI-Unprocessed_066_PIANO066_MID--AUDIO-...,2017/MIDI-Unprocessed_066_PIANO066_MID--AUDIO-...,464.649433
3,Alexander Scriabin,"24 Preludes Op. 11, No. 13-24",train,2004,2004/MIDI-Unprocessed_XP_21_R1_2004_01_ORIG_MI...,2004/MIDI-Unprocessed_XP_21_R1_2004_01_ORIG_MI...,872.640588
4,Alexander Scriabin,"3 Etudes, Op. 65",validation,2006,2006/MIDI-Unprocessed_17_R1_2006_01-06_ORIG_MI...,2006/MIDI-Unprocessed_17_R1_2006_01-06_ORIG_MI...,397.857508


Can load the file and explore the `messages`, which hold notes and other actions.
Likely to just have track 0 and 1 per file.
Track 0 holds only metadata, track 1 holds the whole song.

In [3]:
filename = data_dir / df.iloc[0]['midi_filename']
print(filename)
mid = mido.MidiFile(filename)
messages = []
for i, track in enumerate(mid.tracks):
    print('Track {}: {}'.format(i, track.name))
    for msg in track:
        messages.append(msg.dict())

..\maestro\maestro-v3.0.0\2018\MIDI-Unprocessed_Chamber3_MID--AUDIO_10_R3_2018_wav--1.midi
Track 0: 
Track 1: 


Looking at the start of the messages, we should be able to roughly figure out what kind of note they are by combining time, note ons and offs, and the time signature metadata.
Need to verify on all the files, but looks like `note_off` actions might not be included.
Instead, setting the velocity to 0 is likely how the note ends.

Time is incremental, so the second note is hit 615 ticks after the first.
Then 20 ticks later, the first note lifts off.

It doesn't look like chords are actually hit at the same time.
In the sheet music, rows 13, 15, 17, and 18 are all part of a chord played at the same time.
The ticks are so granular that here they range across ~30 ticks.

In [4]:
messages = pd.DataFrame(messages)
print(messages.shape)
messages.head(20)

(23945, 13)


Unnamed: 0,type,tempo,time,numerator,denominator,clocks_per_click,notated_32nd_notes_per_beat,program,channel,control,value,note,velocity
0,set_tempo,500000.0,0,,,,,,,,,,
1,time_signature,,0,4.0,4.0,24.0,8.0,,,,,,
2,end_of_track,,1,,,,,,,,,,
3,program_change,,0,,,,,0.0,0.0,,,,
4,control_change,,0,,,,,,0.0,64.0,127.0,,
5,note_on,,755,,,,,,0.0,,,67.0,52.0
6,note_on,,615,,,,,,0.0,,,72.0,67.0
7,note_on,,20,,,,,,0.0,,,67.0,0.0
8,note_on,,74,,,,,,0.0,,,72.0,0.0
9,control_change,,128,,,,,,0.0,64.0,117.0,,


No `note_off` messages as mentioned.
Control changes are things like changing volume or pedals.

In [6]:
messages.type.value_counts()

type
control_change    15546
note_on            8394
end_of_track          2
set_tempo             1
time_signature        1
program_change        1
Name: count, dtype: int64

Type 64 is sustain pedal on and off.
Type 67 is soft pedal on and off.

In [7]:
messages.control.value_counts()

control
64.0    13165
67.0     2381
Name: count, dtype: int64

Seems like we mostly care about `note_on` events where the velocity is > 0, meaning the note was played.
This should be a way to create a vectorized `bag of notes`, though it should be sorted by note.
MIDI numbers appear to range from 21 to 108 for 88 total notes.
72 of them are included in this song, which actually seems quite high.

In [8]:
messages[messages.velocity > 0].note.value_counts()

note
62.0     146
66.0     140
60.0     140
71.0     132
67.0     131
        ... 
28.0       4
95.0       3
29.0       3
27.0       1
102.0      1
Name: count, Length: 72, dtype: int64

Theoretically could have multiple channels, but I don't expect that's happening with our simple piano files.

In [9]:
messages.channel.value_counts()

channel
0.0    23941
Name: count, dtype: int64