## Music Classification Mini-Project
## Author: Collin Guidry

## Step 1: Data Creation Pipeline

### Purpose:
This notebook is designed collect, format, split, and label MIDI files for both model training and testing purposes.
Portions of this notebook can be easily duplicated to create an inference pipeline.

### Available Parameters:
- Input and output paths for train and test data
- chunk size (in seconds)

### Processing Steps:

    1. Load Midi files for model training
        - These are located in separate folders, partitioned by the composer's name
        - Save each parent folder name as the label for each respective midi file

    2. Load Midi files for model inference/testing
        - Midi files used for model testing are unlabeled.
        - They should be processed independently from training data, but in the same fashion
        - As the labels are unknown, this will later be known as "df_final_test"

    3. Processs loaded Midi files as follows (Train and Test):
        - Extraxt metadata from filepath
        - Split midi file into 30-second chunks
            - Trim each chunk as precicely as possible
            - "Stretch" chunks that fall just short of 30 seconds 
        - Write out chunks to intermediate folders, with an iter_id for each chunk
        - Note: Modifying the chunk size will not overwrite existing files

    4. Load 30-second chunks into DataFrame (Train and Test)
        - Read each chunk from disk
        - Store each midi file object in a Pandas DataFrame column
        - Format available medatada such as composer, composition name, chunk_iter as columns
        - For training data, create train/test split column
            - Note: Does not allow same composition to appear in both splits

    5. Write out DataFrames as .pkl to be further processed
        - train_processed.pkl
        - test_processed.pkl

## Import Libraries

In [1]:
import pretty_midi
import os
import numpy as np
import copy
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

## Helper Functions

In [2]:
# helper functions

def trim_midi_file(file:pretty_midi.pretty_midi.PrettyMIDI , start:int, end:int):
    midi_file = copy.deepcopy(file)
    start_time = float(start)
    end_time = float(end)
    new_start_time = 0.0
    new_end_time = end_time - start_time  # Calculate the duration of the cropped portion
    midi_file.adjust_times([start_time, end_time], [new_start_time, new_end_time])
    return midi_file


def create_chunk_interval_times(length, chunk_size):
    chunks = []
    for i in range(0, length, chunk_size):
        start = i
        end = min(i + chunk_size, length)
        chunks.append([start, end])
    return chunks


def stretch_midi_file_to_desired_length(file:pretty_midi.pretty_midi.PrettyMIDI, length:int):
    midi_file = copy.deepcopy(file)
    start_time = 0.0
    end_time = midi_file.get_end_time()
    new_start_time = 0.0
    new_end_time = length
    midi_file.adjust_times([start_time, end_time], [new_start_time, new_end_time])
    return midi_file


def process_and_save_midi_splits(midi_file:pretty_midi.pretty_midi.PrettyMIDI, 
                                 file_name: str,
                                 output_dir: str,
                                 chunk_size_seconds:int ):
    
    total_duration = int(midi_file.get_end_time())
    intervals = create_chunk_interval_times(total_duration, chunk_size_seconds)[:-1]

    for i, interval in enumerate(intervals):
        
        trimmed_file = trim_midi_file(file=midi_file, start=interval[0], end=interval[1])

        trimmed_stretched_file = stretch_midi_file_to_desired_length(file=trimmed_file, length=chunk_size_seconds)
        
        output_filename = os.path.join(output_dir, f'{file_name}_chunksize_{chunk_size_seconds}_iter_{i}_.mid')

        trimmed_stretched_file.write(output_filename)

    return None


## Training Data

Create a dictionary (midi_dict) with full midi files and composer labels (training data)

In [3]:

midi_folder_train = f"../data/raw/training"
composer_names = ['Bach','Beethoven','Brahms','Schubert']

midi_dict_train = {}

# Import files from each composer folder
# Create a dictionary midi files, indexed by new filename to write out
# The new filename contains the composer and will later be parsed apart

for composer in composer_names:
    midi_folder = midi_folder_train+"/"+composer

    for filename in os.listdir(midi_folder):
        if filename.endswith('.mid'):
            
            filename_split = filename.split('_')
            composition = filename_split[0]
            catalog_name = filename_split[1]
            id = filename_split[2]
            rest = ''.join(filename_split[3:]).split('.')[0]
            
            filename_new = f"{composer}_{composition}_{id}_{catalog_name}_{rest}"

            midi_file = pretty_midi.PrettyMIDI(os.path.join(midi_folder, filename))

            midi_dict_train[filename_new] = {}
            midi_dict_train[filename_new]['MidiFile'] = midi_file




In [4]:
## params
output_dir_train = '../data/intermediate/midi_chunks_train/'
chunk_size_seconds = 30

### Split into chunks

In [5]:
######
# Iterate through the dictionary of files to generate sub-files which are 30-second chunks
# Write these files out with the new file names 

for out_file_name in midi_dict_train:

    midi_file = midi_dict_train[out_file_name]['MidiFile']

    process_and_save_midi_splits(midi_file = midi_file, 
                                    file_name = out_file_name,
                                    output_dir = output_dir_train,
                                    chunk_size_seconds = chunk_size_seconds )

### Load into DataFrame

In [15]:
# load training chunks into a new dictionary & DataFrame

midi_dict_train = {}

for filename in os.listdir(output_dir_train):
    if filename.endswith('.mid'):

        midi_file = pretty_midi.PrettyMIDI(os.path.join(output_dir_train, filename))
        midi_dict_train[filename] = midi_file


df = pd.DataFrame()
df['track_path'] = midi_dict_train.keys()
df[['composer', 'track_name', 'id', 'catalog_name','path_other','chunk_size_seconds', 'iter']] = df['track_path'].str.split('_', expand=True)[[0,1,2,3,4,6,8]]
df['midi'] = [*midi_dict_train.values()]

In [16]:
# Create train / test split before saving the dataset. 
# Ensure the same song is not seen in both splits (data leakage)

splitter = GroupShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

train_indices, val_indices = next(splitter.split(df, groups=df['track_name']))

df['split'] = 'train'
df.loc[val_indices, 'split'] = 'test'

df

Unnamed: 0,track_path,composer,track_name,id,catalog_name,path_other,chunk_size_seconds,iter,midi,split
0,Bach_Cello Suite 3_2217_BWV1009_cs3-1pre_chunk...,Bach,Cello Suite 3,2217,BWV1009,cs3-1pre,30,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
1,Bach_Cello Suite 3_2217_BWV1009_cs3-1pre_chunk...,Bach,Cello Suite 3,2217,BWV1009,cs3-1pre,30,1,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
2,Bach_Cello Suite 3_2217_BWV1009_cs3-1pre_chunk...,Bach,Cello Suite 3,2217,BWV1009,cs3-1pre,30,2,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
3,Bach_Cello Suite 3_2217_BWV1009_cs3-1pre_chunk...,Bach,Cello Suite 3,2217,BWV1009,cs3-1pre,30,3,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
4,Bach_Cello Suite 3_2217_BWV1009_cs3-1pre_chunk...,Bach,Cello Suite 3,2217,BWV1009,cs3-1pre,30,4,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
...,...,...,...,...,...,...,...,...,...,...
2313,Schubert_String Quintet in C major_1742_OP163_...,Schubert,String Quintet in C major,1742,OP163,sb163m2,30,5,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
2314,Schubert_String Quintet in C major_1742_OP163_...,Schubert,String Quintet in C major,1742,OP163,sb163m2,30,6,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
2315,Schubert_String Quintet in C major_1742_OP163_...,Schubert,String Quintet in C major,1742,OP163,sb163m2,30,7,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train
2316,Schubert_String Quintet in C major_1742_OP163_...,Schubert,String Quintet in C major,1742,OP163,sb163m2,30,8,<pretty_midi.pretty_midi.PrettyMIDI object at ...,train


In [19]:
# train/test split - sanity check
df['split'].value_counts()/df['split'].shape[0]

train    0.726057
test     0.273943
Name: split, dtype: float64

In [20]:
df.to_pickle('../data/processed/train_processed.pkl')

## Test/Inference/Holdout Data

In [21]:
### Load test data into dictionary
midi_folder_test = f"../data/raw/holdout"

midi_dict_test = {}
for filename in os.listdir(midi_folder_test):
    if filename.endswith('.mid'):

        midi_file = pretty_midi.PrettyMIDI(os.path.join(midi_folder_test, filename))

        filename_new = filename[:-4]
        midi_dict_test[filename_new] = {}
        midi_dict_test[filename_new]['MidiFile'] = midi_file

### Split into chunks

In [22]:
###### Split test data from dictionary into chunks
output_dir_test = '../data/intermediate/midi_chunks_test/'
chunk_size_seconds = 30

for out_file_name in midi_dict_test:

    midi_file = midi_dict_test[out_file_name]['MidiFile']

    process_and_save_midi_splits(midi_file = midi_file, 
                                    file_name = out_file_name,
                                    output_dir = output_dir_test,
                                    chunk_size_seconds = chunk_size_seconds )

In [24]:
# load test chunks into a new dictionary & dataFrame

midi_dict_test = {}

for filename in os.listdir(output_dir_test):
    if filename.endswith('.mid'):

        midi_file = pretty_midi.PrettyMIDI(os.path.join(output_dir_test, filename))
        midi_dict_test[filename] = midi_file

### Load into DataFrame

In [26]:
df_test = pd.DataFrame()
df_test['track_path'] = midi_dict_test.keys()
df_test['track_path_original'] = df_test['track_path'].str.split('_').apply(lambda lst: '_'.join(lst[0:2])+'.mid')
df_test[['prob','chunk_size_seconds','iter']] = df_test['track_path'].str.split('_', expand=True)[[0,3,5]]
df_test['prob'] = df_test['prob'].astype(float)
df_test['pred_given'] = df_test['prob'].apply(lambda p: 1 if p > .5 else 0)
df_test['midi'] = [*midi_dict_test.values()]
df_test

Unnamed: 0,track_path,track_path_original,prob,chunk_size_seconds,iter,pred_given,midi
0,0.002716920481628_adj_chunksize_30_iter_0_.mid,0.002716920481628_adj.mid,0.002717,30,0,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...
1,0.002716920481628_adj_chunksize_30_iter_10_.mid,0.002716920481628_adj.mid,0.002717,30,10,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...
2,0.002716920481628_adj_chunksize_30_iter_11_.mid,0.002716920481628_adj.mid,0.002717,30,11,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...
3,0.002716920481628_adj_chunksize_30_iter_12_.mid,0.002716920481628_adj.mid,0.002717,30,12,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...
4,0.002716920481628_adj_chunksize_30_iter_13_.mid,0.002716920481628_adj.mid,0.002717,30,13,0,<pretty_midi.pretty_midi.PrettyMIDI object at ...
...,...,...,...,...,...,...,...
300,0.981087291054314_adj_chunksize_30_iter_5_.mid,0.981087291054314_adj.mid,0.981087,30,5,1,<pretty_midi.pretty_midi.PrettyMIDI object at ...
301,0.981087291054314_adj_chunksize_30_iter_6_.mid,0.981087291054314_adj.mid,0.981087,30,6,1,<pretty_midi.pretty_midi.PrettyMIDI object at ...
302,0.981087291054314_adj_chunksize_30_iter_7_.mid,0.981087291054314_adj.mid,0.981087,30,7,1,<pretty_midi.pretty_midi.PrettyMIDI object at ...
303,0.981087291054314_adj_chunksize_30_iter_8_.mid,0.981087291054314_adj.mid,0.981087,30,8,1,<pretty_midi.pretty_midi.PrettyMIDI object at ...


In [27]:
df_test.to_pickle('../data/processed/test_processed.pkl')