# Batch Processing and Classification Standardization
This Jupyter Notebook is the product of Thomas Hymel. The contents pertain to one part of a Automatic Drum Transcription (ADT) project that I am working on to improve data science and machine learning skills. This notebook in particular focuses on the batch processing needed to load the entire training data set into memory. After that, it focuses on the standardization of the labels used for classifying. Although the *drum piece* labels have already been standardized into a master format by previous code, the classifiers have not been. This is because the amount of classes to aim for could change, so we kept as much information as possible before making that decision, which will occur in this notebook. 




In [1]:
# runs the previous Jupyter Notebook so that I can access the functions in it in this Notebook
%run ./Aligning-Drum-Tabs-with-Music-Notebook.ipynb

Artist: 
Song: May3Beta
Drummer: Thomas Hymel
Tabbed by: Epoch0
BPM: 135

Cymbals:
G = China
C = thin, regular crash
C2 = thin crash, higher sounding than C
C3 = thin crash, different than 2, higher than C
HH = hihat
R = ride
|-x-|: Hit (normal)
|-X-|: Accented strike; If on hihat, loose/washy hihat
|-#-|: strike then choke by grabbing cymbal

Drums:
B = bass
S = snare
sT = small tom
mT = medium tom
FT = floor tom
|-O-|: accented strike
|-g-|: ghost hit
|-f-|: flam
|-D-|: double hit; on bass drum, two hits with slightly different timing

Intro: (0:00)
S |----r-------r---|----r-------r---|----r-------r---|----r-------r---|
B |o---------------|----------------|o---------------|----------------|
  |1e+a2e+a3e+a4e+a|1e+a2e+a3e+a4e+a|1e+a2e+a3e+a4e+a|1e+a2e+a3e+a4e+a|

S |----r-------r---|----r-------r---|----r-------r---|----r---r-------|
sT|----------------|----------------|----------------|------------oo--|
FT|----------------|----------------|----------------|----------oo--oo|
B |o-----

     line1 line2 line3 line4 line5 line6 line7 line8 line9 line10 line11
0              C     H     M     L     C     C     H     S      B       
1              3     T     T     T     2     C     H     D      D       
2              |     |     |     |     |     |     |     |      |      |
3        I     -     -     -     -     -     -     -     -      o      1
4        n     -     -     -     -     -     -     -     -      -      e
...    ...   ...   ...   ...   ...   ...   ...   ...   ...    ...    ...
2242           -     -     -     -     -     -     -     -      -      4
2243           -     -     -     -     -     -     -     -      -      e
2244           -     -     -     -     -     -     -     -      -      +
2245           -     -     -     -     -     -     -     -      -      a
2246           |     |     |     |     |     |     |     |      |      |

[2247 rows x 11 columns]
     tk BD SD HH CC C2 LT MT HT C3 note garbage
0     |  |  |  |  |  |  |  |  |  |             
1 



The transposed shape of the AudioSegment-derived lr_samples is (2, 10584000)
The default shape of the librosa-derived lb_test_song is (2, 10584000)
The transposed shape of the lr_samples is (2, 10584000)
Dictionary used to describe the information from the song: {'title': 'may3beta', 'format': 'mp3', 'width': '16bit', 'channels': 'stereo', 'frame rate': 44100, 'duration_seconds': 240.0, 'duration_minutes': 4.0}
The number of samples in may3beta.mp3 is 2
10584000
All the following prints are from combine_tab_and_song function
first drum note row = 0
# of song slices post fdn = 2144
# of song slices pre fdn = 15
Produced number of song slices = 2159
Expected number of song slices (should be same for non-triplet songs) = 2160.0
tab length = 2112     datatype: <class 'int'>
len(song_slices_tab_indexed) = 2112     datatype of object: <class 'list'>
song_slices_tab_indexed[0].shape = (4900, 2)     datatype of [0]: <class 'numpy.ndarray'>
np.array(song_slices_tab_indexed).shape = (2112, 4900,

Sampling a SD event for alignment check... Loading tab and audio slice
   1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453  \
C3    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
HT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
MT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
LT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
C2    -    X    -    -    -    -    -    -    -    -    -    -    -    -    -   
CC    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
HH    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
SD    o    -    -    -    -    r    -    -    -    -    -    -    -    r    -   
BD    -    o    -    -    -    -    -    -    -    -    -    -    -    -    -   
tk    a    1    e    +    a    2    e    +    a    3    e    +    a    4    e   

   1454  
C3    -  
HT    -  
MT    -

Sampling a BD event for alignment check... Loading tab and audio slice
   1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518  \
C3    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
HT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
MT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
LT    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
C2    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
CC    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
HH    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
SD    -    -    -    -    -    -    -    -    o    -    -    -    -    -    -   
BD    o    -    -    o    -    -    -    -    -    -    o    -    -    -    o   
tk    1    e    +    a    2    e    +    a    3    e    +    a    4    e    +   

   1519  
C3    -  
HT    -  
MT    -

Creating may3beta.mp3 machine-friendly tab
Line 1 has 557 chars
Line 2 has 2247 chars
Line 3 has 2247 chars
Line 4 has 2247 chars
Line 5 has 2247 chars
Line 6 has 2247 chars
Line 7 has 2247 chars
Line 8 has 2247 chars
Line 9 has 2247 chars
Line 10 has 2247 chars
Line 11 has 2247 chars
Line 12 has 2247 chars
All the following prints are from combine_tab_and_song function
first drum note row = 0
# of song slices post fdn = 2144
# of song slices pre fdn = 15
Produced number of song slices = 2159
Expected number of song slices (should be same for non-triplet songs) = 2160.0
tab length = 2112     datatype: <class 'int'>
len(song_slices_tab_indexed) = 2112     datatype of object: <class 'list'>
song_slices_tab_indexed[0].shape = (4900, 2)     datatype of [0]: <class 'numpy.ndarray'>
np.array(song_slices_tab_indexed).shape = (2112, 4900, 2)


## Batch Processing
In order to create a training set, a bunch of data must be loaded into a similar format. To facilitate this, functions should be created to scalably handle as much data as we may have in the future (i.e. much more than 25 songs). 

I had some data files written in .py and I wanted to convert them properly to a JSON file, so the function below will help with convering those .py files into the proper format JSON file. 

In [2]:
import sys
import os
import json

# establish the file paths for where the music, python files, and tab files are found for batch aligning
batch_align_path = "C:\\Users\\Thomas\\Python Projects\\Drum-Tabber-Support-Data\\"
song_path = "Songs\\"
python_files_path = "Alignment-Python-Files\\"

In [3]:
def create_and_write_JSON_file(song_file, tab_file, tab_char_labels, alignment_info, directory_path):
    """
    Creates a JSON file from the given information, in the correct format needed for later batch processing of the information
    
    Args:
        song_file [str]: string that is the song file (example_song.mp3, wav, etc)
        tab_file [str]: string that is the tab file name (example_song.txt)
        tab_char_labels [dict]: dictionary that describes the tab character labels for the example's tab
        alignment_info [dict]: dictionary that describes the alignment info needed for processing
        directory_path [str]: the path to which you want to save the output .json file
    
    Returns:
        dict : JSON structure python dictionary
    """
    
    json_dict = {}   # build up this python dictionary to be used as input for json.dumps function
    
    json_dict['song'] = song_file                  # creates a key/value of song to the song_file
    json_dict['tab_file'] = tab_file               # creates a key/value of the tab_file to the tab file
    json_dict['tab_char_labels'] = tab_char_labels # creates the tab_char_labels entry
    json_dict['alignment_info'] = alignment_info   # creates the alignment_info entry
    
    with open(directory_path + song_file[:len(song_file)-4] + ".json", 'w') as outfile:   # ASSUMES extension is 3 chars long, should probably change this but whatever
        json.dump(json_dict, outfile, indent=4)
    
    return json.dumps(json_dict)

In [4]:
def batch_align(songs_file_path):
    """
    Runs the music aligned tab function multiple times for an entire batch of songs (using JSON files with information in it, song files, and text tab files)
    
    Args:
        songs_file_path [str]: string that point to the directory that contains a bunch of folders of songs
        
    Returns:
        dict: Python dictionary that has keys of the name of the folders that have been processed (song titles)
              and values that are the music-aligned tab dataframe objects that contain that song, produced by the highest-level music-aligning function
    """

    MATDF_dict = {}   # dictionary where the keys are the string of the name of the song, and the values are the music-aligned-tab dataframe object
    
    subdirs = [os.path.join(songs_file_path, o) for o in os.listdir(songs_file_path) if os.path.isdir(os.path.join(songs_file_path,o))] # grab all the subdirectories in the songs_
    list_of_songs = [x.replace(songs_file_path,"") for x in subdirs]  # grabbing only the end of the song folders (that is, the song title)
    
    for song in list_of_songs:                          # go through all the songs folders 
        song_dir = songs_file_path + song + "\\"
        with open(song_dir + song + '.json') as json_file:  # open the .json file inside each song folder
            data = json.load(json_file)                     # load the .json file into a python dictionary
            MATDF_dict[song] = Make_Music_Aligned_Drum_Tab(song_dir + data["song"], song_dir + data["tab_file"], data["tab_char_labels"], data["alignment_info"])
    
    return MATDF_dict

In [5]:
song_directory = batch_align_path + song_path

MATDF_test = batch_align(song_directory)

Creating C:\Users\Thomas\Python Projects\Drum-Tabber-Support-Data\Songs\ancient_tombs\ancient_tombs.mp3 machine-friendly tab
Line 1 has 554 chars
Line 2 has 3066 chars
Line 3 has 3066 chars
Line 4 has 3066 chars
Line 5 has 3066 chars
Line 6 has 3066 chars
Line 7 has 3066 chars
Line 8 has 3066 chars
Line 9 has 3066 chars
Line 10 has 3066 chars
Line 11 has 3066 chars
Line 12 has 3066 chars
Line 13 has 3066 chars
Line 14 has 3066 chars
All the following prints are from combine_tab_and_song function
first drum note row = 0
# of song slices post fdn = 3062
# of song slices pre fdn = 3
Produced number of song slices = 3065
Expected number of song slices (should be same for non-triplet songs) = 3068.207528344671
tab length = 2882     datatype: <class 'int'>
len(song_slices_tab_indexed) = 2882     datatype of object: <class 'list'>
song_slices_tab_indexed[0].shape = (3392, 2)     datatype of [0]: <class 'numpy.ndarray'>
np.array(song_slices_tab_indexed).shape = (2882,)
Creating C:\Users\Thomas

All the following prints are from combine_tab_and_song function
first drum note row = 0
# of song slices post fdn = 2761
# of song slices pre fdn = 1
Produced number of song slices = 2762
Expected number of song slices (should be same for non-triplet songs) = 2770.494693877551
tab length = 2658     datatype: <class 'int'>
len(song_slices_tab_indexed) = 2658     datatype of object: <class 'list'>
song_slices_tab_indexed[0].shape = (3481, 2)     datatype of [0]: <class 'numpy.ndarray'>
np.array(song_slices_tab_indexed).shape = (2658,)
Creating C:\Users\Thomas\Python Projects\Drum-Tabber-Support-Data\Songs\gunpowder\gunpowder.mp3 machine-friendly tab
Line 1 has 337 chars
Line 2 has 3350 chars
Line 3 has 3350 chars
Line 4 has 3350 chars
Line 5 has 3350 chars
Line 6 has 3350 chars
Line 7 has 3350 chars
Line 8 has 3350 chars
Line 9 has 3350 chars
Line 10 has 3350 chars
Line 11 has 3350 chars
Line 12 has 3350 chars
All the following prints are from combine_tab_and_song function
first drum not

All the following prints are from combine_tab_and_song function
first drum note row = 0
# of song slices post fdn = 1238
# of song slices pre fdn = 0
Produced number of song slices = 1238
Expected number of song slices (should be same for non-triplet songs) = 1237.104
tab length = 1122     datatype: <class 'int'>
len(song_slices_tab_indexed) = 1122     datatype of object: <class 'list'>
song_slices_tab_indexed[0].shape = (8166, 2)     datatype of [0]: <class 'numpy.ndarray'>
np.array(song_slices_tab_indexed).shape = (1122,)
Creating C:\Users\Thomas\Python Projects\Drum-Tabber-Support-Data\Songs\surprise_surprise\surprise_surprise.mp3 machine-friendly tab
Line 1 has 371 chars
Line 2 has 1992 chars
Line 3 has 1992 chars
Line 4 has 1992 chars
Line 5 has 1992 chars
Line 6 has 1992 chars
Line 7 has 1992 chars
Line 8 has 1992 chars
Line 9 has 1992 chars
Line 10 has 1992 chars
Line 11 has 1992 chars
All the following prints are from combine_tab_and_song function
first drum note row = 0
# of s

## Classification Standardization
#### Creating the FullSet Dataframe
At this point we have to do some more processing of the *labels* themselves and decide what the standard classifications should be like. For this to happen we should do processing on the *entire* data set at once, because then we can aggregately identify all the *used* labels and change them into a standardized list. Currently, my plan is to contain the music aligned tab dataframes inside a dictionary, whose keys are the name of the python file and whose values are the music aligned tab dataframe objects. So let's start to write functions to deal with this classification standardization. 

In [6]:
def create_FullSetMAT(MAT_dict):
    """
    Function to create, describe and present information to the coder about the labels of the entire data set
    
    Args:
        MAT_dict [dict]: the dictionary whose keys are the python file names of the processed songs and whose values are 
                         the corresponding music aligned tab dataframes
                         
    Returns:
        Dataframe: the FullSet dataframe of all songs' dataframes stacked on top of each other (and outputs information to the user display)
    """
    
    # get blank char to use in this function
    tk_label, measure_char, blank_char = get_special_chars()
    
    # stacks all the dataframes one on top of another in the rows, ignoring indices of each dataframe, and giving each frame an extra "key" layer
    print("...Concatenating all music-aligned dataframes")
    output_df = pd.concat(MAT_dict, axis=0, ignore_index = False, join = 'outer', sort = False)
    
    print("...Replacing NaNs with " + blank_char + " for output")
    output_df = output_df.fillna(blank_char)               # replace NaN with the blank_char
    
    print("...Dropping the song slices for ease of display \n")
    full_df = output_df.drop(columns = ['song slice'])    # drop the song slice info because we don't care about them right now
    
    print("---fullset.describe() without blank_chars---")
    print(full_df[full_df != blank_char].describe())
    print()
    
    print("Unique values and frequencies in column __:")
    for col in full_df:
        naf_series = full_df[col].value_counts()
        print(str(naf_series.to_frame().T))
        print()
    
    return output_df

In [7]:
# testing the MAT_description function
FullSet_df = create_FullSetMAT(MATDF_test)


...Concatenating all music-aligned dataframes
...Replacing NaNs with - for output
...Dropping the song slices for ease of display 

---fullset.describe() without blank_chars---
           tk     BD    SD    HH    RD    CC    C2   LT   MT   HT  CH   C3  SC
count   49917  11829  7624  5388  1718  3279  2334  994  391  298  97  302  26
unique      9      3     8     9     5     4     4    5    4    4   3    3   2
top         +      o     o     x     x     X     X    o    o    o   X    X   X
freq    12461  11356  6527  2905  1090  2016  1891  888  315  249  81  264  20

Unique values and frequencies in column __:
        +      e      a     1     2     3     4   t   s
tk  12461  12459  12446  3206  3202  3199  2867  47  30

        -      o    O   d
BD  38088  11356  430  43

        -     o    g    O    f   d   r   x  0
SD  42293  6527  527  271  136  75  73  14  1

        -     x     X    o   O   s   g  S  d  w
HH  44529  2905  2186  195  35  32  26  5  2  2

        -     x    X    b  

#### Cleaning up Labels and Collapsing Classes

To clean up the data (that is so cleanly displayed above), and to limit/collapse the classes to fewer number of classes, we want to make a function that we can use to replace the data in our big, entire song dataframe. We will be making extensive use of the pandas function **replace**, because that's exactly what we are attempting to do. Note that most of these decisions are arbitrary but have a direct impact on how the model performs and the predictive capabilities that it will ultimately have. If you have too many classes (especially multiple within *one* drum instrument) then the model won't be able to accurately predict because it will need to differentiate finer and finer features of the original data. I want my code to be flexible though in *choosing* which classes you want from the full data set. That choice occurs right here, in the implementation and execution of the functions that will modify the FullSet dataframe. 

I will split this into two different functions that will most likely be large because they handle a bunch of different cases: the first function will do cleaning of the labels, mainly replacing labels that are commonly used with other more standard labels that are already existing in the rest of the set. The second will collapse the classes and have more user input involved, but have defaults that make the outputs classes as simple as possible.  

In [8]:
def clean_labels(FullSet_df):
    """
    Cleans the labels by replacing, errors, common mistakes or different notations for labels that already exist
    
    Args:
        FullSet_df [Dataframe]: the entire set of music aligned tabs in one dataframe
        
    Returns:
        Dataframe: the FullSet dataframe but with the labels cleaned up according to the code in here
    """
    
    master_format_dict = get_master_format_chars() # grabs the dict of the master format to be used here
    tk_label, measure_char, blank_char = get_special_chars()  # get blank_char mainly
    
    replace_dict = {}   # build up this dict to use as the replace in a later df.replace function
    for drum_chars in master_format_dict.values():
        replace_dict[drum_chars] = {}    # create an empty dict object for each drum char in master format dict so I can later use .update method always
    
    # get useful, specific subsets of the column names (2 chars) used in the FullSet dataframe, as dictated by the master_format_dict, ensuring that they are in the FullSet_df column names
    cymbals = [master_format_dict[x] for x in master_format_dict.keys() if 'cymbal' in x and master_format_dict[x] in FullSet_df.columns]  # does NOT include hihat
    hihat = master_format_dict['hi-hat']
    snare = master_format_dict['snare drum']
    ride = master_format_dict['ride cymbal']
    drums = [master_format_dict[x] for x in master_format_dict.keys() if ('drum' in x or 'tom' in x) and master_format_dict[x] in FullSet_df.columns]   # includes both drums and toms
    
    """CLEAN UP 1: get rid of the "grab cymbal to stop sustain" notation in cymbal line tabs for all cymbals"""
    for cymbal in cymbals:
        replace_dict[cymbal].update({'#':blank_char})  # Constructing a dict where {column_name : {thing_to_be_replaced: value_replacing} }
    
    """CLEAN UP 2: get rid of the 'f', 's', and 'S' on the 'HH' column (usually denotes foot stomp on hihat pedal)"""
    replace_dict[hihat].update({'f':blank_char, 's':blank_char, 'S' : blank_char})
    
    """CLEAN UP 3: replace the washy 'w' and 'W' with the normal washy hi-hat notation 'X'  (overall inconsistent notation but consistent enough to map properly)"""
    replace_dict[hihat].update({'w': 'X', 'W':'X'})
    
    """CLEAN UP 4: get rid of 'r' on the 'SD' column (rimshots on the snare drum) and change 'x' to 'o' (sometimes used in drum solos for easier reading)"""
    replace_dict[snare].update({'r' : blank_char, 'x' : 'o', '0' : 'O'})
    
    """CLEAN UP 5: get rid of doubles notation ('d') and flams ('f'), and replace them with equivalent single hits"""
    for drum in drums:
        replace_dict[drum].update({'d' : 'o', 'D' : 'O', 'f' : 'o'})
    replace_dict[hihat].update({'d' : 'x', 'f' : 'x'})
    replace_dict[ride].update({'d' : 'x', 'f' : 'x'})
        
    """CLEAN UP 6: On hihat line, O and o are going to sound ~the same regardless of actual dynamic strength of hit."""
    replace_dict[hihat].update({'O': 'o'})
    
    """CLEAN UP 7: Replace m-dashes used in place of the blank char (here, n dash)"""
    for col in master_format_dict.values():
        replace_dict[col].update({'—' : blank_char})
    
    FullSet_df = FullSet_df.replace(to_replace = replace_dict, value = None) # do the replacing using the replace_dict
    
    return FullSet_df
    

In [9]:
# testing the clean_labels function
FullSet_testclean = clean_labels(FullSet_df)

print("Post cleaning, Unique values and frequencies in column __:")
for col in FullSet_testclean.drop(columns = ['song slice']):
        naf_series = FullSet_testclean[col].value_counts()
        print(str(naf_series.to_frame().T))
        print()

Post cleaning, Unique values and frequencies in column __:
        +      e      a     1     2     3     4   t   s
tk  12461  12459  12446  3206  3202  3199  2867  47  30

        -      o    O
BD  38088  11399  430

        -     o    g    O
SD  42366  6752  527  272

        -     x     X    o   g
HH  44566  2907  2188  230  26

        -     x    X    b   g
RD  48199  1104  476  113  25

        -     X     x  b
CC  46642  2016  1252  7

        -     X    x  b
C2  47586  1891  439  1

        -    o   O
LT  48923  916  78

        -    o   O
MT  49526  334  57

        -    o   O
HT  49619  262  36

        -   X   x  g
CH  49820  81  14  2

        -    X   x
C3  49627  264  26

        -   X  b
SC  49891  20  6



In [10]:
def collapse_class(FullSet_df, keep_dynamics = False, keep_bells = False, keep_toms_separate = False, hihat_classes = 1, cymbal_classes = 1):
    """
    Collapses the class labels in the FullSet dataframe to the desired amount of classes for output labels in Y.
    Note that all of the collapsing choices will exist inside this function. There won't be a different place or prompt that
    allows the classes to be customized further. This is where the class decisions making is occurring, HARD CODED into the function 
    Note that derived classes will be entirely lower case in the column names, where as normal classes will be entirely upper case
    
    Args:
        FullSet_df [Dataframe]: the entire set of music aligned tabs in one dataframe, cleaned up at this point
        keep_dynamics [bool]: Default False. If False, collapses the dynamics labels into one single label (normally, capital vs. lower case). If True, don't collapse, effectively keeping dynamics as classes
        keep_bells [bool]: Default False. If False, changes the bells into blank_char, effectively getting rid of them and ignoring their characteristic spectral features. 
                           If True, still changes them into blank_char, but create a new column in the dataframe called 'be' that places them in there
        keep_toms_separate [bool]: Default False. If False, collapses the toms into one single tom class. If True, keep the toms labels separate and have multiple tom class
        hihat_classes [int]: Default 1. Hihats have two, or arguably three, distinct classes. One class is the completely closed hihat hit that is a "tink" sound. 
                            A second very common way to play hihat is called "washy" where the two hihats are slightly open and can interact with each other after being hit
                            A third class is the completely open hihat, where the top hihat doesn't interact with the bottom at all. This is similar to a cymbal hit
                            Default 2 classes splits 
        cymbal_classes [int]: Default 1. Cymbals come in many sizes, tones, and flavors. The most reasonable thing to do is to collapse all cymbals into one class
                              But what about the Ride cymbal? which normally is not "crashed" but hit like the hihat 
                              If == 2, Ride will be split out of the rest of the crash cymbals
                              If == -1, keep all cymbal classes intact (generally for debugging)
    Returns:
        Dataframe: the FullSet dataframe but with classes collapsed, which most likely means that certain columns will be gone and new columns will be present 
    """
    
    master_format_dict = get_master_format_chars() # grabs the dict of the master format to be used here
    tk_label, measure_char, blank_char = get_special_chars()  # get blank_char mainly
    
    # get useful, specific subsets of the column names (2 chars) used in the FullSet dataframe, as dictated by the master_format_dict, ensuring that they are in the FullSet_df column names
    drums = [master_format_dict[x] for x in master_format_dict.keys() if ('drum' in x or 'tom' in x)  and master_format_dict[x] in FullSet_df.columns]  # drums AND toms in this list
    cymbals = [master_format_dict[x] for x in master_format_dict.keys() if 'cymbal' in x and master_format_dict[x] in FullSet_df.columns]         # Notably EXCLUDING hi-hat 
    toms = [master_format_dict[x] for x in master_format_dict.keys() if 'tom' in x and master_format_dict[x] in FullSet_df.columns]                 # toms ONLY
    hihat = master_format_dict['hi-hat']            # get the label for the hi-hat column from master_format_dict
        
    """HIHAT - determine the number of classes desired in the hi-hat line. CRITICAL that this occurs before CYMBALS"""
    FullSet_df = FullSet_df.replace(to_replace = {hihat: {'g': blank_char}}, value = None)  # gets rid of ghost notes on the hihat no matter how many classes are chosen
    if hihat_classes == 2 or hihat_classes == 1:    # with only 1 or 2 classes, the washy ('X') and open ('o') hits are combined into one ('X') on the same line
        FullSet_df = FullSet_df.replace(to_replace = {hihat: {'o': 'X'}}, value = None) # replaces all 'o' with 'X' in the hihat column
        if hihat_classes == 1:                      # with only one class, need to keep the closed hi-hat ('x') on its own column, and then move the open 'o' and washy 'X' into the Crash Cymbal ('CC') column
            FullSet_df.loc[FullSet_df[hihat] == 'X', master_format_dict['crash cymbal']] = FullSet_df.loc[FullSet_df[hihat] == 'X', hihat]  # sets the values in the CC column, in the rows where the hihat == 'X', to the values that are in the hihat column of those rows
            FullSet_df = FullSet_df.replace(to_replace = {hihat: {'X': blank_char}}, value = None) # rids the hihat column of the 'X's that have been moved to the CC column
    if hihat_classes == 3:
        None            # keep the expected notations of 'x' for closed, 'X' for washy, and 'o' for completely open
    
    """DYNAMICS - Making everything lower case that needs to be, and get rid of ghost notes; doesn't touch the hihat"""
    if not keep_dynamics:   # in the case where the dynamics are NOT kept. That is, this code should collapse the dynamics
        replace_dict = {}   # build up this dict to use as the replace in a later df.replace function
        for element in drums + cymbals:
            replace_dict[element] = {'X':'x', 'O':'o', 'g':blank_char} # prepare to search for X to replace with x, and O to replace with o whenever applicable
        FullSet_df = FullSet_df.replace(to_replace = replace_dict, value = None) # do the replacing using the replace_dict
        
    """BELLS - get rid of bell hits entirely or move them into a new column"""
    if not keep_bells:          # in the case where bell hits are thrown away
        FullSet_df = FullSet_df.replace(to_replace = 'b', value = blank_char) # NOTE: replaces 'b' ANYWHERE in the dataframe labels with the blank_char
    else:                       # in the case where bell hits are moved to a new column and replaced with blank_char after that
        FullSet_df['be'] = blank_char  # new bell column is titled 'be' for 'bell' and is initially all blank_char
        replace_dict = {}
        for cymbal in cymbals:
            FullSet_df.loc[FullSet_df[cymbal] == 'b','be'] = FullSet_df.loc[FullSet_df[cymbal] == 'b', cymbal]
            replace_dict[cymbal] = {'b':blank_char}
        FullSet_df = FullSet_df.replace(to_replace = replace_dict, value = None) # do the replacing using the replace_dict
        
    """TOMS - keep toms as their own classes or collapse into one"""
    if not keep_toms_separate:         # in the case where toms are collapsed into one class
        FullSet_df['at'] = blank_char    # The new column is titled 'at' for 'all toms' as it represents the labels of all the toms at once, initially set to the blank_char for all rows
        for tom in toms:
            FullSet_df.loc[FullSet_df[tom] != blank_char,'at'] = FullSet_df.loc[FullSet_df[tom] != blank_char, tom]  #  finds all the rows where a tom event occurs, and the value of those rows in the tom event
        FullSet_df = FullSet_df.drop(columns = toms)  # drop the original toms columns
        
    """CYMBALS - determine the number of cymbal classes"""
    if cymbal_classes == 1:   # the case where we collapse all the cymbal classes down to one class
        FullSet_df['ac'] = blank_char    # new column is titled 'ac' for 'all cymbals' as it represents the labels of all the cymbals at once
        for cymbal in cymbals:
            FullSet_df.loc[FullSet_df[cymbal] != blank_char, 'ac'] = FullSet_df.loc[FullSet_df[cymbal] != blank_char, cymbal]
        FullSet_df = FullSet_df.drop(columns = cymbals)
    if cymbal_classes == 2: # the case where we collapse all the cymbal classes except the ride cymbal down to one class
        FullSet_df['mc'] = blank_char   # new column is titled 'mc' for 'most cymbals' as it represents most cymbals
        most_cymbals = [x for x in cymbals if x != master_format_dict['ride cymbal']]   # grabbing all the cymbals not the ride cymbal
        for cymbal in most_cymbals:
            FullSet_df.loc[FullSet_df[cymbal] != blank_char, 'mc'] = FullSet_df.loc[FullSet_df[cymbal] != blank_char, cymbal]
        FullSet_df = FullSet_df.drop(columns = most_cymbals)    # drop the cymbal columns no longer needed
    if cymbal_classes == -1:
        None    # used for debugging
    
    """BEATS AND DOWNBEATS - change the time-keeping line notation to denote downbeats and other beats"""
    non_digits = [x for x in FullSet_df['tk'].unique() if not x.isdigit()]         # finds all non-digit values used in the tk column
    non_ones_digits = [x for x in FullSet_df['tk'].unique() if x.isdigit() and x != '1'] # finds all digit values that are not equal to 1
    replace_dict = {'tk': {}}     # create empty dict for the 'tk' column
    for el in non_digits:
        replace_dict['tk'].update({el: blank_char})    # replacing non-digits elements in tk column for blank_chars
    for el in non_ones_digits:
        replace_dict['tk'].update({el: 'c'})           # replacing non-ones digits elements in tk column for 'c', which stands for 'click', as if you were listening to a metronome hearing clicks on the beats
    replace_dict['tk'].update({'1': 'C'})              # 'C' stands for 'Click', a louder click from a metronome, used to denote the downbeat
    FullSet_df = FullSet_df.replace(to_replace = replace_dict, value = None)             
    
    return FullSet_df

In [61]:
# testing the clean_labels function
FullSet_testcollapse = collapse_class(FullSet_testclean , keep_dynamics = False, keep_toms_separate = False, keep_bells = False, cymbal_classes = 1, hihat_classes = 1)

print("Post collapsing, Unique values and frequencies in column __:")
for col in FullSet_testcollapse.drop(columns = ['song slice']):
        naf_series = FullSet_testcollapse[col].value_counts()
        print(str(naf_series.to_frame().T))
        print()

Post collapsing, Unique values and frequencies in column __:
        -     c     C
tk  37443  9268  3206

        -      o
BD  38088  11829

        -     o
SD  42893  7024

        -     x
HH  47010  2907

        -     o
at  48377  1540

        -     x
ac  40116  9801



**Cleaning and collapsing functions work!** In particular, the collapsing function should work for *any* configuration desired of the ones available with different combinations of values passed as arguments. The function is written generally enough that it assumes any other case could be in either configuration. From my small amount of testing with the small test set, I never found any errors or unintended outputs.

The cleaning function may need more work with a larger data set, but that function is written generally enough such that it is easy to add in more clean up sections based on the summary of the labels in a larger data set. 

List of songs needed to look at for alignment issues:
+ Rollercoaster may still be unacceptably weird

In [45]:
# using random alignment checker, I can check any song for alignment issues
drums_check = ['BD','SD', 'HH', 'at', 'ac']
num_buffer = 16
song_to_check = 'wolves_at_the_door'
random_alignment_checker(FullSet_testcollapse.loc[(song_to_check,)], drums_check, num_buffer)

Sampling a SD event for alignment check... Loading tab and audio slice
   346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361
ac   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
at   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
HH   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
SD   o   -   -   -   o   -   -   -   o   -   -   -   o   -   -   -
BD   -   -   o   o   -   -   o   -   -   o   -   o   -   -   -   o
tk   -   -   c   -   -   -   C   -   -   -   c   -   -   -   c   -


Sampling a at event for alignment check... Loading tab and audio slice
   1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579  \
ac    -    -    -    x    -    -    -    x    -    x    -    x    -    x    -   
at    o    o    o    -    -    -    -    -    -    -    -    -    -    -    -   
HH    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
SD    -    -    -    -    -    -    -    o    -    -    -    -    -    -    -   
BD    -    -    -    o    -    -    o    -    -    o    -    -    -    o    -   
tk    -    -    -    C    -    -    -    c    -    -    -    c    -    -    -   

   1580  
ac    x  
at    -  
HH    -  
SD    o  
BD    -  
tk    c  


Sampling a ac event for alignment check... Loading tab and audio slice
   1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665  \
ac    x    -    -    x    -    -    -    x    -    x    -    x    -    x    -   
at    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
HH    -    -    -    -    -    -    -    -    -    -    -    -    -    -    -   
SD    -    -    -    o    -    -    -    o    -    -    -    o    -    -    -   
BD    o    -    o    -    -    -    o    -    -    o    o    -    -    o    -   
tk    -    c    -    -    -    c    -    -    -    c    -    -    -    C    -   

   1666  
ac    x  
at    -  
HH    -  
SD    o  
BD    -  
tk    -  


Sampling a HH event for alignment check... Loading tab and audio slice
No valid samples in HH of that dataframe
Sampling a BD event for alignment check... Loading tab and audio slice
   175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
ac   -   x   -   x   -   x   -   x   -   x   -   x   -   x   -   x
at   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
HH   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
SD   -   -   -   o   -   -   -   o   -   -   -   o   -   -   -   o
BD   o   o   o   o   o   o   o   o   o   o   o   o   o   o   o   o
tk   -   C   -   -   -   c   -   -   -   c   -   -   -   c   -   -


In [58]:
%whos

Variable                      Type             Data/Info
--------------------------------------------------------
AudioSegment                  type             <class 'pydub.audio_segment.AudioSegment'>
FullSet_df                    DataFrame                                t<...>[49917 rows x 14 columns]
FullSet_testclean             DataFrame                                t<...>[49917 rows x 14 columns]
FullSet_testcollapse          DataFrame                                t<...>n[49917 rows x 7 columns]
MATDF_test                    dict             n=25
MAT_df                        DataFrame             tk BD SD HH CC C2 LT<...>n[2112 rows x 11 columns]
Make_Music_Aligned_Drum_Tab   function         <function Make_Music_Alig<...>ab at 0x000002668201F288>
add_empty_lines               function         <function add_empty_lines at 0x00000266F4BBBA68>
align_tab_with_music          function         <function align_tab_with_<...>ic at 0x00000266F7E08678>
alignment_info                

#### List of Useful Shortcuts

* Ctrl + shift + P = List of Shortcuts
* Enter (command mode) = Enter Edit Mode (enter cell to edit it)
* Esc (edit mode) = Enter Command Mode (exit cell)
* A = Create Cell above
* B = Create Cell below
* D,D = Delete Cell
* Shift + Enter = Run Cell (code or markdown)
* M = Change Cell to Markdown
* Y = Change Cell to Code
* Ctrl + Shift + Minus = Split Cell at Cursor

In [None]:
# Establish the file path (fp) string to the non Git data that is used in this notebook for testing purposes
fp = "C:\\Users\\Thomas\\Python Projects\\Drum-Tabber-Support-Data\\Notebook-NonGit-Data\\"