# Implementing the `MiniBach` model

## Part 1: Pre-processing the corpus

In this step, we process a collection of raw musical scores and turn them into a dataset of 4-measure chunks of music examples for the `MiniBach` architecture. If you are comfortable with `music21` and doing these steps by yourself, you may want to skip this notebook.

The generated output of this script will be stored in the `dataset.csv` file.


In [1]:
import music21
import pandas as pd
import os

The dataset of Chorales has been taken originally from [Craig Sapp's repository](https://github.com/craigsapp/bach-370-chorales).

This repository has been added as a submodule, so if you cloned recursively, you already have it.

For parsing/processing the files, we use the [music21](http://web.mit.edu/music21/) python library

In [2]:
dataset_path = os.path.join('bach-370-chorales', 'kern')

part_indexes = {
    0: 'soprano',
    1: 'alto',
    2: 'tenor',
    3: 'bass'    
}

def make_measure_chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    return [lst[i:i+n] for i in range(len(lst)-n+1)]


with open('dataset.csv', 'w') as fd:
    fd.write('sixteenth,soprano,alto,tenor,bass,file\n')
    for f in sorted(os.listdir(dataset_path)):
        print(f'Processing {f}...')
        filepath = os.path.join(dataset_path, f)
        # parse the file with music21
        s = music21.converter.parse(filepath)
        # discard pieces with time signature != 4/4
        timeSignature = s.flat.getElementsByClass('TimeSignature')
        if timeSignature and timeSignature[0].ratioString != '4/4':
            continue
        parts = {}    
        # Iterate over the 4 parts
        for part_id, part in enumerate(s.parts):
            # Get all the measures in this part
            measures = [mm for mm in part.getElementsByClass('Measure') if mm.number != 0]
            # Group them into groups of 4
            measure_chunks = make_measure_chunks(measures, 4)
            part_chunks = []
            for chunk in measure_chunks:
                # For every chunk of 4 measures
                chunk_encoding = {(offset/4.0): '--' for offset in range(0, 64)}
                for measure_id, measure in enumerate(chunk):
                    # Iterate over each measure
                    for ev in measure:
                        # And every note within the measure
                        offs = 4.0 * measure_id + ev.offset
                        if isinstance(ev, music21.chord.Chord):                        
                            if offs in chunk_encoding:
                                chunk_encoding[offs] = ev[0].nameWithOctave
                        elif isinstance(ev, music21.note.Note):
                            if offs in chunk_encoding:
                                chunk_encoding[offs] = ev.nameWithOctave                        
                        elif isinstance(ev, music21.note.Rest):                        
                            if offs in chunk_encoding:
                                chunk_encoding[offs] = 'Rest'            
                part_chunks.append(list(chunk_encoding.values()))
            parts[part_indexes[part_id]] = part_chunks    

        for chunk_id in range(len(parts['soprano'])):
            dfdict = {}
            for part, chunks in parts.items():
                dfdict[part] = chunks[chunk_id]
            chunk_name = f'{f}_chunk_{chunk_id}'
            dfdict['file'] = chunk_name            
            df = pd.DataFrame(dfdict)
            df.to_csv(fd, header=False)
print('The dataset.csv file has been written!')

Processing chor001.krn...
Processing chor002.krn...
Processing chor003.krn...
Processing chor004.krn...
Processing chor005.krn...
Processing chor006.krn...
Processing chor007.krn...
Processing chor008.krn...
Processing chor009.krn...
Processing chor010.krn...
Processing chor011.krn...
Processing chor012.krn...
Processing chor013.krn...
Processing chor014.krn...
Processing chor015.krn...
Processing chor016.krn...
Processing chor017.krn...
Processing chor018.krn...
Processing chor019.krn...
Processing chor020.krn...
Processing chor021.krn...
Processing chor022.krn...
Processing chor023.krn...
Processing chor024.krn...
Processing chor025.krn...
Processing chor026.krn...
Processing chor027.krn...
Processing chor028.krn...
Processing chor029.krn...
Processing chor030.krn...
Processing chor031.krn...
Processing chor032.krn...
Processing chor033.krn...
Processing chor034.krn...
Processing chor035.krn...
Processing chor036.krn...
Processing chor037.krn...
Processing chor038.krn...
Processing c

Processing chor318.krn...
Processing chor319.krn...
Processing chor320.krn...
Processing chor321.krn...
Processing chor322.krn...
Processing chor323.krn...
Processing chor324.krn...
Processing chor325.krn...
Processing chor326.krn...
Processing chor327.krn...
Processing chor328.krn...
Processing chor329.krn...
Processing chor330.krn...
Processing chor331.krn...
Processing chor332.krn...
Processing chor333.krn...
Processing chor334.krn...
Processing chor335.krn...
Processing chor336.krn...
Processing chor337.krn...
Processing chor338.krn...
Processing chor339.krn...
Processing chor340.krn...
Processing chor341.krn...
Processing chor342.krn...
Processing chor343.krn...
Processing chor344.krn...
Processing chor345.krn...
Processing chor346.krn...
Processing chor347.krn...
Processing chor348.krn...
Processing chor349.krn...
Processing chor350.krn...
Processing chor351.krn...
Processing chor352.krn...
Processing chor353.krn...
Processing chor354.krn...
Processing chor355.krn...
Processing c

The `Humdrum (**kern)` format provides information for describing many elemnts of a full music score. 

With the previous processing, we reduced this information to only pitch information in sixteenth-note slices in 4-measure chunks at a time. A (very) minimal encoding, which is helpful for feeding it into the neural network.

In the next step, we start from this `dataset.csv` generated here.