# Data preparation for automatic music generation.
### Created by Juan Julian Cea Moran
#### Github: Juanju97
---
This notebook is intended to explore the data and generate some insights as well as a suitable preprocessing method which prepares data to be used by the model.

## Creating vocabulary and data
---
In this section, the goal is to analize all midi files and generate a vocabulary contaning all different types of notes and chords.
We also want to store all midi notes in individual arrays (data object).

In [None]:
from music21 import converter, instrument, chord, note, pitch
import glob

vocabulary = set([])
data = []
i = 0

for file in glob.glob("../data/classic_dataset/*.mid"):
    midi = converter.parse(file)

    tracks = instrument.partitionByInstrument(midi)

    if tracks:
        main_track = tracks.parts[0].recurse()
    else:
        main_track = midi.flat.notes

    file_notes = []
    for e in main_track:
        if isinstance(e, note.Note):
            element = e.nameWithOctave + "_" + str(e.duration.quarterLength)

            vocabulary.add(element)
            file_notes.append(element)

        elif isinstance(e, note.Rest):
            element = "rest_" + str(e.duration.quarterLength)

            vocabulary.add(element)
            file_notes.append(element)
             
        elif isinstance(e, chord.Chord):
            chord_notes = '.'.join([str(p) for p in e.pitches])
            element = chord_notes + "_" + str(e.duration.quarterLength)
            
            vocabulary.add(element)
            file_notes.append(element)
    
    data.append(file_notes)
    i += 1
    print(i)

In [2]:
import pickle

pickle.dump(sorted(vocabulary), open("./data/classic_dataset/vocabulary.p", "wb"))
pickle.dump(data, open("./data/classic_dataset/data.p", "wb"))


## Mapping the vocabulary and formating data
---
Now we want to feed the model with this data. The model performs better with numbers, so we need to map each value of the vocabulary into a number so the model can process it.

In [21]:
import pickle

vocabulary = pickle.load(open("./data/classic_dataset/vocabulary.p", "rb"))

map_voc = dict((element, number) for number, element in enumerate(vocabulary))

Now, we have to apply the vocabulary to codificate the data.

In [31]:
import pandas
import numpy

formated_data = [[map_voc[n] for n in file] for file in data]

pickle.dump(formated_data, open("./data/classic_dataset/formated_data.p", "wb"))