<h1> Chord Accompaniment using RL - Data preprocessing to compute empirical rewards </h1>

**POC:** Nina Rauscher (nr2861@columbia.edu)

\
In this notebook, we will focus on creating a dataset that can be used as the core component of our empirical rewards during the agent training.

We will use Beethoven sonatas as a reference for what can be good chord transitions, studying the empirical frequency of those.

In the end, our goal is to have a csv file similar to a 48 x 48 matrix (48 = 12 x 4 chord types) that contains probabilities of switching from one possible chord to another. It will then be used to generate empirical rewards.

**Our main steps are:**
1. Find Beethoven sonatas data
2. Preprocess this data so that it has the same chord format as what we will use next
3. Identify chord transitions 
4. Compute frequencies for each potential transition
5. Organize the data as a 48x48 matrix and save it as a csv file

<h2> Necessary imports </h2>

In [1]:
# Operational libraries
import numpy as np
import pandas as pd
import os
import json

<h2> Step 1: Find Beethoven sonatas data </h2>

We will use the data from [Tsung Ping's Github](https://github.com/Tsung-Ping/functional-harmony), especially the `BPS_FH_Dataset` folder and the `chords.xlsx` files within each sonata folder.

As explained by Tsung Ping in the associated README file, the files in each folder are structured as follows:

---

**Tonality**

capital = major
\
lower case = minor
\
'+' = sharp
\
'-' = flat

ex. C = C major, c+ = C# minor

---

**Scale Degree**


1 = I, i (1+ for augmented I) 
\
2 = ii, ii- (-2 for Neapolitan chord)
\
3 = iii, III
\
4 = IV, iv (+4 for augmented 6th)
\
5 = V, v
\
6 = vi, VI (-6 for 6b) 
\
7 = vii-, vii=7. vii-7
\
'+' before number = sharp
\
'-' before number = flat
\
/ = secondary chord 


---

**Chord Quality**

M = major
\
m = minor
\
M7 = major 7th
\
m7 = minor 7th
\
D7 = dominant 7th
\
a = augmented chord (1+)
\
a6 = augmented 6th (+4: It+6, Fr+6, Gr+6)

---

**Chord Label**

capital Roman numeral = mojor triad
\
lower case Roman numeral = minor triad
\
capital Roman numeral with '+' = augmented triad
\
lower case Roman numeral with '-' = diminished chord
\
lower case Roman numeral with '=' = half diminished chord

6 = triad (1st inversion)
\
64 = triad (2nd inversion)
\
7 = 7th chord (root position)
\
65 = 7th chord (1st inversion)
\
43 = 7th chord (2nd inversion)
\
42 = 7th chord (3rd inversion)

N6 = Neapolitan chord
\
It+6 = Italian sixth
\
Fr+6 = French sixth
\
Gr+6 = German sixth

<h2> Step 2: Data Preprocessing Theory</h2>

<h3> 2.1. Create a function to format a chords.xlsx file into an exploitable chords dataframe </h3>

We want to create a function to reformat a *chords.xlsx* file and we will then use it on different sonatas files.

In [2]:
# Reformat the chords file into a dataframe with corresponding columns
def get_file_chords(file):
    temp_chords = pd.read_excel(file, header = None)
    temp_chords.columns = ['onset','offset','key','degree','quality','inversion','Roman numeral notation']
    return temp_chords

In [3]:
order_notes = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
available = ['M','m','D7','m7'] # These correspond to the chord types we have decided to restrict the problem to

In [4]:
# Dictionaries to convert the chords
dic_major = {1:0,2:2,3:4,4:5,5:7,6:9,7:11}
dic_minor = {1:0,2:2,3:3,4:5,5:7,6:8,7:10}
dic_minor_7 = {1:0,2:2,3:3,4:5,5:7,6:9,7:10}
dic_D7 = {1:0,2:2,3:4,4:6,5:7,6:9,7:11}
dic_tot = {'m':dic_minor, 'M': dic_major, 'm7': dic_minor_7, 'D7': dic_D7}

In [5]:
# Convert the key, degree and quality columns into a chord following the format we expect
def chord(key, quality, degree):
    ini = order_notes.index(key.upper())
    quality_str = quality
    if quality_str == 'D7':
        quality_str = '7'
    return order_notes[(ini+dic_tot[quality][degree])%12]+quality_str

chord('F#','m7',5)

'C#m7'

In [6]:
# From a dataframe from the chords.xlsx file, create a filtered and formatted dataframe
def create_good_chords(chords):

    good_chords = chords[chords['quality'].isin(available)]
    good_chords = good_chords[good_chords['degree'].isin([1,2,3,4,5,6,7])]
    good_chords['key'] = good_chords['key'].replace({'A-':'G#', 'b-': 'A#'})
    good_chords['New notation'] = [[]]*len(good_chords)

    ## Notations
    for i in range(len(good_chords)):
        try:
            good_chords.loc[i, 'New notation'] = chord(good_chords['key'][i], good_chords['quality'][i], good_chords['degree'][i])
        except:
            pass

    ## Breaks
    good_chords['Break'] = [0]*len(good_chords)
    for i in range(len(good_chords.index)-1):
        cur, next_ = list(good_chords.index)[i], list(good_chords.index)[i+1]
        if good_chords.loc[cur,'offset'] == good_chords.loc[next_,'onset']:
            good_chords.loc[next_,'Break'] = 0
        else:
            good_chords.loc[next_,'Break'] = 1
    return good_chords

<h3> 2.2. Identify transitions from a chords dataframe </h3>

In [7]:
def create_and_store_transitions(good_chords,path_name):
    good_chords = good_chords.reset_index()
    list_trans = []
    acc = 0
    for i in range(len(good_chords.index)):
        if i>= acc:
            cur = []
            j=i
            while j < len(good_chords.index) and good_chords['Break'][j] != 1:
                cur.append(good_chords['New notation'][j])
                j+=1
            list_trans.append(cur)
            acc=j

    ### Final transitions

    dic = {}
    acc = 0
    for l in list_trans:
        if not (l == [] or [] in l):
            dic[acc] = l
            acc+=1

    ### Store it

    with open(path_name, 'w') as file:
        json.dump(dic, file)

<h2> Step 3: Process all sonatas chords.xlsx files </h2>

Now that we have established the necessary functions to process one file, we can apply it on all folders, and so all chords.xlsx files form the sonatas available in the Github dataset.

In [12]:
# Define the path to the working directory where the chords files are stored
my_dir = "/Users/ninarauscher/Desktop/Python Projects/Reinforcement Learning/Chord Accompaniment using RL/BPS_FH_Dataset"

In [15]:
# Access these files and create json files with the chord transitions
os.chdir(my_dir)
l_datasets = [elem for elem in os.listdir()]
for elem in l_datasets:
    str_ = os.path.join(my_dir, elem)
    if os.path.isdir(str_):  # Check if it's a directory before changing
        os.chdir(str_)
        chords = get_file_chords('chords.xlsx')
        good_chords = create_good_chords(chords)
        str_store = os.path.join(my_dir, 'dico_' + elem + '.json')
        create_and_store_transitions(good_chords, str_store)

In [16]:
# Let's merge everything into one dictionary
os.chdir(my_dir)
dico_total = {}
for elem in os.listdir():
    if 'dico' in elem:
        with open(elem, 'r') as file:
            loaded_dict = json.load(file)
        nb = elem.split('.')[0].split('_')[1]
        sigle_dict = {str(key)+'_'+nb: value for key, value in loaded_dict.items()}
        dico_total = {**dico_total,**sigle_dict}
dico_total

{'0_2': ['AM',
  'Bm',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'EM',
  'AM',
  'DM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'F#7',
  'Gm',
  'C#m',
  'F#7',
  'BM'],
 '1_2': ['F#7', 'BM'],
 '2_2': ['F#7', 'BM', 'B7', 'Em'],
 '3_2': ['Em'],
 '4_2': ['GM'],
 '5_2': ['EM'],
 '6_2': ['EM'],
 '7_2': ['EM',
  'B7',
  'EM',
  'B7',
  'Gm',
  'Cm',
  'F#m',
  'AM',
  'BM',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'Bm',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'EM',
  'AM',
  'DM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'F#7',
  'Gm',
  'C#m',
  'F#7',
  'BM'],
 '8_2': ['F#7', 'BM'],
 '9_2': ['F#7', 'BM', 'B7', 'Em'],
 '10_2': ['Em'],
 '11_2': ['GM'],
 '12_2': ['EM'],
 '13_2': ['EM'],
 '14_2': 

In [17]:
# Save the dictionary into one single json file
os.chdir("/Users/ninarauscher/Desktop/Python Projects/Reinforcement Learning/Chord Accompaniment using RL/BPS_FH_Dataset")
with open('dico_all_transitions.json', 'w') as file:
    json.dump(dico_total, file)

In [22]:
dico_total

{'0_2': ['AM',
  'Bm',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'EM',
  'AM',
  'DM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'F#7',
  'Gm',
  'C#m',
  'F#7',
  'BM'],
 '1_2': ['F#7', 'BM'],
 '2_2': ['F#7', 'BM', 'B7', 'Em'],
 '3_2': ['Em'],
 '4_2': ['GM'],
 '5_2': ['EM'],
 '6_2': ['EM'],
 '7_2': ['EM',
  'B7',
  'EM',
  'B7',
  'Gm',
  'Cm',
  'F#m',
  'AM',
  'BM',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'Bm',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'B7',
  'EM',
  'B7',
  'EM',
  'B7',
  'EM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'EM',
  'AM',
  'DM',
  'AM',
  'E7',
  'AM',
  'E7',
  'AM',
  'F#7',
  'Gm',
  'C#m',
  'F#7',
  'BM'],
 '8_2': ['F#7', 'BM'],
 '9_2': ['F#7', 'BM', 'B7', 'Em'],
 '10_2': ['Em'],
 '11_2': ['GM'],
 '12_2': ['EM'],
 '13_2': ['EM'],
 '14_2': 

<h2> Step 4: Compute frequencies for each chord transition </h2>

In [23]:
def dico_freq(dico_total):
    dico_frequency = {}
    for key in dico_total.keys():
        l = dico_total[key]
        for i in range(len(l)-1):
            if l[i] not in dico_frequency.keys():
                dico_frequency[l[i]] = {}
            if l[i+1] not in dico_frequency[l[i]].keys():
                dico_frequency[l[i]][l[i+1]] = 0
            dico_frequency[l[i]][l[i+1]]+=1
    return dico_frequency

In [24]:
# Get all states represented in Beethoven sonatas
states = []
for key in dico_total.keys():
    for elem in dico_total[key]:
        if elem not in states:
            states.append(elem)

In [25]:
def dico_probabilities(dico_frequency):
    dico_proba = {}
    for key in dico_frequency.keys():
        sum_ = sum(dico_frequency[key].values())
        dico_proba[key] = {}
        for key2 in dico_frequency[key].keys():
            dico_proba[key][key2] = dico_frequency[key][key2]/sum_
        for elem in states:
            if elem not in dico_proba[key].keys():
                dico_proba[key][elem] = 0
    return dico_proba

In [26]:
# Actually create these dictionaries
dico_frequency = dico_freq(dico_total)
dico_proba = dico_probabilities(dico_frequency)

<h2> Step 5: Organize the data as a 48x48 matrix and save it as a csv file </h2>

In [27]:
# Create a kind of transition matrix for chords
transition_matrix = pd.DataFrame(index = states, columns = states)

for s1 in states:
    for s2 in states:
        transition_matrix.loc[s1,s2] = dico_proba[s1][s2]

In [28]:
transition_matrix

Unnamed: 0,AM,Bm,E7,B7,EM,DM,F#7,Gm,C#m,BM,...,G#m,A#m,D#M,C#M,Dm7,F7,D#m7,A#m7,G#m7,C#m7
AM,0.071429,0.015306,0.372449,0.015306,0.122449,0.168367,0.010204,0.0,0.0,0.010204,...,0.0,0.010204,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
Bm,0.043478,0.065217,0.108696,0.021739,0.0,0.0,0.391304,0.0,0.0,0.021739,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
E7,0.676829,0.0,0.018293,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,...,0.006098,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
B7,0.0,0.0,0.0,0.006329,0.721519,0.0,0.0,0.012658,0.0,0.0,...,0.0,0.0,0.0,0.006329,0.0,0.0,0,0.0,0.0,0.0
EM,0.105769,0.0,0.038462,0.514423,0.033654,0.028846,0.019231,0.0,0.0,0.081731,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.004808
DM,0.100977,0.006515,0.006515,0.0,0.003257,0.100977,0.0,0.045603,0.0,0.0,...,0.0,0.035831,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
F#7,0.0,0.257143,0.0,0.0,0.0,0.0,0.057143,0.028571,0.0,0.657143,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
Gm,0.048193,0.0,0.0,0.0,0.036145,0.060241,0.0,0.0,0.024096,0.0,...,0.012048,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
C#m,0.0,0.0,0.0,0.0,0.0,0.0,0.235294,0.0,0.0,0.235294,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0
BM,0.0,0.026549,0.0,0.123894,0.168142,0.0,0.19469,0.0,0.053097,0.088496,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.00885,0.0


In this matrix, we only have 43x43 frequencies... and we need 48x48 to cover all the potential chords we've defined. The missing chords are: Cm7, Fm7, A#M7, C#M7 and G#M7.

It would be interesting to retrieve additional data from other sonatas (not necessarily from Beethoven) to complete the transitions for the 5 missing chords. However, in order to keep Beethoven's style, we've decided to just fill the remaining transitions with null values.

In [29]:
# Create a list of additional chords
additional_chords = ['Cm7', 'Fm7', 'A#M7', 'C#M7', 'G#M7']

In [33]:
# Create a new DataFrame with the desired size and fill with 0
new_size = len(transition_matrix) + len(additional_chords)
new_df = pd.DataFrame(0.00, index = additional_chords + transition_matrix.index.tolist(), columns = additional_chords + transition_matrix.columns.tolist())

# Update the values based on the existing transition_matrix
new_df.loc[transition_matrix.index, transition_matrix.columns] = transition_matrix.values

In [34]:
new_df

Unnamed: 0,Cm7,Fm7,A#M7,C#M7,G#M7,AM,Bm,E7,B7,EM,...,G#m,A#m,D#M,C#M,Dm7,F7,D#m7,A#m7,G#m7,C#m7
Cm7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fm7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
C#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
G#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AM,0.0,0.0,0.0,0.0,0.0,0.071429,0.015306,0.372449,0.015306,0.122449,...,0.0,0.010204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bm,0.0,0.0,0.0,0.0,0.0,0.043478,0.065217,0.108696,0.021739,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
E7,0.0,0.0,0.0,0.0,0.0,0.676829,0.0,0.018293,0.0,0.02439,...,0.006098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006329,0.721519,...,0.0,0.0,0.0,0.006329,0.0,0.0,0.0,0.0,0.0,0.0
EM,0.0,0.0,0.0,0.0,0.0,0.105769,0.0,0.038462,0.514423,0.033654,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004808


Besides, we will also need to change the formatting of the chords to match what we will use as states for ou RL problem (M -> M, m -> m, 7 -> M7, m7 -> m7).

In [35]:
new_df.columns

Index(['Cm7', 'Fm7', 'A#M7', 'C#M7', 'G#M7', 'AM', 'Bm', 'E7', 'B7', 'EM',
       'DM', 'F#7', 'Gm', 'C#m', 'BM', 'Em', 'GM', 'Cm', 'F#m', 'D#m', 'CM',
       'Dm', 'G7', 'G#M', 'D#7', 'FM', 'C7', 'A7', 'Am', 'Fm', 'Em7', 'Bm7',
       'F#m7', 'D7', 'Am7', 'F#M', 'A#M', 'Gm7', 'G#m', 'A#m', 'D#M', 'C#M',
       'Dm7', 'F7', 'D#m7', 'A#m7', 'G#m7', 'C#m7'],
      dtype='object')

In [36]:
new_labels = ['Cm7', 'Fm7', 'A#M7', 'C#M7', 'G#M7', 'AM', 'Bm', 'EM7', 'BM7', 'EM',
       'DM', 'F#M7', 'Gm', 'C#m', 'BM', 'Em', 'GM', 'Cm', 'F#m', 'D#m', 'CM',
       'Dm', 'GM7', 'G#M', 'D#M7', 'FM', 'CM7', 'AM7', 'Am', 'Fm', 'Em7', 'Bm7',
       'F#m7', 'DM7', 'Am7', 'F#M', 'A#M', 'Gm7', 'G#m', 'A#m', 'D#M', 'C#M',
       'Dm7', 'FM7', 'D#m7', 'A#m7', 'G#m7', 'C#m7']

In [37]:
# Rename columns and index based on new_labels
chords_transitions = new_df.rename(columns=dict(zip(new_df.columns, new_labels)),
                       index=dict(zip(new_df.index, new_labels)))

In [38]:
chords_transitions

Unnamed: 0,Cm7,Fm7,A#M7,C#M7,G#M7,AM,Bm,EM7,BM7,EM,...,G#m,A#m,D#M,C#M,Dm7,FM7,D#m7,A#m7,G#m7,C#m7
Cm7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fm7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
C#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
G#M7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AM,0.0,0.0,0.0,0.0,0.0,0.071429,0.015306,0.372449,0.015306,0.122449,...,0.0,0.010204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bm,0.0,0.0,0.0,0.0,0.0,0.043478,0.065217,0.108696,0.021739,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
EM7,0.0,0.0,0.0,0.0,0.0,0.676829,0.0,0.018293,0.0,0.02439,...,0.006098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BM7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006329,0.721519,...,0.0,0.0,0.0,0.006329,0.0,0.0,0.0,0.0,0.0,0.0
EM,0.0,0.0,0.0,0.0,0.0,0.105769,0.0,0.038462,0.514423,0.033654,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004808


Eventually, we want to save this new matrix as a csv file:

In [39]:
chords_transitions.to_csv('transition_matrix.csv')