# Introduction

In this project, we work on a music generator using machine-learning techniques. We build a model with boosted neural networks and train it with Chopin's Nocturnes. Since the primary objective is to let the machine complete a given sample of music, we also include the sample composition in the training. We include percent-error graphs to show the number of misclassifications. To generate music, instead of using the argmax function, which would lead to a deterministic music generator, we compute exponentially normalized probabilities and always pick from the two highest probable classifications. 

In Section 2, we describe the data-gathering process. Section 3 details the training of the model. We present the music generation with our model in Section 4. In Section 5, we conclude.

# Build

We train our model with Chopin's Nocturnes due to their elegant simplicity. From MuseScore.com, we obtain the free, editable Muse Score sheet music files for all 21 Nocturnes, though we deploy only six of them in this project due to technicalities and time limitations. In the sheet music, we remove _everything_--dynamics, tempo, ornamentations, articulations, slurs and ties, voicings--and retain only the bare melodic pitch and rhythm to create a monophonic melody line. We emphasize that we transpose all the pieces to the key of C major so as to create harmony between various pieces during the training. After the musical simplification of the sheet music, we export each score as a MIDI file and we switch to Python.

Using the module `pretty_midi`, we extract the pitch and duration of each note played in a given MIDI file. To process data, we define certain tools. First, we start with the _principal value_ (PV) of a note; namely, we numerify the twelve notes of the western music theory as C = 0, C\# = 1, D = 2, ..., A\# = 10, and B = 11. Using the octave number, O, we define the _absolute value_ (AV) as AV = 12 O + PV. We also want to convert the duration info, which is given in seconds, into the canonical time values referred to as whole, half, quarter, etc. To do that, we convert the given duration, D, of a note into a _canonical duration_ (CD), which is the equivalent duration as if the given piece had 60 beats per minute (BPM), via CD = D BPM / 60. Noting that a quarter note is equivalent to one second in 60 BPM, we identify each _canonical time value label_ (CTVL) accordingly. After this "raw" analysis, we prepare the raw data table for each piece or MIDI file so that it looks like

```
Note  Octave  Duration (s)  PV  AV  CD (s)  CTVL  b
--------------------------------------------------------
'A'   4       0.7484        9   57  1       5     10
'E'   5       0.7484        4   64  1       5     10
'E'   5       0.7484        4   64  1       5     10
...
'C'   5       0.3734        0   60  0.5     8     10
'B'   4       0.3734        11  59  0.5     8     10
```

Here, CTVL is used as the argument of an ordered list of `whole`, `half`, `quarter`, `eighth`, etc. For instance CTVL = 5 means a quarter note, whilst CTVL = 8 means an eighth note. In the last column, we have the weights, $b$, that are used in the cost function. For the Nocturnes, we use a weight of 1, whilst for the sample piece to be completed by the machine we use 10 because we want the machine to include the musical characters of the user training the model. 

Once we obtain the raw data files, we move on to processing them. To this end, we need to determine the features and the target(s). From experience, we employ the following notion of composition: A set of $g$ notes determines the $g+1$st note. With $g=3$ in this project, given $\vec x_1$, $\vec x_2$, and $\vec x_3$, we generate $\vec y$. Here, the $\vec x_i$ and $\vec y$ are three-dimensional vectors with components (PV, O, CTVL). Namely, from 9 features, we generate 3 targets. We assume that each target each is independent from one another. We refer to these 9 features as the _natural features_. However, we also know that this is not the entire story behind composing a musical piece. We want to know 

- the distance between the first and second notes, the first and third notes, and the second and third notes, $d_{ij}$,

- if the first or second or third note is in key, $k_i$,

- if the direction between the first and second notes, the first and third notes, or the second and third notes is ascending, descending, or if said pair is identical, ${\rm dir}_{ij}$, and

- if the first and second notes, the first and third notes, or the second and third notes are consonant, $c_{ij}$.

With these, we obtain the _engineered_ features:

$$d_{ij} = {\rm AV}_j - {\rm AV}_i,$$

$$k_i = 1 \ {\rm if} \ {\rm PV}_i \in \{0, 2, 4, 5, 7, 9, 11\} \ {\rm else} \ 0,$$

which is the major scale in the canonical key, which is C,

$${\rm dir}_{ij} = 1 \ {\rm if} \ {\rm AV}_j > {\rm AV}_i, \ -1\ {\rm if} \ {\rm AV}_j < {\rm AV}_i, \ {\rm else} \ 0,$$

and

$$c_{ij} = 1 \ {\rm if} \ {\rm AV}_j - {\rm AV}_i \in \{0, 3, 4, 7\} \ {\rm else} \ 0,$$

which is when the notes $i$ and $j$ are in unison or the minor third, major third, or perfect fifth of one another.

With the natural and engineered features combined, the processed data for each piece has 21 features, 3 targets, and 1 weight. We collectively denote the features by $\vec x$, the targets $y_1$, $y_2$, and $y_3$, and the weights $b$. 

# Train

We choose a model with boosted neural networks:

$${\rm model}(\vec x, \vec \Theta_m) = {\rm model}(\vec x, \vec \Theta_{m-1}) + \vec w_{0, m} + \vec w_{1, m} f(\vec x, \vec u),$$

where the class index is implicit, and

$$f(\vec x, \vec u) = \tanh(\mathring{\vec x} \cdot u)$$

are the neural-network units. Here, $\mathring{\vec x}$ represents the features with a 1 appended for the bias term, and the $w$s and the $u$s constitute the set of weights, $\vec \Theta$. For the 0th round, we have

$${\rm model}(\vec x, \vec \Theta_0) = \vec w_0.$$

We assume a multiclass softmax cost for each target, $y_a$, $a=1,2,3$:

$$g_a (\vec \Theta_m) = - \frac1{\sum_{p=1}^P b_p} \sum_{p=1}^P b_p \log \left( { e^{{\rm model}(\vec x_p, \Theta_m)_{{y_a}_p}}} \over \sum_{c=1}^{C_a} e^{{\rm model}(\vec x_p, \Theta_m)_c } \right),$$

where the target index of the weights is implicit. Target 1 (the PVs) has labels 0, 1, 2, ..., 11, target 2 (the Os) has labels 0, 1, 2, ..., 7, and target 3 (the CTVLs) has labels 0, 1, 2, ..., 13; thus, we have $C_1 = 12$, $C_2 = 8$, and $C_3 = 14$. We have a total number of data points of $P = 2525$. We employ a gradient-descent optimization for each cost with a constant learning rate of $\alpha = 0.01$:

$$\vec \Theta^k = \vec \Theta^{k-1} - \alpha \vec \nabla g(\vec \Theta^{k-1}).$$

To avoid overflow, we normalize the features by not standardizing them, but instead dividing each feature by the maximum possible value for that feature so that all the features are between $-1$ and $+1$. Finally, we assume a maximum number of rounds of $M = 30$ and a maximum number of iterations of $K = 1000$. Below, we present the fundamental part of the code.

In [None]:
def model(x, Theta, m, C): 
  if m == 0:
    return np.full((P, C), Theta[:, 0, 0])
  else:
    w0 = Theta[:, m, 0]
    w1 = Theta[:, m, 1]
    u0 = Theta[:, m, 2]
    u = Theta[:, m, 3:]
    f = np.tanh(u0 + np.dot(x, u.T))
    prev = model(x, Theta, m - 1, C)
    return prev + w0 + w1 * f
def g(Theta, y, m, C):
  den = np.sum(np.exp(model(x, Theta, m, C)), axis = 1)
  num = np.exp(model(x, Theta, m, C))
  one_hot = np.eye(num.shape[1])[y]
  num = np.sum(num * one_hot, axis = 1)
  return -np.sum(b * np.log(num / den)) / np.sum(b)
dg = grad(g, 0)
def train(y, M, K):
  if np.array_equal(y, y_1): 
    C = C_1
    target = '1'
  if np.array_equal(y, y_2): 
    C = C_2
    target = '2'
  if np.array_equal(y, y_3): 
    C = C_3
    target = '3'
  Theta = np.zeros((C, M + 1, N + 3))
  for m in range(1, M + 1):
    Theta[:, m, 2:] = 4.0 * np.random.randn(C, N + 1)
  for k in range(K):
    grads = dg(Theta, y, 0, C)
    Theta[:, 0, 0] -= 0.01 * grads[:, 0, 0]
  for m in range(1, M + 1):
    for k in range(K):
      grads = dg(Theta, y, m, C)
      Theta[:, m, :] -= 0.01 * grads[:, m, :]
  return Theta
M = 30
K = 1000
Theta_1 = train(y_1, M, K)
Theta_2 = train(y_2, M, K)
Theta_3 = train(y_3, M, K)

Below, we present the plots of percent-error misclassification vs. boosting round for each target, $y_1$, $y_2$, and $y_3$:

<img src="./out/figures/big_data_with/percent_error_target_3.pdf" alt="percent_error" width="50%">

We achieve a success rate of $25\%$ for target 1 (the note), $65\%$ for target 2 (the octave value), and $35\%$ for target 3 (the time value) with our current assumptions.

# Generate

Now we are ready to generate music. We go back to the data of the sample composition, take a copy of it with three columns PV, O, and CTVL, take the last three notes as three-dimensional vectors, i.e. 9 _natural_ features, engineer the remaining 16 features, and generate target values using the corresponding set of weights. Instead of using the argmax function, we create a set of probabilities for each class for a given target, form the exponential probabilities, and pick two with the highest probabilities. For example, for the first target, we have

$$z_c = {\rm model}(\vec x, \Theta_1)_c \to \mathcal P_c = {e^{z_c} \over \sum_{c=1}^{C_1} e^{z_c}},$$

pick two with the top two probabilities as

&emsp;random.choice(argsort($\mathcal P$)[-2:]), p = $\mathcal P$[argsort($\mathcal P$)[-2:]])

using NumPy. We append the $y_1$, $y_2$, and $y_3$ values to the data table, and repeat this process until a desiration length of the composition in units of seconds is achieved in accordance with the specified BPM value.

# Conclusion

In this project, we have built a _not-so-random_ composer using machine-learning techniques based on a prior _random_ composer project developed independently. We have used a boosted model with neural networks trained with Chopin's Nocturnes, as well as a sample composition. At the end, we have created a piece as a completion of the sample composition in MIDI format, which can be used in other softwares such as Muse Score and GarageBand to produce music on top of it. 

This project has served as a crude introduction to music production with AI tools using an in-house code. Based on hands-on experience, we make the following remarks. We have assumed that any three consecutive notes determine the fourth one and that the note value of the target is independent of its octave and time values. In practice, this is not always true. A future study should include a larger $g$ value, as well as involve the computation of correlations of the target components. Due to computational costs, we have used somewhat small values of the total number of rounds, $M$, and the total number of iterations, $K$. A successive study should use higher values with further optimization for the training. 

We have not shown the entire code here because of space limitations; thus, a zip of the code is attached with the package and the repo is publicly available at [github.com/kagsimsek/not-so-random_compose](https://github.com/kagsimsek/not-so-random_compose).