# Improvise a Jazz Solo with an LSTM Network

 In this notebook, we will implement a model that uses an LSTM to generate music. At the end, you'll even be able to listen to your own music! 

<img src="images/jazz.jpg" style="width:450;height:300px;">

**By the end of this assignment, you'll be able to:**


- Apply an LSTM to a music generation task
- Generate your own jazz music with deep learning
- Use the flexible Functional API to create complex models

This is going to be a fun one. Let's get started!

In [None]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.layers import Dense, Activation, Dropout, Input, LSTM, Reshape, Lambda, RepeatVector
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

In [None]:
!pip install audioop-lts
import IPython
import sys
import matplotlib.pyplot as plt
from music21 import *
from jazzlstm.grammar import *
from jazzlstm.qa import *
from jazzlstm.preprocess import * 
from jazzlstm.music_utils import *
from jazzlstm.data_utils import *
from jazzlstm.outputs import *
from jazzlstm.test_utils import *

<a name='1'></a>
## 1 - Problem Statement

You would like to create a jazz music piece specially for a friend's birthday. However, you don't know how to play any instruments, or how to compose music. Fortunately, you know deep learning and will solve this problem using an LSTM network! 

You will train a network to generate novel jazz solos in a style representative of a body of performed work. 😎🎷

<a name='1-1'></a>
### 1.1 - Dataset

To get started, you'll train your algorithm on a corpus of Jazz music. Run the cell below to listen to a snippet of the audio from the training set:

In [None]:
IPython.display.Audio('./data/30s_seq.wav')

#### What are musical "values"? (optional)
You can informally think of each "value" as a note, which comprises a pitch and duration. For example, if you press down a specific piano key for 0.5 seconds, then you have just played a note. In music theory, a "value" is actually more complicated than this -- specifically, it also captures the information needed to play multiple notes at the same time. For example, when playing a music piece, you might press down two piano keys at the same time (playing multiple notes at the same time generates what's called a "chord"). But you don't need to worry about the details of music theory for this notebook. 

The preprocessing of the musical data has been taken care of already, which for this notebook means it's been rendered in terms of musical "values." 
* For the purposes of this notebook, all you need to know is that you'll obtain a dataset of values, and will use an RNN model to generate sequences of values. 
* Your music generation system will use 90 unique values. 

Run the following code to load the raw music data and preprocess it into values. This might take a few minutes!

In [None]:
X, Y, n_values, indices_values, chords = load_music_utils('data/original_metheny.mid')
print('number of training examples:', X.shape[0])
print('Tx (length of sequence):', X.shape[1])
print('total # of unique values:', n_values)
print('shape of X:', X.shape)
print('Shape of Y:', Y.shape)
print('Number of chords', len(chords))


print("Data loaded. n_values =", n_values)
print("Number of measures:", len(measures) if 'measures' in globals() else "measures not returned — see below")

In [None]:
reshaper = Reshape((1, n_values))                  # uses real n_values (57)
LSTM_cell = LSTM(n_a, return_state=True)
densor = Dense(n_values, activation='softmax')     # output layer now has 57 units



You have just loaded the following:

- `X`: This is an (m, $T_x$, 90) dimensional array. 
    - You have m training examples, each of which is a snippet of $T_x =30$ musical values. 
    - At each time step, the input is one of 90 different possible values, represented as a one-hot vector. 
        - For example, X[i,t,:] is a one-hot vector representing the value of the i-th example at time t. 

- `Y`: a $(T_y, m, 90)$ dimensional array
    - This is essentially the same as `X`, but shifted one step to the left (to the past). 
    - Notice that the data in `Y` is **reordered** to be dimension $(T_y, m, 90)$, where $T_y = T_x$. This format makes it more convenient to feed into the LSTM later.
    - Similar to the dinosaur assignment, you're using the previous values to predict the next value.
        - So your sequence model will try to predict $y^{\langle t \rangle}$ given $x^{\langle 1\rangle}, \ldots, x^{\langle t \rangle}$. 

- `n_values`: The number of unique values in this dataset. This should be 90. 

- `indices_values`: python dictionary mapping integers 0 through 89 to musical values.

- `chords`: Chords used in the input midi

<a name='1-2'></a>
### 1.2 - Model Overview

Here is the architecture of the model you'll use.you'll implement it in Keras.

<img src="images/music_generation.png" style="width:600;height:400px;">
<caption><center><font color='purple'><b>Figure 1</b>: Basic LSTM model </center></caption>


* $X = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, \cdots, x^{\langle T_x \rangle})$ is a window of size $T_x$ scanned over the musical corpus. 
* Each $x^{\langle t \rangle}$ is an index corresponding to a value.
* $\hat{y}^{\langle t \rangle}$ is the prediction for the next value.
* You'll be training the model on random snippets of 30 values taken from a much longer piece of music. 
    - Thus, you won't bother to set the first input $x^{\langle 1 \rangle} = \vec{0}$, since most of these snippets of audio start somewhere in the middle of a piece of music. 
    - You're setting each of the snippets to have the same length $T_x = 30$ to make vectorization easier.

## 2 - Building the Model

Now, you'll build and train a model that will learn musical patterns. 
* The model takes input X of shape $(m, T_x, 90)$ and labels Y of shape $(T_y, m, 90)$. 
* You'll use an LSTM with hidden states that have $n_{a} = 64$ dimensions.

In [None]:
# number of dimensions for the hidden state of each LSTM cell.
n_a = 64 

#### Sequence generation uses a for-loop
* If you're building an RNN where, at test time, the entire input sequence $x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, \ldots, x^{\langle T_x \rangle}$ is given in advance, then Keras has simple built-in functions to build the model. 
* However, for **sequence generation, at test time you won't know all the values of $x^{\langle t\rangle}$ in advance**.
* Instead, you'll generate them one at a time using $x^{\langle t\rangle} = y^{\langle t-1 \rangle}$. 
    * The input at time "t" is the prediction at the previous time step "t-1".
* So you'll need to implement your own for-loop to iterate over the time steps. 

#### Shareable weights
* The function `djmodel()` will call the LSTM layer $T_x$ times using a for-loop.
* It is important that all $T_x$ copies have the same weights. 
    - The $T_x$ steps should have shared weights that aren't re-initialized.
* Referencing a globally defined shared layer will utilize the same layer-object instance at each time step.
* The key steps for implementing layers with shareable weights in Keras are: 
1. Define the layer objects (you'll use global variables for this).
2. Call these objects when propagating the input.

#### 3 types of layers
* The layer objects you need for global variables have been defined.  
    * Just run the next cell to create them! 
* Please read the Keras documentation and understand these layers: 
    - [Reshape()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Reshape): Reshapes an output to a certain shape.
    - [LSTM()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): Long Short-Term Memory layer
    - [Dense()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense): A regular fully-connected neural network layer.


* `reshaper`, `LSTM_cell` and `densor` are globally defined layer objects that you'll use to implement `djmodel()`. 
* In order to propagate a Keras tensor object X through one of these layers, use `layer_object()`.
    - For one input, use `layer_object(X)`
    - For more than one input, put the inputs in a list: `layer_object([X1,X2])`

In [None]:
def djmodel(Tx, LSTM_cell, densor, reshaper):
    """
    Implement the djmodel composed of Tx LSTM cells where each cell is responsible
    for learning the following note based on the previous note and context.
    """
    # Get shapes
    n_values = densor.units
    n_a = LSTM_cell.units
    
    # Define inputs
    X = Input(shape=(Tx, n_values), name='X')
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    
    # Initialize states
    a = a0
    c = c0
    
    outputs = []
    
    for t in range(Tx):
        # Select t-th time step
        x = X[:, t, :]
        
        # Reshape to (batch, 1, n_values)
        x = reshaper(x)
        
        # LSTM step – IMPORTANT: pass x positionally, NOT with inputs=
        _, a, c = LSTM_cell(x, initial_state=[a, c])
        
        # Dense prediction
        out = densor(a)
        
        # Collect output
        outputs.append(out)
    
    # Create model
    model = Model(inputs=[X, a0, c0], outputs=outputs)
    
    return model

In [None]:
### YOU CANNOT EDIT THIS CELL



model = djmodel(Tx=30, LSTM_cell=LSTM_cell, densor=densor, reshaper=reshaper)

model.compile(
    optimizer=Adam(learning_rate=0.01),
    loss='categorical_crossentropy',
    metrics=[['accuracy']] * 30    # one accuracy metric per output
)

print("Model built with correct n_values =", n_values)
print("Expected X shape:", model.inputs[0].shape)   # should be (None, 30, 57)
print("Actual X shape:", X.shape)                   # should match

In [None]:
model.summary()

#### Initialize hidden state and cell state
Finally, let's initialize `a0` and `c0` for the LSTM's initial state to be zero. 

In [None]:
m = 60
a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))

#### Train the model

You're now ready to fit the model! 

* You'll turn `Y` into a list, since the cost function expects `Y` to be provided in this format. 
    - `list(Y)` is a list with 30 items, where each of the list items is of shape (60,90). 
    - Train for 100 epochs (This will take a few minutes). 

In [None]:
history = model.fit([X, a0, c0], y = list(Y), epochs=100, verbose = 0)

In [None]:
print(f"loss at epoch 1: {history.history['loss'][0]}")
print(f"loss at epoch 100: {history.history['loss'][99]}")
plt.plot(history.history['loss'])

## 3 - Generating Music

You now have a trained model which has learned the patterns of a jazz soloist. You can now use this model to synthesize new music! 

<a name='3-1'></a>
### 3.1 - Predicting & Sampling

<img src="images/music_gen.png" style="width:600;height:400px;">
<center><caption><b><font color='purple'>Figure 2: Generating new values in an LSTM </b></font>

At each step of sampling, you will:
* Take as input the activation '`a`' and cell state '`c`' from the previous state of the LSTM.
* Forward propagate by one step.
* Get a new output activation, as well as cell state. 
* The new activation '`a`' can then be used to generate the output using the fully connected layer, `densor`. 

#### Initialization
* You'll initialize the following to be zeros:
    * `x0` 
    * hidden state `a0` 
    * cell state `c0` 

In [None]:
def music_inference_model(LSTM_cell, densor, Ty=50):
    """
    Uses the trained "LSTM_cell" and "densor" from model() to generate a sequence of values.
    """
    n_values = densor.units
    n_a = LSTM_cell.units

    # Input for first timestep (start token / zero vector)
    x0 = Input(shape=(1, n_values), name='x0')

    # Initial states
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')

    a = a0
    c = c0
    x = x0

    outputs = []

    for t in range(Ty):
        # LSTM step
        _, a, c = LSTM_cell(x, initial_state=[a, c])

        # Dense → probabilities
        out = densor(a)

        # Append current timestep prediction (useful for debugging / sampling)
        outputs.append(out)

        # ────────────────────────────────────────────────
        # Next input: argmax → one-hot → (None, 1, n_values)
        # ────────────────────────────────────────────────

        # 1. argmax → index of highest probability (shape: (None,))
        pred_index = Lambda(
            lambda y: tf.math.argmax(y, axis=-1),
            output_shape=()                    # scalar per batch item
        )(out)

        # 2. one-hot encode the predicted index
        # Output shape: (None, n_values)
        x_onehot = Lambda(
            lambda idx: tf.one_hot(idx, depth=n_values),
            output_shape=(n_values,)           # ← required: explicit shape
        )(pred_index)

        # 3. RepeatVector(1) → add time dimension
        x = RepeatVector(1)(x_onehot)

    # Create the inference model
    inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)

    return inference_model

In [None]:
#Run the cell below to define your inference model. This model is hard coded to generate 50 values.
inference_model = music_inference_model(LSTM_cell, densor, Ty = 50)

In [None]:
# Check the inference model
inference_model.summary()

#### Initialize inference model
The following code creates the zero-valued vectors you will use to initialize `x` and the LSTM state variables `a` and `c`. 

In [None]:
x_initializer = np.zeros((1, 1, n_values))
a_initializer = np.zeros((1, n_a))
c_initializer = np.zeros((1, n_a))

In [None]:

def predict_and_sample(inference_model, x_initializer = x_initializer, a_initializer = a_initializer,
                       c_initializer = c_initializer):
    """
    Predicts the next value of values using the inference model.
   
    Arguments:
    inference_model -- Keras model instance for inference time
    x_initializer -- numpy array of shape (1, 1, 90), one-hot vector initializing the values generation
    a_initializer -- numpy array of shape (1, n_a), initializing the hidden state of the LSTM_cell
    c_initializer -- numpy array of shape (1, n_a), initializing the cell state of the LSTM_cell
   
    Returns:
    results -- numpy-array of shape (Ty, n_values), matrix of one-hot vectors representing the values generated
    indices -- numpy-array of shape (Ty, 1), matrix of indices representing the values generated
    """
   
    n_values = x_initializer.shape[2]
   
    
    # Step 1: Use your inference model to predict an output sequence 
    # given x_initializer, a_initializer and c_initializer.
    pred = inference_model.predict([x_initializer, a_initializer, c_initializer])
    
    # Step 2: Convert "pred" into an np.array() of indices with the maximum probabilities
    indices = np.argmax(pred, axis=-1)
    
    # Step 3: Convert indices to one-hot vectors, the shape of the results should be (Ty, n_values)
    results = to_categorical(indices, num_classes=n_values)
    
   
    return results, indices

In [None]:

results, indices = predict_and_sample(inference_model, x_initializer, a_initializer, c_initializer)

print("np.argmax(results[12]) =", np.argmax(results[12]))
print("np.argmax(results[17]) =", np.argmax(results[17]))
print("list(indices[12:18]) =", list(indices[12:18]))

### 3.2 - Generate Music 

Finally! You're ready to generate music. 

Your RNN generates a sequence of values. The following code generates music by first calling your `predict_and_sample()` function. These values are then post-processed into musical chords (meaning that multiple values or notes can be played at the same time). 

Most computational music algorithms use some post-processing because it's difficult to generate music that sounds good without it. The post-processing does things like clean up the generated audio by making sure the same sound is not repeated too many times, or that two successive notes are not too far from each other in pitch, and so on. 

One could argue that a lot of these post-processing steps are hacks; also, a lot of the music generation literature has also focused on hand-crafting post-processors, and a lot of the output quality depends on the quality of the post-processing and not just the quality of the model. But this post-processing does make a huge difference, so you should use it in your implementation as well. 

Let's make some music! 

In [None]:
out_stream = generate_music(inference_model, indices_values, chords)

In [None]:
mid2wav('output/my_music.midi')
IPython.display.Audio('./output/rendered.wav')

To listen to your music, click File->Open... Then go to "output/" and download "my_music.midi". Either play it on your computer with an application that can read midi files if you have one, or use one of the free online "MIDI to mp3" conversion tools to convert this to mp3.  

As a reference, here is a 30 second audio clip generated using this algorithm:

In [None]:
IPython.display.Audio('./data/30s_trained_model.wav')

<font color='blue'><b> What you should remember:</b>
    
- A sequence model can be used to generate musical values, which are then post-processed into midi music. 
- You can use a fairly similar model for tasks ranging from generating dinosaur names to generating original music, with the only major difference being the input fed to the model.  
- In Keras, sequence generation involves defining layers with shared weights, which are then repeated for the different time steps $1, \ldots, T_x$. 