<a href="https://colab.research.google.com/github/edufantini/music-gen/blob/main/src/MusicGeneratorLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music Generator with LSTM

##About

Music resamples language as a temporal sequence of articulated sounds. They say something, often something human.

Although, there are crucial differences between language and music. We can still describe it as a sequence of symbols in the simplest form of understanding. Translating something complex into something simpler, but usable by computational models.

Thus, the objective of this project is to establish a communication between the human, that understands music in the most intense way that the brain can interpret through information, and the machine.

We'll create a model that can generate music based on the input information, i.e., generate a sequence of sounds which are related in some way with the sounds passed as input.

We'll use Natural Language Processing (NLP) methods, observing the music as it were a language, abstracting it. Doing this, the machine can recognize and process similar data.

On the first step, we'll use text generation techniques, using Recurrent Neural Networks (RNNs) and Long-Short Term Memories (LSTMs). With the effectiveness of the training, even if it's reasonable, we'll perform the same implementation using specific methods such as Attention.



## Imports

In [1]:
# Basic libraries
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F

# Preprocessing data libraries
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Model libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Data visualization
from torch.utils.tensorboard import SummaryWriter
from tqdm.notebook import tqdm #for loading bars

In [2]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

In [3]:
!git clone https://github.com/edufantini/music-gen.git

Cloning into 'music-gen'...
remote: Enumerating objects: 19514, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (31/31), done.[K
remote: Total 19514 (delta 15), reused 45 (delta 14), pack-reused 19468[K
Receiving objects: 100% (19514/19514), 221.29 MiB | 21.06 MiB/s, done.
Resolving deltas: 100% (85/85), done.


In [3]:
from music_gen.src.GetData import *

## Dataset

In [8]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [9]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"

In [10]:
%cd /content/gdrive/My Drive/Kaggle

/content/gdrive/My Drive/Kaggle


In [20]:
!kaggle datasets download -d edufantini/songs-in-midi

songs-in-midi.zip: Skipping, found more recently modified local copy (use --force to force download)


In [11]:
!ls

clean_midi  dataset  kaggle.json  songs-in-midi.zip


In [None]:
!unzip \*.zip  && rm *.zip

In [17]:
path = '/content/gdrive/My Drive/Kaggle/clean_midi/AC_DC/'

dataset = []

for filename in os.listdir(path):
  if filename.endswith("mid"): 
    # Your code comes here such as 
    print(path + filename)
    #if filename is not 'Back_In_Black.mid':
    data = encode_data(path+filename, 32)
    dataset.append(data)

/content/gdrive/My Drive/Kaggle/clean_midi/AC_DC/Dirty_Deeds_Done_Dirt_Cheap.mid
Processing file /content/gdrive/My Drive/Kaggle/clean_midi/AC_DC/Dirty_Deeds_Done_Dirt_Cheap.mid
Processing part 1/3


Converting measures from part 1: 100%|###############################################################################| 58/58 [00:00<00:00, 290.23it/s]


Processing part 2/3


Converting measures from part 2:   0%|                                                                                         | 0/63 [00:00<?, ?it/s]




Converting measures from part 2:   0%|                                                                                         | 0/63 [00:02<?, ?it/s]


KeyboardInterrupt: ignored

In [23]:
len(dataset)

16

## Preprocess data

```preprocess_bar(encoded_seq, n_in=32, n_out=32)``` uma barra (32 frames) codificada em multi-hot pre-processa essa barra de forma a gerar os valores splitados em X e y

In [24]:
def preprocess_bar(encoded_seq, n_in=32, n_out=32):
  # create lag copies of the sequence
  df = pd.DataFrame(encoded_seq)
  df = pd.concat([df.shift(n_in-i-1) for i in range(n_in)], axis=1)
  # drop rows with missing values
  df.dropna(inplace=True)
  # specify comumns for inout and output values
  values = df.values
  width = encoded_seq.shape[1]
  X = values[:, 0:width*(n_in-1)].reshape(n_in-1, width)
  y = values[:, width:].reshape(n_in-1, width)
  return X,y

```create_dataloader(dataset, batch_size=1)``` converte um dataset com n musicas em um dataloader. 

In [25]:
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

def create_dataloader(dataset, batch_size=1):
  X = []
  y = []
  n_songs = 0
  n_parts= 0
  n_bars= 0

  # create a two arrays X, y with bars
  for song in dataset:
    for part in song:
      for bar in part:
        xa, ya = preprocess_bar(bar)
        X.append(xa)
        y.append(ya)

  X = np.array(X)
  y = np.array(y)
  X = torch.from_numpy(X)
  y = torch.from_numpy(y)
  print(X.shape, y.shape)
  train_ds = TensorDataset(X, y)
  train_dl = DataLoader(train_ds, batch_size=1, shuffle=False)

  return train_dl
           

O dataloader é dividido em duas partes:

  1.   context
  2.   target

onde context são os valores que serão passados como entrada para o modelo - neste caso, esses valores serão 31 frames localizados em cada uma das 5010 barras das musicas e cada frame possui 88 notas.


In [28]:
train_dl = create_dataloader(dataset)

torch.Size([5010, 31, 88]) torch.Size([5010, 31, 88])


In [29]:
def create_vocab(dataset):
  vocab = []
  for song in dataset:
    for part in song:
      for bar in part:
        vocab.append(bar)

  vocab = np.array(vocab)
  vocab = vocab.reshape(vocab.shape[0]*vocab.shape[1], vocab.shape[2])
  vocab = np.unique(vocab, axis=0)


  #np.set_printoptions(threshold=np.inf) print full array
  #np.set_printoptions(threshold=10)
  print(vocab)
  
  return vocab

Temos 837 frames distintos

In [30]:
diff_frames = create_vocab(dataset)
print('\nvocab len: {}'.format(diff_frames.shape[0]))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

vocab len: 626


## Model

Algumas definicoes importantes:

In [31]:
n_frames_input = 31
n_frames_output = 31
n_bars_input = len(train_dl.dataset.tensors[0]) # number of rows of the dataloader
print('Number of bars in the input dataset: {}'.format(n_bars_input))

Number of bars in the input dataset: 5010


In [32]:
input = torch.zeros(n_bars_input, 31, 88)
target = torch.zeros(n_bars_input, 31, 88)

for sample, (xb, yb) in enumerate(train_dl): # gets the samples
  input[sample] = xb
  target[sample] = yb

input[1, 1, :].shape, target[1, 1, :].shape

(torch.Size([88]), torch.Size([88]))

In [33]:
class RNN(nn.Module):
  def __init__(self, input_size, hidden_size, num_layers, output_size):
    super(RNN, self).__init__()
    self.hidden_size = hidden_size
    self.num_layers = num_layers

    #self.embed = nn.Embedding(input_size, hidden_size)
    self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=False)
    self.fc = nn.Linear(hidden_size, output_size)
    self.act = nn.Hardsigmoid()

  def forward(self, x, hidden, cell):

    # print("\n\nX \t", x.unsqueeze(1).shape)

    # Passing in the input and hidden state into the model and obtaining outputs
    out, (hidden, cell) = self.lstm(x.unsqueeze(1), (hidden, cell))

    # print("\n\nX, out, hidden, cell\t", x.shape, out.shape, hidden.shape, cell.shape)

    # Reshaping the outputs such that it can be fit into the fully connected layer
    out = self.fc(out.contiguous().view(-1, self.hidden_size))
    out = self.act(out)

    # print("\n\nOut, hidden\t", out.shape, hidden.shape)
    return out, (hidden, cell)

  # CHECK!!!!!!
  def init_hidden(self, batch_size):
    # This method generates the first hidden state of zeros which we'll use in the forward pass
    # We'll send the tensor holding the hidden state to the device we specified earlier as well
    hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
    cell = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
    return hidden, cell

A classe ```Generator()``` realiza as operacoes principais de treino e teste da LSTM

In [48]:
class Generator():
  def __init__(self):
 
    # preparation
 
    self.bar_len = 31 # how many frames it's gonna take in a timestep
    self.num_layers = self.bar_len
 
    self.frame_len = 88
    self.hidden_size = self.frame_len
 
    self.num_epochs = 5
    self.batch_size = 1 # ou 31?
 
    self.lr = 0.003
 
 
 
  # converts one frame into torch tensor
  def multi_hot_tensor(self, frame):
    tensor = torch.from_numpy(frame)
    return tensor
 
 
  # retrieve data from dataloader
  def get_sample(self, dataloader):
 
    input = torch.zeros(n_bars_input, self.bar_len, self.frame_len)
    target = torch.zeros(n_bars_input, self.bar_len, self.frame_len)
 
    for sample, (xb, yb) in enumerate(dataloader): # gets the samples
      input[sample] = xb
      target[sample] = yb
    
    return input, target
 
 
 
  def generate(self, initial_bar, predict_len, temperature=0.85):
    pass
 
 
 
  def train(self, dataloader):
 
    # Instantiate the model with hyperparameters
    # We'll also set the model to the device that we defined earlier (default is CPU)
    self.rnn = RNN(input_size=self.frame_len,
                   output_size=self.frame_len,
                   hidden_size=self.hidden_size,
                   num_layers=self.num_layers).to(device)
 
 
    optimizer = torch.optim.Adam(self.rnn.parameters(), lr=self.lr)
 
    # podemos?
    loss_fn = nn.BCELoss() # alterar loss -> cross entropy
 
 
    print("\nStarting training...")
 
 
    loss = 0
    for epoch in range(1, self.num_epochs + 1):
 
      print('> EPOCH #', epoch)
 
      # generates the predictions
 
      input, target = self.get_sample(dataloader)
      hidden, cell = self.rnn.init_hidden(self.batch_size)
 
      input = input.to(device)
 
      target = target.to(device)
 
      # print("\n\t Input:\t\t", input.shape)
 
      for bar in tqdm(range(n_bars_input)):
        hidden, cell = self.rnn.init_hidden(self.batch_size)
 
        # print("\n\t Hidden:\t", hidden.shape)
        # print("\t Cell:\t\t", cell.shape)
 
        output, (hidden, cell) = self.rnn(input[bar,:], hidden, cell)
 
        # print("\n\t Output shape:\t\t", output.shape)
        # print("\t Target shape:\t\t", target[bar, :].shape)
 
        #for f in range(output.shape[0]):
 
        #np.set_printoptions(threshold=np.inf)
        #print('\n\t Output:\n\t', output.cpu().detach().numpy())
        #np.set_printoptions(threshold=10)

        # print('\n\t Target:\n\t', target[bar, :])           
        loss_step = loss_fn(output, target[bar, :])
        #loss += loss_step
        #print('\n\t Loss step:\t', loss_step)
        #print('\t Loss:\t\t', loss)
        loss_step.backward() # Does backpropagation and calculates gradients
        optimizer.step() # Updates the weights accordingly
        optimizer.zero_grad() # Clears existing gradients from previous frame
        #loss_step = 0
        
      if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

In [49]:
gen = Generator()

In [50]:
gen.train(train_dl)


Starting training...
> EPOCH # 1


HBox(children=(FloatProgress(value=0.0, max=5010.0), HTML(value='')))

KeyboardInterrupt: ignored