# CPEN 291 Final Project

#Proposal:
Our project would involve turning youtube videos of piano compositions into sheet music. As a person who loves to learn new songs through youtube, I found that it can be often expensive to buy sheet music and instead I try to learn by slowing down the video and looking at the artist's hands. However, that has proven itself as quite a challenge since many times the pianist might edit the video in a way that their fingers aren't visible during all the performance. As a solution, I thought it would be interesting to build an ML system that would take audio as an input (mostly from videos on youtube and other streaming platforms) and create music sheets that would be readily available for download. Upon research, I found that there are quite a few companies that offer similar services, which makes me hopeful of the possibility of implementing my project. There are quite a few articles that mention new exciting ML algorithms such as magenta and Deep Watershed Detection. I would probably rely on a scraping algorithm to get a database with audios from amazing piano performances posted on all sorts of social media. Then, I would sort out the database by converting the audio into something more intelligible (such as waves) that would be analyzed by my model. The model would use the data from the waves to point out what note that is, what's its length, etc. In order to train the model, I could find cheap or free sheet music online in order to have a comparison of what the model generated to what it is originally supposed to look like. Finally, once having an acceptable accuracy rate, I would make the sheet music available on a website where we would be able to download it and learn from it. (Note that this would serve mostly as practice for beginners/intermediate students, as it would not be 100% correct)

#Import Statements


In [None]:
!pip install pretty_midi

Collecting pretty_midi
[?25l  Downloading https://files.pythonhosted.org/packages/bc/8e/63c6e39a7a64623a9cd6aec530070c70827f6f8f40deec938f323d7b1e15/pretty_midi-0.2.9.tar.gz (5.6MB)
[K     |████████████████████████████████| 5.6MB 4.4MB/s 
Collecting mido>=1.1.16
[?25l  Downloading https://files.pythonhosted.org/packages/20/0a/81beb587b1ae832ea6a1901dc7c6faa380e8dd154e0a862f0a9f3d2afab9/mido-1.2.9-py2.py3-none-any.whl (52kB)
[K     |████████████████████████████████| 61kB 6.3MB/s 
Building wheels for collected packages: pretty-midi
  Building wheel for pretty-midi (setup.py) ... [?25l[?25hdone
  Created wheel for pretty-midi: filename=pretty_midi-0.2.9-cp37-none-any.whl size=5591954 sha256=a22af4a31c7a5918957ab615eed7884d11fc24f27e62532af692f3e1ea7f713d
  Stored in directory: /root/.cache/pip/wheels/4c/a1/c6/b5697841db1112c6e5866d75a6b6bf1bef73b874782556ba66
Successfully built pretty-midi
Installing collected packages: mido, pretty-midi
Successfully installed mido-1.2.9 pretty-midi

In [None]:
import pandas as pd, csv
import torch, torchtext
from torch import nn, optim, functional as F
from torchvision import datasets, models, transforms
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import numpy as np
import pretty_midi
import librosa
from librosa import display
from IPython.display import Audio,display
import os
from scipy.io import wavfile
import PIL


In [None]:
!pip install pydub

In [None]:
from pydub import AudioSegment
song = AudioSegment.from_mp3("/swallows.mp3")

Here I am testing some stuff out with the AudioSegment import, refer to: https://github.com/jiaaro/pydub#installation

-> NOTE: It seems like .mid files cannot be processed  :(

# Testing of Different Approaches/Ideas

In [None]:
# Size of segments to break song into for volume calculations
SEGMENT_MS = 50
# dBFS is decibels relative to the maximum possible loudness
volume = [segment.dBFS for segment in song[::SEGMENT_MS]]
x_axis = np.arange(len(volume)) * (SEGMENT_MS / 1000)
plt.plot(x_axis, volume)
plt.show()

In [None]:
# Minimum volume necessary to be considered a note
VOLUME_THRESHOLD = -35
# The increase from one sample to the next required 
# to be considered a note
EDGE_THRESHOLD = 5
predicted_starts = []
for i in range(1, len(volume)):
    if (
        volume[i] > VOLUME_THRESHOLD and 
        volume[i] - volume[i - 1] > EDGE_THRESHOLD
    ):
        ms = i * SEGMENT_MS
        predicted_starts.append(ms)

In [None]:
# Throw out any additional notes found in this window
MIN_MS_BETWEEN = 100
predicted_starts = []
for i in range(1, len(volume)):
    if (
        volume[i] > VOLUME_THRESHOLD and 
        volume[i] - volume[i - 1] > EDGE_THRESHOLD
    ):
        ms = i * SEGMENT_MS
        # Ignore any too close together
        if (
            len(predicted_starts) == 0 or
            ms - predicted_starts[-1] >= MIN_MS_BETWEEN
        ):
            predicted_starts.append(ms)

In [None]:
def frequency_spectrum(sample, max_frequency=800):
    """
    Derive frequency spectrum of a pydub.AudioSample
    Returns an array of frequencies and an array of how prevalent that frequency is in the sample
    """
    
    # Convert pydub.AudioSample to raw audio data
    # Copied from Jiaaro's answer on https://stackoverflow.com/questions/32373996/pydub-raw-audio-data
    bit_depth = sample.sample_width * 8
    array_type = get_array_type(bit_depth)
    raw_audio_data = array.array(array_type, sample._data)
    n = len(raw_audio_data)
    # Compute FFT and frequency value for each index in FFT array
    # Inspired by Reveille's answer on https://stackoverflow.com/questions/53308674/audio-frequencies-in-python
    freq_array = np.arange(n) * (float(sample.frame_rate) / n)  # two sides frequency range
    freq_array = freq_array[:(n // 2)]  # one side frequency range
    raw_audio_data = raw_audio_data - np.average(raw_audio_data)  # zero-centering
    
    freq_magnitude = scipy.fft(raw_audio_data) # fft computing and normalization
    freq_magnitude = freq_magnitude[:(n // 2)] # one side
    if max_frequency:
        max_index = int(max_frequency * n / sample.frame_rate) + 1
        freq_array = freq_array[:max_index]
        freq_magnitude = freq_magnitude[:max_index]
    freq_magnitude = abs(freq_magnitude)
    freq_magnitude = freq_magnitude / np.sum(freq_magnitude)
    return freq_array, freq_magnitude

In [None]:
freq_array, freq_magnitude = frequency_spectrum(song, 800)

In [None]:
peak_indicies, props = scipy.signal.find_peaks(freq_magnitudes, height=0.015)
for i, peak in enumerate(peak_indicies):
    freq = freq_array[peak]
    magnitude = props["peak_heights"][i]
    print("{}hz with magnitude {:.3f}".format(freq, magnitude))

In [None]:
get_note_for_freq(freq_array[np.argmax(freq_magnitude)])

In [None]:
# code adapted from https://github.com/jsleep/wav2mid/blob/master/notebooks/wavmidi_preprocess.ipynb
# This code converts .mid files to spectrograms

# Note: the below code requires you to manually install a new version of fluidsynth (using pip install gives you
# a version that is too old). This code must also be run on a 32-bit version of Python. If memory size errors occur,
# you must increase page size in your devices settings.

import fluidsynth

PATH_TO_ENTRIES = ''
DIR_SAVE = ''

entries = os.listdir(PATH_TO_ENTRIES)

for entry in entries:
    
    midi_fn = PATH_TO_ENTRIES + entry
    sr = 22050

    pm = pretty_midi.PrettyMIDI(midi_fn)

    y = pm.fluidsynth(fs=sr)[:sr*5]
    D = librosa.stft(y)
    librosa.display.specshow(librosa.amplitude_to_db(D,ref=np.max),y_axis='log', x_axis='time', sr=sr)
    plt.title('Power spectrogram')

    plt.savefig(DIR_SAVE +  '/' + entry.replace('.mid', '.png'))


# Dataset Collection


In [None]:
# Code to scrape samples using Selenium. Requires user to install chromedriver.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

PATH_TO_CHROMEDRIVER = ''

driver = webdriver.Chrome(PATH_TO_CHROMEDRIVER)

driver.get('https://www.mutopiaproject.org/cgibin/make-table.cgi?Instrument=Piano')

for i in range(77):
     try:
        tables_mid = []
        tables_ps = []

        for j in range(1,11):
            tables_mid.append(driver.find_element_by_xpath(f"//table/tbody//tr[{j}]/td//table//tbody/tr[4]/td[2]"))
            tables_ps.append(driver.find_element_by_xpath(f"//table/tbody//tr[{j}]/td//table//tbody/tr[5]/t'd"))

        for table in tables_mid:
            time.sleep(0.5)
            table.find_element_by_partial_link_text('.mid').click()

        for table in tables_ps:
            time.sleep(1)
            driver.execute_script("arguments[0].click();", table.find_element_by_partial_link_text('.ps'))

        link = driver.find_element_by_link_text('Next 10')
        link.click()

    except:
        continue


driver.quit()


# Preprocessing

To convert sheets from .ps format to .pdf, Ghostscript, an external application was used. [This](https://stackoverflow.com/questions/44532739/using-ghostscript-in-a-windows-bat-file-to-convert-multiple-pdf-files-to-png) Stackoverflow post below gave a command that, with slight modification, allowed us to batch-convert these files to pdf format using Ghostscript. Below is how we then converted the pdf files to png.

In [None]:
# Code to convert sheets from pdf to png format

from pdf2image import convert_from_path
PATH_SHEETS_PDF = ''
PATH_SHEETS_PNG = ''
entries = os.listdir(PATH_SHEETS_PDF)

for entry in entries:
    images = convert_from_path(PATH_SHEETS_PDF + '/' + entry)
    entry = entry.replace('.pdf', '')

    for i in range(len(images)):
        images[i].save(PATH_SHEETS_PNG + entry + f'_{i}.png', 'PNG')

The following code was used to convert any audio files from stereo to mono

In [None]:
# from https://stackoverflow.com/questions/5120555/how-can-i-convert-a-wav-from-stereo-to-mono-in-python

for entry in DIR_WAV:
    sound = AudioSegment.from_wav(PATH_WAV + '/' + entry)
    sound = sound.set_channels(1)
    sound.export(PATH_WAV + '/' + entry, format='wav')

The following helper function is used to create 2D matrix labels from the midi file of a sample.

In [None]:
# from https://github.com/jsleep/wav2mid/blob/master/examples/one_hot.py

from __future__ import division
"""
Simple function for converting Pretty MIDI object into one-hot encoding
/ piano-roll-like to be used for machine learning.
"""

def get_label(pm, fs=1):
    """Compute a one hot matrix of a pretty midi object
    Parameters
    ----------
    pm : pretty_midi.PrettyMIDI
        A pretty_midi.PrettyMIDI class instance describing
        the piano roll.
    fs : int
        Sampling frequency of the columns, i.e. each column is spaced apart
        by ``1./fs`` seconds.
    Returns
    -------
    one_hot : np.ndarray, shape=(128,times.shape[0])
        Piano roll of this instrument. 1 represents Note Ons,
        -1 represents Note offs, 0 represents constant/do-nothing
    """

    # Allocate a matrix of zeros - we will add in as we go
    one_hots = []

    for instrument in pm.instruments:
        one_hot = np.zeros((128, int(fs*instrument.get_end_time())+1))
        for note in instrument.notes:
            # note on
            one_hot[note.pitch, int(note.start*fs)] = 1          # Losing precision with these casts. Try to fix? (use time windows)
            # print('note on',note.pitch, int(note.start*fs))
            # note off
            one_hot[note.pitch, int(note.end*fs)] = -1
            # print('note off',note.pitch, int(note.end*fs))
        one_hots.append(one_hot)

    one_hot = np.zeros((128, np.max([o.shape[1] for o in one_hots])))
    for o in one_hots:
        one_hot[:, :o.shape[1]] += o

    one_hot = np.clip(one_hot,-1,1)
    print(one_hot.shape)
    return torch.as_tensor(one_hot)


The following helper function applies stft (or cqt) to the signal data of an audio sample and saves the resulting spectrogram. It also transforms the image into a tensor.

In [None]:
def get_sample(signalData, i, transform):
    signalData_float = signalData.astype(float)
    f = librosa.stft(signalData_float)
    librosa.display.specshow(librosa.amplitude_to_db(f, ref=np.max), y_axis='log', x_axis='time', sr=22050)
    plt.savefig(f'image_spec_{i}.jpg')
    img = PIL.Image.open(f'image_spec_{i}.jpg')
    img = transform(img)
    return img

The function below creates a dataset where samples consist of images of the spectrogram representation of a song and labels that are 2D arrays where each row represents a second of the piece of 
music and each column represents a certain note. So if the sample is 90 seconds long, the label will be a matrix with 90 rows and 128 columns since there are 128 midi notes.

If note j is played at time i, label[i][j] = 1. 

If this note becomes "off" at time k, label[k][j] = -1


In [None]:
def create_dataset(path_mid, path_wav, transform):
    entries_mid = os.listdir(path_mid)
    entries_wav = os.listdir(path_wav)
    dataset = []

    for i in range(len(entries_wav)):
        # print("itr: " + str(i))
        pm = pretty_midi.PrettyMIDI(path_mid + '/' + entries_mid[i])
        samplingFrequency, signalData = wavfile.read(path_wav + '/' + entries_wav[i])
        sample = get_sample(signalData, i, transform)
        label = get_label(pm)
        dataset.append((sample, label))
        
    return dataset

In [None]:
class Dataset():
  def __init__(self, PATH_MID, PATH_WAV):
    transform = transforms.Compose([transforms.Resize((224,224)), transforms.ToTensor()])
    self.dataset = create_dataset(PATH_MID, PATH_WAV, transform)
  
  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, i):
    if torch.is_tensor(i):
      i = i.item()

    return self.dataset[i]

# Model, Training and Testing

The model will receive the input as being an audio file which will be analyzed using some python libraries we have found that can be useful to plot the data we have in the waveform type. We also already know how to convert them into Spectograms which can also be useful to analyze data. Our model will output notes (i.e.: ['B', 'C#', ..., 'G']).

In [None]:
# def toSpectrogram(entries):
#   for entry in entries:
    
#     midi_fn = PATH_TO_ENTRIES + entry
#     sr = 22050

#     pm = pretty_midi.PrettyMIDI(midi_fn)

#     y = pm.fluidsynth(fs=sr)[:sr*5]
#     D = librosa.stft(y, n_fft=512) # change n_fft to change how much of the song you capture
#     librosa.display.specshow(librosa.amplitude_to_db(D,ref=np.max),y_axis='log', x_axis='time', sr=sr)
#     plt.title('Power spectrogram')

#     plt.savefig(DIR_SAVE + entry.replace('.mid', '.png'))

# dataset = toSpectrogram(midi_files) #to change to where midi files are saved
# n_train = int(0.8 * len(dataset))
# n_test = len(dataset) - n_train
# rng = torch.Generator().manual_seed(0)
# train, test = torch.utils.data.random_split(dataset, [n_train, n_test], rng)

In [None]:
# def run_train(model, ds, crit, opt, dev, n_epochs=10, batch_sz=128):
#     model = model.to(dev)
#     model.train()
#     ldr = torch.utils.data.DataLoader(ds, batch_size=batch_sz)

#     for i in range(n_epochs):
#       for samples in ds:
#           model.zero_grad()
#           # batch_sz = samples[0].size(0)
                  
#           samples = samples[0].to(dev)
#           # labels = torch.full(1, dtype=torch.float, device=dev)
#           outs = model(samples).squeeze()
#           loss = crit(outs, label)
#           loss.backward()

#           opt.step()

# Conclusion
Given the output notes we will now try to render each of our notes into pictures and assemble them as sheet music.