# Introduction
## Objectives
1. The primary objective of this program is to determine whether the mixed audio can be accurately recreated using the mixing coefficients provided in the song collection of the BandHub dataset. If this is the case, the mixed audio need not be stored and can be recreated as and when required, thereby saving us significant space on the file server.

2. In a handful of cases, the same mix has two mixing coefficients. One in the settings dictionary, and the other in the audioChannels list, which is within the settings dictionary. Most times these are just approximation errors. However, in some cases the differences are significant. Therefore, the secondary objective is to clarify which of these multiple mixing coefficients is more accurate.

## Code Organization
1. Imports
2. Connect to mongoDB client
3. Get all collections: Songs, Videos, Tracks, and Posts
4. Function definitions
 * All functions have been elaborated on in the markdown cells above them.
5. Main script
 * This script calls all the functions as and when required to accomplish both objectives.

### 1. Imports

In [13]:
%autosave 30

Autosaving every 30 seconds


In [14]:
import pymongo
import pandas as pd
import numpy as np
np.set_printoptions(threshold='nan')
from numpy.linalg import inv
from scipy import signal
from scipy.io import wavfile
import soundfile as sf
import matplotlib.pyplot as plt
%matplotlib inline
from bson.objectid import ObjectId
import json
from pprint import pprint
from pydub import AudioSegment
import urllib
import fnmatch
import os
import requests
import ffmpy
import subprocess
import librosa
import librosa.display
import time
import IPython.display as ipd
import pescador

from sklearn.ensemble import RandomForestClassifier
from sklearn.externals import joblib

DATA_ROOT = '/scratch/rrs432/openmic/openmic-2018'

if not os.path.exists(DATA_ROOT):
    raise ValueError('Did you forget to set `DATA_ROOT`?')

### 2. Connect to mongoDB client

In [15]:
client = pymongo.MongoClient('localhost',27017)
#client = pymongo.MongoClient('localhost',32768)
db = client.get_database('b=bandhub')

### 3. Get all collections: Songs, Videos, Tracks, and Posts

In [16]:
songCol = db.get_collection('songsStream')
vidCol = db.get_collection('mergedVideos')
trackCol = db.get_collection('tracksStream')
postCol = db.get_collection('posts')
[songCol.count(), vidCol.count(), trackCol.count(), postCol.count()]

[425706, 198169, 915582, 494867]

### 4. Function definitions

From this dataset, we are only concerned with tracks that have "public" access. This is denoted by the variable "songAccess", which and is set to 1 for public tracks.<br>
Conversely, many mixes have a lot of unused public tracks. The tracks that actually make the final mix are titled "published" tracks. Information about the published tracks in a mix are available only in the post collection, and so this function queries the post collection to return the relevant documents.<br>
To summarize, this generator returns the documents in the post collection that have a user-specified number (num_track) of published tracks. For our evaluation purposes, we have restricted num_track to be greater than 2.

In [17]:
def get_published_tracks(num_track):
    """ This generator function returns all post documents with "num_track" published tracks, where num_track > 2.
    ----------
    Arguments:
        num_track: Number of public, published tracks per mix

    Returns:
        docs: All documents from the post collection that satisfy the aforementioned condition
    """
    
    pub_tracks = postCol.find({'songAccess':{'$exists': True}}, {'participantsInfo':1 ,'songAccess':1, 'objectId':1})
    for docs in pub_tracks:
        if(docs['songAccess'] == 1):
            if(len(docs['participantsInfo']['publishedTracks']) == num_track):
                yield docs

This function aids in achieving the objective of identifying whether bandhub's mixes can be accurately recreated using the given mixing coefficients. First, it sifts through all the documents returned from the get_published_tracks function. Then, it checks and keeps a track of any mixes that have missing published track information. If all the information is present, the post ID and the song ID of the mix are appended to a list and returned.

In [44]:
def get_processed_audio_ids(num_track):
    """ This function computes a list of the post IDs & master song IDs of those mixes generated by the
    get_published_tracks function that contain multiple mixing coefficients associated with the same mix.
    ----------
    Arguments:
        num_track: Number of public, published tracks per mix

    Yields:
        pub_songs: A list of tuples containing relevant post & track IDs
    """
    
    pub_docs = [x for x in get_published_tracks(num_track)]
    count_mixes = len(pub_docs)
    print("Number of mixes with " + str(num_track) + " published tracks: " + str(count_mixes))
    missing_song_id = []
    missing_pubtrack = []
    index_required = []
    pub_songs = []
    for j in range(count_mixes):
        pub_tracks = []
        post_id = pub_docs[j]['_id']
        song_id = pub_docs[j]['objectId']
        
        song_look_up = songCol.find({'masterSongId' : song_id})
        
        for i in range(num_track):
            pub_tracks.append(pub_docs[j]['participantsInfo']['publishedTracks'][i]['_id'])
                                
        if song_look_up is -1:
            missing_song_id.append(song_id)
            pprint('Error: SongID not found in Song Collection!')
        else:
            for song_docs in song_look_up:
                count_pub_track = 0
                for i in range(len(pub_tracks)):
                    if(str(pub_tracks[i]) not in song_docs['settings']):
                        missing_pubtrack.append(pub_tracks[i])   
                        break
                    else:
                        count_pub_track+=1
                if count_pub_track == num_track:
                    pub_songs.append((post_id,song_id))

    return pub_songs

The file URLs in the dataset are of the following format:<br>
http://bandhubwebmedia1.blob.core.windows.net/files/520c0d243262aac9f04d4351-520c0d8753294aac9f04d4353.ogg<br>
We can see that the filename is present in this string after the last separator (/). This function strips the file name from the given file URL and returns it.

In [45]:
def get_filename(file):
    """ This function strips the filename along with the file extension from the file URL.
    ----------
    Arguments:
        file: file URL

    Returns:
        filepath: file path
    """
    
    filePath = file.split('/')
    return filePath[- 1]

The mixing coefficients necessary to recreate a mix are the following:
* pub_t_vol: Track volume in the settings dictionary
* ogg_paths: File paths of the tracks on the server.
* start_times: Track start times<br>
Given a song ID and post ID unique to a mix, this function queries the corresponding collections and extracts the above information for every published track in the mix.<br>
The track volumes are all rounded to 2 decimal places. The file URL is fetched from the effectsAudioURL field, which provides processed tracks (i.e., tracks with effects and processing). If not found, it is fetched from the fileURL field, which provides the raw track file.<br>
The start times for each track are queried from the track collection.

In [46]:
def get_mixing_coeff(num_track, song_id, post_id):
    """ This function gathers all mixing coefficients necessary to remix a song using its constituent
    published tracks.
    ----------
    Arguments:
        num_track: Number of public, published tracks per mix
        song_id: Song ID
        post_id: Post ID

    Returns:
        pub_t_vol: Published track volumes from the settings field in the song collection
        ogg_paths: Processed (if available) or unprocessed file URLs of each published track
    """
    
    pub_t_vol = []
    ogg_paths = []
    track_filenames = []
    pub_tracks = []

    song_doc = songCol.find({'masterSongId' : song_id})
    post_doc = postCol.find({'_id' : post_id})
    
    for docs in post_doc:
        for i in range(num_track):
            pub_tracks.append(docs['participantsInfo']['publishedTracks'][i]['_id'])
    
    for docs in song_doc:
        for i in range(len(pub_tracks)):
            if(str(pub_tracks[i]) in docs['settings']):
                pub_t_vol.append(round(docs['settings'][str(pub_tracks[i])]['volume'],2))            
                gettrack = trackCol.find({'_id' : pub_tracks[i]})
                for tdocs in gettrack:
                    if('effectsAudioUrl' in docs['settings'][str(pub_tracks[i])]):             
                        ogg_paths.append(docs['settings'][str(pub_tracks[i])]['effectsAudioUrl'])
                        #print('Processed audio URL')
                    else:
                        ogg_paths.append(tdocs['audioChannels'][0]['fileUrl'])
                        #print('Unprocessed audio URL')
            else:
                print('Missing published track!')
        
    return pub_t_vol, ogg_paths

This function changes the default working directory to a user-specified folder in the user's scratch directory.

In [47]:
def set_path():
    """ This function sets the default download path to a subfolder within the user's scratch folder.
    """

    path = '/scratch/rrs432/openmic/openmic-2018/mixedAudio'
    os.chdir(path)

Given the file's path on the server, this function downloads the file (as is) onto the local working directory. 

In [48]:
def download_track(ogg_path):
    """ This function downloads the track using its filepath to the default download folder.
    ----------
    Arguments:
        ogg_path: file path
    """    
    set_path()

    r = requests.get(ogg_path)
    filename = get_filename(ogg_path)
    with open(filename, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

This function clears all previously downloaded songs/tracks in the current working directory.

In [49]:
def clear_downloads():
    """ This function clears all previously downloaded songs/tracks in default download folder.
    """
    ! rm -rf *.ogg *.wav

This function computes the estimated mix by multiplying the track volumes with their corresponding track data and summing the results.<br>
**Note:** In this notebook, the term *estimated mix* refers to the mix that we manually compute using the given mixing coefficients. Whereas, *bandhub_mix* refers to the mix provided by Bandhub which is coupled with the mixed video.

In [57]:
def compute_estimated_mix(vol, tracks):
    """ This function computes the estimated mix using relevant mixing coefficients.
    ----------
    Arguments:
        vol: Track volume
        tracks: Track files

    Returns:
        mix_flac: Estimated normalized mix file
    """ 
    sr = 44100
    num_track = len(tracks)
    mix_flac = 0
    for i in range(num_track):
        if tracks[i].ndim is not 2:
            tracks[i] = np.stack((tracks[i], tracks[i]), axis = 1)
        mix_flac += (vol[i]*np.array(tracks[i]))
    
    abs_mix_flac = mix_flac / np.max(np.abs(mix_flac))
    
    if abs_mix_flac.shape[0] - 10*sr < 0:
        return -1
    else:
        return mix_flac / np.max(np.abs(mix_flac))

In [51]:
def aggregate_features(arr):
    """ This function reduce the dimensionality of the feature space using statistical methods
    ----------
    Arguments:
        arr: Full feature array

    Returns:
        -: Reduced dimension feature array
    """ 
    mean = np.mean(arr, axis=1)
    std = np.std(arr, axis=1)
    mini = np.min(arr, axis=1)
    maxi = np.max(arr, axis=1)
    return np.concatenate((mean,std,mini,maxi))

In [52]:
def predict_track_inst(npy_file_path, track_filename):
    """ This function performs instrument prediction on a single track
    ----------
    Arguments:
        npy_file_path: Path of the .npy files
        track_filename: Filename of the track

    Returns:
        Y : dict of predicted true & false values for entire instrument pool
    """ 
    Y = []
    
    filename, file_extension = os.path.splitext(track_filename) 
    mfcc_file_path = npy_file_path+filename + '.npy'
    dmfcc_file_path = npy_file_path+filename+'_d.npy'
    ddmfcc_file_path = npy_file_path+filename+'_dd.npy'
    
    if os.path.isfile(dmfcc_file_path) is True and os.path.isfile(ddmfcc_file_path) is True:
        mfcc_file = np.load(mfcc_file_path)
        dmfcc_file = np.load(dmfcc_file_path)
        ddmfcc_file  = np.load(ddmfcc_file_path)

        #Compute the index offset
        if len(mfcc_file[1]) < 862:
            return -1
        else:
            idx = np.random.randint(len(mfcc_file[1]) - 862)
            mfcc_ten_second = mfcc_file[:,idx:idx + 862]
            dmfcc_ten_second = dmfcc_file[:,idx:idx + 862]
            ddmfcc_ten_second = ddmfcc_file[:,idx:idx + 862]

        X = np.concatenate((aggregate_features(mfcc_ten_second),aggregate_features(dmfcc_ten_second),aggregate_features(ddmfcc_ten_second)))
        X = X[np.newaxis,:]
        
    loaded_model = joblib.load('/scratch/rrs432/openmic/openmic-2018/finalized_model.sav')
    for key in loaded_model:
        Y.append(loaded_model[key].predict_proba(X))
    
    return Y

In [53]:
def consolidate_track_labels(num_track, mix, instrument):
    """ This function consolidates the labels on all the constituent tracks of the mix
    ----------
    Arguments:
        num_track: Number of public, published tracks per mix
        mix: Mix file
        instrument: instrument key

    Returns:
        Y : dict of predicted true & false values for entire instrument pool
    """
    if np.shape(mix) == (num_track, 20, 1, 2):
        max_prob = max([mix[track][instrument][0][1] for track in range(len(mix))])
        if max_prob > 0.5:
            instrument_present = True
        else:
            instrument_present = False
        return [instrument_present]
    else:
        return -1

In [54]:
def get_mix_training_data(num_track, proc_id, instrument):
    count = 0
    error_flag = 0
    track_filenames = []
    flac_tracks = []

    post_id = proc_id[0]
    song_id = proc_id[1]

    #Get mixing coefficients and filepaths
    volume_settings, ogg_paths = get_mixing_coeff(num_track, song_id, post_id)

    if len(ogg_paths) is not num_track:
        error_flag = -1
    else:
        for j in range(num_track):
            #Get filenames
            track_filenames.append(get_filename(ogg_paths[j])) 

    if len(track_filenames) is not num_pub_track:
        error_flag = -2
    else:
        Y_track = []
        for j in range(len(track_filenames)):
            filename, file_extension = os.path.splitext(track_filenames[j])
            npy_filename = filename + '.npy'
            flac_filename = filename + '.flac'

            #Check if track is unprocessed, then check if the downloaded flac file is available
            if os.path.isfile('/scratch/rrs432/openmic/openmic-2018/unprocessedAudio/'+npy_filename):
                npy_file_path = '/scratch/rrs432/openmic/openmic-2018/unprocessedAudio/'
                Y_track.append(predict_track_inst(npy_file_path, track_filenames[j]))
                if os.path.isfile('/scratch/work/marl/bandhub/unprocessedAudio/'+flac_filename):
                    flac_track, mixsr = sf.read('/scratch/work/marl/bandhub/unprocessedAudio/'+flac_filename)
                    flac_tracks.append(flac_track)

            #Else check if track is processed, then check if the downloaded flac file is available
            elif os.path.isfile('/scratch/rrs432/openmic/openmic-2018/processedAudio/'+npy_filename):
                npy_file_path = '/scratch/rrs432/openmic/openmic-2018/processedAudio/'
                Y_track.append(predict_track_inst(npy_file_path, track_filenames[j]))
                if os.path.isfile('/scratch/work/marl/bandhub/processedAudio/'+flac_filename):
                    flac_track, mixsr = sf.read('/scratch/work/marl/bandhub/processedAudio/'+flac_filename)
                    flac_tracks.append(flac_track)
            else:
                print('Flac file not found!') 

    if any(Y_track) is -1 or error_flag is not 0:
        X = 0
        Y = 0
    else:
        #Feature extraction on estimated bandhub mix
        if len(flac_tracks) is num_track:
            mix_flac = compute_estimated_mix(volume_settings, flac_tracks)
            if mix_flac is not -1:
                #Pick random 10-second snippet from the mix
                idx = np.random.randint(mix_flac.shape[0] - 10*mixsr)
                mix_flac = mix_flac[idx:idx + 10*mixsr,:]

                if mix_flac.ndim is 2:
                    mix_flac = mix_flac.sum(axis=1)/2

                mfcc_ten_second = librosa.feature.mfcc(y=mix_flac, sr=mixsr)
                dmfcc_ten_second = librosa.feature.delta(mfcc_ten_second)
                ddmfcc_ten_second = librosa.feature.delta(mfcc_ten_second,order=2)

                X = np.concatenate((aggregate_features(mfcc_ten_second),aggregate_features(dmfcc_ten_second),aggregate_features(ddmfcc_ten_second)))
                X = X[np.newaxis,:]
            else:
                error_flag = -3

        #Determine Y values for the mix by combining instrument recognition probababilities of constituent tracks
        if error_flag is 0:
            Y = consolidate_track_labels(num_pub_track, Y_track, instrument)
        else:
            X = 0
            Y = 0
    print('Success')
    yield dict(X=X,Y=Y)
# pprint(get_mix_training_data())

### 5. Main script

In [55]:
with open(os.path.join(DATA_ROOT, 'class-map.json'), 'r') as f:
    class_map = json.load(f)

In [None]:
# This dictionary will include the classifiers for each model
models = dict()
proc_id_filepath = '/scratch/rrs432/openmic/openmic-2018/proc_id/'
# We'll iterate over all instrument classes, and fit a model for each one
instrumentIndex = 0
for instrument in class_map:
    
    inst_num = class_map[instrument]
        
    # Initialize a new classifier
    clf = RandomForestClassifier(max_depth=5, max_features= 10,n_estimators=50)
    
    # Iterate through all files and train
    for num_pub_track in range(2,20):
        npy_files = []
        for root, dirs, files in os.walk(proc_id_filepath+str(num_pub_track)):
            for file in files:
                if file.endswith(".npy"):
                    npy_files.append(np.load(proc_id_filepath+str(num_pub_track)+'/'+file))

        streams = [pescador.Streamer(get_mix_training_data, num_pub_track, proc_id, instrumentIndex) for proc_id in npy_files]

    # Keep 16 streams alive at once
    # Draw on average 8 patches from each stream before deactivating
    mux_stream = pescador.mux.Mux(streams, k=16, rate=8)

    for batch in mux_stream(max_iter=1000):
        if batch['X'] is not 0 and batch['Y'] is not 0:
            clf.fit(batch['X'],batch['Y'])
        else:
            continue
    
    # Store the classifier in our dictionary
    models[instrument] = clf
    instrumentIndex += 1
    
# Save model to disk
filename = 'bandhub_model.sav'
joblib.dump(models, filename)

Success
Success
Success
Success
Success
