# Estimation of Direction of Arrival (DOA) for First Order Ambisonic Audio Files using Artificial Neural Networks

**Pedro Pablo Lucas Bravo**

**pedropl@uio.no**

Go back to [file.ipybn](file.ipynb)

Continue to [training.ipybn](training.ipynb)

# Feature Extraction

**Before running**: If you DONT want to save the features in a file set the next vatiable to FALSE.

In [1]:
save = True

## Packages and Utility Functions

In [2]:
import numpy as np
import pandas as pd
import scipy
import librosa
import os
import sklearn
import time

start_time = time.time()

#Taken maths from: https://math.libretexts.org/Bookshelves/Calculus/Book%3A_Calculus_(OpenStax)/12%3A_Vectors_in_Space/12.7%3A_Cylindrical_and_Spherical_Coordinates#:~:text=To%20convert%20a%20point%20from,and%20z%3D%CF%81cos%CF%86.&text=To%20convert%20a%20point%20from,and%20z%3D%CF%81cos%CF%86.
def SphericalToCartesian(ele, azi, dist):
    phi = np.deg2rad(90-ele)
    theta = np.deg2rad(azi)
    
    x = dist * np.sin(phi) * np.cos(theta)
    #x=ρsinφcosθ 
    y = dist * np.sin(phi) * np.sin(theta)
    #y=ρsinφsinθ 
    z = dist * np.cos(phi)
    #z=ρcosφ
    return np.array([x, y, z])


#Taken from ML workshop: defining function to interpolate 1-D dimensional arrays
def lin_interp_1d(data, out_size):
    
    in_size = data.shape[0]
    x_in = np.arange(0,in_size)
    interpolator = scipy.interpolate.interp1d(x_in, data)
    x_out = np.arange(0,in_size-1,((in_size-1)/out_size))
    output = interpolator(x_out)
    output = output[0:out_size]
    
    return output

# Taken from: https://stackoverflow.com/questions/21030391/how-to-normalize-a-numpy-array-to-a-unit-vector
# It normalizes a vector such that the norm is 1
def normalize(v):
    norm=np.linalg.norm(v)
    if norm==0:
        norm=np.finfo(v.dtype).eps
    return v/norm

## Row feature extraction

A *First Order Ambisonic (FOA)* audio file is composed of 4 channels that represent the components W, X, Y and Z. To create the features, the **cross power spectral density** is calculated for the pairs (W, X) , (W, Y), (W, Z) as a way to feed an example to the ML technique, which considers the differences from these components relatively to the omnidirectional channel W in terms of distribution of power across the frequency spectrum along a time. From these spectrograms, the *angle* was taken, since according to the literature and experiments, it performs well for DOA estimation. (More details in the report).

Additionally, in order to reduce computational time and work with the same size for all files, the angle vector was interpolated to force a 256 size vector and then normalized to a number between -1 and 1. All channels are included as one only row, which is the final feature example.

In [3]:
#Signal: The 4-channel FOA audio signal
#Location: a vector with the DOA as a vector in cartesian coordinates [x, y, z]
def extract_features_target(signal, location, sr):
    feature = []
    for ch in range(1,4):    
        cross = scipy.signal.csd(signal[0], signal[ch], fs = sr, nperseg=1024, noverlap=512)
        cross = lin_interp_1d(np.angle(cross[1]), 256)
        cross = librosa.util.normalize(cross)
        feature = np.append(feature, cross)
        
    return feature, np.array(location)

## Loading from Database and Extract Features

The database contains a metadata file **metadata_dev.csv** and a set of audio files in the folder **foa_dev**. Each audio file has more than one sound event in the same recording. This code extracts all sound events and the features from their corresponding FOA audio file to a data-structure that will be saved later.

In [4]:
sr = 22050 # Sample rate

#It extracts for training and testing datasets
#max_it is the number of files to consider, and max_num_examples the max number of sound events in the data set
def build_examples(filenames_meta_dir, audiofiles_dir, max_it, max_num_examples):
    filenames_meta = os.listdir(filenames_meta_dir) #Metadata file
    num_features = 768 #Sixze of the feature vector considering 3x256 (3 correlations of 256 elements)
    features = np.zeros((max_num_examples,num_features)) #The feature matrix
    target = np.zeros((max_num_examples,3)) #3 target values [x,y,z] that represents the DOA as a normalized cartesian vector
    meta_feat = pd.DataFrame(columns=['file_name', 'sound_event_recording', 'start_time', 'end_time', 'ele', 'azi', 'dist'])   #Metadata to add to the file in which features will be saved
    example = 0

    for i in range(max_it):  

        #Metadata
        metadata = pd.read_csv(filenames_meta_dir + '/' + filenames_meta[i])
        filename = os.path.splitext(filenames_meta[i])[0]

        print("processing '" + filename + "' " + str(i + 1) + "/" + str(max_it))

        #Audio track
        signal, dummy = librosa.load(audiofiles_dir + '/' + filename + '.wav', sr, mono=False) 
        for s in range(len(metadata)):
            start_time = int(metadata['start_time'][s] * sr)
            end_time = int(metadata['end_time'][s] * sr)
            subsignal = librosa.util.normalize(signal[:, start_time:end_time]) # Extract the sound event and normalize it 
            #Extract the feature vector and convert polar coordinates to a normalized cartesian vector
            features[example,:], target[example,:] = extract_features_target(subsignal, normalize(SphericalToCartesian(metadata['ele'][s],  metadata['azi'][s], metadata['dist'][s])), sr)
            #Fill additional metadata
            to_append = [filename, 
                              metadata['sound_event_recording'][s],
                              metadata['start_time'][s],
                              metadata['end_time'][s],
                              metadata['ele'][s],
                              metadata['azi'][s],
                              metadata['dist'][s]
                             ]
            df_length = len(meta_feat)
            meta_feat.loc[df_length] = to_append
            example += 1

    #Delete rows that are not being used if needed (it happens if max_it < 400 for features and 100 for test)          
    features = np.delete(features, np.arange(example,features.shape[0], 1, dtype=int), axis=0)
    target = np.delete(target, np.arange(example,target.shape[0], 1, dtype=int), axis=0)
    print('Done!')
    print('Features Size: ',features.shape)
    return features, target, meta_feat
    
#Build the training feature matrix
print('Building training features...')
features_train, target_train, meta_train = build_examples('data/dcase_data/metadata_dev', 'data/dcase_data/foa_dev/', 400, 15798) #max 400
    
#Build the testing feature matrix
print('Building testing features...')
features_test, target_test, meta_test = build_examples('data/dcase_data/testing/metadata_eval', 'data/dcase_data/testing/foa_eval/', 100, 3974) #max 100

Building training features...
processing 'split1_ir0_ov1_1' 1/400
processing 'split1_ir0_ov1_10' 2/400
processing 'split1_ir0_ov1_2' 3/400
processing 'split1_ir0_ov1_3' 4/400
processing 'split1_ir0_ov1_4' 5/400
processing 'split1_ir0_ov1_5' 6/400
processing 'split1_ir0_ov1_6' 7/400
processing 'split1_ir0_ov1_7' 8/400
processing 'split1_ir0_ov1_8' 9/400
processing 'split1_ir0_ov1_9' 10/400
processing 'split1_ir0_ov2_11' 11/400
processing 'split1_ir0_ov2_12' 12/400
processing 'split1_ir0_ov2_13' 13/400
processing 'split1_ir0_ov2_14' 14/400
processing 'split1_ir0_ov2_15' 15/400
processing 'split1_ir0_ov2_16' 16/400
processing 'split1_ir0_ov2_17' 17/400
processing 'split1_ir0_ov2_18' 18/400
processing 'split1_ir0_ov2_19' 19/400
processing 'split1_ir0_ov2_20' 20/400
processing 'split1_ir1_ov1_21' 21/400
processing 'split1_ir1_ov1_22' 22/400
processing 'split1_ir1_ov1_23' 23/400
processing 'split1_ir1_ov1_24' 24/400
processing 'split1_ir1_ov1_25' 25/400
processing 'split1_ir1_ov1_26' 26/400


processing 'split3_ir0_ov2_14' 214/400
processing 'split3_ir0_ov2_15' 215/400
processing 'split3_ir0_ov2_16' 216/400
processing 'split3_ir0_ov2_17' 217/400
processing 'split3_ir0_ov2_18' 218/400
processing 'split3_ir0_ov2_19' 219/400
processing 'split3_ir0_ov2_20' 220/400
processing 'split3_ir1_ov1_21' 221/400
processing 'split3_ir1_ov1_22' 222/400
processing 'split3_ir1_ov1_23' 223/400
processing 'split3_ir1_ov1_24' 224/400
processing 'split3_ir1_ov1_25' 225/400
processing 'split3_ir1_ov1_26' 226/400
processing 'split3_ir1_ov1_27' 227/400
processing 'split3_ir1_ov1_28' 228/400
processing 'split3_ir1_ov1_29' 229/400
processing 'split3_ir1_ov1_30' 230/400
processing 'split3_ir1_ov2_31' 231/400
processing 'split3_ir1_ov2_32' 232/400
processing 'split3_ir1_ov2_33' 233/400
processing 'split3_ir1_ov2_34' 234/400
processing 'split3_ir1_ov2_35' 235/400
processing 'split3_ir1_ov2_36' 236/400
processing 'split3_ir1_ov2_37' 237/400
processing 'split3_ir1_ov2_38' 238/400
processing 'split3_ir1_ov

processing 'split0_35' 30/100
processing 'split0_36' 31/100
processing 'split0_37' 32/100
processing 'split0_38' 33/100
processing 'split0_39' 34/100
processing 'split0_4' 35/100
processing 'split0_40' 36/100
processing 'split0_41' 37/100
processing 'split0_42' 38/100
processing 'split0_43' 39/100
processing 'split0_44' 40/100
processing 'split0_45' 41/100
processing 'split0_46' 42/100
processing 'split0_47' 43/100
processing 'split0_48' 44/100
processing 'split0_49' 45/100
processing 'split0_5' 46/100
processing 'split0_50' 47/100
processing 'split0_51' 48/100
processing 'split0_52' 49/100
processing 'split0_53' 50/100
processing 'split0_54' 51/100
processing 'split0_55' 52/100
processing 'split0_56' 53/100
processing 'split0_57' 54/100
processing 'split0_58' 55/100
processing 'split0_59' 56/100
processing 'split0_6' 57/100
processing 'split0_60' 58/100
processing 'split0_61' 59/100
processing 'split0_62' 60/100
processing 'split0_63' 61/100
processing 'split0_64' 62/100
processing 's

# Save Features to Files

In [5]:
#merging everything into a single data structure
def save_to_csv(features, target, meta_feat, file_name):
    dataset = pd.DataFrame(features)

    dataset['x'] = target[:,0]
    dataset['y'] = target[:,1]
    dataset['z'] = target[:,2]

    dataset['file_name'] = meta_feat['file_name']
    dataset['sound_event_recording'] = meta_feat['sound_event_recording']
    dataset['start_time'] = meta_feat['start_time']
    dataset['end_time'] = meta_feat['end_time']
    dataset['ele'] = meta_feat['ele']
    dataset['azi'] = meta_feat['azi']
    dataset['dist'] = meta_feat['dist']

    dataset.to_csv(file_name)

if save:
    # Saving training features
    print('Saving training features...')
    save_to_csv(features_train, target_train, meta_train, 'features_train.csv')

    # Saving testing features
    print('Saving testing features...')
    save_to_csv(features_test, target_test, meta_test, 'features_test.csv')
else:
    print('Features were not saved!')
print('Done!')
print('TOTAL EXECUTION TIME: ', str(time.time() - start_time), ' sec')

Saving training features...
Saving testing features...
Done!
TOTAL EXECUTION TIME:  1654.4557211399078  sec


Go back to [file.ipybn](file.ipynb)

Continue to [training.ipybn](training.ipynb)