# Genre recognition: feature extraction

The audio genre recognition pipeline:
1. GTZAN
2. pre-processing
3. unsupervised feature extraction
4. classification

Open questions:
* Rescale the dataset ? We need to for the algorithm to converge.
    * Rescale $n$ features in [0,1] --> converge. But we need to learn the transform.
    * Normalize each sample to unit norm --> converge. But higher objective and less sparse Z. We also loose the generative ability of our model.
* Is there a way to programmatically assess convergence ? Easy for us to look at the objective function, but for a machine.
* Store data in float64 ? Or compute in float32 ?

Observations:
* Memory efficiency:
    * m=64, 20 songs: 600 MiB --> 170 MiB (pyul mem optimization)
    * m=64, 40 songs: 900 MiB --> 170 MiB (pyul mem optimization)
    * m=128, 200 songs: 800 MiB (pyul mem optimization)
    * m=128, 400 songs: 2 GiB (pyul mem optimization)
* Time efficiency:
    * m=64, 20 songs: 370s
    * m=128, 400 songs: 19636s (pyul mem optimization)

## Setup

In [None]:
import os, time
import numpy as np
import h5py
import matplotlib.pyplot as plt
%matplotlib inline

# Import auto-encoder definition.
%run -n auto_encoder.ipynb
#import auto_encoder

# Profiling.
%load_ext memory_profiler
%load_ext line_profiler
import objgraph

#%load_ext autoreload
#%autoreload 2

## Input data

In [None]:
filename = os.path.join('data', 'audio_v2_full.hdf5')
audio = h5py.File(filename, 'r')

# Display HDF5 attributes.
print('Attributes:')
for attr in audio.attrs:
    print('  {} = {}'.format(attr, audio.attrs[attr]))

# Show datasets, their dimensionality and data type.
print('Datasets:')
for dname, dset in audio.items():
    print('  {:2}: {:24}, {}'.format(dname, dset.shape, dset.dtype))

In [None]:
def datinfo(X, name='Dataset'):
    r"""Print dataset size and dimensionality"""
    print('{}:\n'
          '  size: N={:,} x n={} -> {:,} floats\n'
          '  dim: {:,} features per clip\n'
          '  shape: {}'
          .format(name, np.prod(X.shape[:-1]), X.shape[-1],
                  np.prod(X.shape), np.prod(X.shape[2:]), X.shape))

In [None]:
# Choose dataset.
X = audio.get('Xs')

# Full dataset.
Ngenres, Nclips, Nframes, _, n = X.shape
datinfo(X, 'Full dataset')
print(type(X))

# Reduce data size.
Ngenres, Nclips = 4, 100

# Load data into memory as a standard NumPy array.
X = X[:Ngenres,:Nclips,...]
datinfo(X, 'Reduced dataset')
print(type(X))

# Resize in place without memory loading via hyperslab.
# Require chunked datasets.
#X.resize((Ngenres, Nclips, Nframes, 2, n))

# Squeeze dataset to a 2D array. The auto-encoder does not
# care about the underlying structure of the dataset.
X.resize(Ngenres * Nclips * Nframes * 2, n)
print('Data: {}, {}'.format(X.shape, X.dtype))

# Independently rescale each feature.
# To be put in an sklearn Pipeline to avoid transductive learning.
X -= np.min(X, axis=0)
X /= np.max(X, axis=0)

# Independently normalize each sample.
#X /= np.linalg.norm(X, axis=1)[:,np.newaxis]

## Feature extraction

Hyper-parameters:
* m:  number of atoms in the dictionary, sparse code length
* ld: weigth of the dictionary l2 penalty

In [None]:
m = 128  # 512
ld = 10

Size of training data and parameters.

In [None]:
N = Ngenres * Nclips * Nframes * 2
sizeX = N * n / 2.**20
sizeZ = N * m / 2.**20
sizeD = n * m / 2.**10
sizeE = m * n / 2.**10
# 64 bits float
print('Size X: {:.1f} M --> {:.1f} MiB'.format(sizeX, sizeX*8))
print('Size Z: {:.1f} M --> {:.1f} MiB'.format(sizeZ, sizeZ*8))
print('Size D: {:.1f} k --> {:.1f} kiB'.format(sizeD, sizeD*8))
print('Size E: {:.1f} k --> {:.1f} kiB'.format(sizeE, sizeE*8))

In [None]:
# 200 10 | 200 15
ae = auto_encoder(m=m, ld=ld, rtol=1e-5, xtol=None, N_inner=200, N_outer=10)
tstart = time.time()
Z = ae.fit_transform(X)
print('Elapsed time: {:.0f} seconds'.format(time.time() - tstart))

## Performance analysis

Time analysis.

In [None]:
if False:
    %prun Z = ae.fit_transform(X)

Space analysis.

In [None]:
if False:
    import gc
    gc.collect()
    objgraph.show_most_common_types()
    from pyunlocbox import solvers, functions
    %mprun -f ae.fit_transform -f ae._minD -f ae._minZ -f solvers.solve -f solvers.forward_backward._pre -f solvers.forward_backward._fista -f functions.norm_l1._prox -T profile.txt ae.fit_transform(X)
    #%mprun -f solvers.solve -f solvers.forward_backward._pre -f solvers.forward_backward._fista -f functions.norm_l1._prox -T profile.txt ae.fit_transform(X)
    gc.collect()
    objgraph.show_most_common_types()

In [None]:
if False:
    from pympler import tracker
    tr = tracker.SummaryTracker()
    Z = ae.fit_transform(X)
    tr.print_diff()

## Solution analysis

### Objective

In [None]:
ae.plot_objective()
objective(X, Z, ae.D, ae.ld)

### Sparse codes

In [None]:
sparse_codes(Z)

### Dictionary

Observations:
* The learned atoms seem to represent harmonies and harmonics.
* The atoms themselves look sparse. Should we add some prior knowledge on the dictionary ?

In [None]:
dictionary(ae.D)
atoms(ae.D)

## Output data

We will store more Z when the various approximations will be implemented.

In [None]:
filename = os.path.join('data', 'features.hdf5')

# Remove existing HDF5 file without warning if non-existent.
try:
    os.remove(filename)
except OSError:
    pass

# Create HDF5 file and datasets.
features = h5py.File(filename, 'w')

# Metadata.
features.attrs['sr'] = audio.attrs['sr']
features.attrs['labels'] = audio.attrs['labels']

# Data.
features.create_dataset('X', data=X.reshape(Ngenres, Nclips, Nframes, 2, n), dtype='float32')
features.create_dataset('Z', data=Z.reshape(Ngenres, Nclips, Nframes, 2, Z.shape[-1]), dtype='float32')
features.create_dataset('D', data=ae.D, dtype='float32')

# Show datasets, their dimensionality and data type.
print('Datasets:')
for dname, dset in features.items():
    print('  {:2}: {:22}, {}'.format(dname, dset.shape, dset.dtype))

# Display HDF5 attributes.
print('Attributes:')
for name, value in features.attrs.items():
    print('  {} = {}'.format(name, value))