# Energy auto-encoder: classification

* Technically akin to "transductive learning" because the sparse auto-encoder dictionary is learned over the whole dataset, not only the training data. The LeCun paper we compare with does the same. This could be solved by including the dictionary learning step in the classifier. Technical solutions:
    1. Compute the dictionary in our custom classifier.
    2. Create a scikit-learn Pipeline which includes the whole preprocessing and feature extraction steps.
    In either case the ability to import functions from other notebooks would help. This would be very slow due to the tremendous amount of time needed to train the auto-encoder.
* We should use "grid search" to find the optimal hyper-parameters (auto-encoders, frames, feature vectors, SVM).
* We may use a validation set to mitigate the leak of the testing set in the model as we tune the hyper-parameters.
* Even if not stated in LeCun's paper we should rescale the data before SVM classification.

## Setup

In [None]:
import os, time
import numpy as np
import sklearn
from sklearn import svm
from sklearn import cross_validation
from sklearn import metrics
from sklearn import preprocessing
import h5py
import matplotlib.pyplot as plt
%matplotlib inline

print('Software versions:')
for pkg in [np, sklearn]:
    print('  {}: {}'.format(pkg.__name__, pkg.__version__))

## Input data

1. Retrieve data from the HDF5 data store.
2. Choose the data we want to work with:
    * raw audio $X_a$,
    * CQT spectrograms $X_s$,
    * normalized spectrograms $X_n$,
    * sparse codes $Z$.
3. Eventually reduce the number $N_{genres} \cdot N_{clips}$ of samples for quicker analysis.

In [None]:
filename = os.path.join('data', 'audio_v2_full.hdf5')
audio = h5py.File(filename, 'r')

# Display HDF5 attributes.
print('Attributes:')
for attr in audio.attrs:
    print('  {} = {}'.format(attr, audio.attrs[attr]))

# Show datasets, their dimensionality and data type.
print('Datasets:')
for dname, dset in audio.items():
    print('  {:2}: {:24}, {}'.format(dname, dset.shape, dset.dtype))

In [None]:
def datinfo(X, name='Dataset'):
    r"""Print dataset size and dimensionality"""
    print('{}:\n'
          '  size: N={:,} x n={} -> {:,} floats\n'
          '  dim: {:,} features per clip\n'
          '  shape: {}'
          .format(name, np.prod(X.shape[:-1]), X.shape[-1],
                  np.prod(X.shape), np.prod(X.shape[2:]), X.shape))

In [None]:
# Choose dataset.
X = audio.get('Xs')

# Full dataset.
Ngenres, Nclips, Nframes, _, n = audio[dname].shape
datinfo(X, 'Full dataset')
print(type(X))

# Reduce data size.
#Ngenres, Nclips = 4, 100

# Load data into memory as a standard NumPy array.
X = X[:Ngenres,:Nclips,:,:,:]
datinfo(X, 'Reduced dataset')
print(type(X))

# Resize in place without memory loading via hyperslab.
# Require chunked datasets.
#X.resize((Ngenres, Nclips, Nframes, 2, n))

## Feature vectors through aggregation

Yet another (hopefully intelligent) dimensionality reduction:
* Aggregation of features from various frames to make up $2N_{vectors} = 12$ feature vectors per clip. Each vector represents approximatly 5 seconds of audio which is way longer than single frames while shorter than the whole clip.
* There is again a 50% overlap between those feature vectors.
* Absolute value rectification to prevent components of different sign from canceling each other out.
* Can be thought as an histogram of used dictionary atoms (if using $Z$) or frequency bins (if using $X_s$) along the chosen time window.
* Note that feature aggregation does not make much sense for raw audio (if using $X_a$).

In [None]:
# Flatten consecutive frames in time.
X1 = X.reshape((Ngenres, Nclips, 2*Nframes, n), order='C')
assert np.all(X1[1,4,3,:] == X[1,4,1,1,:])
datinfo(X1, 'Flattened frames')

# Parameters.
Nvectors = 6
Nframes_per_vector = int(np.floor(2 * Nframes / (Nvectors+0.5)))

def aggregate(X):
    # Truncate.
    X = X[:,:,:Nvectors*Nframes_per_vector,:]
    # Group.
    X = X.reshape((Ngenres, Nclips, Nvectors, Nframes_per_vector, n))
    datinfo(X, 'Truncated and grouped')
    # Aggregate.
    return np.sum(np.abs(X), axis=3)

# Feature vectors.
Y = np.empty((Ngenres, Nclips, Nvectors, 2, n))
Y[:,:,:,0,:] = aggregate(X1)  # Aligned.
Y[:,:,:,1,:] = aggregate(X1[:,:,Nframes_per_vector/2:,:])  # Ovelapped.
datinfo(Y, 'Feature vectors')

## Feature vectors visualization

Visualize all feature vectors of a given clip.

Observations:
* Classical music seems to have a much denser spectrum than blues, which may explain why these two classes are easily identifiable using $X_s$.
* Country seems to have strong low frequencies.

In [None]:
genre, clip = 0, 7
fig = plt.figure(figsize=(8,5))
fig.suptitle('12 feature vectors each covering 5 seconds with 50% overlap')
for vector in range(Nvectors):
    for k in range(2):
        i = vector*2+k
        ax = fig.add_subplot(4, 3, i)
        ax.plot(Y[genre,clip,vector,k,:])
        ax.set_xlim((0, n))
        ax.set_xticks([])
        ax.set_yticks([])

## Data preparation for classifiers

1. Rearrange dataset as a 2D array: number of samples x dimensionality.
2. Optionally scale the data.
3. Generate labels.
4. Optionally split in training and testing sets.
5. Optionally randomize labels for testing.

Observations:
* Scaling is necessary for classification performance (both accuracy and speed). 'std' scaling is not well suited to our histogram-like feature vectors which are not at all Gaussian distributions. Prefer 'minmax', i.e. scale features in [0,1]. Moreover this scaling will preserve the sparsity when dealing with sparse codes $Z$.

In [None]:
def prepdata(a, b, c, test_size=None, scale=None, rand=False):
    """Prepare data for classification."""
    
    # Squeeze dataset to a 2D array.
    data = Y.reshape((a*b), c)
    if c == n:
        assert np.all(data[31,:] == Y[0,2,3,1,:])
    elif c == Nvectors*2*n:
        assert np.all(data[Nclips+2,:] == Y[1,2,:,:,:].reshape(-1))

    # Independently scale each feature.
    # Put in an sklearn Pipeline to avoid transductive learning.
    if scale is 'std':
        # Features have zero norm and unit standard deviation.
        data = preprocessing.scale(data, axis=0)
    elif scale is 'minmax':
        # Features in [0,1].
        data = data - np.min(data, axis=0)
        data = data / np.max(data, axis=0)
    #print(np.min(data, axis=0))
    #print(np.max(data, axis=0))
    
    # Labels.
    target = np.empty((a, b), dtype=np.uint8)
    for genre in range(Ngenres):
        target[genre,:] = genre
    target.resize(data.shape[0])
    print('{} genres: {}'.format(Ngenres, ', '.join(audio.attrs['labels'][:Ngenres])))

    # Be sure that classification with random labels is no better than random.
    if rand:
        target = np.floor(np.random.uniform(0, Ngenres, target.shape))
        print('Balance: {} {}'.format(np.sum(target == 0), np.sum(target == 1)))

    # Training and testing sets.
    if test_size is not None:
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(
            data, target, test_size=test_size)  # random_state=1
        print('Training data: {}, {}'.format(X_train.shape, X_train.dtype))
        print('Testing data: {}, {}'.format(X_test.shape, X_test.dtype))
        print('Training labels: {}, {}'.format(y_train.shape, y_train.dtype))
        print('Testing labels: {}, {}'.format(y_test.shape, y_test.dtype))
        return X_train, X_test, y_train, y_test
    else:
        print('Data: {}, {}'.format(data.shape, data.dtype))
        print('Labels: {}, {}'.format(target.shape, target.dtype))
        return data, target

## Linear SVM

* Each feature vector gets a genre label.
* Classification with linear Vector Support Machine (SVM).
* Fast to train.
* Scale well to large dataset.
* Two implementations: liblinear (sklearn LinearSVC) and libsvm (sklearn SVC and NuSVC)
* Multi-class: "one-vs-one" approach (Knerr et al., 1990) (sklearn SVC and NuSVC) and "one-vs-the-rest" (sklearn LinearSVC)

Observations:
* We can predict genre labels of individual frames with good accuracy using CQT spectrograms only.
* SVC vs NuSVC vs LinearSVC:
    * 10-fold cross-validation with 10 classes (default $C=1$ and $\nu=0.5$):
        * SVC (0.56) yields better accuracy than LinearSVC (0.53) than NuSVC (0.51)
        * SVC (303s) and LinearSVC (296s) faster than NuSVC (501s)
    * SVC does often not converge if data is not scaled
    * LinearSVC may be more scalable (in the number of samples)
* Hyper-parameters:
    * $C$ seems to have little impact.
    * $\nu$ has a great impact on speed: lower is slower

Open questions:
* Which multi-class strategy to adopt: one-vs-all or one-vs-one ?
    * sklearn states that one-vs-all is the most common strategy
* Determine $C$ or $\nu$.

In [None]:
# Instantiate a classifier.

clf_svm = svm.SVC(kernel='linear', C=1)
#clf_svm = svm.NuSVC(kernel='linear', nu=0.5)
#clf_svm = svm.LinearSVC(C=1)

In [None]:
# Try the single feature vector classifier (linear SVM).
if True:
    
    # Split data.
    X_train, X_test, y_train, y_test = prepdata(
        Ngenres, Nclips*Nvectors*2, n, test_size=0.4,
        scale='minmax', rand=False)
    
    # Train.
    clf_svm.fit(X_train, y_train)
    
    # Test.
    y_predict = clf_svm.predict(X_test)
    acc = metrics.accuracy_score(y_test, y_predict)
    print('Accuracy: {:.1f} %'.format(acc*100))

## Majority voting

Final dimensionality reduction step:
* Each of the 12 feature vectors of a clip gives a vote. We choose the genre with the highest number of votes.
* Implemented as a custom classifier which embeds an SVM for individual feature vectors classification.
* Alternative implementation: insert in a sklearn pipeline after SVC.

Observations:
* Accuracy on whole clips is indeed better than accuracy on individual feature vectors.
* *clf_svm_vote.confidence* is useful to observe if a class is harder to differentiate.

In [None]:
# Define and instantiate our custom classifier.
class svm_vote(sklearn.base.BaseEstimator):
    
    def __init__(self, svm):
        self.svm = svm
    
    def _vectors(self, X, y=None):
        """Rearrange data in feature vectors for SVM."""
        X = X.reshape(X.shape[0]*Nvectors*2, n)
        if y is not None:
            y = np.repeat(y, Nvectors*2, axis=0)
            assert y.shape[0] == X.shape[0]
            return (X, y)
        else:
            return (X,)
    
    def fit(self, X, y):
        """Fit the embedded SVC."""
        self.svm.fit(*self._vectors(X, y))
    
    def svm_score(self, X, y):
        """Return SVC accuracy on feature vectors."""
        return self.svm.score(*self._vectors(X, y))
    
    def svm_predict(self, X):
        """Return SVC predictions on feature vectors."""
        y = self.svm.predict(*self._vectors(X))
        y.resize(X.shape[0], Nvectors*2)
        return y
        
    def confidence(self, X):
        """Return the number of votes for each class."""
        def bincount(x):
            return np.bincount(x, minlength=Ngenres)
        y = np.apply_along_axis(bincount, 1, self.svm_predict(X))
        assert np.all(np.sum(y, axis=1) == Nvectors*2)
        return y
        
    def predict(self, X):
        """Return predictions on whole clips."""
        y = self.svm_predict(X)
        return np.apply_along_axis(lambda x: np.bincount(x).argmax(), 1, y)
        #return np.zeros(X.shape[0])  # Pretty bad prediction.
    
    def score(self, X, y):
        """Return the accuracy score. Used by sklearn cross-validation."""
        return metrics.accuracy_score(y, self.predict(X))

clf_svm_vote = svm_vote(clf_svm)

In [None]:
# Try the whole clip classifier (linear SVM and majority voting).
if True:
    
    # Split data.
    X_train, X_test, y_train, y_test = prepdata(
        Ngenres, Nclips, Nvectors*2*n, test_size=0.4,
        scale='minmax', rand=False)
    
    # Train.
    clf_svm_vote.fit(X_train, y_train)
    
    # Test on single vectors.
    acc = clf_svm_vote.svm_score(X_test, y_test)
    print('Feature vectors accuracy: {:.1f} %'.format(acc*100))
    
    # Observe individual votes.
    #print(clf_svm_vote.svm_predict(X_test))
    #print(clf_svm_vote.confidence(X_test))
    
    # Test on whole clips.
    y_predict = clf_svm_vote.predict(X_test)
    acc = metrics.accuracy_score(y_test, y_predict)
    assert acc == clf_svm_vote.score(X_test, y_test)
    print('Clips accuracy: {:.1f} %'.format(acc*100))

## Cross-validation

* 10-fold cross-validation.
* 100 randomly chosen clips per fold.
* 9 folds (900 clips) for training, 1 fold (100 clips) for testing.
* Determine a classification accuracy using testing set.
* Repeat 10 times: mean and standard deviation.

Observations:
* Data should be shuffled as samples with the same label are contiguous, i.e. data ordering is not arbitrary.
* *ShuffleSplit*, *StratifiedShuffleSplit*, *KFold* and *StratifiedKFold* yields similar results as long as data is shuffeld.
* (Lots of variance between runs.)
* Data should be rescaled for good performance (both accuracy and speed).

Results:
* With $X_s$
    * Accuracy of 95 (+/- 5) for 2 genres (SVC, minmax)
    * Accuracy of 81 (+/- 4) for 4 genres (SVC, minmax)
    * Accuracy of 56 (+/- 5) for 10 genres (SVC, minmax)

Ideas:
* Use the area under the receiver operating characteristing (ROC) curve (AUC). Not sure if applicable to multi-class.

In [None]:
data, target = prepdata(Ngenres, Nclips, Nvectors*2*n, scale='minmax')

# Cross validation iterators.
#cv = cross_validation.ShuffleSplit(Ngenres*Nclips, n_iter=10, test_size=0.1)
#cv = cross_validation.StratifiedShuffleSplit(target, test_size=0.4)
cv = cross_validation.KFold(Ngenres*Nclips, shuffle=True, n_folds=10)
#cv = cross_validation.StratifiedKFold(target, shuffle=True, n_folds=10)

tstart = time.time()
scores = cross_validation.cross_val_score(
    clf_svm_vote, data, target, cv=cv, n_jobs=1)
print('Elapsed time: {:.2f} seconds'.format(time.time() - tstart))

print('Scores:\n{}'.format(scores))
print('Accuracy: {:.0f} (+/- {:.1f})'.format(scores.mean()*100, scores.std()*100))