# Section 3
----------------------------------------

## Online mini-batch k-means
- load actual music data and select genre of interest for binary classification
- perform feature extration and normalization
- perform mini-batch k-means for finding centroids
- we should never use the data for twice, and here I split the 10%/90% for total data as testing/training set , and then split 20%/80% of the training data for finding centers and training classifier purpose


In [1]:
%matplotlib nbagg

import numpy as np
import random, sys
import points as pt
import librosa
import cPickle as pickle
import myMiniBatchKmeans as miniK
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from  scipy.spatial.distance import euclidean
from librosa.util import normalize
from pylab import plt


X, y = miniK.getData("../homework2/data/data_small8.in")
X_new, y_new = miniK.selectGenre(X, y, 0) #binary

rng = np.random.RandomState(19850920)
permutation = rng.permutation(len(X_new))
X_new, y_new = X_new[permutation], y_new[permutation]
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.10, random_state=2010)
X_train_center, X_train_classifier, y_train_center, y_train_classifier = train_test_split(X_train, y_train, test_size=0.8, random_state=2010)

X_train_center_features = miniK.featureExtraction(X_train_center) #includes MFCC and normalization
X_train_classifier_features = miniK.featureExtraction(X_train_classifier) #includes MFCC and normalization
print "X_train_center_features.shape", X_train_center_features.shape


mbk = miniK.miniBatchKmeans(8, max_iter=100, batch_size=10000)
mbk.fix(X_train_center_features)
distortion, centroids = mbk.getBestCentroids() #get the best centroid over each iteration

print "distortion", distortion, "centroids.shape", centroids.shape
print centroids




X size/number of songs: 80
Number of clips per song: 10
y size: 80
After MFCC X.shape (2, 10, 12, 129)
X_train_flattened.shape (20, 12, 129)
After transpose X_train_flattened.shape (20, 129, 12)
After MFCC X.shape (12, 10, 12, 129)
X_train_flattened.shape (120, 12, 129)
After transpose X_train_flattened.shape (120, 129, 12)
X_train_center_features.shape (2580, 12)
n_iter is 100
round: 0 squared_diff 0.678961671955 noImproveCount 0 improvement 0
round: 1 squared_diff 0.0269871679787 noImproveCount 0 improvement 0.96025229539
round: 2 squared_diff 0.00238213098128 noImproveCount 0 improvement 0.911730975879
round: 3 squared_diff 0.00166425207327 noImproveCount 0 improvement 0.301359964525
round: 4 squared_diff 0.000545806001466 noImproveCount 0 improvement 0.672041266925
round: 5 squared_diff 0.000357719886536 noImproveCount 0 improvement 0.344602504233
round: 6 squared_diff 0.000261665750632 noImproveCount 0 improvement 0.268517741168
round: 7 squared_diff 0.000177191649169 noImproveCou

## VLAD
- This is the method to replace the previouse bag of feature, that instead do the counting of each of word, here we are calculate the residuals (differences between descriptors and the cluster center)

In [2]:
import vlad

trainX_centerSubstract, trainy_centerSubstract = vlad.my_vlad(centroids, X_train_classifier, y_train_classifier)
testX_centerSubstract, testy_centerSubstract = vlad.my_vlad(centroids, X_test, y_test)


vlad before X.shape (12, 10)
vlad after X.shape (12, 10, 12, 129)
After transpose Xt.shape (12, 10, 129, 12)
vlad before X.shape (2, 10)
vlad after X.shape (2, 10, 12, 129)
After transpose Xt.shape (2, 10, 129, 12)


## Train classifier
- Here I use KNeighborsClassifier with grid search for finding the best parameter
- We should never use the data twice, and therefore here I used the second split from X_train for training purpose

In [3]:
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
param_grid = {'n_neighbors': np.arange(3, 10),'weights':('uniform', 'distance'), 'algorithm':('auto','ball_tree', 'kd_tree', 'brute') }
np.set_printoptions(suppress=True)
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, verbose=3)
grid_search.fit(trainX_centerSubstract, trainy_centerSubstract)
print "----------------", "Done grid search", "----------------"

Fitting 3 folds for each of 56 candidates, totalling 168 fits
[CV] n_neighbors=3, weights=uniform, algorithm=auto ..................
[CV]  n_neighbors=3, weights=uniform, algorithm=auto, score=0.750000 -   0.0s
[CV] n_neighbors=3, weights=uniform, algorithm=auto ..................
[CV]  n_neighbors=3, weights=uniform, algorithm=auto, score=1.000000 -   0.0s
[CV] n_neighbors=3, weights=uniform, algorithm=auto ..................
[CV]  n_neighbors=3, weights=uniform, algorithm=auto, score=0.950000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=auto .................
[CV]  n_neighbors=3, weights=distance, algorithm=auto, score=0.750000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=auto .................
[CV]  n_neighbors=3, weights=distance, algorithm=auto, score=1.000000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=auto .................
[CV]  n_neighbors=3, weights=distance, algorithm=auto, score=0.950000 -   0.0s
[CV] n_neighbors=4, weights=uniform, algo

[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 127 tasks       | elapsed:    0.5s



[CV]  n_neighbors=3, weights=uniform, algorithm=brute, score=0.750000 -   0.0s
[CV] n_neighbors=3, weights=uniform, algorithm=brute .................
[CV]  n_neighbors=3, weights=uniform, algorithm=brute, score=1.000000 -   0.0s
[CV] n_neighbors=3, weights=uniform, algorithm=brute .................
[CV]  n_neighbors=3, weights=uniform, algorithm=brute, score=0.950000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=brute ................
[CV]  n_neighbors=3, weights=distance, algorithm=brute, score=0.750000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=brute ................
[CV]  n_neighbors=3, weights=distance, algorithm=brute, score=1.000000 -   0.0s
[CV] n_neighbors=3, weights=distance, algorithm=brute ................
[CV]  n_neighbors=3, weights=distance, algorithm=brute, score=0.950000 -   0.0s
[CV] n_neighbors=4, weights=uniform, algorithm=brute .................
[CV]  n_neighbors=4, weights=uniform, algorithm=brute, score=0.750000 -   0.0s
[CV] n_neighbors=

[Parallel(n_jobs=1)]: Done 168 out of 168 | elapsed:    0.6s finished


## Predict
- Use the previous splitted test data for prediction
- I got 100% prediction. Before I was using the same training set for both center and classifier purpose, and only got 85% prediction. So never use the data again will yeild significant improvement

In [4]:
predict = grid_search.predict(testX_centerSubstract)
print grid_search.score(testX_centerSubstract, testy_centerSubstract)
print grid_search.best_params_

1.0
{'n_neighbors': 6, 'weights': 'uniform', 'algorithm': 'auto'}
