# RandomForest on mel-spectrogram summarization

This is very close to the `melspec-maxp` (max summarization) baseline model in [Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning](https://peerj.com/articles/488/) (Dan Stowell, Mark D. Plumbley)


In [15]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import time

import numpy
import pandas
import matplotlib.pyplot as plt

import sklearn
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from sklearn import metrics

# Custom modules
import dcase2018bird
import features

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load dataset

In [2]:
dataset = dcase2018bird.load_dataset()
print(dataset.shape)
dataset.head(3)

(48310, 4)


Unnamed: 0,itemid,datasetid,hasbird,folder
0,BUK4_20161103_204504_125,PolandNFC,,polandnfc
1,BUK4_20161016_012704_132,PolandNFC,,polandnfc
2,6wichura_deszcz_BUK4_20161005_022304_129,PolandNFC,,polandnfc


In [3]:
dataset.datasetid.unique()

array(['PolandNFC', 'BirdVox-DCASE-20k', 'chern', 'ff1010bird',
       'warblrb10k', 'wabrlrb10k_test'], dtype=object)

## Split training and evaluation data
No labels available for evaluation, they are the thing to predict in competition

In [4]:
trainset = dataset[dataset.hasbird.notna()].copy()
print(trainset.shape)
trainset['hasbird'] = trainset.hasbird.astype(bool)
#trainset.groupby('folder').head(1)

(35690, 4)


In [5]:
evalset = dataset[dataset.hasbird.isna()].copy()
print(evalset.shape)
#evalset.groupby('folder').head(1)

(12620, 4)


# Load features

(10, 3, 64)

In [56]:
start = time.time()
F = train_F.compute()
end = time.time()
print('d', end-start)

d 122.5700409412384


### Compute features

# Model

In [58]:
train_X = F[:,0]
train_Y = trainset[0:1000].hasbird
rf = make_pipeline(
    RandomForestClassifier(n_estimators=100, min_samples_leaf=2, random_state=1),
)

X_train, X_test, Y_train, Y_test = \
  model_selection.train_test_split(train_X, train_Y, test_size=0.3)

start = time.time()
print('Starting train', X_train.shape, numpy.mean(Y_train))
rf.fit(X_train, Y_train)
end = time.time()
print('Train time', end-start)

print('train', model_selection.cross_val_score(rf, X_train, Y_train, scoring='roc_auc', cv=5))
print('test', model_selection.cross_val_score(rf, X_test, Y_test, scoring='roc_auc', cv=5))


Starting train (700, 64) 0.4742857142857143
Train time 0.7555375099182129
train [0.83128278 0.82412263 0.7977068  0.80687007 0.8445413 ]
test [0.70588235 0.74400871 0.70145903 0.67715618 0.72727273]
