# Train shallow classifiers on embeddings
One approach to training ML classifiers without much training data is to use a pre-trained feature extractor to generate embeddings, and train a supervised "shallow classifier" on these embeddings with labeled examples. This doesn't involve training the backbone network that creates the embeddings. Instead we do something lighter-weight, such as training a 1-layer MLP (fully connected layer, i.e. logistic regression) or some other non-deep classifier. These types of classifiers are generally implemented in the package `sklearn`. 

This notebook compares several sklearn classifiers for their performance classifying bird species from embeddings. We split the annotated dataset into a training and testing set. 

In [121]:
import numpy as np
import pandas as pd
from glob import glob
from pathlib import Path

from matplotlib import pyplot as plt
plt.rcParams['figure.figsize']=[15,5] #for big visuals
%config InlineBackend.figure_format = 'retina'

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import average_precision_score, roc_auc_score


this notebook assumes 

- (1) that you have a labeled dataset corresponding to the embedded audio clips. 

In this case, I use the publicly available annotated bird datset published by [Chronister et al 2021](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.3329) and available on [Dryad](https://datadryad.org/stash/dataset/doi:10.5061/dryad.d2547d81z)


- (2) you have created some embeddings for the annotated audio (see other notebooks for examples) which are saved in a .CSV formatted as | file | start_time | end_time | embedding columns...

In this case, I created embeddings with an EfficientnetB2 model trained on 4000 Xeno-Canto species using OpenSoundscape. The embedding process creates a 1280-length vector for each audio clip. 

In [126]:
path_to_embeddings = 'embeddings_divine_bird_cont1_pnre_5s.csv'
path_to_labels = '/Users/SML161/labeled_datasets/pnre_ecy3329/pnre_ecy3329_onehot_labels_5s.csv'
embeddings = pd.read_csv(path_to_embeddings,index_col=[0,1,2])
labels = pd.read_csv(path_to_labels,index_col=[0,1,2])
labels.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,RWBL,TUTI,CEDW,CANG,RSHA,NOCA,VEER,HETH,SWTH,BRCR,...,CSWA,AMRE,BTNW,WBNU,AMGO,CARW,AMRO,BWWA,REVI,BHVI
file,start_time,end_time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,5.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,10.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,15.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,20.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [127]:
embeddings.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0,1,2,3,4,5,6,7,8,9,...,1270,1271,1272,1273,1274,1275,1276,1277,1278,1279
file,start_time,end_time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
/Users/SML161/labeled_datasets/pnre_ecy3329/wav_Files/Recording_4/Recording_4_Segment_04.wav,0.0,5.0,1.94729,0.55613,1.911788,-0.258975,0.409084,2.868284,0.025218,-0.207324,0.285348,2.09694,...,2.963932,2.201978,-0.07183,0.17655,1.71418,1.047384,2.936442,-0.208916,0.963851,-0.255863


## split into train/test sets by recording

In [128]:
recording = labels.reset_index()['file'].apply(lambda x: Path(x).parent.stem)
mask = np.array([r=='Recording_4' for r in recording])
# np.unique(labels.recording,return_counts=True)


labels_test = labels[mask]
labels_train = labels[~mask]

emb_test = embeddings[mask]
emb_train = embeddings[~mask]

# pick a species (note that some species might not be in all recordings, causing issues if you split train/test by recording)
sp = 'TUTI' # Tufted Titmouse

Xt = emb_train.values
Xv = emb_test.values
Yt = labels_train[sp]
Yv = labels_test[sp]

# alternative: random split of 5s clips into train/val 
# this makes the problem much easier for the classifier, but might be necessary if species are not spread across the recordings or some other natural split
# X = embeddings.values
# Y = labels['AMRO']
# Xt,Xv,Yt,Yv = train_test_split(X,Y,test_size=0.5)

print(f"number of {sp} annotations in train: {Yt.sum()} and test: {Yv.sum()}")

number of TUTI annotations in train: 609.0 and test: 105.0


## compare several classifier options from SKLearn for Tufted Titmouse classification
from https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [129]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025, random_state=42),
    SVC(gamma=2, C=1, random_state=42),
    GaussianProcessClassifier(1.0 * RBF(1.0), random_state=42),
    DecisionTreeClassifier(max_depth=5, random_state=42),
    RandomForestClassifier(
        max_depth=5, n_estimators=10, max_features=1, random_state=42
    ),
    MLPClassifier(hidden_layer_sizes=(),alpha=1, max_iter=1000, random_state=42),
    MLPClassifier(hidden_layer_sizes=(100, 100),alpha=1, max_iter=1000, random_state=42),
    MLPClassifier(hidden_layer_sizes=(100, 100, 100),alpha=1, max_iter=1000, random_state=42),
    AdaBoostClassifier(algorithm="SAMME", random_state=42),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
]

names = [
    "Nearest Neighbors",
    "Linear SVM",
    "RBF SVM",
    "Gaussian Process",
    "Decision Tree",
    "Random Forest",
    "FC Neural Net 1 layer",
    "FC Neural Net 2 layers",
    "FC Neural Net 4 layers",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
]

# iterate over classifiers
for name, clf in zip(names, classifiers):
    
    clf = make_pipeline(StandardScaler(), clf)
    clf.fit(Xt, Yt)
    # score = clf.score(Xv, Yv)
    print(f"{name}")

    preds_train = clf.predict(Xt)
    preds_val = clf.predict(Xv)

    ap_train = average_precision_score(Yt,preds_train)
    ap_val = average_precision_score(Yv,preds_val)
    print(f'\tAvgPrecision: train {ap_train:0.2f} test {ap_val:0.2f}')

    ap_train = roc_auc_score(Yt,preds_train)
    ap_val = roc_auc_score(Yv,preds_val)
    print(f'\tAU_ROC: train {ap_train:0.2f} test {ap_val:0.2f}')

Nearest Neighbors
	AvgPrecision: train 0.77 test 0.08
	AU_ROC: train 0.89 test 0.56
Linear SVM
	AvgPrecision: train 0.92 test 0.14
	AU_ROC: train 0.96 test 0.67
RBF SVM
	AvgPrecision: train 1.00 test 0.07
	AU_ROC: train 1.00 test 0.50




Gaussian Process
	AvgPrecision: train 1.00 test 0.07
	AU_ROC: train 1.00 test 0.50
Decision Tree
	AvgPrecision: train 0.62 test 0.07
	AU_ROC: train 0.78 test 0.51
Random Forest
	AvgPrecision: train 0.23 test 0.07
	AU_ROC: train 0.52 test 0.50
FC Neural Net 1 layer
	AvgPrecision: train 0.90 test 0.12
	AU_ROC: train 0.95 test 0.63
FC Neural Net 2 layers
	AvgPrecision: train 0.98 test 0.11
	AU_ROC: train 0.99 test 0.64
FC Neural Net 4 layers
	AvgPrecision: train 0.99 test 0.11
	AU_ROC: train 1.00 test 0.65
AdaBoost
	AvgPrecision: train 0.70 test 0.10
	AU_ROC: train 0.86 test 0.59
Naive Bayes
	AvgPrecision: train 0.48 test 0.16
	AU_ROC: train 0.85 test 0.65




QDA
	AvgPrecision: train 0.69 test 0.07
	AU_ROC: train 0.81 test 0.50


### logistic regression

In [130]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0,max_iter=1000).fit(Xt, Yt)
# clf.predict(Xv)
preds_train = clf.predict_proba(Xt)[:,1]
preds_val = clf.predict_proba(Xv)[:,1]


ap_train = average_precision_score(Yt,preds_train)
ap_val = average_precision_score(Yv,preds_val)
print(f'AvgPrecision: train {ap_train:0.2f} test {ap_val:0.2f}')

ap_train = roc_auc_score(Yt,preds_train)
ap_val = roc_auc_score(Yv,preds_val)
print(f'AU_ROC: train {ap_train:0.2f} test {ap_val:0.2f}')


AvgPrecision: train 1.00 test 0.23
AU_ROC: train 1.00 test 0.69



### SGD classifier


In [131]:
# Xt,Xv,Yt,Yv = train_test_split(X,Y,test_size=0.5)

# Always scale the input. The most convenient way is to use a pipeline.
clf = make_pipeline(StandardScaler(),
                    SGDClassifier(max_iter=1000, tol=1e-3))
clf.fit(Xt, Yt)

preds_train = clf.predict(Xt)
preds_val = clf.predict(Xv)

ap_train = average_precision_score(Yt,preds_train)
ap_val = average_precision_score(Yv,preds_val)
print(f'AvgPrecision: train {ap_train:0.2f} test {ap_val:0.2f}')

ap_train = roc_auc_score(Yt,preds_train)
ap_val = roc_auc_score(Yv,preds_val)
print(f'AU_ROC: train {ap_train:0.2f} test {ap_val:0.2f}')

AvgPrecision: train 0.87 test 0.15
AU_ROC: train 0.94 test 0.59
