# Model Training & Testing

In this notebook we train a model based off previously labeled data to predict who is speaking in a 1 second audio segment. The steps are:

1. Load the previously formatted data
2. Split the data into training & testing. For initial validation, we'll train a model on 75% of the data, and evaluate it on 25% of the data.
3. Train a Logistic Regression model to predict speakers.
3. Spot check the model with some simple validation metrics.
3. If it looks reasonable, retrain the final model on the full data for future prediction.
4. Save the model.

In [1]:
import warnings
from pathlib import Path

import librosa
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from sklearn.externals import joblib

In [2]:
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [3]:
X_df = pd.read_csv("./training-data/X.csv", index_col=0)
label_df = pd.read_csv("./training-data/labels.csv")

In [4]:
X = X_df.values
y = label_df["label_id"].values

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

### Logistic Regression

In [6]:
# bias class weights so Scott & Wes are even
class_weight = {0 : .4, 1: .4, 2: .2}

In [7]:
scaled_lr = make_pipeline(
    StandardScaler(), LogisticRegressionCV(multi_class="multinomial", class_weight=class_weight))
scaled_lr.fit(X_train, y_train)
y_pred = scaled_lr.predict(X_test)

In [8]:
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.91      0.97      0.94       709
          1       0.91      0.98      0.94       496
          2       0.48      0.10      0.17       110

avg / total       0.87      0.90      0.88      1315

Accuracy: 0.9011406844106464


In [9]:
pd.crosstab(pd.Series(y_test, name="actual"), pd.Series(
    y_pred, name="predicted"), margins="All")

predicted,0,1,2,All
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,686,14,9,709
1,5,488,3,496
2,64,35,11,110
All,755,537,23,1315


### Train final model on all available data

In [10]:
full_model = make_pipeline(
    StandardScaler(), LogisticRegressionCV(multi_class="multinomial", class_weight=class_weight))
full_model.fit(X, y)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregressioncv', LogisticRegressionCV(Cs=10, class_weight={0: 0.4, 1: 0.4, 2: 0.2}, cv=None,
           dual=False, fit_intercept=True, intercept_scaling=1.0,
           max_iter=100, multi_class='multinomial', n_jobs=1, penalty='l2',
           random_state=None, refit=True, scoring=None, solver='lbfgs',
           tol=0.0001, verbose=0))])

In [11]:
joblib.dump(full_model, "syntax-speaker-predictor.pkl")

['syntax-speaker-predictor.pkl']