# Building a Machine Learning Model  
---

## Introduction 

In this notebook I will show you the basics of training a machine learning algorithm.  For this example, we will perform __supervised learning__ by training __Support Vector Machines__, a common ML algorithm, __detect music genres from audio files__. 

To train the models we will use two datasets consisting of audio features for the task of **genre classification**. The audio features are calculated from audio (.WAV) files using [Marsyas](http://marsyas.info/index.html#), an open source framework for audio analysis. 

The first dataset contains just two song level features (average spectral centroid and average spectral rolloff). Using only two features we can directly visualize the data with a scatter plot. The points are colored in terms of their class membership.  

The second dataset contains the same songs but with a set of 124 spectral features.

There are three genres each represented by a 100 audio tracks or points in this case. The genres are __classical, jazz and metal__. 

Through the process we will get an intro to the ML pipeline, common ML tools (numpy, pandas, scikit-learn, matplotlib), and running a Jupyter Notebook (which will be needed later in the semester).

We will use [scikit-learn](https://scikit-learn.org/stable/index.html), a common ML framework for Python, to implement the ML pipeline. 

### Standard Imports

In [None]:
# Numpy for vector processing and linear algebra
import numpy as np

# Scikit learn for ML Pipeline
from sklearn import datasets
from sklearn.datasets import make_blobs
from sklearn import preprocessing
from sklearn import svm
from sklearn import model_selection
from sklearn import metrics

# Matplotlib + Seaborn for plotting
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Panda for data processing
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)

# Other Utils
from utils import make_mesh
from data.data_utils import get_doughnut_dataset

##  
---

## Data Generation

<img src="images/ml_pipeline_data.png" width=1200>

### Data Collection and Annotation

#### _Methods for Collection and Annotation_
- __Manual Annotation__
- __Empirical Studies / Experiments__
  - physiological measurements from the Trier Social Stress Test
- __Web Scraping__
  - search Google Images for pictures of specified facial expressions, such as _happy_ or _sad_
- __Historical Data and Records__
  - Risk assessment systems
- __Crowdsourcing__
  - [Amazon Mechanical Turk](https://onlinelibrary.wiley.com/doi/abs/10.1002/bdm.1753?casa_token=iW56bog9MJAAAAAA:BnSAAMDrpDFfnwMvphaHyfatw4W1f5q1RPT3KuhZitEYpNX1fDoBC7nRvNUvANvF5nFQvsO_d8WchavK)
  - [Data annotation games](https://www.cs.cmu.edu/~elaw/papers/ISMIR2007.pdf)
- __Existing Datasets__
  - [ImageNet](https://www.image-net.org/)
  
 __What are the potential issues with these methods for data collection?__
 
 ---

#### [GTZAN Music Genre Dataset](https://www.tensorflow.org/datasets/catalog/gtzan)

For this example we will use an existing dataset.

Tzanetakis et al. (2001) "Automatic Musical Genre Classification Of Audio Signals"

- Music selected from author's personal music library
- 1000 audio tracks, 30 seconds each
- 10 Genres, each with 100 tracks
  - blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock
  
(We're using a subset of 300 tracks from 3 genres: classical, metal and rock)
  
##### Feature Extraction

- Spectral analysis of raw audio file using Marsyas
- Our example dataset contains 124 spectral features describing audio content
  - including: time-domain Zero-Crossings, Spectral Centroid, Rolloff, Flux and Mel-Frequency Cepstral Coefficients (MFCC)
  
**_What are some potential causes of biases in this dataset?_**

#### Load Reduced Feature and Full Feature Datasets

In [None]:
# Load dataset with only two features for visualization
X_sm, y_sm = datasets.load_svmlight_file("data/3genres.arff.libsvm")
X_sm = X_sm.toarray()
print(f'X: Samples={X_sm.shape[0]}, Features={X_sm.shape[1]}')
print(f'y: Samples={y_sm.shape[0]}')

# save features and labels to Pandas dataframe for easier processing
target_names = ['classical', 'jazz', 'metal']
features_sm = ['centroid', 'rolloff']
label = 'genre'
df_sm = pd.DataFrame(data=X_sm, columns=features_sm)
df_sm[label] = y_sm

In [None]:
# Load dataset with only with all features
X_full, y_full = datasets.load_svmlight_file("data/3genres_full.arff.libsvm")
X_full = X_full.toarray()
print(f'X: Samples={X_full.shape[0]}, Features={X_full.shape[1]}')
print(f'y: Samples={y_full.shape[0]}')

# save features and labels to Pandas dataframe for easier processing
target_names = ['classical', 'jazz', 'metal']
features_full = list(range(X_full.shape[1]))
label = 'genre'
df_full = pd.DataFrame(data=X_full, columns=features_full)
df_full[label] = y_full

#### Review Datasets

In [None]:
df_sm.head()

In [None]:
df_full.head()

In [None]:
df_sm[features_sm].describe()

In [None]:
df_full[features_full].describe()

### Data Preperation / Preprocessing

##### __Data Processing__

- standardizing and normalizing data
- feature extraction (grey line between data collection and processing)
- data augmentation

##### __Data Preperation__
- Reviewing and cleaning data samples
- Spliting data into Training, Testing and Validation (common for deep learning) datasets

#### Min Max Scaling

Scale each feature, $i$, to range $[min, max]$, using following formula:

scale between $[0, 1]$
$$
    x_i^{\prime} = \dfrac{x_i - min(x_i)}{max(x_i) - min(x_i)}
$$

then scale to arbitrary min-max values in range $[min, max]$
$$
    x_i^{\prime} = x_i^{\prime} * (max - min) + min
$$

In [None]:
minmax_scaler = preprocessing.MinMaxScaler()  # default range [0, 1]
X_minmax = minmax_scaler.fit_transform(df_sm[features_sm])

In [None]:
X_minmax = pd.DataFrame(data=X_minmax, columns=features_sm)
X_minmax.describe()

#### Standardize Data or Zero Mean Unit Variance Normalization

Removes the mean of the data and scales to have unit variance

$$
    x_i^{\prime} = \dfrac{x_i - \mu_i}{\sigma_i}
$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of feature $i$

In [None]:
standard_scaler = preprocessing.StandardScaler()
X_zmuv = standard_scaler.fit_transform(df_sm[features_sm])

In [None]:
X_zmuv = pd.DataFrame(data=X_zmuv, columns=features_sm)
X_zmuv.describe()

####  
---

#### Train/Test Split

- Split dataset into train and testing sets
- apply normalization preprocessing to datasets

In [None]:
# get features and labels from original df
# small dataset
X = np.array(df_sm[features_sm])
y = np.array(df_sm[label])

# full dataset
# X = np.array(df_full[features_full])
# y = np.array(df_full[label])

# Split the data for training and testing
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=.25, random_state=5)

# initialize and fit scaler to training data
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)

# scale training and testing data using scaler fit on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print('Training Data')
print(f'X: {X_train.shape}, y: {y_train.shape}')
print('Test Data')
print(f'X: {X_test.shape}, y: {y_test.shape}')

In [None]:
# Training Data Statistics
df_tmp = pd.DataFrame(X_train)
df_tmp.describe()

In [None]:
# Test Data Statistics
df_tmp = pd.DataFrame(X_test)
df_tmp.describe()

#### Questions
1. Why is it important to scale/normalize the test data scaled separately with training data statistics?
2. What is the problem with splitting the data randomly? 
    - How can this be addressed?

##  
---

## Model Training  

<img src="images/ml_pipeline_models.png" width=1200>

### Model Development

##### 1. Model Type and Architecture

- Depends on Task, _T_
    - Classification, Regression, Clustering, etc...
- Model Type
    - Classical ML
        - Support Vector Machines, Linear or Logistic Regression, Decision Trees, etc...
    - Deep Learning
        - Feedfordward network, Convolutional Network, ResNet, Transformers, etc...

##### 2. Hyperparameter selection

- A parameter of selected model and architecture that affects the learning process
- Used to tuning the model training process

##### 3. Perform Training

---

#### Create and Train SVMs with different Kernels

A support vector machine (SVM) is a linear model that supports non-linear datasets through the [kernel trick](http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) (also see optional section, _More on SVMs Maximum Margin Seperating Hyperplane_, below).

Scikit-Learn Support Vector Machine Implementations

- [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- [Linear Suppor Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)



#### Model and Hyperparameter Selection

In [None]:
kernels = ["linear", "rbf", "poly"]
C = .3 # regularization parameter
poly_degree = 3 # polynomial order for the poly kernel

clfs = {
    "SVC w/ Linear Kernel": svm.SVC(kernel="linear", C=C),
    "LinearSVC": svm.LinearSVC(C=C),
    "SVC w/ RBF Kernel": svm.SVC(kernel="rbf", C=C),
    "SVC w/ 3 Degree Polynomial kernel": svm.SVC(kernel="poly", degree=poly_degree, C=C)
}

In [None]:
#train the models
models = {title: clf.fit(X_train, y_train) 
          for title, clf in clfs.items()}

#### Plot the Models

In [None]:
fig, sub = plt.subplots(2,2, figsize=(20, 14))
plt.subplots_adjust(wspace=0.2, hspace=0.2)

# make mesh for plotting
X_normed = scaler.transform(X) # scale X to ensure mesh has correct bounds
xx, yy = make_mesh(X_normed[:, 0], X_normed[:, 1], padding=.1, h=.005)

for (title, clf), ax in zip(models.items(), sub.flatten()):
    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    test_score = clf.score(X_test, y_test)
    train_score = clf.score(X_train, y_train)
    Z = Z.reshape(xx.shape)

    ax.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap=plt.cm.coolwarm, zorder=11, edgecolor='k', s=20)
    # ax.scatter(X_test[:,0], X_test[:,1], c=y_test, s=50, zorder=10, edgecolor='k',cmap=plt.cm.coolwarm)

    if title != "LinearSVC":
        sv = clf.support_
        X_support = X_train[sv]
        y_support = y_train[sv]
        ax.scatter(X_support[:,0], X_support[:, 1], c=y_support, cmap=plt.cm.coolwarm, 
                   s=100, zorder=10, edgecolor='k')

    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm)
    ax.set_title(f'{title}: Training Acc: {train_score*100:.2f}, Test Acc: {test_score*100:.2f}')
    ax.set_xticks(())
    ax.set_yticks(())

### Model Evaluation

- Evaluation models for with the goal of ___generalizability___
- Metric selection typically depends on task

##### __Some common metrics are:__
- __Accuracy:__ percent of correctly labeled predictions
  - correct / total
  - error rate is the inverse (1 - acc)
- __Precision:__ proportion of positively labeled predictions that are correct
  - tp / (tp + fp)  
- __Recall:__ proportion of actual positives that are identified correctly
  - tp / (tp + fn)
- __F1 Score:__ harmonic average of precision and recall
  - 2 * (precision * recall) / (precision + recall)
  
  
##### __Confusion Matrics__

- Matrix of actual vs. predicted labels for each class
- Provides further intuition as to where model is confused

---

#### Evaluate Trained SVMs

In [None]:
for title, model in models.items():
    y_pred = model.predict(X_test)
    print(title)
    print(metrics.classification_report(y_test, y_pred))
    print()

#### Plot Confusion Matrices

In [None]:
plt.figure(figsize=(20, 14))
for i, (title, model) in enumerate(models.items()):
    y_pred = model.predict(X_test)
    cm = metrics.confusion_matrix(y_test, y_pred)
    plt.subplot(2, 2, i+1)
    
    df_cm = pd.DataFrame(cm, range(3), range(3))
    sns.set(font_scale=1.4) # for label size
    sns.heatmap(df_cm, annot=True, annot_kws={"size": 16}, cmap='Blues',
                xticklabels=target_names, yticklabels=target_names) # font size
    plt.title(title)

#### [Cross Validation with Hyperparameter Selection](https://towardsdatascience.com/cross-validation-and-hyperparameter-tuning-how-to-optimise-your-machine-learning-model-13f005af9d7d)

A better method for selecting and evaluating models

#### More on SVMs Maximum Margin Seperating Hyperplane (Optional)

In [None]:
X_blobs, y_blobs = make_blobs(n_samples=40, centers=2, random_state=6)
X_doughnut, y_doughnut, _, _ = get_doughnut_dataset()

In [None]:
# X, y = X_doughnut, y_doughnut
X, y = X_blobs, y_blobs

# fit the model, don't regularize 
clf = svm.SVC(kernel='linear', C=1) ## Try with different Kernels
clf.fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired, zorder=10)

# plot the decision function
ax = plt.gca()
xlim = ax.get_xlim()
ylim = ax.get_ylim()

# create grid to evaluate model
xx, yy = make_mesh(X[:, 0], X[:, 1], padding=0, h=.05)
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# plot decision boundary and margins
ax.contour(xx, yy, Z, levels=[-1, 0, 1], alpha=0.5, colors='k',
           linestyles=['--', '-', '--'], zorder=11)
# # plot support vectors
sc = ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
           linewidth=1, facecolors='none', edgecolor='k', zorder=10)

### Model Postprocessing

##### __Postprocessing__

- Selection of probablity (or score) threshold for final class determination
  - often a tradeoff between precision and recall
- Ensemble Aggregration
- Sample Aggregation
  - Majority voting vs. Max Value vs. Average
  
_Genre Classification Example of Sample Aggregation_

- Instead of one feature vector per 30 second clip, we extract features every 5 seconds for training
- How do we aggregate the results of all 6 clips into a final score?

### Model Deployment

##### __Challenges in Real World Deployment__

- Real world setting vs. data collection setting
    - Domain Shift / Data Drift
- Deployment Requirements
    - Explainabilty
    - Latency vs. Real-time feedback
- Unintended or undesired uses