<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Overview</a></span></li></ul></li><li><span><a href="#Load-the-Data" data-toc-modified-id="Load-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load the Data</a></span></li><li><span><a href="#Model-the-Data-with-PLS-DA" data-toc-modified-id="Model-the-Data-with-PLS-DA-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Model the Data with PLS-DA</a></span><ul class="toc-item"><li><span><a href="#Training-a-Hard-Model" data-toc-modified-id="Training-a-Hard-Model-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Training a Hard Model</a></span></li><li><span><a href="#Training-a-Soft-Model" data-toc-modified-id="Training-a-Soft-Model-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Training a Soft Model</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Testing</a></span></li></ul></li><li><span><a href="#Optimizing-the-Classifier" data-toc-modified-id="Optimizing-the-Classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Optimizing the Classifier</a></span></li></ul></div>

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import sys
sys.path.append('../../')
import chemometrics

import numpy as np
import pandas as pd

import watermark
%load_ext watermark

Overview
--------
This is a simple example of using variants of PLS-DA to do some analysis.

In [None]:
%watermark -t -m -v --iversions

# Load the Data

In [None]:
# Let's load some data from the tests/ for this example
df = pd.read_csv('../tests/data/plsda3_train.csv')

In [None]:
# You can see that samples are rows, columns are different features
df

In [None]:
raw_x = np.array(df.values[:,3:], dtype=float) # Extract features
raw_y = np.array(df['Class'].values, dtype=str) # Take the class as the target

# Model the Data with PLS-DA

In [None]:
from chemometrics.classifier.plsda import PLSDA

## Training a Hard Model

In [None]:
# Here the data are elemental levels so we will scale the X data
plsda = PLSDA(n_components=5, 
              alpha=0.05, 
              gamma=0.01, 
              not_assigned='UNKNOWN', 
              style="hard", 
              scale_x=True)

In [None]:
_ = plsda.fit(raw_x, raw_y)

In [None]:
_ = plsda.visualize_2d(styles=['hard'])

In [None]:
# We can see what samples are predicted to be using the predict() function.
pred = plsda.predict(raw_x)

In [None]:
plsda.score(raw_x, raw_y)

In [None]:
# The score() function is just tetsing how many are correctly predicted.  You can do this directly and 
# easily with the "hard" version of PLS-DA.
np.sum(np.array(pred).ravel() == raw_y) / raw_y.shape[0]

In [None]:
# More complete figures of merit can be computed.
df, I, CSNS, CSPS, CEFF, TSNS, TSPS, TEFF = plsda.figures_of_merit(pred, raw_y)

In [None]:
df # Each row is what the sample IS, each column is what the PREDICTION is.

In [None]:
I # Total fo each category

In [None]:
CSNS

In [None]:
CSPS

In [None]:
CEFF

In [None]:
TSNS, TSPS, TEFF

## Training a Soft Model

In [None]:
# Here the data are elemental levels so we will scale the X data
plsda = PLSDA(n_components=5, 
              alpha=0.05, 
              gamma=0.01, 
              not_assigned='UNKNOWN', 
              style="soft", 
              scale_x=True)

In [None]:
_ = plsda.fit(raw_x, raw_y)

In [None]:
# You can visualize both the hard and soft boundaries if you train a soft model.
# With a hard model, you only get the hard boundaries by default.
_ = plsda.visualize_2d(styles=['hard', 'soft'])

In [None]:
# We can see what samples are predicted to be using the predict() function.
pred = plsda.predict(raw_x)

In [None]:
# Samples can now be predicted to belong to multiple classes.
pred[:10]

In [None]:
# More complete figures of merit can be computed.
df, I, CSNS, CSPS, CEFF, TSNS, TSPS, TEFF = plsda.figures_of_merit(pred, raw_y)

In [None]:
df

## Testing

First, let's test on other pure samples that weren't in the training set.

In [None]:
df = pd.read_csv('../tests/data/plsda3_test.csv')
raw_x_t = np.array(df.values[:,3:], dtype=float)
raw_y_t = np.array(['THA2']*len(raw_x), dtype=str)

In [None]:
pred = plsda.predict(raw_x_t)
df, I, CSNS, CSPS, CEFF, TSNS, TSPS, TEFF = plsda.figures_of_merit(pred, raw_y_t)

In [None]:
df # Most foreign samples were CORRECTLY identified as being unknown

# Optimizing the Classifier

Here we took alpha as a meaningful choice of type I error rate, but it could also be adjusted.  Moreover, we arbitrarily selected the number of PCs to use in the PLSDA model.  We can use scikit-learn's pipelines to automatically optimize hyperparameters like this.

In [None]:
# Here I've use an imblearn pipeline, but you can also use scikit-learn's pipeline if you don't want to 
# do any class balancing.

pipeline = imblearn.pipeline.Pipeline(steps=[
    # Insert other preprocessing steps here...
    # ("smote", ScaledSMOTEENN(random_state=1)), # For example, class balancing
    ("plsda", PLSDA(n_components=5, 
                    alpha=0.05,
                    scale_x=True, 
                    not_assigned='UNKNOWN',
                    style='soft', 
                   )
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    # 'smote__k_enn':[1, 2, 3],
    # 'smote__k_smote':[1, 3, 3],
    # 'smote__kind_sel_enn':['all', 'mode'],
    'plsda__n_components':np.arange(1, 20, 2),
    'plsda__alpha': [0.07, 0.05, 0.03, 0.01],
    #'plsda__scale_x':[True, False],
    #'plsda__style':['hard', 'soft'],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(raw_x, raw_y)

In [None]:
# The best parameters found can be accessed like this:
gs.best_params_

In [None]:
gs.best_score_ # The best score it recieved was...

In [None]:
# You can see detailed CV results here
gs.cv_results_

In [None]:
# For a 1D optimization you can easily visualize where the best value is:
# plt.errorbar(np.arange(1, 20, 2), gs.cv_results_['mean_test_score'], yerr=gs.cv_results_['std_test_score'])
# plt.xlabel('n_components')
# plt.ylabel('Mean Test Score (TEFF)')

In [None]:
# scikit-learn finds the optimum over the range, however, you may wish to simply look at these results
# and use a smaller value, perhaps at an "elbow", and re-train a new model separately.

In [None]:
# The refit=True (default) refits the model on the data in the end so you can use it directly.
gs.best_estimator_.predict(raw_x)

In [None]:
# You can visualize the training results
gs.best_estimator_.named_steps['plsda'].visualize_2d(styles=['hard', 'soft'])

In [None]:
# Train 
gs.best_estimator_.named_steps['plsda'].score(raw_x, raw_y) # The score being used here is TEFF

In [None]:
pred = gs.best_estimator_.named_steps['plsda'].predict(raw_x)
df, I, CSNS, CSPS, CEFF, TSNS, TSPS, TEFF = plsda.figures_of_merit(pred, raw_y)

In [None]:
pred[:20]

In [None]:
df

In [None]:
CSNS

In [None]:
CSPS

In [None]:
CEFF

In [None]:
TSPS, TSNS, TEFF

In [None]:
np.any(gs.best_estimator_.named_steps['plsda'].check_outliers())

In [None]:
# Test
gs.best_estimator_.named_steps['plsda'].score(raw_x_t, raw_y_t) # The score being used here is TEFF

In [None]:
pred = gs.best_estimator_.named_steps['plsda'].predict(raw_x_t)
df, I, CSNS, CSPS, CEFF, TSNS, TSPS, TEFF = plsda.figures_of_merit(pred, raw_y_t)

In [None]:
pred[:20]

In [None]:
df