<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Overview</a></span></li></ul></li><li><span><a href="#Load-the-Data" data-toc-modified-id="Load-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load the Data</a></span></li><li><span><a href="#Model-the-Data-with-DD-SIMCA" data-toc-modified-id="Model-the-Data-with-DD-SIMCA-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Model the Data with DD-SIMCA</a></span><ul class="toc-item"><li><span><a href="#Training" data-toc-modified-id="Training-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Testing</a></span></li></ul></li><li><span><a href="#Create-a-Classifier" data-toc-modified-id="Create-a-Classifier-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create a Classifier</a></span></li><li><span><a href="#Optimizing-the-Classifier" data-toc-modified-id="Optimizing-the-Classifier-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Optimizing the Classifier</a></span></li></ul></div>

In [None]:
import matplotlib.pyplot as plt
%matplotlib notebook

import imblearn
import sklearn

from sklearn.model_selection import GridSearchCV

import sys
sys.path.append('../../')
import chemometrics

import numpy as np
import pandas as pd

import watermark
%load_ext watermark

Overview
--------
This is a simple example of using DD-SIMCA to do some analysis.

In [None]:
%watermark -t -m -v --iversions

# Load the Data

In [None]:
# Let's load some data from the tests/ for this example
df = pd.read_csv('../tests/data/simca_train.csv')

In [None]:
# You can see that samples are rows, columns are different features
df

In [None]:
raw_x = np.array(df.values[:,3:], dtype=float) # Extract features
raw_y = np.array(df['Class'].values, dtype=str) # Take the class as the target

# Model the Data with DD-SIMCA

In [None]:
from chemometrics.classifier.simca import DDSIMCA_Model

## Training

In [None]:
# Here the data is spectra so we will not scale the X data
dds = DDSIMCA_Model(n_components=7, alpha=0.05, gamma=0.01, scale_x=False)

In [None]:
_ = dds.fit(raw_x, raw_y)

In [None]:
_ = dds.visualize(raw_x, raw_y)

In [None]:
# We can see what samples are classified as 'Pure' using the predict() function.
pure_sample_mask = dds.predict(raw_x)
len(pure_sample_mask)

In [None]:
# We could extract that data as follows:
pure = raw_x[pure_sample_mask]

In [None]:
# Extremes and Outliers can be found using the check_outliers() function.
extremes_mask, outliers_mask = dds.check_outliers(raw_x)

In [None]:
# We could extract that data as follows:
extremes = raw_x[extremes_mask]
outliers = raw_x[outliers_mask]

In [None]:
# Number of outliers, for example?
np.sum(outliers_mask)

In [None]:
# Number of extremes, for example?
np.sum(extremes_mask)

## Testing

First, let's test on other pure samples that weren't in the training set.

In [None]:
df = pd.read_csv('../tests/data/simca_test.csv')
raw_x_t = np.array(df.values[:,3:], dtype=float)
raw_y_t = np.array(df.values[:,1], dtype=str)

In [None]:
# Here, we would like to see all the points fall INSIDE the green acceptance region since we know they
# all belong to the 'Pure' class.
_ = dds.visualize(raw_x_t, raw_y_t)

We could also check against known alternates.

In [None]:
df = pd.read_csv('../tests/data/simca_test_alt.csv', header=None)
raw_x_a = np.array(df.values[:,3:], dtype=float)
raw_y_a = np.array(df.values[:,1], dtype=str)

In [None]:
# Here, we would like to see all the points fall OUTSIDE the green acceptance region since we know they
# are not the 'Pure' class.
_ = dds.visualize(raw_x_a, raw_y_a)

# Create a Classifier

In the last section, we just created a DD-SIMCA Model. In practice, we would like to turn that into a classifier.  

In [None]:
from chemometrics.classifier.simca import SIMCA_Classifier

In [None]:
sc = SIMCA_Classifier(n_components=7, 
                      alpha=0.05, 
                      scale_x=False, 
                      style='dd-simca', 
                      target_class='Pure', 
                      use='TEFF')

In [None]:
# Combine the training (all Pure) and alternate data to create a new training set that has multiple (in this
# case 2) classes.  We specified the target_class='Pure' above, which tells the classifier that we are trying
# to model that class.  ONLY that data is used to fit the model; all other classes it is provided will be
# ignored.
x_train = np.vstack((raw_x, raw_x_a))
y_train = np.hstack((raw_y, raw_y_a))
_ = sc.fit(x_train, y_train)

In [None]:
# By default, TEFF is used to score the classifer, you can change this when then classifier is instantiated.
sc.score(x_train, y_train) # TEFF = sqrt(TSNS * TSPS)

In [None]:
sc.TSNS, sc.TSPS, sc.TEFF

In [None]:
# Look at all the data.
sc.model.visualize(x_train, y_train)

In [None]:
# Look at just the data used to train the underlying SIMCA_Model.
sc.model.visualize(raw_x, raw_y)

In [None]:
# Look at the data used to test how well the model performs by classifying alternative samples.
sc.model.visualize(raw_x_a, raw_y_a)

# Optimizing the Classifier

Here we took alpha as a meaningful choice of type I error rate, but it could also be adjusted.  Moreover, we arbitrarily selected the number of PCs to use in the SIMCA model.  We can use scikit-learn's pipelines to automatically optimize hyperparameters like this.

In [None]:
# Here I've use an imblearn pipeline, but you can also use scikit-learn's pipeline if you don't want to 
# do any class balancing.

pipeline = imblearn.pipeline.Pipeline(steps=[
    # Insert other preprocessing steps here...
    # ("smote", ScaledSMOTEENN(random_state=1)), # For example, class balancing
    ("simca", SIMCA_Classifier(n_components=7, 
                               alpha=0.05, 
                               scale_x=False, 
                               style='dd-simca', 
                               target_class='Pure', 
                               use='TEFF')
    )
])

# Hyperparameters of pipeline steps are given in standard notation: step__parameter_name
param_grid = [{
    # 'smote__k_enn':[1, 2, 3],
    # 'smote__k_smote':[1, 3, 3],
    # 'smote__kind_sel_enn':['all', 'mode'],
    'simca__n_components':np.arange(1, 10),
    # 'simca__alpha':[0.07, 0.05, 0.03, 0.01],
    # 'simca__style':['dd-simca', 'simca'],
}]

gs = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    n_jobs=-1,
    cv=sklearn.model_selection.StratifiedKFold(n_splits=3, shuffle=True, random_state=0),
    error_score=0,
    refit=True
)

_ = gs.fit(x_train, y_train)

In [None]:
# The best parameters found can be accessed like this:
gs.best_params_

In [None]:
gs.best_score_ # The best score it recieved was...

In [None]:
# You can see detailed CV results here
gs.cv_results_

In [None]:
# For a 1D optimization you can easily visualize where the best value is
plt.errorbar(np.arange(1, 10), gs.cv_results_['mean_test_score'], yerr=gs.cv_results_['std_test_score'])
plt.xlabel('n_components')
plt.ylabel('Mean Test Score (TEFF)')

In [None]:
# The refit=True (default) refits the model on the data in the end so you can use it directly.
gs.best_estimator_.predict(raw_x) # raw_x was just Pure

In [None]:
gs.best_estimator_.predict(raw_x_a) # raw_x_a was just Alternates

In [None]:
gs.best_estimator_.named_steps['simca'].model.visualize(x_train, y_train)

In [None]:
gs.best_estimator_.named_steps['simca'].score(x_train, y_train) # The score being used here is TEFF

In [None]:
gs.best_estimator_.named_steps['simca'].TSNS # 67 / (67+5)

In [None]:
gs.best_estimator_.named_steps['simca'].TSPS # 1 - 6/(6+5+7)

In [None]:
gs.best_estimator_.named_steps['simca'].TEFF # sqrt(TSPS*TSNS)