Soft Independent Modeling of Class Analogies (SIMCA)
===

Author: Nathan A. Mahynski

Date: 2023/09/12

Description: Derivation and examples of [SIMCA](https://en.wikipedia.org/wiki/Soft_independent_modelling_of_class_analogies).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mahynski/pychemauth/blob/main/docs/jupyter/gallery/simca.ipynb)

Soft Independent Modelling of Class Analogies (SIMCA) is a popular method for building authentication models.  Like Soft PLS-DA, it can determine if a new sample is consistent with a training set of known authentic class examples.  However, unlike PLS-DA, SIMCA models are trained on a single class and return a binary True/False prediction as to whether a new sample is consistent with previously seen ones.  There have been many variations on the technique since its [introduction by Wold](https://www.sciencedirect.com/science/article/abs/pii/0031320376900145) in the late 1970s.  We will review 2 different approaches: one more conventional, another which is more modern.

Furthermore, there are differences in the way each model may be optimized. "Rigorous" models use only examples of the target class and are designed to reach a certain sensitivity (specificity cannot be evaluated); "compliant" models are instead trained using alternative examples to reach an overall ideal balance of specificity and sensitivity.  While the latter often appear to perform "better," the results are biased based on what alternatives are available for training which is difficult to fully quantify. See [Rodionova et al.](https://www.sciencedirect.com/science/article/pii/S0169743916302799) for more discussion.

In [1]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

In [2]:
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [19]:
import scipy
import sklearn

import numpy as np

from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA

from pychemauth.classifier.simca import SIMCA_Model
from pychemauth.preprocessing.scaling import CorrectedScaler

In [3]:
%watermark -t -m -v --iversions

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : GCC 12.2.0
OS          : Linux
Release     : 6.2.0-26-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 40
Architecture: 64bit

matplotlib: 3.7.2
json      : 2.0.9
pychemauth: 0.0.0b3
watermark : 2.4.3



A Conventional Implementation
---

See ["Robust classification in high dimensions based on the SIMCA Method," Vanden Branden, Hubert, Chemometrics and Intelligent Laboratory Systems 79 (2005) 10-21.](https://doi.org/10.1016/j.chemolab.2005.03.002) and ["Decision criteria for soft independent modelling of class analogy applied to near infrared data" De Maesschalk et al., Chemometrics and Intelligent Laboratory Systems 47 (1999) 65-77.](https://doi.org/10.1016/S0169-7439(98)00159-2) for details and justification of this implementation.

Step 1: The raw data is **broken up by group (supervised)**; then **for each group** a PCA model for the data is constructed as follows:
    
$$
X = TP^T + E.
$$

Here, $X$, is the training data and has dimensions IxJ where $T$ is the scores matrix (projection of $X$ into some space, IxK, determined by), $P$ is related to the loading matrix (JxK), and $E$ as the error or residual matrix.  $E$ may be explicitly calculated by $E = X - TP^T$. $X$ should be centered, and possibly scaled, as is required for PCA.

<!-- As in PLS-DA, or other discriminant methods, a hard boundary between classes can be defined by drawing a hyperplane that divides them.  This leads to a hard "yes" or "no" if a point belongs to a certain class, and it can only belong to a single class.  Methods like SIMCA can predict if a point belongs to class A, class B, both, or neither.  It does this by defining a model boundary (distance from class centroid, for example) which envelops samples from class A; boundaries around different classes can be disjoint or overlapping, and their union does not need to entirely fill space (so that some points could belong to no classes).  Note the similarity to [Soft PLS-DA](plsda.ipynb); however, PLS-DA is trained with a fixed number of classes, whereas each SIMCA model is trained on only a single one. -->

Step 2: The residual standard deviation (RSD) values are calculated for the test set and training set a little differently.  For a given observation in the test set:

$$
RSD_{i, test}^2 = \frac{e_i^Te_i}{J-K},
$$

while for the training set (composed of the I samples) the degrees of freedom are modified slightly:

$$
RSD_{train}^2 = \frac{\sum_{i,train} e_i^Te_i}{(J-K)(I-K-1)}.
$$

Here, $e_i^Te_i$ is referred to as the squared orthogonal distance, $OD^2$. 

Step 3: The threshold for the model is given by:

$$
F_i = \frac{RSD_{i, test}^2}{RSD_{\rm train}^2}
$$

where the $F$ value is compared to some critical limiting value taken at a given significance, for example, $F_{\rm crit} = F_{(J-K),(I-K-1);0.95}$ is commonly used as a 95% quantile limit.  If $F_i/F_{\rm crit} < 1$, we assign the observation to the class.

Note that the number of principal components, K, does NOT need to the same from group to group; in fact, this is where the "independent" part of S(I)MCA comes from. (see Vanden Branden et al.)

Note: there is apparently some discrepancy, historically, on what degrees of freedom to use when computing these F statistics.  In De Maesschalk et al. Chem. Intell. Lab. Sys. 47 (1999) the authors discuss this in detail; essentially, in the above equations the term (J - K) is only valid when I > J (more samples than variables).  Otherwise, then term should be replaced with ((I - 1) - K) in both the test and train cases (and for computing the critical F value); i.e., use the smaller of (I,J), but due to mean centering we lose 1 DoF from I.  However, this term cancels out when computing the f value for a given sample.  It does, however, **affect the calculation of the critical F value**.  From a statistics perspective this is important so that $\alpha$ is meaningful; from a machine learning perspective, if we simply chose to optimize $\alpha$ to balance sensitivity and specificity of a model, then it is effectively irrelevant since this procedure treats $F_{\rm crit}$ as an adjustable parameter.

In [20]:
class SIMCA:
    def __init__(self, n_components, alpha=0.05):
        self.__n_components = n_components
        self.__alpha = alpha

    def fit(self, X_train):
        # 1. Autoscale X
        self.__ss = CorrectedScaler(with_mean=True, with_std=True)
        self.__X_train = X_train.copy()

        # 2. Perform PCA on standardized coordinates
        self.__pca = PCA(n_components=self.__n_components, random_state=0)
        self.__pca.fit(self.__ss.fit_transform(self.__X_train))

        # 3. Compute critical F value
        self.__I = X_train.shape[0]
        self.__J = X_train.shape[1]
        self.__K = self.__n_components
        self.__a = self.__J if self.__I > self.__J else self.__I-1
        
        self.__f_crit = scipy.stats.f.ppf(
            1.0-self.__alpha,
            (self.__a-self.__K),
            (self.__a-self.__K)*(self.__I-self.__K-1)
        )

    def predict(self, X):
        # Check that observations are rows of X
        X = np.array(X)
        if len(X.shape) == 1:
            X = X.reshape(-1,1)
        assert(X.shape[1] == self.__J)

        """
        From "Multivariate class modeling for the verification of food-authenticity
        claims," Oliveri and DOwney, TrAC 35 (2012):
        
        "The distances between each sample and the model are
        evaluated in the full multidimensional space, which is
        also defined by the non-significant PCs (SIMCA outer
        space). This permits the inclusion of information on
        random distribution of samples around the model, which
        arises from non-significant, and therefore uninformative,
        variations.""
        """
        
        # Following Vanden Branden et al.
        X_pred = self.__ss.inverse_transform(
            self.__pca.inverse_transform(
                self.__pca.transform(
                    self.__ss.transform(X)
                )
            )
        )
        numer = np.sum((X - X_pred)**2, axis=1)/(self.__a - self.__K)

        X_pred = self.__ss.inverse_transform(
            self.__pca.inverse_transform(
                self.__pca.transform(
                    self.__ss.transform(self.__X_train)
                )
            )
        )

        OD2 = np.sum((self.__X_train - X_pred)**2, axis=1)
        denom = np.sum(OD2) / ((self.__a - self.__K)*(self.__I - self.__K - 1))

        # F-test for each distance
        F = numer/denom

        # If f < f_crit, it belongs to the class
        return F < self.__f_crit

In [21]:
X, Y = sklearn.datasets.make_blobs(
    n_samples=100,
    n_features=5,
    centers=3,
    cluster_std=3,
    shuffle=True,
    random_state=0
)

X_train, y_train = X[:80], Y[:80]
X_test, y_test = X[80:], Y[80:]

In [38]:
# WE need a new SIMCA object for EACH training class - each one indicates whether we predict a point
# belongs to that class or not.
manual_simca_model = {}
for i in range(3):
    manual_simca_model[i] = SIMCA(n_components=1, alpha=0.05)
    manual_simca_model[i].fit(X_train[y_train == i])

In [39]:
manual_simca_model[0].predict(X_test) == (y_test == 0)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

In [40]:
# PyChemAuth provides a more versatile class for this.
pychemauth_simca_model = {}
for i in range(3):
    pychemauth_simca_model[i] = SIMCA_Model(n_components=1, alpha=0.05, scale_x=True)
    pychemauth_simca_model[i].fit(X_train[y_train == i])

In [41]:
pychemauth_simca_model[0].predict(X_test) == (y_test == 0)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

In [43]:
for i in [0, 1, 2]:
    assert np.all(pychemauth_simca_model[i].predict(X_test) == manual_simca_model[i].predict(X_test))

A Modern Approach: DD-SIMCA
---

https://colab.research.google.com/github/nathan-mahynski/nathan-mahynski.github.io/blob/public/_examples/common_chemometrics/example.ipynb#scrollTo=pODi0zhQ-ISp
    
    

https://colab.research.google.com/drive/12aajEL8tzkKEiGoI54xpH0-KlMa8U6V3#scrollTo=BqgfgRpE14_d

In [None]:
Also check scaling in h and q space