Classification Under Open Set Conditions
===

Author: Nathan A. Mahynski

Date: 2023/08/31

Description: Building classifiers that work in the "open world."

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mahynski/pychemauth/blob/main/docs/jupyter/api/opensetclassification.ipynb)

Conventional classifiers often assume there exist a finite set of known classes.

$$C_{\rm assumed} = [C_1, C_2, C_3]$$

To train such a model, these classes should be sampled (often evenly) in such a way that during testing / deployment the new observations will come from the same distribution as the training samples (IID).  Class balancing and cross-validation are common tools used to handle uncertainty in the latter of these assumptions, however, in the real world there are often many other (possibly infinite number of) classes that are not available at training time which a classifier might encounter when deployed.  

$$C_{\rm reality} = [C_1, C_2, C_3, \dots)$$

This image is from [Scheirer et al., "Toward Open Set Recognition" (2012)](https://ieeexplore.ieee.org/abstract/document/6365193) which originally formalized the OSR problem:

<img src="../../_static/osr_definition.png" style="width:500px;">

Note that the "face verification" problem is essentially a one-class authentication problem. This "open set" of possibilities means that a classifier should be able to recognize the known classes seen during training, but also recognize when a test case is "none of the above."  There are a variety of algorithms designed to do this and related tasks.  Here a few references that summarize some ontologies:

1. [Yang et al., Generalized Out-of-Distribution Detection: A Survey](https://arxiv.org/abs/2110.11334)
2. [Geng et al., Recent Advances in Open Set Recognition: A Survey](https://ieeexplore.ieee.org/abstract/document/9040673)

The "open set recognition" (OSR) task refers to when the model should be able to identify known classes and reject unknown ones.  By some classification schemes, the process of simply rejecting an input as being from an unknown class is referred to as a "reject" option.  OSR tasks are closely related to one-class classifiers (OCC) used for authentication purposes.  An OSR-capable model can be constructed by chaining together multiple OCCs, each designed to recognize a single class.  [Soft PLS-DA](../learn/plsda.ipynb) is another example of model capable of handling open-set conditions. Another *ad hoc*, but general, way to handle an OSR task is to combine an outlier detector with a closed set classifier. Green and red pathways illustrate ["compliant" and "rigorous"](simca.ipynb#Building-an-Authenticator) OCC training schemes, respectively.

<img src="../../_static/osr.png" style="width:250px;">

The outlier detector illustrated here determines if a sample is out of distribution (OOD), and sends only those in distribution (ID) to the classifier.  In this way, the outlier detector determines if the input is coming from a "known" region of parameter space which the classifier should be responsible for.  If not, the input is simply assigned to an "unknown" class.  Otherwise, the closed-set classifier is assumed to be responsible for identifying the input as one of its known classes.  The outlier detector itself may use a variety of different assumptions depending on which detector is used, as may the classifier, but this combination method is very general and can be applied with different sorts of outlier detectors and classification models.

We have implemented a basic OpenSetClassification model in PyChemAuth which we illustrate for a variety of different conditions below. Depending on the classification model being used the performance metric may vary.  Moreover, if the underlying model si capable of detecting outliers (or rejecting acceptance as in OCC) we should combine those (1) rejected because the test sample belongs to an known alternative ("known unknown") with those (2) rejected because they are from an unknown alternative class ("unknown unknown") to compute the correct performance metric.

In [1]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

In [2]:
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")
    
import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [5]:
import sklearn
import imblearn
import numpy as np
from pychemauth.preprocessing.scaling import CorrectedScaler
from pychemauth.preprocessing.imbalanced import ScaledSMOTEENN
from pychemauth.preprocessing.missing import PCA_IA
from pychemauth.classifier.osr import OpenSetClassifier
from pychemauth.classifier.plsda import PLSDA

In [4]:
%watermark -t -m -v --iversions

Python implementation: CPython
Python version       : 3.11.4
IPython version      : 8.14.0

Compiler    : GCC 12.2.0
OS          : Linux
Release     : 6.2.0-39-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 40
Architecture: 64bit

sklearn   : 1.3.0
matplotlib: 3.7.2
watermark : 2.4.3
pychemauth: 0.0.0b4
imblearn  : 0.11.0
json      : 2.0.9
numpy     : 1.24.3



Case 1: Ignore Outlier Detection
---

## 1a: Muticlass, Hard Model

This is just a baseline and is equivalent to just using the multiclass hard model.  Since this is using a hard model, this cannot detect novelties at test time.

In [None]:
osc = OpenSetClassifier(
    clf_model=HARD_PLSDA,
    outlier_model=None, 
    clf_kwargs={...},
    known_classes=['Class A', 'Class B', 'Class C'],
    score_metric='TEFF',
    clf_style='hard'
)

## 1b: Multiclass, Soft Model

Since this is using a soft model, the model can detect novelties at test time. However, this relies on the model which is biased based on its training to understand a closed set of knowns.

In [None]:
osc = OpenSetClassifier(
    clf_model=SOFT_PLSDA,
    outlier_model=None, 
    clf_kwargs={"not_assigned":"???", ...},
    known_classes=['Class A', 'Class B', 'Class C'],
    score_metric='TEFF',
    clf_style='soft'
)

Case 2: Multiclass Hard Model with Outlier Detection
---

Case 3: Multiclass Soft Model with Outlier Detection
---

Case 4: Convert a Binary OvA Discriminator into Authenticator
---