Greetings! This repository is the code for the paper:
Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests
Sean Kulinski, Saurabh Bagchi, David I. Inouye
Neural Information Processing Systems (NeurIPS), 2020.
If you use this code, please do us a favor and cite this paper via:
@inproceedings{kulinski2020feature,
author = {Kulinski, Sean and Bagchi, Saurabh and Inouye, David I.},
booktitle = {Neural Information Processing Systems (NeurIPS)},
title = {Feature Shift Detection: Localizing Which Features Have Shifted via Conditional Distribution Tests},
year = {2020}
}
In many real world scenarios the data used by machine learning models shifts away from the distribution the models were trained on. Is there a way we can not only detect when this happens, but also localize that shift to specific features in the data?
Distribution shift is a very real and frequent problem in machine learning production environments. Recently, there has been much research looking into detecting when such a shift has happened, and our work extends this idea to not only detecting a shift, but also localizing the shift to specific problem features in the data. A simple example of this would be if a model was trained on a sensor network and, after a while of being in use, some sensors begin to malfunction and output incorrect values. If we could detect this shift early and even localize the shift to these problem sensors, then the issue can be swiftly debugged and remediated.
Our goal for feature shift detection is to do exactly this; detecting a shift and localizing the shift to specific features. We perform hypothesis testing to see if there is a discrepancy between the feature-wise conditional distributions of the training and query distribution. We perform this for all features and report the ones which have a discrepancy. To do this, we introduce a novel use of a test statistic based on the (Fisher) score function, which can compute these conditional distribution hypothesis test quite efficiently.
The fsd
module contain three main parts (in descending order):
fsd.featureshiftdetector
- This submodule contains the main classFeatureShiftDetector
which performs both detection and localization. Given a specified statistic instance and bootstrapping method ('time' or 'simple'),.fit(*, *)
to perform bootstrapping, and then.detect_and_localize(*)
can be called to perform the shift detection and localization.fsd.divergence
- This submodule contains the various divergence methods used in the paper. This includesFisherDivergence
,ModelKS
, andKnnKS
. Each divergence takes in a density model (or non-parametric method (i.e. KNN)), fits it on two data distributions (.fit(*, *)
), and then calculates and returns the feature-wise test statistics (.score_features(*)
).fsd.models
- This submodule contains the models used to fit the data to a distribution (for brevity, we'll refer to KNN as model here). It includesGaussianDensity
,DeepDensity
(which uses iterative Gaussianization to fit a deep density model), andKnn
. Each model has a.fit(*)
which method which fits the model uses the provided training data, a.sample(*)
method which samples from the model (KNN just samples from the training data), and a.conditional_sample(*)
which performs conditional sampling using the provided point to be conditioned upon.
In order to reproduce the experiments, first setup a python environment matching that seen in the requirements.txt
. (Note: python 3.7+ must be used.) Then, call the desired experiment as seen in the scripts/
folder. For example, to call the unknown-single-sensor experiment, perform the following command:
$ python unknown-multiple-sensors.py
To reproduce the real-world experiments, after setting up the python enviroment above, then call $ python path-to-feature-shift/scripts/real-world-experiments/fetch-data.py ARGUMENT
Where ARGUMENT can be either: all, gas, energy, or covid. (e.g. to download all datasets use, $ python feature-shift/scripts/real-world-experiments/fetch-data.py all
).
If you have any questions or issues, please reach out via my email:
skulinsk AT purdue DOT edu
Cheers!