Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.
Not an official Google product.
This library provides statistical inference for high dimensional logistic regression maximum likelihood, based largely on the breakthrough results from Sur and Candès (PNAS, 2019). The challenge with applying their results is that they depend on an unobserved signal strength quantity. Our method estimates this quantity via a leave-one-out approach, which we outline in our paper [1].
By high-dimensions, we mean that the ratio of the number of covariates p
to
the sample size n
is strictly between 0 and 0.5. When the number of covariates
is too large, the data is separable, and our method will not help to recover
from such a case. When the number of covariates is small (say, p <= 0.05 * n
),
the high dimensional adjustment is a bit numerically unstable, and adds little
value over the standard large-sample theory.
The setting studied is complementary to sparse high dimensional regimes. We assume that there are a relatively large number of covariates that are weakly correlated with the binary outcome. If one expects only a very small number of the many candidate covariates to have a nonzero coefficient in the model, sparse model selection and post-selective inference is probably a better approach than the one taken here.
Run run.sh
to install requirements and package, and run tests.
The main approach proposed in our work is implemented in the
UnbiasedLogisticRegression
class in unbiased_logistic_regression.py
. This
has an sklearn
-like interface, with a fit
, decision_function
and
predict_proba
API. Additionally, for inference, we've added a
prediction_intervals
method. See the inline documentation for more details
of usage.
[1] S. Yadlowsky, T. Yun, C. McLean, A. D'Amour (2021). "SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression". arXiv:2103.12725 [stat.ML].