# FANOK

FANOK is a Python implementation of the fixed and Gaussian knockoffs framework developed by Barber-Candès.
This notebook summarizes the most important features that the package provides.
In the following,
let $X \in \mathbb{R}^{n \times p}$ be a design matrix of $n$ samples and $p$ features,
and $y \in \mathbb{R}^n$ a target vector.



In [None]:
from fanok import (
    FixedKnockoffs,
    GaussianKnockoffs,
    LowRankGaussianKnockoffs,
    RandomizedLowRankFactorModel,
    EstimatorStatistics,
    KnockoffSelector,
)

from sklearn.datasets import make_regression  # Synthetic regression problem generation
from sklearn.linear_model import LassoCV

## Fixed knockoffs

Fixed knockoffs work only in the low dimension regime,
and more precisely when $n \geq 2p$.

In [None]:
X, y, coef = make_regression(n_samples=100, n_features=50, n_informative=20, coef=True)

### Generation

Fixed knockoffs can be created with a `FixedKnockoffs` object
that must be fitted on the samples.
It follows Scikit-Learn's `fit` and `transform` scheme.
The parameter `sdp_mode` may be either 'sdp' (more power but slower) or 'equi' (faster).
By default the SDP is solved with a fast coordinate ascent algorithm.

In [None]:
fixed_knockoffs = FixedKnockoffs(sdp_mode='sdp')
fixed_knockoffs.fit(X)
X_tilde = fixed_knockoffs.transform(X)

### Selection

In order to perform the selection,
statistics must be computed on the concatenated matrix $[X, \tilde{X}]$.
They can be evaluated in many different ways;
a regressor is typically trained against $[X, \tilde{X}]$ and $y$,
and its coefficients are kept in abolute value.
The Lasso often leads to a high statistical power.
To do so, we feed a `LassoCV` estimator to a `EstimatorStatistics` object.

In [None]:
lasso = LassoCV(cv=3, max_iter=3000)
statistics = EstimatorStatistics(lasso)
selector = KnockoffSelector(fixed_knockoffs, statistics, alpha=0.2, offset=1)
selector.fit(X, y)

fdr, power = selector.score(X, y, coef)
print(f"FDR: {fdr}, Power: {power}")

After the selector is fitted,
the attribute `selector.mask_` describes which features were selector.

## Gaussian knockoffs

In high dimension,
fixed knockoffs are not working anymore and Gaussian knockoffs must be employed.
The selection procedure follows the same pattern as above.
The parameter `covariance_mode` may be either 'empirical' or 'wolf' (recommended).

In [None]:
X, y, coef = make_regression(n_samples=100, n_features=200, n_informative=20, coef=True)
statistics = EstimatorStatistics(LassoCV(cv=3, max_iter=3000))
gaussian_knockoffs = GaussianKnockoffs(sdp_mode='sdp', covariance-mode='wolf')
selector = KnockoffSelector(gaussian_knockoffs, statistics, alpha=0.05, offset=1)

selector.fit(X, y)

fdp, power = selector.score(X, y, coef)
print(f"FDP: {fdp}, Power: {power}")

### Low-rank factor model

In very high dimension,
solving the SDP is not possible anymore.
Instead, the covariance matrix can be approximated in the following way:
$$\Sigma = D + U U^\top$$
where $D \in \mathbb{R}^{p \times p}$ is a diagonal and psd matrix,
and $U \in \mathbb{R}^{p \times k}$ is composed of $k$ orthogonal columns.
When $k \ll p$, this special structure allows to solve the SDP
efficiently, in $\mathcal{O}(pk^2)$ steps only.

There are multiple ways to compute this factor model,
but when the dimension is high randomized algorithms are better suited.

In [None]:
factor_model = RandomizedLowRankFactorModel(rank=20)
knockoffs = LowRankGaussianKnockoffs(factor_model)
statistics = EstimatorStatistics(LassoCV(cv=3, max_iter=3000))
selector = KnockoffSelector(knockoffs, statistics, alpha=0.2, offset=1)
selector.fit(X, y)

fdr, power = selector.score(X, y, coef)
print(f"FDR: {fdr}, Power: {power}")