# Data Fission
## Conducting Inference
We conduct inference on some vector $Y$ given a set of covariates $X$ and a known covariance matrix  $\Sigma = \sigma^2I_n$ as follows:
1. Decompose $y_i$ into $f(y_i) = y_i - Z_i$ and $g(y_i) = y_i + Z_i$ where $Z_i\sim\mathcal{N}(0,\sigma^2)$
2. Fit $f(y_i)$ using LASSO to select features, denoted as $M\subseteq [p]$ (tuning parameter $\lambda$ by 1 standard deviation rule)
3. Fit $g(y_i)$ by linear regression without regularization using only the selected features.
4. Construct CIs for the coeffecients trained in step 3. each at level $\alpha$ using Theorem 2.

### Theorem 2
Let 
$$\hat\beta(M) = \argmin_{\tilde\beta} = || g(Y) - X_M\tilde\beta||^2 = \left( X_M^\top X_M \right)^{-1} X_M^\top g(Y)$$

and for $\mu = \mathbb{E}[Y\mid X]\in\mathbb{R}^n$ (which is a fixed unknown quantity), 

$$\beta^*(M) = \argmin_{\tilde\beta} = \mathbb{E}\left[ || Y - X_M\tilde\beta ||^2 \right] = \left( X_M^\top X_M \right)^{-1} X_M^\top \mu.$$

Then, 

$$\hat\beta(M) \sim \mathcal{N} \left( \beta^*(M), \left( 1 +\tau^{-2} \right)\left( X_M^\top X_M\right)^{-1} X_M^\top \Sigma X_M \left( X_M^\top X_M\right)^{-1} \right)$$

Furthermore, we can form a $1-\alpha$ CI for the $k$th element of $\beta^*(M)$ as 

$$\hat\beta(M)\pm z_{\alpha/2}\sqrt{\left( 1 + \tau^{-2} \right) \left[ \left( X^\top_M X_M\right)^{-1} X_M^\top\Sigma X_M \left( X_M^\top X_M\right)^{-1} \right]_{kk} }$$

## Simulation Setup: 

"We choose $\sigma^2 = 1$ and generate $n = 16$ data points with $p = 20$ covariates. For the first 15 data points, we have an associated vector of covariates  $x_i\in\mathbb{R}^p$ generated from independent Gaussians. The last data point, which we denote $x_\text{lev}$, is generated in such a way as to ensure it is likely to bemore influential than the remaining observations due to having much larger leverage. We define 

$$x_\text{lev} = \gamma\left( \vert X_1 \vert_\infty, \dots, \vert X_p \vert_\infty \right)$$
  
where $X_k$ denotes the the kth column vector of themodel design matrix $X$ formed from the first 15 data points and $\gamma$ is a parameter that we will vary within these simulations that reflects the degree to which the last data point has higher leverage than the first set of data points. We then construct $y_i\sim\mathcal{N}\left( \beta^\top x_i,\sigma^2 \right)$. The parameter $\beta$ is nonzero for 4 features: $(\beta_{1}, \beta_{16}, \beta_{17}, \beta_{18}) = S_\Delta(1,1,-1,1)$ where $S_\Delta encodes signal strength." 

We use 500 repetitions and summarize performance as follows. For the selection stage, we compute the power (defined as $\frac{\vert j\in M:\beta_j\neq0\vert}{\vert j\in[p]:\beta_j\neq0\vert}$) and precision (defined as $\frac{\vert j\in M:\beta_j\neq0\vert}{\vert M\vert}$) of selecting features with a nonzero parameter. For inference, we use the false coverage rate (defined as $\frac{\vert k\in M:[\beta^*(M)]_k\not\in \text{CI}_k  \vert }{\max\{ \vert M \vert ,1 \}}$) where $\text{CI}_k$ is the CI for $[\beta^*(M)]_k. We also track the average CI length within the selected model.

In [253]:
import numpy as np
import pandas as pd
from scipy import stats

In [191]:
sigma_sq = 1
n = 15
p = 20
betas = np.zeros(p)

ones = [0,16,18]
betas[ones] = 1

neg_ones = [17]
betas[neg_ones] = -1


In [192]:
def generate_linear(n = n, p = p, betas = betas, add_influential = []):
    n_true  = n + len(add_influential)
    X = np.zeros((n_true, p))
    X_1_to_n = np.random.multivariate_normal(np.zeros(p), np.eye(p), n)
    X[:n,:] = X_1_to_n
    if len(add_influential) > 0:
        baseline = X_1_to_n.max(axis = 0)
        for i in range(len(add_influential)):
            X[(n - 1) + i,:] = baseline * add_influential[i]
    
    Y = np.random.normal(0, 1, n_true) + X @ betas
    sd = 1
    return X,Y, sd

In [254]:
def experiment_linear(n = n, p = p, betas = betas, add_influential = []):
    X, Y, sd = generate_linear(n = n, p = p, betas = betas, add_influential = add_influential)
    n += len(add_influential)
    ## Masking
    sd_z = sd
    noise = np.random.normal(0, sd_z, n)
    g_Y = Y + noise
    h_Y = Y - noise

In [338]:
from sklearn.linear_model import ElasticNetCV, ElasticNet
import statsmodels.api as sm

## Generate Data
X, y,sd = generate_linear(n, p, betas, add_influential = [2])

## Fit initial LASSO
model = ElasticNetCV(cv=10, l1_ratio=1,max_iter=10000).fit(X,Y)
mean_mse = np.mean(model.mse_path_, axis=1)
alpha_min_index = np.argmin(mean_mse)
mse_alpha_min = model.mse_path_[alpha_min_index,:]
std_error = np.std(mse_alpha_min, ddof=1) / np.sqrt(model.mse_path_.shape[1])
threshold = mean_mse[alpha_min_index] + std_error

## Find 1se alpha
alpha_1se_indexes = np.where(mean_mse <= threshold)[0]
alpha_1se = model.alphas_[alpha_1se_indexes[0]]

## Fit model with 1se alpha
model_1se = ElasticNet(alpha=alpha_1se, l1_ratio=1, max_iter=10000).fit(X,Y)

## Find nonzero coefficients
selected = np.where(model_1se.coef_ != 0)[0]
if len(selected) > 0:
    infer_model = sm.OLS(y, sm.add_constant(X[:,selected])).fit()
    print(infer_model.conf_int(0.05)[1:])



In [None]:
infer_model.coef_

array([ 1.07330555,  0.43404962,  0.25356643,  0.59341185, -1.21248747,
        1.27927079])

In [None]:
## 
np.random.seed(20248)
results_dict = {}
for lev in range(2, 7):
    results_dict[lev] = {}
    X, Y, sd = generate_linear(n, p, betas)
    results_dict[n] = (X,Y)
X, Y = generate_linear(n, p, betas, add_influential = [2])

(array([[ 7.03245695e-01, -1.22805910e+00, -1.52367739e+00,
         -5.12854932e-01, -4.57633078e-01, -7.48616055e-01,
         -1.43787811e+00,  2.44525078e+00, -2.09350576e+00,
         -1.83319363e+00,  4.54433973e-01,  1.26024201e+00,
          1.07805298e+00, -3.91978475e-01,  5.16178380e-01,
         -2.48948796e-01, -3.98043629e-01, -1.08027018e+00,
          2.78770346e-01,  2.60201454e-01],
        [ 1.53385269e+00,  4.73521470e-01, -1.05571646e+00,
         -1.70513674e+00, -7.78930755e-01,  1.83118890e-01,
         -1.21383435e+00,  5.35368726e-01,  1.35491928e+00,
         -6.33083485e-01, -6.20239196e-01,  1.05089785e+00,
         -2.58831379e-01,  9.88113504e-01,  1.52479548e+00,
         -8.37588688e-01,  2.32668343e+00,  2.64273829e-01,
         -1.14453323e+00,  8.37403363e-02],
        [-3.08794189e-01,  6.25964566e-01, -1.73799274e-01,
          1.59075801e-01,  1.95327947e+00, -4.87606999e-01,
         -1.33211742e-01, -9.87038286e-01,  3.72274809e-03,
          1.