## Prepare TST Data

In [1]:
TST_PATH = "./tst1s.mat"
LPNE_PATH = "./lpne-data-analysis/"
N_COMPONENTS = 30
import os,sys
import numpy as np
import matplotlib.pyplot as plt
import pickle

sys.path.append(LPNE_PATH)

import data_tools

np.random.seed(42)

In [2]:
X_psd,labels = data_tools.load_data(TST_PATH,feature_list = ["power"])

version saveFeatures_1.6 used to calcuate power features
Using data that was preprocessed with unknown preprocessing version. Please make sure all datasets in the same project were preprocessed the same way.


In [3]:
labels['area']

['Acb_Core',
 'Acb_Sh',
 'BLA',
 'IL_Cx',
 'Md_Thal',
 'PrL_Cx',
 'VTA',
 'lDHip',
 'lSNC',
 'mDHip',
 'mSNC']

In [4]:
y_mouse = np.array(labels['windows']['mouse'])
y_expDate = np.array(labels['windows']['expDate'])
y_task = np.array(labels['windows']['task'])
y_geno = np.array(labels['windows']['genotype'])

## Generate the causal data

Here we limit ourselves to samples from the Open-Field and Tail-Suspension tasks. We consider the placement in open-field, or in tail-suspension as the treatment assignment. All mice are exposed to both conditions. 

Our aim here is to generate a semi-synthetic dataset for evaluating mediation effect and discovering the true mediation effect. Our data naturally provides our observed variables X, which are the power spectral densities of the LFPs recorded at the 11 regions noted above. These power spectral densities are evaluated from 1-56 Hz.Our data also provides a natural analog to treatment assignment with the placement in the open field "OF" task, or into the tail suspension "TS" task. The open field task is considered a moderate stress state and the tail suspension is considered a hightened stress state. We also include the genotype of the mouse as a confound. $Clock\Delta19$ mice are a model of bipolar disorder.

$$T \sim \{0:\textit{Open Field},1:\textit{Tail Suspension}\}$$
$$\mathcal{y}_{geno} \sim \{0: \textit{Wild Type},1:Clock\Delta19\}$$
$$X \in \mathbb{R}^{11 \times 56}$$

To generate our true underlying mediator, we use the top 30 principal component's activations. We demonstrate below that these principal components do vary with the treatment and therefore a mediator relationship can be found. We then add a unit value bias to samples in the treatment group.

$$ s \sim PCA(X) $$
$$ m = s + \mathbb{1}_{T=1}$$

We then define outcomes that are a linear function of the treatment, mediator, random noise, and a confounder variable - the mouse's genotype.

$$ y = \alpha T + \beta^T m + \phi \mathcal{y}_{geno} + \epsilon_{y} $$

### Variables

 - causal_mask: A filter for isolating the samples relevant to our causal experiment. We exclude the homecage data and only use the open field and tail suspension data
 - X_causal: Power spectral density features for timepoints where the mouse is either in open field or TST
 - y_mouse_causal: Per sample mouse identifier for samples in OF or TST
 - y_task_causal: String array indicating per sample whether a mouse is in OF or TST
 - y_treatment_causal: Our $T$ variable. This is a binary version of y_task_causal where 1 is treatment and 0 is control.
 - y_geno_causal: Binary array indicating whether a mouse is of the CD19 or wildtype genotype

In [5]:
causal_mask = np.logical_or(y_task=="OF",y_task=="TS")
X_causal = X_psd[causal_mask]
y_mouse_causal = y_mouse[causal_mask]
y_task_causal = y_task[causal_mask]
y_treatment_causal = y_task_causal == "TS"
y_geno_causal = y_geno[causal_mask]

### Generating M

Here we generate our ground truth mediator. We fit PCA to all of our data and extract the principal components and add a unit bias to the samples associated with the Tail-Suspension-Test.

$$ s \sim PCA(X) $$
$$ m = s + \mathbb{1}_{T=1}$$


In [6]:
from sklearn.decomposition import PCA

model = PCA(n_components=N_COMPONENTS)
model.fit(X_causal)
s = model.transform(X_causal)

### Generating y

Here we generate our outcomes using the formula noted above.

$$ \alpha \sim Uniform(-2,2)$$
$$ \beta \sim Uniform(-2,2)$$
$$ \phi \sim Uniform(-2,2)$$
$$ \epsilon \sim Normal(0,0.1)$$

$$ y = \alpha T + \beta^T m + \phi \mathcal{y}_{geno} + \epsilon_{y} $$

In [7]:
alpha = np.random.uniform(-2,2,size=1)
beta = np.random.uniform(-2,2,size=(N_COMPONENTS,1))
phi = np.random.uniform(-2,2,size=1)
epsilon = np.random.normal(0,0.1,size=(X_causal.shape[0])).reshape(-1,1)

y = alpha*y_treatment_causal.reshape(-1,1) + s @ beta + phi*y_geno_causal.reshape(-1,1) + epsilon

### Saving the Data

In [8]:
data_dict = {
    
    #Features
    "M": X_causal, 
    
    #Mediator (PCA Scores)
    "Z": s,
    
    #Outcomes
    "Y": y,
    
    #Treatment
    "T": y_treatment_causal,
    
    #Confounder
    "geno":y_geno_causal,
    
    #Mouse Identity
    "mouse_id": y_mouse_causal,
    
    #Feature labels for each feature in X
    "featureNames":labels['powerFeatures'],
}

with open("./mediation_tst.pkl","wb") as f:
    pickle.dump(data_dict,f,protocol=4)