Here we prepare tables for imputation.
We need the following tables:

1. observed expression (indiv x gene) for father and mother
2. predicted expression (indiv x gene) for each of the two haplotypes
3. covariates (indiv x covariate)

That's it!
One thing to keep in mind is that we need to make sure that the individuals and genes have the same ID annotation across tables.
In the Framingham data, I use SampleID for individual and ensembl ID for gene.

We don't format all data in this notebook. 
The actually formatting will be on the fly. 
This is for experimenting ideas.

What I do format are:

1. covariate matrix
2. annotate observed expression with ensembl ID

In [1]:
import pandas as pd
import gzip, os

In [2]:
# individual to work with
active_individual_list_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/preprocess/all_indiv_w_genotype.txt'
pedigree_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/preprocess/extracted_pedigree.tsv.gz'

# gene link
gene_list_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/preprocess/microarray_gene_annotation.tsv'

# observed expression
expression_full_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/preprocess/expression.all_indiv_w_genotype.tsv.gz'

# predicted expression
pred_expr_out = [
    '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/pred_expr/gtex_v8_Whole_Blood_en.pred_expr.txt.gz',
    '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/pred_expr/gtex_v8_Whole_Blood_dapgw.pred_expr.txt.gz'
]

# covariates
peer_value_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/peer/peer_on_obs_expr_all_indiv_w_genotype/X.csv'
peer_indiv_list = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/preprocess/expression.all_indiv_w_genotype.tsv.gz'
pca_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/pca/pca_on_all_indiv_w_genotype.eigenvec'


In [3]:
covar_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/prepare_for_imputation/covariate.tsv.gz'
obs_expr_out = '/lambda_stor/data/yanyul/Framingham/haplotype_po_framingham/prepare_for_imputation/obs_expr.tsv.gz'

# covariates

In [4]:
# load peer factors
def read_peer_indiv_list(filename):
    with gzip.open(filename, 'rt') as f:
        first_l = f.readline()
    first_l = first_l.strip().split('\t')
    return first_l[1:]
df_peer = pd.read_csv(peer_value_out, header=None)
df_peer.columns = [ f'PeerFactor{i}' for i in range(df_peer.shape[1]) ]
df_peer_list = pd.DataFrame({'SampleID': read_peer_indiv_list(peer_indiv_list)})
df_peer = pd.concat((df_peer_list, df_peer), axis=1)
df_peer['SampleID'] = df_peer['SampleID'].astype(str)

In [5]:
# load pca pv
df_pca = pd.read_csv(pca_out, sep=' ', header=None)
del df_pca[0]
df_pca.columns = [ 'SampleID' ] + [ f'PCA{i}' for i in range(df_pca.shape[1] - 1) ]
df_pca['SampleID'] = df_pca['SampleID'].astype(str)

In [6]:
# merge covariates and save
df_covar = pd.merge(df_peer, df_pca, left_on='SampleID', right_on='SampleID')
if not os.path.exists(covar_out):
    df_covar.to_csv(covar_out, compression='gzip', sep='\t', index=False)

# observed expression

In [7]:
# load the observed expression
obs_expr = pd.read_csv(expression_full_out, compression='gzip', sep='\t')
obs_expr_indivs = obs_expr.columns.tolist()[1:]

In [8]:
# load map
gene_map = pd.read_csv(gene_list_out, sep='\t')
gene_map = gene_map[~gene_map['ENSEMBL'].isna()]

In [9]:
# annotation observed expression
obs_expr_annot = pd.merge(obs_expr, gene_map, left_on='probeset_id', right_on='probeset_id', how='inner')

In [10]:
# prepare indiv x gene matrix
obs_expr_n_x_g = obs_expr_annot[obs_expr_indivs].T.reset_index(drop=True)
obs_expr_n_x_g.columns = obs_expr_annot['ENSEMBL'].tolist()
obs_expr_n_x_g = pd.concat((pd.DataFrame({'SampleID': obs_expr_indivs}), obs_expr_n_x_g), axis=1)

In [11]:
if not os.path.exists(obs_expr_out):
    obs_expr_n_x_g.to_csv(obs_expr_out, compression='gzip', index=False, sep='\t')