# Description

This notebook will show the structure of the main data matrices in PhenoPLIER, and will guide you in analyzing gene associations for a particular trait: basophil percentage, which is presented in the [manuscript](https://greenelab.github.io/phenoplier_manuscript/#phenoplier-an-integration-framework-based-on-gene-co-expression-patterns) in Figure 1c.

# Modules

In [1]:
import tempfile

import numpy as np
from scipy import stats
import pandas as pd

from entity import Trait, Gene
import conf

# Load gene module-gene membership matrix (matrix Z)

Here we load the gene module-gene membership matri, or "latent variables loadings matrix" (from the terminology of the [MultiPLIER article](https://doi.org/10.1016/j.cels.2019.04.003)).

In [2]:
matrix_z = pd.read_pickle(conf.MULTIPLIER["MODEL_Z_MATRIX_FILE"])

In [3]:
matrix_z.shape

(6750, 987)

In [4]:
matrix_z.head()

Unnamed: 0,LV1,LV2,LV3,LV4,LV5,LV6,LV7,LV8,LV9,LV10,...,LV978,LV979,LV980,LV981,LV982,LV983,LV984,LV985,LV986,LV987
GAS6,0.0,0.0,0.039438,0.0,0.050476,0.0,0.0,0.0,0.590949,0.0,...,0.050125,0.0,0.033407,0.0,0.0,0.005963,0.347362,0.0,0.0,0.0
MMP14,0.0,0.0,0.0,0.0,0.070072,0.0,0.0,0.004904,1.720179,2.423595,...,0.0,0.0,0.001007,0.0,0.035747,0.0,0.0,0.0,0.014978,0.0
DSP,0.0,0.0,0.0,0.0,0.0,0.041697,0.0,0.005718,0.0,0.0,...,0.020853,0.0,0.0,0.0,0.0,0.005774,0.0,0.0,0.0,0.416405
MARCKSL1,0.305212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161843,0.149471,...,0.027134,0.05272,0.0,0.030189,0.060884,0.0,0.0,0.0,0.0,0.44848
SPARC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014014,...,0.0,0.0,0.0,0.0,0.0,0.0,0.067779,0.0,0.122417,0.062665


As you can see, this matrix Z contains the membership value for each gene across all LVs (or gene modules).
A value of zero means that the gene does not belong to that LV, whereas a larger value represents how strongly that gene belongs to the LV.
A group of genes that belong to the same LV represent a gene-set that has a similar expression profile across a set of tissues or cell types.
We'll cover this in more detail in the next notebook (`02-LV_cell_types-...`).

# Load information about LV alignment with pathways

LV in matrix Z can represent a group of genes that align well with prior pathways (or prior knowledge) or be "novel" in the sense that the combination of genes do not represent a known unit but was found the PLIER when factorizing the recount2 data (see the MultiPLIER article for more details).

Here we load that information, where for each LV and pathway, we have a p-value and area under the curve (AUC) that indicate how well the LV aligns to that pathway.

In [5]:
lv_metadata = pd.read_pickle(conf.MULTIPLIER["MODEL_SUMMARY_FILE"])

In [6]:
lv_metadata.shape

(2157, 5)

In [7]:
lv_metadata.head()

Unnamed: 0,pathway,LV index,AUC,p-value,FDR
1,KEGG_LYSINE_DEGRADATION,1,0.388059,0.866078,0.956005
2,REACTOME_MRNA_SPLICING,1,0.733057,4.8e-05,0.000582
3,MIPS_NOP56P_ASSOCIATED_PRE_RRNA_COMPLEX,1,0.680555,0.001628,0.011366
4,KEGG_DNA_REPLICATION,1,0.549473,0.312155,0.539951
5,PID_MYC_ACTIVPATHWAY,1,0.639303,0.021702,0.083739


# Load gene associations from PhenomeXcan

Now we load gene-trait association from [PhenomeXcan](https://doi.org/10.1126/sciadv.aba2083).
PhenomeXcan provides TWAS results (using [Summary-MultiXcan](https://doi.org/10.1371/journal.pgen.1007889) and [Summary-PrediXcan](https://doi.org/10.1038/s41467-018-03621-1)) across ~4,000 traits.
If you are interested in PhenomeXcan you can also check out the [Github repo](https://github.com/hakyimlab/phenomexcan) to know how to download results.

For this demo, we'll load a file that contains Summary-MultiXcan (or S-MultiXcan) results for basophil percentage.
This file contains a list of p-values for ~22k genes, where a significant p-value means that the gene's predicted expression (across different tissues) is associated with basophil percentage.
In the notebook I refer to these results generically as "TWAS results", meaning that we have gene-trait associations.
All these TWAS results were derived solely from GWAS summary stats, so you can also generate yours relatively easily by using [S-MultiXcan](https://doi.org/10.1371/journal.pgen.1007889).

In [8]:
%%bash
# download S-MultiXcan results for basophil percentage
wget https://uchicago.box.com/shared/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz -O /tmp/smultixcan_30220_raw_ccn30.tsv.gz

--2022-11-14 22:11:04--  https://uchicago.box.com/shared/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz
Resolving uchicago.box.com (uchicago.box.com)... 74.112.186.144
Connecting to uchicago.box.com (uchicago.box.com)|74.112.186.144|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz [following]
--2022-11-14 22:11:04--  https://uchicago.box.com/public/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz
Reusing existing connection to uchicago.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://uchicago.app.box.com/public/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz [following]
--2022-11-14 22:11:04--  https://uchicago.app.box.com/public/static/g70nq1c6wjvado242t9yg05jrhvdykrv.gz
Resolving uchicago.app.box.com (uchicago.app.box.com)... 74.112.186.144
Connecting to uchicago.app.box.com (uchicago.app.box.com)|74.112.186.144|:443... connected.
HTTP request sent, awaiting respo

In [9]:
df = pd.read_csv("/tmp/smultixcan_30220_raw_ccn30.tsv.gz", sep="\t")

In [10]:
df.shape

(22258, 18)

In [11]:
df.head()

Unnamed: 0,gene,gene_name,pvalue,n,n_indep,p_i_best,t_i_best,p_i_worst,t_i_worst,eigen_max,eigen_min,eigen_min_kept,z_min,z_max,z_mean,z_sd,tmi,status
0,ENSG00000143669.13,LYST,3.43009e-78,40.0,5.0,5.887514e-67,Cells_Cultured_fibroblasts,0.5715454,Ovary,25.575879,5.685765e-16,1.091354,-17.287064,16.985176,-6.255059,11.905653,5.0,0
1,ENSG00000140575.12,IQGAP1,5.312060999999999e-56,49.0,2.0,1.792379e-55,Cells_EBV-transformed_lymphocytes,2.034382e-18,Brain_Frontal_Cortex_BA9,43.159657,1.197089e-15,3.913487,-15.689222,-8.755368,-11.930501,2.305056,2.0,0
2,ENSG00000239839.6,DEFA3,6.178319e-52,25.0,8.0,1.73551e-30,Brain_Frontal_Cortex_BA9,0.4821369,Heart_Atrial_Appendage,13.021527,1.362063e-17,0.438362,-11.476299,3.676225,-6.196329,4.534728,8.0,0
3,ENSG00000140577.15,CRTC3,3.63354e-46,47.0,5.0,3.326635e-48,Skin_Not_Sun_Exposed_Suprapubic,0.2188896,Artery_Tibial,25.991131,1.853057e-15,1.18336,-7.376543,14.588451,7.294836,5.29438,5.0,0
4,ENSG00000107929.14,LARP4B,5.229144e-43,35.0,4.0,4.209043e-45,Brain_Frontal_Cortex_BA9,0.3858258,Brain_Cerebellar_Hemisphere,24.531853,5.407442e-16,2.055242,-13.906881,14.092741,6.933903,8.977076,4.0,0


# Take a look at genes associated with basophil percentage

Show the sample size for this trait.

In [12]:
trait_code = "30220_raw-Basophill_percentage"
t = Trait.get_trait(full_code=trait_code)
display(f"{trait_code} - sample size: {t.n}")

'30220_raw-Basophill_percentage - sample size: 349861'

Below I list the top associated genes for basophil percentage.

In [13]:
traits_df = df[["gene_name", "pvalue"]].dropna().set_index("gene_name")

# remove duplicated gene names
traits_df = traits_df.loc[~traits_df.index.duplicated()]

In [14]:
traits_df.shape

(22248, 1)

In [15]:
traits_df.head()

Unnamed: 0_level_0,pvalue
gene_name,Unnamed: 1_level_1
LYST,3.43009e-78
IQGAP1,5.312060999999999e-56
DEFA3,6.178319e-52
CRTC3,3.63354e-46
LARP4B,5.229144e-43


Here I quickly show the data summary for this trait's gene associations:

In [16]:
traits_df["pvalue"].apply(lambda x: -np.log10(x)).describe()

count    22248.000000
mean         0.695257
std          1.556005
min          0.000120
25%          0.166637
50%          0.393855
75%          0.815005
max         77.464694
Name: pvalue, dtype: float64

# Get a set of common genes between TWAS and LVs

In [17]:
common_genes = traits_df.index.intersection(matrix_z.index)

In [18]:
common_genes

Index(['IQGAP1', 'CCR1', 'CCR3', 'PRTN3', 'FCGR3B', 'RASGRP4', 'ZYX', 'RYR1',
       'EPHA1', 'MYB',
       ...
       'NPTX2', 'RNASE2', 'WDTC1', 'NLK', 'GADD45B', 'ARHGAP22', 'POLR1E',
       'ATP6V1D', 'NEDD4L', 'CACNG6'],
      dtype='object', length=6430)

In [19]:
# keep only the genes in common
traits_df = traits_df.loc[common_genes]

In [20]:
traits_df.shape

(6430, 1)

In [21]:
matrix_z = matrix_z.loc[common_genes]

In [22]:
matrix_z.shape

(6430, 987)

# Analysis of a neutrophil-termed LV

Let's take as an example an LV that was previously analyzed in the MultiPLIER study, which we identify as `LV603`. This LV aligns well with pathways related to neutrophils, as you can see below. In the next notebook (`02-LV_cell_types...`) we'll see that this LV is expressed in neutrophils and other granulocytes.

In [23]:
lv_metadata[
    (lv_metadata["LV index"] == "603") & (lv_metadata["FDR"] < 0.05)
].sort_values("FDR")

Unnamed: 0,pathway,LV index,AUC,p-value,FDR
1511,IRIS_Neutrophil-Resting,603,0.905751,8.355935999999999e-38,4.505939e-35
1512,SVM Neutrophils,603,0.979789,2.856571e-11,1.432936e-09
1513,PID_IL8CXCR2_PATHWAY,603,0.810732,0.0008814671,0.007041943
1516,SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES,603,0.769292,0.003387907,0.01948724


Let's see which genes more strongly belong to LV603 (the numbers are the gene weights in this LV):

In [24]:
lv603_top_genes = matrix_z["LV603"].sort_values(ascending=False)
display(lv603_top_genes.head(20))

CXCR2        5.320459
FCGR3B       5.128372
TNFRSF10C    5.035457
VNN2         4.680865
ZDHHC18      4.495976
MNDA         4.488505
CXCR1        4.442062
P2RY13       4.404405
VNN3         4.253184
FPR2         4.187560
CEACAM3      4.139476
C5AR1        4.101986
SLC45A4      4.068913
AQP9         3.939923
CCR3         3.883533
ABTB1        3.745259
CSF3R        3.735651
FPR1         3.720018
DPEP2        3.656125
SIRPB1       3.632251
Name: LV603, dtype: float64

Are these top genes associated with our trait of interest?

In [25]:
traits_df.loc[lv603_top_genes.index].head(20)

Unnamed: 0,pvalue
CXCR2,0.0128164
FCGR3B,3.297587e-32
TNFRSF10C,1.655229e-07
VNN2,0.9217389
ZDHHC18,0.08370749
MNDA,0.4876452
CXCR1,0.01315804
P2RY13,0.02734939
VNN3,0.4231949
FPR2,0.8199449


It seems so. But what about the rest of the genes? They might be also strongly associated.
Let's take a random sample:

In [26]:
traits_df.sample(n=20, random_state=0)

Unnamed: 0,pvalue
GPR171,0.170912
SGCA,0.070701
P4HA2,0.324261
ANAPC7,0.032241
CYP11A1,0.215818
NUMB,0.126148
IGF1,0.603579
C10orf10,0.018192
CISH,0.558913
NDUFB9,0.574156


They do not seem as significant as those within the top genes in LV603.

If we compute the correlation between LV603 gene weights (`lv603_top_genes`) and gene associations for basophil percentage (`traits_df`) we get this:

In [27]:
lv603_top_genes

CXCR2        5.320459
FCGR3B       5.128372
TNFRSF10C    5.035457
VNN2         4.680865
ZDHHC18      4.495976
               ...   
GTF2H2       0.000000
TRIM28       0.000000
DPYD         0.000000
ASPM         0.000000
PDE2A        0.000000
Name: LV603, Length: 6430, dtype: float64

In [28]:
stats.pearsonr(
    traits_df["pvalue"]
    .apply(lambda x: -np.log10(x))
    .loc[lv603_top_genes.index]
    .to_numpy(),
    lv603_top_genes.to_numpy(),
)

(0.13428177436972902, 2.935462151415347e-27)

Although the correlation is significant (`2.94e-27`) and the slope positive (we are interested only in genes at the top of the LV), we need to account for correlated predicted expression from the TWAS models (for example, if the expression of two genes at the top of the LV is correlated that would invalidate our test).
We provide a class, `GLSPhenoplier` (implemented in Python) that computes this. We also provide a command-line tool, `gls_cli.py`, that performs several preprocessing steps, and below we show how to use it for our example.

# `gls_cli.py`: association between an LV and a trait

The `gls_cli.py` command-line tool needs as input the S-MultiXcan TWAS results and a gene correlation matrix (which is specific to the TWAS results, see below).

In [29]:
%%bash
# remove previously computed results (if exist)
rm /tmp/gls_phenoplier-basophill_percentage.tsv.gz

# print full path of gls_cli.py tool
echo ${PHENOPLIER_CODE_DIR}/libs/gls_cli.py

# print full path of gene correlations file (which is trait-specific!)
COHORT_NAME="phenomexcan_rapid_gwas"
REFERENCE_PANEL="gtex_v8"
GENE_CORR_DIR="${PHENOPLIER_RESULTS_GLS}/gene_corrs/cohorts/${COHORT_NAME}/${REFERENCE_PANEL}/mashr/gene_corrs-symbols-within_distance_5mb.per_lv/"
echo ${GENE_CORR_DIR}

python ${PHENOPLIER_CODE_DIR}/libs/gls_cli.py \
  --input-file /tmp/smultixcan_30220_raw_ccn30.tsv.gz \
  --duplicated-genes-action keep-first \
  --gene-corr-file ${GENE_CORR_DIR} \
  --debug-use-sub-gene-corr \
  --covars gene_size gene_size_log gene_density gene_density_log \
  --output-file /tmp/gls_phenoplier-basophill_percentage.tsv.gz

rm: cannot remove '/tmp/gls_phenoplier-basophill_percentage.tsv.gz': No such file or directory


/opt/code/libs/gls_cli.py
/opt/data/results/gls/gene_corrs/cohorts/phenomexcan_rapid_gwas/gtex_v8/mashr/gene_corrs-symbols-within_distance_5mb.per_lv/


[2022-11-14 22:11:07,561] INFO: Reading input file /tmp/smultixcan_30220_raw_ccn30.tsv.gz
[2022-11-14 22:11:07,668] INFO: Input file has 22258 genes
[2022-11-14 22:11:07,677] INFO: Removed duplicated genes symbols using 'keep-first'. Data now has 22251 genes
[2022-11-14 22:11:07,679] INFO: p-values statistics: min=3.4e-78 | mean=4.3e-01 | max=1.0e+00 | # missing=3 (0.0%)
[2022-11-14 22:11:07,680] INFO: Using covariates: ['gene_density', 'gene_density_log', 'gene_size', 'gene_size_log']
[2022-11-14 22:11:07,930] INFO: NumExpr defaulting to 4 threads.
[2022-11-14 22:11:07,935] INFO: Replacing zero p-values by nonzero minimum divided by 10
[2022-11-14 22:11:07,939] INFO: Using -log10(pvalue)
[2022-11-14 22:11:07,940] INFO: Using gene correlation file: /opt/data/results/gls/gene_corrs/cohorts/phenomexcan_rapid_gwas/gtex_v8/mashr/gene_corrs-symbols-within_distance_5mb.per_lv/
[2022-11-14 22:11:07,966] INFO: 987 LVs (gene modules) were found in LV model
[2022-11-14 22:11:07,966] INFO: All LV

As you can see from the output, the tool performs some preprocessing of the input TWAS file, and then computes an association for all LVs in the model (987 in our case) and the gene p-values from TWAS. The output is finally written to the path specified.

**IMPORTANT:** keep in mind that you have to use a gene correlation matrix that is specific to your TWAS results. This is because gene correlations depend on the variants present in the original GWAS used. Check out [the notebooks here](https://github.com/greenelab/phenoplier/tree/main/nbs/15_gsa_gls) to see how to compute a gene correlation matrix specific for your trait of interest.

# Load LV-trait results

In [30]:
lv_df = pd.read_csv("/tmp/gls_phenoplier-basophill_percentage.tsv.gz", sep="\t")

In [31]:
lv_df

Unnamed: 0,lv,beta,beta_se,t,pvalue_twosided,pvalue_onesided
0,LV603,0.988076,0.127500,7.749615,1.064641e-14,5.323207e-15
1,LV719,0.903596,0.130457,6.926399,4.736869e-12,2.368434e-12
2,LV68,0.859876,0.126886,6.776776,1.338216e-11,6.691079e-12
3,LV517,0.831304,0.125022,6.649252,3.188491e-11,1.594246e-11
4,LV599,0.728474,0.126189,5.772889,8.156289e-09,4.078144e-09
...,...,...,...,...,...,...
982,LV213,-0.194986,0.125889,-1.548863,1.214639e-01,9.392681e-01
983,LV322,-0.202912,0.125966,-1.610849,1.072618e-01,9.463691e-01
984,LV632,-0.203234,0.124819,-1.628234,1.035242e-01,9.482379e-01
985,LV776,-0.205448,0.125552,-1.636362,1.018129e-01,9.490936e-01


As you can see, LV603 is at the top of the LVs associations for basophil percentage.
However, the onesided p-value here (`5.32e-15`) is larger than a simple correlation (`2.94e-27`), suggesting that we have correlated genes at the top of the LV.

# Conclusions

Hopefully, now have a more clear idea of the main data matrixes involved in PhenoPLIER (matrix Z, PhenomeXcan gene-trait associations, etc).
We also see how to compute a p-value between an LV (group of genes or gene module) and a trait of interest.
To do this with your own data, you need to compute the S-MultiXcan TWAS results (gene-based) from your GWAS summary stats and generate your own gene correlation matrix.

In the next notebook (`02-LV_cell_types-...`), we'll see how to check in which tissues or cell types are our LV603'genes expressed.