# Analysis tools

Swan has several analysis options to use. 
* [Differential gene expression](#deg)
* [Differential transcript expression](#det)
* [Isoform switching / differential isoform expression](#is)
* [Combining metadata columns](#multi_gb)
* [Exon skipping and intron retention](#es_ir)
* [More differential expression](#more_de)

<!-- Running this tutorial on my laptop took around 30 minutes and 3 GB of RAM. The longest steps by far are running the differential gene and transcript expression tools. The diffxpy tools are multithreaded, and my laptop has 8 cores. -->

In [1]:
import swan_vis as swan

sg = swan.read('data/swan.p')

Read in graph from data/swan.p


## <a name="deg"></a>Differential gene expression tests

Differential gene expression testing in Swan is implemented via [diffxpy](https://github.com/theislab/diffxpy). To run the test, choose a metadata column from `sg.adata.obs` to test on using the `obs_col` argument. If there are more than 2 unique values in this column, further specify the conditions to test using the `obs_conditions` arguments.

The differential expression test that is run is [diffxpy's Wald test](https://diffxpy.readthedocs.io/en/latest/api/diffxpy.api.test.wald.html#diffxpy.api.test.wald), which checks if a "a certain coefficient introduces a significant difference in the expression of a gene". This test is performed on the normalized TPM for each gene.

For individuals wanting to run a different diffxpy differential test, see [this section](#more_de).

In [2]:
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']

# perform a differential gene expression 
# Wald test on the provided metadata column and conditions
test = sg.de_gene_test(obs_col, obs_conditions=obs_conditions)

AnnData expects .obs.index to contain strings, but got values like:
    [0, 1, 2, 3, 4]

    Inferred to be: integer

  value_idx = self._prep_dim_index(value.index, attr)


training location model: False
training scale model: True
iter   0: ll=5261992.176001
iter   1: ll=5261992.176001, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   2: ll=2753855.768885, converged: 75.84% (loc: 75.84%, scale update: True), in 66.56sec
iter   3: ll=2753855.768885, converged: 75.84% (loc: 100.00%, scale update: False), in 0.00sec
iter   4: ll=1889665.256868, converged: 92.98% (loc: 92.98%, scale update: True), in 17.23sec
iter   5: ll=1889665.256868, converged: 92.98% (loc: 100.00%, scale update: False), in 0.00sec
iter   6: ll=1886515.721880, converged: 98.49% (loc: 98.49%, scale update: True), in 7.07sec
iter   7: ll=1886515.721880, converged: 98.49% (loc: 100.00%, scale update: False), in 0.00sec
iter   8: ll=1885913.092184, converged: 99.63% (loc: 99.63%, scale update: True), in 3.85sec
iter   9: ll=1885913.092184, converged: 99.63% (loc: 100.00%, scale update: False), in 0.00sec
iter  10: ll=1885691.601695, converged: 99.90% (loc: 99.90%, scale

  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))


The output in `test` is a summary table for the differential expression test.

In [7]:
test.head(2)

Unnamed: 0,gid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gname
68444,ENSG00000137204.14,0.0,0.0,-297.776029,2.873145,False,1.328783e-08,-297.776029,2.222759e-162,-5.835233,SLC22A7
132616,ENSG00000186204.14,0.0,0.0,-297.776029,9.517887,False,2.018855e-07,-297.776029,2.222759e-162,-5.01634,CYP4F12


The results of this test are also stored in an automatically-generated key in `sg.adata.uns`, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

In [8]:
# deg - differential gene expression
uns_key = swan.make_uns_key('deg', 
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gname
68444,ENSG00000137204.14,0.0,0.0,-297.776029,2.873145,False,1.328783e-08,-297.776029,2.222759e-162,-5.835233,SLC22A7
132616,ENSG00000186204.14,0.0,0.0,-297.776029,9.517887,False,2.018855e-07,-297.776029,2.222759e-162,-5.01634,CYP4F12


Swan can also automatically subset the test summary table to pull out genes that pass a certain significance threshold. 

In [10]:
# return a table of significantly differentially-expressed genes
# for a given q val + log2fc threshold
de_genes = sg.get_de_genes(obs_col, obs_conditions=obs_conditions,
                           q=0.05, log2fc=1)

In [11]:
de_genes.head()

Unnamed: 0,gid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gname
57195,ENSG00000129245.11,0.0,0.0,283.913085,7.659833,False,1.200001,283.913085,2.222759e-162,-59.551343,FXR2
56566,ENSG00000128656.13,0.0,0.0,283.913085,81.457545,False,1.199999,283.913085,2.222759e-162,-74.030688,CHN1
101233,ENSG00000164318.17,0.0,0.0,283.913085,4.777928,False,0.8,283.913085,2.222759e-162,-39.453406,EGFLAM
57196,ENSG00000129245.11,0.0,0.0,283.913085,7.659833,False,1.200001,283.913085,2.222759e-162,-59.551343,FXR2
101230,ENSG00000164318.17,0.0,0.0,283.913085,4.777928,False,0.8,283.913085,2.222759e-162,-39.453406,EGFLAM


## <a name="det"></a>Differential transcript expression tests

Similarly, Swan can run tests to find differentially expressed transcript isoforms. The input and output to these functions are identical to that of the differential gene tests.

The differential expression test that is run is [diffxpy's Wald test](https://diffxpy.readthedocs.io/en/latest/api/diffxpy.api.test.wald.html#diffxpy.api.test.wald), which checks if a "a certain coefficient introduces a significant difference in the expression of a transcript". This test is performed on the normalized TPM for each transcript.

For individuals wanting to run a different diffxpy differential test, see [this section](#more_de)

In [12]:
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']

# perform a differential transcript expression 
# Wald test on the provided metadata column and conditions
test = sg.de_transcript_test(obs_col, obs_conditions=obs_conditions)

training location model: True
training scale model: True
iter   0: ll=13727074.960768
caught 17171 linalg singular matrix errors
iter   1: ll=13727074.960768, converged: 0.00% (loc: 100.00%, scale update: False), in 14.40sec
iter   2: ll=6420440.381519, converged: 83.71% (loc: 83.71%, scale update: True), in 246.41sec
caught 16133 linalg singular matrix errors
iter   3: ll=6420440.381519, converged: 83.71% (loc: 100.00%, scale update: False), in 13.52sec
iter   4: ll=3660392.429387, converged: 93.97% (loc: 93.97%, scale update: True), in 48.35sec
caught 7890 linalg singular matrix errors
iter   5: ll=3660392.429387, converged: 93.97% (loc: 100.00%, scale update: False), in 14.28sec
iter   6: ll=3653846.326338, converged: 99.02% (loc: 99.02%, scale update: True), in 23.72sec
caught 150 linalg singular matrix errors
iter   7: ll=3653846.326338, converged: 99.02% (loc: 100.00%, scale update: False), in 9.39sec
iter   8: ll=3651881.056467, converged: 99.72% (loc: 99.72%, scale update: True

  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))


The output in `test` is a summary table for the differential expression test.

In [13]:
test.head(2)

Unnamed: 0,tid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gid,gname
21101,ENST00000367818.3,0.0,0.0,-297.776029,0.400283,False,0.621579,-297.776029,2.222759e-162,0.0,ENSG00000143184.4,XCL1
136458,ENST00000544590.1,0.0,0.0,-297.776029,0.235725,False,0.389695,-297.776029,2.222759e-162,0.0,ENSG00000109920.12,FNBP4


The results of this test are similarly stored in an automatically-generated key in `sg.adata.uns`, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

In [14]:
# det - differential transcript expression
uns_key = swan.make_uns_key('det', 
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,tid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gid,gname
21101,ENST00000367818.3,0.0,0.0,-297.776029,0.400283,False,0.621579,-297.776029,2.222759e-162,0.0,ENSG00000143184.4,XCL1
136458,ENST00000544590.1,0.0,0.0,-297.776029,0.235725,False,0.389695,-297.776029,2.222759e-162,0.0,ENSG00000109920.12,FNBP4


Again, Swan can also automatically subset the test summary table to pull out genes that pass a certain significance threshold.

In [16]:
# return a table of significantly differentially-expressed genes
# for a given q val + log2fc threshold
de_transcripts = sg.get_de_transcripts(obs_col, obs_conditions=obs_conditions,
                           q=0.05, log2fc=1)

In [17]:
de_transcripts.head()

Unnamed: 0,tid,pval,qval,log2fc,mean,zero_mean,grad,coef_mle,coef_sd,ll,gid,gname
91026,ENST00000486541.1,0.0,0.0,283.913085,7.222791,False,1.2,283.913085,2.222759e-162,-59.116265,ENSG00000117318.8,ID3
81775,ENST00000475122.1,0.0,0.0,283.913085,0.904308,False,0.8,283.913085,2.222759e-162,-32.247542,ENSG00000119812.18,FAM98A
91246,ENST00000486828.6,0.0,0.0,283.913085,0.253818,False,0.83105,283.913085,2.222759e-162,0.0,ENSG00000196923.13,PDLIM7
190208,ENST00000623250.1,0.0,0.0,283.913085,0.841705,False,1.2,283.913085,2.222759e-162,-45.109253,ENSG00000279348.1,AC012513.3
91032,ENST00000486554.1,0.0,0.0,283.913085,0.507635,False,0.4,283.913085,2.222759e-162,-16.49623,ENSG00000157514.16,TSC22D3


## <a name="is"></a>Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in [Joglekar et. al., 2021](https://www.nature.com/articles/s41467-020-20343-5). DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the `obs_col` and `obs_conditions` arguments.

In [3]:
# look at valid metadata options
sg.adata.obs

Unnamed: 0_level_0,dataset,cell_line,replicate
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
hepg2_1,hepg2_1,hepg2,1
hepg2_2,hepg2_2,hepg2,2
hffc6_1,hffc6_1,hffc6,1
hffc6_2,hffc6_2,hffc6,2
hffc6_3,hffc6_3,hffc6,3


In [2]:
# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table = sg.die_gene_test(obs_col=obs_col, 
                             obs_conditions=obs_conditions,
                             verbose=True)

The resultant table contains an entry for each gene with the p value (`p_val`), adjusted p value (`adj_p_val`), and change in percent isoform usage for the top two isoforms (`dpi`). Exact details on these calculations can be found in [Joglekar et. al., 2021](https://www.nature.com/articles/s41467-020-20343-5).

In [6]:
die_table.head()

Unnamed: 0,gid,p_val,dpi,adj_p_val
4,ENSG00000005801.17,0.004017539,33.102501,0.017566
6,ENSG00000075790.10,3.733015e-06,34.215599,3.8e-05
8,ENSG00000005175.9,0.006216589,14.736797,0.025327
11,ENSG00000006282.20,0.0001472361,19.920383,0.001061
13,ENSG00000007376.7,1.98931e-07,29.276819,3e-06


As with differential expression testing, differential isoform expression testing results are stored automatically in `sg.adata.uns`, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

In [4]:
# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,adj_p_val
0,ENSG00000004059.10,0.370956,0.659836,0.545069
1,ENSG00000003509.15,0.110749,14.178416,0.236859


Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds. 

In [7]:
test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
                       p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,adj_p_val
4,ENSG00000005801.17,0.004017539,33.102501,0.017566
6,ENSG00000075790.10,3.733015e-06,34.215599,3.8e-05
8,ENSG00000005175.9,0.006216589,14.736797,0.025327
11,ENSG00000006282.20,0.0001472361,19.920383,0.001061
13,ENSG00000007376.7,1.98931e-07,29.276819,3e-06


Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and find genes that exhibit DIE on the basis of their starts or ends. To do this, use the `kind` argument to `die_gene_test`.

In [3]:
# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table = sg.die_gene_test(kind='tss', 
                             obs_col=obs_col,
                             obs_conditions=obs_conditions,
                             verbose=True)
die_table.head()

Testing for DIE for each gene: 100%|█████████▉| 58905/58906 [11:46<00:00, 67.25it/s] 

Unnamed: 0,gid,p_val,dpi,adj_p_val
0,ENSG00000001630.16,0.286866,2.580168,0.71092
1,ENSG00000002330.13,0.089441,12.817429,0.348247
2,ENSG00000002586.19,0.88598,0.102093,0.982149
3,ENSG00000002822.15,0.678511,1.724138,0.962778
4,ENSG00000002919.14,0.669412,2.439026,0.962778


To access the die results on the tss level, use `die_kind='tss'` as input to `make_uns_key()`.

In [4]:
# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions,
                            die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,adj_p_val
0,ENSG00000001630.16,0.286866,2.580168,0.71092
1,ENSG00000002330.13,0.089441,12.817429,0.348247


And provide the `kind='tss'` option to `get_die_genes()` when trying to filter your test results.

In [5]:
test = sg.get_die_genes(kind='tss', obs_col=obs_col, 
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,adj_p_val
5,ENSG00000003402.19,4.039839e-06,39.797981,6.987697e-05
6,ENSG00000003436.15,1.183656e-05,28.073771,0.0001856128
7,ENSG00000004487.16,0.001135407,75.714287,0.01145036
36,ENSG00000008952.16,0.004075852,26.08696,0.03342667
40,ENSG00000010278.13,3.077503e-22,39.024387,2.195799e-20


For TESs, use `kind='tes'` as input to `die_genes_test()`, `die_kind='tes'` to `make_uns_key()`, and `kind='tes'` to `get_die_genes()`.

In [6]:
# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table = sg.die_gene_test(kind='tes', obs_col='cell_line', obs_conditions=['hepg2', 'hffc6'])
die_table.head()

Testing for DIE for each gene: 100%|██████████| 58906/58906 [12:00<00:00, 67.25it/s]

Unnamed: 0,gid,p_val,dpi,adj_p_val
0,ENSG00000000419.12,0.749266,0.096133,0.904272
1,ENSG00000001461.16,0.852914,9.093739,0.943997
2,ENSG00000001630.16,0.286866,2.580168,0.613824
3,ENSG00000002330.13,0.184635,12.81743,0.479316
4,ENSG00000002549.12,0.679148,0.694543,0.879387


In [8]:
# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions,
                            die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,adj_p_val
0,ENSG00000000419.12,0.749266,0.096133,0.904272
1,ENSG00000001461.16,0.852914,9.093739,0.943997


In [9]:
test = sg.get_die_genes(kind='tes', obs_col=obs_col, 
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,adj_p_val
9,ENSG00000003402.19,7.338098e-15,84.848488,5.296535e-13
10,ENSG00000003436.15,2.772096e-07,32.637852,7.003007e-06
14,ENSG00000004487.16,0.001135407,75.714287,0.01141621
32,ENSG00000006282.20,0.006858641,19.819595,0.04915014
60,ENSG00000010278.13,3.191221e-22,39.024387,3.4861939999999995e-20


## <a name="multi_gb"></a>Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1? 

In [10]:
sg.adata.obs.head()

Unnamed: 0_level_0,dataset,cell_line,replicate
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
hepg2_1,hepg2_1,hepg2,1
hepg2_2,hepg2_2,hepg2,2
hffc6_1,hffc6_1,hffc6,1
hffc6_2,hffc6_2,hffc6,2
hffc6_3,hffc6_3,hffc6,3


Let's ignore for a moment the fact that the `dataset` column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets! Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

In [11]:
col_name = sg.add_multi_groupby(['cell_line', 'replicate'])

print(col_name)
print(sg.adata.obs.head())

cell_line_replicate
         dataset cell_line replicate cell_line_replicate
index                                                   
hepg2_1  hepg2_1     hepg2         1             hepg2_1
hepg2_2  hepg2_2     hepg2         2             hepg2_2
hffc6_1  hffc6_1     hffc6         1             hffc6_1
hffc6_2  hffc6_2     hffc6         2             hffc6_2
hffc6_3  hffc6_3     hffc6         3             hffc6_3


The added column in `col_name` can then be used as the `obs_col` input to `de_gene_test(), de_transcript_test(), and die_gene_test()`, as in the following calls:

In [13]:
obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']

deg_summary = sg.de_gene_test(obs_col=obs_col,
                              obs_conditions=obs_conditions)
det_summary = sg.de_transcript_test(obs_col=obs_col,
                                    obs_conditions=obs_conditions)
die_summary = sg.die_gene_test(obs_col=obs_col, 
                               obs_conditions=obs_conditions)

## <a name="es_ir"></a>Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events. 

To obtain a dataframe of novel exon skipping events, run the following code:

In [3]:
# returns a DataFrame of genes, transcripts, and specific edges in 
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)

In [15]:
es_df.head()

Unnamed: 0,gid,tid,egde_id
0,ENSG00000157916.19,TALONT000218256,952616
0,ENSG00000122406.13,TALONT000425229,952716
0,ENSG00000224093.5,TALONT000434035,952720
0,ENSG00000224093.5,TALONT000434035,952720
0,ENSG00000224093.5,TALONT000434035,952720


You can pass gene IDs from `es_df` into `gen_report()` or `plot_graph()` to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.

To obtain a list of genes containing novel intron retention events, run the following code:

In [2]:
# returns a DataFrame of genes, transcripts, and specific edges in 
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)

In [17]:
# save the SwanGraph as a Python pickle file
sg.save_graph('data/swan')
sg.save_graph('swan')
sg.save_graph('data/swan_files_full')

Saving graph as data/swan.p
Saving graph as swan.p
Saving graph as data/swan_files_full.p


In [4]:
ir_df.head()

Unnamed: 0,gid,tid,egde_id
0,ENSG00000143753.12,TALONT000482711,952811
0,ENSG00000285053.1,TALONT000483978,952821
0,ENSG00000177042.14,TALONT000213980,954058
0,ENSG00000177042.14,TALONT000213980,954058
0,ENSG00000148926.9,TALONT000251937,954085


You can pass gene IDs from `ir_df` into `gen_report()` or `plot_graph()` to visualize where these intron retention events are in gene reports or gene summary graphs respectively.

## <a name="more_de"></a>More differential expression

For users that are interested in using different differential expression tests, or tweaking the input parameters, we encourage them to obtain an AnnData version of of their SwanGraph using `create_gene_anndata` or `create_transcript_anndata`, and exploring the numerous differential testing options that diffxpy supports. 

[Diffxpy differential testing tutorials](https://diffxpy.readthedocs.io/en/latest/tutorials.html#differential-testing)

[More information on diffxpy differential expression tests](https://nbviewer.jupyter.org/github/theislab/diffxpy_tutorials/blob/master/diffxpy_tutorials/test/introduction_differential_testing.ipynb) 

In [3]:
dataset_groups = [['HepG2_1','HepG2_2'],
                  ['HFFc6_1','HFFc6_2','HFFc6_3']]

# create a gene-level AnnData object compatible with diffxpy 
# that assigns different condition labels to the given dataset groups
gene_adata = sg.create_gene_anndata(dataset_groups)

Transforming to str index.


In [4]:
# create a transcript-level AnnData object compatible with diffxpy 
# that assigns different condition labels to the given dataset groups
transcript_adata = sg.create_transcript_anndata(dataset_groups)

Transforming to str index.
