# Analysis tools

Swan has several analysis options to use. 
* [Differential expression tests](#de)
* [Isoform switching / differential isoform expression](#is)
* [Combining metadata columns](#multi_gb)
* [Exon skipping and intron retention](#es_ir)
* [More differential expression](#more_de)

<!-- Running this tutorial on my laptop took around 30 minutes and 3 GB of RAM. The longest steps by far are running the differential gene and transcript expression tools. The diffxpy tools are multithreaded, and my laptop has 8 cores. -->

In [1]:
import swan_vis as swan

sg = swan.read('data/swan.p')

Read in graph from data/swan.p


## <a name="de"></a>Differential expression tests

Swan's old differential gene and transcript expression tests using `diffxpy` have now been deprecated as it seems that the library is unsupported. I recommend that users interested in running differential gene or transcript expression test either use [PyDESeq2](https://github.com/owkin/PyDESeq2) or Scanpy's [`rank_genes_groups`](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html) test, which both support the AnnData format for simple compatibility with Swan's AnnData expression representation.

<!-- ### <a name="de"></a>Using Scanpy's `rank_genes_groups` -->

<!-- `rank_genes_groups` expects logarithmized data, so make sure you transform your data before running the test on whichever [AnnData](https://freese.gitbook.io/swan/faqs/data_structure#anndata) you want that's in your SwanGraph. -->

<!-- ```python
import scanpy as sc

sg.adata.X = sg.adata.layers['tpm']
sc.pp.log1p(sg.adata)
sg.adata.layers['log_norm'] = sg.adata.X.copy()
sc.tl.rank_genes_groups(sg.adata,
                        groupby=<obs_col>,
                        groups=<obs_conditions>,
                        layer='log_norm',
                        method='wilcoxon')

results_df = sc.get.rank_genes_groups_df(sg.adata, <obs_condition>)
results_df.head()
``` -->

### <a name="de"></a>Using PyDESeq2

Please read the [PyDESeq2 documentation](https://pydeseq2.readthedocs.io/en/latest/) for details on how to use one of the SwanGraph AnnData objects to obtain differential expression results. Below is an example on how to find differentially expressed transcripts between cell lines.

In [16]:
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import numpy as np

adata = sg.adata.copy()

# PyDESeq2 currently doesn't support column names with underscores, so change that
adata.obs.rename({'cell_line': 'cellline'}, axis=1, inplace=True)
obs_col = 'cellline'

threads = 8

# densify matrix
adata.X = np.array(adata.X.todense())

# run test
dds = DeseqDataSet(adata=adata,
               design_factors=obs_col,
               n_cpus=threads,
               refit_cooks=True)
dds.deseq2()
stat_res = DeseqStats(dds,
                  n_cpus=threads)
stat_res.summary()

df = stat_res.results_df

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 34.09 seconds.

Fitting dispersion trend curve...
... done in 12.01 seconds.

Fitting MAP dispersions...
... done in 13.97 seconds.

Fitting LFCs...
... done in 4.86 seconds.

Refitting 0 outliers.

Running Wald tests...
... done in 2.36 seconds.

Log2 fold change & Wald test p-value: cellline hffc6 vs hepg2


Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
tid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
TALONT000296400,3.497574,2.416436,1.472616,1.640914,0.100815,0.193840
ENST00000591581.1,0.162115,0.624956,5.002765,0.124922,0.900585,
ENST00000546893.5,8.173445,-0.053334,0.793179,-0.067241,0.946390,0.968527
ENST00000537289.1,5.140216,-1.248175,0.978927,-1.275044,0.202294,0.328286
ENST00000258382.9,2.012104,-0.011819,1.493866,-0.007912,0.993687,
...,...,...,...,...,...,...
ENST00000506914.1,1.022692,0.947588,2.272813,0.416923,0.676735,
ENST00000571080.1,0.473263,-2.824409,3.426473,-0.824291,0.409774,
ENST00000378615.7,1.334873,-1.199810,1.790781,-0.669993,0.502862,
ENST00000409586.7,0.338615,1.464364,4.023896,0.363917,0.715920,


In [6]:
df.head()

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
tid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
TALONT000296400,3.497574,2.416436,1.472616,1.640914,0.100815,0.19384
ENST00000591581.1,0.162115,0.624956,5.002765,0.124922,0.900585,
ENST00000546893.5,8.173445,-0.053334,0.793179,-0.067241,0.94639,0.968527
ENST00000537289.1,5.140216,-1.248175,0.978927,-1.275044,0.202294,0.328286
ENST00000258382.9,2.012104,-0.011819,1.493866,-0.007912,0.993687,


## <a name="is"></a>Isoform switching / Differential isoform expression testing

Isoform switching / differential isoform expression (DIE) testing is implemented according to the strategy in [Joglekar et. al., 2021](https://www.nature.com/articles/s41467-020-20343-5). DIE can roughly be described as finding statistically significant changes in isoform expression between two conditions along with a change in percent isoform usage per gene.

Pairwise comparisons can be set up using different columns in the metadata that was added to the SwanGraph with the `obs_col` and `obs_conditions` arguments.

In [2]:
# look at valid metadata options
sg.adata.obs

Unnamed: 0_level_0,cell_line,replicate,dataset,total_counts
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hepg2_1,hepg2,1,hepg2_1,499647.0
hepg2_2,hepg2,2,hepg2_2,848447.0
hffc6_1,hffc6,1,hffc6_1,761493.0
hffc6_2,hffc6,2,hffc6_2,787967.0
hffc6_3,hffc6,3,hffc6_3,614921.0


In [3]:
# find genes that exhibit DIE between HFFc6 and HepG2
obs_col = 'cell_line'
obs_conditions = ['hepg2', 'hffc6']
die_table, die_results = sg.die_gene_test(obs_col=obs_col, 
                                          obs_conditions=obs_conditions,
                                          verbose=True)

Testing for DIE for each gene: 100%|██████████| 14684/14684 [03:50<00:00, 123.69it/s]

The resultant table contains an entry for each gene with the p value (`p_val`), adjusted p value (`adj_p_val`), and change in percent isoform usage for the top two isoforms (`dpi`), as well as the identities of the top isoforms involved in the switch. Exact details on these calculations can be found in [Joglekar et. al., 2021](https://www.nature.com/articles/s41467-020-20343-5).

In [5]:
die_table.head()

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000130175,0.206667,39.264992,TALONT000296399,,39.264992,,TALONT000296400,ENST00000589838.5,-20.116056,-10.638298,0.46942,PRKCSH
1,ENSG00000130202,0.000367,11.893251,ENST00000252485.8,ENST00000252483.9,9.684967,2.208281,TALONT000406668,ENST00000591581.1,-11.473083,-0.420168,0.00356,NECTIN2
2,ENSG00000111371,0.680435,9.401713,ENST00000398637.9,ENST00000439706.5,8.547012,0.854701,ENST00000546893.5,,-9.401711,,0.886243,SLC38A1
3,ENSG00000181924,0.028195,9.452934,ENST00000537289.1,ENST00000545127.1,7.298603,2.154328,ENST00000355693.4,,-9.452934,,0.122619,COA4
4,ENSG00000163468,0.255788,0.680048,ENST00000295688.7,TALONT000476055,0.366966,0.277693,ENST00000368259.6,ENST00000489870.1,-0.568503,-0.111545,0.525066,CCT3


Differential isoform expression testing results are stored automatically in `sg.adata.uns`, and will be saved to the SwanGraph if you save it. You can regenerate this key and access the summary table by running the following code:

In [6]:
# die_iso - isoform level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions)
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000130175,0.206667,39.264992,TALONT000296399,,39.264992,,TALONT000296400,ENST00000589838.5,-20.116056,-10.638298,0.46942,PRKCSH
1,ENSG00000130202,0.000367,11.893251,ENST00000252485.8,ENST00000252483.9,9.684967,2.208281,TALONT000406668,ENST00000591581.1,-11.473083,-0.420168,0.00356,NECTIN2


Swan comes with an easy way to filter your DIE test results based on adjusted p value and dpi thresholds. 

In [7]:
test = sg.get_die_genes(obs_col=obs_col, obs_conditions=obs_conditions,
                       p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
1,ENSG00000130202,0.0003674234,11.893251,ENST00000252485.8,ENST00000252483.9,9.684967,2.208281,TALONT000406668,ENST00000591581.1,-11.473083,-0.420168,0.003560035,NECTIN2
8,ENSG00000025156,0.001857411,35.714287,ENST00000452194.5,ENST00000465214.2,25.714287,10.0,ENST00000368455.8,,-35.714287,,0.01405048,HSF2
20,ENSG00000105254,4.779407999999999e-38,25.419458,ENST00000585910.5,TALONT000366329,20.394853,4.702971,ENST00000221855.7,ENST00000589996.5,-25.016304,-0.403154,7.343218999999999e-36,TBCB
23,ENSG00000105379,1.8027819999999997e-307,81.136333,ENST00000354232.8,,81.136333,,ENST00000309244.8,ENST00000596253.1,-80.857935,-0.278394,1.551113e-304,ETFB
28,ENSG00000148180,0.0,89.319039,ENST00000373818.8,TALONT000419680,85.24585,4.073189,TALONT000418752,ENST00000373808.8,-68.60318,-7.768958,0.0,GSN


Swan also now automatically tracks transcription start site (TSS) and transcription end site (TES) usage, and can find genes that exhibit DIE on the basis of their starts or ends. To do this, use the `kind` argument to `die_gene_test`.

In [8]:
# find genes that exhibit DIE for TSSs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tss', 
                                          obs_col=obs_col,
                                          obs_conditions=obs_conditions,
                                          verbose=True)
die_table.head()

Testing for DIE for each gene: 100%|██████████| 14684/14684 [05:38<00:00, 43.42it/s] 
Testing for DIE for each gene: 100%|█████████▉| 14674/14684 [02:18<00:00, 162.60it/s]

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000000419,0.249561,10.002187,ENSG00000000419_2,ENSG00000000419_1,9.906052,0.096133,ENSG00000000419_3,ENSG00000000419_4,-7.218704,-2.783483,0.536906,DPM1
1,ENSG00000001461,0.852914,9.093739,ENSG00000001461_5,,9.093739,,ENSG00000001461_1,ENSG00000001461_3,-5.543442,-1.775148,1.0,NIPAL3
2,ENSG00000001497,0.891496,3.846153,ENSG00000001497_1,,3.846146,,ENSG00000001497_2,,-3.846153,,1.0,LAS1L
3,ENSG00000001630,0.286866,2.580168,ENSG00000001630_3,,2.580168,,ENSG00000001630_2,ENSG00000001630_1,-1.549232,-1.030928,0.582636,CYP51A1
4,ENSG00000002330,0.184635,12.817431,ENSG00000002330_2,ENSG00000002330_1,11.868484,0.948944,ENSG00000002330_4,ENSG00000002330_3,-12.603584,-0.213847,0.454053,BAD


To access the die results on the tss level, use `die_kind='tss'` as input to `make_uns_key()`.

In [9]:
# die_iso - TSS level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions,
                            die_kind='tss')
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000000419,0.249561,10.002187,ENSG00000000419_2,ENSG00000000419_1,9.906052,0.096133,ENSG00000000419_3,ENSG00000000419_4,-7.218704,-2.783483,0.536906,DPM1
1,ENSG00000001461,0.852914,9.093739,ENSG00000001461_5,,9.093739,,ENSG00000001461_1,ENSG00000001461_3,-5.543442,-1.775148,1.0,NIPAL3


And provide the `kind='tss'` option to `get_die_genes()` when trying to filter your test results.

In [10]:
test = sg.get_die_genes(kind='tss', obs_col=obs_col, 
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
10,ENSG00000003402,7.338098e-15,84.848486,ENSG00000003402_3,ENSG00000003402_7,77.272728,7.575758,ENSG00000003402_1,ENSG00000003402_5,-31.717173,-22.222223,4.390279e-13,CFLAR
11,ENSG00000003436,3.812087e-06,34.072281,ENSG00000003436_1,ENSG00000003436_3,28.073771,4.564083,ENSG00000003436_4,,-34.072281,,7.064168e-05,TFPI
16,ENSG00000004487,0.001135407,75.714287,ENSG00000004487_1,,75.714287,,ENSG00000004487_2,,-75.714285,,0.01020405,KDM1A
33,ENSG00000006282,0.004544048,19.819595,ENSG00000006282_2,ENSG00000006282_1,13.679245,6.14035,ENSG00000006282_3,,-19.819595,,0.03236475,SPATA20
43,ENSG00000007376,0.0001256379,27.898551,ENSG00000007376_3,ENSG00000007376_1,23.454107,4.444445,ENSG00000007376_5,ENSG00000007376_7,-14.130434,-7.512077,0.001525134,RPUSD1


For TESs, use `kind='tes'` as input to `die_genes_test()`, `die_kind='tes'` to `make_uns_key()`, and `kind='tes'` to `get_die_genes()`.

In [11]:
# find genes that exhibit DIE for TESs between HFFc6 and HepG2
die_table, die_results = sg.die_gene_test(kind='tes',
                                          obs_col='cell_line',
                                          obs_conditions=['hepg2', 'hffc6'])
die_table.head()

Testing for DIE for each gene: 100%|██████████| 14684/14684 [02:19<00:00, 105.51it/s]


Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000000419,1.0,0.096133,ENSG00000000419_2,,0.096133,,ENSG00000000419_1,,-0.09613,,1.0,DPM1
1,ENSG00000001461,0.852914,9.093739,ENSG00000001461_3,,9.093739,,ENSG00000001461_4,ENSG00000001461_1,-5.543442,-1.775148,1.0,NIPAL3
2,ENSG00000001630,0.286866,2.580168,ENSG00000001630_2,,2.580168,,ENSG00000001630_1,ENSG00000001630_3,-1.549232,-1.030928,0.618077,CYP51A1
3,ENSG00000002330,0.184635,12.817431,ENSG00000002330_1,ENSG00000002330_4,11.868484,0.948944,ENSG00000002330_2,ENSG00000002330_3,-12.603584,-0.213847,0.48086,BAD
4,ENSG00000002549,0.679148,0.694543,ENSG00000002549_2,,0.694542,,ENSG00000002549_1,,-0.694543,,0.926889,LAP3


In [12]:
# die_iso - TES level differential isoform expression test results
uns_key = swan.make_uns_key('die',
                            obs_col=obs_col, 
                            obs_conditions=obs_conditions,
                            die_kind='tes')
test = sg.adata.uns[uns_key]
test.head(2)

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENSG00000000419,1.0,0.096133,ENSG00000000419_2,,0.096133,,ENSG00000000419_1,,-0.09613,,1.0,DPM1
1,ENSG00000001461,0.852914,9.093739,ENSG00000001461_3,,9.093739,,ENSG00000001461_4,ENSG00000001461_1,-5.543442,-1.775148,1.0,NIPAL3


In [13]:
test = sg.get_die_genes(kind='tes', obs_col=obs_col, 
                        obs_conditions=obs_conditions,
                        p=0.05, dpi=10)
test.head()

Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
9,ENSG00000003402,7.338098e-15,84.848486,ENSG00000003402_5,ENSG00000003402_6,77.272728,7.575758,ENSG00000003402_9,ENSG00000003402_7,-31.717173,-22.222223,5.296535e-13,CFLAR
10,ENSG00000003436,2.772096e-07,32.637854,ENSG00000003436_1,ENSG00000003436_4,22.082711,10.555142,ENSG00000003436_2,ENSG00000003436_5,-30.435921,-1.818182,7.003007e-06,TFPI
14,ENSG00000004487,0.001135407,75.714287,ENSG00000004487_2,,75.714287,,ENSG00000004487_1,,-75.714285,,0.01141621,KDM1A
32,ENSG00000006282,0.006858641,19.819595,ENSG00000006282_2,,19.819595,,ENSG00000006282_1,,-19.819595,,0.04915014,SPATA20
60,ENSG00000010278,3.191221e-22,39.024387,ENSG00000010278_2,,39.024387,,ENSG00000010278_3,ENSG00000010278_1,-38.605976,-0.41841,3.4861939999999995e-20,CD9


Finally, for transcriptomes generated with [Cerberus](https://github.com/mortazavilab/cerberus), Swan automatically tracks intron chain usage, and you can perform intron chain switching analysis with `kind='ic'` as input to the `die_gene_test()` function. 

In [15]:
sg_brain = swan.read('swan_modelad.p')

# find genes that exhibit DIE for ICs between genotypes
die_table, die_results = sg_brain.die_gene_test(kind='ic',
                                                obs_col='genotype',
                                                obs_conditions=['b6n', '5xfad'])
die_table.head()

Read in graph from swan_modelad.p


Unnamed: 0,gid,p_val,dpi,pos_iso_1,pos_iso_2,pos_iso_1_dpi,pos_iso_2_dpi,neg_iso_1,neg_iso_2,neg_iso_1_dpi,neg_iso_2_dpi,adj_p_val,gname
0,ENCODEMG000055991,0.4100695,6.071428,ENCODEMG000055991_2,ENCODEMG000055991_3,3.571428,2.5,ENCODEMG000055991_1,,-6.071426,,0.707992,ENCODEMG000055991
1,ENCODEMG000055998,0.6190162,9.722223,ENCODEMG000055998_2,,9.722223,,ENCODEMG000055998_1,,-9.722218,,0.8362,ENCODEMG000055998
2,ENCODEMG000056718,8.081718e-07,6.289159,,,2.32977,1.953835,,,-3.339438,-2.949721,2e-05,ENCODEMG000056718
3,ENCODEMG000056804,0.002107047,27.863049,ENCODEMG000056804_1,,27.863047,,ENCODEMG000056804_2,,-27.863049,,0.02151,ENCODEMG000056804
4,ENCODEMG000063411,0.7699701,12.5,ENCODEMG000063411_1,,12.5,,ENCODEMG000063411_2,,-12.5,,0.919808,ENCODEMG000063411


## <a name="multi_gb"></a>Combining metadata columns

What if none of the metadata columns you have summarize the comparisons you want to make? What if I want to find differentially expressed genes or transcripts, or find isoform-switching genes between hffc6 replicate 3 and hepg2 replicate 1? 

In [16]:
sg.adata.obs.head()

Unnamed: 0_level_0,cell_line,replicate,dataset,total_counts
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hepg2_1,hepg2,1,hepg2_1,499647.0
hepg2_2,hepg2,2,hepg2_2,848447.0
hffc6_1,hffc6,1,hffc6_1,761493.0
hffc6_2,hffc6,2,hffc6_2,787967.0
hffc6_3,hffc6,3,hffc6_3,614921.0


Let's ignore for a moment the fact that the `dataset` column does effectively capture both replicate as well as cell line metadata. This may not always be the case with more complex datasets. Swan has a function to concatenate columns together and add them as an additional column to the metadata tables. Use the following code to generate a new column that concatenates as many preexisting metadata columns as you wish:

In [4]:
col_name = sg.add_multi_groupby(['cell_line', 'replicate'])

print(col_name)
print(sg.adata.obs.head())

cell_line_replicate
        cell_line replicate  dataset  total_counts cell_line_replicate
dataset                                                               
hepg2_1     hepg2         1  hepg2_1      499647.0             hepg2_1
hepg2_2     hepg2         2  hepg2_2      848447.0             hepg2_2
hffc6_1     hffc6         1  hffc6_1      761493.0             hffc6_1
hffc6_2     hffc6         2  hffc6_2      787967.0             hffc6_2
hffc6_3     hffc6         3  hffc6_3      614921.0             hffc6_3


The added column in `col_name` can then be used as the `obs_col` input to `die_gene_test()` as follows:

In [25]:
obs_col = col_name
obs_conditions = ['hffc6_3', 'hepg2_1']


die_table, die_results = sg.die_gene_test(kind='iso',
                                          obs_col=obs_col, 
                                          obs_conditions=obs_conditions)

AnnData expects .obs.index to contain strings, but got values like:
    [0, 1]

    Inferred to be: integer

  value_idx = self._prep_dim_index(value.index, attr)


training location model: False
training scale model: True
iter   0: ll=4391080.510328
iter   1: ll=4391080.510328, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   2: ll=2990612.252568, converged: 82.30% (loc: 82.30%, scale update: True), in 80.66sec
iter   3: ll=2990612.252568, converged: 82.30% (loc: 100.00%, scale update: False), in 0.00sec
iter   4: ll=2736937.375560, converged: 91.10% (loc: 91.10%, scale update: True), in 17.41sec
iter   5: ll=2736937.375560, converged: 91.10% (loc: 100.00%, scale update: False), in 0.00sec
iter   6: ll=2735022.903744, converged: 98.06% (loc: 98.06%, scale update: True), in 10.08sec
iter   7: ll=2735022.903744, converged: 98.06% (loc: 100.00%, scale update: False), in 0.00sec
iter   8: ll=2734765.741299, converged: 99.81% (loc: 99.81%, scale update: True), in 4.18sec
iter   9: ll=2734765.741299, converged: 99.81% (loc: 100.00%, scale update: False), in 0.00sec
iter  10: ll=2734727.647088, converged: 99.98% (loc: 99.98%, scal

  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))


training location model: True
training scale model: True
iter   0: ll=9704955.674048
caught 13674 linalg singular matrix errors
iter   1: ll=9704955.674048, converged: 0.00% (loc: 100.00%, scale update: False), in 14.53sec
iter   2: ll=5938670.795058, converged: 89.26% (loc: 89.26%, scale update: True), in 285.23sec
caught 13674 linalg singular matrix errors
iter   3: ll=5938670.794538, converged: 89.26% (loc: 100.00%, scale update: False), in 13.05sec
iter   4: ll=5938670.794538, converged: 89.26% (loc: 100.00%, scale update: False), in 2.11sec
iter   5: ll=4749172.560584, converged: 96.79% (loc: 96.79%, scale update: True), in 35.51sec
caught 825 linalg singular matrix errors
iter   6: ll=4749172.547157, converged: 96.79% (loc: 100.00%, scale update: False), in 11.85sec
iter   7: ll=4749172.547157, converged: 96.79% (loc: 100.00%, scale update: False), in 2.14sec
iter   8: ll=4746120.485244, converged: 98.84% (loc: 98.84%, scale update: True), in 16.83sec
caught 48 linalg singular ma

  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))


## <a name="es_ir"></a>Exon skipping and intron retention

Swan can detect novel (unannotated) exon skipping and intron retention events. 

To obtain a dataframe of novel exon skipping events, run the following code:

In [26]:
# returns a DataFrame of genes, transcripts, and specific edges in 
# the SwanGraph with novel exon skipping events
es_df = sg.find_es_genes(verbose=True)

Testing each novel edge for exon skipping:   0%|          | 0/855 [00:00<?, ?it/s]

Analyzing 855 intronic edges for ES


Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:06<00:00,  6.13s/it]

Found 529 novel es events in 149 transcripts.


In [27]:
es_df.head()

Unnamed: 0,gid,tid,edge_id
0,ENSG00000157916.19,TALONT000218256,952616
0,ENSG00000122406.13,TALONT000425229,952716
0,ENSG00000224093.5,TALONT000434035,952720
0,ENSG00000224093.5,TALONT000434035,952720
0,ENSG00000224093.5,TALONT000434035,952720


To obtain a list of genes containing novel intron retention events, run the following code:

In [28]:
# returns a DataFrame of genes, transcripts, and specific edges in 
# the SwanGraph with novel intron retaining events
ir_df = sg.find_ir_genes(verbose=True)


  0%|          | 0/1186 [00:00<?, ?it/s][A
Testing each novel edge for intron retention:   0%|          | 0/1186 [00:00<?, ?it/s][A

Analyzing 1186 exonic edges for IR


Testing each novel edge for exon skipping: 100%|██████████| 855/855 [1:26:11<00:00,  6.05s/it]

Testing each novel edge for intron retention:   0%|          | 1/1186 [00:05<1:54:40,  5.81s/it][A
Testing each novel edge for intron retention:   0%|          | 2/1186 [00:12<2:01:22,  6.15s/it][A
Testing each novel edge for intron retention:   0%|          | 3/1186 [00:17<1:57:27,  5.96s/it][A
Testing each novel edge for intron retention:   0%|          | 4/1186 [00:24<2:00:54,  6.14s/it][A
Testing each novel edge for intron retention:   0%|          | 5/1186 [00:30<1:58:07,  6.00s/it][A
Testing each novel edge for intron retention:   1%|          | 6/1186 [00:35<1:56:44,  5.94s/it][A
Testing each novel edge for intron retention:   1%|          | 7/1186 [00:42<1:59:37,  6.09s/it][A
Testing each novel edge for intron retention:   1%|          | 8/1186 [00:48<1:57:20,  5.98s/it][A
Testing each novel edge for intron retention:   1%|          | 9/1186 [00:54<1:59:13,  6.08s/it][A
Test

Found 35 novel ir events in 27 transcripts.


You can pass gene IDs from `es_df` into `gen_report()` or `plot_graph()` to visualize where these exon-skipping events are in gene reports or gene summary graphs respectively.