Skip to content

Latest commit

 

History

History
1280 lines (1096 loc) · 29.2 KB

data_structure.md

File metadata and controls

1280 lines (1096 loc) · 29.2 KB

Swan data structure

A SwanGraph consists of several different parts that can be used individually. This page serves as an overview of several of the important features of a SwanGraph.

Table of contents

We'll be using the same SwanGraph from the rest of the tutorial to examine how data is stored in the SwanGraph. Load it using the following code:

import swan_vis as swan

# code to download this data is in the Getting started tutorial
sg = swan.read('../tutorials/data/swan.p')
Read in graph from ../tutorials/data/swan.p

Genomic location information

Swan stores information on individual genomic locations that eventually are plotted as nodes in the SwanGraph in the SwanGraph.loc_df pandas DataFrame. The information in the DataFrame and the column names are described below:

  • chromosomal coordinates (chrom, coord)
  • whether or not the genomic location is present in the provided reference annotation (annotation)
  • what role the location plays in the transcript(s) that it is part of (internal, TSS, TES)
  • internal identifier in the SwanGraph (vertex_id)
sg.loc_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
chrom coord vertex_id annotation internal TSS TES
vertex_id
0 chr1 11869 0 True False True False
1 chr1 12010 1 True False True False
2 chr1 12057 2 True True False False
3 chr1 12179 3 True True False False
4 chr1 12227 4 True True False False

Intron / exon location information

Swan stores information about the exons and introns that are eventually plotted as edges in the SwanGraph in the SwanGraph.edge_df pandas DataFrame. The information in the DataFrame and the column names are described below:

  • internal vertex ids from SwanGraph.loc_df that bound each edge (v1, v2)
  • strand that this edge is from (strand)
  • whether this edge is an intron or an exon (edge_type)
  • whether or not the edge is present in the provided reference annotation (annotation)
  • internal identifier in the SwanGraph (edge_id)
sg.edge_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
v1 v2 strand edge_type edge_id annotation
edge_id
0 0 4 + exon 0 True
5 1 2 + exon 5 True
6 2 3 + intron 6 True
7 3 4 + exon 7 True
1 4 5 + intron 1 True

Transcript information

Swan stores information about the transcripts from the annotation and added transcriptome in the SwanGraph.t_df pandas DataFrame. The information in the DataFrame and the column names are described below:

  • transcript ID from the GTF (tid)
  • transcript name from the GTF, if provided (tname)
  • gene ID from the GTF (gid)
  • gene name from the GTF, if provided (gname)
  • path of edges (edge_ids from SwanGraph.edge_df) that make up the transcript (path)
  • path of genomic locations (vertex_ids from SwanGraph.loc_df) that make up the transcript (loc_path)
  • whether or not the transcript is present in the provided reference annotation (annotation)
  • novelty category of the transcript, if provided (novelty)
sg.t_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tname gid gname path tid loc_path annotation novelty
tid
ENST00000000233.9 ARF5-201 ENSG00000004059 ARF5 [377467, 377468, 377469, 377470, 377471, 37747... ENST00000000233.9 [827256, 827261, 827264, 827265, 827266, 82726... True Known
ENST00000000412.7 M6PR-201 ENSG00000003056 M6PR [555507, 555495, 555496, 555497, 555498, 55550... ENST00000000412.7 [184557, 184551, 184547, 184542, 184541, 18453... True Known
ENST00000000442.10 ESRRA-201 ENSG00000173153 ESRRA [520219, 520207, 520208, 520209, 520210, 52021... ENST00000000442.10 [149944, 149946, 149951, 149952, 149955, 14995... True Known
ENST00000001008.5 FKBP4-201 ENSG00000004478 FKBP4 [550369, 550370, 550371, 550372, 550373, 55037... ENST00000001008.5 [179573, 179578, 179588, 179591, 179592, 17959... True Known
ENST00000001146.6 CYP26B1-201 ENSG00000003137 CYP26B1 [111085, 111086, 111087, 111088, 111078, 11107... ENST00000001146.6 [510480, 510478, 510476, 510475, 510472, 51047... True Known

AnnData

Swan stores abundance information for transcripts, TSSs, TESs, and edges using the AnnData data format. This allows for tracking of abundance information using multiple metrics, storage of complex metadata, and direct compatibility with plotting and analysis using Scanpy. Since there's a lot of information online about these data formats, I'll just go over the specifics that Swan uses.

General AnnData format

The basic AnnData format is comprised of:

  • AnnData.obs - pandas DataFrame - information and metadata about the samples / cells / datasets
  • AnnData.var - pandas DataFrame - information about the variables being measured (ie genes, transcripts etc.)
  • AnnData.X - numpy array - information about expression of each variable in each sample

In Swan, the expression data is stored in three different formats that can be accessed through different layers:

  • AnnData.layers['counts'] - raw counts of each variable in each sample
  • AnnData.layers['tpm'] - transcripts per million calculated per sample
  • AnnData.layers['pi'] - percent isoform use per gene (only calculated for transcripts, TSS, TES)

Transcript AnnData

You can access transcript expression information using SwanGraph.adata.

The variable information stored is just the transcript ID but can be merged with SwanGraph.t_df for more information.

sg.adata.var.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tid
tid
ENST00000000233.9 ENST00000000233.9
ENST00000000412.7 ENST00000000412.7
ENST00000000442.10 ENST00000000442.10
ENST00000001008.5 ENST00000001008.5
ENST00000001146.6 ENST00000001146.6

The metadata information that has been added to the SwanGraph along with the initial dataset name from the column names of the added abundance table.

sg.adata.obs.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
dataset cell_line replicate cell_line_replicate
index
hepg2_1 hepg2_1 hepg2 1 hepg2_1
hepg2_2 hepg2_2 hepg2 2 hepg2_2
hffc6_1 hffc6_1 hffc6 1 hffc6_1
hffc6_2 hffc6_2 hffc6 2 hffc6_2
hffc6_3 hffc6_3 hffc6 3 hffc6_3

The expression information are stored in SwanGraph.adata.layers['counts'], SwanGraph.adata.layers['tpm'], and SwanGraph.adata.layers['pi'] for raw counts, TPM, and percent isoform (pi) respectively.

print(sg.adata.layers['counts'][:5, :5])
print(sg.adata.layers['tpm'][:5, :5])
print(sg.adata.layers['pi'][:5, :5])
[[ 98.  43.   4.  23.   0.]
 [207.  66.   6.  52.   0.]
 [100. 148.   0.  82.   0.]
 [108. 191.   0.  98.   0.]
 [ 91. 168.   2. 106.   0.]]
[[196.13847    86.06076     8.005652   46.032497    0.       ]
 [243.97517    77.789185    7.071744   61.28845     0.       ]
 [131.32097   194.35504     0.        107.6832      0.       ]
 [137.06158   242.39594     0.        124.37069     0.       ]
 [147.9865    273.20584     3.2524502 172.37987     0.       ]]
[[100.       100.       100.       100.         0.      ]
 [ 99.519226 100.        60.000004 100.         0.      ]
 [ 98.039215 100.         0.       100.         0.      ]
 [ 99.08257  100.         0.       100.         0.      ]
 [100.       100.       100.       100.         0.      ]]

Edge AnnData

You can access edge expression information using SwanGraph.edge_adata.

The variable information stored is just the edge ID but can be merged with SwanGraph.edge_df for more information.

sg.edge_adata.var.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
edge_id
edge_id
0 0
5 5
6 6
7 7
1 1

The metadata information that has been added to the SwanGraph along with the initial dataset name from the column names of the added abundance table. It should be identical to SwanGraph.adata.obs.

sg.edge_adata.obs.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
dataset cell_line replicate
index
hepg2_1 hepg2_1 hepg2 1
hepg2_2 hepg2_2 hepg2 2
hffc6_1 hffc6_1 hffc6 1
hffc6_2 hffc6_2 hffc6 2
hffc6_3 hffc6_3 hffc6 3

And similarly, counts and TPM of each edge are stored in SwanGraph.edge_adata.layers['counts'] and SwanGraph.edge_adata.layers['tpm']. This data is very sparse though so it shows up as all zeroes here!

print(sg.edge_adata.layers['counts'][:5, :5])
print(sg.edge_adata.layers['tpm'][:5, :5])
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

TSS / TES AnnData

You can access TSS and TES expression information using SwanGraph.tss_adata and SwanGraph.tes_adata respectively.

Unlike the other AnnDatas for edge and transcript expression, the AnnData.var table hold more information:

  • automatically-generated TSS or TES id, which is made up of the gene ID the TSS or TES belongs to and its number (tss_id or tes_id)
  • gene ID that the TSS / TES belongs to (gid)
  • gene name that the TSS / TES belongs to, if provided (gname)
  • vertex ID from SwanGraph.loc_df that the TSS / TES came from (vertex_id)
  • automatically-generated TSS or TES id, which is made up of the gene name (if provided) that the TSS or TES belongs to and its number (tss_name or tes_name)
print(sg.tss_adata.var.head())
print(sg.tes_adata.var.head())
                                     gid   gname  vertex_id  tss_name
tss_id                                                               
ENSG00000000003.14_1  ENSG00000000003.14  TSPAN6     926111  TSPAN6_1
ENSG00000000003.14_2  ENSG00000000003.14  TSPAN6     926112  TSPAN6_2
ENSG00000000003.14_3  ENSG00000000003.14  TSPAN6     926114  TSPAN6_3
ENSG00000000003.14_4  ENSG00000000003.14  TSPAN6     926117  TSPAN6_4
ENSG00000000005.5_1    ENSG00000000005.5    TNMD     926077    TNMD_1
                                     gid   gname  vertex_id  tes_name
tes_id                                                               
ENSG00000000003.14_1  ENSG00000000003.14  TSPAN6     926092  TSPAN6_1
ENSG00000000003.14_2  ENSG00000000003.14  TSPAN6     926093  TSPAN6_2
ENSG00000000003.14_3  ENSG00000000003.14  TSPAN6     926097  TSPAN6_3
ENSG00000000003.14_4  ENSG00000000003.14  TSPAN6     926100  TSPAN6_4
ENSG00000000003.14_5  ENSG00000000003.14  TSPAN6     926103  TSPAN6_5

Again the metadata in SwanGraph.tss_adata and SwanGraph.tes_adata should be identical to the metadata in the other AnnDatas.

print(sg.tss_adata.obs.head())
print(sg.tes_adata.obs.head())
         dataset cell_line replicate
index                               
hepg2_1  hepg2_1     hepg2         1
hepg2_2  hepg2_2     hepg2         2
hffc6_1  hffc6_1     hffc6         1
hffc6_2  hffc6_2     hffc6         2
hffc6_3  hffc6_3     hffc6         3
         dataset cell_line replicate
index                               
hepg2_1  hepg2_1     hepg2         1
hepg2_2  hepg2_2     hepg2         2
hffc6_1  hffc6_1     hffc6         1
hffc6_2  hffc6_2     hffc6         2
hffc6_3  hffc6_3     hffc6         3

And finally, expression data for each TSS / TES are stored in the following layers: SwanGraph.tss_adata.layers['counts'], SwanGraph.tss_adata.layers['tpm'], SwanGraph.tss_adata.layers['pi'], SwanGraph.tes_adata.layers['counts'], SwanGraph.tes_adata.layers['tpm'], SwanGraph.tes_adata.layers['pi']

r = 5
start_c = 20
end_c = 25
print(sg.tss_adata.layers['counts'][:r, start_c:end_c])
print(sg.tss_adata.layers['tpm'][:r, start_c:end_c])
print(sg.tss_adata.layers['pi'][:r, start_c:end_c])
print()
print(sg.tes_adata.layers['counts'][:r, start_c:end_c])
print(sg.tes_adata.layers['tpm'][:r, start_c:end_c])
print(sg.tes_adata.layers['pi'][:r, start_c:end_c])
[[  0.   0.   0.   0. 129.]
 [  0.   0.   0.   0. 323.]
 [  9.   0.   0.   0. 212.]
 [ 16.   0.   0.   0. 173.]
 [  7.   0.   0.   0. 123.]]
[[  0.         0.         0.         0.       258.18228 ]
 [  0.         0.         0.         0.       380.69556 ]
 [ 11.818888   0.         0.         0.       278.40045 ]
 [ 20.305418   0.         0.         0.       219.55234 ]
 [ 11.383576   0.         0.         0.       200.0257  ]]
[[  0.   0.   0.   0. 100.]
 [  0.   0.   0.   0. 100.]
 [100.   0.   0.   0. 100.]
 [100.   0.   0.   0. 100.]
 [100.   0.   0.   0. 100.]]

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]]
[[0.        0.        0.        0.        0.       ]
 [0.        0.        0.        0.        0.       ]
 [0.        0.        0.        0.        0.       ]
 [0.        0.        0.        0.        0.       ]
 [0.        1.6262251 0.        0.        0.       ]]
[[  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0. 100.   0.   0.   0.]]

Intron chain AnnData

In the case that the transcriptome you added is from Cerberus or uses Cerberus-style transcript IDs (ie. <gene_id>[1,1,1]), Swan will also calculate intron chain counts and TPM automatically. These are stored in SwanGraph.ic_adata.

sg = swan.read('../tutorials/data/swan_modelad.p')
sg.ic_adata.var.tail()
Read in graph from ../tutorials/data/swan_modelad.p
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
gid gname ic_name n_cells
ic_id
ENSMUSG00000118369_2 ENSMUSG00000118369 Gm30541 Gm30541_2 14
ENSMUSG00000118380_3 ENSMUSG00000118380 Gm36037 Gm36037_3 1
ENSMUSG00000118382_1 ENSMUSG00000118382 Gm8373 Gm8373_1 2
ENSMUSG00000118383_1 ENSMUSG00000118383 Gm50321 Gm50321_1 14
ENSMUSG00000118390_1 ENSMUSG00000118390 Gm50102 Gm50102_1 1

Current plotted graph information

To reduce run time for generating gene reports, Swan stores the subgraph that is used to generate plots for any specific gene in SwanGraph.pg. This object is very similar to the parent SwanGraph object. It has a loc_df, edge_df, and t_df that just consist of the nodes, edges, and transcripts that make up a specific gene. This data structure can be helpful for understanding what is going on in generated plots as the node labels are not consistent with the display labels in Swan plots.

For instance, let's again plot ADRM1.

sg.plot_graph('ADRM1')

png

In SwanGraph.pg.loc_df, you can see what genomic location each node plotted in the gene's graph corresponds to:

sg.pg.loc_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
chrom coord vertex_id annotation internal TSS TES color edgecolor linewidth
vertex_id
0 chr20 62302093 0 True False True False tss None None
1 chr20 62302142 1 True True False False internal None None
2 chr20 62302896 2 True True True False tss None None
3 chr20 62303045 3 False True False False internal None None
4 chr20 62303049 4 True True False False internal None None

In SwanGraph.pg.edge_df, you can see information about each edge, indexed by the subgraph vertex IDs from SwanGraph.pg.loc_df:

sg.pg.edge_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
v1 v2 strand edge_type edge_id annotation curve color line
edge_id
884037 0 1 + exon 884037 True arc3,rad=4.000000000000002 exon None
884038 1 2 + intron 884038 True arc3,rad=-3.9999999999999964 intron None
884039 2 4 + exon 884039 True arc3,rad=1.9999999999999996 exon None
884040 4 6 + intron 884040 True arc3,rad=-2.000000000000001 intron None
884041 6 7 + exon 884041 True arc3,rad=3.9999999999999964 exon None

And finally, SwanGraph.pg.t_df holds the information about each transcript in the gene:

sg.pg.t_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
tname gid gname path tid loc_path annotation novelty
tid
ENST00000253003.6 ADRM1-201 ENSG00000130706.12 ADRM1 [884039, 884040, 884041, 884042, 884043, 88404... ENST00000253003.6 [2, 4, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17... True Known
ENST00000462554.2 ADRM1-202 ENSG00000130706.12 ADRM1 [884060, 884058, 884045, 884046, 884047] ENST00000462554.2 [5, 7, 11, 12, 13, 14] True Known
ENST00000465805.2 ADRM1-203 ENSG00000130706.12 ADRM1 [884061, 884044, 884045, 884046, 884047] ENST00000465805.2 [8, 10, 11, 12, 13, 14] True Known
ENST00000491935.5 ADRM1-204 ENSG00000130706.12 ADRM1 [884037, 884038, 884039, 884040, 884041, 88404... ENST00000491935.5 [0, 1, 2, 4, 6, 7, 9, 10, 11, 12, 13, 14, 15, ... True Known
ENST00000620230.4 ADRM1-205 ENSG00000130706.12 ADRM1 [884039, 884040, 884041, 884058, 884045, 88404... ENST00000620230.4 [2, 4, 6, 7, 11, 12, 13, 14, 15, 16, 17, 18, 1... True Known