Skip to content

hoelzer/sankey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sankey

Build a Sankey plot (in HTML format) from an input TSV file.

Example

1) translate input TSV into JSON format

Example TSV

12	root	Viruses	Caudovirales	Siphoviridae	Fromanvirus	unclassified Fromanvirus	Mycobacterium phage Naca
10	root	Viruses	Caudovirales	Siphoviridae	Dismasvirus	unclassified Dismasvirus	Microbacterium phage Didgeridoo
8	root	Viruses	unclassified    Baculoviridae	Betabaculovirus	unclassified Betabaculovirus	Spodoptera litura granulovirus

Important: At the moment your TSV needs to have the same amount of columns for each row. Insert unclassified XXX if you are missing a rank.

Here, we have identified 12 Mycobacterium phage Naca, 10 Microbacterium phage Didgeridoo, and 8 litura granulovirus. We will use a script to build the JSON format that looks like this:

Example JSON

{"nodes":[
    {"name":"Viruses","id":0},
    {"name":"Caudovirales","id":1},
    {"name":"Siphoviridae","id":2},
    {"name":"Fromanvirus","id":3},
    {"name":"unclassified Fromanvirus","id":4},
    {"name":"Mycobacterium phage Naca","id":5},
    {"name":"Dismasvirus","id":6},
    {"name":"unclassified Dismasvirus","id":7},
    {"name":"Microbacterium phage Didgeridoo","id":8},
    {"name":"unclassified","id":9},
    {"name":"Baculoviridae","id":10},
    {"name":"Betabaculovirus","id":11},
    {"name":"Spodoptera","id":12},
    {"name":"litura granulovirus","id":13}
    ],
    "links":[
    {"source":0,"target":1,"value":22},
    {"source":0,"target":9,"value":8},
    {"source":1,"target":2,"value":22},
    {"source":2,"target":3,"value":12},
    {"source":3,"target":4,"value":12},
    {"source":4,"target":5,"value":12},
    {"source":2,"target":6,"value":10},
    {"source":6,"target":7,"value":10},
    {"source":7,"target":8,"value":10},
    {"source":9,"target":10,"value":8},
    {"source":10,"target":11,"value":8},
    {"source":11,"target":12,"value":8},
    {"source":12,"target":13,"value":8}
    ]}

Run:

ruby tsv2json.rb test/viruses.tsv 200

or to specifically include or exclude certain values regardless of the cutoff (e.g. 200), respectively:

ruby tsv2json.rb test.tsv 200 '[B.1.351.2,B.1.351.3,B.1.617.2,B.1.617.1,B.1.617.3]' '[B.1.177.86,B.1.177.81,B.1.177.62,B.1.258.17,A.27,B.1.221,B.1.525,B.1.1.318,B.1.160,B.1.1.317,B.1.258]'
  • first list includes
  • second list excludes

You should apply a cutoff (here 200) depending on your input because otherwise the Sankey plot will become to large. You can test different cutoffs.

The resulting .json file can be used to plot the Sankey.

2) Sankey plot

Based on https://github.com/fbreitwieser/sankeyD3.

Install R and dependencies...

conda create -n sankey -c r r-base pandoc
conda activate sankey
R
# basics needed for Sankey HTML
install.packages('devtools')
devtools::install_github("fbreitwieser/sankeyD3")

... or use this Docker environment

docker run --rm -it -v $PWD:$PWD -w $PWD nanozoo/sankey_plot:0.12.3--8cf7f6a /bin/bash

Generate Sankey via interactive R session

Use a conda environment or the Docker.

R
library(sankeyD3)
library(magrittr)

Taxonomy <- jsonlite::fromJSON("test/viruses.tsv.json")

# show in browser
sankeyNetwork(Links = Taxonomy$links, Nodes = Taxonomy$nodes, Source = "source", 
    Target = "target", Value = "value", NodeID = "name", units = "count", 
    fontSize = 22, nodeWidth = 30, nodeShadow = TRUE, nodePadding = 30, 
    nodeStrokeWidth = 1, nodeCornerRadius = 10, dragY = TRUE, dragX = TRUE, 
    numberFormat = ",.3g")

# print to HTML file
sankeyNetwork(Links = Taxonomy$links, Nodes = Taxonomy$nodes, Source = "source", 
    Target = "target", Value = "value", NodeID = "name", units = "count", 
    fontSize = 22, nodeWidth = 30, nodeShadow = TRUE, nodePadding = 30, 
    nodeStrokeWidth = 1, nodeCornerRadius = 10, dragY = TRUE, dragX = TRUE, 
    numberFormat = ",.3g") %>% saveNetwork(file = 'viruses_sankey.html')

Apply orderByPath = TRUE if the child nodes should be ordered by path instead of their size.