# Plotting Pipeline Metrics

Following manuscript submission, the reviewer had the following comment to make:

*It might help if, at the various stages of the pipeline, the number of total sequences being considered is mentioned, to give an idea of what steps do what level of filtering for users hoping to apply this pipeline to other proteins.*

To aid in this, we'll make a Sankey diagram illustrating the numbers of sequences filtered out at each step.

First, a simple dictionary to hold the link data:

In [12]:
import plotly.graph_objects as go
import plotly.express as px

red = "rgb(183, 50, 57)"
link_colours = [px.colors.sequential.dense[i] for i in [0, 1, 2, 2, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11]]

labels = ["Pfam Hits", "Structure Hits", "Total Hits", "Hits Mapped to Contigs", "Hits with No Contig", "Contigs with CDSes", "Contigs with No CDSes", "No Phage Hits", "Phage Hits", "Contigs > 25 kb", "Contigs < 25 kb", "Contigs > 10 Proteins", "Contigs < 10 Proteins", "No Phage Pfam Hits", "Phage Pfam Hits"]
values = ["767,292", "2881", "770,172", "643,975", "126,197", "372,241", "271,734", "339,854", "32,387", "17,341", "322,513", "13,031", "4310", "1548", "11,480"]

links = dict(
      source =  [0, 1, 2, 2, 3, 3, 5, 5, 7, 7, 9, 9, 11, 11],
      target =  [2, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
      value =  [767292, 2881, 643975, 126197, 372241, 271734, 339854, 32387, 17341, 322513, 13031, 4310, 1548, 11480],
      label =  [f"{l}<br>{v}" for l, v in zip(labels, values)],
      color = link_colours
)

node_colours = link_colours.copy()

for i in [4, 6, 8, 10]:
    node_colours[i] = red
node_colours.append(red)

fig = go.Figure(data=[go.Sankey(
    node = dict(pad = 15, thickness = 20, label=links["label"], color=node_colours),
    link = links)]
)

fig.update_layout(
    width=1500,
    height=800,
    font=dict(size=18, color="black"),
    template='plotly_white'
)
fig.show()