Special case for bin_width=1 sequence output #88

josiahseaman · 2020-03-19T17:16:43Z

At --bin_width=1 the json produced looks ridiculous. For supporting Schematize, this is going to be a common use case. Everytime we load in a graph with odgi we'll ask for bin width = 1, 10, 100, 1000, 10000, 100000 at minimum.

Currently, the output would look like {id=1, seq='C'}{id=2, seq=G}{id=3, seq='T'}. That's 43 characters for "CGT". In this special case, index in the string +1 is exactly the same as bin id, so they don't need to be listed.

New output should be:
{pangenome_sequence="CGTACGTACGTACGTACTACTCAGCTAGCTAGCTACGTCGAGTCTTACTCTAGATC"} {path_name="ATH17...
No bin declarations should be included. We'll also make a special case for bin_width=1 in schematize to look for the unique key "pangenome_sequence" which should only occur once in the file. Besides this and the lack of bin declarations, the rest of the file can be the same including the bin_ids in the path traversals and links.

See also #49 #86

The text was updated successfully, but these errors were encountered:

josiahseaman · 2020-03-19T17:20:27Z

Thought should be given to where else this problem reoccurs. I had originally intended on making bin steps powers of 2 or 4, not 10. Bin declarations at 2,4,8,16,32 would also be space inefficient with separate bin declarations. In contrast, bin_id is always calculable by simply dividing the pangenome coordinate by the bin_width. Perhaps it's a better alternative to make "pangenome_sequence" the default behavior for all bin sizes?

subwaystation · 2020-03-20T09:35:07Z

@josiahseaman @superjox I would propose that we decouple the pangenome sequence completely from the bin.json. It will become a FASTA file with the pangenome sequence in it. So we have to generate the FASTA once for all bin sizes. We can index the FASTA with e.g. samtools or integrate such a functionality into our C++ code via seqan3 or maybe htslib. I did not find any code for python that would be able to create a FASTA index.
Having such an index we could use file seeks to extract the desired sequences very fast out of an arbitrary sized pangenome sequence. BUT I am not sure if a file seek is possible on the client side.
We would need something like https://www.npmjs.com/package/fs-ext to do that. But it is a nodejs package and I am not sure if it will work on the client side. I will find out.

subwaystation · 2020-03-20T09:36:06Z

Furthermore, I would expect that the JBrowse2 people should have build in such a functionality at some point. We should ask Robert about that.

subwaystation · 2020-03-20T10:33:10Z

So the fs-ext functionality is only available on the server side....... I tried https://developers.google.com/web/updates/2011/08/Seek-into-local-files-with-the-File-System-API or https://www.html5rocks.com/en/tutorials/file/filesystem/ but I could not get them to run.
@josiahseaman Do you have any idea how one could do a file seek? Maybe via an AJAX request?
Else we might have to add that functionality on the server side.

josiahseaman · 2020-03-20T14:21:38Z

@subwaystation Your comment is very insightful. I agree that in the context of multiple file outputs like we use in component_segmentation this is the right course of action for the browser and javascript. However, to date, odgi bin is only outputting one file. We could do the file split inside of component_segmentation along with everything else. It'd be a more minor code change in odgi. I'd find either location acceptable. But if we change odgi to make a separate pangenome_sequence.fasta then we should change the output to make a directory instead of just similarly named files.

@ekg I see this fitting well with a separate change, which is to request a list of bin_widths instead of just one. That way we request the entire operation of reading in the graph and outputting layers of different bin_width JSON and then a single pangenome sequence FASTA to populate any bin_width. Either way these are going to be large files. However, it makes a lot of sense to not duplicate the sequence unnecessarily, which is currently happening.

subwaystation · 2020-03-20T14:41:28Z

@josiahseaman Could you please elaborate? You want odgi bin to only output splitted files? What benefit would we then get? I am a little bit confused here. Wouldn't that break the component algorithm?

But let me apply some wishful thinking. Ideally, we have an odgi server that gives you the segmented JSON ready for direct visualization.
Request types:

Pangenome position + bin width
Path name + nucleotide position + bin width

Response:

JSON with components ready for visualization

So in odgi we would prepare the bin data structure only for the requested subpart of the pangenome. I am not sure how or if this is possible in odgi. I will discuss later with @ekg . Another requirement would be to port a huge part of the component_segmentation code to C++.
The benefits would be:

One repo less, so less code to maintain and less complexity of the project.
No intermediate files to save. For large data sets that would mean much less IO in the range of tens of GBytes.
C++ might be faster than Python.
Go Wasm. We could get all our code run in Javascript. Implemented nicely, we might not even need a server component and could stick to the JBrowse2 concept.

Possible issues:

My idea might not scale on the odgi side. But we don't know that.
This would involvle some serious porting work from Python to C++.
More people are familiar with Python compared to C++.

But I am open to better solutions!

josiahseaman · 2020-03-21T18:30:23Z

An important lesson I've learned from decades of software development is to watch out for scope creep. It can kill a project and make it feel as if nothing is progressing, despite lots of hard work. This started off as a single special case for bin_width=1 and it can be easily solved by simply outputting a FASTA file. That doesn't require porting an entire program to another language and restructuring our data pipeline. Let's do only what's required to get the result we want. Don't ever try to optimize something until you actually run into a constraint caused by some design decision.

If it's not possible to possible to seek a large file, then all we have to do is to break the FASTA into 2MB chunk files. We already have all the code for doing this in python alongside component_segmentation. That implementation would take maybe an hour or two of development. Reworking all of component_segmentation would likely take 2 weeks of precious developer time.

For simplicity, I would prefer odgi outputs only a single FASTA file (not chunked) so that other tools don't have to de-chunk the file in order to use it effectively.

subwaystation · 2020-03-23T14:12:32Z

Sometimes, sustainable and user-friendly software implementations do not feel like something is progressing. They do not have to, in my opinion. But, afterwards, the reward is even bigger.
With my proposition, we would greatly enhance the user experience and reduce code complexity. Potential developers would be presented with a clearer path for contribution. More importantly, it would also solve our zooming problem. And we would not have to create tens of GBytes of files for each bin width.
So we would save time, disk space and human resources.

If we want to go the JBrowse2 way, we need to have the code all in Javascript (not realistic) or in C++ (for compilation to Javascript) anyhow. Or do you have any other long term plan?

Breaking the FASTA into several chunks is a clever idea to outmaneuver the need for a FASTA index.
I can implement a parameter that allows the storage of the pangenome sequence in a specified FASTA file. This represents a fast solution to our problem.
But I do not think it is a sustainable way to move things forward.

josiahseaman · 2020-03-23T17:37:20Z

Oh, to clarify, odgi and component_segmentation are both offline precomputes. So the JBrowse restrictions don't really apply to them. We'll always produce static files which JBrowse can consume on the fly from javascript. Also, Python is compilable into C code, so anything that can port C++ code can also port Python. Still, I think the transpiler route is not necessary, since we'll just produce static files. Most large projects contain many languages, which gives more people an opportunity to contribute in a language they know, and the strengths of each language can be leveraged in appropriate tasks.

I do agree, and created this issue, because JSON is very bloated. Best way to see this is zipping the set of JSON files. Lung cancer is 332MB vs 30MB, a 10x difference. But I see JSON as temporary anyways. Long term, let's look at something like Spodgi for Triple Store that can contain our graph, precomputes, and annotations in one place.

subwaystation · 2020-03-23T18:25:05Z

They are precomputes, but a user would just like to through a .gfa or .og or .JSON ... file at the VIZ and would expect that it just does its job. We could get rid of a large fraction of precomputes if we have a way to calculate the currently relevant JSON on the fly.

To clarify, the FASTA option is still needed as a temporary solution?

subwaystation · 2020-03-23T18:27:21Z

And I do agree, Spodgi would be a nice replacement for what we want to do. But not everyone wants to set up a whole RDF Triple store just to take a look at his graph. Only if we find an ultra user-friendly way to do this.

subwaystation · 2020-03-23T18:49:55Z

Actually, I am wrong here. With Spodgi we won't have to set up a triple store. We project an odgi graph via the pything bindings to RDF and make it accessible via SPARQL queries.

Josiah tested and approved outputs. Adding pure FASTA output for Pangenome #88

josiahseaman · 2020-04-02T21:32:12Z

I've just approved PR#96. I think there's some follow up changes about how we output zoom stacks: lists of bin widths, but I think that is properly a second issue.

josiahseaman added the enhancement New feature or request label Mar 19, 2020

subwaystation self-assigned this Mar 20, 2020

subwaystation mentioned this issue Mar 24, 2020

Render nucleotides on the nearest zoom level. graph-genome/Schematize#17

Closed

5 tasks

josiahseaman mentioned this issue Mar 24, 2020

v11: Chunk Fasta files in parallel with components graph-genome/component_segmentation#11

Closed

subwaystation added a commit that referenced this issue Mar 30, 2020

this should resolve #88

2ea72d5

josiahseaman mentioned this issue Mar 30, 2020

Adding pure FASTA output for Pangenome #88 #96

Merged

josiahseaman added a commit that referenced this issue Apr 2, 2020

Merge pull request #96 from vgteam/i88_bin_fasta

39f467a

Josiah tested and approved outputs. Adding pure FASTA output for Pangenome #88

josiahseaman closed this as completed Apr 2, 2020

josiahseaman mentioned this issue Apr 2, 2020

Output "Zoom Stack" list of bin widths #98

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special case for bin_width=1 sequence output #88

Special case for bin_width=1 sequence output #88

josiahseaman commented Mar 19, 2020

josiahseaman commented Mar 19, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

josiahseaman commented Mar 20, 2020

subwaystation commented Mar 20, 2020

josiahseaman commented Mar 21, 2020

subwaystation commented Mar 23, 2020

josiahseaman commented Mar 23, 2020 •

edited

Loading

subwaystation commented Mar 23, 2020 •

edited

Loading

subwaystation commented Mar 23, 2020

subwaystation commented Mar 23, 2020

josiahseaman commented Apr 2, 2020

Special case for bin_width=1 sequence output #88

Special case for bin_width=1 sequence output #88

Comments

josiahseaman commented Mar 19, 2020

josiahseaman commented Mar 19, 2020 • edited Loading

subwaystation commented Mar 20, 2020 • edited Loading

subwaystation commented Mar 20, 2020 • edited Loading

subwaystation commented Mar 20, 2020 • edited Loading

josiahseaman commented Mar 20, 2020

subwaystation commented Mar 20, 2020

josiahseaman commented Mar 21, 2020

subwaystation commented Mar 23, 2020

josiahseaman commented Mar 23, 2020 • edited Loading

subwaystation commented Mar 23, 2020 • edited Loading

subwaystation commented Mar 23, 2020

subwaystation commented Mar 23, 2020

josiahseaman commented Apr 2, 2020

josiahseaman commented Mar 19, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

subwaystation commented Mar 20, 2020 •

edited

Loading

josiahseaman commented Mar 23, 2020 •

edited

Loading

subwaystation commented Mar 23, 2020 •

edited

Loading