-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special case for bin_width=1 sequence output #88
Comments
Thought should be given to where else this problem reoccurs. I had originally intended on making bin steps powers of 2 or 4, not 10. Bin declarations at 2,4,8,16,32 would also be space inefficient with separate bin declarations. In contrast, bin_id is always calculable by simply dividing the pangenome coordinate by the bin_width. Perhaps it's a better alternative to make "pangenome_sequence" the default behavior for all bin sizes? |
@josiahseaman @superjox I would propose that we decouple the |
Furthermore, I would expect that the JBrowse2 people should have build in such a functionality at some point. We should ask Robert about that. |
So the |
@subwaystation Your comment is very insightful. I agree that in the context of multiple file outputs like we use in component_segmentation this is the right course of action for the browser and javascript. However, to date, @ekg I see this fitting well with a separate change, which is to request a list of bin_widths instead of just one. That way we request the entire operation of reading in the graph and outputting layers of different bin_width JSON and then a single pangenome sequence FASTA to populate any bin_width. Either way these are going to be large files. However, it makes a lot of sense to not duplicate the sequence unnecessarily, which is currently happening. |
@josiahseaman Could you please elaborate? You want But let me apply some wishful thinking. Ideally, we have an
Response:
So in
Possible issues:
But I am open to better solutions! |
An important lesson I've learned from decades of software development is to watch out for scope creep. It can kill a project and make it feel as if nothing is progressing, despite lots of hard work. This started off as a single special case for bin_width=1 and it can be easily solved by simply outputting a FASTA file. That doesn't require porting an entire program to another language and restructuring our data pipeline. Let's do only what's required to get the result we want. Don't ever try to optimize something until you actually run into a constraint caused by some design decision. If it's not possible to possible to seek a large file, then all we have to do is to break the FASTA into 2MB chunk files. We already have all the code for doing this in python alongside component_segmentation. That implementation would take maybe an hour or two of development. Reworking all of component_segmentation would likely take 2 weeks of precious developer time. For simplicity, I would prefer odgi outputs only a single FASTA file (not chunked) so that other tools don't have to de-chunk the file in order to use it effectively. |
Sometimes, sustainable and user-friendly software implementations do not feel like something is progressing. They do not have to, in my opinion. But, afterwards, the reward is even bigger. If we want to go the JBrowse2 way, we need to have the code all in Javascript (not realistic) or in C++ (for compilation to Javascript) anyhow. Or do you have any other long term plan? Breaking the FASTA into several chunks is a clever idea to outmaneuver the need for a FASTA index. |
Oh, to clarify, odgi and component_segmentation are both offline precomputes. So the JBrowse restrictions don't really apply to them. We'll always produce static files which JBrowse can consume on the fly from javascript. Also, Python is compilable into C code, so anything that can port C++ code can also port Python. Still, I think the transpiler route is not necessary, since we'll just produce static files. Most large projects contain many languages, which gives more people an opportunity to contribute in a language they know, and the strengths of each language can be leveraged in appropriate tasks. I do agree, and created this issue, because JSON is very bloated. Best way to see this is zipping the set of JSON files. Lung cancer is 332MB vs 30MB, a 10x difference. But I see JSON as temporary anyways. Long term, let's look at something like Spodgi for Triple Store that can contain our graph, precomputes, and annotations in one place. |
They are precomputes, but a user would just like to through a .gfa or .og or .JSON ... file at the VIZ and would expect that it just does its job. We could get rid of a large fraction of precomputes if we have a way to calculate the currently relevant JSON on the fly. To clarify, the FASTA option is still needed as a temporary solution? |
And I do agree, Spodgi would be a nice replacement for what we want to do. But not everyone wants to set up a whole RDF Triple store just to take a look at his graph. Only if we find an ultra user-friendly way to do this. |
Actually, I am wrong here. With Spodgi we won't have to set up a triple store. We project an odgi graph via the pything bindings to RDF and make it accessible via SPARQL queries. |
Josiah tested and approved outputs. Adding pure FASTA output for Pangenome #88
I've just approved PR#96. I think there's some follow up changes about how we output zoom stacks: lists of bin widths, but I think that is properly a second issue. |
At --bin_width=1 the json produced looks ridiculous. For supporting Schematize, this is going to be a common use case. Everytime we load in a graph with odgi we'll ask for bin width = 1, 10, 100, 1000, 10000, 100000 at minimum.
Currently, the output would look like
{id=1, seq='C'}{id=2, seq=G}{id=3, seq='T'}
. That's 43 characters for "CGT". In this special case, index in the string +1 is exactly the same as bin id, so they don't need to be listed.New output should be:
{pangenome_sequence="CGTACGTACGTACGTACTACTCAGCTAGCTAGCTACGTCGAGTCTTACTCTAGATC"} {path_name="ATH17...
No bin declarations should be included. We'll also make a special case for bin_width=1 in schematize to look for the unique key "pangenome_sequence" which should only occur once in the file. Besides this and the lack of bin declarations, the rest of the file can be the same including the bin_ids in the path traversals and links.
See also #49 #86
The text was updated successfully, but these errors were encountered: