Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special case for bin_width=1 sequence output #88

Closed
josiahseaman opened this issue Mar 19, 2020 · 13 comments
Closed

Special case for bin_width=1 sequence output #88

josiahseaman opened this issue Mar 19, 2020 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@josiahseaman
Copy link
Collaborator

At --bin_width=1 the json produced looks ridiculous. For supporting Schematize, this is going to be a common use case. Everytime we load in a graph with odgi we'll ask for bin width = 1, 10, 100, 1000, 10000, 100000 at minimum.

Currently, the output would look like {id=1, seq='C'}{id=2, seq=G}{id=3, seq='T'}. That's 43 characters for "CGT". In this special case, index in the string +1 is exactly the same as bin id, so they don't need to be listed.

New output should be:
{pangenome_sequence="CGTACGTACGTACGTACTACTCAGCTAGCTAGCTACGTCGAGTCTTACTCTAGATC"} {path_name="ATH17...
No bin declarations should be included. We'll also make a special case for bin_width=1 in schematize to look for the unique key "pangenome_sequence" which should only occur once in the file. Besides this and the lack of bin declarations, the rest of the file can be the same including the bin_ids in the path traversals and links.

See also #49 #86

@josiahseaman
Copy link
Collaborator Author

josiahseaman commented Mar 19, 2020

Thought should be given to where else this problem reoccurs. I had originally intended on making bin steps powers of 2 or 4, not 10. Bin declarations at 2,4,8,16,32 would also be space inefficient with separate bin declarations. In contrast, bin_id is always calculable by simply dividing the pangenome coordinate by the bin_width. Perhaps it's a better alternative to make "pangenome_sequence" the default behavior for all bin sizes?

@josiahseaman josiahseaman added the enhancement New feature or request label Mar 19, 2020
@subwaystation subwaystation self-assigned this Mar 20, 2020
@subwaystation
Copy link
Member

subwaystation commented Mar 20, 2020

@josiahseaman @superjox I would propose that we decouple the pangenome sequence completely from the bin.json. It will become a FASTA file with the pangenome sequence in it. So we have to generate the FASTA once for all bin sizes. We can index the FASTA with e.g. samtools or integrate such a functionality into our C++ code via seqan3 or maybe htslib. I did not find any code for python that would be able to create a FASTA index.
Having such an index we could use file seeks to extract the desired sequences very fast out of an arbitrary sized pangenome sequence. BUT I am not sure if a file seek is possible on the client side.
We would need something like https://www.npmjs.com/package/fs-ext to do that. But it is a nodejs package and I am not sure if it will work on the client side. I will find out.

@subwaystation
Copy link
Member

subwaystation commented Mar 20, 2020

Furthermore, I would expect that the JBrowse2 people should have build in such a functionality at some point. We should ask Robert about that.

@subwaystation
Copy link
Member

subwaystation commented Mar 20, 2020

So the fs-ext functionality is only available on the server side....... I tried https://developers.google.com/web/updates/2011/08/Seek-into-local-files-with-the-File-System-API or https://www.html5rocks.com/en/tutorials/file/filesystem/ but I could not get them to run.
@josiahseaman Do you have any idea how one could do a file seek? Maybe via an AJAX request?
Else we might have to add that functionality on the server side.

@josiahseaman
Copy link
Collaborator Author

@subwaystation Your comment is very insightful. I agree that in the context of multiple file outputs like we use in component_segmentation this is the right course of action for the browser and javascript. However, to date, odgi bin is only outputting one file. We could do the file split inside of component_segmentation along with everything else. It'd be a more minor code change in odgi. I'd find either location acceptable. But if we change odgi to make a separate pangenome_sequence.fasta then we should change the output to make a directory instead of just similarly named files.

@ekg I see this fitting well with a separate change, which is to request a list of bin_widths instead of just one. That way we request the entire operation of reading in the graph and outputting layers of different bin_width JSON and then a single pangenome sequence FASTA to populate any bin_width. Either way these are going to be large files. However, it makes a lot of sense to not duplicate the sequence unnecessarily, which is currently happening.

@subwaystation
Copy link
Member

@josiahseaman Could you please elaborate? You want odgi bin to only output splitted files? What benefit would we then get? I am a little bit confused here. Wouldn't that break the component algorithm?

But let me apply some wishful thinking. Ideally, we have an odgi server that gives you the segmented JSON ready for direct visualization.
Request types:

  1. Pangenome position + bin width
  2. Path name + nucleotide position + bin width

Response:

  1. JSON with components ready for visualization

So in odgi we would prepare the bin data structure only for the requested subpart of the pangenome. I am not sure how or if this is possible in odgi. I will discuss later with @ekg . Another requirement would be to port a huge part of the component_segmentation code to C++.
The benefits would be:

  1. One repo less, so less code to maintain and less complexity of the project.
  2. No intermediate files to save. For large data sets that would mean much less IO in the range of tens of GBytes.
  3. C++ might be faster than Python.
  4. Go Wasm. We could get all our code run in Javascript. Implemented nicely, we might not even need a server component and could stick to the JBrowse2 concept.

Possible issues:

  1. My idea might not scale on the odgi side. But we don't know that.
  2. This would involvle some serious porting work from Python to C++.
  3. More people are familiar with Python compared to C++.

But I am open to better solutions!

@josiahseaman
Copy link
Collaborator Author

An important lesson I've learned from decades of software development is to watch out for scope creep. It can kill a project and make it feel as if nothing is progressing, despite lots of hard work. This started off as a single special case for bin_width=1 and it can be easily solved by simply outputting a FASTA file. That doesn't require porting an entire program to another language and restructuring our data pipeline. Let's do only what's required to get the result we want. Don't ever try to optimize something until you actually run into a constraint caused by some design decision.

If it's not possible to possible to seek a large file, then all we have to do is to break the FASTA into 2MB chunk files. We already have all the code for doing this in python alongside component_segmentation. That implementation would take maybe an hour or two of development. Reworking all of component_segmentation would likely take 2 weeks of precious developer time.

For simplicity, I would prefer odgi outputs only a single FASTA file (not chunked) so that other tools don't have to de-chunk the file in order to use it effectively.

@subwaystation
Copy link
Member

Sometimes, sustainable and user-friendly software implementations do not feel like something is progressing. They do not have to, in my opinion. But, afterwards, the reward is even bigger.
With my proposition, we would greatly enhance the user experience and reduce code complexity. Potential developers would be presented with a clearer path for contribution. More importantly, it would also solve our zooming problem. And we would not have to create tens of GBytes of files for each bin width.
So we would save time, disk space and human resources.

If we want to go the JBrowse2 way, we need to have the code all in Javascript (not realistic) or in C++ (for compilation to Javascript) anyhow. Or do you have any other long term plan?

Breaking the FASTA into several chunks is a clever idea to outmaneuver the need for a FASTA index.
I can implement a parameter that allows the storage of the pangenome sequence in a specified FASTA file. This represents a fast solution to our problem.
But I do not think it is a sustainable way to move things forward.

@josiahseaman
Copy link
Collaborator Author

josiahseaman commented Mar 23, 2020

Oh, to clarify, odgi and component_segmentation are both offline precomputes. So the JBrowse restrictions don't really apply to them. We'll always produce static files which JBrowse can consume on the fly from javascript. Also, Python is compilable into C code, so anything that can port C++ code can also port Python. Still, I think the transpiler route is not necessary, since we'll just produce static files. Most large projects contain many languages, which gives more people an opportunity to contribute in a language they know, and the strengths of each language can be leveraged in appropriate tasks.

I do agree, and created this issue, because JSON is very bloated. Best way to see this is zipping the set of JSON files. Lung cancer is 332MB vs 30MB, a 10x difference. But I see JSON as temporary anyways. Long term, let's look at something like Spodgi for Triple Store that can contain our graph, precomputes, and annotations in one place.

@subwaystation
Copy link
Member

subwaystation commented Mar 23, 2020

They are precomputes, but a user would just like to through a .gfa or .og or .JSON ... file at the VIZ and would expect that it just does its job. We could get rid of a large fraction of precomputes if we have a way to calculate the currently relevant JSON on the fly.

To clarify, the FASTA option is still needed as a temporary solution?

@subwaystation
Copy link
Member

And I do agree, Spodgi would be a nice replacement for what we want to do. But not everyone wants to set up a whole RDF Triple store just to take a look at his graph. Only if we find an ultra user-friendly way to do this.

@subwaystation
Copy link
Member

Actually, I am wrong here. With Spodgi we won't have to set up a triple store. We project an odgi graph via the pything bindings to RDF and make it accessible via SPARQL queries.

@josiahseaman
Copy link
Collaborator Author

I've just approved PR#96. I think there's some follow up changes about how we output zoom stacks: lists of bin widths, but I think that is properly a second issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants