Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cooler files for single-cell Hi-C #186

Closed
joachimwolff opened this issue Feb 27, 2020 · 11 comments
Closed

Cooler files for single-cell Hi-C #186

joachimwolff opened this issue Feb 27, 2020 · 11 comments

Comments

@joachimwolff
Copy link
Collaborator

joachimwolff commented Feb 27, 2020

Hi Nezar,

I am using your cool format as you know in HiCExplorer and now new, for single-cell Hi-C analysis in a software called scHiCExplorer.

To use your file format I create one cool file per cell and store all of them in one mcool file. This works quite fine; however, as pointed out by a referee of a publication of mine, it is misleading (And if that referee reads this thread: Thanks for the positive feedback and the comments.). I call the file mcool but of course they are not in compliance with your definition of a multiresolution cool file.

My question to you is: Is there any interest in coming up with a new file ending from your side for single-cell Hi-C data? The concept would be to simply store the individual cool files in a single-cell mcool file and use it as a container. In the future we could discuss to add additional metadata and or bulk matrices. What I have I simply call now version 1, it has only the root folder of the mcool and in this one all individual cool files are stored. In a possible version 2 the concept could look like:

  • root::/single-cell matrices/
  • root::/bulk matrix

It would be important to me to have a solution which is in compliance with the developers (aka you :) ) of cooler to prevent a branching of file formats and names.

Possible suggestions I have for a single-cell Hi-C cool format name:

  • sccool
  • scool
  • sCool

Any other, and possible better name proposal is welcome.

Thanks for having a thought on that.

Best,

Joachim

@Phlya
Copy link
Member

Phlya commented Feb 27, 2020

I was actually thinking about it too, would be great to have multiple cells in one file!

What I was thinking, is to have them all in one pixel table with different names for the interaction count columns, actually, a bit of a different design... But regardless of implementation, just wanted to support the general idea!

@joachimwolff
Copy link
Collaborator Author

joachimwolff commented Mar 2, 2020

Thanks for supporting this idea @Phlya.

The way of storing it as cool matrices within the "mcool" file has the nice effect of a very easy way to parallel the computations. You need to read in the list of stored matrices via cooler.fileops.list_coolers and simply dived this list by the number of used threads. Each thread processes its sub-list iterative and independently, all sub-matrices can be read as a cool file with the existing API. The benefit of this method is that nothing in the API needs to be changed and the number of used cores can be fully used.

However, I am always open for better ideas and implementations. What benefits would it bring to store it in the way you proposed?

@nvictus
Copy link
Member

nvictus commented Mar 3, 2020

Hi Joachim!

This is a great idea! I think it would be great to gather input from single-cell users for a good file layout to store all single cells together, and which could probably optionally include bulk/pooled data as well. As for the file extension convention, I guess I'd put in a vote for .scool, for symmetry with .mcool.

As you said, I don't think we can store data @Phlya's suggested way with the current API. However, querying many cells is straightforward, and you could just do that and merge all your pixels into a dataframe labelled by cell ID to construct such a thing on the fly.

Btw, how are you labeling the groups referring to the single cells? Are you grouping them by resolution?

One idea I've considered in the past is:

root::/cells/{resolution}/{unique_cell_id}

Of course, one could also do root::/cells/{unique_cell_id} for a "single-cell single-res cooler" and root::/cells/{unique_cell_id}/resolutions/{resolution} for a "single-cell multi-res cooler... but maybe that's getting a bit crazy.

It would be good to chime in the 4DN DCIC @burakalver and @hbbrandao for additional feedback.
Maybe we can draft a google doc with a formal "proposal"?

@joachimwolff
Copy link
Collaborator Author

Hi Nezar,

Thanks for your positive feedback. I took the name 'scool' for the single-cell cool file and using it on my webserver and the depending publication. However, as soon as we will have an agreement I will update my tools to whatever format we will have.

I am a supporter of the principle 'Keep it simple'; therefore I think root::/cells/{unique_cell_id} is the best option. For another resolution, just have another scool file. If there is a usecase for a multi-cool file, (somehow with HiGlass?) I think root::/cells/{resolution}/{unique_cell_id} is the best option.

Last, the bulk matrix. I am not sure if we should keep it together with the scool file root::/bulk_matrix or, because it is kind of seen and operated with like a regular Hi-C matrix to have it as an independent file. The reason for the last option would be that we don't need to care about any succeeding file (and where to store it) like obs_exp or pearson correlated matrix. Moreover, it is simpler for the users to use hicPlotMatrix -m bulk_matrix.cool instead of hicPlotMatrix -m single_cell.scool::/bulk_matrix.

Best,

Joachim

@nvictus
Copy link
Member

nvictus commented Mar 12, 2020

I also like the simple root::/cells/{unique_cell_id} and .scool to hold a batch of single cells at a single resolution (not including bulk).

It would be good to include some top-level metadata for introspection and versioning. .mcool currently uses:

  • format: HDF5::MCOOL
  • format-version: 2

If we agree that all single cells must use the same bin segmentation, I would propose including at least:

  • format: HDF5::SCOOL
  • format-version: 1
  • bin-type: {'fixed' or 'variable'}
  • bin-size: {int or 'null'}

@Phlya
Copy link
Member

Phlya commented Mar 12, 2020

I would argue having a possibility of multiple resolutions is a good thing. Two flavours, like for regular coolers, perhaps? .scool and .smcool?

How about thinking about it a bit more general, not just for single-cell data, but just for storing multiple datasets together in one file? Why not keep all samples from an experiment all together, for example? It seems like the schema would be identical to the proposed ideas here, just needs to name and "market" it appropriately.

@nvictus
Copy link
Member

nvictus commented Mar 13, 2020

Well, if you treat the scool layout as a "tree" that can be rooted anywhere in the file, then it's easy to envision a multi-resolution version of this.

root::/resolutions/1000/cells/{id1}
...
root::/resolutions/1000/cells/{idN} 

root::/resolutions/5000/cells/{id1}
...
root::/resolutions/5000/cells/{idN} 

How about thinking about it a bit more general, not just for single-cell data, but just for storing multiple datasets together in one file

Well, that is basically why the cooler "data collection" is defined as a tree that can be rooted anywhere. Having a few standardized layouts is good, but I'm skeptical of finding a general directory hierarchy that everyone agrees with. Maybe some recommended good practices?

Otherwise, just making people more aware the introspection tools would be useful (cooler tree, cooler attrs and the other fileops which actually work on any HDF5 file!).

@joachimwolff
Copy link
Collaborator Author

joachimwolff commented Mar 13, 2020

The smcool format sound alright for me, however, I am a bit sceptical on the use case of it. Somehow a tool like HiGlass would need to support it, but I am not sure how good an investigation on multiple single-cell Hi-C matrices could work. More than maybe 8 - 16 matrices visualised in parallel seems difficult to handle for me. Anyway, these are issues someone else needs to solve.

Concerning to store all data of an experiment: I think we have sth like this already and these are zip files. I am sorry that I have to argument against this idea, but I think it is not good to have another file format. The problematic I see is simple that e.g. we store a bigwig file in such a format and than we need first to extract it before we can use it with any other tool which is supporting bigwig. This situation is already present with zip files and a general cool format would have the same issues.

@joachimwolff
Copy link
Collaborator Author

Hi,

I have not received any answer to my mail to you, therefore I try it here again. I really want to push this idea forward and have single-cell cool format. However, please just say it if you don't have the time or you are not interested anymore.

Best,

Joachim

@nvictus
Copy link
Member

nvictus commented Apr 20, 2020

Joachim, I apologize for the long delay. I sent you an email.

@nvictus nvictus added the schema label Jun 30, 2020
@nvictus
Copy link
Member

nvictus commented Jul 18, 2020

Added in #201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants