Join GitHub today
You're all invited because of your expertise writing single-cell pipelines. I think if we could all agree on a file format and some conventions, we could make our respective code much more interoperable. At a minimum, this would require each library to support reading and writing a common file format. I propose we go with Loom. :-)
But it will be extremely wasteful to write the whole dataset to file if you only added an attribute (e.g. a cluster ID). Better instead if each library just reads and writes the necessary attributes. How great wouldn't it be if we could easily do this (and vice versa in R):
seurat = robjects.r['seurat'] pagoda = robjects.r['pagoda'] with loompy.connect("filename.loom") as ds: scanpy.api.pp.filter_cells(ds) cytograph.multiscale_knn(ds) seurat.cluster(ds) pagoda.diff_expression(ds)
(Of course, the code is made-up, but you get the idea)
To get there, we would need to agree on some conventions:
To get started, here's a proposal based on our current conventions, slightly cleaned up:
Rows are genes, columns are cells.
Graphs on cells
Graphs on genes
We also have conventions for layers, used e.g. for gene enrichment. For example, an aggregate Loom file would have one column per cluster instead of per cell, and would have layers such as
Note: I'm opening the discussion here, but these are really intended to be conventions for analysis libraries and visualization tools, not Loom the file format. Loom itself is intended to be more general, so we will not impose on it e.g. that there will always be a
Thank you, Sten! This discussion is exactly what we need, I think.
Orientation: I'd vote for rows to be cells and columns to be genes, as in the modern classics of Statistics Hastie et al. (2009) and Machine Learning Murphy (2012), in dataframes both in R and Python and the established machine learning and statistics packages in Python (statsmodels, scikit-learn). But I acknowledge that for genomic data, the tradition is the opposite and I'm happy to conform with that, if most of you think it this is the better choice.
Other annotation/attributes: The naming conventions are fine! But I'd suggest that one also allows arbitrary multi-dimensional annotation. The PCA representation of data can easily have 50 dimensions. It is not very intuititve to label all of these by
I agree that the loom file format should stay as general as possible. For my taste, even more general. It would be nice if loom can at least cope with what AnnData can store, which is otherwise very similar to loom: We allow arbitrary structured and unstructured annotation/metadata/attributes, and only distinguish inherently one-dimensional (numerical or categorical annotation), multi-dimensional (something like the PCA representation of data 'X_pca', mentioned above) and unstructured annotation. For instance, graphs (e.g., single-cell graph: sparse matrix whose node attributes are cell attributes) do not fit into the data/annotation/attributes block scheme and hdf5 does not support sparse matrices. Hence graphs appear as "unstructured annotation" within AnnData - even though they are sliced (subsampling, indexing, etc.). Also, we make use of the unstructured annotation for categorical data types, for non-single-cell graphs, etc.. But this is now turning into the specifics of how Scanpy/AnnData objects can be represented as loom files: #1 (comment) as opposed to the inherent hdf5-format of AnnData.
I guess similar issues arise for Seurat objects, Pagoda objects and others. I'd be interested in reading about these issues so that we can all find good best practices.
2D attributes: Yes, this makes sense and would be easy to implement.
Graphs: In Loom, graphs are stored as (row, col, weight) arrays, equivalent to the coo format from scipy.sparse. This supports arbitrary weighted, directed and undirected graphs (even multigraphs, I suppose).
Unstructured data: can you explain in more detail what you allow as unstructured, and how that is written and read to the HDF5 format?
Thank you for staring the discussion, Sten. It would indeed be great to have an interoperable format. It would obviously take more to agree on the details, but I first wanted to ask about the global structure.
There are multiple ways to analyze a given dataset, and often we want to maintain multiple results. In pagoda2 we found it useful to have a notion of dataset "reductions". These are essentially processed views of the dataset, such as different normalizations, projection of cells onto PCs, or high-dimensional diffusion map (similar to multi-dimensional attributes Alex mentioned). Some reductions have functional meanings, for instance regressing out cell cycle. A dataset "view" would be a better name.
However the key structural impact of these is that individual analysis steps, such as clustering or embeddings, can be then be ran on different reductions. Because of this, we also keep track of which view of the data was used to generate subsequent results. So that the dataset has multiple views, each potentially with different corresponding cluster or embedding results. Do you envision some metadata that would tie annotations and embeddings to different dataset views?
Views: So let me rephrase and see if I understand - you want to create an attribute
But more generally this also boils down to the question of having multiple analysis results. I.e. instead of standardizing on a single
How about we make standard naming schemes in addition to standard names? For example, cluster labels could be named
Orientation: is column access faster? It never occurred to me to check. The matrix is block-compressed so I would have thought it would be symmetric.
Normalization: in our own work, we always recalculate things like normalization and PCA, for every step, and just store raw data and results in the Loom file. This is wasteful of course but simplifies the pipeline. Probably the only representation of normalization that would be easy to implement across pipelines would be to store a full normalized matrix explicitly. I would vote for creating a new file in this case, so that the normalized data can be input to any algorithm, and the algorithm doesn't need to know which specific layer to use. That's also fine. 1.3 million cells from 10X is just 38 GB so it's not terribly inefficient.
Here more generally also to Alex points above, there is the question of what should be stored in file and what should be stored only transiently in memory. Anything needed for interoperabiity between pipelines would have to be stored in the file. But some things may be better recomputed at each step, because they are closely tied to each pipeline.
Extensions: Yes good idea, we should make a HDF5 group for that, e.g.
Multiple views: Sten, I pick up two distinct usage approaches in your response:
On the idea of using naming convention (e.g.
P.S. agree on the rest.
The way we use Loom files is as (1) "a single branch". Our pipeline is based on Luigi from Spotify. You define a "task" which takes one or more input files and produces a single output. You string together tasks using these dependencies, and Luigi figures out what needs to be run, and what tasks can be run in parallel. (Luigi actually doesn't fit well with Loom because it tends to always want to make a new file for every step. This is something we'd like to fix.)
I would probably vote for going with one file, one branch, but maybe with concessions for multiple analyses in specific cases where it makes sense.
Loom has so far been very restricted in the data types allowed for attributes. The spec says
For strings, I propose to stick to ascii, but to use XML entity encoding (e.g. things like
For numbers, I propose to be more generous, and allow 8, 16, 32 and 64-bit ints, both signed and unsigned, as well as both 32 and 64-bit floats. I propose these same types would be allowed for the matrix layers (but no strings).
We store bools as numbers (1 and 0) currently, and I propose they would be unsigned bytes. This works out of the box with both MATLAB and C/C++.
I think I will stay out of most of the discussion about structure and naming conventions, since I do not know what the standards and typical use-cases within the field are.
However, it would be nice if the conventions are easy to automate in scripts. I'm pretty sure we were all already thinking along these lines, but I want to point out what this means for me.
The loom-viewer currently detects attributes that fit the conventions used in Sten's group, and adapts the interface to that for convenience:
Here it detected various attributes associated with X/Y position (in this case, previous "deprecated" conventions from before they "stabilised").
So good naming conventions/schemes will result in the loom viewer being more convenient to use too!
Regarding naming schemes: how about treating
A question is whether this relationshipship would be directed:
(I'll stop here before I propose something that ends up making the naming schemes Turing complete)
Orientation: note that this issue extends beyond loading data from loom files themselves: if each other package uses columns for genes in Python, that would require a transpose each time we move data from one into the other. That is both slower and more error-prone. I do not know what the actual conventions in other packages are, but if I understand @falexwolf correctly the norm in software is column-oriented?
I forgot to mention one convention used in the loom-viewer:
For both row- and column- attributes, a "fake attribute" called
Let's call this a pseudo-attribute: it is metadata about the rows and colums, but not actually part of the loom-file itself.
The choice of using
We could put
Two quick comments:
It's the other way around; per the wikipedia article:
As far as Numpy goes, it supports both (but defaults to C), so we're not really bound to one or the other:
I'm sincerely not sure if that is how you would express this or not, but either way we can conclude that natural language is confusing and ambiguous, that we agree on what we mean, and we're aware of that now. I think that is what matters most. ;)
Anyway, back to the actual orientation discussion. My usual disclaimer: I do not do actual research with data. Since I do not know how the data is used in practice, I have no idea what would be the more logical orientation.
However, I do have some ideas about what matters when deciding, and I hope those considerations are useful for the discussion. I won't touch the traditions/conventions argument, which should be considered but is obviously outside of my knowledge domain.
1: Should Genes or Cells be the major order?
First thing to note: whether genes or cells should be the major order is a separate question from whether we should make genes/cells row- or column- oriented. The former is about how the data is used. The second depends on the answer to the first question, and whether the platforms the data is used on are row- or column-major order.
The first question is obviously the more important one, and answering it requires more than just determining which data is accessed more often.
Types of data access
The kind of access matters as much as how often it is accessed: it is faster to access 64 rows in sequence in one go, than accessing 64 random ones. We can distinguish random access vs whole blocks. Major order matters for both, but it matters more for random access than block access.
From the comments made (and after a quick chat with @gioelelm) I gather that:
(I could be way off here, please correct me if I am)
Assuming that this interpretation is correct, that would suggest that gene major order would benefit more.
Another thing to consider is relative number of cells versus genes, and size of the data set.
I'm not a biologist, but as I understand, the number of genes is more or less fixed per organism (for example ~29k for the mouse data I've seen in Sten's group). The number of cells varies, and currently most sets I have seen in the group are less than 29k. However, this number will grow rapidly in the future, and into the millions.
If we have to choose, I would argue it is more important to optimise for the larger datasets (so total cells >>> total genes), because of the relative greater impact of performance improvements. Improving performance by 10% on a small data set might save minutes at best, whereas that 10% performance increase might save hours on larger data sets.
When total cells >>> total genes, cell major order would have a significantly more negative impact on gene access, than gene major order would have on cell access. This again speaks in favour of genes as major order. This applies mainly to datasets loaded in memory though, because of the next point.
Loom files are chunked, so major- and minor order should not matter regarding disk access; I just looked at the documentation, which had this to say: "The chunks of the dataset are allocated at independent locations throughout the HDF5 file and a B-tree maps chunk N-dimensional addresses to file addresses."
The HDF5 groups does suggest making chunk shapes asymmetrical when you know the data has a preferred order. In our case, we might want to make the chunks that contain more cells than genes. This would reduce disk access when acessing genes, and have relatively little cost when accessing cells (again assuming cells >>> genes).
2: After answering the previous question, what does that mean for orientation?
Let's say the answer to the previous question is that it makes more sense to make genes the major order.
Python is row-major order. Numpy defaults to row-major, but can support column-major. However, every other language used in the sciences is column major (Fortran, R, Matlab, and though nobody here seems to use this beautiful language: Julia).
I suspect that effectively, Fortran matters for libraries powering the rest (BLAS, LAPACK), and we shouldn't worry about it. Matlab I recall being used more physics and less in biology, so probably not that big of a deal either.
The big two contenders seem to be Python (with Numpy) and R.
We could make genes column-aligned and force all of our Numpy code to use column-major order, which would make things consistent, but that would lead to terrible performance with other python packages that expects row-major data (presumably most).
For a while I considered a silly, somewhat blasphemous idea: let the Python and R loom libraries use whatever orientation would make genes the major order (not the loom file itself, which should have a fixed orientation). This would make genes align with rows for Python, and with columns in R. It would have the convenient consequence that it aligns with the statistics traditions in R, and the genomics traditions in Python (only the Python stats and machine learning packages would be left out), meaning it fits expectations for most people (as long as they work stick to their own platform).
Of course, this would lead to confusion when you use both packages, and it would mean
I guess in the end the main consideration should be how the Python and R platforms are used: is there more random access, or block access, and more gene or cell access?
Let's say that Python is typically used in a fashion that favours random access, and R in a way that favours whole-dataset block access. In that case, the benefit of making the genes row-aligned should be bigger in general. A similar argument holds if Python does more gene access an R does more cell access.
But I have no idea about what this is like in practice, so I'll leave it at that.
Hope this was somewhat useful,
Right, my bad. I meant for row-major that iterating over column indices is fast as it's contiguous. But this means that the row lies contiguous in memory and can be quickly retrieved. I edited the statement above to not keep confusing people...
I think your statements above make sense; I'd say you're right, single access to genes would be more frequent than to cells. Also, I believe you that it's better to have the dimension with more entries (cells, as nr. cells >> nr. genes) be the minor order; so genes as major order; I'll make some experiments with that... ;)
However, I'd say one should make the decision based on simple, clear conventions that should be the same between languages... I think both Python and R have the flexibility of interpreting arrays in memory as column- or row-major (as it's just a matter of the indexing convention) and hence, transpostitions can be avoided.
If loom files are ordered with genes as rows, Scanpy will be able to efficiently cope with that. Nonetheless, I think Scanpy should not stand against the conventions of the whole Python ecosystem and the stat/ML books; it will continue to present variables (genes) as columns to the user and observations/samples (cells) as rows, even though the notions "columns" and "rows" are never used - instead, after a long discussion, we decided for the names "var" for variables and "smp" for samples (maybe "obs" for observation in the future) to avoid running into these convention issues. Looking into the loom file, people might then get confused as things are called "rows" and "cols", but this might be something to live with.
Hi, I’m Philipp – also from Fabian Theis’ lab – and I designed AnnData with Alex.
I wrote a little piece on why we need to stop talking about “row annotation” and “column annotation”. There’s only cells and genes, rows and columns are exclusively an implementation detail.
Because of this, the vital consideration about “how to transpose our matrix” is simply about what performs better – “gene major” or “cell major”.
Now there’s two considerations:
Only then can we make an informed decision what to choose. And only a real-world benchmark can tell us the truth: Neither microbenchmarks nor theorizing over performance will necessarily be useful.
Well, to my shame you took the one thing for granted where I accidentally swapped things. Or rather... it's complicated...
Let's just do the experiments ;)
Well, this would make the per-platform suggestion more appealing. However, there was some hope that Loom might useful in other contexts than single-cell data, and then having the labels fixed to "genes" and "cells" might be problematic. This could be fixed by some convention for labelling the major/minor axis, but that again introduces extra syntax.
Also, maybe I'm missing something, but I don't think rows and columns are "exclusively an implementation detail".
Currently, when describing an m by n matrix, m is always the rows and n always the columns. Currently in loompy, if one requests ten genes, one gets a 10 x total cells matrix, and when requesting ten cells, loompy returns a total genes x 10 matrix. So users will learn to expect that genes are rows.
Say we avoid that by returning a 10 x total genes matrix in the second case. For R, being column-major, this would be total cells/genes x 10. But even then: users will learn that the major order of whatever is requested is rows for Python, and columns for R. Either that or we have the same problem of one of the not utilising major order again. On top of that it and it adds an extra transpose that might be unnecessary.
My point is: no matter what, some association with "rows = genes/cells" or "rows = major/minor order" seems inevitable. But again, given my lack of hands-on experience with the data I might be missing something.
While the "high level" questions regarding orientation fit here, but I'm worried diving further into these technical details will make the discussion wander off-topic. So, I opened a separate issue for this low-level aspect. Shall we continue the performance-considerations and testing there, while keeping the "user expectations"/traditions discussion here?
If you want to be more general, we can talk about “observations” and “features”, so that’s a bit of a tangent I guess.
Loom is a data format. Its job is to contain the data with the optimal memory layout for what it’s used for. If cell/observation-major is faster and smaller, we do that, or the other way round.
Everything else is API design. We will just return an object of a class that has the dimension order we want. What I mean is that e.g. CSC and CSR matrices contain the exactly same data… but for one you access the major dimension with
The only problem is dense data: R has a built-in
@flying-sheep: please, let's do observations/variables or samples/features. The first is more common in stat, second more common in ML. My impression is the first pair causes the least confusion for bio-related users. Hence, in AnnData, we might consider transition from "samples" to "observations".
@JobLeonard: this is the most general as you can get in any data science problem, right? rows and cols, by contrast, are notions that are not specific to data and suggest that both dimensions play the same roles, as in a graph adjacency matrix, for instance
But the mapping from genes/cells to observations/variables depends on what you want to do and can’t be known in advance. If you are clustering cells, the cells are the observations but if you’re inferring gene regulatory networks, the genes are the observations. One of the goals of Loom was exactly to allow both dimensions to play the same role. This is different from almost every other API out there, which enforces a different API for each dimension. It is one of the great frustrations when trying to store omics data in databases, for example. Transposing a matrix is fast relative to loading data from disk so let’s not worry about the performance aspects of the orientation of the file. On disk, Loom is block-compressed so rows and columns are (ought to be?) equally fast to access. I haven’t heard any very compelling arguments for any particular orientation, so we will stick with genes in rows. This will save us many months of bugs since all our existing code is based on this convention.…
-- Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden<x-apple-data-detectors://1/0> +46 8 52 48 75 77<tel:+46%208%2052%2048%2075%2077> (office) +46 70 399 32 06<tel:+46%2070%20399%2032%2006> (mobile) 24 nov. 2017 kl. 20:23 skrev Alex Wolf <firstname.lastname@example.org<mailto:email@example.com>>: @flying-sheep<https://github.com/flying-sheep>: please, let's do observations/variables or samples/features. The first is more common in stat, second more common in ML. My impression is the first pair causes the least confusion for bio-related users. Hence, in AnnData, we might consider transition from "samples" to "observations". @JobLeonard<https://github.com/jobleonard>: this is the most general as you can get in any data science problem, right? rows and cols, by contrast, are notions that are not specific to data and suggest that both dimensions play the same roles, as in a graph adjacency matrix, for instance — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#19 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AKKag32bJlbxM9cEDb6iyVedt0wdkYMFks5s5xewgaJpZM4QePCP>.
Picking up the thread again, I'm happy to announce Loom 2.0, an almost complete rewrite of loompy. Go to the release notes for the full list of changes. In short, v2.0 implements many of the features requested above (such as Unicode support, multidimensional attributes, and multidimensional global attributes).
It also supports a powerful new concept of in-memory views (essentially a slice through the file, including all layers, attributes, and graphs) which works great with the new
It is much more Pythonic, with a uniform API for reading/writing/deleting attributes, graphs and layers.
It is more generous (allowing almost any kind of list, tuple or array to be assigned to attributes, or any kind of sparse or dense adjacency matrix for graphs). At the same time, it normalizes everything to conform to the file spec.
Finally, I have updated the file format specification to be more specific about the allowed datatypes and shapes.
All this, and it remains fully backwards compatible with your old loom files. I have used 2.0 for a couple of months already and it's a significant step forward. Your code will be simpler, more expressive and more capable.
Note: your current code should mostly work without problems. You'll get log messages for deprecated methods and attributes, but they can be safely ignored. There are two breaking changes, both mentioned in the release notes. After fixing instances of these two issues, our current analysis pipeline runs without error and uses nearly every feature of loompy 2.0, so I would consider this a reasonably stable release. Out-of-memory manifold learning and clustering of a 500k cells by 27k genes dataset with all the bells and whistles runs just fine on my MacBook with 16 GB RAM (admittedly, it takes a few hours)!
I've updated the docs, which are now (for technical reasons) hosted at http://linnarssonlab.org/loompy.
Thank you for this, Sten! :) Just before Christmas, Scanpy/AnnData incorporated .loom support: https://scanpy.readthedocs.io, (docs read, docs write, code read, code write) Unfortunately, still with the old loom version. We will work on making use of the possibilities of the new version, which will allow loosing less information when exporting to
What might be interesting for all efforts of storing sparse matrices in hdf5: I've worked on a more basic way of interfacing sparse matrices in hdf5 files, which works a bit different than in loom, but provides some advantages and additional functionality.
We mentioned before that we planned to move from a static to a dynamically-backed AnnData object from the very beginning. We did in the past weeks. AnnData now offers both: convenience of fancy indexing and all what pandas/numpy offer for accessing and manipulating data ("memory mode") versus full hdf5-backing ("backed mode"), with reduced speed and possibilities of accessing and manipulating data. See my thoughts on this here. Given that loom didn't satisfy all that we need for backing AnnData objects, we currently base this on a very similar hdf5-format (but with in addition support for categorical data types, arbitrary unstructured annotations, hdf5 sparse datasets), but will be happy to move away from this, once loom can fully back AnnData objects.
PS: A revised version of the Scanpy draft will appear in Genome Biology soon and we reference and acknowledge the discussion here and Sten's comments there. It really helped us rethink AnnData's hdf5-backing.
Congratulations on the twins! Should keep you very busy :-)
It would be great if you or @flying-sheep could have a look at loompy 2 and let me know what is missing for you to use it as backing for AnnData. I'm not sure I understand the specific requirements for categorical datatypes (we just use strings or integers typically). But I think/hope loom now supports your notion of arbitrary unstructured data (let me know if not!).
As for sparse data, I'll take a look at your work and think a little more. Storing as CSC will probably be much faster along columns (compared to loom) and much slower along the rows. Can you append data to such files without loading all the data? But I'm wondering if we could have a concept of "freezing" a file, after which data can no longer be appended. Before freezing, we could use block-compressed (or COO). Upon freezing, we could create both the CSC and the CSR representation, for fast access along both axes. I think that would work well as we usually have separate phases of (1) gathering all the data into a single loom file and (2) running algorithms on that.
Thank you, Sten, and a happy new year to everybody!
Of course, we will look into loompy 2! We will gradually improve the interface with anndata and converge to something that is mature with clear and powerful conventions.