Sparse matrices for cbImportScanpy #231

redst4r · 2021-11-24T00:13:43Z

Fixing #230 : Export the gene expression matrix in sparse format, saving time and disk space

Using 'matrix.mtx.gz' as the new expression matrix exposed two other bugs:

running cbBuild on an existing build will fail with a missing key exception (outConf["fileVersions"] does not exist in matrixOrSamplesHaveChanged())
The logic in MatrixMtxReader.iterrows() to resolve geneIds works differently then in MatrixTsvReader.iterrows(), should be consistent now.

…terrows()

maximilianh · 2021-11-24T11:10:44Z

Wow, thanks! Looks good, I may change a few small things later with a followup commit.

maximilianh · 2021-11-24T11:20:12Z

Note that I moved this into the develop branch. We keep the released version in master, and development in "develop". I'm not sure if this is a good convention, often PRs forget to set the branch, but it means that people who clone the repo first see the stable version. Idk.

maximilianh · 2021-11-24T11:25:57Z

Now after your changes, there is no way to get the .tsv output anymore. I wonder if this is a good idea. Old pipelines may be used to have the .tsv.gz ?

I just looked at an mtx file, this is what it looks like:

9 80 52
1 12 5.122949e+00
1 14 5.871189e+00
1 15 5.310305e+00
1 16 5.264228e+00
1 19 5.385112e+00
2 11 5.893417e+00
2 13 5.397913e+00
2 15 5.310305e+00
2 16 5.264228e+00
2 51 4.631501e+00
2 55 4.586091e+00
2 61 4.406388e+00
2 65 4.307617e+00

Doesn't look very compact to me. Strange that mmwrite doesn't allow us to set the format or get rid of these pointless +00 strings... we'll have a lot of these. Did you compare the file size compared to .tsv.gz ?

maximilianh · 2021-11-24T14:05:00Z

Hi @redst4r I've made a few changes to this:

cbScanpyImport now has an option -f to specify the format
the default is "tsv" but you can add the line "matrixFormat='mtx'" to your ~/.cellbrowser.conf to change this
cbScanpy got the same option -f, same behavior
I changed the default output filename to features.tsv.gz instead of genes.tsv.gz because that's the new name for Cellranger 3 and also moving forward that will work better for the various other dataset types that we'll have and so we don't ever have to change it again

All of this is in the develop branch for now.

maximilianh · 2021-11-24T14:08:48Z

Also changed the default branch now to "develop", so future PRs will go to that by default and I don't have to revert commits anymore. I should have done that years ago.

maximilianh · 2021-11-24T14:16:13Z

Hey @redst4r, this part of the code is problematic:

genes_file = join(path, 'features.tsv.gz')
with gzip.open(genes_file, 'wt') as f:
    f.write("\n".join(genes))

You're not saving the symbols or geneIDs, just one of them. Some users, like @pcm32 really need to keep both the gene IDs and the gene symbols. I think in features.tsv.gz there is a convention to have geneIds first, then symbols, as tab-sep columns (I hope that's correct). Can we make it so such that this information is not lost in the conversion? The issue linked from here has some additional information.

I see that you run geneSeriesToStrings, but that's only useful for .tsv format. For .mtx format, I think the two columns must be tab-separated. Ideally, we would look at an example file from cellranger and imitate their format...?

maximilianh · 2021-11-24T14:24:04Z

OK, nevermind, I changed this now, getting rid of some code duplication. adding a parameter "sep" to geneSeriesToStrings should have addressed this. I haven't tested this yet, @matthewspeir may have ideas on better example datasets, probably I should test a lot more with .h5ad files.

maximilianh · 2021-11-29T09:55:13Z

Hi @redst4r, are you happy with the changes I made to your pull request? If not, please don't hesitate to reply here or let us know via cells@ucsc.edu. The changes are made to make sure that existing functionality is not broken. Some users, like @pcm32 use cbBuild in their pipelines and if the default output format is changed, that may break their pipelines. But in principle, I guess we all agree that .mtx.gz is probably the best default format in the long run...

redst4r · 2021-11-29T17:57:58Z

Hi, sorry was busy last week with other stuff. Making the output format a command line option seems like a good compromise for compatibility. Thanks for checking compatibility in general, I use cellbrowser only in one particular workflow, so it's hard for me to see what those changes break for other people!

About the weird floating point formatting of mmwrite: I don't think it's too big of an issue, compression should take care of most of that overhead. But I guess you could specify mmwrite(field='integer') for integer matrices, or mmwrite(field='real', precision=xxx) otherwise to make it more compact.

All your changes look good to me, please go ahead with it!

maximilianh · 2021-11-30T11:10:43Z

No, field='integer' is not needed, mmwrite does this automatically, I just tried. as for precision, do you know why you changed the default to precision=7? Looks good to me either way, was just wondering why you changed the default.

redst4r · 2021-12-03T20:38:06Z

actually, no idea about the precision=7 argument, I might have just copy/pasted that from somewhere. Feel free to reset it to default

redst4r added 3 commits November 20, 2021 20:28

writing expression matrix in sparse format in scanpy import

34c4ca8

BUGFIX: missing-key error when rerunning cbBuild on an existing build

0684816

fixed some inconsistency with geneSymbol resolve in MatrixMtxReader.i…

53f1705

…terrows()

maximilianh merged commit a3a7af6 into maximilianh:master Nov 24, 2021

maximilianh mentioned this pull request Nov 24, 2021

AnnData conversion no longer grabs var['gene_symbols'] to get gene symbols #216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse matrices for cbImportScanpy #231

Sparse matrices for cbImportScanpy #231

redst4r commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021 •

edited

Loading

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 29, 2021

redst4r commented Nov 29, 2021 •

edited

Loading

maximilianh commented Nov 30, 2021

redst4r commented Dec 3, 2021

Sparse matrices for cbImportScanpy #231

Sparse matrices for cbImportScanpy #231

Conversation

redst4r commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021 • edited Loading

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 24, 2021

maximilianh commented Nov 29, 2021

redst4r commented Nov 29, 2021 • edited Loading

maximilianh commented Nov 30, 2021

redst4r commented Dec 3, 2021

maximilianh commented Nov 24, 2021 •

edited

Loading

redst4r commented Nov 29, 2021 •

edited

Loading