Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrices for cbImportScanpy #231

Merged
merged 3 commits into from
Nov 24, 2021

Conversation

redst4r
Copy link
Contributor

@redst4r redst4r commented Nov 24, 2021

Fixing #230 : Export the gene expression matrix in sparse format, saving time and disk space

Using 'matrix.mtx.gz' as the new expression matrix exposed two other bugs:

  • running cbBuild on an existing build will fail with a missing key exception (outConf["fileVersions"] does not exist in matrixOrSamplesHaveChanged())
  • The logic in MatrixMtxReader.iterrows() to resolve geneIds works differently then in MatrixTsvReader.iterrows(), should be consistent now.

@maximilianh
Copy link
Owner

Wow, thanks! Looks good, I may change a few small things later with a followup commit.

@maximilianh maximilianh merged commit a3a7af6 into maximilianh:master Nov 24, 2021
@maximilianh
Copy link
Owner

maximilianh commented Nov 24, 2021

Note that I moved this into the develop branch. We keep the released version in master, and development in "develop". I'm not sure if this is a good convention, often PRs forget to set the branch, but it means that people who clone the repo first see the stable version. Idk.

@maximilianh
Copy link
Owner

Now after your changes, there is no way to get the .tsv output anymore. I wonder if this is a good idea. Old pipelines may be used to have the .tsv.gz ?

I just looked at an mtx file, this is what it looks like:

9 80 52
1 12 5.122949e+00
1 14 5.871189e+00
1 15 5.310305e+00
1 16 5.264228e+00
1 19 5.385112e+00
2 11 5.893417e+00
2 13 5.397913e+00
2 15 5.310305e+00
2 16 5.264228e+00
2 51 4.631501e+00
2 55 4.586091e+00
2 61 4.406388e+00
2 65 4.307617e+00

Doesn't look very compact to me. Strange that mmwrite doesn't allow us to set the format or get rid of these pointless +00 strings... we'll have a lot of these. Did you compare the file size compared to .tsv.gz ?

@maximilianh
Copy link
Owner

Hi @redst4r I've made a few changes to this:

  • cbScanpyImport now has an option -f to specify the format
  • the default is "tsv" but you can add the line "matrixFormat='mtx'" to your ~/.cellbrowser.conf to change this
  • cbScanpy got the same option -f, same behavior
  • I changed the default output filename to features.tsv.gz instead of genes.tsv.gz because that's the new name for Cellranger 3 and also moving forward that will work better for the various other dataset types that we'll have and so we don't ever have to change it again

All of this is in the develop branch for now.

@maximilianh
Copy link
Owner

Also changed the default branch now to "develop", so future PRs will go to that by default and I don't have to revert commits anymore. I should have done that years ago.

@maximilianh
Copy link
Owner

Hey @redst4r, this part of the code is problematic:

genes_file = join(path, 'features.tsv.gz')
with gzip.open(genes_file, 'wt') as f:
    f.write("\n".join(genes))

You're not saving the symbols or geneIDs, just one of them. Some users, like @pcm32 really need to keep both the gene IDs and the gene symbols. I think in features.tsv.gz there is a convention to have geneIds first, then symbols, as tab-sep columns (I hope that's correct). Can we make it so such that this information is not lost in the conversion? The issue linked from here has some additional information.

I see that you run geneSeriesToStrings, but that's only useful for .tsv format. For .mtx format, I think the two columns must be tab-separated. Ideally, we would look at an example file from cellranger and imitate their format...?

@maximilianh
Copy link
Owner

OK, nevermind, I changed this now, getting rid of some code duplication. adding a parameter "sep" to geneSeriesToStrings should have addressed this. I haven't tested this yet, @matthewspeir may have ideas on better example datasets, probably I should test a lot more with .h5ad files.

@maximilianh
Copy link
Owner

Hi @redst4r, are you happy with the changes I made to your pull request? If not, please don't hesitate to reply here or let us know via cells@ucsc.edu. The changes are made to make sure that existing functionality is not broken. Some users, like @pcm32 use cbBuild in their pipelines and if the default output format is changed, that may break their pipelines. But in principle, I guess we all agree that .mtx.gz is probably the best default format in the long run...

@redst4r
Copy link
Contributor Author

redst4r commented Nov 29, 2021

Hi, sorry was busy last week with other stuff. Making the output format a command line option seems like a good compromise for compatibility. Thanks for checking compatibility in general, I use cellbrowser only in one particular workflow, so it's hard for me to see what those changes break for other people!

About the weird floating point formatting of mmwrite: I don't think it's too big of an issue, compression should take care of most of that overhead. But I guess you could specify mmwrite(field='integer') for integer matrices, or mmwrite(field='real', precision=xxx) otherwise to make it more compact.

All your changes look good to me, please go ahead with it!

@maximilianh
Copy link
Owner

No, field='integer' is not needed, mmwrite does this automatically, I just tried. as for precision, do you know why you changed the default to precision=7? Looks good to me either way, was just wondering why you changed the default.

@redst4r
Copy link
Contributor Author

redst4r commented Dec 3, 2021

actually, no idea about the precision=7 argument, I might have just copy/pasted that from somewhere. Feel free to reset it to default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants