Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address a few frequent matrix dtype conversion errors #281

Closed
wants to merge 1 commit into from

Conversation

hvbakel
Copy link

@hvbakel hvbakel commented Dec 23, 2023

When running pegasus I've encountered recurrent dtype errors in the calc_stat_per_batch function that seem to happen when the matrix is not specifically cast to np.float32. There are also a few cases in preprocessing.py where the use of -= or /= can cause occasional dtype errors.

@yihming
Copy link
Member

yihming commented Dec 23, 2023

Hi @hvbakel . Thanks for your contribution. Could you please share a use case when Pegasus fails due to this issue? I'd like to reproduce your issue at my side to better understand it. Thanks!

Sincerely,
Yiming

@hvbakel
Copy link
Author

hvbakel commented Dec 23, 2023

I've made a testcase available at the following link: https://www.dropbox.com/scl/fi/1dgciypx87dewzot9fjp3/pegasus-dtype-error-testcase.tar.gz?rlkey=01k6j2i12cxshrqbsjux6ro5w&dl=0

In addition to h5ad files, the tarball includes a csv file for loading the h5ad files into a multimodal object (Batch1_pg_aggregate.csv), a test script to trigger the error (testcase.py) and an example error message (testcase_error.log). Note that it can be tricky to reproduce the error as it doesn't always occur, even when starting with the same file set.

@yihming
Copy link
Member

yihming commented Jan 6, 2024

Hi @hvbakel ,

It took me quite a while to understand this issue.

I have 2 comments:

  1. Pegasus highly_variable_features() function assumes log_norm() beforehand, so that the HVG selection will work on log-normalized count matrix which is in float/doublet type.

  2. The aggregate_matrices() function has an issue on deciding the default count matrix of the aggregation result: it simply picks up the first matrix in the result. In your case, since your source h5ad files have 2 matrices (X for int type raw counts, counts.log_norm for float type log-normalized counts), it kind of randomly chose one to be the default count matrix:

    • If the result has X being the default, your code will have this issue on type mismatching.
    • Otherwise, if the result has counts.log_norm being the default, your code should work without error.

For Issue 2 above, I've fixed it in this PR. Please upgrade your pegasusio package to 0.8.2. In the new version, the aggregated data object will choose the count matrix key in which the majority of source objects use as default. I.e., in your case it would be X the raw counts, as all of your h5ad files have X being the default.

For issue 1, I have 2 suggestions:

  • If you want to redo the log-norm starting from the default raw counts, your code should look like the following:
data = pg.aggregate_matrices("Batch1_pg_aggregate.csv")
pg.identify_robust_genes(data)
pg.log_norm(data)
pg.highly_variable_features( data, batch='Channel', n_top=5000, flavor="pegasus" )
  • Otherwise, if you want to select HVG from the preexisting log-norm counts, you could do:
data = pg.aggregate_matrices("Batch1_pg_aggregate.csv")
pg.identify_robust_genes(data)
data.select_matrix('counts.log_norm')
pg.highly_variable_features( data, batch='Channel', n_top=5000, flavor="pegasus" )

which switch to the log-norm count matrix as the default matrix.

You may choose either one depending on your analysis preference.

Sincerely,
Yiming

@yihming
Copy link
Member

yihming commented Jan 6, 2024

I'll close this PR. Please feel free to open an issue if your issue persists.

@yihming yihming closed this Jan 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants