Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

Open
rmovva opened this issue Jul 1, 2020 · 6 comments
Open
Assignees

Comments

@rmovva
Copy link
Contributor

rmovva commented Jul 1, 2020

I tried running the RAPIDS implementation of rank_genes_groups alongside the Scanpy CPU implementation on the same data matrix, but I'm getting very different results.

Here's my code for the GPU call:

cluster_labels = cudf.Series.from_categorical(adata.obs["louvain"].cat)
var_names = cudf.Series(var_names)
dense_gpu_array = cp.array(adata_raw.X.todense())

scores, names, reference = rapids_scanpy_funcs.rank_genes_groups(
    dense_gpu_array,
    cluster_labels, 
    var_names, 
    n_genes=n_top_diff_peaks, groups='all', reference='rest')

And the CPU call:

adata_raw.obs['louvain'] = adata.obs['louvain'].tolist()
sc.tl.rank_genes_groups(adata_raw, 
                       groupby="louvain", 
                       n_genes=n_top_diff_peaks, 
                       groups='all', 
                       reference='rest',
                       method='logreg'
                       )

When I look at the top differential gene for each cluster, the outputs reported by the GPU and CPU are disjoint. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low. My suspicion is that the GPU output isn't actually being properly sorted by logistic regression coefficient, so the output is just some random set of differential genes & their scores instead of the top N.

When I scatterplot the results, the CPU results also seem to make much more sense than the GPU.

@avantikalal
Copy link
Contributor

Possibly relevant cuML issue: rapidsai/cuml#2478

@cjnolet
Copy link
Member

cjnolet commented Jul 5, 2020

. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low.

Are you referring to the resulting scores (.e.g, adata.uns['rank_genes_groups']['scores'])? For me, the output for both CPU and GPU seem to be unsorted.

Correction: When using penalty='none', the major axis is indeed sorted in both the GPU and CPU notebooks.

While we wait for the release of the fix for rapidsai/cuml#2478, we have a couple options:

  1. Turn off regularization by passing penalty='none' into the rank_genes_groups functions for both CPU and GPU.
  2. Set the C hyper-parameter to the number of elements in X as recommended in [BUG] LogisticRegression suffers from accuracy loss when penalty is enabled rapidsai/cuml#2478.

@teju85
Copy link

teju85 commented Feb 24, 2021

hey folks,
just curious. Since the above cuML issue has been fixed, did any of you get a chance rerun the code afterwards? Are you still facing this issue?

@avantikalal
Copy link
Contributor

@teju85 thanks for the reminder, we'll check and get back to you.

@avantikalal
Copy link
Contributor

@teju85 We checked this and the problem still exists, despite the cuML bug being resolved. @cjnolet is looking into it.

@avantikalal
Copy link
Contributor

This issue should be resolved now: rapidsai/cuml#3645

Will test and close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants