RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

rmovva · 2020-07-01T02:46:10Z

I tried running the RAPIDS implementation of rank_genes_groups alongside the Scanpy CPU implementation on the same data matrix, but I'm getting very different results.

Here's my code for the GPU call:

cluster_labels = cudf.Series.from_categorical(adata.obs["louvain"].cat)
var_names = cudf.Series(var_names)
dense_gpu_array = cp.array(adata_raw.X.todense())

scores, names, reference = rapids_scanpy_funcs.rank_genes_groups(
    dense_gpu_array,
    cluster_labels, 
    var_names, 
    n_genes=n_top_diff_peaks, groups='all', reference='rest')

And the CPU call:

adata_raw.obs['louvain'] = adata.obs['louvain'].tolist()
sc.tl.rank_genes_groups(adata_raw, 
                       groupby="louvain", 
                       n_genes=n_top_diff_peaks, 
                       groups='all', 
                       reference='rest',
                       method='logreg'
                       )

When I look at the top differential gene for each cluster, the outputs reported by the GPU and CPU are disjoint. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low. My suspicion is that the GPU output isn't actually being properly sorted by logistic regression coefficient, so the output is just some random set of differential genes & their scores instead of the top N.

When I scatterplot the results, the CPU results also seem to make much more sense than the GPU.

The text was updated successfully, but these errors were encountered:

avantikalal · 2020-07-02T15:56:39Z

Possibly relevant cuML issue: rapidsai/cuml#2478

cjnolet · 2020-07-05T04:30:33Z

. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low.

~~Are you referring to the resulting scores (.e.g, adata.uns['rank_genes_groups']['scores'])? For me, the output for both CPU and GPU seem to be unsorted.~~

Correction: When using penalty='none', the major axis is indeed sorted in both the GPU and CPU notebooks.

While we wait for the release of the fix for rapidsai/cuml#2478, we have a couple options:

Turn off regularization by passing penalty='none' into the rank_genes_groups functions for both CPU and GPU.
Set the C hyper-parameter to the number of elements in X as recommended in [BUG] LogisticRegression suffers from accuracy loss when penalty is enabled rapidsai/cuml#2478.

teju85 · 2021-02-24T10:41:41Z

hey folks,
just curious. Since the above cuML issue has been fixed, did any of you get a chance rerun the code afterwards? Are you still facing this issue?

avantikalal · 2021-02-26T17:46:27Z

@teju85 thanks for the reminder, we'll check and get back to you.

avantikalal · 2021-03-18T15:23:41Z

@teju85 We checked this and the problem still exists, despite the cuML bug being resolved. @cjnolet is looking into it.

avantikalal · 2021-04-27T17:03:58Z

This issue should be resolved now: rapidsai/cuml#3645

Will test and close.

avantikalal assigned cjnolet and avantikalal Jul 1, 2020

avantikalal assigned mdemouth Mar 5, 2021

avantikalal unassigned mdemouth Mar 18, 2021

cjnolet mentioned this issue Mar 22, 2021

[BUG] Logistic regression coefficients (for feature importance) significantly differ from Scikit-learn rapidsai/cuml#3645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

rmovva commented Jul 1, 2020

avantikalal commented Jul 2, 2020

cjnolet commented Jul 5, 2020 •

edited

Loading

teju85 commented Feb 24, 2021

avantikalal commented Feb 26, 2021

avantikalal commented Mar 18, 2021

avantikalal commented Apr 27, 2021

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

Comments

rmovva commented Jul 1, 2020

avantikalal commented Jul 2, 2020

cjnolet commented Jul 5, 2020 • edited Loading

teju85 commented Feb 24, 2021

avantikalal commented Feb 26, 2021

avantikalal commented Mar 18, 2021

avantikalal commented Apr 27, 2021

cjnolet commented Jul 5, 2020 •

edited

Loading