ENH Remove count DataFrame from calculate_cooks #292

asistradition · 2024-06-11T15:22:54Z

What does your PR implement? Be specific.

calculate_cooks casts normed_counts into a pandas DataFrame for robust_method_of_moments_disp. This is memory inefficient for large data.

robust_method_of_moments_disp has been refactored to accept an ndarray directly and the DataFrame has been removed. There is no numerical change as a result.

for more information, see https://pre-commit.ci

umarteauowkin

Hi @asistradition , thanks a lot for this PR. I was wondering how important it was to you that we put _mu_LFC and _hat_diagonals in the obsm and not in the layers. I agree it makes sense not to store useless nan values. However, fundamentally, these matrices are more layers than simply obsm (we always have this issue between objects restricted to non zero genes and objects defined on all genes, ideallly we would like to have a layers field restricted to non zero genes but we don't). However, if you experience significant memory differences by keeping it in the layers, I will accept !

asistradition · 2024-06-13T15:15:35Z

The main advantage to using .obsm is that column slicing a row-major array requires a copy, and there is a considerable amount of overhead when calling .layers[key][:, filtered_genes] repeatedly.

As those keys are only used in cooks_distances it is a considerable optimization (for large data, e.g. 50k x 30k) to move them to .obsm remove those copies.

for more information, see https://pre-commit.ci

asistradition · 2024-07-01T16:51:07Z

Also includes an optional control_genes argument to fit_size_factors which has the same behavior as the controlGenes argument to estimateSizeFactors and the associated unit test

pydeseq2/dds.py

…ctors`

…tors`

BorisMuzellec · 2024-07-02T09:22:19Z

Thanks @asistradition for this PR!

I agree with @umarteauowkin that on principle storing _mu_LFC and _hat_diagonals in the obsm and not in the layers is a bit awkward, but if this leads to memory gains I'm fine with it, given that (as you pointed out) they are only used in cooks_distances.

asistradition and others added 14 commits May 30, 2024 14:53

Poscount implementation & diag(XXT) optimization

0ae480a

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c85ef3

for more information, see https://pre-commit.ci

Add docstring for linter

9564d9d

[pre-commit.ci] auto fixes from pre-commit.com hooks

945f7bc

for more information, see https://pre-commit.ci

Wald weighting optimization

3bc2ec2

Refactor calculate_cooks to ndarray from DataFrame

8c618e4

Minor cleanup for cook's distances

99267e6

Improve comments and add tests for size factors

c213ac4

[pre-commit.ci] auto fixes from pre-commit.com hooks

1843643

for more information, see https://pre-commit.ci

Fix for linter line width

89141e7

Fix incorrect type hints for mypy

71e65eb

Merge branch 'owkin:main' into main

c48ffe9

Memory optimization for cook's distances

54ca62e

Add comments for clarity

4dfd4ab

asistradition requested review from BorisMuzellec, maikia, arthurPignetOwkin, mandreux-owkin and umarteauowkin as code owners June 11, 2024 15:22

pre-commit-ci bot and others added 5 commits June 11, 2024 15:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

4ab4577

for more information, see https://pre-commit.ci

Doc notation fix

49e4ecc

[pre-commit.ci] auto fixes from pre-commit.com hooks

2706640

for more information, see https://pre-commit.ci

Add explicit cast for mypy

6b0fe04

[pre-commit.ci] auto fixes from pre-commit.com hooks

b6605c2

for more information, see https://pre-commit.ci

umarteauowkin reviewed Jun 13, 2024

View reviewed changes

asistradition and others added 4 commits July 1, 2024 12:02

Merge branch 'owkin:main' into main

349371c

Add control_genes to fit_size_factors

658ab3f

Merge branch 'main' of github.com:asistradition/PyDESeq2

06c6994

[pre-commit.ci] auto fixes from pre-commit.com hooks

98bd2b1

for more information, see https://pre-commit.ci

asistradition and others added 2 commits July 1, 2024 12:42

Replace pipe with Union

a108ce5

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ea08ad

for more information, see https://pre-commit.ci

BorisMuzellec reviewed Jul 2, 2024

View reviewed changes

pydeseq2/dds.py Outdated Show resolved Hide resolved

BorisMuzellec added 2 commits July 2, 2024 11:13

docs: add a doctring for the control_genes argument in `fit_size_fa…

dc95947

…ctors`

docs: fix docstring for the control_genes argument in `fit_size_fac…

c1b927c

…tors`

BorisMuzellec approved these changes Jul 2, 2024

View reviewed changes

BorisMuzellec merged commit 3505d78 into owkin:main Jul 2, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Remove count DataFrame from calculate_cooks #292

ENH Remove count DataFrame from calculate_cooks #292

asistradition commented Jun 11, 2024

umarteauowkin left a comment

asistradition commented Jun 13, 2024

asistradition commented Jul 1, 2024

BorisMuzellec commented Jul 2, 2024

ENH Remove count DataFrame from calculate_cooks #292

ENH Remove count DataFrame from calculate_cooks #292

Conversation

asistradition commented Jun 11, 2024

What does your PR implement? Be specific.

umarteauowkin left a comment

Choose a reason for hiding this comment

asistradition commented Jun 13, 2024

asistradition commented Jul 1, 2024

BorisMuzellec commented Jul 2, 2024