Memory usage ir_neighbors #217

vladie0 · 2020-12-01T17:35:15Z

Dear authors,

Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server.
Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?

Kind Regards,

grst · 2020-12-01T18:09:01Z

Hi @vladie0,

what metric and what options do you use for ir_neighbors? For only 55k cells this seems like a lot!

There are several points to be considered here:

The network graph is stored as a sparse matrix. Each edge in the cell x cell neighborhood graph has an entry in the matrix.
Reducing the cutoff will reduce the number of edges, and therefore memory consumption
If there are many cells with the same clonotype, that will lead to a "dense" block within the sparse matrix, which can lead to high memory consumption. I am planning to fix that (Refactor CDR3-network construction. #191), but won't have time until January for it.

Best,
Gregor

vladie0 · 2020-12-01T20:04:44Z

Dear Gregor,

Thank you for your swift reply.
I've reduced the number of edges to 2, but still have the same issue. The ouput is as you can see below, it is stuck at this stage, memory usage ramps up till 128gb than system fails.

Ar there other parameter that can be of importance, for example metric, n_jobs, sequence? ...

Note, I've also merged TCR & adata data, prior to ir_neighbors calculations. I've tried the following options without succes:

ir.pp.ir_neighbors(adata)
ir.pp.ir_neighbors(adata, receptor_arms="all", dual_ir="primary_only")
ir.pp.ir_neighbors(
adata,
metric="alignment",
n_jobs=1,
sequence="aa",
cutoff=2,
receptor_arms="any",
dual_ir="primary_only",
)

ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)

vladie0 · 2020-12-01T20:28:53Z

Even with alignment="identity", the same memory issue persists

vladie0 · 2020-12-02T07:24:17Z

This issue keeps occurring even when the number of cells are down sampled to 10000 cells.

grst · 2020-12-02T07:40:22Z

So for this, I would indeed expect memory issues:

ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)

A levenshtein distance of 10 will connect almost all cells - A threshold of 1 or 2 makes more sense here, 10 is better for the alignment metric.

For the others, in particular ir.pp.ir_neighbors(adata), this should not happen.

Just to be sure, when testing the identity metric, and the downsampled data, did you restart the Python kernel to make sure you start with a clean memory?

Let's also do some checks on your anndata. Can you please execute the following commands, they will count
the most frequent CDR3 sequences:

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()

vladie0 · 2020-12-02T07:52:01Z

Yes, memory was cleaned each time.
I performed your request, this was calculated instantly

grst · 2020-12-02T07:57:14Z

Ah wow, that explains. You essentially only have three different clonotypes in your data, with 10k+ cells each.
Is that what you expect from your data?

This will lead to three "dense" blocks of 12931 x 12931 + 13784 x 13784 + ... in the sparse matrix, which is obviously a lot of memory.

This will be fixed by #191, but I'll still take some time to implement it.

On the other hand, with only three different clonotypes, the clonotype network does not make a lot of sense anyway.

EDIT:
as it's only showing the top 5, there might still be some more with low counts. In that case it would still make sense to build the network.

A "workaround" would be to downsample only the cells that belong to that abundant clonotype -- or just live with the memory consumption if your system can handle it.

vladie0 · 2020-12-02T08:03:50Z

Ok thank you,
there are 267 clonotypes, but indeed with a lower presence. I'm quite new to TCR so I'm still trying to figure things out.
I will certainly lookout for the next updates, thank you for your aid and the development of an awesome package.

Would you suggest to remove the top 3 clonotype and build a network on this data subset or would you still consider it unnecessary to define clonotypes using scirpy and carry on with subsequent analysis?

grst · 2020-12-02T08:11:22Z

First I would wonder: is there a biological explanation for the clonotypes being so skews, or is this a technical artifact?
I have never seen such an unequal distribution of clonotypes before, they usually follow a Power Law distribution.

I think it is still interesting to look at a clonotype network. As a temporary workaround, you could remove all cells with these abundant clonotypes, except a few representatives. Or if you have the memory, just perform the analysis as usual.

vladie0 · 2020-12-02T08:18:10Z

Yes, these are cell lines of T-cells harvested from a patient, with a certain disease. I suppose this explained the skewed distribution.

I will try to implement the workaround, thank your support

vladie0 · 2020-12-02T09:32:46Z

One more question,
I've received a 1T mem machine to perform calcuation on the full DB, what exact parameters would you suggest to use to perform ir_neighbors?

grst · 2020-12-02T09:40:16Z

I've received a 1T mem machine to perform calcuation on the full DB,

fingers crossed that works out 🤞

what exact parameters would you suggest to use to perform ir_neighbors?

If you want to construct the network based on "similar clonotypes that likely recognize the same antigen",
I would go for the alignment metric and a cutoff of somewhere between 10 and 15 (we don't have empirical evidence to back up an ideal threshold).

Since you have so few clonotypes, I would probably use dual_ir = "any" and receptor_arms = "all" (possibly even receptor_arms = "any")

It depends a bit what your question is. The options are also explained here.

nicoac · 2020-12-16T19:37:07Z

I'm running into a similar issue with a dataset generated from T cells with a fixed beta chain. Seems as if the computation gets stuck and this is what it looks like: 0%| | 0/669 [00:00<?, ?it/s]

Looking forward to the fix! I'm really enjoying using Scirpy.

grst · 2020-12-17T08:01:34Z

Hi @nicoac,

to verity that it is really the same issue, could you please also report the result of the following two command?

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()

nicoac · 2020-12-17T16:42:52Z

Thanks for the response. @grst
Here is the result of that code:

IR_VDJ_1_cdr3
CASSSPGTANYAEQFF 25647
nan 9079
None 242
CTCSAEGDRQAPLF 6
CAWSLGQQNTLYF 6
dtype: int64

As you can see. It is quite a large amount of one specific chain. Also just to be clear. This is the result of combining six different adata frames. I have three day zeros and three day 8 samples that I am tracking longitudinally to see TCR alpha chain pairings over time.

grst · 2020-12-17T16:55:42Z

Thanks for checking, seems indeed the same issue.
I'll try to fix it after the holidays.

nicoac · 2021-01-21T22:37:51Z

Hi Gregor. Just following up with you on some updates. I was able to get the numbers to crunch properly all the way through the ir_neighbors and the end of the analysis as per the tutorial. However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

Any thoughts?

grst · 2021-01-22T07:32:50Z

However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

I'm not sure what you mean with the coord-dictionary and what exactly is getting wiped. Can you provide an example?

nicoac · 2021-01-22T17:20:50Z

Here is what is happening after a fresh run of the code

grst · 2021-01-25T08:19:21Z

What probably happens is that the jupyter kernel gets killed because it runs out of memory (it seems pp.ir_neighbors never finished). As the jupyter kernel restarts, all your objects get lost.

Unfortunately, there's currently not a lot you can do except for

downsample (reduce the number of cells with identical clonotypes), or
getting crazy amounts of memory

I'm working on a fix in #230, but it turns out to be more difficult than I had expected.

vladie0 · 2021-01-25T17:26:05Z

Hi grst,

I have a new batch of similar data and was wondering what your progress is on optimizing ir_neighbors for sparse matrices. The reason I'm asking is that I have a new, similar project and was wondering whether I should receive a large memory server again ?

grst · 2021-01-25T20:35:23Z

was wondering what your progress is on optimizing ir_neighbors for sparse matrices

I don't want to promise anything, but if all goes well I should have something ready to test in 2-3 weeks.

grst · 2021-01-29T07:51:05Z

Hi @vladie0,

just wanted to let you know that the 2-3 weeks won't work out. There were some setbacks and I now need to commit to other projects for a while.

Sorry about that,
Gregor

grst · 2021-03-17T09:55:17Z

Hi @vladie0 and @nicoac,

#230 is finally merged. Could you give it a try and check if everything works as expected now?

You can install the development version using

pip install git+https://github.com/icbi-lab/scirpy.git@master

Note that the procedure for calling clonotypes has slightly changed, in particular:

pp.ir_neighbors has been replaced by pp.ir_dist
tl.define_clonotypes now takes the dual_ir and receptor_arms arguments (they were previously provided to ir_neighbors)
The clonotype network now collapses cells with identical receptors to a single dot

The documentation of the development version is at https://icbi-lab.github.io/scirpy/develop and this tutorial section describes the updated clonotype definition.

nicoac · 2021-03-17T18:40:05Z

I will report back with results. Thank you so much!

grst added this to ToDo in scirpy-dev Dec 2, 2020

grst moved this from ToDo to In progress in scirpy-dev Dec 10, 2020

grst mentioned this issue Jan 3, 2021

Refactor CDR3 distance calculation #230

Merged

grst closed this as completed in #230 Mar 17, 2021

scirpy-dev automation moved this from In progress to Done Mar 17, 2021

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage ir_neighbors #217

Memory usage ir_neighbors #217

vladie0 commented Dec 1, 2020

grst commented Dec 1, 2020

vladie0 commented Dec 1, 2020

vladie0 commented Dec 1, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020 •

edited

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

vladie0 commented Dec 2, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

nicoac commented Dec 16, 2020 •

edited

grst commented Dec 17, 2020

nicoac commented Dec 17, 2020

grst commented Dec 17, 2020

nicoac commented Jan 21, 2021

grst commented Jan 22, 2021

nicoac commented Jan 22, 2021

grst commented Jan 25, 2021 •

edited

vladie0 commented Jan 25, 2021

grst commented Jan 25, 2021

grst commented Jan 29, 2021

grst commented Mar 17, 2021

nicoac commented Mar 17, 2021

This comment has been minimized.

This comment has been minimized.

Memory usage ir_neighbors #217

Memory usage ir_neighbors #217

Comments

vladie0 commented Dec 1, 2020

grst commented Dec 1, 2020

vladie0 commented Dec 1, 2020

vladie0 commented Dec 1, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020 • edited

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

vladie0 commented Dec 2, 2020

vladie0 commented Dec 2, 2020

grst commented Dec 2, 2020

nicoac commented Dec 16, 2020 • edited

grst commented Dec 17, 2020

nicoac commented Dec 17, 2020

grst commented Dec 17, 2020

nicoac commented Jan 21, 2021

grst commented Jan 22, 2021

nicoac commented Jan 22, 2021

grst commented Jan 25, 2021 • edited

vladie0 commented Jan 25, 2021

grst commented Jan 25, 2021

grst commented Jan 29, 2021

grst commented Mar 17, 2021

nicoac commented Mar 17, 2021

This comment has been minimized.

This comment has been minimized.

grst commented Dec 2, 2020 •

edited

nicoac commented Dec 16, 2020 •

edited

grst commented Jan 25, 2021 •

edited