Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage ir_neighbors #217

Closed
vladie0 opened this issue Dec 1, 2020 · 27 comments · Fixed by #230
Closed

Memory usage ir_neighbors #217

vladie0 opened this issue Dec 1, 2020 · 27 comments · Fixed by #230
Projects

Comments

@vladie0
Copy link

vladie0 commented Dec 1, 2020

Dear authors,

Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server.
Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?

Kind Regards,

@grst
Copy link
Collaborator

grst commented Dec 1, 2020

Hi @vladie0,

what metric and what options do you use for ir_neighbors? For only 55k cells this seems like a lot!

There are several points to be considered here:

  • The network graph is stored as a sparse matrix. Each edge in the cell x cell neighborhood graph has an entry in the matrix.
  • Reducing the cutoff will reduce the number of edges, and therefore memory consumption
  • If there are many cells with the same clonotype, that will lead to a "dense" block within the sparse matrix, which can lead to high memory consumption. I am planning to fix that (Refactor CDR3-network construction.  #191), but won't have time until January for it.

Best,
Gregor

@vladie0
Copy link
Author

vladie0 commented Dec 1, 2020

Dear Gregor,

Thank you for your swift reply.
I've reduced the number of edges to 2, but still have the same issue. The ouput is as you can see below, it is stuck at this stage, memory usage ramps up till 128gb than system fails.

image

Ar there other parameter that can be of importance, for example metric, n_jobs, sequence? ...

Note, I've also merged TCR & adata data, prior to ir_neighbors calculations. I've tried the following options without succes:

ir.pp.ir_neighbors(adata)
ir.pp.ir_neighbors(adata, receptor_arms="all", dual_ir="primary_only")
ir.pp.ir_neighbors(
adata,
metric="alignment",
n_jobs=1,
sequence="aa",
cutoff=2,
receptor_arms="any",
dual_ir="primary_only",
)

ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)

@vladie0
Copy link
Author

vladie0 commented Dec 1, 2020

Even with alignment="identity", the same memory issue persists

@vladie0
Copy link
Author

vladie0 commented Dec 2, 2020

This issue keeps occurring even when the number of cells are down sampled to 10000 cells.

@grst
Copy link
Collaborator

grst commented Dec 2, 2020

So for this, I would indeed expect memory issues:

ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
)

A levenshtein distance of 10 will connect almost all cells - A threshold of 1 or 2 makes more sense here, 10 is better for the alignment metric.

For the others, in particular ir.pp.ir_neighbors(adata), this should not happen.

Just to be sure, when testing the identity metric, and the downsampled data, did you restart the Python kernel to make sure you start with a clean memory?


Let's also do some checks on your anndata. Can you please execute the following commands, they will count
the most frequent CDR3 sequences:

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()

@vladie0
Copy link
Author

vladie0 commented Dec 2, 2020

Yes, memory was cleaned each time.
I performed your request, this was calculated instantly

ir_req

@grst
Copy link
Collaborator

grst commented Dec 2, 2020

Ah wow, that explains. You essentially only have three different clonotypes in your data, with 10k+ cells each.
Is that what you expect from your data?

This will lead to three "dense" blocks of 12931 x 12931 + 13784 x 13784 + ... in the sparse matrix, which is obviously a lot of memory.

This will be fixed by #191, but I'll still take some time to implement it.

On the other hand, with only three different clonotypes, the clonotype network does not make a lot of sense anyway.

EDIT:
as it's only showing the top 5, there might still be some more with low counts. In that case it would still make sense to build the network.

A "workaround" would be to downsample only the cells that belong to that abundant clonotype -- or just live with the memory consumption if your system can handle it.

@vladie0
Copy link
Author

vladie0 commented Dec 2, 2020

Ok thank you,
there are 267 clonotypes, but indeed with a lower presence. I'm quite new to TCR so I'm still trying to figure things out.
I will certainly lookout for the next updates, thank you for your aid and the development of an awesome package.

Would you suggest to remove the top 3 clonotype and build a network on this data subset or would you still consider it unnecessary to define clonotypes using scirpy and carry on with subsequent analysis?

@grst
Copy link
Collaborator

grst commented Dec 2, 2020

First I would wonder: is there a biological explanation for the clonotypes being so skews, or is this a technical artifact?
I have never seen such an unequal distribution of clonotypes before, they usually follow a Power Law distribution.

I think it is still interesting to look at a clonotype network. As a temporary workaround, you could remove all cells with these abundant clonotypes, except a few representatives. Or if you have the memory, just perform the analysis as usual.

@vladie0
Copy link
Author

vladie0 commented Dec 2, 2020

Yes, these are cell lines of T-cells harvested from a patient, with a certain disease. I suppose this explained the skewed distribution.

I will try to implement the workaround, thank your support

@grst grst added this to ToDo in scirpy-dev Dec 2, 2020
@vladie0
Copy link
Author

vladie0 commented Dec 2, 2020

One more question,
I've received a 1T mem machine to perform calcuation on the full DB, what exact parameters would you suggest to use to perform ir_neighbors?

@grst
Copy link
Collaborator

grst commented Dec 2, 2020

I've received a 1T mem machine to perform calcuation on the full DB,

fingers crossed that works out 🤞

what exact parameters would you suggest to use to perform ir_neighbors?

If you want to construct the network based on "similar clonotypes that likely recognize the same antigen",
I would go for the alignment metric and a cutoff of somewhere between 10 and 15 (we don't have empirical evidence to back up an ideal threshold).

Since you have so few clonotypes, I would probably use dual_ir = "any" and receptor_arms = "all" (possibly even receptor_arms = "any")

It depends a bit what your question is. The options are also explained here.

@grst grst moved this from ToDo to In progress in scirpy-dev Dec 10, 2020
@nicoac
Copy link

nicoac commented Dec 16, 2020

I'm running into a similar issue with a dataset generated from T cells with a fixed beta chain. Seems as if the computation gets stuck and this is what it looks like: 0%| | 0/669 [00:00<?, ?it/s]

Looking forward to the fix! I'm really enjoying using Scirpy.

@grst
Copy link
Collaborator

grst commented Dec 17, 2020

Hi @nicoac,

to verity that it is really the same issue, could you please also report the result of the following two command?

adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head()

@nicoac
Copy link

nicoac commented Dec 17, 2020

Thanks for the response. @grst
Here is the result of that code:

IR_VDJ_1_cdr3
CASSSPGTANYAEQFF 25647
nan 9079
None 242
CTCSAEGDRQAPLF 6
CAWSLGQQNTLYF 6
dtype: int64

As you can see. It is quite a large amount of one specific chain. Also just to be clear. This is the result of combining six different adata frames. I have three day zeros and three day 8 samples that I am tracking longitudinally to see TCR alpha chain pairings over time.

@grst
Copy link
Collaborator

grst commented Dec 17, 2020

Thanks for checking, seems indeed the same issue.
I'll try to fix it after the holidays.

@nicoac
Copy link

nicoac commented Jan 21, 2021

Hi Gregor. Just following up with you on some updates. I was able to get the numbers to crunch properly all the way through the ir_neighbors and the end of the analysis as per the tutorial. However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

Any thoughts?

@grst
Copy link
Collaborator

grst commented Jan 22, 2021

However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables.

I'm not sure what you mean with the coord-dictionary and what exactly is getting wiped. Can you provide an example?

@nicoac
Copy link

nicoac commented Jan 22, 2021

scirpy_error

Here is what is happening after a fresh run of the code

@grst
Copy link
Collaborator

grst commented Jan 25, 2021

What probably happens is that the jupyter kernel gets killed because it runs out of memory (it seems pp.ir_neighbors never finished). As the jupyter kernel restarts, all your objects get lost.

Unfortunately, there's currently not a lot you can do except for

  • downsample (reduce the number of cells with identical clonotypes), or
  • getting crazy amounts of memory

I'm working on a fix in #230, but it turns out to be more difficult than I had expected.

@vladie0
Copy link
Author

vladie0 commented Jan 25, 2021

Hi grst,

I have a new batch of similar data and was wondering what your progress is on optimizing ir_neighbors for sparse matrices. The reason I'm asking is that I have a new, similar project and was wondering whether I should receive a large memory server again ?

@grst
Copy link
Collaborator

grst commented Jan 25, 2021

was wondering what your progress is on optimizing ir_neighbors for sparse matrices

I don't want to promise anything, but if all goes well I should have something ready to test in 2-3 weeks.

@grst
Copy link
Collaborator

grst commented Jan 29, 2021

Hi @vladie0,

just wanted to let you know that the 2-3 weeks won't work out. There were some setbacks and I now need to commit to other projects for a while.

Sorry about that,
Gregor

@grst grst closed this as completed in #230 Mar 17, 2021
scirpy-dev automation moved this from In progress to Done Mar 17, 2021
@grst
Copy link
Collaborator

grst commented Mar 17, 2021

Hi @vladie0 and @nicoac,

#230 is finally merged. Could you give it a try and check if everything works as expected now?

You can install the development version using

pip install git+https://github.com/icbi-lab/scirpy.git@master

Note that the procedure for calling clonotypes has slightly changed, in particular:

  • pp.ir_neighbors has been replaced by pp.ir_dist
  • tl.define_clonotypes now takes the dual_ir and receptor_arms arguments (they were previously provided to ir_neighbors)
  • The clonotype network now collapses cells with identical receptors to a single dot

The documentation of the development version is at https://icbi-lab.github.io/scirpy/develop and this tutorial section describes the updated clonotype definition.

@nicoac
Copy link

nicoac commented Mar 17, 2021

I will report back with results. Thank you so much!

@grst

This comment has been minimized.

@nicoac

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
scirpy-dev
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants