New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage ir_neighbors #217
Comments
Hi @vladie0, what metric and what options do you use for There are several points to be considered here:
Best, |
Dear Gregor, Thank you for your swift reply. Ar there other parameter that can be of importance, for example metric, n_jobs, sequence? ... Note, I've also merged TCR & adata data, prior to ir_neighbors calculations. I've tried the following options without succes: ir.pp.ir_neighbors(adata) ir.pp.ir_neighbors( |
Even with alignment="identity", the same memory issue persists |
This issue keeps occurring even when the number of cells are down sampled to 10000 cells. |
So for this, I would indeed expect memory issues: ir.pp.ir_neighbors(
adata,
metric="levenshtein",
n_jobs=8,
sequence="aa",
cutoff=10,
receptor_arms="any",
dual_ir="primary_only",
) A levenshtein distance of 10 will connect almost all cells - A threshold of 1 or 2 makes more sense here, 10 is better for the alignment metric. For the others, in particular Just to be sure, when testing the Let's also do some checks on your anndata. Can you please execute the following commands, they will count adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head() |
Ah wow, that explains. You essentially only have three different clonotypes in your data, with 10k+ cells each. This will lead to three "dense" blocks of This will be fixed by #191, but I'll still take some time to implement it. On the other hand, with only three different clonotypes, the clonotype network does not make a lot of sense anyway. EDIT: A "workaround" would be to downsample only the cells that belong to that abundant clonotype -- or just live with the memory consumption if your system can handle it. |
Ok thank you, Would you suggest to remove the top 3 clonotype and build a network on this data subset or would you still consider it unnecessary to define clonotypes using scirpy and carry on with subsequent analysis? |
First I would wonder: is there a biological explanation for the clonotypes being so skews, or is this a technical artifact? I think it is still interesting to look at a clonotype network. As a temporary workaround, you could remove all cells with these abundant clonotypes, except a few representatives. Or if you have the memory, just perform the analysis as usual. |
Yes, these are cell lines of T-cells harvested from a patient, with a certain disease. I suppose this explained the skewed distribution. I will try to implement the workaround, thank your support |
One more question, |
fingers crossed that works out 🤞
If you want to construct the network based on "similar clonotypes that likely recognize the same antigen", Since you have so few clonotypes, I would probably use It depends a bit what your question is. The options are also explained here. |
I'm running into a similar issue with a dataset generated from T cells with a fixed beta chain. Seems as if the computation gets stuck and this is what it looks like: 0%| | 0/669 [00:00<?, ?it/s] Looking forward to the fix! I'm really enjoying using Scirpy. |
Hi @nicoac, to verity that it is really the same issue, could you please also report the result of the following two command? adata.obs.groupby("IR_VJ_1_cdr3").size().sort_values(ascending=False).head()
adata.obs.groupby("IR_VDJ_1_cdr3").size().sort_values(ascending=False).head() |
Thanks for the response. @grst IR_VDJ_1_cdr3 As you can see. It is quite a large amount of one specific chain. Also just to be clear. This is the result of combining six different adata frames. I have three day zeros and three day 8 samples that I am tracking longitudinally to see TCR alpha chain pairings over time. |
Thanks for checking, seems indeed the same issue. |
Hi Gregor. Just following up with you on some updates. I was able to get the numbers to crunch properly all the way through the ir_neighbors and the end of the analysis as per the tutorial. However, after coming back to it a few days later I am not running into an issue where it seems as if the VDJ cord-dictionary completely resets and wipes all of my variables. Any thoughts? |
I'm not sure what you mean with the coord-dictionary and what exactly is getting wiped. Can you provide an example? |
What probably happens is that the jupyter kernel gets killed because it runs out of memory (it seems Unfortunately, there's currently not a lot you can do except for
I'm working on a fix in #230, but it turns out to be more difficult than I had expected. |
Hi grst, I have a new batch of similar data and was wondering what your progress is on optimizing ir_neighbors for sparse matrices. The reason I'm asking is that I have a new, similar project and was wondering whether I should receive a large memory server again ? |
I don't want to promise anything, but if all goes well I should have something ready to test in 2-3 weeks. |
Hi @vladie0, just wanted to let you know that the 2-3 weeks won't work out. There were some setbacks and I now need to commit to other projects for a while. Sorry about that, |
#230 is finally merged. Could you give it a try and check if everything works as expected now? You can install the development version using pip install git+https://github.com/icbi-lab/scirpy.git@master Note that the procedure for calling clonotypes has slightly changed, in particular:
The documentation of the development version is at https://icbi-lab.github.io/scirpy/develop and this tutorial section describes the updated clonotype definition. |
I will report back with results. Thank you so much! |
Dear authors,
Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server.
Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?
Kind Regards,
The text was updated successfully, but these errors were encountered: