Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getSignificance very memory intensive when fit = ANOVA #63

Closed
Dragonmasterx87 opened this issue Feb 10, 2023 · 3 comments
Closed

getSignificance very memory intensive when fit = ANOVA #63

Dragonmasterx87 opened this issue Feb 10, 2023 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@Dragonmasterx87
Copy link

Dragonmasterx87 commented Feb 10, 2023

Hi Nick!

Amazing tool! I was wondering if you could help me with a issue I am having owing to a large dataset.
I have a seurat object which consists of sample data integrated across 15 individuals 50k+ subsetted high-quality doublet free cells.
Now in this dataset I have a metadata for celltype (15 types) which are further broken into race (2 types) and sex (2 types) which leads to joined metadata slot for 60 celltypes, on the basis of sex and race.

Now I have already downloaded the entire msigdb for cat C5 (gene ontologies), using:

# molecular Signature database for msigdb
gene.sets <- getGenesets(org = "hsa",
                         db = c("msigdb"), #https://rdrr.io/bioc/EnrichmentBrowser/man/getGenesets.html
                         gene.id.type = "SYMBOL", #idTypes(org = "hsa")
                         cat = c("C5"), # C5 is gene ontology
                         cache = TRUE, 
                         return.type = "list")

and successfuly ran enrichIt:

# Enrichment calculation
ES <- enrichIt(obj = pancreas.combined.h.s, 
               gene.sets = gene.sets, 
               groups = 1000, 
               cores = 30, 
               min.size = NULL)

At this point, I have successfully reproduced all of the plotting params that I have tested and which are outlined in your vignette. Obviously after transferring an enormous amount of C5 metadata or the ES file to the Seurat object.

Now I am interested in running a statistical test across all 60 samples in various configurations to test for pathways that are more activated in some sex_ancestral_celltype vs another sex_ancestral_celltype (obviously the comparison is for the same cell type but differing sex and ancestry). In order to achieve this I run:

# Significance testing
significant_pathways <- getSignificance(ES2, group = "celltype.sample", fit = "ANOVA") 

As I write this I just crossed 450GB of ram utilization. Is there any way I can reduce this colossal computational complexity?
I am thinking probably the only way, is to individually subset cell types, calculate the significance and then add that metadata iteratively back to the primary Seurat file. Or is there some way of using getSignificance to specify the testing combinations?

Thanks again for the wonderful tool.

Cheers,

🐉

@Dragonmasterx87
Copy link
Author

Dragonmasterx87 commented Feb 11, 2023

Update: I have subsetted each of the individual cell types, and am now running getSignificance for each file separately. So far it seems to be less computationally intensive. The trick will be to concatenate the data when I have it. Will update you later.

Update: Using some meta-data kung-fu, I was able to segregate the cell-types into individual Seurat objects and successfully run getSignificance on them, using simpler computational complexity I was able to reduce ram usage from >1.2TB to <20GB.

With regards to the interpretation, for the sake of argument in a cell-type X for the GO term 'increased carbohydrate synthesis' say I am comparing males to females and the p.val is <0.001 (the column name is malesvsfemales.pval), also say the median for males is 4000 and median for females is 2000, would that suggest that this specific GO is up-regulated in males, owing to a higher median in males and furthermore, in the males_vs_females direction and it is significant?

Cheers,

🐉

@ncborcherding ncborcherding added the help wanted Extra attention is needed label Feb 16, 2023
@ncborcherding
Copy link
Owner

Hey Fahd,

Apologies for the delay -

Thank for reaching out and giving an extensive summary (with follow up) of the problem. You are completely correct - there is a large computational requirement for a lot of the additional testing as getSignificance() is dependent on r::stats. I think this is a larger issue as data sets expand and will mark this as help wanted because the issue is persistent.

Please let me know if you have any suggestions from your experience and I will test some ideas I have in the mean time.

Nick

@Dragonmasterx87
Copy link
Author

Cheers thanks legend!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants