Determining the appropriate background #16

bschilder · 2023-11-07T17:09:38Z

An important question for any enrichment-based analysis is: how do we determine the appropriate set of background genes?

In the case of rare disease celltyping project, I think there's two potential backgrounds that could be used.

Options

1. All HPO genes

All genes included in the HPO gene annotations.
This is essentially saying "is this phenotype's genes enriched in a celltype relative to rare disease genes in general?"

> gene_data <- HPOExplorer::load_phenotype_to_genes()
Reading cached RDS file: phenotype_to_genes.txt
+ Version: v2023-10-09
> length(unique(gene_data$gene_symbol))
[1] 5005

2. All human genes

All genes in the human genome, or at least those that appear in the CTD. When the CTD is from mouse, this background would be further reduced to those that have human 1:1 orthologs.

This is essentially saying "is this phenotype's genes enriched in a celltype relative to all genes that are expressed and informative in some cell type?"

genelistSpecies = "human"
sctSpecies = "human"
> bg = orthogene::create_background(
+     species1 = genelistSpecies,
+     species2 = sctSpecies)
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Returning 19,129 unique genes from entire human genome.

> length(bg)
[1] 19129

Internally, EWCE will use the CTD genes to subset this list further, as it can only sample specificity scores from genes that are included in the CTD:

Conclusions

@NathanSkene @KittyMurphy I think we concluded the latter is the way to go, especially as this is concordant with how EWCE behaves by default (unless the user supplies their own custom bg= arg).

The text was updated successfully, but these errors were encountered:

NathanSkene · 2023-11-07T17:51:43Z

Was thinking when we discussed this that it is worth checking if all HPO genes are pLI extreme genes. Probably Not worth putting time into checking though. I think it should be against all genes. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Brian M. Schilder ***@***.***> Sent: Tuesday, November 7, 2023 5:10:15 PM To: neurogenomics/MultiEWCE ***@***.***> Cc: Skene, Nathan G ***@***.***>; Assign ***@***.***> Subject: Re: [neurogenomics/MultiEWCE] Determining the appropriate background (Issue #16) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Assigned #16<#16> to @NathanSkene<https://github.com/NathanSkene>. — Reply to this email directly, view it on GitHub<#16 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE2TUBBPNB2O7PKCE5DYDJTPPAVCNFSM6AAAAAA7BRDYD2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQHA4DSOBVGAZDAMQ>. You are receiving this because you were assigned.Message ID: ***@***.***>

KittyMurphy · 2023-11-07T18:06:40Z

I have some code and data for this for when I looked at genes under selective pressure and pLI, so I can have a look back! Agreed re: against all genes. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Nathan Skene ***@***.***> Sent: Tuesday, November 7, 2023 5:51:55 PM To: neurogenomics/MultiEWCE ***@***.***> Cc: Murphy, Kitty ***@***.***>; Mention ***@***.***> Subject: Re: [neurogenomics/MultiEWCE] Determining the appropriate background (Issue #16) Was thinking when we discussed this that it is worth checking if all HPO genes are pLI extreme genes. Probably Not worth putting time into checking though. I think it should be against all genes. Sent from Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Brian M. Schilder ***@***.***> Sent: Tuesday, November 7, 2023 5:10:15 PM To: neurogenomics/MultiEWCE ***@***.***> Cc: Skene, Nathan G ***@***.***>; Assign ***@***.***> Subject: Re: [neurogenomics/MultiEWCE] Determining the appropriate background (Issue #16) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Assigned #16<#16> to @NathanSkene<https://github.com/NathanSkene>. — Reply to this email directly, view it on GitHub<#16 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE2TUBBPNB2O7PKCE5DYDJTPPAVCNFSM6AAAAAA7BRDYD2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQHA4DSOBVGAZDAMQ>. You are receiving this because you were assigned.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub<#16 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANQCHWHK52N72IZZ2AGOYU3YDJYLXAVCNFSM6AAAAAA7BRDYD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGM2DAMZWGI>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bschilder · 2023-11-07T18:28:38Z

I have some code and data for this for when I looked at genes under selective pressure and pLI, so I can have a look back!

This would be excellent! I'll file a separate issue for this.
Here:

Assess pLI in HPO genes rare_disease_celltyping#50

Agreed re: against all genes.

Cool! Seems we're all in consensus, I'll proceed accordingly.

bschilder · 2023-11-08T12:07:25Z

method="gprofiler" vs. method="homologene"

MultiEWCE::get_bg is a new function that wraps orthogene::create_background.

Human CTD + Human hits

The gprofiler background is MUCH more comprehensive set of human genes than homologene (the default method orthogene uses within EWCE).

Difference: 62663 vs. 16482 genes

> bg1=get_bg(method="gprofiler")
Useing cached bg.
+ Version: 2023-11-08
> length(bg1)
[1] 62663
> 
> bg2=get_bg(method="homologene")
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Returning 19,129 unique genes from entire human genome.
+ Version: 2023-11-08
> length(bg2)
[1] 19129

Mouse CTD + Human hits

For mouse-human analyses, the choice of orthologous gene mapping method makes a difference (tho not a massive one):

Difference: 17024 vs. 19129 genes

> bg1=get_bg(method="gprofiler", species1 = "mouse", species2 = "human")
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.

-- mouse
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Gene table with 56,847 rows retrieved.
Returning all 56,847 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
56,847 genes extracted.
Converting mouse ==> human orthologs using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Checking for genes without orthologs in human.
Extracting genes from input_gene.
62,846 genes extracted.
Extracting genes from ortholog_gene.
62,846 genes extracted.
Dropping 37,509 NAs of all kinds from ortholog_gene.
Checking for genes without 1:1 orthologs.
Dropping 4,149 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 5,437 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   39,651 / 56,697 (70%)
Total genes remaining after convert_orthologs :
   17,046 / 56,697 (30%)
--

=========== REPORT SUMMARY ===========

17,024 / 56,697 (30.03%) target_species genes remain after ortholog conversion.
17,024 / 40,835 (41.69%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.

-- human
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.
--

=========== REPORT SUMMARY ===========

40,835 / 40,835 (100%) target_species genes remain after ortholog conversion.
40,835 / 40,835 (100%) reference_species genes remain after ortholog conversion.
17,024 intersect background genes used.
+ Version: 2023-11-08
> length(bg1)
[1] 17024
> 
> bg2=get_bg(method="homologene", species1 = "mouse", species2 = "human")
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- mouse
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- human
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--

=========== REPORT SUMMARY ===========

19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
+ Version: 2023-11-08
> length(bg2)
[1] 16482

Conclusion

For MultiEWCE, the best choice is "gprofiler", as:

we're mostly dealing with human CTD + human hit genes
caching in get_bg obviates the need for API calls (tho i could incorporate this into orthogene, and thus EWCE direclty)

For EWCE, I think i may be ok to continue using "homologene" as the default method, as:
1, it's far faster than "gprofiler" and doesn't use any API calls (which requires stable internet)
2. And, importantly, the default species in EWCE are mouse CTD and human hit genes, which as we can see above do not make a huge impact on the number of background genes. However we should probably add a note to users that if using human CTD + human hits, they should switch to method="gprofiler".

Tagging @Al-Murphy here as this is relevant to EWCE.

bschilder assigned bschilder and NathanSkene Nov 7, 2023

bschilder added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers labels Nov 7, 2023

bschilder added this to the Publish rare disease celltyping manuscript milestone Nov 7, 2023

bschilder closed this as completed Nov 7, 2023

bschilder mentioned this issue Nov 7, 2023

Assess pLI in HPO genes neurogenomics/rare_disease_celltyping#50

Open

4 tasks

bschilder mentioned this issue Nov 8, 2023

Repeat phenotype-level enrichment analyses using EWCE neurogenomics/RareDiseasePrioritisation#29

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining the appropriate background #16

Determining the appropriate background #16

bschilder commented Nov 7, 2023 •

edited

NathanSkene commented Nov 7, 2023 via email

KittyMurphy commented Nov 7, 2023 via email

bschilder commented Nov 7, 2023 •

edited

bschilder commented Nov 8, 2023 •

edited

Determining the appropriate background #16

Determining the appropriate background #16

Comments

bschilder commented Nov 7, 2023 • edited

Options

1. All HPO genes

2. All human genes

Conclusions

NathanSkene commented Nov 7, 2023 via email

KittyMurphy commented Nov 7, 2023 via email

bschilder commented Nov 7, 2023 • edited

bschilder commented Nov 8, 2023 • edited

method="gprofiler" vs. method="homologene"

Human CTD + Human hits

Mouse CTD + Human hits

Conclusion

bschilder commented Nov 7, 2023 •

edited

bschilder commented Nov 7, 2023 •

edited

bschilder commented Nov 8, 2023 •

edited