Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining the appropriate background #16

Closed
bschilder opened this issue Nov 7, 2023 · 4 comments
Closed

Determining the appropriate background #16

bschilder opened this issue Nov 7, 2023 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed

Comments

@bschilder
Copy link
Collaborator

bschilder commented Nov 7, 2023

An important question for any enrichment-based analysis is: how do we determine the appropriate set of background genes?

In the case of rare disease celltyping project, I think there's two potential backgrounds that could be used.

Options

1. All HPO genes

All genes included in the HPO gene annotations.
This is essentially saying "is this phenotype's genes enriched in a celltype relative to rare disease genes in general?"

> gene_data <- HPOExplorer::load_phenotype_to_genes()
Reading cached RDS file: phenotype_to_genes.txt
+ Version: v2023-10-09
> length(unique(gene_data$gene_symbol))
[1] 5005

2. All human genes

All genes in the human genome, or at least those that appear in the CTD. When the CTD is from mouse, this background would be further reduced to those that have human 1:1 orthologs.

This is essentially saying "is this phenotype's genes enriched in a celltype relative to all genes that are expressed and informative in some cell type?"

genelistSpecies = "human"
sctSpecies = "human"
> bg = orthogene::create_background(
+     species1 = genelistSpecies,
+     species2 = sctSpecies)
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Returning 19,129 unique genes from entire human genome.

> length(bg)
[1] 19129

Internally, EWCE will use the CTD genes to subset this list further, as it can only sample specificity scores from genes that are included in the CTD:

Conclusions

@NathanSkene @KittyMurphy I think we concluded the latter is the way to go, especially as this is concordant with how EWCE behaves by default (unless the user supplies their own custom bg= arg).

@bschilder bschilder added documentation Improvements or additions to documentation help wanted Extra attention is needed good first issue Good for newcomers labels Nov 7, 2023
@NathanSkene
Copy link

NathanSkene commented Nov 7, 2023 via email

@KittyMurphy
Copy link

KittyMurphy commented Nov 7, 2023 via email

@bschilder
Copy link
Collaborator Author

bschilder commented Nov 7, 2023

I have some code and data for this for when I looked at genes under selective pressure and pLI, so I can have a look back!

This would be excellent! I'll file a separate issue for this.
Here:

Agreed re: against all genes.

Cool! Seems we're all in consensus, I'll proceed accordingly.

@bschilder
Copy link
Collaborator Author

bschilder commented Nov 8, 2023

method="gprofiler" vs. method="homologene"

MultiEWCE::get_bg is a new function that wraps orthogene::create_background.

Human CTD + Human hits

The gprofiler background is MUCH more comprehensive set of human genes than homologene (the default method orthogene uses within EWCE).

Difference: 62663 vs. 16482 genes

> bg1=get_bg(method="gprofiler")
Useing cached bg.
+ Version: 2023-11-08
> length(bg1)
[1] 62663
> 
> bg2=get_bg(method="homologene")
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
Returning 19,129 unique genes from entire human genome.
+ Version: 2023-11-08
> length(bg2)
[1] 19129

Mouse CTD + Human hits

For mouse-human analyses, the choice of orthologous gene mapping method makes a difference (tho not a massive one):

Difference: 17024 vs. 19129 genes

> bg1=get_bg(method="gprofiler", species1 = "mouse", species2 = "human")
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.

-- mouse
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Gene table with 56,847 rows retrieved.
Returning all 56,847 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
56,847 genes extracted.
Converting mouse ==> human orthologs using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Checking for genes without orthologs in human.
Extracting genes from input_gene.
62,846 genes extracted.
Extracting genes from ortholog_gene.
62,846 genes extracted.
Dropping 37,509 NAs of all kinds from ortholog_gene.
Checking for genes without 1:1 orthologs.
Dropping 4,149 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 5,437 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   39,651 / 56,697 (70%)
Total genes remaining after convert_orthologs :
   17,046 / 56,697 (30%)
--

=========== REPORT SUMMARY ===========

17,024 / 56,697 (30.03%) target_species genes remain after ortholog conversion.
17,024 / 40,835 (41.69%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.

-- human
Retrieving all genes using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Gene table with 62,663 rows retrieved.
Returning all 62,663 genes from human.
--

=========== REPORT SUMMARY ===========

40,835 / 40,835 (100%) target_species genes remain after ortholog conversion.
40,835 / 40,835 (100%) reference_species genes remain after ortholog conversion.
17,024 intersect background genes used.
+ Version: 2023-11-08
> length(bg1)
[1] 17024
> 
> bg2=get_bg(method="homologene", species1 = "mouse", species2 = "human")
Generating gene background for mouse x human ==> human
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- mouse
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Gene table with 21,207 rows retrieved.
Returning all 21,207 genes from mouse.
--
--
Preparing gene_df.
data.frame format detected.
Extracting genes from Gene.Symbol.
21,207 genes extracted.
Converting mouse ==> human orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Checking for genes without orthologs in human.
Extracting genes from input_gene.
17,355 genes extracted.
Extracting genes from ortholog_gene.
17,355 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   4,725 / 21,207 (22%)
Total genes remaining after convert_orthologs :
   16,482 / 21,207 (78%)
--

=========== REPORT SUMMARY ===========

16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
Gathering ortholog reports.
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.

-- human
Retrieving all genes using: homologene.
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Gene table with 19,129 rows retrieved.
Returning all 19,129 genes from human.
--

=========== REPORT SUMMARY ===========

19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
16,482 intersect background genes used.
+ Version: 2023-11-08
> length(bg2)
[1] 16482

Conclusion

For MultiEWCE, the best choice is "gprofiler", as:

  • we're mostly dealing with human CTD + human hit genes
  • caching in get_bg obviates the need for API calls (tho i could incorporate this into orthogene, and thus EWCE direclty)

For EWCE, I think i may be ok to continue using "homologene" as the default method, as:
1, it's far faster than "gprofiler" and doesn't use any API calls (which requires stable internet)
2. And, importantly, the default species in EWCE are mouse CTD and human hit genes, which as we can see above do not make a huge impact on the number of background genes. However we should probably add a note to users that if using human CTD + human hits, they should switch to method="gprofiler".

Tagging @Al-Murphy here as this is relevant to EWCE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants