Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent gene set contents with MSigDB #30

Closed
10adavis opened this issue Jun 16, 2023 · 6 comments
Closed

Inconsistent gene set contents with MSigDB #30

10adavis opened this issue Jun 16, 2023 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@10adavis
Copy link

First, thanks for the great package! It's really convenient to be able to pull in these gene sets from MSigDB. I've been using it to pull gene sets for about a year now, and only recently noticed that some of the gene sets are different than what's on MSigDB (e.g., GOBP_Keratinization from msigdbr includes 279 genes, but on MSigDB it only has 83 genes).

I thought it might be a difference of versions (as msigdbr pulls MSigDB 7.5.1), but GOBP_Keratinization actually contains fewer genes in this version (n = 59): https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5.1/c5.go.bp.v7.5.1.symbols.gmt

I used this line to pull all GO BP sets:

m_df_BP = msigdbr(species = "Homo sapiens",subcategory=c("BP"))

here is my session info:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices
[6] datasets utils methods base

other attached packages:
[1] scales_1.1.1 msigdbr_7.4.1
[3] biomartr_0.9.2 data.table_1.14.0
[5] GSEABase_1.54.0 graph_1.70.0
[7] annotate_1.70.0 XML_3.99-0.6
[9] reactome.db_1.76.0 GO.db_3.13.0
[11] fgsea_1.18.0 dplyr_1.0.7
[13] EnhancedVolcano_1.10.0 ggrepel_0.9.1
[15] rlist_0.4.6.1 pheatmap_1.0.12
[17] org.Hs.eg.db_3.13.0 AnnotationDbi_1.54.1
[19] readxl_1.3.1 ggplot2_3.3.5
[21] ashr_2.2-47 DESeq2_1.32.0
[23] SummarizedExperiment_1.22.0 Biobase_2.52.0
[25] MatrixGenerics_1.4.0 matrixStats_0.59.0
[27] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
[29] IRanges_2.26.0 S4Vectors_0.30.0
[31] BiocGenerics_0.38.0 rmarkdown_2.14
[33] here_1.0.1

loaded via a namespace (and not attached):
[1] snow_0.4-3 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.0.0
[5] splines_4.1.0 BiocParallel_1.26.1
[7] digest_0.6.27 invgamma_1.1
[9] foreach_1.5.2 htmltools_0.5.2
[11] SQUAREM_2021.1 fansi_0.5.0
[13] magrittr_2.0.1 memoise_2.0.0
[15] cluster_2.1.2 doParallel_1.0.17
[17] ComplexHeatmap_2.8.0 Biostrings_2.60.1
[19] extrafont_0.17 extrafontdb_1.0
[21] prettyunits_1.1.1 colorspace_2.0-2
[23] rappdirs_0.3.3 blob_1.2.2
[25] xfun_0.30 crayon_1.4.1
[27] RCurl_1.98-1.3 genefilter_1.74.0
[29] survival_3.3-1 iterators_1.0.14
[31] glue_1.6.2 gtable_0.3.0
[33] zlibbioc_1.38.0 XVector_0.32.0
[35] GetoptLong_1.0.5 DelayedArray_0.18.0
[37] proj4_1.0-10.1 Rttf2pt1_1.3.9
[39] shape_1.4.6 maps_3.3.0
[41] DBI_1.1.1 Rcpp_1.0.7
[43] progress_1.2.2 xtable_1.8-4
[45] clue_0.3-60 bit_4.0.4
[47] truncnorm_1.0-8 httr_1.4.2
[49] RColorBrewer_1.1-2 ellipsis_0.3.2
[51] pkgconfig_2.0.3 farver_2.1.0
[53] dbplyr_2.1.1 locfit_1.5-9.4
[55] utf8_1.2.1 tidyselect_1.1.1
[57] labeling_0.4.2 rlang_0.4.11
[59] munsell_0.5.0 cellranger_1.1.0
[61] tools_4.1.0 cachem_1.0.5
[63] cli_3.3.0 generics_0.1.0
[65] RSQLite_2.2.7 evaluate_0.14
[67] stringr_1.4.0 fastmap_1.1.0
[69] yaml_2.2.1 babelgene_21.4
[71] knitr_1.33 bit64_4.0.5
[73] purrr_0.3.4 KEGGREST_1.32.0
[75] ash_1.0-15 ggrastr_0.2.3
[77] xml2_1.3.2 biomaRt_2.48.2
[79] compiler_4.1.0 rstudioapi_0.13
[81] filelock_1.0.2 curl_4.3.2
[83] beeswarm_0.4.0 png_0.1-8
[85] tibble_3.1.3 geneplotter_1.70.0
[87] stringi_1.7.3 highr_0.10
[89] ggalt_0.4.0 lattice_0.20-45
[91] Matrix_1.3-4 vctrs_0.3.8
[93] pillar_1.6.1 lifecycle_1.0.0
[95] BiocManager_1.30.16 GlobalOptions_0.1.2
[97] bitops_1.0-7 irlba_2.3.3
[99] R6_2.5.0 renv_0.15.4
[101] KernSmooth_2.23-20 gridExtra_2.3
[103] vipor_0.4.5 codetools_0.2-19
[105] MASS_7.3-55 assertthat_0.2.1
[107] rprojroot_2.0.2 rjson_0.2.21
[109] withr_2.4.2 GenomeInfoDbData_1.2.6
[111] hms_1.1.0 grid_4.1.0
[113] Cairo_1.5-12.2 mixsqp_0.3-43
[115] tinytex_0.37 ggbeeswarm_0.6.0

@igordot igordot added the help wanted Extra attention is needed label Jun 16, 2023
@igordot
Copy link
Owner

igordot commented Jun 16, 2023

Apologies about any confusion.

I tried your example and I am getting 59 genes as expected. Some genes have multiple Ensembl IDs which makes the table slightly bigger.

> m_df_BP_K = dplyr::filter(m_df_BP, gs_name == "GOBP_KERATINIZATION")
> length(unique(m_df_BP_K$ensembl_gene))
[1] 65
> length(unique(m_df_BP_K$gene_symbol))
[1] 59

@10adavis
Copy link
Author

Thanks for the swift response! That's strange. These are my outputs:

m_df_BP = msigdbr(species = "Homo sapiens",subcategory=c("BP"))
m_df_BP_K = dplyr::filter(m_df_BP, gs_name == "GOBP_KERATINIZATION")
length(unique(m_df_BP_K$ensembl_gene))
[1] 279
length(unique(m_df_BP_K$gene_symbol))
[1] 225

@igordot
Copy link
Owner

igordot commented Jun 19, 2023

I just realized that you mentioned msigdbr 7.5.1 in the text, but the session info shows it is 7.4.1. Can you confirm which one it is?

The results I showed were with 7.5.1.

@10adavis
Copy link
Author

Ahh! That's it! I completely overlooked I wasn't running 7.5.1. It looks like GO BP Keratinization v7.4 contains 225 gene symbols. Sorry for the confusion!

@10adavis 10adavis reopened this Jun 19, 2023
@10adavis
Copy link
Author

Just curious - will you be updating msigdbr to source MSIgDB v2023.1?

@igordot
Copy link
Owner

igordot commented Jun 19, 2023

Yes, I plan to update. My last submission to CRAN was rejected and I have to restructure the code to comply with CRAN policies, but I haven't had time to do that.

@igordot igordot closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants