Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of bundled data through designated separate repo #150

Merged
merged 34 commits into from
May 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a55b8a9
cleanup: drop bundled ashm regions
Kdreval Dec 29, 2022
d086a1b
new feature: introduce config key for data version tracking
Kdreval Dec 29, 2022
7315f5c
new feature: auto load bundled data of particular version
Kdreval Dec 29, 2022
1c3c141
bug fix: refer to the most recent version
Kdreval Dec 29, 2022
72ca772
cleanup: add documentation
Kdreval Dec 29, 2022
2e23d10
cleanup: update documentation for the new function
Kdreval Dec 29, 2022
7fc02da
cleanup: add examples for new function
Kdreval Dec 29, 2022
1b28d15
merge master and resolve conflicts
Kdreval Jan 19, 2023
2769b34
cleanup: remove dependance on internet to check for latest version of…
Kdreval Feb 8, 2023
62e6d10
cleanup: updating NAMESPACE
Kdreval Feb 8, 2023
62cb770
Merge branch 'master' into kdreval-data_helper
Kdreval Feb 8, 2023
342ebd6
cleanup: merge master and resolve conflicts
Kdreval Mar 23, 2023
ff52877
new functionality: switching lymphoma genes to GAMBLR.data helper
Kdreval Mar 23, 2023
173fbc4
add fix from Ryan for prettyOncopot
Kdreval Mar 23, 2023
acd8df5
cleanup: update documentation
Kdreval Mar 23, 2023
4a90f28
cleanup: add GAMBLR.data as dependency
Kdreval Mar 23, 2023
57e28c6
cleanup: standardize the config key check
Kdreval Mar 24, 2023
81f8a06
review comments
Kdreval Apr 21, 2023
ed74f3c
get rid of data.table dependency when calculating PGA
Kdreval Apr 21, 2023
f21503d
init collating PGA results
Kdreval Apr 21, 2023
c976122
add PGA collating to main function
Kdreval Apr 21, 2023
50866a8
bug fix: mistake in calculate_pga
Kdreval May 1, 2023
46a6aec
new functionality: add capture support to
Kdreval May 3, 2023
76a48fa
new functionality: add capture support to get_cn_segments
Kdreval May 3, 2023
16fb2e1
new functionality: add capture support to get_cn_states
Kdreval May 3, 2023
9586403
new functionality: add capture support throughout CNV functions
Kdreval May 3, 2023
7672e8d
new functionality: remove capture stop in collate_pga
Kdreval May 3, 2023
f566779
cleanup: update documentation
Kdreval May 3, 2023
e4b60b4
cleanup: merge master and resolve conflicts
Kdreval May 3, 2023
76ec542
bugfix: remove newline in export statement messing up with NAMESPACE
Kdreval May 3, 2023
9ac4959
cleanupadd capture support to plotting functions
Kdreval May 4, 2023
3f5eb75
cleanup: add more capture support
Kdreval May 4, 2023
02aea8b
cleanup: add more documentation
Kdreval May 4, 2023
db2e632
cleabug fix: correct deliminators in corrupted bed file
Kdreval May 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Imports:
dplyr,
forcats,
g3viz,
GAMBLR.data (>= 0.1),
ggbeeswarm,
ggplot2,
ggpubr,
Expand Down Expand Up @@ -68,7 +69,8 @@ VignetteBuilder:
Remotes:
cBioPortal/cgdsr,
morinlab/g3viz,
morinlab/ggsci
morinlab/ggsci,
morinlab/GAMBLR.data
biocViews:
Encoding: UTF-8
LazyData: true
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ export(tidy_lymphgen)
export(write_sample_set_hash)
import(ComplexHeatmap)
import(DBI)
import(GAMBLR.data)
import(RCircos)
import(RColorBrewer)
import(RMariaDB)
Expand Down
60 changes: 0 additions & 60 deletions R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -235,24 +235,6 @@
"grch37_all_gene_coordinates"


#' grch37 ASHM Regions.
#'
#' ASHM regions in respect to grch37.
#'
#' @format ## `grch37_ashm_regions`
#' A data frame with 88 rows and 7 columns.
#' \describe{
#' \item{chr_name}{The chromosome for which the region is residing on}
#' \item{hg19_start}{start coordinate for the region}
#' \item{hg19_end}{end coordinate for the region}
#' \item{gene}{Gene symbol (Hugo)}
#' \item{region}{Region name}
#' \item{regulatory_comment}{Regulatory element}
#' \item{name}{gene-region format}
#' }
"grch37_ashm_regions"


#' grch37 Gene Coordinates.
#'
#' All gene coordinates in respect to grch37.
Expand Down Expand Up @@ -300,25 +282,6 @@
#' }
"grch37_partners"


#' hg38 ASHM Regions.
#'
#' ASHM regions in respect to hg38.
#'
#' @format ## `hg38_ashm_regions`
#' A data frame with 88 rows and 7 columns.
#' \describe{
#' \item{chr_name}{The chromosome for which the region is residing on}
#' \item{hg19_start}{start coordinate for the region}
#' \item{hg19_end}{end coordinate for the region}
#' \item{gene}{Gene symbol (Hugo)}
#' \item{region}{Region name}
#' \item{regulatory_comment}{Regulatory element}
#' \item{name}{gene-region format}
#' }
"hg38_ashm_regions"


#' hg38 Gene Coordinates.
#'
#' All gene coordinates in respect to hg38.
Expand Down Expand Up @@ -415,29 +378,6 @@
"lymphoma_genes_comprehensive"


#' Lymphoma Genes
#'
#' A data frame with known lymphoma genes, genes are annotated by pathology, as well as literature support.
#'
#' @format ## `lymphoma_genes`
#' A data frame with 196 rows and 12 columns.
#' \describe{
#' \item{Gene}{Gene symbol}
#' \item{DLBCL}{Boolean flag annotating if the described genes are significant for the pathology (DLBCL)}
#' \item{FL}{Boolean flag annotating if the described genes are significant for the pathology (FL)}
#' \item{BL}{Boolean flag annotating if the described genes are significant for the pathology (BL)}
#' \item{MCL}{Boolean flag annotating if the described genes are significant for the pathology (MCL)}
#' \item{CLL}{Boolean flag annotating if the described genes are significant for the pathology (CLL)}
#' \item{ensembl_gene_id}{Ensembl gene ID}
#' \item{hgnc_symbol}{Gene symbol (Hugo)}
#' \item{LymphGen}{Boolean flag, TRUE if lymphGen}
#' \item{Reddy}{Boolean flag, TRUE if gene verified by the stated study (Reddy)}
#' \item{Chapuy}{Boolean flag, TRUE if gene verified by the stated study (Chapuy)}
#' \item{entrezgene_id}{Entrez ID for the described gene}
#' }
"lymphoma_genes"


#' Reddy Genes.
#'
#' Genes identified as significantly mutated in DLBCL by the study of Reddy et al.
Expand Down
339 changes: 185 additions & 154 deletions R/database.R

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions R/load_data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Helper functions not for export

#' @title Check the version of the data to load.
#'
#' @description Helper function to determine whether the user is requesting the latest version
#' of the bundled data or wants to access one of the earlier veresions.
#'
#' @param mode Determines which data to handle. Defaults to somatic_hypermutation_locations. Will grow with more options as more data is version tracked.
#' @param this_genome_build The genome build of the data if coordinate-based. Accepts grch37 (default) or hg38.
#'
#' @noRd
#'
#' @return data frame
#' @import config dplyr readr GAMBLR.data
#'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone over all the GAMBLR helper functions on my branch. This was done as a step in preparing the documentation for building a site from the source code with build_site from pkgdown. This function takes all .Rd files that live in the man folder and generates function-specific HTML files. This is not something we want for helper functions (since they are not exported in NAMESPACE).

The fix for this is to add @noRd in the documentation for such functions. This prevents the .Rd file from being generated.

I think you should also specify this for this function to keep things consistent. Let me know if this doesn't make sense, or if you need me to further clarify.

Thanks,

#' @examples
#' determine_version()
#' determine_version(this_genome_build = "hg38")
#'
determine_version <- function(
mode = "somatic_hypermutation_locations",
this_genome_build = "grch37"
){
# Determine the latest version of the data
# Get absolute paths to bundled files in GAMBLR.data
all_files <- system.file(
"extdata",
package = "GAMBLR.data"
) %>%
list.files(
recursive = TRUE,
full.names = TRUE
)

# Extract version from the path
all_files <- gsub(
".*extdata/",
"",
all_files
)
all_files <- all_files[grepl(mode, all_files)]

# Determine the highst version
versions <- gsub(".*[/]([^/]+)[/].*", "\\1", all_files)
versions <- versions[grep('[0-9]+', versions)]
versions <- sort(
numeric_version(
versions
)
)

latest_version <- max(versions)

# Which version did the user requested in config?
requested_version <- check_config_value(
config::get("bundled_data_versions")[[mode]]
)
# Convert to numeric value if it is a string
if(requested_version == "_latest"){
requested_version = latest_version
}

# UCSC-ize the genome build format of GAMBLR
if(this_genome_build == "grch37"){
this_genome_build = "GRCh37"
}else{
this_genome_build = "GRCh38"
}

# Load the user-specified version of the data
this_object <- paste0(
mode,
"_",
this_genome_build,
"_v",
requested_version
)

this_data <- eval(
parse(
text = paste0(
"GAMBLR.data::",
this_object)
)
)

return(this_data)

}


grch37_ashm_regions <- determine_version(
mode = "somatic_hypermutation_locations",
this_genome_build = "grch37"
)

hg38_ashm_regions <- determine_version(
mode = "somatic_hypermutation_locations",
this_genome_build = "hg38"
)

lymphoma_genes <- GAMBLR.data::get_genes(
entities = c("BL", "MCL", "DLBCL"),
version = check_config_value(
config::get("bundled_data_versions")[["lymphoma_genes"]]
),
gene_format = "data.frame"
)
Loading