Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read remote bed.bgz #76

Open
bschilder opened this issue Sep 4, 2022 · 7 comments
Open

Unable to read remote bed.bgz #76

bschilder opened this issue Sep 4, 2022 · 7 comments

Comments

@bschilder
Copy link

bschilder commented Sep 4, 2022

Hello,

I'm trying to query subsets of Roadmap bed.gz files. However there seems to be some issues here:

Reprex

Code

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
res <- rtracklayer::import(URL)

## same error when using `which`
gr <- GenomicRanges::GRanges("chr4:14737349-16737284")
res <- rtracklayer::import(URL, which=gr)

Console output

Error in .new_IRanges_from_start_end(start, end) : 
  'start' or 'end' cannot contain NAs
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :

Traceback

Error in .new_IRanges_from_start_end(start, end) : 
'start' or 'end' cannot contain NAs
11.
stop(wmsg("'start' or 'end' cannot contain NAs"))
10.
.new_IRanges_from_start_end(start, end)
9.
.new_IRanges(start = start, end = end, width = width)
8.
IRanges(bed$thickStart + 1L, thickEnd)
7.
.local(con, format, text, ...)
6.
import(con, ...)
5.
import(con, ...)
4.
import(FileForFormat(con), ...)
3.
import(FileForFormat(con), ...)
2.
rtracklayer::import(URL)
1.
rtracklayer::import(URL)

Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.25 R.utils_2.12.0
[4] tidyselect_1.1.2 RSQLite_2.2.16 AnnotationDbi_1.58.0
[7] htmlwidgets_1.5.4 grid_4.2.1 BiocParallel_1.30.3
[10] XGR_1.1.8 munsell_0.5.0 codetools_0.2-18
[13] interp_1.1-3 DT_0.24 colorspace_2.0-3
[16] OrganismDbi_1.38.1 Biobase_2.56.0 filelock_1.0.2
[19] knitr_1.40 supraHex_1.34.0 rstudioapi_0.14
[22] stats4_4.2.1 DescTools_0.99.45 MatrixGenerics_1.8.1
[25] GenomeInfoDbData_1.2.8 bit64_4.0.5 echoconda_0.99.7
[28] basilisk_1.8.1 vctrs_0.4.1 generics_0.1.3
[31] xfun_0.32 biovizBase_1.44.0 BiocFileCache_2.4.0
[34] R6_2.5.1 GenomeInfoDb_1.32.3 AnnotationFilter_1.20.0
[37] bitops_1.0-7 cachem_1.0.6 reshape_0.8.9
[40] DelayedArray_0.22.0 assertthat_0.2.1 BiocIO_1.6.0
[43] scales_1.2.1 nnet_7.3-17 rootSolve_1.8.2.3
[46] gtable_0.3.0 lmom_2.9 ggbio_1.44.1
[49] ensembldb_2.20.2 rlang_1.0.4 echodata_0.99.12
[52] splines_4.2.1 lazyeval_0.2.2 rtracklayer_1.57.0
[55] dichromat_2.0-0.1 hexbin_1.28.2 checkmate_2.1.0
[58] BiocManager_1.30.18 yaml_2.3.5 reshape2_1.4.4
[61] GenomicFeatures_1.48.3 ggnetwork_0.5.10 backports_1.4.1
[64] Hmisc_4.7-1 RBGL_1.72.0 tools_4.2.1
[67] ggplot2_3.3.6 ellipsis_0.3.2 RColorBrewer_1.1-3
[70] proxy_0.4-27 BiocGenerics_0.42.0 Rcpp_1.0.9
[73] plyr_1.8.7 base64enc_0.1-3 progress_1.2.2
[76] zlibbioc_1.42.0 purrr_0.3.4 RCurl_1.98-1.8
[79] basilisk.utils_1.8.0 prettyunits_1.1.1 rpart_4.1.16
[82] deldir_1.0-6 S4Vectors_0.34.0 SummarizedExperiment_1.26.1
[85] ggrepel_0.9.1 cluster_2.1.4 fs_1.5.2
[88] crul_1.2.0 magrittr_2.0.3 data.table_1.14.2
[91] echotabix_0.99.7 dnet_1.1.7 openxlsx_4.2.5
[94] mvtnorm_1.1-3 ProtGenerics_1.28.0 matrixStats_0.62.0
[97] patchwork_1.1.2 hms_1.1.2 evaluate_0.16
[100] XML_3.99-0.10 jpeg_0.1-9 readxl_1.4.1
[103] IRanges_2.30.1 gridExtra_2.3 compiler_4.2.1
[106] biomaRt_2.52.0 tibble_3.1.8 crayon_1.5.1
[109] R.oo_1.25.0 htmltools_0.5.3 echoannot_0.99.7
[112] tzdb_0.3.0 Formula_1.2-4 tidyr_1.2.0
[115] expm_0.999-6 Exact_3.1 DBI_1.1.3
[118] dbplyr_2.2.1 MASS_7.3-58.1 rappdirs_0.3.3
[121] boot_1.3-28 Matrix_1.4-1 readr_2.1.2
[124] piggyback_0.1.4 cli_3.3.0 R.methodsS3_1.8.2
[127] parallel_4.2.1 igraph_1.3.4 GenomicRanges_1.48.0
[130] pkgconfig_2.0.3 GenomicAlignments_1.32.1 dir.expiry_1.4.0
[133] RCircos_1.2.2 foreign_0.8-82 osfr_0.2.8
[136] xml2_1.3.3 XVector_0.36.0 stringr_1.4.1
[139] VariantAnnotation_1.42.1 digest_0.6.29 graph_1.74.0
[142] httpcode_0.3.0 Biostrings_2.64.1 rmarkdown_2.16
[145] cellranger_1.1.0 htmlTable_2.4.1 gld_2.6.5
[148] restfulr_0.0.15 curl_4.3.2 Rsamtools_2.12.0
[151] rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-159
[154] jsonlite_1.8.0 BSgenome_1.64.0 fansi_1.0.3
[157] pillar_1.8.1 lattice_0.20-45 GGally_2.1.2
[160] KEGGREST_1.36.3 fastmap_1.1.0 httr_1.4.4
[163] survival_3.4-0 glue_1.6.2 zip_2.2.0
[166] png_0.1-7 bit_4.0.4 Rgraphviz_2.40.0
[169] class_7.3-20 stringi_1.7.8 blob_1.2.3
[172] latticeExtra_0.6-30 memoise_2.0.1 dplyr_1.0.9
[175] e1071_1.7-11 ape_5.6-2

</details>

Many thanks in advance, 
Brian
@bschilder
Copy link
Author

It seems that the issue only occurs when the file is remote. Downloading the entire file first does technically circumvent the issue, though this approach isn't ideal:

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"

tmp <- file.path(tempfile(),basename(URL))
dir.create(dirname(tmp),showWarnings = FALSE, recursive = TRUE)
utils::download.file(url = URL, destfile = tmp)
res <- rtracklayer::import(tmp)

Screenshot 2022-09-04 at 15 56 36

@bschilder
Copy link
Author

@lawremi @sanchit-saini, this error is still persisting. Do you know what might be happening here?

@bschilder
Copy link
Author

@hpages I know you're not the primary maintainer for this repo, but I was wondering if you nevertheless have any ideas about this. Thank you!

@hpages
Copy link
Contributor

hpages commented Oct 24, 2022

Hi @bschilder , I didn't investigate much but this is how rtracklayer::import() reads this remote BED file:

URL <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"
con1 <- gzcon(url(URL, open="rb"), text=TRUE)
bedNames <- c("chrom", "start", "end", "name", "score", "strand", "thickStart", "thickEnd", "itemRgb")
bedClasses <- c("character", "integer", "integer", "character", "numeric", "character", "integer", "integer", "character")
bed1 <- read.table(con1, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
# Warning message:
# In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
#   number of items read is not a multiple of the number of columns
colnames(bed1) <- bedNames

Major problem is that bed1 only has 1127 rows!

dim(bed1)
# [1] 1127    9

Minor problem is that the last row has an unexpected NA in the thickEnd column:

tail(bed1)
#      chrom   start     end     name score strand thickStart thickEnd
# 1122 chr10 8459800 8462200   5_TxWk     0   <NA>    8459800  8462200
# 1123 chr10 8462200 8462400    7_Enh     0   <NA>    8462200  8462400
# 1124 chr10 8462400 8477200 15_Quies     0   <NA>    8462400  8477200
# 1125 chr10 8477200 8483600    7_Enh     0   <NA>    8477200  8483600
# 1126 chr10 8483600 8491600   5_TxWk     0   <NA>    8483600  8491600
# 1127 chr10 8491600 8492800    7_Enh     0   <NA>     849160       NA
#          itemRgb
# 1122     0,100,0
# 1123   255,255,0
# 1124 255,255,255
# 1125   255,255,0
# 1126     0,100,0
# 1127            

Compare this to what rtracklayer::import() does to read the file locally:

con2 <- gzfile("E099_15_coreMarks_dense.bed.bgz", open="r")
bed2 <- read.table(con2, colClasses=bedClasses, as.is=TRUE, na.strings=".", comment.char="", sep="\t", quote="")
colnames(bed2) <- bedNames

dim(bed2)
# [1] 265577      9

anyNA(bed2$thickEnd)
# [1] FALSE

So it looks to me that read.table() is somehow broken on a connection obtained with gzcon(url(URL, open="rb"), text=TRUE).

@bschilder
Copy link
Author

bschilder commented Oct 25, 2022

Hi @hpages thanks for taking a look at this, this definitely gives me some clues.

I actually found the below approach does a decent job. Of course, this is just a temporary solution as it still has to download the whole file first, which is less efficient than making a query. But at least this approach consistently reads in all 265,577 rows , and can handle situations where there's some NAs.

path <- "https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E099_15_coreMarks_dense.bed.bgz"

tmp <- file.path(tempdir(), basename(path))
        ## Only download it if you have to
        if(!file.exists(tmp)){
            utils::download.file(url = path, 
                                 destfile = tmp) 
        } 
 dat <- data.table::fread(text=readLines(con = tmp))
gr <- GenomicRanges::makeGRangesFromDataFrame(df = dat, seqnames.field = "V1", start.field = "V2", end.field = "V3", keep.extra.columns = TRUE)

Minor problem is that the last row has an unexpected NA in the thickEnd column:

rtracklayer seems to get hung up on the NA. This doesn't seem to be an issue with creating the GRanges object manually (see above example), so i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error? @sanchit-saini

@hpages
Copy link
Contributor

hpages commented Oct 25, 2022

i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?

That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.

One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in rtracklayer::import().

FWIW I adopted that approach in GenomeInfoDb::getChromInfoFromNCBI() last year (see Bioconductor/GenomeInfoDb@58c8310). Works a lot better now!

@bschilder
Copy link
Author

i'm wondering if there's a way to have rtracklayer::import simply throw a warning instead of an error?

That doesn't solve the real problem which is that the remote file doesn't contain any NA so there shouldn't be one in the data.frame obtained by importing the file.

Ah ok, I see what you mean. Very strange that the NA gets introduced somehow.

Screenshot 2022-10-26 at 10 03 47

One workaround is to not try to import the remote file directly, but to download it to a temporary file first, like you did manually, except that this would need to be baked in rtracklayer::import().
FWIW I adopted that approach in GenomeInfoDb::getChromInfoFromNCBI() last year (see Bioconductor/GenomeInfoDb@58c8310). Works a lot better now!

Nice! Thanks for sharing this. @sanchit-saini @lawremi something along these lines would be a great feature forrtracklayer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants