Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what(): Positions should be positive. #212

Closed
caimiao0714 opened this issue May 9, 2021 · 13 comments
Closed

what(): Positions should be positive. #212

caimiao0714 opened this issue May 9, 2021 · 13 comments

Comments

@caimiao0714
Copy link

caimiao0714 commented May 9, 2021

I’ve been trying to use the bigsnpr package to read the UK Biobank genetics data, but it seems that I can’t really replicate your code. I’ve been had “positions should be positive” error although (I think) I’m using the same code as you did. Here are my code:

pacman::p_load(doParallel, benchmarkme, bigreadr)
 
(n_cores = benchmarkme::get_cpu()$no_of_cores)
 
# ==================== Read in SNP IDs ====================
registerDoParallel(cl <- makeCluster(n_cores))
start_time = Sys.time()
system.time({
  list_snp_id <- foreach(chr = 1:22) %dopar% {
    # cat("Processing chromosome", chr, "..\n")
    mfi <- paste0("Data/ukb_imp_mfi/ukb_mfi_chr", chr, "_v3.txt")
    infos_chr <- bigreadr::fread2(mfi, showProgress = FALSE)
    infos_chr_sub <- subset(infos_chr, V6 > 0.01)  ## MAF > 1%
    bim <- paste0("Data/ukb_snp_bim/ukb_snp_chr", chr, "_v2.bim")
    map_chr <- bigreadr::fread2(bim)
    joined <- dplyr::inner_join(
      infos_chr_sub, map_chr,
      by = c("V3" = "V4", "V4" = "V5", "V5" = "V6")
    )
    with(joined, paste(chr, V3, V4, V5, sep = "_"))
  }
})
Sys.time() - start_time # Total amount of time: 30 sec
stopCluster(cl)
 
# ==================== Read in SNPs ====================
rds = bigsnpr::snp_readBGEN(
    bgenfiles = glue::glue("Data/imp/ukb_imp_chr{chr}_v3.bgen", chr = 1:2),
    list_snp_id = list_snp_id,
    backingfile = "Data/saved_tmp_1_2",
    #ind_row = ind.indiv[sub],
    bgi_dir = "Data/ukb_imp_bgi",
    ncores = benchmarkme::get_cpu()$no_of_cores
  )

The error message is:

    terminate called after throwing an instance of ‘Rcpp::exception’
           What(): Positions should be positive.
    Aborted (core dumped)

My questions is: I’m using the SNP id from the original mfi files and everything looks fine, I’m not sure why the position is not positive. Any comments/suggestions would be extremely useful and appreciated!

Thank you,
Miao

@privefl
Copy link
Owner

privefl commented May 9, 2021

Seems like an Rcpp issue that I have never seen before.

Which packageVersion("bigsnpr") do you have?
I have recently added more checks before going into Rcpp to try catch possible errors before.

@privefl
Copy link
Owner

privefl commented May 9, 2021

In fact, it seems to be one of my own assertions, which I have never encountered before.

myassert(pos > 0, "Positions should be positive.");

What is the range of the positions you're trying to read? Are there any 0s maybe?

@caimiao0714
Copy link
Author

caimiao0714 commented May 9, 2021 via email

@caimiao0714
Copy link
Author

In fact, it seems to be one of my own assertions, which I have never encountered before.

myassert(pos > 0, "Positions should be positive.");

What is the range of the positions you're trying to read? Are there any 0s maybe?

How can I check the range of the positions? I did not specify the ind_row, so assumingly all the positions will be read.

@privefl
Copy link
Owner

privefl commented May 9, 2021

When you read the MFI files.

@caimiao0714
Copy link
Author

I tried to validate if there is any non-positive positions, and it seems that my mfi files do NOT include any non-positive pos. I added a condition to force V3 > 0 in the subset sentence (see my code below).

system.time({
  list_snp_id <- foreach(chr = 1:22) %dopar% {
    # cat("Processing chromosome", chr, "..\n")
    mfi <- paste0("Data/ukb_imp_mfi/ukb_mfi_chr", chr, "_v3.txt")
    infos_chr <- bigreadr::fread2(mfi, showProgress = FALSE)
    infos_chr_sub <- subset(infos_chr, V6 > 0.01 & V3 > 0)  ## <------ I added V3 > 0 to make sure all pos are positive
    bim <- paste0("Data/ukb_snp_bim/ukb_snp_chr", chr, "_v2.bim")
    map_chr <- bigreadr::fread2(bim)
    joined <- dplyr::inner_join(
      infos_chr_sub, map_chr,
      by = c("V3" = "V4", "V4" = "V5", "V5" = "V6")
    )
    with(joined, paste(chr, V3, V4, V5, sep = "_"))
  }
}) 

I'm using R 3.6.1 on CentOS Linux 7, bigsnpr 1.6.1. Below is my session info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /export/home/caim29/software/anaconda3/envs/r3.2/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bigsnpr_1.6.1     bigstatsr_1.5.1   bigreadr_0.2.4    benchmarkme_1.0.7 doParallel_1.0.16 iterators_1.0.13 
[7] foreach_1.5.1    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6            benchmarkmeData_1.0.4 bigsparser_0.4.4      bigparallelr_0.3.1   
 [5] pillar_1.6.0          compiler_3.6.1        tools_3.6.1           gtable_0.3.0         
 [9] lifecycle_1.0.0       tibble_3.1.1          lattice_0.20-44       pkgconfig_2.0.3      
[13] rlang_0.4.11          Matrix_1.3-3          DBI_1.1.1             flock_0.7            
[17] yaml_2.2.1            bigassertr_0.1.4      dplyr_1.0.6           httr_1.4.2           
[21] generics_0.1.0        vctrs_0.3.8           cowplot_1.1.1         grid_3.6.1           
[25] tidyselect_1.1.1      data.table_1.14.0     glue_1.4.2            R6_2.5.0             
[29] fansi_0.4.2           parallelly_1.25.0     pacman_0.5.1          purrr_0.3.4          
[33] ggplot2_3.3.3         magrittr_2.0.1        scales_1.1.1          codetools_0.2-18     
[37] ellipsis_0.3.2        assertthat_0.2.1      colorspace_2.0-1      utf8_1.2.1           
[41] munsell_0.5.0         crayon_1.4.1     

I'm suspecting pos not positive is not the real issue here (based on my code above). Let me know if you have any thoughts/suggestions.

Thank you,
Miao

@privefl
Copy link
Owner

privefl commented May 10, 2021

BTW, what is sum(lengths(list_snp_id))?

@caimiao0714
Copy link
Author

BTW, what is sum(lengths(list_snp_id))?

I think this is just checking how many SNP IDs were included.

@privefl
Copy link
Owner

privefl commented May 10, 2021

Yes, forget about that.
I'll try to reproduce what you did. Do you get this error already when restricting to 1 chromosome?
Do you get this error when using ncores = 1?

@caimiao0714
Copy link
Author

Yes, forget about that.
I'll try to reproduce what you did. Do you get this error already when restricting to 1 chromosome?
Do you get this error when using ncores = 1?

I spent a whole day testing all different combinations of chr index and the number of cores. It seems that I encounter the same exact error message (see below) whenever chr 1 is included, even if I only use 1 core. But the program can work perfectly well when I don't include chr 1, regardless of how many scores I use.

terminate called after throwing an instance of ‘Rcpp::exception’
           What(): Positions should be positive.
    Aborted (core dumped)

@privefl
Copy link
Owner

privefl commented May 10, 2021

So, the problem is with chromosome 1 only?
It is possible that your file is corrupted of some kind?

@privefl
Copy link
Owner

privefl commented May 10, 2021

You can compare tools::md5sum("Data/imp/ukb_imp_chr1_v3.bgen") with https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997.

@caimiao0714
Copy link
Author

You can compare tools::md5sum("Data/imp/ukb_imp_chr1_v3.bgen") with https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=997.

Thank you! I checked and it does seem to be a data issue. The md5 checksum of chr 1 in my data set does not match that of the uk biobank website, while chr 2 to 22 do match. I'll try to re-download the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants