Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading in many (old) files after copying them over to my linux drive fails #110

Closed
japhir opened this issue Apr 1, 2020 · 30 comments
Closed

Comments

@japhir
Copy link
Contributor

japhir commented Apr 1, 2020

Having to work from home got me quite frustrated with the extremely slow vpn connection I have to the rawdata samba drive, so I copied everything over with some nice rsync scripts. I used the -t flag in rsync, which is supposed to preserve modification times. This seems to have gone wrong however:

Now I did manage to read in all the data, but when I try to iso_get_file_info() for all ~15k files, it results in the below errors:

> iso_get_file_info(dids)
Info: aggregating file info from 14985 data file(s)
Error: No common type for `180223_1_Kiel Std test_1_ETH-1.did$file_datetime` <datetime<Europe/Amsterdam>> and `180307_2_Kiel Std test_30_ETH-3.did$file_datetime` <integer>.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
No common type for `180223_1_Kiel Std test_1_ETH-1.did$file_datetime` <datetime<Europe/Amsterdam>> and `180307_2_Kiel Std test_30_ETH-3.did$file_datetime` <integer>.
Backtrace:
  1. utils:::.ess.eval(...)
 27. vctrs:::vec_ptype2.POSIXt.default(...)
 28. vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 29. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 30. vctrs:::stop_incompatible(...)
 31. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.

Running any of the other isoreader functions is also extremely slow: just reading in the 104MB rds file with dids <- iso_read_dual_inlet("out/dids.di.rds") takes ~2.11 minutes, probably because it's performing some checks? read_rds("out/dids.di.rds") takes approximately 7 seconds.

iso_filter_files() is also non-functional on the whole dataset.

Any ideas on how to fix this?

@japhir
Copy link
Contributor Author

japhir commented Apr 1, 2020

Hmm when I read the file that's giving errors separately, I also get a bunch of errors:

run1 <- iso_read_dual_inlet("~/Downloads/archive/motu/dids/_180223_1/")
Info: preparing to read 1 data files (all will be cached)...
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' from cache...
Progress: [================================================================================================] 1/1 (100%)  0s
Info: finished reading 1 files in 0.09 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
  file_id                  type  func                 details                                                                
  <chr>                    <chr> <chr>                <chr>                                                                  
1 180223_1_Kiel Std test_… error extract_did_raw_vol… cannot locate voltage data - block 'CTwoDoublesArrayData' not found af…
2 180223_1_Kiel Std test_… error extract_did_vendor_… cannot process vendor computed data table - block 'CDualInletEvaluated…

Warning message:
Column `path` has different attributes on LHS and RHS of join 

here's the file:

180223_1_Kiel Std test_1_ETH-1.zip

@japhir
Copy link
Contributor Author

japhir commented Apr 1, 2020

The above errors have made me look into the raw files themselves. Currently trying to do an rsync with md5-sums so that I'm certain that it's not a copying error. This may take some time because I have some samba issues now. I'll get back to this as soon as I have a response from tech support on that!

@japhir
Copy link
Contributor Author

japhir commented Apr 2, 2020

Yep, seems like this was an issue with half-copied files, since it does work now.

I still get this Column `path` has different attributes on LHS and RHS of join warning though

@sebkopf
Copy link
Contributor

sebkopf commented Apr 2, 2020 via email

@japhir
Copy link
Contributor Author

japhir commented Apr 2, 2020

180223_1_Kiel Std test_1_ETH-1.zip

unzip that file to wherever, then iso_read_dual_inlet("~/Download/folder") should do it, I think!

@sebkopf
Copy link
Contributor

sebkopf commented Apr 2, 2020

I get the following output without any other warnings. Could you share your sessionInfo()?

Info: preparing to read 1 data files (all will be cached)...                                                                            
Info: reading file '180223_1_Kiel Std test_1_ETH-1.did' with '.did' reader                                                              
Warning: caught error - cannot locate voltage data - block 'CTwoDoublesArrayData' not found after position 1 (pos 113311)               
Warning: caught error - cannot process vendor computed data table - block 'CDualInletEvaluatedData' not found after position 1 (pos 4...
Progress: [=============================================================================================================] 1/1 (100%)  1s
Info: finished reading 1 files in 1.87 secs
Info: encountered 2 problems in total
# A tibble: 2 x 4
  file_id                    type  func                   details                                                                         
  <chr>                      <chr> <chr>                  <chr>                                                                           
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block '
CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…

Dual inlet iso file '180223_1_Kiel Std test_1_ETH-1.did': 0 cycles, 0 ions () 
Problems:
# A tibble: 2 x 4
  file_id                    type  func                   details                                                                         
  <chr>                      <chr> <chr>                  <chr>                                                                           
1 180223_1_Kiel Std test_1_… error extract_did_raw_volta… cannot locate voltage data - block 'CTwoDoublesArrayData' not found after posit…
2 180223_1_Kiel Std test_1_… error extract_did_vendor_da… cannot process vendor computed data table - block 'CDualInletEvaluatedData' not…

@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

log of running it on one file, quietly, with caching
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
setwd("~/Downloads")
cafs  <- iso_read_dual_inlet("170126_170124_Sibren_run29-1426.caf",
                             cache = TRUE,
                             quiet = TRUE,
                             discard_duplicates = FALSE,
                             parallel = TRUE)
#> Warning: Column `path` has different attributes on LHS and RHS of join
iso_get_problems(cafs)
#> # A tibble: 1 x 4
#>   file_id            type  func               details                           
#>   <chr>              <chr> <chr>              <chr>                             
#> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` …
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.4 dplyr_0.8.5    
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.4   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.2.5        stringr_1.4.0     knitr_1.28        vctrs_0.2.4      
#> [29] globals_0.12.5    hms_0.5.3         tidyselect_1.0.0  glue_1.3.2       
#> [33] listenv_0.8.0     R6_2.4.1          fansi_0.4.1       rmarkdown_2.1    
#> [37] tidyr_1.0.2       Rhdf5lib_1.8.0    readr_1.3.1       purrr_0.3.3      
#> [41] magrittr_1.5      feather_0.3.5     codetools_0.2-16  ellipsis_0.3.0   
#> [45] htmltools_0.4.0   assertthat_0.2.1  future_1.16.0     UNF_2.0.6        
#> [49] utf8_1.1.4        stringi_1.4.6     crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-03 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

Looks like there are some issues with the old caf files now, which is very unfortunate. They are suddenly ALL marked as problematic files. When I run iso_get_file_info() on my whole set of cafs I get the following:

> iso_get_file_info(cafs)
Info: aggregating file info from 4928 data file(s)
Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
  1. isoreader::iso_get_file_info(cafs)
 14. vctrs:::vec_ptype2.POSIXt.default(...)
 15. vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 16. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 17. vctrs:::stop_incompatible(...)
 18. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.

>rlang::last_trace()
<error/vctrs_error_incompatible_type>
No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
Backtrace:
     █
  1. ├─isoreader::iso_get_file_info(cafs)
  2. │ └─`%>%`(...)
  3. │   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  4. │   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5. │     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
  6. │       └─isoreader:::`_fseq`(`_lhs`)
  7. │         └─magrittr::freduce(value, `_function_list`)
  8. │           ├─base::withVisible(function_list[[k]](value))
  9. │           └─function_list[[k]](value)
 10. │             └─isoreader:::safe_bind_rows(.)
 11. │               └─vctrs::vec_rbind(...)
 12. ├─vctrs:::vec_ptype2_dispatch_s3(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
 13. ├─vctrs::vec_ptype2.POSIXt(x = x, y = y, x_arg = x_arg, y_arg = y_arg)
 14. └─vctrs:::vec_ptype2.POSIXt.default(...)
 15.   └─vctrs::vec_default_ptype2(x, y, x_arg = x_arg, y_arg = y_arg)
 16.     └─vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
 17.       └─vctrs:::stop_incompatible(...)
 18.         └─vctrs:::stop_vctrs(...)

@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

Ok I think I'm being stupid. I had this issue for my newest files first, then it was fixed after I rsync'd without the --ignore-existing flag. Now I had it for the older caf files, but hadn't removed the flag yet. Don't spend time trying to fix this yet please ;-).

@sebkopf
Copy link
Contributor

sebkopf commented Apr 3, 2020

sounds good. I do think there might be some dplyr issues with 0.8.5 (and the upcoming 1.0) that we need to address. The newest dplyr has implements bind_rows() in a new way that I'm pretty sure is crashing the iso_get_ functions for more complicated data columns like the file_datetime

@sebkopf sebkopf mentioned this issue Apr 3, 2020
1 task
@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

Aww unfortunately that was not the problem. All my old caf files don't work anymore, even after double-checking that they were copied over correctly.

So iso_get_file_info() breaks for the caf files. For the did files it's just become very slow.

log of reading in the combined big cafs file with all 4928 caf files, resulting in 6300 errors
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
setwd("~/SurfDrive/PhD/programming/dataprocessing")
cafs <- iso_read_dual_inlet("out/cafs.di.rds")
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file 'out/cafs.di.rds' with '.di.rds' reader
#> Info: loaded data for 4928 data files from R Data Storage - checking loaded...
#> Info: finished reading 1 files in 13.08 secs
#> Warning: Column `path` has different attributes on LHS and RHS of join
#> Info: encountered 6300 problems in total
#> # A tibble: 6,300 x 4
#>    file_id            type  func               details                          
#>    <chr>              <chr> <chr>              <chr>                            
#>  1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  2 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  3 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  4 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  5 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  6 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  7 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  8 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#>  9 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#> 10 170127_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value`…
#> # … with 6,290 more rows
iso_get_file_info(cafs)
#> Info: aggregating file info from 4928 data file(s)
#> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
iso_get_raw_data(cafs)
#> Info: aggregating raw data from 4928 data file(s)
#> # A tibble: 74,674 x 9
#>    file_id                type   cycle v44.mV v45.mV v46.mV v47.mV v48.mV v49.mV
#>    <chr>                  <chr>  <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 170126_170124_Sibren_… stand…     0 13091. 15594. 18622.  2078.   216. -0.528
#>  2 170126_170124_Sibren_… stand…     1 12177. 14506. 17323.  1933.   201. -0.503
#>  3 170126_170124_Sibren_… stand…     2 11329. 13497. 16119.  1799.   187. -0.456
#>  4 170126_170124_Sibren_… stand…     3 10556. 12576. 15019.  1677.   174. -0.431
#>  5 170126_170124_Sibren_… stand…     4  9845. 11729. 14008.  1564.   163. -0.397
#>  6 170126_170124_Sibren_… stand…     5  9192. 10952. 13080.  1461.   152. -0.363
#>  7 170126_170124_Sibren_… stand…     6  8591. 10236. 12224.  1366.   142. -0.339
#>  8 170126_170124_Sibren_… stand…     7  8034.  9572. 11431.  1278.   133. -0.308
#>  9 170126_170124_Sibren_… stand…     8  7521.  8961. 10702.  1197.   124. -0.282
#> 10 170126_170124_Sibren_… sample     1 12661. 14953. 17854.  1974.   206. -0.509
#> # … with 74,664 more rows

Created on 2020-04-03 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 3, 2020

Also, just using iso_read_dual_inlet() on one of these summary rds files is very slow, so probably some of the integrity checks have also broken? With read_rds() or readRDS() it's much faster.

@sebkopf
Copy link
Contributor

sebkopf commented Apr 3, 2020

I cannot reproduce your error even with your versions of dplyr and vctrs. Could this be an issue with the cached files? Can you run an example with read_cache = FALSE and quiet=FALSE so I can get a better sense for output?

@japhir
Copy link
Contributor Author

japhir commented Apr 7, 2020

Hmm that's very weird. I've just updated my system and vctrs, dplyr and isoreader, and even with this single caf file I get issues. Are you saying you don't get these warnings/errors with the one file on your system either? Or just not the error related to the different file_datetime formats?

new log running it without caching for one file
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
cafs  <- iso_read_dual_inlet("~/Downloads/170126_170124_Sibren_run29-1426.caf",
                             cache = FALSE,
                             read_cache = FALSE,
                             quiet = FALSE,
                             discard_duplicates = FALSE,
                             parallel = FALSE)
#> Info: preparing to read 1 data files...
#> Info: reading file '170126_170124_Sibren_run29-1426.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: finished reading 1 files in 4.06 secs
#> Warning: Column `path` has different attributes on LHS and RHS of join
#> Info: encountered 1 problems in total
#> # A tibble: 1 x 4
#>   file_id            type  func               details                           
#>   <chr>              <chr> <chr>              <chr>                             
#> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` …
iso_get_problems(cafs)
#> # A tibble: 1 x 4
#>   file_id            type  func               details                           
#>   <chr>              <chr> <chr>              <chr>                             
#> 1 170126_170124_Sib… error extract_isodat_ol… "Assigned data `file_info$value` …
iso_get_file_info(cafs)
#> Info: aggregating file info from 1 data file(s)
#> # A tibble: 1 x 7
#>   file_id file_root file_path file_subpath file_datetime       file_size
#>   <chr>   <chr>     <chr>     <chr>        <dttm>                  <int>
#> 1 170126… /home/ja… 170126_1… <NA>         2017-01-26 20:29:47    651810
#> # … with 1 more variable: MS_integration_time.s <dbl>
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.8   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.3.0        dplyr_0.8.5       stringr_1.4.0     knitr_1.28       
#> [29] generics_0.0.2    vctrs_0.2.4       globals_0.12.5    hms_0.5.3        
#> [33] tidyselect_1.0.0  glue_1.4.0        listenv_0.8.0     R6_2.4.1         
#> [37] fansi_0.4.1       rmarkdown_2.1     tidyr_1.0.2       Rhdf5lib_1.8.0   
#> [41] readr_1.3.1       purrr_0.3.3       magrittr_1.5      feather_0.3.5    
#> [45] codetools_0.2-16  ellipsis_0.3.0    htmltools_0.4.0   assertthat_0.2.1 
#> [49] future_1.16.0     UNF_2.0.6         utf8_1.1.4        stringi_1.4.6    
#> [53] crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-07 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 7, 2020

Another long log running it for 22 caf files, resulting in 24 errors
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
dir.create("/tmp/rtmp")
setwd("/tmp/rtmp")
cafs  <- iso_read_dual_inlet("~/Documents/archive/pacman/cafs/180522_Stds/",
                             cache = FALSE,
                             read_cache = FALSE,
                             quiet = FALSE,
                             discard_duplicates = FALSE,
                             parallel = FALSE)
#> Info: preparing to read 22 data files...
#> Info: reading file '180522_Std_ETH-1_1.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-1_2.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-1_7.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-1_8.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-2_10.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-2_3.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-2_4.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-2_9.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-3_11.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-3_12.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-3_5.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180522_Std_ETH-3_6.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-1_13.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-1_14.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-1_19.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-1_20.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-2_15.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-2_16.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-2_21.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-2_22.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Warning: caught error - cannot identify measured masses - block 'CResultDat...
#> Warning: caught error - cannot process vendor data table - block 'CResultDa...
#> Info: reading file '180523_Std_ETH-3_17.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: reading file '180523_Std_ETH-3_18.caf' with '.caf' reader
#> Warning: caught error - Assigned data `file_info$value` must be compatible ...
#> Info: finished reading 22 files in 1.03 mins
#> Warning: Column `path` has different attributes on LHS and RHS of join
#> Info: encountered 24 problems in total
#> # A tibble: 24 x 4
#>    file_id        type  func                details                             
#>    <chr>          <chr> <chr>               <chr>                               
#>  1 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  2 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  3 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  4 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  5 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  6 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  7 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  8 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  9 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#> 10 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#> # … with 14 more rows
iso_get_problems(cafs)
#> # A tibble: 24 x 4
#>    file_id        type  func                details                             
#>    <chr>          <chr> <chr>               <chr>                               
#>  1 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  2 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  3 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  4 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  5 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  6 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  7 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  8 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#>  9 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#> 10 180522_Std_ET… error extract_isodat_old… "Assigned data `file_info$value` mu…
#> # … with 14 more rows
iso_get_file_info(cafs)
#> Info: aggregating file info from 22 data file(s)
#> # A tibble: 22 x 7
#>    file_id file_root file_path file_subpath file_datetime       file_size
#>    <chr>   <chr>     <chr>     <chr>        <dttm>                  <int>
#>  1 180522… /home/ja… 180522_S… <NA>         2018-05-22 17:24:52    651970
#>  2 180522… /home/ja… 180522_S… <NA>         2018-05-22 18:03:27    668650
#>  3 180522… /home/ja… 180522_S… <NA>         2018-05-22 21:14:55    668682
#>  4 180522… /home/ja… 180522_S… <NA>         2018-05-22 21:54:16    668678
#>  5 180522… /home/ja… 180522_S… <NA>         2018-05-22 23:11:42    669030
#>  6 180522… /home/ja… 180522_S… <NA>         2018-05-22 18:42:25    668992
#>  7 180522… /home/ja… 180522_S… <NA>         2018-05-22 19:21:29    669014
#>  8 180522… /home/ja… 180522_S… <NA>         2018-05-22 22:33:02    668970
#>  9 180522… /home/ja… 180522_S… <NA>         2018-05-22 23:47:26    652032
#> 10 180522… /home/ja… 180522_S… <NA>         2018-05-23 00:26:31    668710
#> # … with 12 more rows, and 1 more variable: MS_integration_time.s <dbl>
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.8   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.3.0        dplyr_0.8.5       stringr_1.4.0     knitr_1.28       
#> [29] generics_0.0.2    vctrs_0.2.4       globals_0.12.5    hms_0.5.3        
#> [33] tidyselect_1.0.0  glue_1.4.0        listenv_0.8.0     R6_2.4.1         
#> [37] fansi_0.4.1       rmarkdown_2.1     tidyr_1.0.2       Rhdf5lib_1.8.0   
#> [41] readr_1.3.1       purrr_0.3.3       magrittr_1.5      feather_0.3.5    
#> [45] codetools_0.2-16  ellipsis_0.3.0    htmltools_0.4.0   assertthat_0.2.1 
#> [49] future_1.16.0     UNF_2.0.6         utf8_1.1.4        stringi_1.4.6    
#> [53] crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-07 by the reprex package (v0.3.0)

also: I edited all above posts to use the <details> flags so the long logs are collapsed by default.

@japhir
Copy link
Contributor Author

japhir commented Apr 7, 2020

Maybe it's because you ran it on the did files in #110 (comment) in stead of the single caf file?

@sebkopf
Copy link
Contributor

sebkopf commented Apr 7, 2020

found it, it's tibble 3.0!!

sebkopf added a commit that referenced this issue Apr 7, 2020
@sebkopf
Copy link
Contributor

sebkopf commented Apr 8, 2020

Hi @japhir , can you try if devtools::install_github("isoverse/isoreader", ref = "dev") solves the problem?

@japhir
Copy link
Contributor Author

japhir commented Apr 8, 2020

That's great! Thanks for implementing a fix so soon. I've updated to the dev version again for now, and it appears to be working. It now doesn't give me the LHS and RHS errors when I read in the files, but the old summary files and cached files are very slow still. It looks like I'm going to have to re-import all (caf) files again with the read_cache flag off, because applying iso_get_file_info() to the cached versions still results in an error. That'll take a while so I'll get back to you on that later.

re-import of one folder with caf files
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
dir.create("/tmp/rtmp")
setwd("/tmp/rtmp")
cafs  <- iso_read_dual_inlet("~/Documents/archive/pacman/cafs/180522_Stds/",
                             cache = FALSE,
                             read_cache = FALSE,
                             quiet = FALSE,
                             discard_duplicates = FALSE,
                             parallel = FALSE)
#> Info: preparing to read 22 data files...
#> Info: reading file '180522_Std_ETH-1_1.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-1_2.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-1_7.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-1_8.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-2_10.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-2_3.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-2_4.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-2_9.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-3_11.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-3_12.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-3_5.caf' with '.caf' reader
#> Info: reading file '180522_Std_ETH-3_6.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-1_13.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-1_14.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-1_19.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-1_20.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-2_15.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-2_16.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-2_21.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-2_22.caf' with '.caf' reader
#> Warning: caught error - cannot identify measured masses - block 'CResultDat...
#> Warning: caught error - cannot process vendor data table - block 'CResultDa...
#> Info: reading file '180523_Std_ETH-3_17.caf' with '.caf' reader
#> Info: reading file '180523_Std_ETH-3_18.caf' with '.caf' reader
#> Info: finished reading 22 files in 57.35 secs
#> Info: encountered 2 problems in total
#> # A tibble: 2 x 4
#>   file_id         type  func             details                                
#>   <chr>           <chr> <chr>            <chr>                                  
#> 1 180523_Std_ETH… error extract_caf_raw… cannot identify measured masses - bloc…
#> 2 180523_Std_ETH… error extract_caf_ven… cannot process vendor data table - blo…
iso_get_problems(cafs)
#> # A tibble: 2 x 4
#>   file_id         type  func             details                                
#>   <chr>           <chr> <chr>            <chr>                                  
#> 1 180523_Std_ETH… error extract_caf_raw… cannot identify measured masses - bloc…
#> 2 180523_Std_ETH… error extract_caf_ven… cannot process vendor data table - blo…
iso_get_file_info(cafs)
#> Info: aggregating file info from 22 data file(s)
#> # A tibble: 22 x 22
#>    file_id file_root file_path file_subpath file_datetime       file_size Line 
#>    <chr>   <chr>     <chr>     <chr>        <dttm>                  <int> <chr>
#>  1 180522… /home/ja… 180522_S… <NA>         2018-05-22 17:24:52    651970 1    
#>  2 180522… /home/ja… 180522_S… <NA>         2018-05-22 18:03:27    668650 2    
#>  3 180522… /home/ja… 180522_S… <NA>         2018-05-22 21:14:55    668682 1    
#>  4 180522… /home/ja… 180522_S… <NA>         2018-05-22 21:54:16    668678 2    
#>  5 180522… /home/ja… 180522_S… <NA>         2018-05-22 23:11:42    669030 2    
#>  6 180522… /home/ja… 180522_S… <NA>         2018-05-22 18:42:25    668992 1    
#>  7 180522… /home/ja… 180522_S… <NA>         2018-05-22 19:21:29    669014 2    
#>  8 180522… /home/ja… 180522_S… <NA>         2018-05-22 22:33:02    668970 1    
#>  9 180522… /home/ja… 180522_S… <NA>         2018-05-22 23:47:26    652032 1    
#> 10 180522… /home/ja… 180522_S… <NA>         2018-05-23 00:26:31    668710 2    
#> # … with 12 more rows, and 15 more variables: `Peak Center` <chr>,
#> #   Pressadjust <chr>, Background <chr>, `Reference Refill` <chr>, `Weight
#> #   [mg]` <chr>, Sample <chr>, `Identifier 1` <chr>, `Identifier 2` <chr>,
#> #   Analysis <chr>, Comment <chr>, Preparation <chr>, `Pre Script` <chr>, `Post
#> #   Script` <chr>, Method <chr>, MS_integration_time.s <dbl>
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.8   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.3.0        dplyr_0.8.5       stringr_1.4.0     knitr_1.28       
#> [29] generics_0.0.2    vctrs_0.2.4       globals_0.12.5    hms_0.5.3        
#> [33] tidyselect_1.0.0  glue_1.4.0        listenv_0.8.0     R6_2.4.1         
#> [37] fansi_0.4.1       rmarkdown_2.1     tidyr_1.0.2       Rhdf5lib_1.8.0   
#> [41] readr_1.3.1       purrr_0.3.3       magrittr_1.5      feather_0.3.5    
#> [45] codetools_0.2-16  ellipsis_0.3.0    htmltools_0.4.0   assertthat_0.2.1 
#> [49] future_1.16.0     UNF_2.0.6         utf8_1.1.4        stringi_1.4.6    
#> [53] crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-08 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 8, 2020

It's just finished re-reading the 4928 caf files! It has now found 1376 files with problems, of which a lot are duplicate files.

I get the below warning when I saved the aggregate file to rds with iso_save()

Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.

Reading in the newly created summary rds file is still slow (20.43 secs! vs read_rds which is basically instantaneous).

iso_get_file_info() unfortunately fails still on the caf files 😢.

logs of importing the summarized caf di file
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
setwd("~/SurfDrive/PhD/programming/dataprocessing")
cafs <- iso_read_dual_inlet("out/cafs.di.rds")
#> Info: preparing to read 1 data files (all will be cached)...
#> Info: reading file 'out/cafs.di.rds' with '.di.rds' reader
#> Info: loaded data for 4928 data files from R Data Storage - checking loaded...
#> Info: finished reading 1 files in 19.39 secs
#> Info: encountered 1376 problems in total
#> # A tibble: 1,376 x 4
#>    file_id                type  func           details                          
#>    <chr>                  <chr> <chr>          <chr>                            
#>  1 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  2 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  3 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  4 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  5 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  6 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  7 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  8 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  9 170127_170127_170124_… error extract_caf_r… cannot identify measured masses …
#> 10 170127_170127_170124_… error extract_caf_v… cannot process vendor data table…
#> # … with 1,366 more rows
iso_get_problems(cafs)
#> # A tibble: 1,376 x 4
#>    file_id                type  func           details                          
#>    <chr>                  <chr> <chr>          <chr>                            
#>  1 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  2 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  3 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  4 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  5 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  6 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  7 170127_170124_Sibren_… error extract_caf_r… cannot identify measured masses …
#>  8 170127_170124_Sibren_… error extract_caf_v… cannot process vendor data table…
#>  9 170127_170127_170124_… error extract_caf_r… cannot identify measured masses …
#> 10 170127_170127_170124_… error extract_caf_v… cannot process vendor data table…
#> # … with 1,366 more rows
iso_get_file_info(cafs)
#> Info: aggregating file info from 4928 data file(s)
#> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.8   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.3.0        dplyr_0.8.5       stringr_1.4.0     knitr_1.28       
#> [29] generics_0.0.2    vctrs_0.2.4       globals_0.12.5    hms_0.5.3        
#> [33] tidyselect_1.0.0  glue_1.4.0        listenv_0.8.0     R6_2.4.1         
#> [37] fansi_0.4.1       rmarkdown_2.1     tidyr_1.0.2       Rhdf5lib_1.8.0   
#> [41] readr_1.3.1       purrr_0.3.3       magrittr_1.5      feather_0.3.5    
#> [45] codetools_0.2-16  ellipsis_0.3.0    htmltools_0.4.0   assertthat_0.2.1 
#> [49] future_1.16.0     UNF_2.0.6         utf8_1.1.4        stringi_1.4.6    
#> [53] crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-08 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 8, 2020

Of course I should have just limited it to the two files that are actually indicated to be the error message. That would have saved me 2 hours of unnecessary computation ;-). Anyway, here they are
problematic_2_files.zip

included reprex ran on only those two files
library(isoreader)
#> 
#> Attaching package: 'isoreader'
#> The following object is masked from 'package:stats':
#> 
#>     filter
setwd("~/Downloads")
cafs  <- iso_read_dual_inlet("problematic_2_files",
                             cache = FALSE,
                             read_cache = FALSE,
                             quiet = FALSE,
                             discard_duplicates = FALSE,
                             parallel = FALSE)
#> Info: preparing to read 2 data files...
#> Info: reading file 'problematic_2_files/170126_170124_Sibren_run29-1426.caf...
#> Info: reading file 'problematic_2_files/170621_170522_Guido_Magda_ETH-1-000...
#> Warning: caught error - no C_blocks available
#> Warning: caught error - no C_blocks available
#> Warning: Unknown or uninitialised column: `block`.
#> Warning: caught error - no C_blocks available
#> Warning: caught error - no C_blocks available
#> Warning: caught error - no C_blocks available
#> Warning: caught error - no C_blocks available
#> Warning: caught error - no C_blocks available
#> Info: finished reading 2 files in 3.38 secs
#> Info: encountered 7 problems in total
#> # A tibble: 7 x 4
#>   file_id                       type  func                      details         
#>   <chr>                         <chr> <chr>                     <chr>           
#> 1 170621_170522_Guido_Magda_ET… error extract_isodat_old_seque… no C_blocks ava…
#> 2 170621_170522_Guido_Magda_ET… error extract_isodat_datetime   no C_blocks ava…
#> 3 170621_170522_Guido_Magda_ET… error extract_MS_integration_t… no C_blocks ava…
#> 4 170621_170522_Guido_Magda_ET… error extract_caf_raw_voltage_… no C_blocks ava…
#> 5 170621_170522_Guido_Magda_ET… error extract_isodat_reference… no C_blocks ava…
#> 6 170621_170522_Guido_Magda_ET… error extract_isodat_resistors  no C_blocks ava…
#> 7 170621_170522_Guido_Magda_ET… error extract_caf_vendor_data_… no C_blocks ava…
iso_get_problems(cafs)
#> # A tibble: 7 x 4
#>   file_id                       type  func                      details         
#>   <chr>                         <chr> <chr>                     <chr>           
#> 1 170621_170522_Guido_Magda_ET… error extract_isodat_old_seque… no C_blocks ava…
#> 2 170621_170522_Guido_Magda_ET… error extract_isodat_datetime   no C_blocks ava…
#> 3 170621_170522_Guido_Magda_ET… error extract_MS_integration_t… no C_blocks ava…
#> 4 170621_170522_Guido_Magda_ET… error extract_caf_raw_voltage_… no C_blocks ava…
#> 5 170621_170522_Guido_Magda_ET… error extract_isodat_reference… no C_blocks ava…
#> 6 170621_170522_Guido_Magda_ET… error extract_isodat_resistors  no C_blocks ava…
#> 7 170621_170522_Guido_Magda_ET… error extract_caf_vendor_data_… no C_blocks ava…
iso_get_file_info(cafs)
#> Info: aggregating file info from 2 data file(s)
#> Error: No common type for `170126_170124_Sibren_run29-1426.caf$file_datetime` <datetime<Europe/Amsterdam>> and `170621_170522_Guido_Magda_ETH-1-0000.caf$file_datetime` <integer>.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libopenblasp-r0.3.9.so
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] isoreader_1.1.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] zip_2.0.4         Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3   
#>  [5] highr_0.8         prettyunits_1.1.1 progress_1.2.2    R.methodsS3_1.8.0
#>  [9] R.utils_2.9.2     base64enc_0.1-3   tools_3.6.3       digest_0.6.25    
#> [13] rhdf5_2.30.1      lubridate_1.7.8   evaluate_0.14     lifecycle_0.2.0  
#> [17] tibble_3.0.0      pkgconfig_2.0.3   rlang_0.4.5       openxlsx_4.1.4   
#> [21] cli_2.0.2         yaml_2.2.1        parallel_3.6.3    xfun_0.12        
#> [25] xml2_1.3.0        dplyr_0.8.5       stringr_1.4.0     knitr_1.28       
#> [29] generics_0.0.2    vctrs_0.2.4       globals_0.12.5    hms_0.5.3        
#> [33] tidyselect_1.0.0  glue_1.4.0        listenv_0.8.0     R6_2.4.1         
#> [37] fansi_0.4.1       rmarkdown_2.1     tidyr_1.0.2       Rhdf5lib_1.8.0   
#> [41] readr_1.3.1       purrr_0.3.3       magrittr_1.5      feather_0.3.5    
#> [45] codetools_0.2-16  ellipsis_0.3.0    htmltools_0.4.0   assertthat_0.2.1 
#> [49] future_1.16.0     UNF_2.0.6         utf8_1.1.4        stringi_1.4.6    
#> [53] crayon_1.3.4      R.oo_1.23.0

Created on 2020-04-08 by the reprex package (v0.3.0)

@japhir
Copy link
Contributor Author

japhir commented Apr 28, 2020

This seems resolved now, also in the master branch! Read/save speeds of the raw files are back to before and I don't get errors on saving the rds! Running iso_read_dual_inlet() on the saved rds file is still slower than a plain read_rds(), however.

@sebkopf
Copy link
Contributor

sebkopf commented Apr 29, 2020

Hi @japhir . The whole cache file system is actually revamped in the release yesterday (1.2.0) so cache files can be copied, have more useful files names to know which is which and allow skipping the data integrity checks for files that are up to date (should make reading .rds files similarly fast to direct readRDS). You do need to re-generate your cache but it's easy now with iso_reread_oudated_files(iso_files) and hopefully the last time for a long time we need to make any structural changes like this. Would love to hear if it works.

By the way, notifications about isoverse are no in a repo for this purpose, take a look: isoverse/news#2

@japhir
Copy link
Contributor Author

japhir commented Apr 30, 2020

Hi @sebkopf, thanks for the notice. I've just updated to R 4.0 and the newest isoreader, but I think something must have gone wrong somewhere… re-reading the whole database took about twice as long as last time (with very few new files, as you can imagine) and while iso_get_file_info() works, it's also much slower than before. iso_get_raw_data() hasn't finished as I'm typing this...

How can I help debug this?

@sebkopf
Copy link
Contributor

sebkopf commented Apr 30, 2020

Hi @japhir, can you send a small excerpt of your entire collection? Nothing has changed in iso_get_file_info() but I have not yet tested any of these in R 4.0. I think the new R does a lot more type cast checks that might make those built into iso_get_file_info() redundant (but also makes it slower in R 4.0 since they're essentially done twice). The changes with the caching should just make reading cached files and .rds files faster, not inherently the data collection. The speed of iso_get_raw_data() is mostly limited by how quick tidyverse functions work (unless you also bring in all the file info) and with the switch to vctrs some things have gotten slower :( - not sure yet whether the benefit of type cast checks in vctrs really outweighs the speed losses

@japhir
Copy link
Contributor Author

japhir commented Apr 30, 2020

Just finished reading in everything. Didn't get any particular warnings on the newer did files, but got these on the caf files: (again, much slower than before).

Info: exporting data from 4928 iso_files into R Data Storage '/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/cafs.di.rds'
Warning messages:
1: In max(.data$pos) : no non-missing arguments to max; returning -Inf
2: In max(.data$pos) : no non-missing arguments to max; returning -Inf
3: Unknown or uninitialised column: `block`.
4: Unknown or uninitialised column: `block`.
5: Unknown or uninitialised column: `block`.

All of the previously shared files in this thread should be good, the raw data haven't chaged. How big of a subset were you thinking? I was hesitant to share many earlier, but just asked my supervisor and he says it shouldn't be a problem to share some files.

@sebkopf
Copy link
Contributor

sebkopf commented May 1, 2020

that's great! I was actually thinking not the raw files since they don't cause trouble for me but just parts of the isofile collection, so something like this:

iso_files <- iso_read_dual_inlet("....rds")
# pick 100 random files from the collection
iso_files[sample(1:length(iso_files), 100)] %>% iso_save("for_seb.di.rds")

as for that .data$pos warning, could you see if you can pinpoint where it occurs with the following flags to elevate warnings to errors and not catch them?

options(warn = 2)
isoreader:::iso_turn_debug_on(catch_errors = FALSE)
iso_read_dual_inlet(....)

@japhir
Copy link
Contributor Author

japhir commented May 4, 2020

Ok @sebkopf, here's the test file with 100 standards!

for_seb.di.zip

I generated them like this:

  seb_sub <- dids %>%
    iso_filter_files(Comment == "STD")  # for standard
  # evenly spaced throughout the record, not sure if it's sorted by file_datetime though, 
  # so could still be random.
  seb_sub <- seb_sub[(floor(seq(1, length(seb_sub), length.out = 100)))] %>%  
    iso_save("out/for_seb.di.rds")

I tried to have a look at where it's getting slow with profvis, but I don't really understand the graph so I'll leave that up to you ;-)

  library(profvis)
  library(isoreader)
  profvis({
    dids <- iso_read_dual_inlet("out/for_seb.di.rds")

    didinfo <- dids %>%
      iso_get_file_info()

    rawdata <- dids %>%
      iso_get_raw_data()
  })
output on my machine
Attaching package:isoreaderThe following object is masked frompackage:stats:

    filter


Progress: [-----------------------------------------------------------------------------------------] 0/1 (  0%)  0s
                                                                                                                    
Info: preparing to read 1 data files (all will be cached)...
Progress: [-----------------------------------------------------------------------------------------] 0/1 (  0%)  0s
                                                                                                                    
Info: reading file 'out/for_seb.di.rds' with '.di.rds' reader...
Progress: [-----------------------------------------------------------------------------------------] 0/1 (  0%)  0s
                                                                                                                    
Info: loaded 100 data files from R Data Storage
Progress: [-----------------------------------------------------------------------------------------] 0/1 (  0%)  0s
Progress: [=========================================================================================] 1/1 (100%)  0s
Info: finished reading 1 files in 0.19 secs
Info: encountered 19 problems in total
�[90m# A tibble: 19 x 4�[39m
   file_id                type    func               details                                                          
   �[3m�[90m<chr>�[39m�[23m                  �[3m�[90m<chr>�[39m�[23m   �[3m�[90m<chr>�[39m�[23m              �[3m�[90m<chr>�[39m�[23m                                                            
�[90m 1�[39m 180223_1_Kiel Std teserror   extract_did_raw_vcannot locate voltage data - block 'CTwoDoublesArrayData' not fo…
�[90m 2�[39m 180223_1_Kiel Std teserror   extract_did_vendocannot process vendor computed data table - block 'CDualInletEva…
�[90m 3�[39m 180517_29_RobinV_5_ET… warning iso_as_file_list   duplicate files kept but with recoded file IDs: 180517_29_RobinV…
�[90m 4�[39m 180621_47_Chris_14_ET… error   extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo…
�[90m 5�[39m 180621_47_Chris_14_ET… error   extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva…
�[90m 6�[39m 180621_47_Chris_14_ETwarning iso_as_file_list   duplicate files kept but with recoded file IDs: 180621_47_Chris_…
�[90m 7�[39m 180903_83_Cas_19_ETH-warning iso_as_file_list   duplicate files kept but with recoded file IDs: 180903_83_Cas_19…
�[90m 8�[39m 180915_88_WuyunCas_39warning iso_as_file_list   duplicate files kept but with recoded file IDs: 180915_88_WuyunC…
�[90m 9�[39m 180929_94_Ilja_37_ETHwarning iso_as_file_list   duplicate files kept but with recoded file IDs: 180929_94_Ilja_3…
�[90m10�[39m 190514_195_NdW_25_ETHwarning iso_as_file_list   duplicate files kept but with recoded file IDs: 190514_195_NdW_2…
�[90m11�[39m 190805_237_RvdP_5_ETHerror   extract_did_raw_vcannot locate voltage data - block 'CTwoDoublesArrayData' not fo…
�[90m12�[39m 190805_237_RvdP_5_ETHerror   extract_did_vendocannot process vendor computed data table - block 'CDualInletEva…
�[90m13�[39m 191125_295_MM_16_ETH-… error   extract_did_raw_v… cannot locate voltage data - block 'CTwoDoublesArrayData' not fo…
�[90m14�[39m 191125_295_MM_16_ETH-… error   extract_did_vendo… cannot process vendor computed data table - block 'CDualInletEva…
�[90m15�[39m 200110_311_NdW_43_ETHwarning iso_as_file_list   duplicate files kept but with recoded file IDs: 200110_311_NdW_4…
�[90m16�[39m 180316_4_Std test_6_Ewarning iso_as_file_list   duplicate files kept but with recoded file IDs: 180316_4_Std tes…
�[90m17�[39m 180831_83_Cas_9_ETH-3error   extract_did_raw_vcannot locate voltage data - block 'CTwoDoublesArrayData' not fo…
�[90m18�[39m 180831_83_Cas_9_ETH-3error   extract_did_vendocannot process vendor computed data table - block 'CDualInletEva…
�[90m19�[39m 180831_83_Cas_9_ETH-3… warning iso_as_file_list   duplicate files kept but with recoded file IDs: 180831_83_Cas_9_…

Info: aggregating file info from 100 data file(s)
Info: aggregating raw data from 100 data file(s)

@japhir
Copy link
Contributor Author

japhir commented May 4, 2020

regarding the debugging request: this doesn't work because of the duplicated files

  options(warn = 2)
  isoreader:::iso_turn_debug_on(catch_errors = FALSE)
  setwd("~/Documents/archive/")
  isoreader::iso_read_dual_inlet("~/Documents/archive/pacman/cafs", 
                                 discard_duplicates = FALSE)
output
Info: debug mode turned on, error catching turned off, caching turned off

Error: (converted from warning) some files from different folders have identical file names:
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(1).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(2).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(3).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(4).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(5).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(6).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(7).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(8).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/170402_Sibren_8(9).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/deel2/170402_Sibren_8(1).caf
	~/Documents/archive/pacman/cafs/170402_Sibren_8.2 event/deel2/170402_Sibren_8(2).caf
	~/Documents/a

sebkopf added a commit that referenced this issue Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants