Bigfile#14
Conversation
|
|
||
| If you really need to read an entire csv in memory, by default, R users use the `read.table` method or variations thereof (such as `read.csv`). However, `fread` from the `data.table` package is supposed to be a lot faster. Let's measure the time to read in the data using these two different methods. | ||
|
|
||
| read.table(file.path("source", csv.name) |
| WHERE device_info_serial = %d | ||
| AND date_time < '2014-07-01' | ||
| AND date_time > '2014-03-01'", serialid)) | ||
| data <- dbFetch(res) |
There was a problem hiding this comment.
use dbGetQuery() instead of the dbSendQuery() and dbFetchQuery() combo
|
|
||
| ```{r dplyr, message=FALSE, warning=FALSE} | ||
| library(dplyr) | ||
| my_db <- src_sqlite(db.name, create = F) |
There was a problem hiding this comment.
NEVER abbreviate TRUE and FALSE
| * If you're stuck to a csv, [use `sqldf`](#Limiting-both-the-number-of-rows-and-the-number-of-columns-using-sqldf) | ||
| * If you can, use a SQLite database and query it using either [SQL queries](#Working-with-SQLite-databases) or [`dplyr`](#Interacting-with-SQLite-databases-using-dplyr) | ||
|
|
||
|
|
There was a problem hiding this comment.
I'm missing instructions on how to convert large text files into SQLite
There was a problem hiding this comment.
Apparently, sqldf (read.csv.sql command) has some difficulties with the quotes in the file (I don't have a OS-independent solution). Do you have other suggestions using sqldf? My other option would be to add a solution to the notebook using dbWriteTable (RSQLite package).
There was a problem hiding this comment.
I think I'd try to read the file with read.csv and then use dbWriteTable. If the file is too large then you can read it in chunks by setting skip and n.row and repeat those steps until you reach the end of the file.
| bird_tracking <- tbl(my_db, "processed_logs") | ||
| results <- bird_tracking %>% | ||
| filter(device_info_serial == 860) %>% | ||
| select(date_time, latitude, longitude, altitude) |
There was a problem hiding this comment.
use the same query as the one using plain SQL in line 207
| filter(device_info_serial == 860) %>% | ||
| select(date_time, latitude, longitude, altitude) | ||
| select(date_time, latitude, longitude, altitude) %>% | ||
| filter(date_time < "2014-07-01") %>% |
There was a problem hiding this comment.
you can combine the filter statements into one statement
filter(device_info_serial == 860, "2014-03-01" < date_time, date_time < "2014-07-01")
There was a problem hiding this comment.
Indeed. But I'll keep it like this to keep the comparison with the multiline SQL query (cfr. your earlier suggestion).
| Let's try to select rows where the device id matches a given value (e.g. 860), and the date time is between two given timestamps. For our analysis, we only need date_time, latitude, longitude and altitude so we only select those. | ||
|
|
||
| ```{r} | ||
| sqlTiming <- system.time(data <- dbGetQuery(conn = db, |
There was a problem hiding this comment.
if you want really good timing for comparisons use library(microbenchmark).
There was a problem hiding this comment.
Thanks for the advice. Probably out of scope for this comparison, as the differences are clear enough.
peterdesmet
left a comment
There was a problem hiding this comment.
Reviewed and updated file in 8c0a9bf. Would advice to name it data-handling-large-files-R.Rmd
|
Also, verify if the internal links (with #) work... |
First draft, ready for review, of the big file handling tutorial in R, based on the gist of Bart