Skip to content

Bigfile#14

Merged
stijnvanhoey merged 13 commits intomasterfrom
bigfile
Mar 3, 2017
Merged

Bigfile#14
stijnvanhoey merged 13 commits intomasterfrom
bigfile

Conversation

@stijnvanhoey
Copy link
Copy Markdown
Contributor

First draft, ready for review, of the big file handling tutorial in R, based on the gist of Bart

Comment thread source/data-handling-bigfiles-R.Rmd Outdated

If you really need to read an entire csv in memory, by default, R users use the `read.table` method or variations thereof (such as `read.csv`). However, `fread` from the `data.table` package is supposed to be a lot faster. Let's measure the time to read in the data using these two different methods.

read.table(file.path("source", csv.name)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing bracket

Comment thread source/data-handling-bigfiles-R.Rmd Outdated
WHERE device_info_serial = %d
AND date_time < '2014-07-01'
AND date_time > '2014-03-01'", serialid))
data <- dbFetch(res)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use dbGetQuery() instead of the dbSendQuery() and dbFetchQuery() combo

Comment thread source/data-handling-bigfiles-R.Rmd Outdated

```{r dplyr, message=FALSE, warning=FALSE}
library(dplyr)
my_db <- src_sqlite(db.name, create = F)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEVER abbreviate TRUE and FALSE

Comment thread source/data-handling-bigfiles-R.Rmd Outdated
* If you're stuck to a csv, [use `sqldf`](#Limiting-both-the-number-of-rows-and-the-number-of-columns-using-sqldf)
* If you can, use a SQLite database and query it using either [SQL queries](#Working-with-SQLite-databases) or [`dplyr`](#Interacting-with-SQLite-databases-using-dplyr)


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing instructions on how to convert large text files into SQLite

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, sqldf (read.csv.sql command) has some difficulties with the quotes in the file (I don't have a OS-independent solution). Do you have other suggestions using sqldf? My other option would be to add a solution to the notebook using dbWriteTable (RSQLite package).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd try to read the file with read.csv and then use dbWriteTable. If the file is too large then you can read it in chunks by setting skip and n.row and repeat those steps until you reach the end of the file.

Comment thread source/data-handling-bigfiles-R.Rmd Outdated
bird_tracking <- tbl(my_db, "processed_logs")
results <- bird_tracking %>%
filter(device_info_serial == 860) %>%
select(date_time, latitude, longitude, altitude)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same query as the one using plain SQL in line 207

Comment thread source/data-handling-bigfiles-R.Rmd Outdated
filter(device_info_serial == 860) %>%
select(date_time, latitude, longitude, altitude)
select(date_time, latitude, longitude, altitude) %>%
filter(date_time < "2014-07-01") %>%
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can combine the filter statements into one statement
filter(device_info_serial == 860, "2014-03-01" < date_time, date_time < "2014-07-01")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But I'll keep it like this to keep the comparison with the multiline SQL query (cfr. your earlier suggestion).

Comment thread source/data-handling-bigfiles-R.Rmd Outdated
Let's try to select rows where the device id matches a given value (e.g. 860), and the date time is between two given timestamps. For our analysis, we only need date_time, latitude, longitude and altitude so we only select those.

```{r}
sqlTiming <- system.time(data <- dbGetQuery(conn = db,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want really good timing for comparisons use library(microbenchmark).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice. Probably out of scope for this comparison, as the differences are clear enough.

Copy link
Copy Markdown
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed and updated file in 8c0a9bf. Would advice to name it data-handling-large-files-R.Rmd

@peterdesmet
Copy link
Copy Markdown
Member

Also, verify if the internal links (with #) work...

@stijnvanhoey stijnvanhoey merged commit a14d775 into master Mar 3, 2017
@stijnvanhoey stijnvanhoey deleted the bigfile branch March 3, 2017 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants