Bigfile by stijnvanhoey · Pull Request #14 · inbo/tutorials

stijnvanhoey · 2017-02-10T15:26:12Z

First draft, ready for review, of the big file handling tutorial in R, based on the gist of Bart

ThierryO · 2017-02-10T15:57:44Z

+
+If you really need to read an entire csv in memory, by default, R users use the `read.table` method or variations thereof (such as `read.csv`). However, `fread` from the `data.table` package is supposed to be a lot faster. Let's measure the time to read in the data using these two different methods.
+
+read.table(file.path("source", csv.name)


missing closing bracket

ThierryO · 2017-02-10T16:01:07Z

+                                WHERE device_info_serial = %d 
+                                    AND date_time < '2014-07-01' 
+                                    AND date_time > '2014-03-01'", serialid))
+    data <- dbFetch(res)    


use dbGetQuery() instead of the dbSendQuery() and dbFetchQuery() combo

ThierryO · 2017-02-10T16:02:18Z

+
+```{r dplyr, message=FALSE, warning=FALSE}
+library(dplyr)
+my_db <- src_sqlite(db.name, create = F)


NEVER abbreviate TRUE and FALSE

ThierryO · 2017-02-10T16:04:04Z

+    * If you're stuck to a csv, [use `sqldf`](#Limiting-both-the-number-of-rows-and-the-number-of-columns-using-sqldf)
+    * If you can, use a SQLite database and query it using either [SQL queries](#Working-with-SQLite-databases) or [`dplyr`](#Interacting-with-SQLite-databases-using-dplyr)
+
+


I'm missing instructions on how to convert large text files into SQLite

Apparently, sqldf (read.csv.sql command) has some difficulties with the quotes in the file (I don't have a OS-independent solution). Do you have other suggestions using sqldf? My other option would be to add a solution to the notebook using dbWriteTable (RSQLite package).

I think I'd try to read the file with read.csv and then use dbWriteTable. If the file is too large then you can read it in chunks by setting skip and n.row and repeat those steps until you reach the end of the file.

ThierryO · 2017-02-10T16:05:27Z

+bird_tracking <- tbl(my_db, "processed_logs")
+results <- bird_tracking %>%
+    filter(device_info_serial == 860) %>%
+    select(date_time, latitude, longitude, altitude)


use the same query as the one using plain SQL in line 207

ThierryO · 2017-02-13T08:31:28Z

    filter(device_info_serial == 860) %>%
-    select(date_time, latitude, longitude, altitude)
+    select(date_time, latitude, longitude, altitude) %>%
+    filter(date_time < "2014-07-01") %>%


you can combine the filter statements into one statement
filter(device_info_serial == 860, "2014-03-01" < date_time, date_time < "2014-07-01")

Indeed. But I'll keep it like this to keep the comparison with the multiline SQL query (cfr. your earlier suggestion).

ThierryO · 2017-02-13T08:33:44Z

+Let's try to select rows where the device id matches a given value (e.g. 860), and the date time is between two given timestamps. For our analysis, we only need date_time, latitude, longitude and altitude so we only select those.
+
+```{r}
+sqlTiming <- system.time(data <-  dbGetQuery(conn = db,


if you want really good timing for comparisons use library(microbenchmark).

Thanks for the advice. Probably out of scope for this comparison, as the differences are clear enough.

peterdesmet

Reviewed and updated file in 8c0a9bf. Would advice to name it data-handling-large-files-R.Rmd

peterdesmet · 2017-02-14T13:09:54Z

Also, verify if the internal links (with #) work...

stijnvanhoey added 2 commits February 10, 2017 16:22

add big files handling tutorial

f7148af

add output of big file handling

91217a7

stijnvanhoey requested a review from peterdesmet February 10, 2017 15:26

ThierryO requested changes Feb 10, 2017

View reviewed changes

stijnvanhoey added 4 commits February 10, 2017 17:40

remove redundant line

8989a2c

update sql handling to single command

23a0104

update dplyr query with date select

fd135c4

update rendered docs

56a52c0

ThierryO reviewed Feb 13, 2017

View reviewed changes

ThierryO approved these changes Feb 13, 2017

View reviewed changes

Review tutorial

8c0a9bf

peterdesmet approved these changes Feb 14, 2017

View reviewed changes

stijnvanhoey added 6 commits March 2, 2017 23:48

add readr package to comparison

9cc1123

add convenient csvfile to sqlite conversion functionality

d9d7d89

update internal links

f3bac79

adapt file naming to large files

6c2f0cf

update website

06ac0f4

rename bigfiles to large files tutorial

386aa28

stijnvanhoey merged commit a14d775 into master Mar 3, 2017

stijnvanhoey deleted the bigfile branch March 3, 2017 11:19


		If you really need to read an entire csv in memory, by default, R users use the `read.table` method or variations thereof (such as `read.csv`). However, `fread` from the `data.table` package is supposed to be a lot faster. Let's measure the time to read in the data using these two different methods.

		read.table(file.path("source", csv.name)

		* If you're stuck to a csv, [use `sqldf`](#Limiting-both-the-number-of-rows-and-the-number-of-columns-using-sqldf)
		* If you can, use a SQLite database and query it using either [SQL queries](#Working-with-SQLite-databases) or [`dplyr`](#Interacting-with-SQLite-databases-using-dplyr)

Conversation

stijnvanhoey commented Feb 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterdesmet left a comment

Choose a reason for hiding this comment

Uh oh!

peterdesmet commented Feb 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants