# Demonstration of data.table functionality

For several great example-based resources for seeing what data.table can do, see DataCamp's data.table cheat sheet (saved under **H:\LIBRARY\R references/data.table/data.table cheat sheet.pdf**) and the built-in function `example(data.table)` after loading the `data.table` library into R.

## Generate data
We'll start our demo with a simple data set that reflects as much richness as we'll need. Later, we'll generate much larger data sets to demonstrate the considerable speed advantages of data.table over data.frame methods.

Note that the data.frame "DF" and data.table "DT" have exactly parallel structure as displayed here. In R, their printed representations will differ slightly, where DT has a smarter printout, showing both head and tail rows, and field names printed at the bottom of the table for easier reference.

In [None]:
library(data.table)
set.seed(60637)
DF <- data.frame(id     = 1:12,
                 prog   = rep(c("A", "B", "None"), each = 4),
                 sch    = rep(c("North", "South"), times = 6),
                 gender = sample(c("M", "F"), 12, replace = TRUE),
                 score  = runif(12))
DT <- data.table(DF, key = "id,prog,sch,gender")
DF
DT

## General structure of working with data.table objects
Whereas data.frame contents can be accessed using square brackets and references to rows and columns, i.e. ```df[*row indication*, *col indication*]```, data.table contents take three arguments: ```dt[i,j,by]``` where:

* i - indication of rows
* j - indication/operation on columns (optional)
* by - built-in ability to run operations with a subset

## Subsetting by rows
This works both by the familiar dt[<row subset info>.] with a comma demarcating rows vs. colums, or simply with dt[<row subset info>] with no comma.

In [None]:
DT[3:5,]

In [None]:
DT[3:5]

In [None]:
DT[sch == "North"]

In [None]:
DT[prog %in% c("A", "B")]

## Working with data.table columns

In [None]:
# Referencing columns in data.table is also just a bit different than for data.frames.
# Rather than refering to column titles using their names, stored as a string, you
# refer to them using the unquoted name of the column. Think about this as similar to
# e.g. subset(), aggregate(), or within() functions, where you indicate the name
# of the data.frame and then are within the "environment" of that data.frame and
# can reference names of the columns directly, as if there were objects in the global
# environment.
DT[, "sch"]
DT[, sch]

In [None]:
# However you can still reference columns by string by using the get() function, whose (helpful!) job it
# it is to take the name of an object, stored as a string, and fetch you the object itself.
DT[, get("sch")]

In [None]:
# Additionally, think of groups of columns as lists
DT[1:3, list(id, sch)]
# And note that ".()" is shorthand for "list()"
DT[1:3, .(id, sch)]

In [None]:
# (Indeed, it may be interesting to know that data.frames are constructed as a special
#  type of list, where each list element is of exactly the same length. The fact that 
#  lists can have contents of totally different types is what lets columns of data.frames
#  have columns of different data types, unlike matrices which are also 2x2 tables but which
#  can only have elements of the same (numeric) types.)
# For example, the following command loops through DF's columns since, as a data.frame/list, its
# base elements are the columns. 
lapply(DF, function(x) class(x))

In [None]:
DT[, mean(score)]

In [None]:
DT[, .(mean(score), sum(score))]

In [None]:
DT[, .(mean_score = mean(score), sum_score = sum(score))]

(Need to talk about ```:=``` as its own operation that generates an assignment within the data table)

In [None]:
DT[1:3, .(id, score, mean_score = mean(score))]

In [None]:
# Can even run multiple commands from within the DT environment using curly braces
DT[, {print(score)
      hist(score)$histogram
      }]

## Working with "by"s

In [None]:
# Can specify the "by" using a character string...
DT[, .(mean_score = mean(score)), by = "prog"]

In [None]:
# ... or as a column object within the data.table object 
DT[, .(mean_score = mean(score)), by = prog]

In [None]:
# Can do multiple "by"s with a string, separating by commas
DT[, .(mean_score = mean(score)), by = "prog,sch"]

In [None]:
# Can do multiple "by"s with column object references, using list syntax (i.e. the ".()")
DT[, .(mean_score = mean(score)), by = .(prog,sch)]

In [None]:
# Just like with columns, we can generate "by" information on the fly
DT[, .(mean_score = mean(score)), by = .(any_prog = prog %in% c("A", "B"))]

In [None]:
# Subset and then perform "by" calculations
DT[sch == "North", .(mean_score = mean(score), N = .N), by = "gender"]

## Working with special values
E.g. ".N", ".I", ".GRP", ".SD" and others

In [None]:
# Using .N for group counts
DT[, .(count = .N, group = .GRP), by = "gender"]

## Using the `:=` operator

In [None]:
# Using ":=" generates an assignment within the data table. Rather than being an on-the-fly calculation, it directly 
# modifies the underlying data
DT[, mean_score := mean(score)]
# Note that, in this case, it would be redundant to assign the output to a new object,
# e.g. don't do: newDT <- DT[, mean_score := mean(score)]

In [None]:
# The ":=" operator can be used to create multiple new columns.
# Interestingly enough, the left-hand side is (as far as I can tell)
# specified as a string
DT[, c("A", "B") := list(runif(nrow(DT)), letters[1:nrow(DT)])]

In [None]:
# Columns can be removed by using ":=" to assign columns to null
DT[, "plusfive" := NULL]
DT[, c("A", "B") := NULL]

## Chaining



## Using data.table for data manipulations

In [None]:
# Operations that you commonly perform on data.frames will work with data.table objects too
DT$plusfive <- DT$score + 5
DT <- within(DT, {
    someProg <- ifelse(prog == "None", "Nope", "Yep")    
    fSomeProg <- factor(someProg)
})
subset(DT, id >= 6)
str(DT)

In [None]:
# Some things that don't work
DT <- within(DT, {
    male <- FALSE
    male[gender == "F"] <- TRUE
})

## Using data.table to save time

In [None]:
n <- 1e6
dfBig <- data.frame(cat = sample(LETTERS, n, replace = TRUE), x = rnorm(n))
dtBig <- data.table(dfBig)
dtBigKey <- data.table(dfBig, key = "cat")

In [None]:
system.time(aggregate(x ~ cat, data = dfBig, mean))
system.time(dtBig[, mean(x), by = "cat"])
system.time(dtBigKey[, mean(x), by = "cat"])