# Manipulating and Reshaping Data
___

Useful links:
* [Paper on tidy-data](http://vita.had.co.nz/papers/tidy-data.pdf)
* [Post on R-bloggers on split-apply-combine problems](http://www.r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/)

## Dataset generation and loading

### A data.frame for basic manipulation

In [3]:
set.seed(13435)

In [2]:
X <- data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X <- X[sample(1:5),]
X$var2[c(1,3)] = NA
X

Unnamed: 0,var1,var2,var3
4,2,,12
1,3,9.0,13
3,1,,15
2,5,8.0,14
5,4,6.0,11


### And the standard cars dataset

In [6]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


## Subsetting

Generate a dataset *without order*, and with *missing values*.

In [6]:
X[,1] # call the first column

In [7]:
X[,"var1"] # call the first column

In [8]:
X[1:2, "var2"] # subset by rows and columns

In [11]:
X[(X$var1 <= 3 & X$var3 > 11),]  # filter by conditions.

var1,var2,var3
2,,15
3,,12


Note that the above filter by conditional is really a filter aginst a list of booleans for the rows. Taking the internal value:

In [12]:
X$var1 <= 3 & X$var3 > 11

In other words, we've just put the above array into the 'rows' filter.

In [17]:
X[c(TRUE, FALSE, TRUE, FALSE, FALSE),]

var1,var2,var3
2,,15
3,,12


In [21]:
X[(X$var1 <= 3 | X$var3 <= 12), 2:3]

Unnamed: 0,var2,var3
1,,15
4,10.0,11
2,,12


 We can return a vector of the indices at which the condition is true...

And use that to filter the dataframe.

In [31]:
which(X$var2 > 8)

In [23]:
X[which(X$var2 > 8),]

Unnamed: 0,var1,var2,var3
4,1,10,11
5,4,9,13


Which begs the question - how does this filtering deal with NAs?

In [25]:
X$var2 > 8

In [32]:
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


### Subsetting where the filter vectors mis-match data.frame dimensions

In [33]:
X[X$var2 > 8,]

Unnamed: 0,var1,var2,var3
,,,
4,1.0,10.0,11.0
NA.1,,,
5,4.0,9.0,13.0


The logic above seems to be that **because the length of the filtering vector matches the length of the dataframe**, it goes through item by item of the filter. If `TRUE`, return that row. If `FALSE`, return nothing. If `NA`, then return a row of `NA`s.

In [34]:
X[(X$var2 > 8)[1:4],]

Unnamed: 0,var1,var2,var3
,,,
4,1.0,10.0,11.0
NA.1,,,
NA.2,,,


But here, because **the length of the filtering vector does NOT match the length of the dataframe**, we take a different approach. If `TRUE`, return that row. If `FALSE`, return `NA`s. If `NA`, return a row of `NA`.

In [40]:
X[c(NA, NA, FALSE, TRUE, NA),]

Unnamed: 0,var1,var2,var3
,,,
NA.1,,,
3,5.0,6.0,14.0
NA.2,,,


So by our logic - `NA`, `NA`, `FALSE` - skip that row, return nothing!, `TRUE` - return row 4!, `NA`.

In [42]:
X[c(TRUE, FALSE),]

Unnamed: 0,var1,var2,var3
1,2,,15
2,3,,12
5,4,9.0,13


And if we don't supply enough variables, R starts looping the filter vector. So in the above example, we get every second value, starting `TRUE`.

And by this logic, if we input `c(NA, FALSE, TRUE)` this should be equivalent to `c(NA, FALSE, TRUE, NA, FALSE)` - i.e. three rows consisting of row 3 wrapped by `NA`s.

In [44]:
X[c(NA, FALSE, TRUE),]

Unnamed: 0,var1,var2,var3
,,,
2,3.0,,12.0
NA.1,,,


Gotcha.

## Sorting

In [49]:
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


In [45]:
sort(X$var1)

In [46]:
sort(X$var1, decreasing=TRUE)

In [47]:
sort(X$var2, decreasing=TRUE, na.last=TRUE)

For clarity: `sort` actually carries out the sorting; `order` just returns the order in which they should be.

In [48]:
order(X$var1)

In [50]:
X[order(X$var1),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


In [54]:
X[order(X$var2, na.last = FALSE),]

Unnamed: 0,var1,var2,var3
1,2,,15
2,3,,12
3,5,6.0,14
5,4,9.0,13
4,1,10.0,11


And although we can't really do anything here, we can sort by multiple variables...

In [56]:
X[order(X$var1, X$var3),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


## Adding rows and columns

In [57]:
X$var4 <- rnorm(5)
X

Unnamed: 0,var1,var2,var3,var4
1,2,,15,0.187596
4,1,10.0,11,1.7869764
2,3,,12,0.4966936
3,5,6.0,14,0.063183
5,4,9.0,13,-0.5361329


In [58]:
Y <- cbind(X, rnorm(5))
Y

Unnamed: 0,var1,var2,var3,var4,rnorm(5)
1,2,,15,0.187596,0.6257849
4,1,10.0,11,1.7869764,-2.4508375
2,3,,12,0.4966936,0.08909424
3,5,6.0,14,0.063183,0.4783857
5,4,9.0,13,-0.5361329,1.00053336


What if you try to add a differing number of rows?

In [59]:
Y <- cbind(X, rnorm(4))

ERROR: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 5, 4


## Reshaping data
Remember, the general principles we want are:
* Each variable in a column.
* Each observation in a row.
* Each table/file store data about one kind of observation.

In [7]:
library(reshape2)

In [8]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [11]:
str(mtcars)

'data.frame':	32 obs. of  12 variables:
 $ mpg    : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl    : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp   : num  160 160 108 258 360 ...
 $ hp     : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat   : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt     : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec   : num  16.5 17 18.6 19.4 17 ...
 $ vs     : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am     : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear   : num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb   : num  4 4 1 1 2 1 4 2 2 4 ...
 $ carname: chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...


In [12]:
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb         carname         
 Min.   :0.000

### Melting data frames

In [9]:
mtcars$carname <- rownames(mtcars)

In [10]:
mtcars$carname

In [13]:
carMelt <- melt(mtcars, id=c("carname", "gear", "cyl"), measure.vars=c("mpg", "hp"))

In [14]:
head(carMelt, n=3)

carname,gear,cyl,variable,value
Mazda RX4,4,6,mpg,21.0
Mazda RX4 Wag,4,6,mpg,21.0
Datsun 710,4,4,mpg,22.8


In [15]:
tail(carMelt, n=3)

Unnamed: 0,carname,gear,cyl,variable,value
62,Ferrari Dino,5,6,hp,175
63,Maserati Bora,5,8,hp,335
64,Volvo 142E,4,4,hp,109


We can see that each variable identifies a variable, and the value is placed under the `value` column.

We can re-cast the dataset with the `dcast` function.

In [21]:
cylData <- dcast(carMelt, cyl ~ variable)  # By default, this returns the length (i.e. number of observations)
cylData

Aggregation function missing: defaulting to length


cyl,mpg,hp
4,11,11
6,7,7
8,14,14


In [19]:
cylData <- dcast(carMelt, cyl ~ variable, mean)
cylData

cyl,mpg,hp
4,26.66364,82.63636
6,19.74286,122.28571
8,15.1,209.21429


### Averaging values

In [22]:
head(InsectSprays)

count,spray
10,A
7,A
20,A
14,A
14,A
12,A


In [23]:
str(InsectSprays)

'data.frame':	72 obs. of  2 variables:
 $ count: num  10 7 20 14 14 12 10 23 17 20 ...
 $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...


Here's one approach with `tapply`...

In [27]:
tapply(InsectSprays$count, InsectSprays$spray, sum)

Or another, with split-apply-combine.

In [28]:
split.Insect <- split(InsectSprays$count, InsectSprays$spray)
split.Insect

In [30]:
apply.Insect <- lapply(split.Insect, sum)
apply.Insect

In [33]:
combine.Insect <- unlist(apply.Insect)
combine.Insect

For sure: `sapply` - to cover the apply and combine solution!

In [36]:
?sapply

0,1
lapply {base},R Documentation

0,1
X,a vector (atomic or list) or an expression object. Other objects (including classed objects) will be coerced by base::as.list.
FUN,"the function to be applied to each element of X: see ‘Details’. In the case of functions like +, %*%, the function name must be backquoted or quoted."
...,optional arguments to FUN.
simplify,"logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = ""array"" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]])."
USE.NAMES,"logical; if TRUE and if X is character, use X as names for the result unless it had names already. Since this argument follows ... its name cannot be abbreviated."
FUN.VALUE,a (generalized) vector; a template for the return value from FUN. See ‘Details’.
n,integer: the number of replications.
expr,"the expression (a language object, usually a call) to evaluate repeatedly."
x,"a list, typically returned from lapply()."
higher,"logical; if true, simplify2array() will produce a (“higher rank”) array when appropriate, whereas higher = FALSE would return a matrix (or vector) only. These two cases correspond to sapply(*, simplify = ""array"") or simplify = TRUE, respectively."


In [35]:
sapply(split.Insect, sum)