# Manipulating Data
___

## Dataset generation

### A data.frame for basic manipulation

In [61]:
set.seed(13435)

In [62]:
X <- data.frame("var1"=sample(1:5),"var2"=sample(6:10),"var3"=sample(11:15))
X <- X[sample(1:5),]
X$var2[c(1,3)] = NA
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


## Subsetting

Generate a dataset *without order*, and with *missing values*.

In [6]:
X[,1] # call the first column

In [7]:
X[,"var1"] # call the first column

In [8]:
X[1:2, "var2"] # subset by rows and columns

In [11]:
X[(X$var1 <= 3 & X$var3 > 11),]  # filter by conditions.

var1,var2,var3
2,,15
3,,12


Note that the above filter by conditional is really a filter aginst a list of booleans for the rows. Taking the internal value:

In [12]:
X$var1 <= 3 & X$var3 > 11

In other words, we've just put the above array into the 'rows' filter.

In [17]:
X[c(TRUE, FALSE, TRUE, FALSE, FALSE),]

var1,var2,var3
2,,15
3,,12


In [21]:
X[(X$var1 <= 3 | X$var3 <= 12), 2:3]

Unnamed: 0,var2,var3
1,,15
4,10.0,11
2,,12


 We can return a vector of the indices at which the condition is true...

And use that to filter the dataframe.

In [31]:
which(X$var2 > 8)

In [23]:
X[which(X$var2 > 8),]

Unnamed: 0,var1,var2,var3
4,1,10,11
5,4,9,13


Which begs the question - how does this filtering deal with NAs?

In [25]:
X$var2 > 8

In [32]:
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


### Subsetting where the filter vectors mis-match data.frame dimensions

In [33]:
X[X$var2 > 8,]

Unnamed: 0,var1,var2,var3
,,,
4,1.0,10.0,11.0
NA.1,,,
5,4.0,9.0,13.0


The logic above seems to be that **because the length of the filtering vector matches the length of the dataframe**, it goes through item by item of the filter. If `TRUE`, return that row. If `FALSE`, return nothing. If `NA`, then return a row of `NA`s.

In [34]:
X[(X$var2 > 8)[1:4],]

Unnamed: 0,var1,var2,var3
,,,
4,1.0,10.0,11.0
NA.1,,,
NA.2,,,


But here, because **the length of the filtering vector does NOT match the length of the dataframe**, we take a different approach. If `TRUE`, return that row. If `FALSE`, return `NA`s. If `NA`, return a row of `NA`.

In [40]:
X[c(NA, NA, FALSE, TRUE, NA),]

Unnamed: 0,var1,var2,var3
,,,
NA.1,,,
3,5.0,6.0,14.0
NA.2,,,


So by our logic - `NA`, `NA`, `FALSE` - skip that row, return nothing!, `TRUE` - return row 4!, `NA`.

In [42]:
X[c(TRUE, FALSE),]

Unnamed: 0,var1,var2,var3
1,2,,15
2,3,,12
5,4,9.0,13


And if we don't supply enough variables, R starts looping the filter vector. So in the above example, we get every second value, starting `TRUE`.

And by this logic, if we input `c(NA, FALSE, TRUE)` this should be equivalent to `c(NA, FALSE, TRUE, NA, FALSE)` - i.e. three rows consisting of row 3 wrapped by `NA`s.

In [44]:
X[c(NA, FALSE, TRUE),]

Unnamed: 0,var1,var2,var3
,,,
2,3.0,,12.0
NA.1,,,


Gotcha.

## Sorting

In [49]:
X

Unnamed: 0,var1,var2,var3
1,2,,15
4,1,10.0,11
2,3,,12
3,5,6.0,14
5,4,9.0,13


In [45]:
sort(X$var1)

In [46]:
sort(X$var1, decreasing=TRUE)

In [47]:
sort(X$var2, decreasing=TRUE, na.last=TRUE)

For clarity: `sort` actually carries out the sorting; `order` just returns the order in which they should be.

In [48]:
order(X$var1)

In [50]:
X[order(X$var1),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


In [54]:
X[order(X$var2, na.last = FALSE),]

Unnamed: 0,var1,var2,var3
1,2,,15
2,3,,12
3,5,6.0,14
5,4,9.0,13
4,1,10.0,11


And although we can't really do anything here, we can sort by multiple variables...

In [56]:
X[order(X$var1, X$var3),]

Unnamed: 0,var1,var2,var3
4,1,10.0,11
1,2,,15
2,3,,12
5,4,9.0,13
3,5,6.0,14


## Adding rows and columns

In [57]:
X$var4 <- rnorm(5)
X

Unnamed: 0,var1,var2,var3,var4
1,2,,15,0.187596
4,1,10.0,11,1.7869764
2,3,,12,0.4966936
3,5,6.0,14,0.063183
5,4,9.0,13,-0.5361329


In [58]:
Y <- cbind(X, rnorm(5))
Y

Unnamed: 0,var1,var2,var3,var4,rnorm(5)
1,2,,15,0.187596,0.6257849
4,1,10.0,11,1.7869764,-2.4508375
2,3,,12,0.4966936,0.08909424
3,5,6.0,14,0.063183,0.4783857
5,4,9.0,13,-0.5361329,1.00053336


What if you try to add a differing number of rows?

In [59]:
Y <- cbind(X, rnorm(4))

ERROR: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 5, 4
