# Data Structures

There are five main data structures in R.

* vector
* matrix
* array
* list
* data frames

Note that vector, matrix and array contain elements that are all the same time (`homogeneous`), and list and data frames may contain elements of mixed types (`heterogeneous`). A list is like a vector, but heterogeneous and a data frame is like a matrix, but heterogeneous.

## Vectors

### Basic vectors

A vector is created using the `c` function. Note that a mixture of data types may be stored in a vector.

In [1]:
a <- c(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)
print(a)

 [1] "TRUE"  "FALSE" "-10"   "10"    "X"     "Y"     NA      "NaN"   "Inf"  
[10] "-Inf" 


Note that we shoved logicals, numerics and characters into the vector `a`, but when interrogating its type, all the elements are `coerced` to `character` type. The most complex type in a vector will be the type that all the elements will be coerced into.

In [2]:
typeof(a)

### Length of vector

To get the length of a vector, use `length`.

In [3]:
a <- c(1, 2, 3)
length(a)

### Accessing individual elements

It's crazy, but true, a `R` vector is indexed starting from `1` instead of `0`. To access each element in a vector in `R` by position, use the brackets starting with `1` as follows.

In [4]:
a[1]

In [5]:
a[2]

### Named vector

A `named vector` is when each element in a vector is given a name. When accessing elements in a named vector, you may access the element by name.

In [6]:
a <- c(A=TRUE, B=FALSE, C=-10, D=10, E='X', F='Y')
print(a)

      A       B       C       D       E       F 
 "TRUE" "FALSE"   "-10"    "10"     "X"     "Y" 


Here, we may access the first element by `a['A']` or `a[1]`.

In [7]:
print(a['A'])

     A 
"TRUE" 


In [8]:
print(a[1])

     A 
"TRUE" 


Likewise, we may access the second element by `a['B']` or `a[2]`.

In [9]:
print(a['B'])

      B 
"FALSE" 


In [10]:
print(a[2])

      B 
"FALSE" 


### Selecting elements

We may select multiple elements using a supplied vector of indices.

In [11]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(2, 4)]
print(b)

[1] 2 4


We may also use the colon `:` operator to specify a range of elements.

In [12]:
a <- c(1, 2, 3, 4, 5)
b <- a[1:2]
print(b)

[1] 1 2


We may use a logical masking vector as well to select elements.

In [13]:
a <- c(1, 2, 3, 4, 5)
b <- a[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
print(b)

[1] 1 3 5


### Math operations

Math operations on vectors proceed as on numerics. Note the `%*%` operator, which is the `dot product` operator and returns a `matrix`. When you perform math operations on vectors, make sure they are of the same length or you might get unexpected results.

In [14]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- a + b
d <- a - b
e <- a * b
f <- a / b
g <- a %% b
h <- a %*% b

In [15]:
print(c)

[1] 5 7 9


In [16]:
print(d)

[1] -3 -3 -3


In [17]:
print(e)

[1]  4 10 18


In [18]:
print(f)

[1] 0.25 0.40 0.50


In [19]:
print(g)

[1] 1 2 3


In [20]:
print(h)

     [,1]
[1,]   32


### Combining

Use the `c` function to combine vectors.

In [21]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(a, b)
print(c)

[1] 1 2 3 4 5 6


### Sorting

Use the `sort` function to sort elements of a vector.

In [22]:
a <- c(10, 5, 2, 8, 7)
b <- sort(a)
print(b)

[1]  2  5  7  8 10


### Factors

Factors represent categorical values and you may create them from vectors using the `factor` function. Note that factors have `levels`, which is an ordering placed on the unique elements of the factor. If no explicit `levels` is specified, the ordering is alphabetical.

In [23]:
a <- factor(c('water', 'soda', 'tea', 'coffee'))
print(a)

[1] water  soda   tea    coffee
Levels: coffee soda tea water


Here is a specific `levels` placed on the factor (most caffeine to least).

In [24]:
a <- factor(
        c('water', 'soda', 'tea', 'coffee'), 
        levels=c('tea', 'coffee', 'soda', 'water'))
print(a)

[1] water  soda   tea    coffee
Levels: tea coffee soda water


## Matrix

### Creation

A matrix is created using the `matrix` function. Note that you may supply a vector and the number of rows and columns of the matrix during instantiation/creation. The matrix is created column-wise by default.

In [25]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2)
print(A)

     [,1] [,2]
[1,]    1    3
[2,]    2    4


To create a matrix with a vector of data by row, use `byrow=TRUE`.

In [26]:
A <- matrix(c(1, 2, 3, 4), nrow=2, ncol=2, byrow=TRUE)
print(A)

     [,1] [,2]
[1,]    1    2
[2,]    3    4


### Accessing elements

To access one or more elements in an array, use positional indices with the brackets `[]` and `:`. 

In [27]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)

     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
[3,]    6    7    9


Get one element.

In [28]:
a <- A[1, 1]
print(a)

[1] 0


Get multiple elements (second and third rows, first column).

In [29]:
a <- A[2:3, 1]
print(a)

[1] 3 6


Get multiple elements (first column, second and third rows).

In [30]:
a <- A[1, 2:3]
print(a)

[1] 1 2


Get multiple elements (second and third rows, second and third columns).

In [31]:
a <- A[2:3, 2:3]
print(a)

     [,1] [,2]
[1,]    4    5
[2,]    7    9


### Transposing

Use the `t` function to transpose a matrix.

In [32]:
A <- matrix(c(0, 1, 2, 3, 4, 5, 6, 7, 9), nrow=3, byrow=TRUE)
print(A)

     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
[3,]    6    7    9


In [33]:
a <- t(A)
print(a)

     [,1] [,2] [,3]
[1,]    0    3    6
[2,]    1    4    7
[3,]    2    5    9


## Lists

`Lists` are somewhat of a misnomer in `R` as they have more capabilities than a mere list in other programming languages. A list in `R` actually behaves like a dictionary, map or associative array. We create a list using the `list` function.

In [34]:
a <- list(TRUE, FALSE, -10, 10, 'X', 'Y', NA, NULL, NaN, Inf, -Inf)

In [35]:
typeof(a)

In [36]:
print(a)

[[1]]
[1] TRUE

[[2]]
[1] FALSE

[[3]]
[1] -10

[[4]]
[1] 10

[[5]]
[1] "X"

[[6]]
[1] "Y"

[[7]]
[1] NA

[[8]]
NULL

[[9]]
[1] NaN

[[10]]
[1] Inf

[[11]]
[1] -Inf



### Accessing elements

We may use brackets `[]` and `:` to access elements of a list.

In [37]:
b <- a[1]
print(b)

[[1]]
[1] TRUE



In [38]:
b <- a[2:5]
print(b)

[[1]]
[1] FALSE

[[2]]
[1] -10

[[3]]
[1] 10

[[4]]
[1] "X"



### Named lists

As with named vectors, we also have `named lists`.

In [39]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
print(a)

$A
[1] TRUE

$B
[1] FALSE

$C
[1] -10

$D
[1] 10



We may access the first element of the list `a` with `a['A']` or `a[1]`.

In [40]:
b <- a['A']
print(b)

$A
[1] TRUE



In [41]:
b <- a[1]
print(b)

$A
[1] TRUE



### List apply

The `lapply` function can be used to apply a function to each element of a list. Here, we get the class of each element in the list.

In [42]:
a <- list(A=TRUE, B=FALSE, C=-10, D=10)
b <- lapply(a, class)
print(b)

$A
[1] "logical"

$B
[1] "logical"

$C
[1] "numeric"

$D
[1] "numeric"



## Data Frames

A `data frame` is perhaps the most powerful data structure in `R`. There are `me-too` data frame data structures in `Python` with `Pandas` and `Spark`. To create a data frame in `R`, use the `data.frame` function.

In [43]:
s <- data.frame(
    age = c(18, 16, 15),
    grade = c('A', 'B', 'C'),
    name = c('Jane', 'Jack', 'Joe'),
    male = c(FALSE, TRUE, TRUE)
)

print(s)

  age grade name  male
1  18     A Jane FALSE
2  16     B Jack  TRUE
3  15     C  Joe  TRUE


### Accessing elements

To access the first row.

In [44]:
a <- s[1, ]
print(a)

  age grade name  male
1  18     A Jane FALSE


To access the first column.

In [45]:
a <- s[, 1]
print(a)

[1] 18 16 15


To access columns by name.

In [46]:
a <- s$age
print(a)

[1] 18 16 15


In [47]:
a <- s$grade
print(a)

[1] A B C
Levels: A B C


In [48]:
a <- s$name
print(a)

[1] Jane Jack Joe 
Levels: Jack Jane Joe


In [49]:
a <- s$male
print(a)

[1] FALSE  TRUE  TRUE


To access elements by filtering with positional indices.

In [50]:
a <- s[1:2, 1:2]
print(a)

  age grade
1  18     A
2  16     B


If there is missing data `NA` in your data frame, use the `complete.cases` function to create a logical vector mask to filter for rows (or `cases`) with only complete data.

In [51]:
s <- data.frame(
    age = c(18, 16, 15, 19),
    grade = c('A', 'B', 'C', NA),
    name = c('Jane', 'Jack', 'Joe', 'Jerry'),
    male = c(FALSE, TRUE, TRUE, TRUE)
)

print(s)

  age grade  name  male
1  18     A  Jane FALSE
2  16     B  Jack  TRUE
3  15     C   Joe  TRUE
4  19  <NA> Jerry  TRUE


In [52]:
a <- complete.cases(s)
print(a)

[1]  TRUE  TRUE  TRUE FALSE


In [53]:
a <- s[complete.cases(s), ]
print(a)

  age grade name  male
1  18     A Jane FALSE
2  16     B Jack  TRUE
3  15     C  Joe  TRUE
