# NB: R Data Structures

Basic R comes with several data structures:

| Structure   | Dim | Data Type | Shape   | Python | 
|-------------|-----|-----------|---------|--------|
| Vector     | $1$ | uniform  | sequence | 
| Matrix    | $2$ | uniform  | square   |
| Array      | $N$ | uniform  | cube+    | NumPy array |
| List       | $1$ | non-uniform  | ragged | List, Dict |
| Data Frame | $2$ | multiple | uniform | Pandas Data Frame |


- These reflect the evolution of R.
- We mainly use Vectors and Data Frames.

A **vector** is what is called an array in many other programming
languages

-   A collection of cells with **a fixed size** where all cells hold the
    **same data type** (integers or characters or reals or whatever).

A **matrix** is a two-dimensional vector (fixed size, all cell types the
same).

An **array** is a vector with one or more dimensions.

-   So, an array with one dimension is (almost) the same as a vector.

-   An array with two dimensions is (almost) the same as a matrix.

-   An array with three or more dimensions is an n-dimensional array.

A **list** can hold items of different types and the list size can be
increased on the fly.

-   List contents can be accessed either by **index** (like
    `mylist[[1]]`) or by **name** (like `mylist$age`).

-   Lists are like lists in Python.

A **data frame** is called a *table* in many languages.

-   This is the workhorse of R.

-   Each column holds the same type, and the columns can have header
    names.\
    A data frame is essential a kind of a list --- **a list of vectors**
    each with the same length, but of varying data types.

**The two most frequently used are Vector and Data frame.**

So, we will look at vectors and data frames.

We will also look at lists since they are used internally to construct
data frames.

# Vectors and `c()`

A vector is a sequence of data elements of the **same basic type**.

Members in a vector are officially called ***components***, but many
call them ***members***.

Vectors may be created with the `c()` function ("c" stands for combine).

-  This is like `[]` in Python.

Here is a vector of three numeric values 2, 3 and 5.

In [156]:
c(2, 3, 5) 

And here is a vector of logical values.

In [157]:
c(TRUE, FALSE, TRUE, FALSE, FALSE) 

A vector can contain character strings.

In [158]:
c("aa", "bb", "cc", "dd", "ee") 

## Vectors from sequences using `:`, `seq()`, and `rep()`

Vectors can be made out of sequences which may be generated in a few
ways.

In [159]:
s1 <- 2:5
s1

The `seq()` function is like Python's `range()`.

In [160]:
s2 <- seq(from=1, to=5, by=2)
s2

You can drop the argument names and write `seq(1,5,2)`.

The `rep()` function will create a series of repeated values:

In [161]:
s3 <- rep(1, 5)
s3

## `length()`

The number of members in a vector is given by the `length()` function.

In [162]:
length(c("aa", "bb", "cc", "dd", "ee")) 

## Combining Vectors with `c()`

Vectors can be combined via the function `c()`.

In [163]:
n <- c(2, 3, 5) 
s <- c("aa", "bb", "cc", "dd", "ee") 
 c(n, s) 

## Value Coercion

Notice how **the numeric values are being coerced into character
strings** when the two vectors are combined.

This is necessary so as to maintain the same primitive data type for
members in the same vector.

## Vector Math

Arithmetic operations of vectors are performed member-by-member, i.e.,
**member-wise**.

We called this 'element-wise' in the context of NumPy.

For example, suppose we have two vectors a and b.

In [164]:
a <- c(1, 3, 5, 7) 
b <- c(1, 2, 4, 8)

If we multiply `a` by 5, we would get a vector with each of its members
multiplied by 5.

In [165]:
5 * a 

And if we add a and b together, the sum would be a vector whose members
are the sum of the corresponding members from a and b.

In [166]:
a + b

Similarly for subtraction, multiplication and division, we get new
vectors via member-wise operations.

In [167]:
a - b 

In [168]:
a * b 

In [169]:
a / b 

## The Recycling Rule

If two vectors are of unequal length, the **shorter one will be
recycled** in order to match the longer vector.

This is similar to broadcasting in NumPy and Pandas.

For example, the following vectors `u` and `v` have different lengths,
and their sum is computed by recycling values of the shorter vector `u`.

In [170]:
u <- c(10, 20, 30) 
v <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) 
u + v 

## Vector Indexes

We retrieve values in a vector by declaring an index inside a single
square bracket index `[]` operator.

Vector indexes are 1-based.

In [171]:
s <- c("aa", "bb", "cc", "dd", "ee") 
s[3] 

## Negative Indexing

Unlike Python, if the index is negative, **it will remove the member**
whose position has the same absolute value as the negative index.

It really does mean subtraction!

For example, the following creates a vector slice with the third member
removed.

In [172]:
s[-3] 

## Out-of-Range Indexes

Values for out-of-range indexes are reported as `NA`.

In [173]:
s[10] 

## Numeric Index Vectors

A new vector can be sliced from a given vector with a numeric vector
passed to the indexing operator.

Index vectors consist of member positions of the original vector to be
retrieved.

Here we see how to retrieve a vector slice containing the second and
third members of a given vector `s`.

In [174]:
s <- c("aa", "bb", "cc", "dd", "ee") 
s[c(2, 3)] 

## Duplicate Indexes

The index vector allows duplicate values. Hence the following retrieves
a member twice in one operation.

In [175]:
s[c(2, 3, 3)] 

## Out-of-Order Indexes

The index vector can even be out-of-order. Here is a vector slice with
the order of first and second members reversed.

In [176]:
s[c(2, 1, 3)] 

## Range Index

To produce a vector slice between two indexes, we can use the colon
operator ":". This can be convenient for situations involving large
vectors.

In [177]:
s[2:4] 

## Logical Index Vectors

A new vector can be sliced from a given vector with a logical index
vector.

The logical vector must the same length as the original vector.

Its members are `TRUE` if the corresponding members in the original
vector are to be included in the slice, and `FALSE` if otherwise.

-   This is what we called **boolean indexing** and masking in Python.

For example, consider the following vector s of length 5.

In [178]:
s <- c("aa", "bb", "cc", "dd", "ee")

To retrieve the the second and fourth members of s, we define a logical
vector L of the same length, and have its second and fourth members set
as `TRUE`.

In [179]:
L = c(FALSE, TRUE, FALSE, TRUE, FALSE)
s[L] 

The code can be abbreviated into a single line.

In [180]:
s[c(FALSE, TRUE, FALSE, TRUE, FALSE)]

## Naming Vector Members with `names()`

We can assign names to vector members, too.

In [181]:
v <- c("Mary", "Sue") 
names(v) <- c("First", "Last") 
v 

Now we can retrieve the first member by name, much like a Python
dictionary.

In [182]:
v["First"] 

We can also reverse the order with a character string index vector.

In [183]:
v[c("Last", "First")]

# Lists

A list is a generic vector containing other objects.

This is close to a Python list.

The following variable `x` is a list containing copies of three vectors
`n`, `s`, `b`, and `a` numeric value $3$.

In [184]:
n <- c(2, 3, 5) 
s <- c("aa", "bb", "cc", "dd", "ee") 
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE) 

x <- list(n, s, b, 3)   # x contains copies of n, s, b
x

Note that odd **bracket notation**.

It indicates that each list member **contains** a vector, even if the
length of the vector is $1.$

## List Slicing

We retrieve a list slice with the single square bracket `[]` operator.

The following is a slice containing the second member of `x`, which is a copy of `s`.

In [185]:
x[2]

With a vector, we can retrieve a slice with multiple members.

Here a slice containing the second and fourth members of `x`.

In [186]:
x[c(2, 4)]

## Member Reference with `[[]]`

To reference a list member directly, we use the double square bracket
`[[]]` operator.

The following object `x[[2]]` is the second member of `x`.

In other words, **`x[[2]]` is a true copy of `s`, not a slice containing
`s` or its copy**.

In [187]:
x[2]

In [188]:
x[[2]]

We can modify its content directly.

In [189]:
x[[2]][1] = "ta" 

In [190]:
x[2]

And `s` is unaffected.

In [191]:
s