# Subsetting Matrices, Arrays



The great thing about matrices is that since they are just generalizations of vectors from one dimension to two, subsetting matrices works *almost* the same way it works with vectors. Basically, instead of subsetting by passing an index or a logical array into a set of square brackets (e.g. `[1]`), we just put a comma in those square brackets and specify a location with *two* indices / logical arrays (e.g. `[1,1]`).

## Subsetting by Index

Suppose we have the following matrix:

In [15]:
our_matrix = matrix(1:12, nrow=3, ncol=4)
our_matrix

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12


To subset, we just pass a location along the x-axis (rows) and a location along the y-axis (columns). For example, if we wanted the entry from the second row and third column, we'd type:

In [16]:
our_matrix[2, 3]

The one new thing is that if you want ALL entries along a specific dimension, you still put in a comma, but you leave the entry blank for the dimension on which you want all observations. So if I wanted to second row, I'd just type:

In [17]:
our_matrix[2,]

Or if I wanted the third column, I'd type:

In [18]:
our_matrix[,3]

Note that if you pull out a subset of your matrix that is one dimensional, it just becomes a vector!

In [19]:
class(our_matrix)

In [20]:
class(our_matrix[1,])

Finally, just like with vectors, we can subset with vectors if we want:

In [22]:
our_matrix[1:2, 3:4]

0,1
7,10
8,11


## Subsetting with Logicals

## Arrays

As we saw above, generalizing the way we subset vectors to subsetting matrices was quite easy -- we just subset our matrices by doing what we did with vectors, but with two terms in in our square brackets separated by a comma (e.g. `[1, 1]` instead of just `[1]`). 

But guess what? Just as it was easy to generalize from one dimension to two, it turns out that we can also generalize from two dimensions to N dimensions the same way! In fact, rather than thinking of vectors and matrices as to different thing, we can think of them as special cases (the case of N=1 and N=2) of a more general data structure: **Arrays**. 

Arrays are collections of data of the same type with a regular structure organized into N dimensions. When N=1, an array is the equivalent of a vector:

In [1]:
array(1:9)

And when N=2, an array is just a matrix:

In [5]:
array(1:6, dim=c(2,3))

0,1,2
1,3,5
2,4,6


But when N > 2, things start to get a little harder to visualize, but also a lot more powerful. 

*But WHY*, I hear you asking, *would I ever want more than two dimensions?!* 

Well, let's start with use-cases for N=3. Suppose that you wanted to build a climate model. One thing that you would have to do is come up with a way to store the temperature at different points within a 3D space! And you could do that with a three-dimensional array where the first two dimensions are used to store the x and y coordinates of a point on the ground, the third could represent an elevation above that point, and the value in that entry would be the temperature. 

For example, let's create a 3 x 3 x 10 array filled with random temperatures generated with `rnorm()`, where we're imagining the first dimension to be, say, number of kilometers North of a reference point (say, the center of Duke's campus), the second dimension is the number of kilometers East of the center of Duke's campus, and the third dimension is kilometers elevation:

In [10]:
# Make random temperatures.
# In a real world application 
# you'd have measured these or 
# started with seed values based on 
# measured values. 

 # using a reasonable farenheit mean and sd
rand_temps = rnorm(3*3*10, mean=70, sd=10)
temperatures = array(rand_temps, dim=c(3,3,10))

Now because `temperatures` has three dimensions, it's hard to print out, but it's easy to extract values. Suppose wanted to know the temperature 1 km North of Duke, 2km East, and at an elevation of 10km:

In [11]:
temperatures[1, 2, 10]

Ta-Da! I could also get the temperature at all elevations at the point 1km North and 2km East:

In [13]:
temperatures[1, 2, ]

Or the temperature at 5km elevation across all ground locations: 

In [14]:
temperatures[,,5]

0,1,2
63.8509,82.40452,71.75735
65.3921,52.12161,89.9629
86.95049,56.50496,58.08369


As you can see, this is a *really* powerful idea, and honestly arrays are probably the most fundamental data structure in data science. 

### Other Uses for Arrays

Not into climate modeling? OK, here are some other uses of arrays with more than 2 dimensions:

- **Repeated measurements over time:** Just as a single survey is easily represented as a matrix, so too can repeated surveys be easily represented by making time a third dimension. This makes it easy to pull out a single wave of the survey, or to pick out all the responses for a given person over time. 
- **Brain Scans:** fMRI scan the entire volume of the brain, and that volumetric data is most naturally stored in the three-dimensional array, just like the temperature data above. Obviously not all social scientists will end up working with brain scans, but there's certainly a lot of cutting-edge work in this area!
- **3D measurements that evolve over time:** just as we can model survey data that evolves over time in three dimensions, we can also model three-dimensional volumetric data that evolves over time (e.g. not just a slice of a climate model, but its evoluation) in a four-dimensional array!
- **Satellite Data:** Satellite image data usually comes in the form of sets of 2-dimensional images, where each image includes information about light intensity at a given wavelength. When these images are stacked to, say, generate a color image, or identify wavelength combinations common to certain types of pollution, flood waters, or specific crops, you get a three-dimensional array.

So yeah, high dimensional arrays are a SUPER powerful data structure, and one you shouldn't shy away