# Manipulating Matrices

The great thing about matrices is that since they are just generalizations of vectors from one dimension to two, subsetting matrices works *almost* the same way it works with vectors. Basically, instead of subsetting by passing an index or a logical array into a set of square brackets (e.g. `[1]`), we just put a comma in those square brackets and specify a location with *two* indices / logical arrays (e.g. `[1,1]`).

## Subsetting by Index

Suppose we have the following matrix:

In [1]:
our_matrix <- matrix(1:12, nrow = 3, ncol = 4)
our_matrix

0,1,2,3
1,4,7,10
2,5,8,11
3,6,9,12


To subset, we just pass a location along the x-axis (rows) and a location along the y-axis (columns). For example, if we wanted the entry from the second row and third column, we'd type:

In [2]:
our_matrix[2, 3]

The one new thing is that if you want ALL entries along a specific dimension, you still put in a comma, but you leave the entry blank for the dimension on which you want all observations. So if I wanted to second row, I'd just type:

In [3]:
our_matrix[2, ]

Or if I wanted the third column, I'd type:

In [4]:
our_matrix[, 3]

Note that if you pull out a subset of your matrix that is one dimensional, it just becomes a vector!

In [5]:
class(our_matrix)

In [6]:
class(our_matrix[1, ])

Finally, just like with vectors, we can subset with vectors if we want:

In [7]:
our_matrix[1:2, 3:4]

0,1
7,10
8,11


## Subsetting with Logicals

Subsetting with logical vectors also generalizes from vectors to matrices in the same way. To illustrate, let's go back to our toy matrix of survey respondenses:

In [8]:
income <- c(22000, 75000, 19000)
age <- c(20, 35, 55)
education <- c(12, 16, 11)

survey <- cbind(income, age, education)
survey

income,age,education
22000,20,12
75000,35,16
19000,55,11


If we wanted to select all the rows where income was less than the US median income (about 65,000), we would first extract the income column, then create a logical column that's `TRUE` if income is below 65,000, then put that in the first position of our square brackets:

In [9]:
income <- survey[, 1]
income

In [10]:
below_median = income < 65000
below_median

In [11]:
survey[below_median, ]

income,age,education
22000,20,12
19000,55,11


Or, of course, we could do that all in one line instead of breaking out the steps:

In [12]:
survey[survey[, 1] < 65000, ]

income,age,education
22000,20,12
19000,55,11


Or, since R used the names of vectors we passed to `cbind()` as column names, we could also subset our columns by name, which makes our code a lot easier to understand:

In [13]:
survey[survey[, "income"] < 65000, ]

income,age,education
22000,20,12
19000,55,11


## Subsetting by Names

As we just saw, while not all matrices have names, if you they do you can subset using them. For example, our `survey` matrix has column names, but no row names, so we can only subset columns by name:

In [14]:
survey[, "education"]

Names are accessible through the `colnames()` and `rownames()` functions:

In [15]:
colnames(survey)

In [16]:
rownames(survey)

NULL

Oddly, R also allows you to assign to these functions to change the names on a matrix. For example, to add row names we could do:

In [17]:
rownames(survey) <- c("row1", "row2", "row3")
survey

Unnamed: 0,income,age,education
row1,22000,20,12
row2,75000,35,16
row3,19000,55,11


And we can delete them too!

In [18]:
rownames(survey) <- NULL

## Subsetting by Row and Column Simultaneously

Often, we don't just want to subset rows or columns, but both at once. For example, suppose I wanted the education levels of everyone with incomes below the US median. I could do this in two steps by subsetting rows and then subsetting columns:

In [19]:
below_median <- survey[survey[, "income"] < 65000, ]
below_median[, "education"]

Or I can do it all in one command!

In [20]:
survey[survey[,"income"] < 65000, "education"]

So what is the average education of people earning less than the median income in the US in our toy data?

In [21]:
mean(survey[survey[,"income"] < 65000, "education"])

OK -- I know we've just covered a lot, but hopefully that example makes clear how quickly we can start doing really, really powerful analyses and answering substantive questions *just by subsetting our data carefully.*

## Using Subsets to Modify Data

Sometimes we want to modify a *part* of a matrix. For example, suppose we were working with our survey data, and we want to multiple all the income values by `1.02` to adjust for inflation that has occurred since the survey. Obviously, if we just multiplied the matrix by `1.02`, we'd also modify things like education and age:

In [22]:
survey * 1.02

income,age,education
22440,20.4,12.24
76500,35.7,16.32
19380,56.1,11.22


What we can do instead is extract the column with income, modify it, then replace the old income column with our updated column:

In [23]:
income_column <- survey[, "income"] # Extract income
adjusted_income <- income_column * 1.02 # Adjust income
survey[, "income"] <- adjusted_income # Replace income with new values!
survey

income,age,education
22440,20,12
76500,35,16
19380,55,11


Or, if we wanted, we could actually do all this in one step:

In [24]:
# Re-make survey so it hasn't been adjusted for inflation
income <- c(22000, 75000, 19000)
age <- c(20, 35, 55)
education <- c(12, 16, 11)
survey <- cbind(income, age, education)
survey

income,age,education
22000,20,12
75000,35,16
19000,55,11


In [25]:
# Now adjust income in one step!
survey[, "income"] <- survey[, "income"] * 1.02
survey

income,age,education
22440,20,12
76500,35,16
19380,55,11


And this is *especially* powerful if we subset on BOTH rows and columns. Suppose, for example, we wanted to see what people's incomes would look like if anyone who didn't finish high school (`education < 12`) got a tax credit of 10,000 dollars?

In [26]:
survey[survey[, "education"] < 12, "income"] = survey[survey[, "education"] < 12, "income"] + 10000

In [27]:
survey

income,age,education
22440,20,12
76500,35,16
29380,55,11


And that's it! Now you're a matrix pro. 

## Arrays

As we saw above, generalizing the way we subset vectors to subsetting matrices was quite easy -- we just subset our matrices by doing what we did with vectors, but with two terms in in our square brackets separated by a comma (e.g. `[1, 1]` instead of just `[1]`). 

But guess what? Just as it was easy to generalize from one dimension to two, it turns out that we can also generalize from two dimensions to N dimensions the same way! In fact, rather than thinking of vectors and matrices as to different thing, we can think of them as special cases (the case of N=1 and N=2) of a more general data structure: **Arrays**. 

Arrays are collections of data of the same type with a regular structure organized into N dimensions. When N=1, an array is the equivalent of a vector:

In [28]:
array(1:9)

And when N=2, an array is just a matrix:

In [29]:
array(1:6, dim = c(2, 3))

0,1,2
1,3,5
2,4,6


But when N > 2, things start to get a little harder to visualize, but also a lot more powerful. 

*But WHY*, I hear you asking, *would I ever want more than two dimensions?!* 

Well, let's start with use-cases for N=3. Suppose that you wanted to build a climate model. One thing that you would have to do is come up with a way to store the temperature at different points within a 3D space! And you could do that with a three-dimensional array where the first two dimensions are used to store the x and y coordinates of a point on the ground, the third could represent an elevation above that point, and the value in that entry would be the temperature. 

For example, let's create a 3 x 3 x 10 array filled with random temperatures generated with `rnorm()`, where we're imagining the first dimension to be, say, number of kilometers North of a reference point (say, the center of Duke's campus), the second dimension is the number of kilometers East of the center of Duke's campus, and the third dimension is kilometers elevation:

In [30]:
# Make random temperatures.
# In a real world application
# you'd have measured these or
# started with seed values based on
# measured values.

 # using a reasonable farenheit mean and sd
rand_temps <- rnorm(3 * 3 * 10, mean = 70, sd = 10)
temperatures <- array(rand_temps, dim = c(3, 3, 10))

Now because `temperatures` has three dimensions, it's hard to print out, but it's easy to extract values. Suppose wanted to know the temperature 1 km North of Duke, 2km East, and at an elevation of 10km:

In [31]:
temperatures[1, 2, 10]

Ta-Da! I could also get the temperature at all elevations at the point 1km North and 2km East:

In [32]:
temperatures[1, 2, ]

Or the temperature at 5km elevation across all ground locations: 

In [33]:
temperatures[, , 5]

0,1,2
64.51831,84.07189,81.67219
57.57011,82.47413,74.1287
53.74611,68.3629,63.98535


As you can see, this is a *really* powerful idea, and honestly arrays are probably the most fundamental data structure in data science. 

### Other Uses for Arrays

Not into climate modeling? OK, here are some other uses of arrays with more than 2 dimensions:

- **Repeated measurements over time:** Just as a single survey is easily represented as a matrix, so too can repeated surveys be easily represented by making time a third dimension. This makes it easy to pull out a single wave of the survey, or to pick out all the responses for a given person over time. 
- **Brain Scans:** fMRI scan the entire volume of the brain, and that volumetric data is most naturally stored in the three-dimensional array, just like the temperature data above. Obviously not all social scientists will end up working with brain scans, but there's certainly a lot of cutting-edge work in this area!
- **3D measurements that evolve over time:** just as we can model survey data that evolves over time in three dimensions, we can also model three-dimensional volumetric data that evolves over time (e.g. not just a slice of a climate model, but its evoluation) in a four-dimensional array!
- **Satellite Data:** Satellite image data usually comes in the form of sets of 2-dimensional images, where each image includes information about light intensity at a given wavelength. When these images are stacked to, say, generate a color image, or identify wavelength combinations common to certain types of pollution, flood waters, or specific crops, you get a three-dimensional array.

So yeah, high dimensional arrays are a SUPER powerful data structure, and one you shouldn't shy away.

## Recap

- Subsetting matrices is just like subsetting vectors, except with two entries between the square brackets: `[ , ]`.
- The first entry in the square brackets relates to rows, the second to columns.
- Like vectors, you can subset by index, by logical vector, or by name. 
- You can mix how you subset, and use a logical for rows and a name for columns. 
- If you subset a row, or subset a column you get back a vector.
- Subsetting on both rows and column allows you to edit matrices in very powerful ways. 
- Vectors and matrices are actually just special cases of "arrays" where N=1 and N=2. Arrays can have as many dimensions as you want, and are very powerful abstractions. 

## Exercises

Now that we've familiarized ourselves with matrices and matrix manipulation, [it's time to do some exercises!](./exercises/exercise_matrices.ipynb)