# Manipulating Matrices

The great thing about matrices is that since they are just generalizations of vectors from one dimension to two, subsetting matrices works *almost* the same way it works with vectors. Basically, instead of subsetting by passing an index or a logical array into a set of square brackets (e.g. `[1]`), we just put a comma in those square brackets and specify a location with *two* indices / logical arrays (e.g. `[1, 1]`).

## Subsetting by Index

Suppose we have the following matrix:

In [1]:
import numpy as np

our_matrix = np.arange(12).reshape((3, 4))
our_matrix


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

To subset, we just pass a location along the x-axis (rows) and a location along the y-axis (columns). For example, if we wanted the entry from the second row and third column (as always, remembering the first entry in each dimension has index `0`), we'd type:

In [2]:
our_matrix[1, 2]


6

The one new thing is that if you want ALL entries along a specific dimension, you still put in a comma and type `:` for the dimension on which you want all observations. So if I wanted to second row, I'd just type:

In [4]:
our_matrix[1, :]


array([4, 5, 6, 7])

Or if I wanted the third column, I'd type:

In [5]:
our_matrix[:, 2]


array([ 2,  6, 10])

Note that if you pull out a subset of your matrix that is one dimensional, it just becomes a vector!

Finally, just like with vectors, we can subset with ranges of indices if we want (remembering the final value in a range is NOT returned):

In [8]:
our_matrix[0:2, 2:4]


array([[2, 3],
       [6, 7]])

## Subsetting with Logicals

Subsetting with logical vectors also generalizes from vectors to matrices in the same way. To illustrate, let's go back to our toy matrix of survey responses where each row represents a different person, and the columns represent respondent age, income, and years of education:


In [9]:
import numpy as np

survey = np.array(
    [[20, 22_000, 12], [35, 65_000, 16], [55, 19_000, 11], [45, 35_000, 12]]
)

survey


array([[   20, 22000,    12],
       [   35, 65000,    16],
       [   55, 19000,    11],
       [   45, 35000,    12]])

If we wanted to select all the rows where income was less than the US median income (about 64,000), we would first extract the income column, then create a logical column that's `TRUE` if income is below 65,000, then put that in the first position of our square brackets:

In [10]:
income = survey[:, 1]
income


array([22000, 65000, 19000, 35000])

In [11]:
below_median = income < 64000
below_median


array([ True, False,  True,  True])

In [12]:
survey[below_median, :]


array([[   20, 22000,    12],
       [   55, 19000,    11],
       [   45, 35000,    12]])

Or, of course, we could do that all in one line instead of breaking out the steps:

In [13]:
survey[survey[:, 1] < 64000, :]


array([[   20, 22000,    12],
       [   55, 19000,    11],
       [   45, 35000,    12]])

## Subsetting by Row and Column Simultaneously

Often, we don't just want to subset rows or columns, but both at once. For example, suppose I wanted the education levels of everyone with incomes below the US median. I could do this in two steps by subsetting rows and then subsetting columns:

In [14]:
below_median = survey[survey[:, 1] < 64000, :]
below_median[:, 2]


array([12, 11, 12])

Or I can do it all in one command!

In [15]:
survey[survey[:, 1] < 64000, 2]


array([12, 11, 12])

So what is the average education of people earning less than the median income in the US in our toy data?

In [16]:
np.mean(survey[survey[:, 1] < 64000, 2])


11.666666666666666

OK -- I know we've just covered a lot, but hopefully that example makes clear how quickly we can start doing really, really powerful analyses and answering substantive questions *just by subsetting our data carefully.*

### Naming Rows and Columns

If you've worked with matrices or data frames in other languages, at this point you may be saying "Why do I have to identify my columns by index?! Can't I give them nice, human-readable names?" 

The answer is yes, numpy does provides some utilities for naming rows and columns. However, they're a little clunky, and so most people interested in being able to name their columns end up using a different library called *pandas* that will be the focus of one of our later courses. Pandas is a library that is built on top of numpy -- so it's really important we learn numpy before we learn pandas! -- and provides a lot of tools to make numpy easier to use, like the ability to easily give your columns human-readable names. 

## Using Subsets to Modify Data

Sometimes we want to modify a *part* of a matrix. For example, suppose we were working with our survey data, and we want to multiple all the income values by `1.02` to adjust for inflation that has occurred since the survey. Obviously, if we just multiplied the matrix by `1.02`, we'd also modify things like education and age:

In [17]:
survey * 1.02


array([[2.040e+01, 2.244e+04, 1.224e+01],
       [3.570e+01, 6.630e+04, 1.632e+01],
       [5.610e+01, 1.938e+04, 1.122e+01],
       [4.590e+01, 3.570e+04, 1.224e+01]])

What we can do instead is extract the column with income, modify it, then replace the old income column with our updated column:

In [18]:
income_column = survey[:, 1]  # Extract income
adjusted_income = income_column * 1.02  # Adjust income
survey[:, 1] = adjusted_income  # Replace income with new values!
survey


array([[   20, 22440,    12],
       [   35, 66300,    16],
       [   55, 19380,    11],
       [   45, 35700,    12]])

Or, if we wanted, we could actually do all this in one step:

In [19]:
# Re-make survey so it hasn't been adjusted for inflation
survey = np.array(
    [[20, 22_000, 12], [35, 65_000, 16], [55, 19_000, 11], [45, 35_000, 12]]
)


In [20]:
# Now adjust income in one step!
survey[:, 1] = survey[:, 1] * 1.02
survey


array([[   20, 22440,    12],
       [   35, 66300,    16],
       [   55, 19380,    11],
       [   45, 35700,    12]])

And this is *especially* powerful if we subset on BOTH rows and columns. Suppose, for example, we wanted to see what people's incomes would look like if anyone who didn't finish high school (`education < 12`) got a tax credit of 10,000 dollars?

In [21]:
survey[survey[:, 2] < 12, 1] = survey[survey[:, 2] < 12, 1] + 10000


In [22]:
survey


array([[   20, 22440,    12],
       [   35, 66300,    16],
       [   55, 29380,    11],
       [   45, 35700,    12]])

And that's it! Now you're a matrix pro. 

## Recap

- Subsetting matrices is just like subsetting vectors, except with two entries between the square brackets: `[ , ]`.
- The first entry in the square brackets relates to rows, the second to columns.
- Like vectors, you can subset by index or boolean vector. 
- You can mix how you subset, and use a boolean for rows and a name for columns. 
- If you subset a row, or subset a column you get back a vector.
- Subsetting on both rows and column allows you to edit matrices in very powerful ways. 

## Exercises

Now that we've familiarized ourselves with matrices and matrix manipulation, [it's time to do some exercises!](./exercises/exercise_matrices.ipynb)