# Manipulating Vectors

In our last reading, we learned what vectors are, and how to do operations on entire vectors. But often times we want to work with *subsets* of a vector. Indeed, extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process --- whether applied to a vector or a dataset --- is often referred to as "taking a subset", "subsetting", or "filtering". If there is one skill you need to master as quickly as possible, it's this.

Subsetting can be accomplished in two ways: 

- By index
- By boolean vectors

## What is Subsetting?

As you've probably already realized, vectors don't just contain a jumble of data -- they also have a concept of "order". In particular, vector data is organized along a single dimension (in a line, just as data is organized in a list). So when I create a vector with `42, 47, -1`, I have in mind that 42 is the first entry, 47 is the second, and -1 is the third. And we can use that concept of order to subset vectors by passing the index (order number) of an entry we want to our vector in square brackets. For example, consider the following vector:

In [1]:
import numpy as np
a = np.array([42, 47, -1])
a

array([42, 47, -1])

If I wanted to pull out the second entry in that vector, I could do so with *array indexing* using square brackets `[]` (remember that indexes start at 0 in Python, so the second entry is at index `1`):

In [2]:
a[1]

47

And if I want to assign that second entry to a new variable, I can!

In [3]:
new = a[1]
new

47

But what, exactly, is happening when I subset? Let's return to the idea that a variable is just a box holding some data, and walk-through the following block of code:

```r
a = np.array([1, 2, 3])
new = a[2]
```

In the first line of code, we create a new vector with three entries and assign it to the variable `a`. Just as in our previous reading, we can think of the variable `a` as a box that is holding this new vector.

OBVIOUSLY NEED TO UPDATE TO MATCH HOW DREW AND GENEVIEVE ARE DRAWING AND TO NUMPY SYNTAX

![vector_subsetting1](images/vector_subsetting1.png)

In the second line, the first thing that happens is R evaluates the expression on the right side of the assignment operator: `a[1]`. The use of `a` and square brackets indicates to R that we're not trying to access a portion of the data stored in the box labelled `a`. In particular, by putting a `1` between the square brackets, we're telling R we want the second item in the box `a`: `47`. 

![vector_subsetting2](images/vector_subsetting2.png)

Then when we assign that value -- 47 -- to `new`, we create a new variable, and insert our data into that box:

![vector_subsetting3](images/vector_subsetting3.png)

This `variable[]` notation is something we'll use a *lot* with numpy, and it will always mean the same thing: we're trying to access some data in the data stored in the box `variable`.

**Note:** we're making one small simplification in the discussion above that, if you've worked with numpy a lot, you may notice. Don't worry -- we'll address that in a later reading!

## Subsetting By Index



What we just did is an example of subsetting by index, where we just specify the location (index) of the data we want:

In [4]:
a = np.array([42, 47, -1])
a[1]

47

But we can also pass a list or numpy array of indices to get a subset of entries:

In [5]:
a[[0, 2]]

array([42, -1])

Or with an array:

In [6]:
zero_and_two = np.array([0, 2])
zero_and_two


array([0, 2])

In [7]:
a[zero_and_two]

array([42, -1])

Also, you don't have to subset entries in order! If you pass indices out of order, you'll get a vector with a new order!

In [8]:
a[[2, 0]]

array([-1, 42])

Again, this is all working the same was as our example with just one entry -- Python interprets the square brackets as a request for some data in the box `a`, and if we pass multiple indices, it just grabs multiple items from that box. 

## Subsetting with Logicals

Subsetting with logicals is a little hard to explain, so instead let's jump right into an example. 

Suppose we have a character vector with only two elements ("apple" and "banana"). Subsetting it to "apple" could be done by passing a logical vector as follows:

In [9]:
fruits = np.array(["apple", "banana"])
fruits[[True, False]]

array(['apple'], dtype='<U6')

Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with `True` are kept while elements that match with `False` are dropped.

Visualized with the same tools we used before, we can draw out what's happening in this block of code:

In [10]:
a = np.array([42, 47, -1])
my_subset = np.array([True, False, True])
b = a[my_subset]
b

array([42, -1])

First we create `a`:

![logical_subset_1](images/logical_subset_1.png)

Then we create `my_subset`:

![logical_subset_2](images/logical_subset_2.png)

Then the magic: R lines up the entries in the data in the box labelled `a` and the data in the box labelled `my_subset`, and keeps any entries from `a` that line up with values of `my_subset` that are `TRUE`. 

Then it assigns the values in `a` that line up with `TRUE`s in `my_subset` to a new variable `b`:

![logical_subset_3](images/logical_subset_3.png)

### Logical Operations



This process is extremely useful when combined with a *logical operation* to combine multiple conditions. For example, we can use the logical "equals" (written `==`) say "be true if values are equal", and the logical "not equals (written `!=`) to say "be true if values are not equal".

However, when working with numpy arrays, we can't use the logical operations `or`, `and`, and `not` we previously learned in Python. Instead we have to use `&` for "and", `|` for "or", and `~` for "not".

To illustrate, let's, using a logical operation we can filter a large vector of numbers:

In [11]:
# Create a numeric vector
numbers = np.arange(10, 110, 10)
numbers


array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [12]:
# Get small numbers:
numbers[numbers <= 50]

array([10, 20, 30, 40, 50])

And we can also combine logical conditions. When we do so, however, note that we have to wrap each test in `()` so numpy parses them correctly. For example:

In [13]:
numbers[(numbers < 30) | (numbers == 100)]

array([ 10,  20, 100])

If you don't wrap your two tests in parenthesis, you'll run into trouble and get this error:

```python
numbers[numbers < 30 | numbers == 100]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/tj/s8f2_ks15h315z5thvtnhz8r0000gp/T/ipykernel_13964/3746007904.py in <module>
----> 1 numbers[numbers < 30 | numbers == 100]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
```

(Why? Well... that gets *really* complicated fast, but basically numpy parses the `|` before anything else, and while we use `|` as the logical "or", when you use it when neither array is of type bool, it actually does some weird bit-level manipulations, generating a new array of numbers instead of an array booleans. So... just use the parentheses. 🤷‍♂️ )

Note that there's nothing magic about putting these booleans inside square brackets -- Python is just evaluating the code inside the square brackets, returning an array of type `bool`, and then using that to subset the original array. Indeed, we can move the construction of these `bool` arrays outside of the square brackets if we want:

In [15]:
# Get only the middle set of numbers

middle_number = (30 < numbers) & (numbers < 80)
middle_number

array([False, False, False,  True,  True,  True,  True, False, False,
       False])

In [16]:
numbers[middle_number]

array([40, 50, 60, 70])

## Using Subsetting to Modifying Vectors

The subsetting logic from above isn't just for extracting subsets of vectors to analyze -- it's also useful for modifying vectors. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered grandpa's age above? We can fix it using indexing, a logical statement, or naming. 

In [17]:
# Recreate vector with age values
age = np.array([50, 55, 80])
age

array([50, 55, 80])

In [18]:
age[age == 80] = 82 # using a logical statement
age

array([50, 55, 82])

In [19]:
age[1] = 45         # using indexing
age

array([50, 45, 82])

Note that we can also make modifications to subsets of rows by using subsets on BOTH sides of the assignment operator. 

For example, say we wanted to see how old everyone would be in five years, but Grandpa says we're not allowed to change his age because he's decided he wants to keep saying he's 80 until he dies. Oh, Grandpa.

So what we want to do is take our ages and increase them all by 5. If Grandpa weren't so annoying, we could just do:

```python
ages = ages + 5
```

But Grandpa says we can't increase his age. So we have to (a) pull out the ages that are less than 80, (b) increment them up by 5, and (c) re-insert them, replacing the older age values for people under 80.

In [20]:
age = np.array([50, 55, 80])

# Get younger ages
younger_ages = age[age < 80]
younger_ages

array([50, 55])

In [21]:
# Make them all five years older
new_ages = younger_ages + 5
new_ages

array([55, 60])

In [22]:
# Re-insert
age[age < 80] = new_ages
age

array([55, 60, 80])

Note that this last operation worked because the vector on the left side of the assignment operator had a length of two, and the new vector on the right-hand side was also of length two, so numpy could match the entries being subset on the left to entries on the right one-to-one.

But while we *can* do this is all these separate steps, we can also collapse this:

In [23]:
age = np.array([50, 55, 80])

In [24]:
age[age < 80] = age[age < 80] + 5
age

array([55, 60, 80])

Again, note this only worked because we were careful to ensure that the vector on the right of the assignment operator "fit" into the space being subset on the left! This is a trick we use a lot in data science, so make sure you're comfortable with it before proceeding. 

## Modifying Vectors and Data Types

You may not have noticed, but up till now we've only being doing "like-for-like" substitutions. For example, when we changed an entry in `age`, we were always replacing one `int` with another.

This is important, because as we discussed in our last reading, vectors are *homogeneously typed*, meaning that unlike lists, you can't put different types of data in an array.

Now when we're *creating* a vector, numpy will use type promotion to pick a type that accommodates everything you're putting into an array. For example, if I pass both bools and integers to `np.array()`, it will just type promote everything to be integers:

In [25]:
np.array([True, False, 7])

array([1, 0, 7])

But once a vector has been created, numpy stops being so considerate: if you try and cram data of a different type into a vector of a given type, it will try to *coerce* the data into the established type of the array. 

For example, if we try and cram 7 into an array that's already of type `bool`, numpy will *coerce* 7 into type bool (e.g. run `Boolean(7)`), which will turn `7` into `True` *even though this is causing information to be lost*:

In [26]:
bool_vector = np.array([True, False])
bool_vector

array([ True, False])

In [27]:
bool_vector[1] = 7
bool_vector

array([ True,  True])

Similarly, if you try and put a floating point number into an integer vector, that float will be type coerced into an integer, which is accomplished by just truncating any information after the decimal:

In [28]:
int_vector = np.array([1, 2, 3])
int_vector

array([1, 2, 3])

In [29]:
int_vector[0] = 42.989723798729874
int_vector

array([42,  2,  3])

This is why, as we mentioned in the last reading, you might not always want to let numpy pick your datatypes for you. Suppose in the example above, for example, you know you might later need to put a floating point number into `int_vector` -- you could instead tell numpy to make it a floating point number vector *at creation*:

In [30]:
no_longer_an_int_vector = np.array([1, 2, 3], dtype="float")
no_longer_an_int_vector[0] = 42.989723798729874
no_longer_an_int_vector

array([42.9897238,  2.       ,  3.       ])

I know this can be a little confusing, so here's a recap:

- When *creating* a vector, numpy will do everything it can to ensure that you don't lose any information by type *promoting* your data to the lowest type that *preserves all the information in your data*. 
- Once a vector has been created, numpy's hands are tied, so it will use type *coercion* to force the data you're trying to put into your existing vector into the established type, *even if that causes information loss.*

## Recap

- Vectors can be subset by index, with a logical, or by name
- Subsetting with logicals allows you to extract subsets based on the values of vector elements. 
- Assigning values to subsets modifies subsets of vectors. 