# Vectors

Now that we have a solid grasp on how R actually interprets code, how to assign values to variables, and some of the main types of data in R, we can now turn to the fundamental building block of R: vectors. And in this reading, we'll learn about how we can use vectors to manage data on things like age, height, eye color, GDP per capita, and war initiation. By the end of this tutorial, you'll know how to create vectors, how to subset them, how to modify them, and how to summarize them.

## Why Do We Need Vectors? 

It is rarely the case in science that we work with singular values (e.g. the number `7`, or a person's name `"Jill"`). Most of the time, we're working with lots of similar observations from different entities, such as the ages of everyone who responded to a survey, or the GDP of all the countries in the world.

To accommodate this need, one of the objects you'll use most in R is a vector. A vector is a **collection** of values, all of the same data type (e.g. we might have a `numeric` vector full of the ages of all survey respondents, or a `character` vector full of the names of countries). 

In fact... OK, it's time to come clean: you've been working with vectors this whole time! Vectors are *so* fundamental to R that **all data in R is stored as vectors.** Even something simple like the number 7, in R, is actually stored as a numeric vector of length 1:

In [10]:
a <- 7
length(a)

## Creating vectors

In our previous exercises, you learned how to create vectors with a single entry using the assignment operator. But often times we want vectors with more than one entry (otherwise, why have vectors?). There are a number of ways to create vectors in R, but the most fundamental is with the `c()` function ("c", I think?, is short for concatenate):

In [2]:
# Numeric vectors 
a_numeric_vector <- c(20, 25, 60, 55)
a_numeric_vector


In [3]:
# Character vectors
a_character_vector <- c("Red", "Green", "Purple")
a_character_vector


In [5]:
# Logical vectors
a_logical_vector <- c(TRUE, FALSE, TRUE) 

`c()` doesn't just work with values you write by hand, though -- you can also use `c()` with variables to combine longer vectors:

In [12]:
a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(a, b)
c

There are also a lot of other convenience functions for creating commonly used vectors. For example, it's very common to want to get a vector of sequential numbers, so if you type `1:20` you get a vector of all of the counting numbers from 1 to 20:

In [17]:
1:20

And if you want a more unusual sequence, you can use `seq()`, which takes a starting point, and ending point, and a step size to create any sequence you may want. Here's all the even numbers from 2 to 20:

In [21]:
seq(2, 20, 2)

(This is the first time we've seen a function that takes multiple inputs (called **arguments**). They're just like the functions that only take one argument, just with different arguments separate by commas).

And if you want a vector of length `N` with the same value repeated over and over, you can use `rep()`:

In [22]:
# Create a vector with the value 42 
# repeated 10 times. 

rep(42, 10)

Vector Math
--------------------

One of the great things about vectors is that we can do all sorts of mathematical operations to vectors efficiently.

If you do math with two vectors, one of which has length one, you basically just get the operation applied to every entry!


In [30]:
# Here's what we'll start with
numbers <- 1:10
numbers


In [31]:
# You can modify all values in a vector 
# by doing math with a vector of length 1
numbers / 10

In [29]:
numbers + 10

And we can do the same trick with lots of math functions:

In [34]:
# Modify a vector using a function
sqrt(numbers) #square root


In [36]:
exp(numbers) #exponentiate

Or if you have two vectors of the same length, mathematical operations will occur "element-wise", meaning the mathematical operation will be applied to the two first entries, then the two second entries, etc. For example, if we were to add our vector of the values 1 through 10 to a vector with five 0s, then five 1s, R would do the following:

```
1    +     0    =    1  +  0    =    1 
2    +     0    =    2  +  0    =    2 
3    +     0    =    3  +  0    =    3 
4    +     0    =    4  +  0    =    4 
5    +     0    =    5  +  0    =    5 
6    +     1    =    6  +  1    =    7 
7    +     1    =    7  +  1    =    8 
8    +     1    =    8  +  1    =    9 
9    +     1    =    9  +  1    =    10 
10   +     1    =    10 +  1    =    11
```

In [40]:
# Two vectors with the same number of elements 
numbers2 <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
numbers3 <- numbers2 + numbers
numbers3


Vector arithmetics can also be carried out in R on two multi-value vectors with different lengths using the [recycling rule](http://www.r-tutor.com/r-introduction/vector/vector-arithmetics), but... that's a thing you probably don't want to do anyway. That's a *weird* behavior! :)

## Summarizing vectors 

We often want to get summary statistics from a vector --- that is,
learn something general about it by looking beyond its constituent
elements. If we have a vector in which each element represents a
person's height, for example, we may want to know who the shortest or tallest
person is, what the median or mean height is, what the standard deviation is. 

So for example, we can use `summary(numbers)` to get a lot of summary stats at once:

In [42]:
summary(numbers)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1     250     499     499     748     997 

Or we can use any one of a handful of other helper functions!

```R
class(numbers) #check the class
length(numbers) #number of elements 
max(numbers) #maximum value
min(numbers) #minimum value
sum(numbers) #sum of all values in the vector
mean(numbers) #mean
median(numbers) #median
var(numbers) #variance
sd(numbers) #standard deviation
quantile(numbers) #percentiles in intervals of .25 
quantile(numbers, probs = seq(0, 1, 0.1)) #percentiles in invervals of 0.1
```

**Don't** worry about memorizing these or anything -- basically, you just need to have a sense of the kinds of things you can do with functions, and if you ever need one can can't remember the name of the function, you can google it to get the specific function name. 

## Subsetting Vectors

Now we come to one of the most important manipulations you'll need to know how to do with vectors: subsetting! 

Extracting a subset of elements from a vector is an extremely important task, not least because it generalizes nicely to datasets (which are at the heart of data science). This process --- whether applied to a vector or a dataset --- is often referred to as "taking a subset", "subsetting", or "filtering". If there is one skill you need to master as quickly as possible, it's this.

Subsetting can be accomplished in several ways, but we'll focus on the two most powerful: 

- By index
- With logical vectors (remember I said logicals would be important? :))


### Subsetting By Index

As you've probably already realized, vectors don't just contain a jumble of data -- they also have a concept of "order". When I create a vector with `c(42, 47, -1)`, I have in mind that 42 is the first entry, 47 is the second, and -1 is the third. And we can use that concept of order to subset vectors by passing the index (order number) of an entry we want to our vector in square brackets:

In [45]:
a <- c(42, 47, -1)
a[2]

Note the use of brackets, `[]` --- this is common when filtering, and we'll use it a lot!

But of course, because everything in R is a vector, if I can pass a single index, then I can pass any other numeric vector of indices, either directly:

In [46]:
a[c(1, 3)]

Or as a variable:

In [47]:
subset = c(1, 3)
a[subset]

### Subsetting with Logicals

Subsetting with logicals is a little hard to explain, so instead let's jump right into an example. 

Suppose we have a character vector with only two elements ("apple" and "banana"). Subsetting it to "apple" could be done by passing a logical vector as follows:

In [48]:
fruits <- c("apple", "banana")
fruits[c(TRUE, FALSE)]

Within these brackets is a vector with the same number of logical elements as there are elements in the vector you want to subset. Elements across the two vectors are matched by order: elements that match with `TRUE` are kept while elements that match with `FALSE` are dropped.

This process is extremely useful when combined with a *logical operation* to combine multiple conditions. For example, you can use:

- the logical "and" (written `&` in R) to say "only be true if both conditions are true", 
- the logical "or" (written `|`) to say "be true if at least one of these conditions is true", or

For example, using a logical operation we can filter a large vector of oranges, apples and bananas:

In [49]:
# Create a vector with 30 fruits 
fruits <- rep(c("orange", "apple", "banana"), 10)
fruits 


In [50]:
# Create a logical vector for dropping bananas

orange_or_apple <- fruits == "orange" | fruits == "apple" # True if orange or apple
not_banana <- fruits != "banana"                            # != means true if not equal
orange_or_apple2 <- fruits %in% c("orange", "apple")

# Carry out the subset
fruits[orange_or_apple]

In [51]:
fruits[not_banana]

In [52]:
fruits[orange_or_apple2]

We applied the same logic as above: We have a vector (`fruits`) that
we want to subset. We do so using a logical vector (`orange_or_apple`, `not_banana`, and `orange_or_apple2`), where elements that match with `TRUE` are kept. The only difference here is that we create the logical vector with a logical operation. The logical operators (e.g., `!=`, `|`) used here are discussed in the link above, with the exception of `%in%`. 

<div class="general-note">

<strong> General note about `%in%`: </strong> This operator is
extremely useful as an alternative for repeated "or" (`|`) statements. For example, say you have a vector with 10 types of fruits and you want to keep elements that are equal to "orange", "apple", "mango",
"mandarin", or "kiwi". You could accomplish this by creating a logical vector like so: `lv <- fruits == "orange" | fruits == "apple" | fruits == "mango" | fruits == "mandarin" | fruits == "kiwi"`.  

<br> What a nighmarishly long statement compared to the `%in%` option that accomplishes the exact same thing: `lv <- fruits %in% c("orange", "apple", "mango", "mandarin", "kiwi")`.

</div>

Of course, subsetting using logicals can also be done on numeric vectors.

Here are a few examples:

In [53]:
# Create a numeric vector
numbers <- seq(0, 100, by = 10)
numbers


In [54]:
# Illustrate three different filters
numbers[numbers <= 50 & numbers != 30]

In [55]:
numbers[numbers == 0 | numbers == 100]

In [56]:
numbers[numbers > 100] #returns an empty vector

## Modifying vectors

The subsetting logic from above can be used to modify vectors. The
idea here is that instead of keeping elements that meet a logical
condition or occur at a specific index, we can change them. For example,
what if we had mis-entered grandpa's age above? We can fix it using indexing,
a logical statement, or naming. 

In [57]:
# Recreate vector with age values
age <- c(50, 55, 80)

# Three ways of changing grandpa's age
# Note: you'd only need to use one of these
age[age == 80] <- 82 # using a logical statement
age[2] <- 45         # using indexing
age

A logical statement is most efficient when we need to change a lot
of elements.

In [59]:
fruits <- rep(c("orange", "apple", "bamama"), 5) 
fruits #bamamas anyone? 

In [60]:
# Let's fix the misspelled element
fruits[fruits == "bamama"] <- "banana"
fruits

## Type Promotion

OK, there's one last lesson that's worth learning about vectors, because it can get you in trouble. 

As noted above, vectors can only contain one type of data, but if you try and use `c()` to combine vectors of different types, R will try and be clever and *find* a way to combine that by doing something called "Type Promotion", which is a way of converting all the data you give it to the same type. For example, if I tried to create a vector by combining a character vector and a numeric vector, R would convert the numeric vector to a character vector so all the data could fit in a numeric vector:


In [13]:
a <- c("Nick", 42)
a

Why did R convert `42` to `"42"` and not convert `"Nick"` to a numeric type? Well because `"Nick"` can be represented as a numeric type in any meaningful sense while any number (like `42`) can always be represented as a character in a meaningful way.

Indeed, there's a hierarchy of data types, where a type lower on the hierarchy can always be converted into something higher in the order, but not the other way around. That hierarchy is:

`logical` --> `numeric` --> `character`

When are is asked to combine vectors of different data types, it will try to move things up this hierarchy by the smallest amount possible in order to make everything the same type.

For example, if you combine `logical` and `numeric` vectors, R will convert all of the data into `numeric` (remember from our previous lesson that R thanks of `TRUE` as being like `1`, and `FALSE` as being like `0`).


In [14]:
c(1, 2, TRUE)

But it **doesn't** convert that data into a `character` vector (even though it could!) because it's trying to make the smallest movements up that hierarchy that it can. But if we try to combine `logical`, `numeric`, *and* `character` vectors, R would be forced to convert everything into a `character` vector:

In [16]:
c(TRUE, 42, "Julio")

Note that I didn't create logical objects to carry out the subsets here,
as opposed to above where we explicitly defined `lv`. I find it more
compact and intuitive to take subsets without first creating a logical
vector.

<div style="margin-top: 15px"> </div>

## Exercises

Create a vector that represents the age of at least four different family
members or friends. You can name it whatever you want.

1. What is the mean age of the people in your vector? Find out in two ways,
with and without using the `mean()` command.

2. How old is the youngest person in your vector? (Use an R command to find out.)

3. What is the age gap between the youngest person and the oldest person in your vector?
(Again use R to find out, and try to be as general as possible in the sense that
your code should work even if the elements in your vector, or their order, change.)

4. How many people in your vector are above age 25? (Again, try to make your code
work even in the case that your vector changes.)

5. Replace the age of the oldest person in your vector with the age of someone
else you know.

6. Create a new vector that indicates how old each person in your vector
will be in 10 years.

7. Create a new vector that indicates what year each person in your vector
will turn 100 years old.

8. Create a new vector with a random sample of 3 individuals from your
original vector. What is the mean age of the people in this new
vector?
