# Intro to Vectors

In this reading, we'll learn about the most basic type of numpy array: the vector. Vectors are arrays that organize their contents along a single dimension, basically the same way data is organized in a `list`. So for example, if I created a vector with the numbers 42, 47, and -1, our vector would store those numbers in order, and I could extract individual elements from my vector by their *index* (their ordering):

In [1]:
import numpy as np
my_vector = np.array([42, 47, -1])
my_vector

array([42, 47, -1])

In [2]:
# Extract by single index
my_vector[1]

47

In [3]:
# Or slice
my_vector[1:2]

array([47])

But while we can extract elements from a vector the same way we would from a list, vectors and lists differ in a number of ways, the most important of which is that vectors can only hold *one* data type. That means that if we tried to insert a floating point number (number with decimals) into this vector of integers:

In [4]:
my_vector[0] = 7.32
my_vector

array([ 7, 47, -1])

numpy will convert our floating point number (7.32) into an integer before putting it into the vector. And that's why instead of 7.32 appearing in our vector, we now have 7 (the result of `integer(7.32)`).

## Why Do We Need Vectors? 

"If vectors are *basically* like lists, except that they can only store data of one type, why on earth do we need them?" I hear you asking.

Speed and memory usage.

As we mentioned in our last reading, lists are flexible, but that flexibility comes at the cost of performance and memory usage. In other words, if lists and cars were vehicles, lists would be 18 wheelers -- lots of space to store stuff, but big and hard to get around -- and numpy vectors would be sports cars -- small and fast, not much storage.

How much slower? Well, to illustrate let's create a list with all the numbers from one to a billion and sum them with regular Python; then let's do the same thing with numpy vectors and compare the result:

In [5]:
# Remember `range` doesn't include the last number,
# so I have to go up to 1_000_000_001 to actually get all
# the numbers from 1 to 1_000_000_000

# Make list
one_to_billion_list = list(range(1, 1_000_000_001))

# May array
one_to_billion_vector = np.arange(1, 1_000_000_001)


In [6]:
%time sum(one_to_billion_list)

CPU times: user 11.8 s, sys: 35.8 s, total: 47.7 s
Wall time: 1min 5s


500000000500000000

In [7]:
%time np.sum(one_to_billion_vector)

CPU times: user 1.77 s, sys: 5.95 s, total: 7.72 s
Wall time: 10.1 s


500000000500000000

So on my 2019 Macbook Pro with 32GB RAM, that took about ~1 minute with regular Python; ~10 seconds with numpy. That's more than a 50x speedup, just for that simple calculation.

But that's not all -- to create that list, regular Python required over **35GB** of RAM, while numpy only used about 6GB.

This is, of course, a toy example. But storing large collections of numbers and manipulating them quickly is at the heart of data science, and these differences in speed and storage efficiency are precisely why we use numpy arrays instead of regular Python objects like lists. 

## Creating vectors

Now that we've motivated our interest in vectors, let's get into their basic usage.

The simplest way to create a vector is with the `np.array()` function and a list:

In [9]:
# A vector of ints
an_integer_vector = np.array([1, 2, 3])
an_integer_vector

array([1, 2, 3])

Vectors aren't limited to integers, however -- we can also create vectors of floating point numbers (numbers with decimal components), Booleans, or strings!

In [10]:
# A vector of floats
a_float_vector = np.array([1.7, 2, 3.14])
a_float_vector

array([1.7 , 2.  , 3.14])

In [11]:
# A vector of booleans
a_boolean_vector = np.array([True, False, True])
a_boolean_vector

array([ True, False,  True])

In [12]:
# A vector of strings
a_string_vector = np.array(["Red", "Green", "Purple"])
a_string_vector


array(['Red', 'Green', 'Purple'], dtype='<U6')

But vectors wouldn't be useful if we had to create a list and pass it to `np.array` anytime we wanted an array, so we also have a number of functions for generating common types of vectors:

In [13]:
# Numbers from 0 to 10
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:
# Ones
np.ones(3)

array([1., 1., 1.])

In [15]:
# Zeros
np.zeros(3)

array([0., 0., 0.])

(Note that numpy writes the ones and zeros vectors with decimals after the numbers -- that's how numpy tells you that those are being stored as floating point numbers (floats), not integers). We can also see this with the `.dtype` attribute:

In [16]:
my_zeros = np.zeros(3)
my_zeros

array([0., 0., 0.])

In [17]:
my_zeros.dtype

dtype('float64')

## Vector Math

If vectors were just for storing data, they wouldn't be super useful. But one of the best things about vectors is that we can use them to do mathematical operations efficiently.

If you do math with two vectors, one of which has length one, you basically just get the operation applied to every entry.


In [18]:
# Here's what we'll start with
numbers = np.arange(10)
numbers


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [19]:
# You can modify all values in a vector 
# by doing math with a vector of length 1
numbers / 10

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [20]:
numbers + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

The same thing happens with mathematical functions -- the function gets applied to each entry:

In [21]:
# Modify a vector using a function
np.sqrt(numbers) #square root


array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [22]:
np.exp(numbers) #exponentiate

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

If you have two vectors of the same length, mathematical operations will occur "element-wise", meaning the mathematical operation will be applied to the two 1st entries, then the two 2nd entries, then the two 3rd entries, etc. For example, if we were to add our vector of the values 1 through 10 to a vector with five 0s, then five 1s, numpy would do the following:

```
0    +     0    =    0  +  0    =    0 
1    +     0    =    1  +  0    =    1 
2    +     0    =    2  +  0    =    2 
3    +     0    =    3  +  0    =    3 
4    +     0    =    4  +  0    =    4 
5    +     1    =    5  +  1    =    6 
6    +     1    =    6  +  1    =    7 
7    +     1    =    7  +  1    =    8 
8    +     1    =    8  +  1    =    9 
9    +     1    =    9  +  1    =    10
```

(Obviously, numpy likes to print out vectors sideways, but personally I think of them as column vectors, so have written them out like that here).


In [23]:
# Two vectors with the same number of elements 
numbers2 = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
numbers3 = numbers2 + numbers
numbers3


array([ 0,  1,  2,  3,  4,  6,  7,  8,  9, 10])

## Summarizing vectors 

We often want to get summary statistics from a vector --- that is, learn something general about it by looking beyond its constituent elements. If we have a vector in which each element represents a person's height, for example, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is. 

For that, numpy provides a huge range of numeric functions:

In [24]:
np.mean(numbers)

4.5

In [25]:
np.max(numbers)

9

Here's a short (very incomplete!) list of these kinds of functions:

```python
len(numbers) #number of elements 
np.max(numbers) #maximum value
np.min(numbers) #minimum value
np.sum(numbers) #sum of all values in the vector
np.mean(numbers) #mean
np.median(numbers) #median
np.var(numbers) #variance
np.sd(numbers) #standard deviation
np.quantile(numbers) #percentiles in intervals of .25 
```

**Don't** worry about memorizing these or anything -- basically, you just need to have a sense of the kinds of things you can do with functions, and if you ever need one can can't remember the name of the function, you can google it to get the specific function name. 

## Type Promotion

There's one last lesson that's worth learning about vectors, because it can get you in trouble. 

As noted above, vectors can only contain one type of data, but if you try pass a list with different kinds of data to `np.array`, numpy will try and be clever and *find* a way to put all that data in one array by doing something called "Type Promotion." Type promotion is the practice of converting all the data you give it to the same type. For example, if I tried to create a vector by combining a string vector and a numeric vector, numpy would convert the numeric value to a string so all the data could fit in a string vector:


In [26]:
np.array(["Nick", 42])

array(['Nick', '42'], dtype='<U21')

Why did numpy convert `42` to `"42"` and not convert `"Nick"` to a numeric type? Well because `"Nick"` can't be represented as a numeric type in any meaningful sense while any number (like `42`) can always be represented as a character in a meaningful way.

Indeed, there's a hierarchy of data types, where a type lower on the hierarchy can **always** be converted into something higher in the order, but not the other way around. That hierarchy is:

`Boolean` --> `integer` --> `float` --> `string`

When Python is asked to combine data of different types, it will try to move things up this hierarchy by the smallest amount possible in order to make everything the same type.

(Note there are individual cases that can move backwards -- the character `"5"` could logically be turned into `5` -- but you can't always convert a character to a numeric, so for consistently R only moves in directions that are **always** possible. 

For example, if you combine `Boolean` and `float` vectors, Python will convert all of the data into `float` (Remember from our previous reading that Python thinks of `True` as being like `1`, and `False` as being like `0`).


In [27]:
np.array([1, 2.4, True])

array([1. , 2.4, 1. ])

But it **doesn't** convert that data into a `string` vector (even though it could!) because it's trying to make the smallest movements up that hierarchy that it can. 

But if we try to combine `Boolean`, `float`, *and* `string` data, Python would be forced to convert everything into a `string` vector:

In [28]:
np.array([True, 42, "Julio"])

array(['True', '42', 'Julio'], dtype='<U21')

## Recap

- Vectors are collections of data of the same type. 
- Vectors can be created with the `np.array()` function. 
- You can easily do math between any vector and a vector of length 1, or a vector of the same length. 
- You *can* do math between vectors of different lengths that aren't length 1, but... the way it works is weird, so don't. 
- If data of different types are passed to the `np.array()` function, it will type promote them to the lowest type that can store all the input types. 

## Next Steps

Now that we're familiar with vectors, [it's time to learn to manipulate them!](manipulating_vectors.ipynb)