<font color=gray>This Jupyter notebook was created by Mark Bunday for \the\world Girls' Machine Learning Day Camp. The license can be found at the bottom of the notebook.</font>

# Vectorization

### What is vectorization? 

When we learned about for-loops, we saw how they can be used to apply operations to the elements of a list sequentially, one-by-one.

In [10]:
numbers_and_words = [1, "two", 3, 4, "five", "six"]
# This for-loop prints out each item in the "numbers_and_words" list.
for item in numbers_and_words:
    print(item)

1
two
3
4
five
six


**Arrays** in Python are like lists, but they consist of only a single type of data, e.g. only numbers or only words. We define arrays using the "numpy" library in Python.

In [23]:
import numpy as np

numbers = np.array([1, 3, 4])
words = np.array(["two", "five", "six"])
print(numbers)
print(words)

[1 3 4]
['two' 'five' 'six']


If we try and make a numbers_and_words array it will automatically convert the numbers to strings. Remember, arrays can only have a single type of element, unlike lists which can be mixed. 

In [26]:
numbers_and_words = np.array([1, 2, 3, "four"])
print(numbers_and_words)  
# You can see the qoutation marks around '1', '2', '3'
# these indicate they were converted to strings

# Let's check the type of the first element (1)
print(type(numbers_and_words[0]))

['1' '2' '3' 'four']
<class 'numpy.str_'>


Why are arrays useful? If we know all of the elements are the same, we can **vectorize** any operation on it. **Vectorization** means instead of applying on operation each element one-by-one, which is what for-loops do, *we apply the operation once to the entire array.*

This might be a little confusing, so let's go through a simple example. 

### Example: Organizing Books

Imagine you work at the library. Bob just dropped off a pile of six books and you need to return them to their proper section based on whether they're fiction or non-fiction. We could represent this pile using a list like we learned before: 

In [33]:
returned_books = ["Fiction", "Non-Fiction", "Fiction", "Fiction", "Non-Fiction"]

We want to 

1. Go through the book pile one-by-one

2. Determine whether each book is Fiction or Non-Fiction

3. Based on 2., return the book to the right section  

Let's also say that it takes 

* 2 seconds to determine if a book is Fiction or Non-Fiction

* 6 seconds to return a book to the Non-Fiction section

* 10 seconds to return a book to the Fiction section

We could represent this routine, or process, using a for-loop and an if-else statement like we learned before: 

In [35]:
def return_book(book):
    print(f"Returned to {book} section!")

time = 0
for book in returned_books:
    time = time + 2
    if book == "Fiction":
        time = time + 10 
        return_book(book)
    else:  # Otherwise, it must be Non-Fiction
        time = time + 6 
        return_book(book)
print(f"The time to return all {len(returned_books)} books was {time} seconds!")

Returned to Fiction section!
Returned to Non-Fiction section!
Returned to Fiction section!
Returned to Fiction section!
Returned to Non-Fiction section!
The time to return all 5 books was 52 seconds!


What would make this process more efficient? Is this how you would do it in real life? What's slowing the process down? 

Part of the problem is it takes 2 seconds to check each book and every book in the pile needs to be checked to determine whether it belongs in the Fiction or Non-Fiction section. That only adds up to 12 seconds when you have 6 books, but what if you had 100? 1,000? If we didn't have to check at all, then that's the same as saying the time to check 1 book is the same as the time to check 1,000! This is where arrays start to be useful. 

The second problem is that each book is processed one-by-one in order. It would be more efficient if we could return multiple books at once since books in the same genre are all going to the same place anyway. In fact, that's exactly what **vectorization** does! 

Let's imagine a second person, Sarah, returns 12 books but she tells us all of them are Fiction books beforehand. How much time does it take to return these 12 books? Well, we don't need to check what genre they are, and because they're all going to the same section, if we're brave enough we can just return all 12 at once which would only take us 6 seconds total. 

### How do we use vectorization in Python? 

Vectorization in Python is easy because you can apply operations directly to arrays, unlike lists! 

For example, let's say we want to multiply a list of numbers by 2. How would we do this with a list? 

In [37]:
numbers = [1,2,3,4,5,6,7,8,9,10]
# This for-loop iterates through each position (0-9)
# in the numbers list above and multiplies each
# number in each position by 2
for position in range(0,9):
    numbers[position] = numbers[position] * 2

print(numbers)

[2, 4, 6, 8, 10, 12, 14, 16, 18, 10]


How would we do this using an array? 

In [70]:
numbers_array = np.array([1,2,3,4,5,6,7,8,9,10])  # Define our array
numbers_array = numbers_array * 2
print(numbers_array)

[ 2  4  6  8 10 12 14 16 18 20]


As you can see, we don't even need to write a for-loop! Because we know that everything in the array is a number, we can simply multiply the whole thing by 2, which is equivalent to multipling each element in the array by 2. With a list, the same operation isn't allowed because it *could* contain elements other than numbers, in which case multiplcation would make no sense. 

We can other arithmatic operations, too!

In [72]:
print(numbers_array + 12)
print(numbers_array / 2)
print(numbers_array + numbers_array)
print(numbers_array * numbers_array)

[14 16 18 20 22 24 26 28 30 32]
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
[ 4  8 12 16 20 24 28 32 36 40]
[  4  16  36  64 100 144 196 256 324 400]


Let's look at some simple performance metrics now. 

In [65]:
import numpy as np
from timeit import Timer

numbers = list(range(500_000))
numbers_array = np.array(list(range(500_000)))

def for_loop_add():
    for i in range(len(numbers)):
        numbers[i] = numbers[i] + 1
    return numbers

def vectorized_add():
    return numbers_array + 1

# Both of these functions do the same thing:
# They add 1 to each element of a list of 500,000
# Let's run them both 10 times and record the fastest
# time for the for-loop and the vectorized version.
print(min(Timer(python_for).repeat(10, 10)))
print(min(Timer(numpy_add).repeat(10, 10)))

0.39556793799965817
0.00825037100003101


We can see that the vectorized version is much faster than the for-loop. As the size of the list grows, the for-loop should get slower and slower while the vectorized version should remain nice and fast. 

<font color=gray>Copyright (c) 2019 Mark Bunday</font>
<br><br>
<font color=gray>Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:</font>
<br><br>
<font color=gray>The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.</font>
<br><br>
<font color=gray>THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.</font>