# Array exercises

Initialize the OK tests to get started.

In [1]:
from client.api.notebook import Notebook
ok = Notebook('arrays.ok')

Up to now, we haven't done much that you couldn't do yourself by hand, without
going through the trouble of learning Python.  Computers are most useful when
you can use a small amount of code to *do the same action* to *many different
things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant
bill, a laptop can calculate 18% tips for every restaurant bill paid by every
human on Earth that day.  (That's if you're pretty fast at doing arithmetic in
your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

```
.18 * billions_of_numbers
```

gives a new array of numbers that's the result of multiplying *each number* in
`billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can
also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

<img src="excel_array.jpg">


## Making arrays

You can type in the data that goes in an array yourself, but that's not
typically how programs work. Normally, we create arrays by loading them from
an external source, like a data file.

First, though, let's learn how to do it the hard way.

Arrays are provided by a package called [NumPy](http://www.numpy.org/)
(pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly,
"NUM-pee").  The package is called `numpy`, but it's standard to rename it
`np` for brevity.  You can do that with:

In [2]:
import numpy as np

Now, to create an array, first make a *list* with the values you want to put
in the array.  Then create the array by using the `array` function from Numpy.
Run this cell to see an example:

In [3]:
values = [0.125, 4.75, -1.3]
np.array(values)

array([ 0.125,  4.75 , -1.3  ])

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3)
is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means
you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an *array* containing the numbers 1, 2, and 3, in
that order.  Name it `small_numbers`.

In [4]:
small_numbers = np.array([1, 2, 3])
small_numbers

array([1, 2, 3])

In [5]:
_ = ok.grade('q411')

**Question 1.1.2.** Make an array containing the numbers 0, 1, -1, $\pi$, and
$e$, in that order.  Name it `interesting_numbers`.  *Hint:* the Numpy `np`
module has the value $\pi$ as `np.pi`.  It also has the value $e$.

In [6]:
interesting_numbers = np.array([0, 1, -1, np.pi, np.e])
interesting_numbers

array([ 0.        ,  1.        , -1.        ,  3.14159265,  2.71828183])

In [7]:
_ = ok.grade('q412')

## Ranges

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 6, 2)` is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6.  In other words, it's equivalent to `make_array(1, 3, 5)`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.1.1.** Use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [8]:
multiples_of_99 = np.arange(0, 10000, 99)
multiples_of_99

array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
       1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
       2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
       3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
       4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
       5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
       6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
       7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
       8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
       9801, 9900, 9999])

In [9]:
_ = ok.grade('q4111')

## Temperature readings

NOAA (the US National Oceanic and Atmospheric Administration) operates weather
stations that measure surface temperatures at different sites around the
United States.  The hourly readings are [publicly
available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for
the month of December 2015.  To analyze the data, we want to know when each
reading was taken, but we find that the data don't include the timestamps of
the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December
2015 (midnight on December 1st) and each subsequent reading was taken exactly
1 hour after the last.

**Question 1.1.1.2.** Create an array of the *time, in seconds, since the
start of the month* at which each hourly reading was taken.  Name it
`collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times
24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array
should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times`
isn't passing the tests, check its length and make sure it has $31 \times 24$
elements.

In [10]:
collection_times = np.arange(0, 31*24*60*60, 60 * 60)
collection_times

array([      0,    3600,    7200,   10800,   14400,   18000,   21600,
         25200,   28800,   32400,   36000,   39600,   43200,   46800,
         50400,   54000,   57600,   61200,   64800,   68400,   72000,
         75600,   79200,   82800,   86400,   90000,   93600,   97200,
        100800,  104400,  108000,  111600,  115200,  118800,  122400,
        126000,  129600,  133200,  136800,  140400,  144000,  147600,
        151200,  154800,  158400,  162000,  165600,  169200,  172800,
        176400,  180000,  183600,  187200,  190800,  194400,  198000,
        201600,  205200,  208800,  212400,  216000,  219600,  223200,
        226800,  230400,  234000,  237600,  241200,  244800,  248400,
        252000,  255600,  259200,  262800,  266400,  270000,  273600,
        277200,  280800,  284400,  288000,  291600,  295200,  298800,
        302400,  306000,  309600,  313200,  316800,  320400,  324000,
        327600,  331200,  334800,  338400,  342000,  345600,  349200,
        352800,  356

In [11]:
_ = ok.grade('q4112')

## Working with single elements of arrays ("indexing")

Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the [US Census Bureau website](http://www.census.gov/population/international/data/worldpop/table_population.php).)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that next week.

In [12]:
# Don't worry too much about what goes on in this cell.
import pandas as pd
population = pd.read_csv("world_population.csv")["Population"].values
population

array([2557628654, 2594939877, 2636772306, 2682053389, 2730228104,
       2782098943, 2835299673, 2891349717, 2948137248, 3000716593,
       3043001508, 3083966929, 3140093217, 3209827882, 3281201306,
       3350425793, 3420677923, 3490333715, 3562313822, 3637159050,
       3712697742, 3790326948, 3866568653, 3942096442, 4016608813,
       4089083233, 4160185010, 4232084578, 4304105753, 4379013942,
       4451362735, 4534410125, 4614566561, 4695736743, 4774569391,
       4856462699, 4940571232, 5027200492, 5114557167, 5201440110,
       5288955934, 5371585922, 5456136278, 5538268316, 5618682132,
       5699202985, 5779440593, 5857972543, 5935213248, 6012074922,
       6088571383, 6165219247, 6242016348, 6318590956, 6395699509,
       6473044732, 6551263534, 6629913759, 6709049780, 6788214394,
       6866332358, 6944055583, 7022349283, 7101027895, 7178722893,
       7256490011])

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [13]:
population.item(0)

2557628654

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [14]:
# The third element in the array is the population
# in 1952.
population_1952 = population.item(2)
population_1952

2636772306

In [15]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population.item(12)
population_1962

3140093217

In [16]:
# The 66th element is the population in 2015.
population_2015 = population.item(65)
population_2015

7256490011

In [17]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
# population_2016 = population.item(66)
# population_2016

In [18]:
# Since make_array returns an array, we can call .item(3)
# on its output to get its 4th element.
np.array([-1, -3, 4, -2]).item(3)

-2

**Question 1.2.1.** Set `population_1973` to the world population in 1973, by
getting the appropriate element from `population` using `item`.

In [19]:
population_1973 =  population.item(23)
population_1973

3942096442

In [20]:
_ = ok.grade('q421')

## Doing something to every element of an array

Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

### Logarithms

Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `np` module and the `item` method you just saw:

In [21]:
population_1950_magnitude = np.log10(population.item(0))
population_1951_magnitude = np.log10(population.item(1))
population_1952_magnitude = np.log10(population.item(2))
population_1953_magnitude = np.log10(population.item(3))
# And so on

But this is tedious and doesn't take full advantage of the fact that we are
using a computer.

Instead, NumPy's version of `log10` can take the logarithm of *each element*
of an array.  It takes a single array of numbers as its argument.  It returns
an array of the same length, where the first element of the result is the
logarithm of the first element of the argument, and so on.

**Question 1.3.1.** Use Numpy to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [22]:
population_magnitudes = np.log10(population)
population_magnitudes

array([9.40783749, 9.4141273 , 9.42107263, 9.42846742, 9.43619893,
       9.44437257, 9.45259897, 9.46110062, 9.4695477 , 9.47722498,
       9.48330217, 9.48910971, 9.49694254, 9.50648175, 9.51603288,
       9.5251    , 9.53411218, 9.54286695, 9.55173218, 9.56076229,
       9.56968959, 9.57867667, 9.58732573, 9.59572724, 9.60385954,
       9.61162595, 9.61911264, 9.62655434, 9.63388293, 9.64137633,
       9.64849299, 9.6565208 , 9.66413091, 9.67170374, 9.67893421,
       9.68632006, 9.69377717, 9.70132621, 9.70880804, 9.7161236 ,
       9.72336995, 9.73010253, 9.73688521, 9.74337399, 9.74963446,
       9.75581413, 9.7618858 , 9.76774733, 9.77343633, 9.77902438,
       9.7845154 , 9.78994853, 9.7953249 , 9.80062024, 9.80588805,
       9.81110861, 9.81632507, 9.82150788, 9.82666101, 9.83175555,
       9.83672482, 9.84161319, 9.84648243, 9.85132122, 9.85604719,
       9.8607266 ])

In [23]:
_ = ok.grade('q431')

<img src="array_logarithm.jpg">

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.


## Arithmetic

Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:


In [24]:
population_in_billions = population / 1000000000
population_in_billions

array([2.55762865, 2.59493988, 2.63677231, 2.68205339, 2.7302281 ,
       2.78209894, 2.83529967, 2.89134972, 2.94813725, 3.00071659,
       3.04300151, 3.08396693, 3.14009322, 3.20982788, 3.28120131,
       3.35042579, 3.42067792, 3.49033371, 3.56231382, 3.63715905,
       3.71269774, 3.79032695, 3.86656865, 3.94209644, 4.01660881,
       4.08908323, 4.16018501, 4.23208458, 4.30410575, 4.37901394,
       4.45136274, 4.53441012, 4.61456656, 4.69573674, 4.77456939,
       4.8564627 , 4.94057123, 5.02720049, 5.11455717, 5.20144011,
       5.28895593, 5.37158592, 5.45613628, 5.53826832, 5.61868213,
       5.69920299, 5.77944059, 5.85797254, 5.93521325, 6.01207492,
       6.08857138, 6.16521925, 6.24201635, 6.31859096, 6.39569951,
       6.47304473, 6.55126353, 6.62991376, 6.70904978, 6.78821439,
       6.86633236, 6.94405558, 7.02234928, 7.10102789, 7.17872289,
       7.25649001])

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):


In [25]:
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

Restaurant bills:	 [20.12 39.9  31.01]
Tips:			 [4.024 7.98  6.202]


<img src="array_multiplication.jpg">

**Question 1.3.2.** Suppose the total charge at a restaurant is the original
bill plus the tip.  That means we can multiply the original bill by 1.2 to get
the total charge.  Compute the total charge for each bill in
`restaurant_bills`.

In [26]:
total_charges = restaurant_bills * 1.2
total_charges

array([24.144, 47.88 , 37.212])

In [27]:
_ = ok.grade('q432')

**Question 1.3.3.** `more_restaurant_bills.csv` contains 100,000 bills!  Compute the total charge for each one.  How is your code different?

In [28]:
# Don't worry about the next two lines.  They get the data into an array.
bills_table = pd.read_table("more_restaurant_bills.csv")
more_restaurant_bills = bills_table["Bill"].values

# Your code here.  It will use the more_restaurant_bills variable.
more_total_charges = more_restaurant_bills * 1.2
more_total_charges

array([20.244, 20.892, 12.216, ..., 19.308, 18.336, 35.664])

In [29]:
_ = ok.grade('q433')

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.3.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [30]:
sum_of_bills = sum(more_total_charges)
sum_of_bills

1795730.0640000193

In [31]:
_ = ok.grade('q434')

**Question 1.3.5.** The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

In [32]:
powers_of_2 = 2 ** np.arange(30)
powers_of_2

array([        1,         2,         4,         8,        16,        32,
              64,       128,       256,       512,      1024,      2048,
            4096,      8192,     16384,     32768,     65536,    131072,
          262144,    524288,   1048576,   2097152,   4194304,   8388608,
        16777216,  33554432,  67108864, 134217728, 268435456, 536870912])

In [33]:
_ = ok.grade('q435')

Congratulations, you're done with the assignment!  Be sure to:

- **run all the tests** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,

In [34]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]