# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook. Before you begin click the "Run cell" button at the top that looks like ▶| or hold down `Shift` + `Return`.

# Lab 03: Data Types and Arrays

Welcome to Foundations of Data Science for High School! Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

**Collaboration Policy:**

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask a neighbor or an instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** _just_ copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

**Due Date:** 

# Today's Assignment 

So far, we've used Python to manipulate numbers and work with tables.  But we need to discuss data types to deepen our understanding of how to work with data in Python.

In this lab, you'll first see how to represent and manipulate another fundamental type of data: text.  A piece of text is called a *string* in Python. You'll also see how to work with *arrays* of data, such as all the numbers between 0 and 100 or all the words in the chapter of a book. Lastly, you'll create tables and practice analyzing them with your knowledge of table operations.

This week, we'll learn how to work with text and arrays.

Recommended reading
 * [Strings](https://inferentialthinking.com/chapters/04/2/Strings.html)
 
 * [Sequences](https://inferentialthinking.com/chapters/05/Sequences.html)

First, set up the imports by running the cell below.

In [1]:
from datascience import *
import numpy as np
import math

## 1. Text
Programming doesn't just concern numbers. Text is one of the most common data types used in programs. 

Text is represented by a **string value** in Python. The word "string" is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.

To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols. 

We've seen strings before in `print` statements.  Below, two different strings are passed as arguments to the `print` function.

In [2]:
print("I <3", 'Data Science')

I <3 Data Science


Just as names can be given to numbers, names can be given to string values.  The names and strings aren't required to be similar in any way. Any name can be assigned to any string.

In [3]:
one = 'two'
plus = '*'
print(one, plus, one)

two * two


**Question 1.** Yuri Gagarin was the first person to travel through outer space.  When he emerged from his capsule upon landing on Earth, he [reportedly](https://en.wikiquote.org/wiki/Yuri_Gagarin) had the following conversation with a woman and girl who saw the landing:

    The woman asked: "Can it be that you have come from outer space?"
    Gagarin replied: "As a matter of fact, I have!"

The cell below contains unfinished code.  Fill in the `...`s so that it prints out this conversation *exactly* as it appears above.


In [13]:
woman_asking = 'The woman asked:' # SOLUTION
woman_quote = '"Can it be that you have come from outer space?"'
gagarin_reply = 'Gagarin replied:'
gagarin_quote = '"As a matter of fact, I have!"' # SOLUTION

print(woman_asking, woman_quote)
print(gagarin_reply, gagarin_quote)

The woman asked: "Can it be that you have come from outer space?"
Gagarin replied: "As a matter of fact, I have!"


In [14]:
woman_asking == 'The woman asked:'

True

In [15]:
woman_quote == '"Can it be that you have come from outer space?"'

True

In [16]:
gagarin_quote == '"As a matter of fact, I have!"'

True

In [17]:
gagarin_reply == 'Gagarin replied:'

True

### String Methods

Strings can be transformed using **methods**. Recall that methods and functions are not technically the same thing, but we'll be using them interchangeably for the purposes of this course.

Here's a sketch of how to call methods on a string:

    <expression that evaluates to a string>.<method name>(<argument>, <argument>, ...)
    
One example of a string method is `replace`, which replaces all instances of some part of the original string (or a *substring*) with a new string. 

    <original string>.replace(<old substring>, <new substring>)
    
`replace` returns (evaluates to) a new string, leaving the original string unchanged.
    
Try to predict the output of this example, then run the cell.

In [7]:
# Replace one letter
'Month'.replace('on', 'a')

'Math'

You can call functions on the results of other functions.  For example, `max(abs(-5), abs(3))` evaluates to 5.  Similarly, you can call methods on the results of other method or function calls.

You may have already noticed one difference between functions and methods - a function like `max` does not require a `.` before it's called, but a string method like `replace` does.

In [8]:
# Calling replace on the output of another call to replace
'train'.replace('t', 'ing').replace('in', 'de')

'degrade'

Here's a picture of how Python evaluates a "chained" method call like that:

<img src="images/chaining_method_calls.png"/>

**Question 2.** Use `replace` to transform the string `'hitchhiker'` into `'matchmaker'`. Assign your result to `new_word`.


In [18]:
new_word = 'hitchhiker'.replace('i','a').replace('h','m').replace('mm','hm') # SOLUTION
new_word

'matchmaker'

In [19]:
new_word

'matchmaker'

There are many more string methods in Python, but most programmers don't memorize their names or how to use them.  In the "real world," people usually just search the internet for documentation and examples. A complete [list of string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) appears in the Python language documentation. [Stack Overflow](http://stackoverflow.com) has a huge database of answered questions that often demonstrate how to use these methods to achieve various ends.

### Converting to and from Strings

Strings and numbers are different *types* of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.

In [11]:
8 + "8"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, there are built-in functions to convert numbers to strings and strings to numbers. Some of these built-in functions have restrictions on the type of argument they take:

|Function |Description|
|-|-|
|`int`|Converts a string of digits or a float to an integer ("int") value|
|`float`|Converts a string of digits (perhaps with a decimal point) or an int to a decimal ("float") value|
|`str`|Converts any value to a string|

Try to predict what data type and value `example` evaluates to, then run the cell.

In [20]:
example = 8 + int("10") + float("8")

print(example)
print("This example returned a " + str(type(example)) + "!")

26.0
This example returned a <class 'float'>!


Suppose you're writing a program that looks for dates in a text, and you want your program to find the amount of time that elapsed between two years it has identified.  It doesn't make sense to subtract two texts, but you can first convert the text containing the years into numbers.

**Question 3.** Finish the code below to compute the number of years that elapsed between `one_year` and `another_year`.  Don't just write the numbers `1618` and `1648` (or `30`); use a conversion function to turn the given text data into numbers.


In [23]:
# Some text data:
one_year = "1618"
another_year = "1648"

# Complete the next line.  Note that we can't just write:
# another_year - one_year
# If you don't see why, try seeing what happens when you
# write that here.
difference = int(another_year) - int(one_year) # SOLUTION
difference

30

In [24]:
difference

30

### Passing Strings to Functions

String values, like numbers, can be arguments to functions and can be returned by functions. 

The function `len` (derived from the word "length") takes a single string as its argument and returns the number of characters (including spaces) in the string.

Note that it doesn't count *words*. `len("one small step for man")` evaluates to 22, not 5.

**Question 4.**  Use `len` to find the number of characters in the long string in the next cell.  Characters include things like spaces and punctuation. Assign `sentence_length` to that number.

**Note:** The string is the first sentence of the English translation of the French [Declaration of the Rights of Man](http://avalon.law.yale.edu/18th_century/rightsof.asp). 

In [25]:
a_very_long_sentence = "The representatives of the French people, organized as a National Assembly, believing that the ignorance, neglect, or contempt of the rights of man are the sole cause of public calamities and of the corruption of governments, have determined to set forth in a solemn declaration the natural, unalienable, and sacred rights of man, in order that this declaration, being constantly before all the members of the Social body, shall remind them continually of their rights and duties; in order that the acts of the legislative power, as well as those of the executive power, may be compared at any moment with the objects and purposes of all political institutions and may thus be more respected, and, lastly, in order that the grievances of the citizens, based hereafter upon simple and incontestable principles, shall tend to the maintenance of the constitution and redound to the happiness of all."
sentence_length = len(a_very_long_sentence) # SOLUTION
sentence_length

896

In [26]:
sentence_length

896

## 2. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day. That's if you're pretty fast at doing arithmetic in your head.

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that contains the result of multiplying each number in `billions_of_numbers` by .18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same data type**. 

### Making arrays

First, let's learn how to manually input values into an array. This typically isn't how programs work. Normally, we create arrays by loading them from an external source, like a data file.

To create an array by hand, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [27]:
make_array(0.125, 4.75, -1.3)

array([ 0.125,  4.75 , -1.3  ])

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(<some_array>)` returns the number of elements in `some_array`.

**Question 5.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  

**Hint:** How did you get the values $\pi$ and $e$ in **Lab 02**?  You can refer to them in exactly the same way here.


In [28]:
interesting_numbers = make_array(0, 1, -1, math.pi, math.e) # SOLUTION
interesting_numbers

array([ 0.        ,  1.        , -1.        ,  3.14159265,  2.71828183])

In [29]:
isinstance(interesting_numbers, np.ndarray)

True

In [30]:
len(interesting_numbers)

5

In [31]:
all(interesting_numbers == np.array([0, 1, -1, math.pi, math.e]))

True

**Question 6.** Make an array containing the five strings "Hello", ",", " ", "world", and "!". **Note:** The third element is a single space inside quotes.) Name it `hello_world_components`.

**Note:** If you evaluate `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`. That's just NumPy's extremely cryptic way of saying that the data types in the array are strings.


In [34]:
hello_world_components = make_array("Hello", ",", " ", "world", "!") # SOLUTION
hello_world_components

array(['Hello', ',', ' ', 'world', '!'],
      dtype='<U5')

In [35]:
isinstance(hello_world_components, np.ndarray)

True

In [36]:
len(hello_world_components)

5

In [37]:
all(hello_world_components == np.array(["Hello", ",", " ", "world", "!"]))

True

### `np.arange`

Arrays are provided by a package called [NumPy](https://www.numpy.org) (pronounced "NUM-pie"). The package is called numpy, but it's standard to rename it np for brevity. You can do that with:

`import numpy as np`

Very often in data science, we want to work with many numbers that are evenly spaced within some range. NumPy provides a special function for this called arange. The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at start and counting up by step, stopping before stop is reached.

Run the following cells to see some examples.

In [38]:
# This array starts at 1 and counts 
# up by 2 and then stops before 6.
np.arange(1, 6, 2)

array([1, 3, 5])

In [40]:
# This array doesn't contain 9 because np.arange 
# stops before the stop value is reached
np.arange(4, 9, 1)

array([4, 5, 6, 7, 8])

**Question 7.** Use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999. So its elements are 0, 99, 198, 297, etc.


In [41]:
multiples_of_99 = np.arange(0,10000,99) # SOLUTION
multiples_of_99

array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
       1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
       2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
       3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
       4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
       5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
       6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
       7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
       8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
       9801, 9900, 9999])

In [42]:
isinstance(hello_world_components, np.ndarray)

True

In [43]:
len(multiples_of_99)

102

In [44]:
[multiples_of_99[0], multiples_of_99[101]] == [0, 9999]

True

### Temperature Readings

NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States. The hourly readings are [publicly](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N) available.

Suppose we download all the hourly data from the Durham, North Carolina site for the month of December 2015. To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 8.** Create an array of the time, in seconds, since the start of the month at which each hourly reading was taken. Name it `collection_times`.

**Hints:** 

 - There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds. So your array should have $31 \times 24$ elements in it.

- The `len` function works on arrays, too. If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.


In [45]:
collection_times = np.arange(0, 31*24*60*60, 60*60) # SOLUTION
collection_times

array([      0,    3600,    7200,   10800,   14400,   18000,   21600,
         25200,   28800,   32400,   36000,   39600,   43200,   46800,
         50400,   54000,   57600,   61200,   64800,   68400,   72000,
         75600,   79200,   82800,   86400,   90000,   93600,   97200,
        100800,  104400,  108000,  111600,  115200,  118800,  122400,
        126000,  129600,  133200,  136800,  140400,  144000,  147600,
        151200,  154800,  158400,  162000,  165600,  169200,  172800,
        176400,  180000,  183600,  187200,  190800,  194400,  198000,
        201600,  205200,  208800,  212400,  216000,  219600,  223200,
        226800,  230400,  234000,  237600,  241200,  244800,  248400,
        252000,  255600,  259200,  262800,  266400,  270000,  273600,
        277200,  280800,  284400,  288000,  291600,  295200,  298800,
        302400,  306000,  309600,  313200,  316800,  320400,  324000,
        327600,  331200,  334800,  338400,  342000,  345600,  349200,
        352800,  356

In [46]:
isinstance(collection_times, np.ndarray)

True

In [47]:
len(collection_times)

744

In [48]:
all(collection_times == np.arange(0, 31*24*60*60, 60*60))

True

### Working with Single Elements of Arrays ("Indexing")
Let's work with a more interesting dataset. The next cell creates an array called `population_amounts` that includes estimated world populations in every year from 1960 to roughly the present. The estimates come from the US Census Bureau website.

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`. You'll learn how to do that later in this lab.

In [49]:
population_amounts = Table.read_table("data/world_population.csv").column("Population")
population_amounts

array([3032156070, 3071596055, 3124561005, 3189655687, 3255145692,
       3322046795, 3392097729, 3461619724, 3532782993, 3606553753,
       3681975908, 3760516757, 3836900801, 3912984371, 3988487336,
       4062507027, 4135432265, 4207786422, 4281339378, 4356778367,
       4432963653, 4511164132, 4592387213, 4674330282, 4755996689,
       4839176734, 4924747934, 5012556248, 5101287675, 5189977062,
       5280046096, 5368139468, 5452576447, 5537885552, 5622085788,
       5706753900, 5789655609, 5872286683, 5954005906, 6034491620,
       6114332517, 6193671694, 6272753009, 6351882385, 6431551721,
       6511748273, 6592734559, 6674203697, 6757020825, 6839574233,
       6921877071, 7002880914, 7085790438, 7169675197, 7254292848,
       7339076654, 7424484741, 7509410228, 7592475615, 7673345391,
       7752840547])

Here's how we get the first element of population_amounts, which is the world population in the first year in the dataset, 1960.

In [37]:
population_amounts.item(0)

3032156070

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element. This is a weird convention in computer science. 0 is called the index of the first item. It's the number of elements that appear before that item. So 3 is the index of the 4th item.

Here are some more examples. In the examples, we've given names to the things we get out of `population_amounts`. Read and run each cell.

In [50]:
# The 13th element in the array is the 
# population in 1972 (which is 1960 + 12).
population_1972 = population_amounts.item(12)
population_1972

3836900801

In [51]:
# The 55th element is the population in 2015.
population_2015 = population_amounts.item(55)
population_2015

7339076654

In [52]:
# The array has only 61 elements, so this doesn't work.
# (There's no element with 61 other elements before it.)
population_amounts.item(61)

IndexError: index 61 is out of bounds for axis 0 with size 61

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method replace earlier.

In [53]:
make_array(-1, -3, 4, -2).item(3)

-2

**Question 9.** Set `population_1988` to the world population in 1988, by getting the appropriate element from `population_amounts` using `item`.


In [56]:
population_1988 = population_amounts.item(28) # SOLUTION
population_1988

5101287675

In [57]:
population_1988

5101287675

### Doing Something to Every Element of an Array

Arrays are primarily useful for doing the same operation many times, so we don't often have to use .item and work with single elements.

**Logarithms**

Here is one simple question we might ask about world population:

How big was the population in *orders* of magnitude in each year?

Orders of magnitude quantify how big a number is by representing it as the power of another number (for example, representing 104 as $10^{2.017033}$). One way to do this is by using the logarithm function. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the log10 function from the math module and the `item` method you just saw:

In [44]:
population_1960_magnitude = math.log10(population_amounts.item(0))
population_1961_magnitude = math.log10(population_amounts.item(1))
population_1962_magnitude = math.log10(population_amounts.item(2))
population_1963_magnitude = math.log10(population_amounts.item(3))

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, `NumPy` provides its own version of `log10` that takes the logarithm of each element of an array. It takes a single array of numbers as its argument. It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 10.** Use `np.log10` to compute the logarithms of the world population in every year. Give the result (an array of 61 numbers) the name `population_magnitudes`. Your code should be very short.


In [58]:
population_magnitudes = np.log10(population_amounts) # SOLUTION
population_magnitudes

array([ 9.48175155,  9.4873641 ,  9.49478901,  9.5037438 ,  9.51257043,
        9.52140575,  9.53046836,  9.53927936,  9.54811696,  9.55709241,
        9.56608094,  9.57524753,  9.58398057,  9.59250811,  9.60080822,
        9.60879413,  9.61652091,  9.62405369,  9.63157966,  9.63916547,
        9.64669417,  9.65428863,  9.6620385 ,  9.6697194 ,  9.67724154,
        9.68477148,  9.69238401,  9.70005926,  9.70767982,  9.71516544,
        9.72263771,  9.72982379,  9.73660176,  9.74334398,  9.74989747,
        9.75638914,  9.76265273,  9.76880725,  9.77480926,  9.78064069,
        9.78634905,  9.79194818,  9.79745819,  9.80290245,  9.80831577,
        9.8136976 ,  9.81906559,  9.82439946,  9.82975526,  9.83502907,
        9.84022388,  9.84527674,  9.8503883 ,  9.85549948,  9.86059508,
        9.86564142,  9.87066632,  9.87560583,  9.88038341,  9.88498475,
        9.88946085])

In [59]:
isinstance(population_magnitudes, np.ndarray)

True

In [60]:
sum(abs(population_magnitudes-np.log10(population_amounts))) < 1e-6

True

What you just did is called elementwise application of `np.log10`, since `np.log10` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="images/array_logarithm.jpg"/>

The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays) on arrays has a useful list of `NumPy` functions that are designed to work elementwise, like `np.log10`.

**Arithmetic**

Arithmetic also works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc) on an array, Python will do the operation to every element of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions.

**Note:** Rather than type all those zeros we can use scientific notation `1e9=1000000000`.

In [61]:
population_in_billions = population_amounts/1e9
population_in_billions

array([ 3.03215607,  3.07159606,  3.124561  ,  3.18965569,  3.25514569,
        3.32204679,  3.39209773,  3.46161972,  3.53278299,  3.60655375,
        3.68197591,  3.76051676,  3.8369008 ,  3.91298437,  3.98848734,
        4.06250703,  4.13543227,  4.20778642,  4.28133938,  4.35677837,
        4.43296365,  4.51116413,  4.59238721,  4.67433028,  4.75599669,
        4.83917673,  4.92474793,  5.01255625,  5.10128768,  5.18997706,
        5.2800461 ,  5.36813947,  5.45257645,  5.53788555,  5.62208579,
        5.7067539 ,  5.78965561,  5.87228668,  5.95400591,  6.03449162,
        6.11433252,  6.19367169,  6.27275301,  6.35188238,  6.43155172,
        6.51174827,  6.59273456,  6.6742037 ,  6.75702082,  6.83957423,
        6.92187707,  7.00288091,  7.08579044,  7.1696752 ,  7.25429285,
        7.33907665,  7.42448474,  7.50941023,  7.59247561,  7.67334539,
        7.75284055])

You can do the same with addition, subtraction, multiplication, and exponentiation (\**). For example, you can calculate a tip on several restaurant bills at once (in this case just 3).

In [62]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills) # The \t is for the tab character

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

Restaurant bills:	 [ 20.12  39.9   31.01]
Tips:			 [ 4.024  7.98   6.202]


<img src="images/array_multiplication.jpg"/>

**Question 11.** Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge. Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.


In [63]:
total_charges = restaurant_bills + tips # SOLUTION
total_charges

array([ 24.144,  47.88 ,  37.212])

In [64]:
isinstance(total_charges, np.ndarray)

True

In [65]:
sum(abs(total_charges - np.array([24.144, 47.88, 37.212]))) < 1e-6

True

**Question 12.** The array `more_restaurant_bills` contains 100,000 bills. Compute the total charge for each one. How is your code different?


In [66]:
more_restaurant_bills = Table.read_table("data/more_restaurant_bills.csv")
more_total_charges = 1.2 * more_restaurant_bills.column(0) # SOLUTION
more_total_charges

array([ 20.244,  20.892,  12.216, ...,  19.308,  18.336,  35.664])

In [67]:
isinstance(more_total_charges, np.ndarray)

True

In [68]:
sum(abs(more_total_charges - 1.2*more_restaurant_bills.column(0))) < 1e-6

True

The function sum takes a single array of numbers as its argument. It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 13.** What was the sum of all the bills in `more_restaurant_bills`, including tips?


In [69]:
sum_of_bills = sum(more_total_charges) # SOLUTION
sum_of_bills

1795730.0640000193

In [70]:
sum_of_bills

1795730.0640000193

# 3. Creating Tables

An array is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state.

Tables extend this idea by containing multiple arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one, `population_amounts`, was defined above in section 2.2 and contains the world population in each year (estimated by the US Census Bureau). The second array, years, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

Run the cell below.

In [71]:
years = np.arange(1960, 2020 + 1)
print("Population column:", population_amounts)
print("Years column:", years)

Population column: [3032156070 3071596055 3124561005 3189655687 3255145692 3322046795
 3392097729 3461619724 3532782993 3606553753 3681975908 3760516757
 3836900801 3912984371 3988487336 4062507027 4135432265 4207786422
 4281339378 4356778367 4432963653 4511164132 4592387213 4674330282
 4755996689 4839176734 4924747934 5012556248 5101287675 5189977062
 5280046096 5368139468 5452576447 5537885552 5622085788 5706753900
 5789655609 5872286683 5954005906 6034491620 6114332517 6193671694
 6272753009 6351882385 6431551721 6511748273 6592734559 6674203697
 6757020825 6839574233 6921877071 7002880914 7085790438 7169675197
 7254292848 7339076654 7424484741 7509410228 7592475615 7673345391
 7752840547]
Years column: [1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 

Suppose we want to answer this question:

In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a `Table`, a 2-dimensional type of dataset.

The expression below:

* creates an empty table using the expression `Table()`,

* adds two columns by calling `with_columns` with four arguments,

* assigns the result to the name population, and finally

* evaluates population so that we can see the table.

The strings "Year" and "Population" are column labels that we have chosen. The names population_amounts and years were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://www.data8.org/datascience/_autosummary/datascience.tables.Table.with_columns.html#datascience.tables.Table.with_columns)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [79]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population.show(40)

Population,Year
3032156070,1960
3071596055,1961
3124561005,1962
3189655687,1963
3255145692,1964
3322046795,1965
3392097729,1966
3461619724,1967
3532782993,1968
3606553753,1969


Now the data is combined into a single table. It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

**Question 14.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

**Note:** Check with a classmate or your instructor to see if your answer is correct.

In [73]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = Table().with_columns('Rating', top_10_movie_ratings, 'Name', top_10_movie_names) # SOLUTION

# We've put this next line here 
# so your table will get printed 
# out when you run this cell.
top_10_movies

Rating,Name
9.2,The Shawshank Redemption (1994)
9.2,The Godfather (1972)
9.0,The Godfather: Part II (1974)
8.9,Pulp Fiction (1994)
8.9,Schindler's List (1993)
8.9,The Lord of the Rings: The Return of the King (2003)
8.9,12 Angry Men (1957)
8.9,The Dark Knight (2008)
8.9,"Il buono, il brutto, il cattivo (1966)"
8.8,The Lord of the Rings: The Fellowship of the Ring (2001)


### Loading a Table from a File

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in string format) and returns a table.

**Question 15.** `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb. Load it as a table called `imdb`.

You may remember working with this table in **Lab 02**.


In [74]:
imdb = Table.read_table("data/imdb.csv") # SOLUTION
imdb

Title,Year,Rating,Votes,Decade
Avengers: Endgame,2019,8.7,394632,2010
Spider-Man: Into the Spider-Verse,2018,8.4,199435,2010
Avengers: Infinity War,2018,8.4,657004,2010
Green Book,2018,8.2,193829,2010
Andhadhun,2018,8.1,41901,2010
Coco,2017,8.3,272499,2010
"Three Billboards Outside Ebbing, Missouri",2017,8.1,339946,2010
Logan,2017,8.1,553983,2010
Your Name.,2016,8.3,131814,2010
Dangal,2016,8.3,121027,2010


In [75]:
isinstance(imdb, tables.Table)

True

In [76]:
imdb.num_rows

250

## 4. More Table Operations

Now that you've worked with arrays, let's add a few more methods to the list of table operations that you saw in **Lab 02**.

### `.column`

column takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**.

In [77]:
# Returns an array of movie names
top_10_movies.column('Name')

array(['The Shawshank Redemption (1994)', 'The Godfather (1972)',
       'The Godfather: Part II (1974)', 'Pulp Fiction (1994)',
       "Schindler's List (1993)",
       'The Lord of the Rings: The Return of the King (2003)',
       '12 Angry Men (1957)', 'The Dark Knight (2008)',
       'Il buono, il brutto, il cattivo (1966)',
       'The Lord of the Rings: The Fellowship of the Ring (2001)'],
      dtype='<U56')


### `.take`

The table method take takes as its argument an array of numbers. Each number should be the index of a row in the table. It returns a **new table** with only those rows.

You'll usually want to use take in conjunction with np.arange to take the first few rows of a table.

In [65]:
# Take first 5 movies of top_10_movies
top_10_movies.take(np.arange(0, 5, 1))

Rating,Name
9.2,The Shawshank Redemption (1994)
9.2,The Godfather (1972)
9.0,The Godfather: Part II (1974)
8.9,Pulp Fiction (1994)
8.9,Schindler's List (1993)


The next three questions will give you practice with combining the operations you've learned in this lab and the previous one to answer questions about the population and imdb tables. First, check out the population table from **Section 3**.

In [80]:
# Run this cell to display the population table
population

Population,Year
3032156070,1960
3071596055,1961
3124561005,1962
3189655687,1963
3255145692,1964
3322046795,1965
3392097729,1966
3461619724,1967
3532782993,1968
3606553753,1969


**Question 16.** Check out the population table from **Section 3** of this lab. Compute the year when the world population first went above 6 billion. Assign the year to `year_population_crossed_6_billion`.

**Hint:** To earn all the points for this question you must determine the value programmatically. Programmatically is used to refer to tasks that can be done in an automated way (using coding/programming), especially as opposed to tasks that have to be done manually (by assigning a value).

In [82]:
year_population_crossed_6_billion = population.where('Population', are.above(6000000000)).first('Year') # SOLUTION
year_population_crossed_6_billion

1999

In [83]:
year_population_crossed_6_billion

1999

**Question 17.** Find the average rating for movies released before the year 2000 and the average rating for movies released in the year 2000 or after for the movies in imdb.

**Hint:** Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [84]:
before_2000 = sum(imdb.where('Year', are.below(2000)).column('Rating'))/len(imdb.where('Year', are.below(2000)).column('Rating')) # SOLUTION
after_or_in_2000 = sum(imdb.where('Year', are.above_or_equal_to(2000)).column('Rating'))/len(imdb.where('Year', are.above_or_equal_to(2000)).column('Rating')) # SOLUTION
print("Average before 2000 rating:", before_2000)
print("Average after or in 2000 rating:", after_or_in_2000)

Average before 2000 rating: 8.2725
Average after or in 2000 rating: 8.23888888889


In [85]:
round(before_2000)-8 < 1e-5

True

In [86]:
round(after_or_in_2000)-8 < 1e-5

True

In [87]:
# HIDDEN
before_2000

8.2724999999999955

In [88]:
# HIDDEN
after_or_in_2000

8.2388888888888907