# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook. Before you begin click the play button to run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("assignment08.ipynb")

# Assignment 08

Welcome to Assignment 08!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** Wednesday, July 13, 2022 @ 11:59 pm

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Assignment

In today's assignment, you'll learn how to:

- work with arrays.

- manipulate strings.

- write user-defined functions.

- perform table operations.

Let's get started! Run the cell below.

In [None]:
from datascience import *
import numpy as np
import math

## Arrays 

### Temperature Readings

NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last. 

**Question 1.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

**Hints:** 

 - There were 31 days in December, which is equivalent to (31 $\times$ 24) hours or (31 $\times$ 24 $\times$ 60 $\times$ 60) seconds.  So your array should have $31 \times 24$ elements in it.

 - The `len` function works on arrays, too!  If your `collection_times` isn't passing the tests, check its length and make sure it has 31 $\times$ 24 elements.

In [None]:
collection_times = ...
collection_times

In [None]:
grader.check("q1")

### Working with single elements of arrays ("indexing")

Let's work with a more interesting data set.  The next cell creates an array called `population_amounts` that includes estimated world populations in every year from **1960** to roughly the present. The estimates come from the US Census Bureau website.

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.

In [None]:
population_amounts = Table.read_table('data/world_population.csv').column('Population')
population_amounts

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1960.

In [None]:
population_amounts.item(0)

The value of that expression is the number 3032156070 (around 3 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [None]:
# The 13th element in the array is the 
# population in 1972 (which is 1960 + 12).
population_1972 = population_amounts.item(12)
population_1972

In [None]:
# The 55th element is the population in 2015.
population_2015 = population_amounts.item(55)
population_2015

In [None]:
# The array has only 61 elements, so this doesn't work.
# There's no element with 62 other elements before it.
population_2016 = population_amounts.item(62)
population_2016

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [None]:
make_array(-1, -3, 4, -2).item(3)

**Question 2.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population_amounts` using `item`.

In [None]:
population_1973 = ...
population_1973

In [None]:
grader.check("q2")

### Doing Something to Every Element of an Array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

#### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

Orders of magnitude quantify how big a number is by representing it as the power of another number (for example, representing 104 as $10^{2.017033}$). One way to do this is by using the logarithm function. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [None]:
population_1960_magnitude = math.log10(population_amounts.item(0))
population_1961_magnitude = math.log10(population_amounts.item(1))
population_1962_magnitude = math.log10(population_amounts.item(2))
population_1963_magnitude = math.log10(population_amounts.item(3))
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 3.** Use `np.log10` to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.


In [None]:
population_magnitudes = ...
population_magnitudes

In [None]:
grader.check("q3")

What you just did is called *elementwise* application of `np.log10`, since `np.log10` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="images/array_logarithm.jpg">


The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays)  on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

### Arithmetic
Arithmetic also works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc) on an array, Python will do the operation to every element of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population_amounts/1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [None]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

<img src="images/array_multiplication.jpg">

**Question 4.** Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.

In [None]:
total_charges = ...
total_charges

In [None]:
grader.check("q4")

**Question 5.** The array `more_restaurant_bills` contains 100,000 bills. Compute the total charge for each one.  How is your code different?

In [None]:
more_restaurant_bills = Table.read_table("data/more_restaurant_bills.csv")
more_total_charges = ...
more_total_charges

In [None]:
grader.check("q5")

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 6.** What was the sum of all the bills in `more_restaurant_bills`, **including tips**?

In [None]:
sum_of_bills = ...
sum_of_bills

In [None]:
grader.check("q6")

**Question 7.** The powers of 2 $\left(2^0 = 1, 2^1 = 2, 2^2 = 4, \text{ etc} \right)$ arise frequently in computer science. For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB. Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

**Hints:** 

 - `np.arange(1, 2**30, 1)` creates an array with $2^{30}$ elements and **will crash your kernel**.

 - Part of your solution will involve `np.arange`, but your array shouldn't have more than 30 elements.

In [None]:
powers_of_2 = ...
powers_of_2

In [None]:
grader.check("q7")

## Creating Tables

An array is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state. 

Tables extend this idea by containing multiple arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one, `population_amounts`, was defined earlier in the notebook, and contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

Run the cell below.

In [None]:
years = np.arange(1960, 2020 + 1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a `Table`, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/_autosummary/datascience.tables.Table.with_columns.html#datascience.tables.Table.with_columns)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table. It's much easier to parse this data. If you need to know what the population was in 1969, for example, you can tell from a single glance.

**Question 8.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Fight Club (1999)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...
top_10_movies

In [None]:
grader.check("q8")

## Loading a Table from a File

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in string format) and returns a table.  

**Question 9.** `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`. 

**Note:** The file is stored in the `data` directory.

You may remember working with this table from previous lessons.

In [None]:
imdb = Table.read_table('data/...')
imdb

In [None]:
grader.check("q9")

Where did `imdb.csv` come from? Take a look at this lab's data folder. You should see a file called `imdb.csv`.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## More Table Operations

Now that you've worked with arrays, let's add a few more methods to the list of table operations that you saw in Lab 2.

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

In [None]:
# Returns an array of movie names
top_10_movies.column('Name')

### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [None]:
# Take first 5 movies of top_10_movies
top_10_movies.take(np.arange(0, 5, 1))

The next three questions will give you practice with combining the operations you've learned in this assignment and in previous ones to answer questions about the `population` and `imdb` tables. First, check out the `population` table from earlier in the notebook.

In [None]:
# Run this cell to display the population table
population

**Question 10.** Check out the `population` table from **Section 2** of this lab.  Compute the year when the world population first went above 7 billion. Assign the year to `year_population_crossed_7_billion`.

In [None]:
year_population_crossed_7_billion = ...
year_population_crossed_7_billion

In [None]:
grader.check("q10")

**Question 11.** Find the average rating for movies released before the year 2000 and the average rating for movies released in the year 2000 or after for the movies in `imdb`.

**Hint:** Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [None]:
before_2000 = ...
after_or_in_2000 = ...
print("Average before 2000 rating:", before_2000)
print("Average after or in 2000 rating:", after_or_in_2000)

In [None]:
grader.check("q18")

**Question 12.** Find the number of movies that came out in *even* years.

**Hints:** 

 - The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

 - `%` can be used on arrays, operating elementwise like `+` or `*`.  So `make_array(5, 6, 7) % 2` is `array([1, 0, 1])`.

 - Create a column called `Year Remainder` that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column (`imdb.with_column(...)` returns a new table).  Then use `where` to find rows where that new column is equal to 0.  Then use `num_rows` to count the number of such rows.

**Note:** These steps can be chained in one single statement, or broken up across several lines with intermediate names assigned. You’re always welcome to break down problems however you wish!


In [None]:
num_even_year_movies = ...
num_even_year_movies

In [None]:
grader.check("q19")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)