# Lab 2: Working with Data

Please complete this lab by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
import numpy as np
from datascience import *

## Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that contains the result of multiplying each number in `billions_of_numbers` by .18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

<img src="excel_array.jpg">

## Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. To create an array by hand, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [None]:
make_array(0.125, 4.75, -1.3)

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(<some_array>)` returns the number of elements in `some_array`.

<font color ='red'>**Question 1. Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.**</font>

*Hint:* How did you get the values $\pi$ and $e$ previously?  You can refer to them in exactly the same way here.

###  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie" or, if you prefer to pronounce things incorrectly, "NUM-pee").  The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at `start` and counting up by `step`, stopping **before** `stop` is reached.

Run the following cells to see some examples!

In [None]:
# This array starts at 1 and counts up by 2
# and then stops before 6
np.arange(1, 6, 2)

In [None]:
# This array doesn't contain 9
# because np.arange stops *before* the stop value is reached
np.arange(4, 9, 1)

In [None]:
# You can also use a single number to create an array of all integers from 0 to that number
np.arange(10)

<font color = 'red'>**Question 2. Use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999 (So its elements are 0, 99, 198, 297, etc.). Assign this to `my_array`.**</font>

## Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population_amounts` that includes estimated world populations in every year from **1950** to roughly the present (The estimates come from the US Census Bureau website.).

Rather than type in the data manually, we've loaded them from a file called `world_population.csv`. 

In [None]:
population_amounts = Table.read_table("world_population.csv").column("Population")
population_amounts

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1950.

In [None]:
population_amounts.item(0)

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [None]:
# The 13th element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population_amounts.item(12)
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population_amounts.item(65)
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population_amounts.item(66)
population_2016

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [None]:
make_array(-1, -3, 4, -2).item(3)

<font color = 'red'>**Question 3. Set `population_1973` to the world population in 1973, by getting the appropriate element from `population_amounts` using `item`.**</font>

## Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

Orders of magnitude quantify how big a number is by representing it as the power of another number (for example, representing 104 as $10^{2.017033}$). One way to do this is by using the logarithm function. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [None]:
population_1950_magnitude = np.log10(population_amounts.item(0))
population_1951_magnitude = np.log10(population_amounts.item(1))
population_1952_magnitude = np.log10(population_amounts.item(2))
population_1953_magnitude = np.log10(population_amounts.item(3))
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

In [None]:
population_magnitudes = np.log10(population_amounts)
population_magnitudes

What you just did is called *elementwise* application of `np.log10`, since `np.log10` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="array_logarithm.jpg">


The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays)  on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population_amounts / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`).

## Creating Tables

An array is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state. 

Tables extend this idea by containing multiple arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one, `population_amounts`, was defined above and contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
# Creating a years array

years = np.arange(1950, 2016)

In [None]:
print("Population column:", population_amounts)

In [None]:
print("Years column:", years)

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

<font color = 'red'>**Question 4. Create an array that contain numbers from 0 to 10, as well as an array that contains numbers from $2^0$ to $2^{10}$. Create a new table called `exponential_two` that contains both arrays, with the column names "Value" and "Exponential".**</font>

It should look something like this:

| Value | Exponential |
| --- | --- |
| 0 | 1 | 
| 1 | 2 |
| ... | ...|

## More Table Operations!

Now that you've worked with arrays, let's add a few more methods to the list of table operations that you saw in Lab 1.

In [None]:
imdb = Table.read_table('imdb.csv')

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

> This is a really important method to remember! We will use this one a lot, so make sure to keep this in mind.

In [None]:
# Returns an array of movie names
imdb.column('Title')

Pulling out a variable of the Table as an array is really helpful, because we can then use array methods on it, such as `.sum()` or `.mean()`, easily. For example, to find the mean number of `Rating` that all moves in this dataset got, we can first pull out the `Rating` columns as an array, then use `.mean()` with it.

In [None]:
imdb.column('Rating').mean()

### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [None]:
# Take first 5 movies of top_10_movies
imdb.take(np.arange(0, 5, 1))

<font color = 'red'>**Question 5. Check out the `population` table from earlier in this lab. Create a new table called `pop_ten_years` that contains the population at the beginning of each decade, starting with 1950 (so, every ten years).**</font>

In [None]:

pop_ten_years = population.take(np.arange(0,66,10))

pop_ten_years






### `group`
The table method `group` takes as its argument a string representing a column name. This gives the count of each category of that variable. **Use this with only with categorical variables, since it won't really make sense with numerical data.** Note that even if a variable has numbers, that doesn't mean it's a numerical variable. For example, we can find the number of movies that were released in each decade using `group` with the `Decade` variable.

In [None]:
movies_by_decade = imdb.group('Decade')
movies_by_decade

We can also use the `collect` parameter to summarize other variables split up by the categories in `Decade`.

In [None]:
imdb.group('Decade', collect = np.mean)

<font color = 'red'>**Question 6. What is the total number of votes that movies got within each `Decade`?**</font>

*Hint: Use the `sum` function for this.*

In [None]:

imdb.group('Votes', collect = sum)




### `pivot`

You can use `pivot` to create contingency tables, looking at the counts for multiple categorical variables. You can also use `pivot` to find, for example, the mean of a third variable within each combination of categories. We'll focus on the first one for now. Let's say we want find out what the distribution of movies that got rated higher than 8.4 was by decade.

In [None]:
imdb.show(5)

In [None]:
imdb.pivot('Highly Rated','Decade')

You can also use `collect` in a similar manner as `group`. Let's say we want to find the mean number of votes of the movies within each combination of `Decade` and `Highly Rated`.

In [None]:
imdb.pivot('Highly Rated','Decade', values = 'Votes', collect = np.mean)

<font color = 'red'>**Question 7. Use `pivot` to find the total number of `Votes` for each combination of `Decade` and `Highly Rated`.**</font>

In [None]:
imdb.pivot('Highly Rated','Decade', values = 'Votes', collect = np.sum)



### `stats`

The table method `stats` provides summary statistics for each column in the table. Note that it tries to figure out whether a variable is numerical or categorical, but it can be wrong! Let's take a look at what happens if we try to get descriptive statistics for the `imdb` table.

In [None]:
imdb.stats()

<font color = 'red'>**Question 8. What is the mean number of votes that a movie got? How does it compare to the median?**</font>

*Hint: Remember, we found a way to find the mean of a variable earlier by pulling out one column as an array.*

## Summary

For your reference, here's a table of all the functions and methods we saw in this lab as well as Lab 1. Refer back to this table whenever you're unsure of which method you're supposed to use.

|Name|Example|Purpose|
|-|-|-|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|
|`where`|`tbl.where("N", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|
|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|
|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|
|`select`|`tbl.select("N")`|Create a copy of a table with only some of the columns|
|`drop`|`tbl.drop("2*N")`|Create a copy of a table without some of the columns|
| `column`| `tbl.column('Title')`|Create an array with one column from the table|
|`take`|`tbl.take(np.arange(0,10,2))`|Create a copy of a table with only the rows in the array|
|`group`|`tbl.group('N')`|Create a new table with the counts of each category in a variable. Can also use `collect = ` to calculate something within each category.|
|`pivot`|`tbl.pivot('N','M')`|Create a new table with the counts of each combination of categories within two variables. Can also use `values = ` and `collect = ` to calculate something about a third variable.|
|`stats`|`tbl.stats('N')`|Creates a new table with numerical statistics for each variable|

