# Lab 2: Table operations, Data Types, and Arrays (oh my)

Welcome to Lab 2!  This week, we'll learn how to import a module and practice table operations! You'll then see how to represent and manipulate another fundamental type of data: text. A piece of text is called a *string* in Python. You'll also see how to work with *arrays* of data, such as all the numbers between 0 and 100 or all the words in the chapter of a book. Lastly, you'll create tables and practice analyzing them with your knowledge of table operations. This lab is a bit looooong, so make sure you begin early!

Recommended Reading:
 * [Introduction to tables](https://www.inferentialthinking.com/chapters/03/4/Introduction_to_Tables)

First run the cell below.

In [1]:
# Just run this cell

import numpy as np
from datascience import *

# 1. Importing code

![imports](https://external-preview.redd.it/ZVPjiFo_Ubl4JeiU63SaTjdIoq5zveSnNZimKpgn2I8.png?auto=webp&s=bf32c94b630befa121075c1ae99b2599af6dedc5) 

[source](https://www.reddit.com/r/ProgrammerHumor/comments/cgtk7s/theres_no_need_to_reinvent_the_wheel_oc/)

Most programming involves work that is very similar to work that has been done before.  Since writing code is time-consuming, it's good to rely on others' published code when you can.  Rather than copy-pasting, Python allows us to **import modules**. A module is a file with Python code that has defined variables and functions. By importing a module, we are able to use its code in our own notebook.

Python includes many useful modules that are just an `import` away.  We'll look at the `math` module as a first example. The `math` module is extremely useful in computing mathematical expressions in Python. 

Suppose we want to very accurately compute the area of a circle with a radius of 5 meters.  For that, we need the constant $\pi$, which is roughly 3.14.  Conveniently, the `math` module has `pi` defined for us:

In [3]:
import math
radius = 5
area_of_circle = radius**2 * math.pi
area_of_circle

78.53981633974483

In the code above, the line `import math` imports the math module. This statement creates a module and then assigns the name `math` to that module. We are now able to access any variables or functions defined within `math` by typing the name of the module followed by a dot, then followed by the name of the variable or function we want.

    <module name>.<name>

**Question 1.1.** The module `math` also provides the name `e` for the base of the natural logarithm, which is roughly 2.71. Compute $(e^{\pi}-\pi)^2$, giving it the name `near_400`.

*Remember: You can access `pi` from the `math` module as well!*

<!--
BEGIN QUESTION
name: q11
-->

In [5]:
near_400 = ((math.e ** math.pi) - math.pi) ** 2
near_400

399.96399997761625

## 1.1. Accessing functions

In the question above, you accessed variables within the `math` module. 

**Modules** also define **functions**.  For example, `math` provides the name `log` for the natural log function. Having imported `math` already, we can write `math.log(3)` to compute the natural log of 3. 

For your reference, below are some more examples of functions from the `math` module.

Notice how different functions take in different numbers of arguments. Often, the [documentation](https://docs.python.org/3/library/math.html) of the module will provide information on how many arguments are required for each function.

*Hint: If you press `shift+tab` while next to the function call, the documentation for that function will appear*

In [6]:
# Calculating logarithms (the logarithm of 8 in base 2).
# The result is 3 because 2 to the power of 3 is 8.
math.log(8, 2)

3.0

In [7]:
# Calculating square roots.
math.sqrt(5)

2.23606797749979

There are various ways to import and access code from outside sources. The method we used above — `import <module_name>` — imports the entire module and requires that we use `<module_name>.<name>` to access its code. 

We can also import a specific constant or function instead of the entire module. Notice that you don't have to use the module name beforehand to reference that particular value. However, you do have to be careful about reassigning the names of the constants or functions to other values!

In [8]:
# Importing just cos and pi from math.
# We don't have to use `math.` in front of cos or pi
from math import cos, pi
print(cos(pi))

# We do have to use it in front of other functions from math, though
math.log(pi)

-1.0


1.1447298858494002

Or we can import every function and value from the entire module.

In [9]:
# Lastly, we can import everything from math using the *
# Once again, we don't have to use 'math.' beforehand 
from math import *
log(pi)

1.1447298858494002

Don't worry too much about which type of import to use. It's often a coding style choice left up to each programmer. In this course, you'll always import the necessary modules when you run the setup cell (like the first code cell in this lab).

Let's move on to practicing some of the table operations you've learned in lecture!

# 2. Table operations

The table `farmers_markets.csv` contains data on farmers' markets in the United States. Each row represents one such market.

Run the next cell to load the `farmers_markets` table.

In [10]:
# Just run this cell

farmers_markets = Table.read_table('farmers_markets.csv')

Let's examine our table to see what data it contains.

**Question 2.1** Use the method `show` to display the first 15 rows of `farmers_markets`. 

*Note:* The terms "method" and "function" are technically not the same thing, but for the purposes of this course, we will use them interchangeably.

**Hint:** `tbl.show(3)` will show the first 3 rows of `tbl`. Additionally, make sure not to call `.show()` without an argument, as this will crash your kernel!


In [11]:
farmers_markets.show(15)

FMID,MarketName,street,city,County,State,zip,x,y,Website,Facebook,Twitter,Youtube,OtherMedia,Organic,Tofu,Bakedgoods,Cheese,Crafts,Flowers,Eggs,Seafood,Herbs,Vegetables,Honey,Jams,Maple,Meat,Nursery,Nuts,Plants,Poultry,Prepared,Soap,Trees,Wine,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,WildHarvested,updateTime,Location,Credit,WIC,WICcash,SFMNP,SNAP,Season1Date,Season1Time,Season2Date,Season2Time,Season3Date,Season3Time,Season4Date,Season4Time
1012063,Caledonia Farmers Market Association - Danville,,Danville,Caledonia,Vermont,5828,-72.1403,44.411,https://sites.google.com/site/caledoniafarmersmarket/,https://www.facebook.com/Danville.VT.Farmers.Market/,,,,Y,N,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,Y,Y,Y,Y,Y,N,Y,Y,Y,N,Y,N,Y,N,6/28/2016 12:10:09 PM,,Y,Y,N,Y,N,06/08/2016 to 10/12/2016,Wed: 9:00 AM-1:00 PM;,,,,,,
1011871,Stearns Homestead Farmers' Market,6975 Ridge Road,Parma,Cuyahoga,Ohio,44130,-81.7286,41.3751,http://Stearnshomestead.com,,,,,-,N,Y,N,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,Y,N,N,N,N,N,N,N,Y,N,N,N,Y,N,4/9/2016 8:05:17 PM,,Y,Y,N,Y,Y,06/25/2016 to 10/01/2016,Sat: 9:00 AM-1:00 PM;,,,,,,
1011878,100 Mile Market,507 Harrison St,Kalamazoo,Kalamazoo,Michigan,49007,-85.5749,42.296,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,N,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,N,Y,Y,Y,N,Y,N,N,Y,Y,N,N,N,N,4/16/2016 12:37:56 PM,,Y,Y,N,Y,Y,05/04/2016 to 10/12/2016,Wed: 3:00 PM-7:00 PM;,,,,,,
1009364,106 S. Main Street Farmers Market,106 S. Main Street,Six Mile,,South Carolina,29682,-82.8187,34.8042,http://thetownofsixmile.wordpress.com/,,,,,-,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,2013,,Y,N,N,N,N,,,,,,,,
1010691,10th Steet Community Farmers Market,10th Street and Poplar,Lamar,Barton,Missouri,64759,-94.2746,37.4956,,,,,http://agrimissouri.com/mo-grown/grodetail.php?type=mo-g ...,-,N,Y,N,Y,N,Y,N,Y,Y,Y,Y,N,Y,N,N,Y,Y,Y,Y,N,N,N,N,Y,N,N,N,N,N,10/28/2014 9:49:46 AM,,Y,N,N,N,N,04/02/2014 to 11/30/2014,Wed: 3:00 PM-6:00 PM;Sat: 8:00 AM-1:00 PM;,,,,,,
1002454,112st Madison Avenue,112th Madison Avenue,New York,New York,New York,10029,-73.9493,40.7939,,,,,,-,N,Y,N,Y,Y,N,N,Y,Y,Y,Y,N,N,N,Y,N,N,Y,Y,N,N,N,N,N,N,N,N,N,N,3/1/2012 10:38:22 AM,Private business parking lot,N,N,Y,Y,N,July to November,Tue:8:00 am - 5:00 pm;Sat:8:00 am - 8:00 pm;,,,,,,
1011100,12 South Farmers Market,3000 Granny White Pike,Nashville,Davidson,Tennessee,37204,-86.7907,36.1184,http://www.12southfarmersmarket.com,12_South_Farmers_Market,@12southfrmsmkt,,@12southfrmsmkt,Y,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,N,Y,Y,Y,N,N,Y,N,Y,N,Y,Y,Y,N,5/1/2015 10:40:56 AM,,Y,N,N,N,Y,05/05/2015 to 10/27/2015,Tue: 3:30 PM-6:30 PM;,,,,,,
1009845,125th Street Fresh Connect Farmers' Market,"163 West 125th Street and Adam Clayton Powell, Jr. Blvd.",New York,New York,New York,10027,-73.9482,40.809,http://www.125thStreetFarmersMarket.com,https://www.facebook.com/125thStreetFarmersMarket,https://twitter.com/FarmMarket125th,,Instagram--> 125thStreetFarmersMarket,Y,N,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,N,Y,N,Y,Y,Y,N,Y,Y,N,Y,N,Y,N,N,N,4/7/2014 4:32:01 PM,Federal/State government building grounds,Y,Y,N,Y,Y,06/10/2014 to 11/25/2014,Tue: 10:00 AM-7:00 PM;,,,,,,
1005586,12th & Brandywine Urban Farm Market,12th & Brandywine Streets,Wilmington,New Castle,Delaware,19801,-75.5345,39.7421,,https://www.facebook.com/pages/12th-Brandywine-Urban-Far ...,,,https://www.facebook.com/delawareurbanfarmcoalition,N,N,N,N,N,N,N,N,Y,Y,N,N,N,N,N,N,N,N,N,N,N,N,N,N,Y,N,N,N,N,N,4/3/2014 3:43:31 PM,"On a farm from: a barn, a greenhouse, a tent, a stand, etc",N,N,N,N,Y,05/16/2014 to 10/17/2014,Fri: 8:00 AM-11:00 AM;,,,,,,
1008071,14&U Farmers' Market,1400 U Street NW,Washington,District of Columbia,District of Columbia,20009,-77.0321,38.917,,https://www.facebook.com/14UFarmersMarket,https://twitter.com/14UFarmersMkt,,,Y,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,N,Y,N,Y,Y,Y,N,N,N,N,N,Y,Y,Y,Y,N,N,N,4/5/2014 1:49:04 PM,Other,Y,Y,Y,Y,Y,05/03/2014 to 11/22/2014,Sat: 9:00 AM-1:00 PM;,,,,,,


Notice that some of the values in this table are missing, as denoted by "nan." This means either that the value is not available (e.g. if we don’t know the market’s street address) or not applicable (e.g. if the market doesn’t have a street address). You'll also notice that the table has a large number of columns in it!

### `num_columns`

The table property `num_columns` returns the number of columns in a table. (A "property" is just a method that doesn't need to be called by adding parentheses.)

Example call: `<tbl>.num_columns`

**Question 2.2.** Use `num_columns` to find the number of columns in our farmers' markets dataset.

Assign the number of columns to `num_farmers_markets_columns`.

<!--
BEGIN QUESTION
name: q22
-->

In [12]:
num_farmers_markets_columns = farmers_markets.num_columns
print("The table has", num_farmers_markets_columns, "columns in it!")

The table has 59 columns in it!


### `select`

Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that information, it just makes the table difficult to read.  This comes up more than you might think, because people who collect and publish data may not know ahead of time what people will want to do with it.

In such situations, we can use the table method `select` to choose only the columns that we want in a particular table. It takes any number of arguments. Each should be the name of a column in the table. It returns a new table with only those columns in it. The columns are in the order *in which they were listed as arguments*.

For example, the value of `farmers_markets.select("MarketName", "State")` is a table with only the name and the state of each farmers' market in `farmers_markets`.



**Question 2.3.** Use `select` to create a table with only the name, city, state, latitude (`y`), and longitude (`x`) of each market.  Call that new table `farmers_markets_locations`.

*Hint:* Make sure to be exact when using column names with `select`; double-check capitalization!

<!--
BEGIN QUESTION
name: q23
-->

In [14]:
farmers_markets_locations = farmers_markets.select('MarketName', 'city', 'State', 'x', 'y')
farmers_markets_locations

MarketName,city,State,x,y
Caledonia Farmers Market Association - Danville,Danville,Vermont,-72.1403,44.411
Stearns Homestead Farmers' Market,Parma,Ohio,-81.7286,41.3751
100 Mile Market,Kalamazoo,Michigan,-85.5749,42.296
106 S. Main Street Farmers Market,Six Mile,South Carolina,-82.8187,34.8042
10th Steet Community Farmers Market,Lamar,Missouri,-94.2746,37.4956
112st Madison Avenue,New York,New York,-73.9493,40.7939
12 South Farmers Market,Nashville,Tennessee,-86.7907,36.1184
125th Street Fresh Connect Farmers' Market,New York,New York,-73.9482,40.809
12th & Brandywine Urban Farm Market,Wilmington,Delaware,-75.5345,39.7421
14&U Farmers' Market,Washington,District of Columbia,-77.0321,38.917


### `drop`

`drop` serves the same purpose as `select`, but it takes away the columns that you provide rather than the ones that you don't provide. Like `select`, `drop` returns a new table.

**Question 2.4.** Suppose you just didn't want the `FMID` and `updateTime` columns in `farmers_markets`.  Create a table that's a copy of `farmers_markets` but doesn't include those columns.  Call that table `farmers_markets_without_fmid`.

<!--
BEGIN QUESTION
name: q24
-->

In [15]:
farmers_markets_without_fmid = farmers_markets.drop('FMID', 'updateTime')
farmers_markets_without_fmid

MarketName,street,city,County,State,zip,x,y,Website,Facebook,Twitter,Youtube,OtherMedia,Organic,Tofu,Bakedgoods,Cheese,Crafts,Flowers,Eggs,Seafood,Herbs,Vegetables,Honey,Jams,Maple,Meat,Nursery,Nuts,Plants,Poultry,Prepared,Soap,Trees,Wine,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,WildHarvested,Location,Credit,WIC,WICcash,SFMNP,SNAP,Season1Date,Season1Time,Season2Date,Season2Time,Season3Date,Season3Time,Season4Date,Season4Time
Caledonia Farmers Market Association - Danville,,Danville,Caledonia,Vermont,5828,-72.1403,44.411,https://sites.google.com/site/caledoniafarmersmarket/,https://www.facebook.com/Danville.VT.Farmers.Market/,,,,Y,N,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,Y,Y,Y,Y,Y,N,Y,Y,Y,N,Y,N,Y,N,,Y,Y,N,Y,N,06/08/2016 to 10/12/2016,Wed: 9:00 AM-1:00 PM;,,,,,,
Stearns Homestead Farmers' Market,6975 Ridge Road,Parma,Cuyahoga,Ohio,44130,-81.7286,41.3751,http://Stearnshomestead.com,,,,,-,N,Y,N,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,Y,N,N,N,N,N,N,N,Y,N,N,N,Y,N,,Y,Y,N,Y,Y,06/25/2016 to 10/01/2016,Sat: 9:00 AM-1:00 PM;,,,,,,
100 Mile Market,507 Harrison St,Kalamazoo,Kalamazoo,Michigan,49007,-85.5749,42.296,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,N,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,N,Y,Y,Y,N,Y,N,N,Y,Y,N,N,N,N,,Y,Y,N,Y,Y,05/04/2016 to 10/12/2016,Wed: 3:00 PM-7:00 PM;,,,,,,
106 S. Main Street Farmers Market,106 S. Main Street,Six Mile,,South Carolina,29682,-82.8187,34.8042,http://thetownofsixmile.wordpress.com/,,,,,-,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,N,,Y,N,N,N,N,,,,,,,,
10th Steet Community Farmers Market,10th Street and Poplar,Lamar,Barton,Missouri,64759,-94.2746,37.4956,,,,,http://agrimissouri.com/mo-grown/grodetail.php?type=mo-g ...,-,N,Y,N,Y,N,Y,N,Y,Y,Y,Y,N,Y,N,N,Y,Y,Y,Y,N,N,N,N,Y,N,N,N,N,N,,Y,N,N,N,N,04/02/2014 to 11/30/2014,Wed: 3:00 PM-6:00 PM;Sat: 8:00 AM-1:00 PM;,,,,,,
112st Madison Avenue,112th Madison Avenue,New York,New York,New York,10029,-73.9493,40.7939,,,,,,-,N,Y,N,Y,Y,N,N,Y,Y,Y,Y,N,N,N,Y,N,N,Y,Y,N,N,N,N,N,N,N,N,N,N,Private business parking lot,N,N,Y,Y,N,July to November,Tue:8:00 am - 5:00 pm;Sat:8:00 am - 8:00 pm;,,,,,,
12 South Farmers Market,3000 Granny White Pike,Nashville,Davidson,Tennessee,37204,-86.7907,36.1184,http://www.12southfarmersmarket.com,12_South_Farmers_Market,@12southfrmsmkt,,@12southfrmsmkt,Y,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,Y,Y,N,N,N,Y,Y,Y,N,N,Y,N,Y,N,Y,Y,Y,N,,Y,N,N,N,Y,05/05/2015 to 10/27/2015,Tue: 3:30 PM-6:30 PM;,,,,,,
125th Street Fresh Connect Farmers' Market,"163 West 125th Street and Adam Clayton Powell, Jr. Blvd.",New York,New York,New York,10027,-73.9482,40.809,http://www.125thStreetFarmersMarket.com,https://www.facebook.com/125thStreetFarmersMarket,https://twitter.com/FarmMarket125th,,Instagram--> 125thStreetFarmersMarket,Y,N,Y,Y,Y,Y,Y,N,Y,Y,Y,Y,Y,Y,N,Y,N,Y,Y,Y,N,Y,Y,N,Y,N,Y,N,N,N,Federal/State government building grounds,Y,Y,N,Y,Y,06/10/2014 to 11/25/2014,Tue: 10:00 AM-7:00 PM;,,,,,,
12th & Brandywine Urban Farm Market,12th & Brandywine Streets,Wilmington,New Castle,Delaware,19801,-75.5345,39.7421,,https://www.facebook.com/pages/12th-Brandywine-Urban-Far ...,,,https://www.facebook.com/delawareurbanfarmcoalition,N,N,N,N,N,N,N,N,Y,Y,N,N,N,N,N,N,N,N,N,N,N,N,N,N,Y,N,N,N,N,N,"On a farm from: a barn, a greenhouse, a tent, a stand, etc",N,N,N,N,Y,05/16/2014 to 10/17/2014,Fri: 8:00 AM-11:00 AM;,,,,,,
14&U Farmers' Market,1400 U Street NW,Washington,District of Columbia,District of Columbia,20009,-77.0321,38.917,,https://www.facebook.com/14UFarmersMarket,https://twitter.com/14UFarmersMkt,,,Y,N,Y,Y,N,Y,Y,N,Y,Y,Y,Y,N,Y,N,Y,Y,Y,N,N,N,N,N,Y,Y,Y,Y,N,N,N,Other,Y,Y,Y,Y,Y,05/03/2014 to 11/22/2014,Sat: 9:00 AM-1:00 PM;,,,,,,


Now, suppose we want to answer some questions about farmers' markets in the US. For example, which market(s) have the largest longitude (given by the `x` column)? 

To answer this, we'll sort `farmers_markets_locations` by longitude.

In [16]:
farmers_markets_locations.sort('x')

MarketName,city,State,x,y
Trapper Creek Farmers Market,Trapper Creek,Alaska,-166.54,53.8748
Kekaha Neighborhood Center (Sunshine Markets),Kekaha,Hawaii,-159.718,21.9704
Hanapepe Park (Sunshine Markets),Hanapepe,Hawaii,-159.588,21.9101
Kalaheo Neighborhood Center (Sunshine Markets),Kalaheo,Hawaii,-159.527,21.9251
Hawaiian Farmers of Hanalei,Hanalei,Hawaii,-159.514,22.2033
Hanalei Saturday Farmers Market,Hanalei,Hawaii,-159.492,22.2042
Kauai Culinary Market,Koloa,Hawaii,-159.469,21.9067
Koloa Ball Park (Knudsen) (Sunshine Markets),Koloa,Hawaii,-159.465,21.9081
West Kauai Agricultural Association,Poipu,Hawaii,-159.435,21.8815
Kilauea Neighborhood Center (Sunshine Markets),Kilauea,Hawaii,-159.406,22.2112


Oops, that didn't answer our question because we sorted from smallest to largest longitude. To look at the largest longitudes, we'll have to sort in reverse order.

In [17]:
farmers_markets_locations.sort('x', descending=True)

MarketName,city,State,x,y
"Christian ""Shan"" Hendricks Vegetable Market",Saint Croix,Virgin Islands,-64.7043,17.7449
La Reine Farmers Market,Saint Croix,Virgin Islands,-64.7789,17.7322
Anne Heyliger Vegetable Market,Saint Croix,Virgin Islands,-64.8799,17.7099
Rothschild Francis Vegetable Market,St. Thomas,Virgin Islands,-64.9326,18.3428
Feria Agrícola de Luquillo,Luquillo,Puerto Rico,-65.7207,18.3782
El Mercado Familiar,San Lorenzo,Puerto Rico,-65.9674,18.1871
El Mercado Familiar,Gurabo,Puerto Rico,-65.9786,18.2526
El Mercado Familiar,Patillas,Puerto Rico,-66.0135,18.0069
El Mercado Familiar,Caguas zona urbana,Puerto Rico,-66.039,18.2324
El Maercado Familiar,Arroyo zona urbana,Puerto Rico,-66.0617,17.9686


(The `descending=True` bit is called an *optional argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.)

### `sort`

Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has text in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `farmers_markets_locations.sort("x")` is a *copy* of `farmers_markets_locations`; the `farmers_markets_locations` table doesn't get modified. For example, if we called `farmers_markets_locations.sort("x")`, then running `farmers_markets_locations` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the `x` column, the farmers' markets would all end up with the wrong longitudes.

**Question 2.5.** Create a version of `farmers_markets_locations` that's sorted by **latitude (`y`)**, with the largest latitudes first.  Call it `farmers_markets_locations_by_latitude`.

<!--
BEGIN QUESTION
name: q25
-->

In [18]:
farmers_markets_locations_by_latitude = farmers_markets_locations.sort('y', descending = True)
farmers_markets_locations_by_latitude

MarketName,city,State,x,y
Tanana Valley Farmers Market,Fairbanks,Alaska,-147.781,64.8628
Ester Community Market,Ester,Alaska,-148.01,64.8459
Fairbanks Downtown Market,Fairbanks,Alaska,-147.72,64.8444
Nenana Open Air Market,Nenana,Alaska,-149.096,64.5566
Highway's End Farmers' Market,Delta Junction,Alaska,-145.733,64.0385
MountainTraders,Talkeetna,Alaska,-150.118,62.3231
Talkeetna Farmers Market,Talkeetna,Alaska,-150.118,62.3228
Denali Farmers Market,Anchorage,Alaska,-150.234,62.3163
Kenny Lake Harvest II,Valdez,Alaska,-145.476,62.1079
Copper Valley Community Market,Copper Valley,Alaska,-145.444,62.0879


Now let's say we want a table of all farmers' markets in North Carolina. Sorting won't help us much here because North Carolina is closer to the middle of the dataset.

Instead, we use the table method `where`.

In [19]:
nc_farmers_markets = farmers_markets_locations.where('State', are.equal_to('North Carolina'))
nc_farmers_markets

MarketName,city,State,x,y
Afton Village Farmers Market,Concord,North Carolina,-80.6702,35.414
Alamance County Farmers Market,Burlington,North Carolina,-79.4357,36.0943
Alexander County Farmers Market,Taylorsville,North Carolina,-81.1781,35.9197
Alleghany Farmers Market,Sparta,North Carolina,-81.1226,36.503
Andrews Farmers Market,Andrews,North Carolina,-83.8232,35.2027
Anson County Farmers Market,Wadesboro,North Carolina,-80.0526,34.9408
Apex Farmers Market,Apex,North Carolina,-78.8499,35.732
Ashboro Downtown Farmers Market,Asheboro,North Carolina,-79.8175,35.7049
Ashe County Farmers Market,West Jefferson,North Carolina,-81.4935,36.4025
Asheville City Market,Asheville,North Carolina,-82.5489,35.5935


Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`nc_farmers_markets`** to a table whose rows are the rows in the **`farmers_markets_locations`** table **`where`** the **`'State'`**s **`are` `equal` `to` `North Carolina`**.

### `where`

Now let's dive into the details a bit more.  `where` takes 2 arguments:

1. The name of a column.  `where` finds rows where that column's values meet some criterion.
2. A predicate that describes the criterion that the column needs to meet.

The predicate in the example above called the function `are.equal_to` with the value we wanted, 'North Carolina'.  We'll see other predicates soon. 

`where` returns a table that's a copy of the original table, but **with only the rows that meet the given predicate**.

**Question 2.6.** Use `nc_farmers_markets` to create a table called `ch_markets` containing farmers' markets in Chapel Hill, North Carolina
<!--
BEGIN QUESTION
name: q36
-->

In [20]:
ch_markets = nc_farmers_markets.where('city', 'Chapel Hill')
ch_markets

MarketName,city,State,x,y
Southern Village Farmers Market,Chapel Hill,North Carolina,-79.0659,35.881
The Chapel Hill Farmers' Market,Chapel Hill,North Carolina,-79.0294,35.9275


Recognize any of them?

So far we've only been using `where` with the predicate that requires finding the values in a column to be *exactly* equal to a certain value. However, there are many other predicates. Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

## 3. Analyzing a dataset

Now that you're familiar with table operations, let’s answer an interesting question about a dataset!

Run the cell below to load the `imdb` table. It contains information about the 250 highest-rated movies on IMDb.

In [21]:
# Just run this cell

imdb = Table.read_table('imdb.csv')
imdb

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


Often, we want to perform multiple operations - sorting, filtering, or others - in order to turn a table we have into something more useful. You can do these operations one by one, e.g.

```
first_step = original_tbl.where(“col1”, are.equal_to(12))
second_step = first_step.sort(‘col2’, descending=True)
```

However, since the value of the expression `original_tbl.where(“col1”, are.equal_to(12))` is itself a table, you can just call a table method on it:

```
original_tbl.where(“col1”, are.equal_to(12)).sort(‘col2’, descending=True)
```
You should organize your work in the way that makes the most sense to you, using informative names for any intermediate tables you create. 

**Question 3.1.** Create a table of movies released between 1980 and 1999 (inclusive) with ratings above 7. The table should only contain the columns `Title` and `Rating`, **in that order**.

Assign the table to the name `above_seven`.

*Hint:* Think about the steps you need to take, and try to put them in an order that make sense. Feel free to create intermediate tables for each step, but please make sure you assign your final table the name `above_seven`!

<!--
BEGIN QUESTION
name: q31
-->

In [28]:
above_seven = imdb.where('Year', are.between(1980, 2000)).where('Rating', are.above(7)).select('Title', 'Rating')
above_seven

Title,Rating
Léon,8.6
The Elephant Man,8.2
Full Metal Jacket,8.3
The Princess Bride,8.1
Mononoke-hime,8.4
Saving Private Ryan,8.5
In the Name of the Father,8.1
Before Sunrise,8.0
The Shining,8.4
"Paris, Texas (1984)",8.0


**Question 3.2.** Use `num_rows` (and arithmetic) to find the *proportion* of movies in the dataset that were released 1980-1999, and the *proportion* of movies in the dataset that were released in the year 2000 or later.

Assign `proportion_in_80_to_99 to the proportion of movies in the dataset that were released 1980-1999, and `proportion_in_21st_century` to the proportion of movies in the dataset that were released in the year 2000 or later.

*Hint:* The *proportion* of movies released in 1980-1999 is the *number* of movies released in 1980-1999, divided by the *total number* of movies.

<!--
BEGIN QUESTION
name: q32
-->

In [30]:
num_movies_in_dataset = imdb.num_rows
num_in_80_to_99  = imdb.where('Year', are.between(1980, 2000)).num_rows
num_in_21st_century = imdb.where('Year', are.above(1999)).num_rows
proportion_in_80_to_99 = num_in_80_to_99 / num_movies_in_dataset
proportion_in_21st_century = num_in_21st_century / num_movies_in_dataset
print("Proportion in'80s and '90s:", proportion_in_80_to_99)
print("Proportion in 21st century:", proportion_in_21st_century)

Proportion in'80s and '90s: 0.292
Proportion in 21st century: 0.316


# 4. Text
Programming doesn't just concern numbers. Text is one of the most common data types used in programs. 

Text is represented by a **string value** in Python. The word "string" is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.

To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols. 

We've seen strings before in `print` statements.  Below, two different strings are passed as arguments to the `print` function.

In [31]:
print("I <3", 'Data Science')

I <3 Data Science


Just as names can be given to numbers, names can be given to string values.  The names and strings aren't required to be similar in any way. Any name can be assigned to any string.

In [32]:
one = 'two'
plus = '*'
print(one, plus, one)

two * two


**Question 4.1.** Yuri Gagarin was the first person to travel through outer space.  When he emerged from his capsule upon landing on Earth, he [reportedly](https://en.wikiquote.org/wiki/Yuri_Gagarin) had the following conversation with a woman and girl who saw the landing:

    The woman asked: "Can it be that you have come from outer space?"
    Gagarin replied: "As a matter of fact, I have!"

The cell below contains unfinished code.  Fill in the `...`s so that it prints out this conversation *exactly* as it appears above.

<!--
BEGIN QUESTION
name: q41
-->

In [37]:
woman_asking = "The woman asked:"
woman_quote = '"Can it be that you have come from outer space?"'
gagarin_reply = 'Gagarin replied:'
gagarin_quote = '"As a matter of fact, I have!"'

print(woman_asking, woman_quote)
print(gagarin_reply, gagarin_quote)

The woman asked: "Can it be that you have come from outer space?"
Gagarin replied: "As a matter of fact, I have!"


## 4.1. String Methods

Strings can be transformed using **methods**. Recall that methods and functions are not technically the same thing, but we'll be using them interchangeably for the purposes of this course.

Here's a sketch of how to call methods on a string:

    <expression that evaluates to a string>.<method name>(<argument>, <argument>, ...)
    
One example of a string method is `replace`, which replaces all instances of some part of the original string (or a *substring*) with a new string. 

    <original string>.replace(<old substring>, <new substring>)
    
`replace` returns (evaluates to) a new string, leaving the original string unchanged.
    
Try to predict the output of this example, then run the cell!

In [38]:
# Replace one letter
hello = 'Hello'
print(hello.replace('o', 'a'), hello)

Hella Hello


You can call functions on the results of other functions.  For example, `max(abs(-5), abs(3))` evaluates to 5.  Similarly, you can call methods on the results of other method or function calls.

You may have already noticed one difference between functions and methods - a function like `max` does not require a `.` before it's called, but a string method like `replace` does.

In [39]:
# Calling replace on the output of another call to replace
'train'.replace('t', 'ing').replace('in', 'de')

'degrade'

Here's a picture of how Python evaluates a "chained" method call like that:

<img src="lab02-chaining_method_calls.jpg"/>

**Question 4.1.1.** Use `replace` to transform the string `'hitchhiker'` into `'matchmaker'`. Assign your result to `new_word`.

<!--
BEGIN QUESTION
name: q411
-->

In [40]:
new_word = 'hitchhiker'.replace('hi', 'ma')
new_word

'matchmaker'

There are many more string methods in Python, but most programmers don't memorize their names or how to use them.  In the "real world," people usually just search the internet for documentation and examples. A complete [list of string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) appears in the Python language documentation. [Stack Overflow](http://stackoverflow.com) has a huge database of answered questions that often demonstrate how to use these methods to achieve various ends.

## 4.2. Converting to and from Strings

Strings and numbers are different *types* of values, even when a string contains the digits of a number. For example, evaluating the following cell causes an error because an integer cannot be added to a string.

In [41]:
8 + "8"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, there are built-in functions to convert numbers to strings and strings to numbers. Some of these built-in functions have restrictions on the type of argument they take:

|Function |Description|
|-|-|
|`int`|Converts a string of digits or a float to an integer ("int") value|
|`float`|Converts a string of digits (perhaps with a decimal point) or an int to a decimal ("float") value|
|`str`|Converts any value to a string|

Try to predict what data type and value `example` evaluates to, then run the cell.

In [42]:
example = 8 + int("10") + float("8")

print(example)
print("This example returned a " + str(type(example)) + "!")

26.0
This example returned a <class 'float'>!


Suppose you're writing a program that looks for dates in a text, and you want your program to find the amount of time that elapsed between two years it has identified.  It doesn't make sense to subtract two texts, but you can first convert the text containing the years into numbers.

**Question 4.2.1.** Finish the code below to compute the number of years that elapsed between `one_year` and `another_year`.  Don't just write the numbers `1618` and `1648` (or `30`); use a conversion function to turn the given text data into numbers.

<!--
BEGIN QUESTION
name: q421
-->

In [43]:
# Some text data:
one_year = "1618"
another_year = "1648"

# Complete the next line.  Note that we can't just write:
#   another_year - one_year
# If you don't see why, try seeing what happens when you
# write that here.
difference = int(another_year) - int(one_year)
difference

30

## 4.3. Passing strings to functions

String values, like numbers, can be arguments to functions and can be returned by functions. 

The function `len` (derived from the word "length") takes a single string as its argument and returns the number of characters (including spaces) in the string.

Note that it doesn't count *words*. `len("one small step for man")` evaluates to 22, not 5.

**Question 4.3.1.**  Use `len` to find the number of characters in the long string in the next cell.  Characters include things like spaces and punctuation. Assign `sentence_length` to that number.

(The string is the first sentence of the English translation of the French [Declaration of the Rights of Man](http://avalon.law.yale.edu/18th_century/rightsof.asp).)  

<!--
BEGIN QUESTION
name: q431
-->

In [44]:
a_very_long_sentence = "The representatives of the French people, organized as a National Assembly, believing that the ignorance, neglect, or contempt of the rights of man are the sole cause of public calamities and of the corruption of governments, have determined to set forth in a solemn declaration the natural, unalienable, and sacred rights of man, in order that this declaration, being constantly before all the members of the Social body, shall remind them continually of their rights and duties; in order that the acts of the legislative power, as well as those of the executive power, may be compared at any moment with the objects and purposes of all political institutions and may thus be more respected, and, lastly, in order that the grievances of the citizens, based hereafter upon simple and incontestable principles, shall tend to the maintenance of the constitution and redound to the happiness of all."
sentence_length = len(a_very_long_sentence)
sentence_length

896

# 5. Arrays

Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that contains the result of multiplying each number in `billions_of_numbers` by .18.  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**. 

## 5.1. Making arrays

First, let's learn how to manually input values into an array. This typically isn't how programs work. Normally, we create arrays by loading them from an external source, like a data file.

To create an array by hand, call the function `make_array`.  Each argument you pass to `make_array` will be in the array it returns.  Run this cell to see an example:

In [45]:
make_array(0.125, 4.75, -1.3)

array([ 0.125,  4.75 , -1.3  ])

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them to names or use them as arguments to functions. For example, `len(<some_array>)` returns the number of elements in `some_array`.

**Question 5.1.1.** Make an array containing the numbers 0, 1, -1, $\pi$, and $e$, in that order.  Name it `interesting_numbers`.  

*Hint:* How did you get the values $\pi$ and $e$ earlier in this lab?

<!--
BEGIN QUESTION
name: q511
-->

In [46]:
interesting_numbers = make_array(0, 1, -1, math.pi, math.e)
interesting_numbers

array([ 0.        ,  1.        , -1.        ,  3.14159265,  2.71828183])

**Question 5.1.2.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you evaluate `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the data types in the array are strings.

<!--
BEGIN QUESTION
name: q512
-->

In [47]:
hello_world_components = make_array("Hello", ",", " ", "world", "!")
hello_world_components

array(['Hello', ',', ' ', 'world', '!'], dtype='<U5')

###  `np.arange`
Arrays are provided by a package called [NumPy](http://www.numpy.org/) (pronounced "NUM-pie"). The package is called `numpy`, but it's standard to rename it `np` for brevity.  You can do that with:

    import numpy as np

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  The line of code `np.arange(start, stop, step)` evaluates to an array with all the numbers starting at `start` and counting up by `step`, stopping **before** `stop` is reached.

Run the following cells to see some examples!

In [48]:
# This array starts at 1 and counts up by 2
# and then stops before 6
np.arange(1, 6, 2)

array([1, 3, 5])

In [49]:
# This array doesn't contain 9
# because np.arange stops *before* the stop value is reached
np.arange(4, 9, 1)

array([4, 5, 6, 7, 8])

**Question 5.1.3.** Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

<!--
BEGIN QUESTION
name: q513
-->

In [50]:
import numpy as np
multiples_of_99 = np.arange(0, 10000, 99)
multiples_of_99

array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
       1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
       2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
       3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
       4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
       5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
       6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
       7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
       8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
       9801, 9900, 9999])

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Raleigh-Durham International Airport site for the month of August 2021.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of August 2021 (midnight on August 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 5.1.4.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There are 31 days in August, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too! You can check that the length of your array is $31 \times 24$ elements.

<!--
BEGIN QUESTION
name: q214
-->

In [55]:
collection_times = np.arange(0, 2678400, 3600)
collection_times

array([      0,    3600,    7200,   10800,   14400,   18000,   21600,
         25200,   28800,   32400,   36000,   39600,   43200,   46800,
         50400,   54000,   57600,   61200,   64800,   68400,   72000,
         75600,   79200,   82800,   86400,   90000,   93600,   97200,
        100800,  104400,  108000,  111600,  115200,  118800,  122400,
        126000,  129600,  133200,  136800,  140400,  144000,  147600,
        151200,  154800,  158400,  162000,  165600,  169200,  172800,
        176400,  180000,  183600,  187200,  190800,  194400,  198000,
        201600,  205200,  208800,  212400,  216000,  219600,  223200,
        226800,  230400,  234000,  237600,  241200,  244800,  248400,
        252000,  255600,  259200,  262800,  266400,  270000,  273600,
        277200,  280800,  284400,  288000,  291600,  295200,  298800,
        302400,  306000,  309600,  313200,  316800,  320400,  324000,
        327600,  331200,  334800,  338400,  342000,  345600,  349200,
        352800,  356

## 5.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population_amounts` that includes estimated world populations in every year from **1950** to roughly the present.  (The estimates come from the US Census Bureau website.)

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.  You'll learn how to do that later in this lab!

In [56]:
population_amounts = Table.read_table("world_population.csv").column("Population")
population_amounts

array([2557628654, 2594939877, 2636772306, 2682053389, 2730228104,
       2782098943, 2835299673, 2891349717, 2948137248, 3000716593,
       3043001508, 3083966929, 3140093217, 3209827882, 3281201306,
       3350425793, 3420677923, 3490333715, 3562313822, 3637159050,
       3712697742, 3790326948, 3866568653, 3942096442, 4016608813,
       4089083233, 4160185010, 4232084578, 4304105753, 4379013942,
       4451362735, 4534410125, 4614566561, 4695736743, 4774569391,
       4856462699, 4940571232, 5027200492, 5114557167, 5201440110,
       5288955934, 5371585922, 5456136278, 5538268316, 5618682132,
       5699202985, 5779440593, 5857972543, 5935213248, 6012074922,
       6088571383, 6165219247, 6242016348, 6318590956, 6395699509,
       6473044732, 6551263534, 6629913759, 6709049780, 6788214394,
       6866332358, 6944055583, 7022349283, 7101027895, 7178722893,
       7256490011])

Here's how we get the first element of `population_amounts`, which is the world population in the first year in the dataset, 1950.

In [57]:
population_amounts.item(0)

2557628654

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population_amounts`.

Notice that we wrote `.item(0)`, not `.item(1)`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population_amounts`.  Read and run each cell.

In [58]:
# The 13th element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population_amounts.item(12)
population_1962

3140093217

In [59]:
# The 66th element is the population in 2015.
population_2015 = population_amounts.item(65)
population_2015

7256490011

In [60]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population_amounts.item(66)
population_2016

IndexError: index 66 is out of bounds for axis 0 with size 66

Since `make_array` returns an array, we can call `.item(3)` on its output to get its 4th element, just like we "chained" together calls to the method `replace` earlier.

In [61]:
make_array(-1, -3, 4, -2).item(3)

-2

**Question 5.2.1.** Set `population_1982` to the world population in 1982, by getting the appropriate element from `population_amounts` using `item`.

<!--
BEGIN QUESTION
name: q521
-->

In [63]:
population_1982 = population_amounts.item(32)
population_1982

4614566561

## 5.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to use `.item` and work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

Orders of magnitude quantify how big a number is by representing it as the power of another number (for example, representing 104 as $10^{2.017033}$). One way to do this is by using the logarithm function. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the `item` method you just saw:

In [65]:
population_1950_magnitude = math.log10(population_amounts.item(0))
population_1951_magnitude = math.log10(population_amounts.item(1))
population_1952_magnitude = math.log10(population_amounts.item(2))
population_1953_magnitude = math.log10(population_amounts.item(3))

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 5.3.1.** Use `np.log10` to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

<!--
BEGIN QUESTION
name: q531
-->

In [66]:
population_magnitudes = np.log10(population_amounts)
population_magnitudes

array([9.40783749, 9.4141273 , 9.42107263, 9.42846742, 9.43619893,
       9.44437257, 9.45259897, 9.46110062, 9.4695477 , 9.47722498,
       9.48330217, 9.48910971, 9.49694254, 9.50648175, 9.51603288,
       9.5251    , 9.53411218, 9.54286695, 9.55173218, 9.56076229,
       9.56968959, 9.57867667, 9.58732573, 9.59572724, 9.60385954,
       9.61162595, 9.61911264, 9.62655434, 9.63388293, 9.64137633,
       9.64849299, 9.6565208 , 9.66413091, 9.67170374, 9.67893421,
       9.68632006, 9.69377717, 9.70132621, 9.70880804, 9.7161236 ,
       9.72336995, 9.73010253, 9.73688521, 9.74337399, 9.74963446,
       9.75581413, 9.7618858 , 9.76774733, 9.77343633, 9.77902438,
       9.7845154 , 9.78994853, 9.7953249 , 9.80062024, 9.80588805,
       9.81110861, 9.81632507, 9.82150788, 9.82666101, 9.83175555,
       9.83672482, 9.84161319, 9.84648243, 9.85132122, 9.85604719,
       9.8607266 ])

What you just did is called *elementwise* application of `np.log10`, since `np.log10` operates separately on each element of the array that it's called on. Here's a picture of what's going on:

<img src="lab02-array_logarithm.jpg">


The textbook's [section](https://www.inferentialthinking.com/chapters/05/1/Arrays)  on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays, meaning that if you perform an arithmetic operation (like subtraction, division, etc) on an array, Python will do the operation to every element of the array individually and return an array of all of the results. For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [67]:
population_in_billions = population_amounts / 1000000000
population_in_billions

array([2.55762865, 2.59493988, 2.63677231, 2.68205339, 2.7302281 ,
       2.78209894, 2.83529967, 2.89134972, 2.94813725, 3.00071659,
       3.04300151, 3.08396693, 3.14009322, 3.20982788, 3.28120131,
       3.35042579, 3.42067792, 3.49033371, 3.56231382, 3.63715905,
       3.71269774, 3.79032695, 3.86656865, 3.94209644, 4.01660881,
       4.08908323, 4.16018501, 4.23208458, 4.30410575, 4.37901394,
       4.45136274, 4.53441012, 4.61456656, 4.69573674, 4.77456939,
       4.8564627 , 4.94057123, 5.02720049, 5.11455717, 5.20144011,
       5.28895593, 5.37158592, 5.45613628, 5.53826832, 5.61868213,
       5.69920299, 5.77944059, 5.85797254, 5.93521325, 6.01207492,
       6.08857138, 6.16521925, 6.24201635, 6.31859096, 6.39569951,
       6.47304473, 6.55126353, 6.62991376, 6.70904978, 6.78821439,
       6.86633236, 6.94405558, 7.02234928, 7.10102789, 7.17872289,
       7.25649001])

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [68]:
restaurant_bills = make_array(20.12, 39.90, 31.01)
print("Restaurant bills:\t", restaurant_bills)

# Array multiplication
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

Restaurant bills:	 [20.12 39.9  31.01]
Tips:			 [4.024 7.98  6.202]


<img src="lab02-array_multiplication.jpg">

**Question 5.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip. If the tip is 20%, that means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`, and assign the resulting array to `total_charges`.

<!--
BEGIN QUESTION
name: q532
-->

In [71]:
total_charges = restaurant_bills * 1.2
total_charges

array([24.144, 47.88 , 37.212])

**Question 5.3.3.** The array `more_restaurant_bills` contains 100,000 bills!  Compute the total charge for each one.  How is your code different?

<!--
BEGIN QUESTION
name: q533
-->

In [72]:
more_restaurant_bills = Table.read_table("more_restaurant_bills.csv").column("Bill")
more_total_charges = more_restaurant_bills * 1.2
more_total_charges

array([20.244, 20.892, 12.216, ..., 19.308, 18.336, 35.664])

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 5.3.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

<!--
BEGIN QUESTION
name: q534
-->

In [74]:
sum_of_bills = sum(more_total_charges)
sum_of_bills

1795730.0640000193

## 6. Creating Tables

An array is useful for describing a single attribute of each element in a collection. For example, let's say our collection is all US States. Then an array could describe the land area of each state. 

Tables extend this idea by containing multiple arrays, each one describing a different attribute for every element of a collection. In this way, tables allow us to not only store data about many entities but to also contain several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one, `population_amounts`, was defined above in section 2.2 and contains the world population in each year (estimated by the US Census Bureau). The second array, `years`, contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [75]:
# Just run this cell

years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Population column: [2557628654 2594939877 2636772306 2682053389 2730228104 2782098943
 2835299673 2891349717 2948137248 3000716593 3043001508 3083966929
 3140093217 3209827882 3281201306 3350425793 3420677923 3490333715
 3562313822 3637159050 3712697742 3790326948 3866568653 3942096442
 4016608813 4089083233 4160185010 4232084578 4304105753 4379013942
 4451362735 4534410125 4614566561 4695736743 4774569391 4856462699
 4940571232 5027200492 5114557167 5201440110 5288955934 5371585922
 5456136278 5538268316 5618682132 5699202985 5779440593 5857972543
 5935213248 6012074922 6088571383 6165219247 6242016348 6318590956
 6395699509 6473044732 6551263534 6629913759 6709049780 6788214394
 6866332358 6944055583 7022349283 7101027895 7178722893 7256490011]
Years column: [1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
 1992 1993 1994 

Suppose we want to answer this question:

> In which year did the world's population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assigns the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. The names `population_amounts` and `years` were assigned above to two arrays of the **same length**. The function `with_columns` takes in alternating strings (to represent column labels) and arrays (representing the data in those columns). The strings and arrays are separated by commas.

In [76]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Population,Year
2557628654,1950
2594939877,1951
2636772306,1952
2682053389,1953
2730228104,1954
2782098943,1955
2835299673,1956
2891349717,1957
2948137248,1958
3000716593,1959


Now the data is combined into a single table! It's much easier to parse this data. If you need to know what the population was in 1959, for example, you can tell from a single glance.

 **Question 6.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

<!--
BEGIN QUESTION
name: q61
-->

In [81]:
top_10_movie_ratings = make_array(9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8)
top_10_movie_names = make_array(
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = Table().with_columns(
    'Name', top_10_movie_names,
    'Rating', top_10_movie_ratings
)

# We've put this next line here 
# so your table will get printed out 
# when you run this cell.
top_10_movies

Name,Rating
The Shawshank Redemption (1994),9.2
The Godfather (1972),9.2
The Godfather: Part II (1974),9.0
Pulp Fiction (1994),8.9
Schindler's List (1993),8.9
The Lord of the Rings: The Return of the King (2003),8.9
12 Angry Men (1957),8.9
The Dark Knight (2008),8.9
"Il buono, il brutto, il cattivo (1966)",8.9
The Lord of the Rings: The Fellowship of the Ring (2001),8.8


#### Loading a table from a file

In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we load them in from an external source, like a data file. There are many formats for data files, but CSV ("comma-separated values") is the most common.

`Table.read_table(...)` takes one argument (a path to a data file in string format) and returns a table.  

**Question 6.2.** `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

(This was done for you earlier in the lab! Repeat that code here.)

<!--
BEGIN QUESTION
name: q62
-->

In [79]:
imdb = Table.read_table('imdb.csv')
imdb

Votes,Rating,Title,Year,Decade
88355,8.4,M,1931,1930
132823,8.3,Singin' in the Rain,1952,1950
74178,8.3,All About Eve,1950,1950
635139,8.6,Léon,1994,1990
145514,8.2,The Elephant Man,1980,1980
425461,8.3,Full Metal Jacket,1987,1980
441174,8.1,Gone Girl,2014,2010
850601,8.3,Batman Begins,2005,2000
37664,8.2,Judgment at Nuremberg,1961,1960
46987,8.0,Relatos salvajes,2014,2010


## 7. More Table Operations!

Now that you've worked with arrays, let's add a few more methods to the list of table operations that you saw in Lab 2.

### `column`

`column` takes the column name of a table (in string format) as its argument and returns the values in that column as an **array**. 

In [82]:
# Returns an array of movie names
top_10_movies.column('Name')

array(['The Shawshank Redemption (1994)', 'The Godfather (1972)',
       'The Godfather: Part II (1974)', 'Pulp Fiction (1994)',
       "Schindler's List (1993)",
       'The Lord of the Rings: The Return of the King (2003)',
       '12 Angry Men (1957)', 'The Dark Knight (2008)',
       'Il buono, il brutto, il cattivo (1966)',
       'The Lord of the Rings: The Fellowship of the Ring (2001)'],
      dtype='<U56')

### `take`
The table method `take` takes as its argument an array of numbers.  Each number should be the index of a row in the table.  It returns a **new table** with only those rows. 

You'll usually want to use `take` in conjunction with `np.arange` to take the first few rows of a table.

In [83]:
# Take first 5 movies of top_10_movies
top_10_movies.take(np.arange(0, 5, 1))

Name,Rating
The Shawshank Redemption (1994),9.2
The Godfather (1972),9.2
The Godfather: Part II (1974),9.0
Pulp Fiction (1994),8.9
Schindler's List (1993),8.9


The next three questions will give you practice with combining the operations you've learned in this lab and the previous one to answer questions about the `population` and `imdb` tables. First, check out the `population` table from section 2.

In [84]:
# Run this cell to display the population table.
population

Population,Year
2557628654,1950
2594939877,1951
2636772306,1952
2682053389,1953
2730228104,1954
2782098943,1955
2835299673,1956
2891349717,1957
2948137248,1958
3000716593,1959


**Question 7.1.** Check out the `population` table from section 2 of this lab.  Compute the year when the world population first went above 5 billion. Assign the year to `year_population_crossed_5_billion`.

<!--
BEGIN QUESTION
name: q41
-->

In [91]:
year_population_crossed_5_billion = population.where('Population', are.above(5000000000)).item(0).select('Year')
year_population_crossed_5_billion

Year
1987


**Question 7.2.** Find the average rating for movies released between 1980 and 1999 (inclusive) and the average rating for movies released in the year 2000 or after for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in each time period), and try to put them in an order that makes sense.

<!--
BEGIN QUESTION
name: q42
-->

In [96]:
between_1980_and_1999 = np.mean(imdb.where('Year', are.between(1980, 2000)).select('Rating'))
after_or_in_2000 = np.mean(imdb.where('Year', are.above(1999)).select('Rating'))
print("Average between 1980 and 1999:", between_1980_and_1999)
print("Average after or in 2000 rating:", after_or_in_2000)

Average between 1980 and 1999: Rating
8.30822
Average after or in 2000 rating: Rating
8.23797


Congratulations, you're done with lab 2!