# Lab 2: Data Types
Welcome to lab 2!

Last time, we had our first look at Python and Jupyter notebooks.  So far, we've primarily used Python to manipulate numbers and text.  There's a lot more to life than those, so Python lets us represent many other types of data in programs.

In this lab, you'll see how to work with datasets in Python -- *collections* of data, like the numbers 2 through 5 or the words "welcome", "to", and "lab". We will see how to manipulate these datasets through the lenses of *arrays*, *lists*, and *tables*.

# 1. Arrays

Up to now, we haven't done much that you couldn't do yourself by hand, without going through the trouble of learning Python.  Computers are most useful when you can use a small amount of code to *do the same action* to *many different things*.

For example, in the time it takes you to calculate the 18% tip on a restaurant bill, a laptop can calculate 18% tips for every restaurant bill paid by every human on Earth that day.  (That's if you're pretty fast at doing arithmetic in your head!)

**Arrays** are how we put many values in one place so that we can operate on them as a group. For example, if `billions_of_numbers` is an array of numbers, the expression

    .18 * billions_of_numbers

gives a new array of numbers that's the result of multiplying each number in `billions_of_numbers` by .18 (18%).  Arrays are not limited to numbers; we can also put all the words in a book into an array of strings.

Concretely, an array is a **collection of values of the same type**, like a column in an Excel spreadsheet. 

![excel_array.jpg](attachment:excel_array.jpg)

## 1.1. Making arrays
You can type in the data that goes in an array yourself, but that's not typically how programs work. Normally, we create arrays by loading them from an external source, like a data file.

First, though, let's learn how to do it the hard way. Execute the following cell so that all the functionality from the `numpy` module are available to you.

In [None]:
import numpy as np

Now, to create an array, call the function `np.array()`. To create an array from scratch, you have to pass a `list` into the function call, which is specified using square brackets `[]`. Note that Python lists and arrays are often used interchangeably, but have some fundamental differences in regards to how they handle arithmetic and other operations. For now, we're focusing on arrays. Run this cell to see an example:

In [None]:
np.array([0.125, 4.75, -1.3])

Each value in an array (in the above case, the numbers 0.125, 4.75, and -1.3) is called an *element* of that array.

Arrays themselves are also values, just like numbers and strings.  That means you can assign them names or use them as arguments to functions.

**Question 1.1.1.** Make an array containing the numbers 1, 2, and 3, in that order.  Name it `small_numbers`.

In [None]:
small_numbers = ...
small_numbers

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is np.array([1, 2, 3]).
</p>
</details>

**Question 1.1.2.** Make an array containing the five strings `"Hello"`, `","`, `" "`, `"world"`, and `"!"`.  (The third one is a single space inside quotes.)  Name it `hello_world_components`.

*Note:* If you print `hello_world_components`, you'll notice some extra information in addition to its contents: `dtype='<U5'`.  That's just NumPy's extremely cryptic way of saying that the things in the array are strings.

In [None]:
hello_world_components = ...
hello_world_components

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is np.array(['Hello', ',', ' ', 'world', '!']).
</p>
</details>

The `join` method of a string takes an array of strings as its argument and puts all of the elements together into one string. Try it:

In [None]:
'°'.join(np.array(['(╯', '□','）╯︵ ┻━┻']))

**Question 1.1.3.** Assign `separator` to a string so that the name `hello` is bound to the string `'Hello, world!'` in the cell below.

In [None]:
separator = ...
hello = separator.join(hello_world_components)
hello

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is Hello, world!.
</p>
</details>

### 1.1.1.  `np.arange`

Very often in data science, we want to work with many numbers that are evenly spaced within some range.  NumPy provides a special function for this called `arange`.  `np.arange(start, stop, space)` produces an array with all the numbers starting at `start` and counting up by `space`, stopping before `stop` is reached.

For example, the value of `np.arange(1, 6, 2)` is an array with elements 1, 3, and 5 -- it starts at 1 and counts up by 2, then stops before 6.  In other words, it's equivalent to `np.array([1, 3, 5])`.

`np.arange(4, 9, 1)` is an array with elements 4, 5, 6, 7, and 8.  (It doesn't contain 9 because `np.arange` stops *before* the stop value is reached.)

**Question 1.1.1.1.** Import `numpy` as `np` and then use `np.arange` to create an array with the multiples of 99 from 0 up to (**and including**) 9999.  (So its elements are 0, 99, 198, 297, etc.)

In [None]:
...
multiples_of_99 = ...
multiples_of_99

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is:
array([   0,   99,  198,  297,  396,  495,  594,  693,  792,  891,  990,
  1089, 1188, 1287, 1386, 1485, 1584, 1683, 1782, 1881, 1980, 2079,
  2178, 2277, 2376, 2475, 2574, 2673, 2772, 2871, 2970, 3069, 3168,
  3267, 3366, 3465, 3564, 3663, 3762, 3861, 3960, 4059, 4158, 4257,
  4356, 4455, 4554, 4653, 4752, 4851, 4950, 5049, 5148, 5247, 5346,
  5445, 5544, 5643, 5742, 5841, 5940, 6039, 6138, 6237, 6336, 6435,
  6534, 6633, 6732, 6831, 6930, 7029, 7128, 7227, 7326, 7425, 7524,
  7623, 7722, 7821, 7920, 8019, 8118, 8217, 8316, 8415, 8514, 8613,
  8712, 8811, 8910, 9009, 9108, 9207, 9306, 9405, 9504, 9603, 9702,
  9801, 9900, 9999]), the code is `np.arange(0,9999+99,99)`
</p>
</details>

##### Temperature readings
NOAA (the US National Oceanic and Atmospheric Administration) operates weather stations that measure surface temperatures at different sites around the United States.  The hourly readings are [publicly available](http://www.ncdc.noaa.gov/qclcd/QCLCD?prior=N).

Suppose we download all the hourly data from the Oakland, California site for the month of December 2015.  To analyze the data, we want to know when each reading was taken, but we find that the data don't include the timestamps of the readings (the time at which each one was taken).

However, we know the first reading was taken at the first instant of December 2015 (midnight on December 1st) and each subsequent reading was taken exactly 1 hour after the last.

**Question 1.1.1.2.** Create an array of the *time, in seconds, since the start of the month* at which each hourly reading was taken.  Name it `collection_times`.

*Hint 1:* There were 31 days in December, which is equivalent to ($31 \times 24$) hours or ($31 \times 24 \times 60 \times 60$) seconds.  So your array should have $31 \times 24$ elements in it.

*Hint 2:* The `len` function works on arrays, too.  If your `collection_times` isn't passing the tests, check its length and make sure it has $31 \times 24$ elements.

In [None]:
collection_times = ...
collection_times

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct code is `np.arange(0, 31*24*60*60, 60*60)`.
</p>
</details>

## 1.2. Working with single elements of arrays ("indexing")
Let's work with a more interesting dataset.  The next cell creates an array called `population` that includes estimated world populations in every year from **1950** to roughly the present.

Rather than type in the data manually, we've loaded them from a file on your computer called `world_population.csv`.

In [None]:
# Don't worry too much about what goes on in this cell.
import pandas as pd

population = pd.read_csv("world_population.csv")['Population'].values
population

Here's how we get the first element of `population`, which is the world population in the first year in the dataset, 1950.

In [None]:
population[0]

The value of that expression is the number 2557628654 (around 2.5 billion), because that's the first thing in the array `population`.

Notice that we wrote `[0]`, not `[1]`, to get the first element.  This is a weird convention in computer science.  0 is called the *index* of the first item.  It's the number of elements that appear *before* that item.  So 3 is the index of the 4th item.

Here are some more examples.  In the examples, we've given names to the things we get out of `population`.  Read and run each cell.

In [None]:
# The third element in the array is the population
# in 1952.
population_1952 = population[2]
population_1952

In [None]:
# The thirteenth element in the array is the population
# in 1962 (which is 1950 + 12).
population_1962 = population[12]
population_1962

In [None]:
# The 66th element is the population in 2015.
population_2015 = population[65]
population_2015

In [None]:
# The array has only 66 elements, so this doesn't work.
# (There's no element with 66 other elements before it.)
population_2016 = population[66]
population_2016

**Question 1.2.1.** Set `population_1973` to the world population in 1973, by getting the appropriate element from `population` using index.

In [None]:
population_1973 = ...
population_1973

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 3942096442.
</p>
</details>

## 1.3. Doing something to every element of an array
Arrays are primarily useful for doing the same operation many times, so we don't often have to work with single elements.

##### Logarithms
Here is one simple question we might ask about world population:

> How big was the population in *orders of magnitude* in each year?

The logarithm function is one way of measuring how big a number is. The logarithm (base 10) of a number increases by 1 every time we multiply the number by 10. It's like a measure of how many decimal digits the number has, or how big it is in orders of magnitude.

We could try to answer our question like this, using the `log10` function from the `math` module and the index method you just saw:

In [None]:
import math
population_1950_magnitude = math.log10(population[0])
population_1951_magnitude = math.log10(population[1])
population_1952_magnitude = math.log10(population[2])
population_1953_magnitude = math.log10(population[3])
...

But this is tedious and doesn't really take advantage of the fact that we are using a computer.

Instead, NumPy provides its own version of `log10` that takes the logarithm of each element of an array.  It takes a single array of numbers as its argument.  It returns an array of the same length, where the first element of the result is the logarithm of the first element of the argument, and so on.

**Question 1.3.1.** Use it to compute the logarithms of the world population in every year.  Give the result (an array of 66 numbers) the name `population_magnitudes`.  Your code should be very short.

In [None]:
population_magnitudes = ...
population_magnitudes

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct code is `np.log10(population)`.
</p>
</details>

![array_logarithm.jpg](attachment:array_logarithm.jpg)

This is called *elementwise* application of the function, since it operates separately on each element of the array it's called on.  The textbook's section on arrays has a useful list of NumPy functions that are designed to work elementwise, like `np.log10`.

##### Arithmetic
Arithmetic also works elementwise on arrays.  For example, you can divide all the population numbers by 1 billion to get numbers in billions:

In [None]:
population_in_billions = population / 1000000000
population_in_billions

You can do the same with addition, subtraction, multiplication, and exponentiation (`**`). For example, you can calculate a tip on several restaurant bills at once (in this case just 3):

In [None]:
restaurant_bills = np.array([20.12, 39.90, 31.01])
print("Restaurant bills:\t", restaurant_bills)
tips = .2 * restaurant_bills
print("Tips:\t\t\t", tips)

![array_multiplication.jpg](attachment:array_multiplication.jpg)

**Question 1.3.2.** Suppose the total charge at a restaurant is the original bill plus the tip.  That means we can multiply the original bill by 1.2 to get the total charge.  Compute the total charge for each bill in `restaurant_bills`.

In [None]:
total_charges = ...
total_charges

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is np.array([24.144, 47.88 , 37.212]).
</p>
</details>

**Question 1.3.3.** `more_restaurant_bills.csv` contains 100,000 bills!  Compute the total charge for each one.  How is your code different?

In [None]:
more_restaurant_bills = pd.read_csv("more_restaurant_bills.csv")["Bill"].values
more_total_charges = ...
more_total_charges

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct code is `1.2*more_restaurant_bills`.
</p>
</details>

The function `sum` takes a single array of numbers as its argument.  It returns the sum of all the numbers in that array (so it returns a single number, not an array).

**Question 1.3.4.** What was the sum of all the bills in `more_restaurant_bills`, *including tips*?

In [None]:
sum_of_bills = ...
sum_of_bills

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 1795730.0640000193.
</p>
</details>

**Question 1.3.5.** The powers of 2 ($2^0 = 1$, $2^1 = 2$, $2^2 = 4$, etc) arise frequently in computer science.  (For example, you may have noticed that storage on smartphones or USBs come in powers of 2, like 16 GB, 32 GB, or 64 GB.)  Use `np.arange` and the exponentiation operator `**` to compute the first 30 powers of 2, starting from `2^0`.

In [None]:
powers_of_2 = ...
powers_of_2

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct code is `2**np.arange(30)`.
</p>
</details>

## 2. Using lists

A *list* is another Python sequence type, similar to an array. It's different than an array because the values it contains can all have different types. A single list can contain `int` values, `float` values, and strings. Elements in a list can even be other lists! A list is created by giving a name to the list of values enclosed in square brackets and separated by commas. For example, `values_with_different_types = ['data', 8, ['lab', 3]]`.



Lists can be useful when working with tables because they can describe the contents of one row in a table, which often  corresponds to a sequence of values with different types. A list of lists can be used to describe multiple rows.

Each column in a table is a collection of values with the same type (an array). If you create a table column from a list, it will automatically be converted to an array. A row, on the ther hand, mixes types.

Here's a table from Chapter 5. (Run the cell below.)

In [None]:
# Run this cell to recreate the table
flowers = pd.DataFrame(
    [[8, 'lotus'], [34, 'sunflower'], [5, 'rose']], # each sublist in this list will be a row in the dataframe
    columns = ['Number of petals', 'Name']
)
flowers

**Question 2.1.** Create a list that describes a new row of this table. The details can be whatever you want, but the list must contain two values: the number of petals (an `int` value) and the name of the flower (a string). How about the "pondweed"? Its flowers have zero petals.

In [None]:
my_flower = ...
my_flower

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code can be, for example, `[0,'pondweed']`.
</p>
</details>

**Observation 2.2.** Suppose we want to add `my_flower` to the `flowers` table, in addition to the 3 flowers in `other_flowers`. We'll do this by first appending `my_flower` to `other_flowers`, then turning `other_flowers` into its own dataframe we can then append to `flowers`. 

In [None]:
other_flowers = [[10, 'lavender'], [3, 'birds of paradise'], [6, 'tulip']]

# append my_flower to other_flowers
other_flowers.append(my_flower)

seven_flowers = flowers.append(pd.DataFrame(other_flowers, columns=flowers.columns))
seven_flowers

Notice that we used the argument `.append()` on two different types of data structures here, and they behaved slightly differently. When you append an element to a `list`, the list gets changed *in place*, which means that `other_flowers` has actually been changed. However, when you append to a dataframe, it creates a *new* dataframe with extra rows; the original dataframe is unchanged. You can verify this below: 

In [None]:
# should only have the original 3 rows
flowers

## 3. DataFrames

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year by the US Census Bureau, and the second contains the years themselves (in order, so the first elements in the population and the years arrays correspond).

In [None]:
population_amounts = pd.read_csv("world_population_summary.csv")["Population"].values
years = np.arange(1950, 2015+1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`DataFrame`*, a 2-dimensional type of dataset. 

The expression below shows how we construct a dataframe with two columns. Inside of the `pd.DataFrame()` constructor is a `dictionary`, which is another Python collection that holds what are known as *key-value pairs*. This method of constructing a dataframe makes columns using the `keys` in the dictionary (element before the `:`) and sets those columns equal to the `values` (elements after the `:`). Here, the dictionary keys are strings and the values are arrays; in general, keys need to just be a single item while values can be a collection of items. 


In [None]:
population = pd.DataFrame({
    'Population': population_amounts,
    'Year': years
})
population

Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

## 4. Creating Tables

**Question 4.1.** In the cell below, we've created 2 arrays. Using the steps above, assign `top_10_movies` to a table that has two columns called "Rating" and "Name", which hold `top_10_movie_ratings` and `top_10_movie_names` respectively.

In [None]:
top_10_movie_ratings = np.array([9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8])
top_10_movie_names = np.array([
        'The Shawshank Redemption (1994)',
        'The Godfather (1972)',
        'The Godfather: Part II (1974)',
        'Pulp Fiction (1994)',
        "Schindler's List (1993)",
        'The Lord of the Rings: The Return of the King (2003)',
        '12 Angry Men (1957)',
        'The Dark Knight (2008)',
        'Il buono, il brutto, il cattivo (1966)',
        'The Lord of the Rings: The Fellowship of the Ring (2001)'])

top_10_movies = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `pd.DataFrame({'Rating':top_10_movie_ratings, "Name": top_10_movie_names})`
</p>
</details>

#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Pandas` functions.

`pd.read_csv` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 4.2.** The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb. Upload the `imdb.csv` into the same folder as the lab02b.ipynb is in and load it as a table called `imdb`.

In [None]:
imdb = ...
imdb

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `pd.read_csv("imdb.csv")`.
</p>
</details>

Notice that there are ellipses in the middle of the table - this table is big enough that only a few of its rows are displayed, but the others are still there. If you want to just look at the first few rows of your dataframe, you can use the argument `df.head()`, which, by default, shows the first 5 rows.

Open up the `imdb.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 5. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the IMDb dataset.

If we want just the ratings of the movies, we can get an array that contains the data in that column. In Pandas, there are a few ways to extract a column from a dataframe.

The following two methods are equivalent and return Pandas `Series` (see [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)). Note that if you want to extract a column using a `.` the column name can't have any spaces. 

In [None]:
imdb.Rating
imdb['Rating']

If you want to extract the column into an actual numpy array, you need to add the argument `.values` to the end of the column extraction (see below). 

The primary difference between a Series and an array is how they are indexed. In arrays, the index is always numeric and in order, so `my_array[0]` will always return the first element of the array. In Series, the index follows whatever the index of the table was. By default, table indices are ordinal and numeric, but they don't have to be. You'll see an example below where the indexing matters.



In [None]:
imdb['Rating'].values

**Question 5.1.** Find the rating of the highest-rated movie in the dataset.

*Hint:* Think back to the functions you've learned about for working with arrays of numbers.  Ask for help if you can't remember one that's useful for this.

In [None]:
highest_rating = ...
highest_rating

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The corret answer is 9.2, and the code is `max(imdb.Rating)`.
</p>
</details>

That's not very useful, though. You'd probably want to know the *name* of the movie whose rating you found!  To do that, we can sort the entire table by rating, which ensures that the ratings and titles will stay together.

In [None]:
imdb.sort_values("Rating")

Well, that actually doesn't help much, either -- we sorted the movies from lowest -> highest ratings.  To look at the highest-rated movies, sort in reverse order:

In [None]:
imdb.sort_values("Rating", ascending=False)

(The `ascending=False` bit is called an *optional argument*. It has a default value of `True`, so when you explicitly tell the function `ascending=False`, then the function will sort in descending order.)

So there are actually 2 highest-rated movies in the dataset: *The Shawshank Redemption* and *The Godfather*.

Some details about sort:

1. The first argument to `sort_values` is the name of a column to sort by.
2. If the column has strings in it, `sort_values` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `imdb.sort_values("Rating")` is a *copy of `imdb`*; the `imdb` table doesn't get modified. For example, if we called `imdb.sort_values("Rating")`, then running `imdb` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Rating" column, the movies would all end up with the wrong ratings.
5. Note that the numbers in bold on the left of the table are no longer in order. These numbers are the dataframe `index` and are associated with the original ordering of the table. This ordering is important to keep in mind when you are accessing elements of the dataframe.

**Question 5.2.** Create a version of `imdb` that's sorted chronologically, with the earliest movies first.  Call it `imdb_by_year`.

In [None]:
imdb_by_year = ...
imdb_by_year

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.sort_values("Year")`
</p>
</details>

**Question 5.3.** What's the title of the earliest movie in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `imdb_by_year`, extract the Title column to get an array, then index in to get the first item.

This is a good time to consider the difference between a Series and an array (i.e. extract the column using `.values` or not) - what happens if you try to index into the Series the same way you index into an array?

In [None]:
earliest_movie_title = ...
earliest_movie_title

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb_by_year["Title"].values[0]`.
</p>
</details>

## 6. Finding pieces of a dataset
Suppose you're interested in movies from the 1940s.  Sorting the table by year doesn't help you, because the 1940s are in the middle of the dataset.

Instead, we need to filter the dataframe.

In [None]:
forties = imdb.loc[imdb.Decade == 1940]
forties

Ignore the syntax for the moment.  Instead, try to read that line like this:

> Assign the name **`forties`** to a table whose rows are the rows in the **`imdb`** table where we have **`loc`ated** the **`'Decade'`**s that are equal to **`1940`**.

**Question 6.1.** Compute the average rating of movies from the 1940s.

*Hint:* The function `np.average` computes the average of an array of numbers.

In [None]:
average_rating_in_forties = ...
average_rating_in_forties

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `np.average(forties.Rating)`
</p>
</details>

Now let's dive into the details a bit more. When you want to extract a subset of a dataframe, you use `.loc[rows, columns]`

- `rows` describes the criterion that the rows you want to include need to meet
- `columns` is typically just a list of column names you want to extract

The `rows` argument can be a bit confusing. Let's look at it piece by piece using the example above. 
The argument we made inside `.loc` extracts the `Decade` column and checks whether or not each row is equal to `1940` using the operator `==`. It is important to know the difference between a single `=` and a double `==`; a single `=` makes an *assignment* of a value to a variable (e.g. `today = "Wednesday"`), while the double `==` is a conditional that returns `True` when the things it is comparing are the same and `False` otherwise.

So when we run the following cell (the inside of the `.loc[]` argument above), we are returned a Series where an element is `True` if its corresponding row meets the condition and `False` if not.

In [None]:
imdb.Decade == 1940

This Series is then passed to the `.loc[]` argument to return a dataframe with only the rows that evaluated to `True` in the conditional statement.

**Question 6.2.** Create a table called `ninety_nine` containing the movies that came out in the year 1999.

In [None]:
ninety_nine = ...
ninety_nine

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.loc[imdb.Year == 1999]`.
</p>
</details>

So far we've only been finding where a column is *exactly* equal to a certain value. However, there are many other conditionals that can be used to filter rows.  Here are a few:

|Predicate|Example|Result|
|-|-|-|
|`==`|`== 50`|Find rows with values equal to 50|
|`!=`|`!= 50`|Find rows with values not equal to 50|
|`>`|`> 50`|Find rows with values above (and not equal to) 50|
|`>=`|`>= 50`|Find rows with values above 50 or equal to 50|
|`<`|`<`|Find rows with values below 50|
|`.between(min, max)`|`.between(2, 10)`|Find rows with values above or equal to 2 and below or equal to 10|
|`.isin(list)`|`.isin(['hi', 'hello'])`|Find rows with values equal to 'hi' or 'hello'|

The textbook section on selecting rows has more examples.


**Question 6.3.** Using `.loc[]` and one of the conditionals from the table above, find all the movies with a rating higher than 8.5.  Put their data in a table called `really_highly_rated`.

In [None]:
really_highly_rated = ...
really_highly_rated

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.loc[imdb.Rating>8.5]`.
</p>
</details>

**Question 6.4.** Find the average rating for movies released in the 20th century and the average rating for movies released in the 21st century for the movies in `imdb`.

*Hint*: Think of the steps you need to do (take the average, find the ratings, find movies released in 20th/21st centuries), and try to put them in an order that makes sense.

In [None]:
average_20th_century_rating = ...
average_21st_century_rating = ...
print("Average 20th century rating:", average_20th_century_rating)
print("Average 21st century rating:", average_21st_century_rating)

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `np.average(imdb.loc[imdb.Decade < 2000].Rating)` and `np.average(imdb.loc[imdb.Decade >= 2000].Rating)`.
</p>
</details>

The property `shape` tells you how many rows and columns are in a table.  (A "property" is just a method that doesn't need to be called by adding parentheses.) It returns what is known as a `tuple` that contains (num_rows, num_columns). The following will return the number of rows:

In [None]:
num_movies_in_dataset = imdb.shape[0]
num_movies_in_dataset

**Question 6.5.** Use `shape` (and arithmetic) to find the *proportion* of movies in the dataset that were released in the 20th century, and the proportion from the 21st century.

*Hint:* The *proportion* of movies released in the 20th century is the *number* of movies released in the 20th century, divided by the *total number* of movies.

In [None]:
proportion_in_20th_century = ...
proportion_in_21st_century = ...
print("Proportion in 20th century:", proportion_in_20th_century)
print("Proportion in 21st century:", proportion_in_21st_century)

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `imdb.loc[imdb.Decade < 2000].shape[0]/num_movies_in_dataset` and `imdb.loc[imdb.Decade >= 2000].shape[0]/num_movies_in_dataset`.
</p>
</details>

**Question 6.6.** Here's a challenge: Find the number of movies that came out in *even* years.

*Hint:* The operator `%` computes the remainder when dividing by a number.  So `5 % 2` is 1 and `6 % 2` is 0.  A number is even if the remainder is 0 when you divide by 2.

*Hint 2:* `%` can be used on arrays, operating elementwise like `+` or `*`.  So `np.array([5, 6, 7]) % 2` is `array([1, 0, 1])`.

*Hint 3:* Create a column called "Year Remainder" that's the remainder when each movie's release year is divided by 2.  Make a copy of `imdb` that includes that column.  Then use `.loc[]` to find rows where that new column is equal to 0.  Then use `shape` to count the number of such rows.

In [None]:
num_even_year_movies = ...
num_even_year_movies

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 127, and the code is `imdb.loc[imdb.Year % 2 == 0].shape[0]`
</p>
</details>

**Question 6.7.** Check out the `population` table from the introduction to this lab.  Compute the year when the world population first went above 6 billion.

In [None]:
year_population_crossed_6_billion = ...
year_population_crossed_6_billion

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 1999, and the code is `population[population.Population>6000000000].Year.min()`
</p>
</details>

## 7. Miscellanea
There are a few more table methods you'll need to fill out your toolbox.  The first 3 have to do with manipulating the columns in a table.

The table `farmers_markets.csv` contains data on farmers' markets in the United States  (data collected [by the USDA]([dataset](https://apps.ams.usda.gov/FarmersMarketsExport/ExcelExport.aspx)).  Each row represents one such market.

**Question 7.1.** Load the dataset into a table.  Call it `farmers_markets`.

In [None]:
farmers_markets = ...
farmers_markets

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `pd.read_csv("farmers_markets.csv")`.
</p>
</details>

You'll notice that it has a large number of columns in it!

**Question 7.2.** Use `shape` to find the number of columns in our farmers' markets dataset.

In [None]:
num_farmers_markets_columns = ...
print("The table has", num_farmers_markets_columns, "columns in it!")

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The correct answer is 59, the code is `farmers_markets.shape[1]`.
</p>
</details>

Most of the columns are about particular products -- whether the market sells tofu, pet food, etc.  If we're not interested in that stuff, it just makes the table difficult to read.  This comes up more than you might think.

In such situations, we can select only the columns we want. There are two ways to do this. Suppose we only want to keep the columns `Year` and `Decade` from the `imdb` table from before. We can pass those columns in as a list and select them like this: `imdb[['Year', 'Decade']]`. This is usually how we will select columns.

Alternatively, we can use the `.loc[]` method from before and pass the column names in as the *second* argument, like this: 

`imdb.loc[:, ['Year', 'Decade']]`. The colon in the first argument tells the method to select *all* the rows in the dataframe. This method comes in handy when we want to filter a dataframe by both rows and columns.


**Question 7.3.** Create a table with only the name, city, state, latitude ('y'), and longitude ('x') of each market.  Call that new table `farmers_markets_locations`.

In [None]:
farmers_markets_locations = ...
farmers_markets_locations

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `farmers_markets[['MarketName','city', 'State', 'y', 'x']]`
</p>
</details>

### `drop`

The `drop` method removes a specified list of columns and keeps all the rest. You use it by calling `.drop(columns=['list', 'of', 'unwanted', 'columns'])`.

**Question 7.5.** Suppose you just didn't want the "FMID" or "updateTime" columns in `farmers_markets`.  Create a table that's a copy of `farmers_markets` but doesn't include those columns.  Call that table `farmers_markets_without_fmid`.

In [None]:
farmers_markets_without_fmid = ...
farmers_markets_without_fmid

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `farmers_markets.drop(columns=['FMID','updateTime'])`.
</p>
</details>

#### `.iloc`
Let's find the 5 northernmost farmers' markets in the US.  You already know how to sort by latitude ('y'), but we haven't seen how to get the first 5 rows of that sorted table. 

The table method `.iloc` is similar to `.loc` except that indexes into the dataframe by *position* instead of the original index or column names. If you want to select the first row of a table in its current ordering, you would use `df.iloc[0]`. You can also select *slices* of the table this way; for example, you can select the first 10 rows like `df.iloc[:10]`.

**Question 7.6.** Make a table of the 5 northernmost farmers' markets in `farmers_markets_locations`.  Call it `northern_markets`.  (It should include the same columns as `farmers_markets_locations`.)

In [None]:
northern_markets = ...
northern_markets

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `farmers_markets_locations.sort_values('y', ascending=False).iloc[:5]`.
</p>
</details>

**Question 7.7.** Make a table of the farmers' markets in Chicago, Illinois.  (It should include the same columns as `farmers_markets_locations`.)

In [None]:
chicago_markets = ...
chicago_markets

<details><summary><button>Click here to reveal the answer!</button></summary>
<p>
The code is `farmers_markets_locations.loc[farmers_markets_locations.city == 'Chicago']`.
</p>
</details>

Recognize any of them?

Alright! You're finished with lab 2! Some of the Pandas syntax is a bit finnicky at first, but you'll get used to it. You can always refer to [this](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) extremely handy cheat sheet when in doubt. Also - a key part of learning how to code is learning how to look things up online. If you run into an error or can't figure something out, chances are VERY high that somebody else has run into the same error. Google away!


Acknowledgement: The materials for this lab borrow from the [data8](http://data8.org/) course at UC Berkeley.