# Lab 3a: Tables I

Welcome to Lab 3a!  

In this class, we will focus on manipulating tables.  We will import our data sets into tables and complete the analysis using these tables. Tables are described in [Chapter 6](https://inferentialthinking.com/chapters/06/Tables.html) of the Inferential Thinking text. A related approach in Python programming is to use what is known as a [pandas dataframe](https://pythonbasics.org/pandas-dataframe/) which we will need to resort to occasionally. Pandas is a mainstay datascience tools.

First, set up the tests and imports by running the cell below.

In [None]:
import numpy as np
from datascience import * # Brings into Python the datascience Table object

# These lines load the tests.

from gofer.ok import check

In [None]:
# Enter your name as a string
# Example
dogname = "Fido"
# Your name
name = ...

## 1. Introduction

For a collection of things in the world, an array is useful for describing a single attribute of each thing. For example, among the collection of US States, an array could describe the land area of each. Tables extend this idea by describing multiple attributes for each element of a collection.

In most data science applications, we have data about many entities, but we also have several kinds of data about each entity.

For example, in the cell below we have two arrays. The first one contains the world population in each year (estimated by the US Census Bureau), and the second contains the years themselves. These elements are in order, so the year and the world population for that year have the same index in their corresponding arrays.

In [None]:
population_amounts = Table.read_table("world_population.csv").column("Population")
years = np.arange(1950, 2016,1)
print("Population column:", population_amounts)
print("Years column:", years)

Suppose we want to answer this question:

> When did world population cross 6 billion?

You could technically answer this question just from staring at the arrays, but it's a bit convoluted, since you would have to count the position where the population first crossed 6 billion, then find the corresponding element in the years array. In cases like these, it might be easier to put the data into a *`Table`*, a 2-dimensional type of dataset. 

The expression below:

- creates an empty table using the expression `Table()`,
- adds two columns by calling `with_columns` with four arguments,
- assignes the result to the name `population`, and finally
- evaluates `population` so that we can see the table.

The strings `"Year"` and `"Population"` are column labels that we have chosen. Ther names `population_amounts` and `years` were assigned above to two arrays of the same length. The function `with_columns` (you can find the documentation [here](http://data8.org/datascience/tables.html)) takes in alternating strings (to represent column labels) and arrays (representing the data in those columns), which are all separated by commas. Tip: Both `population_amounts` and `years` need the same number of data points or an error will be returned on attempting to construct the table.

In [None]:
population = Table().with_columns(
    "Population", population_amounts,
    "Year", years
)
population

Now the data are all together in a single table! It's much easier to parse this data--if you need to know what the population was in 1959, for example, you can tell from a single glance. We'll revisit this table later.

**Question 1** <br/>
From the example in the cell above, identify the variables or data types for each of the following:  which variable contains the table?  which variable contains an array? what is the data type of the column labels?  

In [None]:
table_var = ...
array_var = ...
col_type  = ...

In [None]:
check('tests/q1.py')

## 2. Creating Tables

**Question 2** <br/> In the cell below, we've created 2 arrays. In these examples, we're going to be looking at the Enviornmental Protection Index which describes the state of sustainability in each country.  More information can be found: [Yale EPI](https://epi.yale.edu/). Using the steps above, assign `top_10_epi` to a table that has two columns called "Score" and "Country", which hold `top_10_epi_scores` and `top_10_epi_countries` respectively.

In [None]:
top_10_epi_scores = make_array(82.5, 82.3, 81.5, 81.3, 80., 79.6, 78.9, 78.7, 77.7, 77.2)
top_10_epi_countries = make_array(
        'Denmark',
        'Luxembourg', 
        'Switzerland', 
        'United Kingdom', 
        'France', 
        'Austria', 
        'Finland', 
        'Sweden', 
        'Norway',
        'Germany'
        )

top_10_epi = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_epi

In [None]:
check('tests/q2.py')

#### Loading a table from a file
In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.

`Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.

**Question 3** <br/>The file `yale_epi.csv` in the current directory contains a table of information about 180 countries with their corresponding Environmental Performance Index (EPI) based on 32 indicators of sustainability.  Load it as a table called `epi` using the `Table.read_table` function.

In [None]:
epi = ...
epi 

In [None]:
check('tests/q3.py')

Notice the part about "... (170 rows omitted)."  This table is big enough that only a few of its rows are displayed, but the others are still there.  10 are shown, so there are 180 movies total.

Where did `yale_epi.csv` come from? Take a look at [this lab's folder](./). You should see a file called `yale_epi.csv`.

Open up the `yale_epi.csv` file in that folder and look at the format. What do you notice? The `.csv` filename ending says that this file is in the [CSV (comma-separated value) format](http://edoceo.com/utilitas/csv-file-format).

## 3. Using lists

A *list* is another Python sequence type, similar to an array. It's different than an array because the values it contains can all have different types. A single list can contain `int` values, `float` values, and strings. Elements in a list can even be other lists! A list is created by giving a name to the list of values enclosed in square brackets and separated by commas. For example, `values_with_different_types = ['data', 8, 8.1]`

Lists can be useful when working with tables because they can describe the contents of one row in a table, which often  corresponds to a sequence of values with different types. A list of lists can be used to describe multiple rows.

Each column in a table is a collection of values with the same type (an array). If you create a table column from a list, it will automatically be converted to an array. A row, on the ther hand, mixes types.

Here's a table from Chapter 5. (Run the cell below.)

In [None]:
# Run this cell to recreate the table
flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers

**Question 4** <br/>Create a list that describes a new fourth row of this table. The details can be whatever you want, but the list must contain two values: the number of petals (an `int` value) and the name of the flower (a string). For example, your flower could be "pondweed"! (A flower with zero petals)

In [None]:
my_flower = ...
my_flower

In [None]:
check('tests/q4.py')

**Question 5** <br/>`my_flower` fits right in to the table from chapter 5. Complete the cell below to create a table of seven flowers that includes your flower as the fourth row followed by `other_flowers`. You can use `with_row` to create a new table with one extra row by passing a list of values and `with_rows` to create a table with multiple extra rows by passing a list of lists of values.

In [None]:
# Use the method .with_row(...) to create a new table that includes my_flower 

four_flowers = ...

# Use the method .with_rows(...) to create a table that 
# includes four_flowers followed by other_flowers

other_flowers = [[10, 'lavender'], [3, 'birds of paradise'], [6, 'tulip']]

seven_flowers = ...
seven_flowers

In [None]:
check('tests/q5.py')

## 4. Analyzing datasets
With just a few table methods, we can answer some interesting questions about the EPI dataset.

If we want just the scores of each country, we can get an array that contains the data in that column:

In [None]:
epi.column("Score")

The value of that expression is an array, exactly the same kind of thing you'd get if you typed in `make_array(25.5, 29.7, 49.0, [etc])`.

**Question 6** <br/>Find the EPI score of the highest-ranked country in the dataset.

*Hint:* Think back to the functions you've learned about for working with arrays of numbers.  Ask for help if you can't remember one that's useful for this.

In [None]:
highest_rating = ...
highest_rating

In [None]:
check('tests/q6.py')

That's not very useful, though.  You'd probably want to know the *name* of the country whose score you found!  To do that, we can sort the entire table by EPI Score, which ensures that the scores and country will stay together. Note that calling sort creates a copy of the table and leaves the original table unsorted.

In [None]:
epi.sort("Score")

Well, that actually doesn't help much, either -- we sorted the countries from lowest -> highest scores.  To look at the highest-ranked countries, sort in reverse order:

In [None]:
epi.sort("Score", descending=True)

(The `descending=True` bit is called an *optional argument*. It has a default value of `False`, so when you explicitly tell the function `descending=True`, then the function will sort in descending order.)

So the country with the highest Environmental Protection Index is Denmark with 82.5.  

Some details about sort:

1. The first argument to `sort` is the name of a column to sort by.
2. If the column has strings in it, `sort` will sort alphabetically; if the column has numbers, it will sort numerically.
3. The value of `epi.sort("Score")` is a *copy of `epi`*; the `epi` table doesn't get modified. For example, if we called `epi.sort("Score")`, then running `epi` by itself would still return the unsorted table.
4. Rows always stick together when a table is sorted.  It wouldn't make sense to sort just one column and leave the other columns alone.  For example, in this case, if we sorted just the "Score" column, the countries would all end up with the wrong scores.

**Question 7** <br/>  We also have information about the changes in sustainability scores from 2010 to 2020.  Create a version of `epi` that's sorted by change, with the largest, positive changes first.  Call it `epi_changes`.

In [None]:
epi_changes = ...
epi_changes

In [None]:
check('tests/q7.py')

**Question 8** <br/>What's the name of the country with the largest, positive change in the dataset?  You could just look this up from the output of the previous cell.  Instead, write Python code to find out.

*Hint:* Starting with `epi_changes`, extract the country column to get an array, then use `item` to get its first item.

In [None]:
largest_positive_change = ...
largest_positive_change

In [None]:
check('tests/q8.py')

## Summary

For your reference, here's a table of all the functions and methods we saw in this lab.

|Name|Example|Purpose|
|-|-|-|
|`Table`|`Table()`|Create an empty table, usually to extend with data|
|`Table.read_table`|`Table.read_table("my_data.csv")`|Create a table from a data file|
|`with_columns`|`tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2))`|Create a copy of a table with more columns|
|`column`|`tbl.column("N")`|Create an array containing the elements of a column|
|`sort`|`tbl.sort("N")`|Create a copy of a table sorted by the values in a column|

<br/>

Congratulations, you're done with lab 3a!  Be sure to 
- **run all the tests and verify that they all pass** (the next cell has a shortcut for that), 
- **Save and Checkpoint** from the `File` menu,
- **Hit the Submit button** Your submission will be saved and grade will be posted when it's finished running.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
for x in range(1, 9):
    print('Testing question {}: '.format(str(x)))
    g = check('tests/q{}.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/8)))