# Denison CS181/DA210 SW Lab #6 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

---

## Part A: Transformation operation: `pivot`

The pivot operation is the dual to `melt()`.  As we saw in class, `melt` converts several columns to a single stacked column.  The `pivot` operation, on the other hand, converts a stacked column to a series of columns.

### Pivoting with one value column

Consider the following subset of the `indicators0` dataset:

In [None]:
data = { "code": ["CAN", "CAN", "CAN", "USA", "USA", "USA"],
         "ind": ["pop", "gdp", "life", "pop", "gdp", "life"],
         "value": [36.26, 1535.77, 82.30, 323.13, 18624.47, 76.25]}

indicators0_untidy = pd.DataFrame(data)

# Display the DataFrame
indicators0_untidy

As you can see, this dataset is not tidy.  Specifically, it should map `code -> pop, gdp, life`.  However, a given mapping of independent to dependent variables appears across several rows (violating `TidyData2`).  Furthermore, the columns labeled `ind` and `value` contain information about three different variables, rather than having each column represent exactly one variable (violating `TidyData1`).

We can fix this using a `pivot` operation.  We'll need a few pieces of information:

* `index`: a set of columns to serve as the row index (should uniquely identify a mapping of independent to dependent varaibles)
* `pivot column`: the name of the column that provides the column labels after pivoting (e.g., a column containing variable names)
* `value column(s)`: the names of columns containing values of the variables in the pivot column

In this example, this information is the following:

* `index`: `'code'`
* `pivot column`: `'ind'`, with values `'pop'`, `'gdp'`, and `'life'`
* `value column`: `'value'`

In [None]:
# Display the original DataFrame again
indicators0_untidy

In [None]:
# Pivot the untidy DataFrame to attain separate columns
# for the different indicators
indicators0_pivoted = indicators0_untidy.pivot(
    index = "code",   # index column
    columns = "ind",  # pivot column
    values = "value"  # value column
)

# Display the tidy version of this DataFrame
indicators0_pivoted

Note that `code` is used as the row-label `Index`, and there are just three columns for the two data rows.

### Pivoting with more than one value column

Note that `pandas` does not enforce that data should be tidy.  For the purpose of illustration, let's consider a variation of the `indicators` dataset in which we'll pivot with more than one `value` column.

In [None]:
data = [["CHN", 2005, 1303.72,  2285.97],
        ["CHN", 2010, 1337.70,  6087.16],
        ["CHN", 2015, 1371.22, 11015.54],
        ["GBR", 2005,   60.40,  2525.01],
        ["GBR", 2010,   62.77,  2452.90],
        ["GBR", 2015,   65.13,  2896.42],
        ["IND", 2005, 1147.61,   820.38],
        ["IND", 2010, 1234.28,  1675.62],
        ["IND", 2015, 1310.15,  2103.59]]
columns = ["code", "year", "pop", "gdp"]

indicators0_mult_pivot_vals = pd.DataFrame(data, columns=columns)

# Display the DataFrame
indicators0_mult_pivot_vals

We can now perform a pivot on this dataset with the following parameters:

* `index`: `'code'`
* `pivot column`: `'year''`, with values `2005`, `2010`, and `2015`
* `value columns`: `'pop'` and `'gdp'`

In [None]:
# Perform the pivot
indicators0_pivoted_multvals = indicators0_mult_pivot_vals.pivot(
    index = "code",  # index column
    columns = "year" # pivot column
)

# Note that value columns are not specified, so all remaining columns are used

# Display the resulting DataFrame
indicators0_pivoted_multvals

A few key notes:
* The result has a multi-level column index after we pivot with multiple `value` columns.
* We don't need to specify the `value` columns, because they're any columns that are neither `index` or `pivot` columns.

We could alternately decide to pivot with the years as rows and country code as columns (like the previous example, this is not tidy, but may be a useful display for our data).

In [None]:
# Perform a different pivot
indicators0_pivoted_multvals2 = indicators0_mult_pivot_vals.pivot(
    index = "year",  # index column
    columns = "code" # pivot column
)

# Note that value columns are not specified, so all remaining columns are used

# Display the resulting DataFrame
indicators0_pivoted_multvals2

Our top-level column labels are still `'pop'` and `'gdp'`, but the lower-level column labels have swapped with the row labels from the previous output.

---

## Part B: Try it for yourself: pivot

**Q1:** Make the following into a `pandas` data frame, assigning it to variable `df`.
```
    {'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
     'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
     'baz': [1, 2, 3, 4, 5, 6]}
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame
df

In [None]:
# Testing cell
assert df.shape == (6,3)
assert list(df.columns) == ["foo", "bar", "baz"]
assert list(df.index) == list(range(6))

**Q2:** Suppose the column `'bar'` should provide the row-label `Index`, the values `'one'` and `'two'` from columns `'foo'` should be column labels (so it takes more than one row of `df` to interpret a single observation), and the values themselves should come from the `'baz'` column.  

We can obtain a tidy version of this data using a `pivot` operation.  What parameter arguments would be needed for this operation?

Parameters:

* `index`: YOUR ANSWER HERE
* `pivot column`: YOUR ANSWER HERE
* `value column(s)`: YOUR ANSWER HERE

**Q3:** Perform the `pivot` and assign the result to `df2`.  

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame
df2

In [None]:
# Testing cell
assert df2.shape == (3,2)
assert "one" in df2.columns
assert "two" in df2.columns
assert df2.loc["A", "one"] == 1
assert df2.loc["C", "two"] == 6

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: How can you tell how many columns and rows will be present after a `pivot` operation, based on the columns `index`, `pivot`, and `value` in the original data frame?

---

**Q4:** Consider the file `restaurants_gender.csv`, that has aggregated other data and whose rows map from an id, restaurant, and gender to an average rating.  Relative to this aggregation, the data is tidy as it stands.

Load the data from this file into a `DataFrame` named `rest_gender_df`.  You should inspect the file to determine what to use as a row-label `Index`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the DataFrame
rest_gender_df

In [None]:
# Testing cell
assert rest_gender_df.shape == (4,3)

**Q5:** Pivot the `rest_gender_df` data into a matrix presentation with `restaurant` down one axis (as a row-label `Index`) and `gender` across the other axis (as column label `Index`), a form that might make for good presentation.  Store the result as `rest_gender_mat`.

Note that you can use [`droplevel`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html) to remove the `rating` as the outer level of the column multi-level `Index`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Display the matrix DataFrame
rest_gender_mat

In [None]:
# Testing cell
assert rest_gender_mat.shape == (2,2)
assert rest_gender_mat.loc["A", "F"] == 82 # requires one-level column Index
assert rest_gender_mat.loc["A", "M"] == 79
assert rest_gender_mat.loc["B", "F"] == 57
assert rest_gender_mat.loc["B", "M"] == 68

---

## Part C: Transformation operation: `pivot_table`

A `pivot()` operation is not able to handle all situations in which we may want to pivot.  For example, we may have multiple rows with the same values of the index and pivot variables, which we should aggregate together into one row of the output.

Another situation may arise when we have more than one column that together form our independent variables.  Let's consider the original topnames dataset:

In [None]:
data = [["IND", 2005, "pop",  1147.61],
        ["IND", 2010, "pop",  1234.28],
        ["IND", 2015, "pop",  1310.15],
        ["USA", 2005, "pop",   295.52],
        ["USA", 2010, "pop",   309.33],
        ["USA", 2015, "pop",   320.74],
        ["IND", 2005, "gdp",   820.38],
        ["IND", 2010, "gdp",  1675.62],
        ["IND", 2015, "gdp",  2103.59],
        ["USA", 2005, "gdp", 13036.64],
        ["USA", 2010, "gdp", 14992.05],
        ["USA", 2015, "gdp", 18219.30]]

columns = ["code", "year", "indicator", "value"]

indicators0_untidy2 = pd.DataFrame(data, columns=columns)

# Display the untidy DataFrame
indicators0_untidy2

We can use the more general `pivot_table` operation to pivot this data using a multi-level row index:

In [None]:
indicators0_tidy2 = indicators0_untidy2.pivot_table(
    index=["code", "year"], # two-level index
    columns="indicator"     # pivot column
)

# Note that the values column(s) are not specified,
# so all remaining columns (`value`) are used as the values

# Display the resulting tidy DataFrame
indicators0_tidy2

> You've reached the secpnd checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: What would you expect to happen if you used `columns='value'` instead of `columns='indicator'` in the previous example?  Why?

---

---
## Part D

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE