# Worksheet 3: Cleaning and Wrangling Data


### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

* define the term "tidy data"
* Explain when chaining is appropriate and demonstrate chaining over multiple lines and verbs.
* discuss the advantages and disadvantages of storing data in a tidy data format
* recall and use the following functions and methods for their intended data wrangling tasks:
    - Use `loc[]` to select rows or columns.
    - Use `[]` to filter rows of a data frame.
    - Create new or columns in a data frame using `assign` method.
    - Use `groupby` to calculate summary statistics on grouped objects 
    - Use `melt` and `pivot` to reshape data frames, specifically to make tidy data.
    
This worksheet covers parts of [Chapter 3](https://python.datasciencebook.ca/wrangling.html) of the online textbook. You should read this chapter before attempting the worksheet.

In [None]:
### Run this cell before continuing.
import altair as alt
import pandas as pd

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

**Question 0.0** Multiple Choice: 
<br> {points: 1}

Which of the following characterize a tidy dataset? note - there may be more than 1 correct answers to this question

A) Each row is a single variable

B) There are no missing or erroneous values

C) Each value is a single cell

D) Each variable is a single column

*Assign your answer to an object called `answer0_0` in the code chunk below. Make sure your answer contains uppercase letters and surround it with quotation marks and square brackets. If there are more than one answers to this question, separate each letter with a comma within the square brackets. For example if you believe the answer is A, B and C your answer would like this:
`answer0_0 = ['A', 'B', 'C']`*



In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_0)).encode("utf-8")+b"d2cc037e7dd3bbbe").hexdigest() == "7af8803158676e6dca21b1a695e3ebbc3d39d0bd", "type of answer0_0 is not list. answer0_0 should be a list"
assert sha1(str(len(answer0_0)).encode("utf-8")+b"d2cc037e7dd3bbbe").hexdigest() == "01c9e3a1a82db29978769d988555abdf9a8f87db", "length of answer0_0 is not correct"
assert sha1(str(sorted(map(str, answer0_0))).encode("utf-8")+b"d2cc037e7dd3bbbe").hexdigest() == "50705b4fb2b510902a34f7d6f40e95438a30cd08", "values of answer0_0 are not correct"
assert sha1(str(answer0_0).encode("utf-8")+b"d2cc037e7dd3bbbe").hexdigest() == "50705b4fb2b510902a34f7d6f40e95438a30cd08", "order of elements of answer0_0 is not correct"

print('Success!')

**Question 0.1** Multiple Choice: 
<br> {points: 1}

The data below is wine ratings given for 3 wines by 5 different wine tasters. We are interested in seeing if Taster or Wine type influences the rating.  Given that motivation, which arrangement of the data set show below is "tidy"?,

##### Data set 1:

|     Taster       | Chardonnay | Pinot Grigio | Pinot Blanc |
|------------|------------|--------------|-----------------|
| 001 | 75         | 89           | 92              |
| 002 | 89         | 88           | 89              |
| 003 | 72         | 90           | 95              |
| 004 | 85         | 81           | 90              |
| 005 | 83         | 89           | 88              |

##### Data set 2:

|   Wine | Taster 001 | Taster 002 | Taster 003 | Taster 004 | Taster 005 |
|------------|------------|--------------|-----------------|-------|---------|
| Chardonnay | 75         | 89           | 72              | 85 | 83|
| Pinot Grigio | 89         | 88           | 90             | 81 | 89 |
| Pinot Blanc | 92         | 89           | 95              | 90 | 88 |

##### Data set 3:

| Taster           | Wine | Rating | 
|------------|------------|----|
| 001 |  Chardonnay |  75         |
| 002 |  Chardonnay | 89         | 
| 003 |  Chardonnay |72         | 
| 004 |  Chardonnay |85         | 
| 005 | Chardonnay | 83         | 
| 001 |  Pinot Grigio | 89         |
| 002 |  Pinot Grigio | 88         | 
| 003 |  Pinot Grigio | 90         | 
| 004 |  Pinot Grigio | 81         |
| 005 |  Pinot Grigio | 90         |
| 001 |  Pinot Blanc | 92         |
| 002 | Pinot Blanc | 89         |
| 003 | Pinot Blanc | 95         | 
| 004 | Pinot Blanc | 90         |
| 005 | Pinot Blanc | 88         | 

##### Data set 4:
| Taster    | Chardonnay Rating | 
|------------|------------|
| 001 |  75         | 
| 002 |   89         | 
| 003 |  72         |
| 004 | 85         | 
| 005 | 83         |

| Taster           | Pinot Grigio Rating | 
|------------|------------|
| 001 |   89         |
| 002 |  88         |
| 003 |  90         | 
| 004 | 81         | 
| 005 |  90         | 

| Taster           | Pinot Blanc Rating | 
|------------|------------|
| 001 |   92         | 
| 002 |  89         |
| 003 |  95         | 
| 004 |  90         | 
| 005 |  88         | 


*Assign your answer to an object called `answer0_1`. Make sure your answer is surrounded by square brackets. If there are more than one answers to this question, separate each number with a comma in the square brackets. For example if you believe the answer is 1, 2 and 3 your answer would like this: `answer0_1` = [1, 2, 3]*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_1)).encode("utf-8")+b"c40eafd446afc959").hexdigest() == "e65d9f04ff40c7de4cac4e1ca02ae3df2d8b00c7", "type of answer0_1 is not list. answer0_1 should be a list"
assert sha1(str(len(answer0_1)).encode("utf-8")+b"c40eafd446afc959").hexdigest() == "b4701059ec06be9a13ee7dce8fd7c57cda1b9574", "length of answer0_1 is not correct"
assert sha1(str(sorted(map(str, answer0_1))).encode("utf-8")+b"c40eafd446afc959").hexdigest() == "0bb92bb3f823fe732af88fa95f212d7d7f5663ae", "values of answer0_1 are not correct"
assert sha1(str(answer0_1).encode("utf-8")+b"c40eafd446afc959").hexdigest() == "00b126059cee58b65e5da0aaba9689df89e252da", "order of elements of answer0_1 is not correct"

print('Success!')

**Question 0.2** Multiple Choice: 
<br> {points: 1}

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below:

Why is the primary goal of data wrangling getting dataframes into the tidy data format?

A) Having data expressed in such a way, allows for easier readability and is more aesthetically pleasing.

B) Tidy format uses less storage space on your computer.

C) Many or most modern Data Science tools accept the tidy data format directly (or very close to that) and we need to get the data in a state ready for analysis.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer0_2`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_2)).encode("utf-8")+b"4c3ae040a3075d5c").hexdigest() == "40075b341a20f6d4ba15029119bc1cb0ea061482", "type of answer0_2 is not str. answer0_2 should be an str"
assert sha1(str(len(answer0_2)).encode("utf-8")+b"4c3ae040a3075d5c").hexdigest() == "b9e540a7dc8d2c8ddad325a949dfa99744ac46b6", "length of answer0_2 is not correct"
assert sha1(str(answer0_2.lower()).encode("utf-8")+b"4c3ae040a3075d5c").hexdigest() == "9ba2f2f179c16d4d8b1fdf699dd1b0f995a081ac", "value of answer0_2 is not correct"
assert sha1(str(answer0_2).encode("utf-8")+b"4c3ae040a3075d5c").hexdigest() == "00b0ffea8177a6c21b1b927213a731f3221c2fdc", "correct string value of answer0_2 but incorrect case of letters"

print('Success!')

**Question 0.3** Multiple Choice: 
<br> {points: 1}

For which scenario would using one of the `groupby` + `mean` be appropriate?

A. To apply the same function to every row. 

B. To apply the same function to every column.

C. To apply the same function to groups of rows. 

D. To apply the same function to groups of columns.

*Assign your answer to an object called `answer0_3`.  Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer0_3)).encode("utf-8")+b"04f08a1c90ba0510").hexdigest() == "92fb1afb5c5eb0fff1b059e4de462b615378e308", "type of answer0_3 is not str. answer0_3 should be an str"
assert sha1(str(len(answer0_3)).encode("utf-8")+b"04f08a1c90ba0510").hexdigest() == "6b95776a3eee1cc7b5eaaabb14de6a345ed41a9e", "length of answer0_3 is not correct"
assert sha1(str(answer0_3.lower()).encode("utf-8")+b"04f08a1c90ba0510").hexdigest() == "c187400e0fa1ed4e6f30ceb0077dd82df97d3c9e", "value of answer0_3 is not correct"
assert sha1(str(answer0_3).encode("utf-8")+b"04f08a1c90ba0510").hexdigest() == "47e09412bd76e6a4a7a5e6f1f06110fb4f6d3546", "correct string value of answer0_3 but incorrect case of letters"

print('Success!')

## 1. Assessing avocado prices to inform restaurant menu planning

It is a well known that millennials LOVE avocado toast (joking...well mostly 😉), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes. 

<img align="left" src="https://www.averiecooks.com/wp-content/uploads/2017/07/egghole-2.jpg" width="150" />

*Source: https://www.averiecooks.com/egg-hole-avocado-toast/*

To answer this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the [Hass Avocado Board website](http://www.hassavocadoboard.com/) in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. The data set spans the year 2015-2018.

Some relevant columns in the dataset:

- `Date` - The date in year-month-day format
- `average_price` - The average price of a single avocado
- `type` - conventional or organic
- `yr` - The year
- `region` - The city or region of the observation
- `small_hass_volume` in pounds (lbs)	
- `large_hass_volume` in pounds (lbs)		
- `extra_l_hass_volume`	in pounds (lbs)	
- `wk` - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52).

To answer our question of whether there are times in the year when avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of `average_price` (y-axis) versus `Date` (x-axis).

**Question 1.1** Multiple Choice:
<br> {points: 1}

Which of the following is not included in the `csv` file?

A. Average price of a single avocado.

B. The farming practice (production with/without the use of chemicals). 

C. Average price of a bag of avocados.

D. All options are included in the data set.

*Assign your answer to an object called `answer1_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_1)).encode("utf-8")+b"c1dc71024972b692").hexdigest() == "19d76087a8f1dc29dcc3b6d78444b6b639680782", "type of answer1_1 is not str. answer1_1 should be an str"
assert sha1(str(len(answer1_1)).encode("utf-8")+b"c1dc71024972b692").hexdigest() == "15c69d9da0c8f6f3976315bea4e50eaba4e85d06", "length of answer1_1 is not correct"
assert sha1(str(answer1_1.lower()).encode("utf-8")+b"c1dc71024972b692").hexdigest() == "4d189c476dfa8b729664815c3c246f11aceb3c1c", "value of answer1_1 is not correct"
assert sha1(str(answer1_1).encode("utf-8")+b"c1dc71024972b692").hexdigest() == "0c37c64d3513f3194ce18ed091b31913726c4cc5", "correct string value of answer1_1 but incorrect case of letters"

print('Success!')

**Question 1.2** Multiple Choice:
<br> {points: 1}

The rows in the data frame represent:

A. daily avocado sales data for a region

B. weekly avocado sales data for a region

C. bi-weekly avocado sales data for a region

D. yearly avocado sales data for a region

*Assign your answer to an object called `answer1_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer1_2)).encode("utf-8")+b"74e862414527b7fe").hexdigest() == "558750468bdc989e289b16e529e591b665d121f7", "type of answer1_2 is not str. answer1_2 should be an str"
assert sha1(str(len(answer1_2)).encode("utf-8")+b"74e862414527b7fe").hexdigest() == "c60ed98ac63eba9ff4e97ba58220d23b151642e9", "length of answer1_2 is not correct"
assert sha1(str(answer1_2.lower()).encode("utf-8")+b"74e862414527b7fe").hexdigest() == "f48a247820bb73b05098fa830a31d52717e06c39", "value of answer1_2 is not correct"
assert sha1(str(answer1_2).encode("utf-8")+b"74e862414527b7fe").hexdigest() == "174726519bd2f5b0197bd4fa6fa791deabe2ad74", "correct string value of answer1_2 but incorrect case of letters"

print('Success!')

**Question 1.3** 
<br> {points: 1}

The first step to plotting total volume against average price is to read the file `avocado_prices.csv` using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the `data` directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate `.read_*` function to read the data.

*Assign your answer to an object called `avocado`.* 

In [None]:
# ___ = ___("___")

# your code here
raise NotImplementedError
avocado

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado is None)).encode("utf-8")+b"2451926ef15be7c7").hexdigest() == "eb39bbd84dfb70adc363967a8d71d0f46eedb9bd", "type of avocado is None is not bool. avocado is None should be a bool"
assert sha1(str(avocado is None).encode("utf-8")+b"2451926ef15be7c7").hexdigest() == "a39b879abbd9d6576936fe3f756117d7c93007c1", "boolean value of avocado is None is not correct"

assert sha1(str(type(avocado)).encode("utf-8")+b"2efc37282c704ad3").hexdigest() == "b8205bba640659505695bdb18bc9d5ca2275ab12", "type of type(avocado) is not correct"

assert sha1(str(type(avocado.shape)).encode("utf-8")+b"c4d6a4a5ec883943").hexdigest() == "10a04515407d5be3b21f9462c7fe969c80a4217b", "type of avocado.shape is not tuple. avocado.shape should be a tuple"
assert sha1(str(len(avocado.shape)).encode("utf-8")+b"c4d6a4a5ec883943").hexdigest() == "f97ed101967fa3d24785bd536e651ebd36f9ffb5", "length of avocado.shape is not correct"
assert sha1(str(sorted(map(str, avocado.shape))).encode("utf-8")+b"c4d6a4a5ec883943").hexdigest() == "94b23992e79c9b12cca9e359679167a0717aee55", "values of avocado.shape are not correct"
assert sha1(str(avocado.shape).encode("utf-8")+b"c4d6a4a5ec883943").hexdigest() == "5ec881f4c546f9796b0eb1693fc1bc0f3e13c009", "order of elements of avocado.shape is not correct"

assert sha1(str(type(avocado.columns.values)).encode("utf-8")+b"4d14cacec85bb180").hexdigest() == "594dc9c56d058ecddfae7962b9a4a080265b07bb", "type of avocado.columns.values is not correct"
assert sha1(str(avocado.columns.values).encode("utf-8")+b"4d14cacec85bb180").hexdigest() == "14cfe7538ac8cca0bc3467af40c576fc5837c873", "value of avocado.columns.values is not correct"

print('Success!')

**Question 1.4**

{points: 1}

To answer our question, let's now create the scatter plot where we plot `average_price` on the y-axis versus `Date` on the x-axis. Fill in the `___` in the cell below. 

*Assign your answer to an object called `avocado_plot`. Don't forget to create proper English axis labels.*

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___).title(___)
# )

# your code here
raise NotImplementedError
avocado_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado_plot is None)).encode("utf-8")+b"55a6c330b795bc8b").hexdigest() == "f36991a19049882a3e8519c4834f2fc3395926d6", "type of avocado_plot is None is not bool. avocado_plot is None should be a bool"
assert sha1(str(avocado_plot is None).encode("utf-8")+b"55a6c330b795bc8b").hexdigest() == "768b2a573bb8ab55e3c32fb8452f2a4abfcae979", "boolean value of avocado_plot is None is not correct"

assert sha1(str(type(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"51bbbba0db94b55b").hexdigest() == "d63d23924afa3414c04651af2040191d1c493bac", "type of avocado_plot.encoding.x['shorthand'] is not str. avocado_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.x['shorthand'])).encode("utf-8")+b"51bbbba0db94b55b").hexdigest() == "0bcabc3634d845f57bdf19438cd34e8cd93cb0c0", "length of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"51bbbba0db94b55b").hexdigest() == "0b69a26a7fcd380a98f69ae78d4cb09b4b176827", "value of avocado_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.x['shorthand']).encode("utf-8")+b"51bbbba0db94b55b").hexdigest() == "e0b3e956cdbcd5f954100bdaba191c917774de85", "correct string value of avocado_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"243f5fd9cad44966").hexdigest() == "26eabc16588cc4541088ada1d1b0a46937f22a16", "type of avocado_plot.encoding.y['shorthand'] is not str. avocado_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_plot.encoding.y['shorthand'])).encode("utf-8")+b"243f5fd9cad44966").hexdigest() == "1fd81c02936e08c9ed74e15f269ca522c0073f3f", "length of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"243f5fd9cad44966").hexdigest() == "f91657a32b57dbbecdf30af8857751e3bfda1486", "value of avocado_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_plot.encoding.y['shorthand']).encode("utf-8")+b"243f5fd9cad44966").hexdigest() == "f91657a32b57dbbecdf30af8857751e3bfda1486", "correct string value of avocado_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_plot.mark)).encode("utf-8")+b"5ddfac8d6ed12485").hexdigest() == "8587a9ef36aaf4a87c795688b055a6c1d96649be", "type of avocado_plot.mark is not str. avocado_plot.mark should be an str"
assert sha1(str(len(avocado_plot.mark)).encode("utf-8")+b"5ddfac8d6ed12485").hexdigest() == "0710b9feca5f751429946b412f0cb7b7d63dff3f", "length of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark.lower()).encode("utf-8")+b"5ddfac8d6ed12485").hexdigest() == "13d546595faf950d3974a290ce7370710655df44", "value of avocado_plot.mark is not correct"
assert sha1(str(avocado_plot.mark).encode("utf-8")+b"5ddfac8d6ed12485").hexdigest() == "13d546595faf950d3974a290ce7370710655df44", "correct string value of avocado_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_plot.encoding.y['title'], str))).encode("utf-8")+b"41487a72fa33d9b7").hexdigest() == "2b9b2dbe08fecca3b4293f5ca49503931e984596", "type of isinstance(avocado_plot.encoding.y['title'], str) is not bool. isinstance(avocado_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.y['title'], str)).encode("utf-8")+b"41487a72fa33d9b7").hexdigest() == "17e59b4fd7a12b6b4e5ab4fa56e0d8b50aac8140", "boolean value of isinstance(avocado_plot.encoding.y['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_plot.encoding.x['title'], str))).encode("utf-8")+b"04716fdbcc9f5a82").hexdigest() == "ded1cf84001a3d7f8b458d794ec091205712d3e6", "type of isinstance(avocado_plot.encoding.x['title'], str) is not bool. isinstance(avocado_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_plot.encoding.x['title'], str)).encode("utf-8")+b"04716fdbcc9f5a82").hexdigest() == "d34c58c821816febbf3d6455407652b835776c20", "boolean value of isinstance(avocado_plot.encoding.x['title'], str) is not correct"

print('Success!')

This is a big plot! You can scroll and maybe see some trends, but really what we see in the plot above is not very informative. Why? Because there is a lot of overplotting (data points sitting on top of other data points). What can we do? One solution is to reduce/aggregate the data in a meaningful way to help anwer our question. Remember that we are interested in determining if there are times when the price of avocados are less expensive so that we can recommend when restaurants should put dishes on the menu that contain avocado to maximize their profits for those dishes.

In the data we plotted above, each row is the total sales for avocados for that region for each year. Lets use `.groupby` + `.mean` calculate the average price for each week across years and region. We can then plot that aggregated price against the week and perhaps get a clearer picture.

**Question 1.5**
<br> {points: 1}

Create a reduced/aggregated version of the `avocado` data set and name it `avocado_aggregate`. To do this you will want to `groupby` the `wk` column and then use `mean` to calculate the average price. We pass `numeric_only=True` to tell pandas that we want the mean only of the numeric columns. Note: after applying `groupby` to the dataframe, it will automatically set the `groupby` column as index. Since we would like to use the `wk` column later in the plot, we would apply `reset_index` to reset the index for the dataframe.

*Assign your answer to an object called `avocado_aggregate`.*

In [None]:
# ___ = ___.groupby(___).mean(numeric_only=True).reset_index()

# your code here
raise NotImplementedError
avocado_aggregate.head()

In [None]:
from hashlib import sha1
assert str(type(avocado_aggregate is None)) == "<class 'bool'>", "type of avocado_aggregate is None is not bool. avocado_aggregate is None should be a bool"
assert str(avocado_aggregate is None) == "False", "boolean value of avocado_aggregate is None is not correct"

assert str(type(avocado_aggregate.shape)) == "<class 'tuple'>", "type of avocado_aggregate.shape is not tuple. avocado_aggregate.shape should be a tuple"
assert str(len(avocado_aggregate.shape)) == "2", "length of avocado_aggregate.shape is not correct"
assert str(sorted(map(str, avocado_aggregate.shape))) == "['53', '6']", "values of avocado_aggregate.shape are not correct"
assert str(avocado_aggregate.shape) == "(53, 6)", "order of elements of avocado_aggregate.shape is not correct"

assert sha1(str(type(sum(avocado_aggregate.wk))).encode("utf-8")+b"8bf00066267ebd2e").hexdigest() == "bb35a33b11265b645284b71f4f90290e87f531d1", "type of sum(avocado_aggregate.wk) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(avocado_aggregate.wk)).encode("utf-8")+b"8bf00066267ebd2e").hexdigest() == "a48df2cdedc41f966eb7934eda5a619081d21135", "value of sum(avocado_aggregate.wk) is not correct"

assert sha1(str(type(sum(avocado_aggregate.average_price))).encode("utf-8")+b"3a195945ab1bdee5").hexdigest() == "354cb32f93c324479cbc17b6703e05abe3a1cb9c", "type of sum(avocado_aggregate.average_price) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado_aggregate.average_price), 2)).encode("utf-8")+b"3a195945ab1bdee5").hexdigest() == "1937925944740dc9f6051ce34ac7a996c7bb238c", "value of sum(avocado_aggregate.average_price) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 1.6**
<br> {points: 1}

Now let's take the `avocado_aggregate` data frame and use it to create a scatter plot where we plot `average_price` on the y-axis versus `wk` on the x-axis. 

*Assign your answer to an object called `avocado_aggregate_plot`. Don't forget to create proper English axis titles.*

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___)
#         .title(____)
#         .scale(zero=False)
# )


# your code here
raise NotImplementedError
avocado_aggregate_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado_aggregate_plot is None)).encode("utf-8")+b"515f76fe676c00c9").hexdigest() == "344b45d6d57e014fde5c2aa477d17de35048c25e", "type of avocado_aggregate_plot is None is not bool. avocado_aggregate_plot is None should be a bool"
assert sha1(str(avocado_aggregate_plot is None).encode("utf-8")+b"515f76fe676c00c9").hexdigest() == "2334f190e0b05e3e0a2e010969876be796f74fcc", "boolean value of avocado_aggregate_plot is None is not correct"

assert sha1(str(type(avocado_aggregate_plot.encoding.x['shorthand'])).encode("utf-8")+b"f84d7e8f8a85db42").hexdigest() == "2941935cd0f67d0e652b2a02b1f500623176c7da", "type of avocado_aggregate_plot.encoding.x['shorthand'] is not str. avocado_aggregate_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot.encoding.x['shorthand'])).encode("utf-8")+b"f84d7e8f8a85db42").hexdigest() == "2babfe66247ee4ce5d35ad1eed9a39b37ba35dae", "length of avocado_aggregate_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"f84d7e8f8a85db42").hexdigest() == "6f6c88f77d50f2b14edaa5dcd30fa95e174df38a", "value of avocado_aggregate_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.x['shorthand']).encode("utf-8")+b"f84d7e8f8a85db42").hexdigest() == "6f6c88f77d50f2b14edaa5dcd30fa95e174df38a", "correct string value of avocado_aggregate_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot.encoding.y['shorthand'])).encode("utf-8")+b"502ae6441828a5ac").hexdigest() == "032ad5ebc19903054ab9a5b43f42f21fb76545ec", "type of avocado_aggregate_plot.encoding.y['shorthand'] is not str. avocado_aggregate_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot.encoding.y['shorthand'])).encode("utf-8")+b"502ae6441828a5ac").hexdigest() == "49645e1eddd0b471db48befa9afc0df64eb5bf72", "length of avocado_aggregate_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"502ae6441828a5ac").hexdigest() == "58d7b2f7748ff7e1bd0db1b7838acc83d4290652", "value of avocado_aggregate_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot.encoding.y['shorthand']).encode("utf-8")+b"502ae6441828a5ac").hexdigest() == "58d7b2f7748ff7e1bd0db1b7838acc83d4290652", "correct string value of avocado_aggregate_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot.mark)).encode("utf-8")+b"4289906fb65b7f8c").hexdigest() == "490f4e823bb1ca6e7ae323a70865b727152c1c04", "type of avocado_aggregate_plot.mark is not str. avocado_aggregate_plot.mark should be an str"
assert sha1(str(len(avocado_aggregate_plot.mark)).encode("utf-8")+b"4289906fb65b7f8c").hexdigest() == "1c035a414dfa0f39ab0ffa0a528e531ddd06e4a5", "length of avocado_aggregate_plot.mark is not correct"
assert sha1(str(avocado_aggregate_plot.mark.lower()).encode("utf-8")+b"4289906fb65b7f8c").hexdigest() == "c2840d28ac1480665fdda4fc5cb0a87f0241168d", "value of avocado_aggregate_plot.mark is not correct"
assert sha1(str(avocado_aggregate_plot.mark).encode("utf-8")+b"4289906fb65b7f8c").hexdigest() == "c2840d28ac1480665fdda4fc5cb0a87f0241168d", "correct string value of avocado_aggregate_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_aggregate_plot.encoding.x['title'], str))).encode("utf-8")+b"a107e55850ce9c2b").hexdigest() == "ba1ce90bebe5d6cce894867794875036b1d8ec68", "type of isinstance(avocado_aggregate_plot.encoding.x['title'], str) is not bool. isinstance(avocado_aggregate_plot.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot.encoding.x['title'], str)).encode("utf-8")+b"a107e55850ce9c2b").hexdigest() == "48be948c2c9d38b9413ab97866e866dd18d66a5d", "boolean value of isinstance(avocado_aggregate_plot.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_aggregate_plot.encoding.y['title'], str))).encode("utf-8")+b"d835b78ca2d3672a").hexdigest() == "2e064e5005b11cf9b414cb8be0673627ac8a655d", "type of isinstance(avocado_aggregate_plot.encoding.y['title'], str) is not bool. isinstance(avocado_aggregate_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot.encoding.y['title'], str)).encode("utf-8")+b"d835b78ca2d3672a").hexdigest() == "896f0d63a91f214ade1e4408bdee0557b65d663c", "boolean value of isinstance(avocado_aggregate_plot.encoding.y['title'], str) is not correct"

print('Success!')

We can now see that the prices of avocados does indeed fluctuate throughout the year. And we could use this information to recommend to restaurants that if they want to maximize profit from menu items that contain avocados, they should only offer them on the menu roughly between December and May. 

Why might this happen? Perhaps price has something to do with supply? We can also use this data set to get some insight into that question by plotting total avocado volume (y-axis) versus week. To do this, we will first have to create a column called `total_volume` whose value is the sum of the small, large and extra large-sized avocado volumes. To do this we will have to go back to the original `avocado` data frame we loaded.

**Question 1.7**
<br> {points: 1}

Our next step to plotting `total_volume` per week against week is to create a new column in the `avocado` data frame called `total_volume` which is equal to the sum of all three volume columns:

Fill in the `___` in the cell below. 

In [None]:
# avocado = avocado.assign(___=___ + ___ + ___

# your code here
raise NotImplementedError
avocado

In [None]:
from hashlib import sha1
assert str(type(avocado is None)) == "<class 'bool'>", "type of avocado is None is not bool. avocado is None should be a bool"
assert str(avocado is None) == "False", "boolean value of avocado is None is not correct"

assert str(type(avocado.shape)) == "<class 'tuple'>", "type of avocado.shape is not tuple. avocado.shape should be a tuple"
assert str(len(avocado.shape)) == "2", "length of avocado.shape is not correct"
assert str(sorted(map(str, avocado.shape))) == "['10', '17911']", "values of avocado.shape are not correct"
assert str(avocado.shape) == "(17911, 10)", "order of elements of avocado.shape is not correct"

assert sha1(str(type(sum(avocado.total_volume.dropna()))).encode("utf-8")+b"2a26ed12a1faf711").hexdigest() == "e12bdd85a43e72d586543cfb751b881ab5723e67", "type of sum(avocado.total_volume.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado.total_volume.dropna()), 2)).encode("utf-8")+b"2a26ed12a1faf711").hexdigest() == "af4519dfa8cb5eb26b0c8de88c5e40bd514e45d1", "value of sum(avocado.total_volume.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 1.8** 
<br> {points: 1}

Now, create another reduced/aggregated version of the `avocado` data frame and name it `avocado_aggregate_2`. To do this you will want to `groupby` the `wk` column and then use `mean` to calculate the average total volume.

In [None]:
# ___ = ___.groupby(___).mean(numeric_only=True).reset_index()


# your code here
raise NotImplementedError
avocado_aggregate_2.head()

In [None]:
from hashlib import sha1
assert str(type(avocado_aggregate_2 is None)) == "<class 'bool'>", "type of avocado_aggregate_2 is None is not bool. avocado_aggregate_2 is None should be a bool"
assert str(avocado_aggregate_2 is None) == "False", "boolean value of avocado_aggregate_2 is None is not correct"

assert str(type(avocado_aggregate_2.shape)) == "<class 'tuple'>", "type of avocado_aggregate_2.shape is not tuple. avocado_aggregate_2.shape should be a tuple"
assert str(len(avocado_aggregate_2.shape)) == "2", "length of avocado_aggregate_2.shape is not correct"
assert str(sorted(map(str, avocado_aggregate_2.shape))) == "['53', '7']", "values of avocado_aggregate_2.shape are not correct"
assert str(avocado_aggregate_2.shape) == "(53, 7)", "order of elements of avocado_aggregate_2.shape is not correct"

assert sha1(str(type(sum(avocado_aggregate_2.total_volume))).encode("utf-8")+b"4c85373d3a474912").hexdigest() == "c6904a8417c9400b54ee056173acc66f7a3cbf98", "type of sum(avocado_aggregate_2.total_volume) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(avocado_aggregate_2.total_volume), 2)).encode("utf-8")+b"4c85373d3a474912").hexdigest() == "614dbdc7940e31dd4f1a7534aa03523e351857f9", "value of sum(avocado_aggregate_2.total_volume) is not correct (rounded to 2 decimal places)"

assert sha1(str(type(sum(avocado_aggregate_2.wk))).encode("utf-8")+b"ff2296f6bf1329ff").hexdigest() == "233010ba621759b184b7eb87cb102a17d845e52d", "type of sum(avocado_aggregate_2.wk) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(avocado_aggregate_2.wk)).encode("utf-8")+b"ff2296f6bf1329ff").hexdigest() == "a7fac24fe47cbb578cb69f332f2b0a75bf97ac52", "value of sum(avocado_aggregate_2.wk) is not correct"

print('Success!')

**Question 1.10** 
<br> {points: 1}

Now let's take the `avocado_aggregate_2` data frame and use it to create a scatter plot where we plot average `total_volume` (in pounds, lbs) on the y-axis versus `wk` on the x-axis. Assign your answer to an object called `avocado_aggregate_plot_2`. Don't forget to create proper English axis labels.

> Hint: don't forget to include the units for volume in your axis titles.

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___)
#         .title(___)
#         .scale(zero=False)
# )

# your code here
raise NotImplementedError
avocado_aggregate_plot_2

In [None]:
from hashlib import sha1
assert sha1(str(type(avocado_aggregate_plot_2 is None)).encode("utf-8")+b"5f83d310e429e599").hexdigest() == "a1981495f7d8cfbf8780f81ac4abb56631e37546", "type of avocado_aggregate_plot_2 is None is not bool. avocado_aggregate_plot_2 is None should be a bool"
assert sha1(str(avocado_aggregate_plot_2 is None).encode("utf-8")+b"5f83d310e429e599").hexdigest() == "8f51fb9a6fafe3985dcc0271a13697b7e1a8fc0c", "boolean value of avocado_aggregate_plot_2 is None is not correct"

assert sha1(str(type(avocado_aggregate_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"4d4ee2ce5747470e").hexdigest() == "dbe8ed9989020a949d8a71430cfa0b214efb6119", "type of avocado_aggregate_plot_2.encoding.x['shorthand'] is not str. avocado_aggregate_plot_2.encoding.x['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.encoding.x['shorthand'])).encode("utf-8")+b"4d4ee2ce5747470e").hexdigest() == "6de8d83331d8af94b2fcd1257aece9457fc0d85c", "length of avocado_aggregate_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.x['shorthand'].lower()).encode("utf-8")+b"4d4ee2ce5747470e").hexdigest() == "78236649189b26bd6246efc381782739597acbd6", "value of avocado_aggregate_plot_2.encoding.x['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.x['shorthand']).encode("utf-8")+b"4d4ee2ce5747470e").hexdigest() == "78236649189b26bd6246efc381782739597acbd6", "correct string value of avocado_aggregate_plot_2.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot_2.encoding.y['shorthand'])).encode("utf-8")+b"81b20868b95a6771").hexdigest() == "7dc0923795b500fedfbeb8f9f265a6f9470a9eff", "type of avocado_aggregate_plot_2.encoding.y['shorthand'] is not str. avocado_aggregate_plot_2.encoding.y['shorthand'] should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.encoding.y['shorthand'])).encode("utf-8")+b"81b20868b95a6771").hexdigest() == "02609ff53491f5539533aa7b11244f48dfb58b43", "length of avocado_aggregate_plot_2.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.y['shorthand'].lower()).encode("utf-8")+b"81b20868b95a6771").hexdigest() == "07ca093192eccc102858c4f7c99fd9c9aae98dac", "value of avocado_aggregate_plot_2.encoding.y['shorthand'] is not correct"
assert sha1(str(avocado_aggregate_plot_2.encoding.y['shorthand']).encode("utf-8")+b"81b20868b95a6771").hexdigest() == "07ca093192eccc102858c4f7c99fd9c9aae98dac", "correct string value of avocado_aggregate_plot_2.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(avocado_aggregate_plot_2.mark)).encode("utf-8")+b"77a77d29b749eb72").hexdigest() == "20e137eb3a17cfa6009cdbbf40d5242bb2da0841", "type of avocado_aggregate_plot_2.mark is not str. avocado_aggregate_plot_2.mark should be an str"
assert sha1(str(len(avocado_aggregate_plot_2.mark)).encode("utf-8")+b"77a77d29b749eb72").hexdigest() == "5f6b865f56fc9b322a5672e03b53740fabfc2936", "length of avocado_aggregate_plot_2.mark is not correct"
assert sha1(str(avocado_aggregate_plot_2.mark.lower()).encode("utf-8")+b"77a77d29b749eb72").hexdigest() == "6699751ea5a0f340a45fd6063591087fa4f372d8", "value of avocado_aggregate_plot_2.mark is not correct"
assert sha1(str(avocado_aggregate_plot_2.mark).encode("utf-8")+b"77a77d29b749eb72").hexdigest() == "6699751ea5a0f340a45fd6063591087fa4f372d8", "correct string value of avocado_aggregate_plot_2.mark but incorrect case of letters"

assert sha1(str(type(isinstance(avocado_aggregate_plot_2.encoding.x['title'], str))).encode("utf-8")+b"9bd8036a163f27fc").hexdigest() == "34adb6e9a955435dcc663cb28fe10f8d54bceb87", "type of isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) is not bool. isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot_2.encoding.x['title'], str)).encode("utf-8")+b"9bd8036a163f27fc").hexdigest() == "a0ff543f084e030aa1f8c21fef3d9f5fab3ea8e4", "boolean value of isinstance(avocado_aggregate_plot_2.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(avocado_aggregate_plot_2.encoding.y['title'], str))).encode("utf-8")+b"dcb55b375b0c4865").hexdigest() == "962c57041b911f7c8872844d6441a0995fce5c02", "type of isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) is not bool. isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(avocado_aggregate_plot_2.encoding.y['title'], str)).encode("utf-8")+b"dcb55b375b0c4865").hexdigest() == "5d538b52bbfa83794fc54788da568d95b614b37e", "boolean value of isinstance(avocado_aggregate_plot_2.encoding.y['title'], str) is not correct"

print('Success!')

We can see from the above plot of the average total volume versus the week that there are more avocados sold (and perhaps this reflects what is available for sale) roughly between January to May. This time period of increased volume corresponds with the lower avocado prices. We can *hypothesize* (but not conclude, of course) that the lower prices may be due to an increased availability of avocados during this time period.

## 2. Sea Surface Temperatures in Departure Bay
The next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (in degrees Celsius) and salinity (in practical salinity units, PSU) observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

Presently termed the British Columbia Shore Station Oceanographic Program (BCSOP), there are 12 participating stations; most of these are staffed by Fisheries and Oceans Canada. You can look at data from other stations at http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html 

Further information from the Government of Canada's website indicates: 
>  Observations are made daily using seawater collected in a bucket lowered into the surface water at or near the daytime high tide. This sampling method was designed long ago by Dr. John P. Tully and has not been changed in the interests of a homogeneous data set. This means, for example, that if an observer starts sampling one day at 6 a.m., and continues to sample at the daytime high tide on the second day the sample will be taken at about 06:50 the next day, 07:40 the day after etc. When the daytime high-tide gets close to 6 p.m. the observer will then begin again to sample early in the morning, and the cycle continues. Since there is a day/night variation in the sea surface temperatures the daily time series will show a signal that varies with the14-day tidal cycle. This artifact does not affect the monthly sea surface temperature data.

In this worksheet, we want to see if the sea surface temperature has been changing over time. 

**Question 2.1** True or False:
<br> {points: 1}

The sampling of surface water occurs at the same time each day. 

*Assign your answer to an object called `answer2_1`. Make sure your answer is a boolean. i.e. `True` or `False`.* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_1)).encode("utf-8")+b"17cd29239277db7b").hexdigest() == "2a659e86178094137cc6e4a4eba67c4dff9f934f", "type of answer2_1 is not bool. answer2_1 should be a bool"
assert sha1(str(answer2_1).encode("utf-8")+b"17cd29239277db7b").hexdigest() == "4e666d55963e65630d428435a244aaa9e66cd344", "boolean value of answer2_1 is not correct"

print('Success!')

**Question 2.2** Multiple Choice:
<br> {points: 1}

If high tide occurred at 9am today, what time would the scientist collect data tomorrow?

A. 11:10 am 

B. 9:50 am 

C. 10:00 pm 

D. Trick question... you skip days when collecting data. 

*Assign your answer to an object called `answer2_2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_2)).encode("utf-8")+b"828fb7959f9b12e7").hexdigest() == "3cac3266f7cb1a67ef9ecb17ae8f8fc45738f3e4", "type of answer2_2 is not str. answer2_2 should be an str"
assert sha1(str(len(answer2_2)).encode("utf-8")+b"828fb7959f9b12e7").hexdigest() == "1694341ff3f83d9a73bd8e3a3a18f812ba8333c4", "length of answer2_2 is not correct"
assert sha1(str(answer2_2.lower()).encode("utf-8")+b"828fb7959f9b12e7").hexdigest() == "2e61baf093db06355c61e352fccc458e2536c1f2", "value of answer2_2 is not correct"
assert sha1(str(answer2_2).encode("utf-8")+b"828fb7959f9b12e7").hexdigest() == "4b01ce20c8a28a6692f5a3f61795554871391176", "correct string value of answer2_2 but incorrect case of letters"

print('Success!')

**Question 2.3**
<br> {points: 1}

To begin working with this data, read the file `departure_bay_temperature.csv` using a relative path. Note, this file (just like the avocado data set) is found within the `data` directory. 

*Assign your answer to an object called `sea_surface`.* 

> Hint: check out the data file in the editor mode to see from which row the actual data begins, and you will need to specify the `skiprows` argument accordingly in the suitable `pandas` function.

In [None]:
# your code here
raise NotImplementedError
sea_surface

In [None]:
from hashlib import sha1
assert str(type(sea_surface is None)) == "<class 'bool'>", "type of sea_surface is None is not bool. sea_surface is None should be a bool"
assert str(sea_surface is None) == "False", "boolean value of sea_surface is None is not correct"

assert str(type(sea_surface)) == "<class 'pandas.core.frame.DataFrame'>", "type of type(sea_surface) is not correct"

assert str(type(sea_surface.shape)) == "<class 'tuple'>", "type of sea_surface.shape is not tuple. sea_surface.shape should be a tuple"
assert str(len(sea_surface.shape)) == "2", "length of sea_surface.shape is not correct"
assert str(sorted(map(str, sea_surface.shape))) == "['105', '13']", "values of sea_surface.shape are not correct"
assert str(sea_surface.shape) == "(105, 13)", "order of elements of sea_surface.shape is not correct"

assert str(type(sea_surface.columns.values)) == "<class 'numpy.ndarray'>", "type of sea_surface.columns.values is not correct"
assert str(sea_surface.columns.values) == "['Year' 'Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov'\n 'Dec']", "value of sea_surface.columns.values is not correct"

assert sha1(str(type(sum(sea_surface.Year))).encode("utf-8")+b"d621a2aab489003e").hexdigest() == "15ead5ea8e85f6b76d27bf6b56d6465d3c634aa3", "type of sum(sea_surface.Year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(sea_surface.Year)).encode("utf-8")+b"d621a2aab489003e").hexdigest() == "6ec8f483aae6dfa368a571861333696899a8a1a5", "value of sum(sea_surface.Year) is not correct"

print('Success!')

**Question 2.3.1**
<br> {points: 1}

The data above in Question 2.3 is not tidy, which reasons listed below explain why?

A. There are NaN's in the data set

B. The variable temperature is split across more than one column

C. Values for the variable month are stored as column names

D. A and C

E. B and C

F. All of the above

*Assign your answer to an object called `answer2_3_1`.*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer2_3_1)).encode("utf-8")+b"5b293b9f946fdb9b").hexdigest() == "0c0712bb4b2678bc89bf76fc8e2c9978299d1073", "type of answer2_3_1 is not str. answer2_3_1 should be an str"
assert sha1(str(len(answer2_3_1)).encode("utf-8")+b"5b293b9f946fdb9b").hexdigest() == "6b556cc3478aff1dddd5b9bb08e043c81c1a8105", "length of answer2_3_1 is not correct"
assert sha1(str(answer2_3_1.lower()).encode("utf-8")+b"5b293b9f946fdb9b").hexdigest() == "5182fde710a8136c64d8ff88719271373a30d72c", "value of answer2_3_1 is not correct"
assert sha1(str(answer2_3_1).encode("utf-8")+b"5b293b9f946fdb9b").hexdigest() == "40277580dd2083851b3db44f3415e3f561f7b515", "correct string value of answer2_3_1 but incorrect case of letters"

print('Success!')

**Question 2.4**
<br> {points: 1}

Given `altair` expects tidy data, we need to convert our data into that format. To do this we will use the `melt` function. We would like our data to end up looking like this:

| Year | Month | Temperature |
|------|-------|-------------|
| 1914 | Jan   | 7.2         |
| 1915 | Jan   | 5.6         |
| 1916 | Jan   | 1.2         |
| 1917 | Jan   | 3.8         |
| 1918 | Jan   | 3.7         |
| ...  | ...   | ...         |
| 2014 | Dec   | 7.1         |
| 2015 | Dec   | 6.8         |
| 2016 | Dec   | 5.5         |
| 2017 | Dec   | 6.9         |
| 2018 | Dec   | NaN         |


Fill in the `___` in the cell below. 

*Assign your answer to an object called `tidy_temp`.*

In [None]:
# ___ = sea_surface.___(id_vars=['Year'],  var_name='___', value_name='Temperature')


# your code here
raise NotImplementedError
tidy_temp

In [None]:
from hashlib import sha1
assert str(type(tidy_temp is None)) == "<class 'bool'>", "type of tidy_temp is None is not bool. tidy_temp is None should be a bool"
assert str(tidy_temp is None) == "False", "boolean value of tidy_temp is None is not correct"

assert str(type(tidy_temp.shape)) == "<class 'tuple'>", "type of tidy_temp.shape is not tuple. tidy_temp.shape should be a tuple"
assert str(len(tidy_temp.shape)) == "2", "length of tidy_temp.shape is not correct"
assert str(sorted(map(str, tidy_temp.shape))) == "['1260', '3']", "values of tidy_temp.shape are not correct"
assert str(tidy_temp.shape) == "(1260, 3)", "order of elements of tidy_temp.shape is not correct"

assert str(type(tidy_temp.columns)) == "<class 'pandas.core.indexes.base.Index'>", "type of tidy_temp.columns is not correct"
assert str(tidy_temp.columns) == "Index(['Year', 'Month', 'Temperature'], dtype='object')", "value of tidy_temp.columns is not correct"

assert sha1(str(type(sum(tidy_temp.Temperature.dropna()))).encode("utf-8")+b"b5b290b1f67888e7").hexdigest() == "9aa17c5678a5320ca9213bbe4d219603c674fa95", "type of sum(tidy_temp.Temperature.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(tidy_temp.Temperature.dropna()), 2)).encode("utf-8")+b"b5b290b1f67888e7").hexdigest() == "827efe02fc48e3d83a7875031c0adc03960f20dd", "value of sum(tidy_temp.Temperature.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 2.5**
<br> {points: 1}

Now that we have our data in a tidy format, we can create our plot that compares the average monthly sea surface temperatures (in degrees Celsius) to the year they were recorded. To make our plots more informative, we should plot each month separately. We can filter the data before we pipe our data into the `alt.Chart` function. Let's start out by just plotting the data for the month of November. As usual, use proper English to label your axes :)

*Assign your answer to an object called `nov_temp_plot`.*

> Hint: don't forget to include the units for temperature in your data visualization.

In [None]:
# ___ = alt.Chart(___[___[___] == "Nov"]).mark_point().encode(
#     x=alt.X(___)
#         .scale(zero=False),
#     y=alt.Y(___)
#         .title(___)
#         .scale(zero=False)
# )

# your code here
raise NotImplementedError
nov_temp_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(nov_temp_plot is None)).encode("utf-8")+b"9130b110cfd3c71d").hexdigest() == "3f2c27ff0177a1caf11b83ebfd41857401c2782a", "type of nov_temp_plot is None is not bool. nov_temp_plot is None should be a bool"
assert sha1(str(nov_temp_plot is None).encode("utf-8")+b"9130b110cfd3c71d").hexdigest() == "24043835d5cadacdf46c72b3e9e4d1bff85bc94b", "boolean value of nov_temp_plot is None is not correct"

assert sha1(str(type(nov_temp_plot.data.Month.unique())).encode("utf-8")+b"efbd06caa486ab74").hexdigest() == "2a104b2e96496828b18b23917da2ac40219477c1", "type of nov_temp_plot.data.Month.unique() is not correct"
assert sha1(str(nov_temp_plot.data.Month.unique()).encode("utf-8")+b"efbd06caa486ab74").hexdigest() == "1268bb6b17d49c8d7ce2ac5377c97faf631f7b68", "value of nov_temp_plot.data.Month.unique() is not correct"

assert sha1(str(type(nov_temp_plot.encoding.x['shorthand'])).encode("utf-8")+b"ac79db08dbfa4b30").hexdigest() == "994b0ff1de234e2b267bcab41e332a915b083f5d", "type of nov_temp_plot.encoding.x['shorthand'] is not str. nov_temp_plot.encoding.x['shorthand'] should be an str"
assert sha1(str(len(nov_temp_plot.encoding.x['shorthand'])).encode("utf-8")+b"ac79db08dbfa4b30").hexdigest() == "eba9aee16fff12107dc1ca2328169a47cbc31fa7", "length of nov_temp_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.x['shorthand'].lower()).encode("utf-8")+b"ac79db08dbfa4b30").hexdigest() == "2827ca6a60ca2129e9d0e54ada4c95196bfff3cd", "value of nov_temp_plot.encoding.x['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.x['shorthand']).encode("utf-8")+b"ac79db08dbfa4b30").hexdigest() == "6fb4057d2e39038e55346850d77dacb4bd92568b", "correct string value of nov_temp_plot.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(nov_temp_plot.encoding.y['shorthand'])).encode("utf-8")+b"06afe93c3b9c6bce").hexdigest() == "69a2fff844cb7a75ecae3d6930021f4bdf58037f", "type of nov_temp_plot.encoding.y['shorthand'] is not str. nov_temp_plot.encoding.y['shorthand'] should be an str"
assert sha1(str(len(nov_temp_plot.encoding.y['shorthand'])).encode("utf-8")+b"06afe93c3b9c6bce").hexdigest() == "c340a717ab4b8de6cf72ab066c90997344467558", "length of nov_temp_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.y['shorthand'].lower()).encode("utf-8")+b"06afe93c3b9c6bce").hexdigest() == "d7fca436c6046b0aca615459d4eb15f6208ee151", "value of nov_temp_plot.encoding.y['shorthand'] is not correct"
assert sha1(str(nov_temp_plot.encoding.y['shorthand']).encode("utf-8")+b"06afe93c3b9c6bce").hexdigest() == "cb3c1952538f930f18952aea84a3774f09bc25a9", "correct string value of nov_temp_plot.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(nov_temp_plot.mark)).encode("utf-8")+b"f4be47e4cb7b3a4b").hexdigest() == "9fd87d7cdc381ee6c4d8bdb8933bf8a7e9960a26", "type of nov_temp_plot.mark is not str. nov_temp_plot.mark should be an str"
assert sha1(str(len(nov_temp_plot.mark)).encode("utf-8")+b"f4be47e4cb7b3a4b").hexdigest() == "20aae228d9c2f845ba7ae544b4be7a11c59b1db8", "length of nov_temp_plot.mark is not correct"
assert sha1(str(nov_temp_plot.mark.lower()).encode("utf-8")+b"f4be47e4cb7b3a4b").hexdigest() == "bc56e7d3dcb5bb865a13dda897c3497c5cfb8e54", "value of nov_temp_plot.mark is not correct"
assert sha1(str(nov_temp_plot.mark).encode("utf-8")+b"f4be47e4cb7b3a4b").hexdigest() == "bc56e7d3dcb5bb865a13dda897c3497c5cfb8e54", "correct string value of nov_temp_plot.mark but incorrect case of letters"

assert sha1(str(type(isinstance(nov_temp_plot.encoding.y['title'], str))).encode("utf-8")+b"82b144ab8a5042ee").hexdigest() == "1f78fdc39f7f11cf0853157939e8dc7c8b0e22f0", "type of isinstance(nov_temp_plot.encoding.y['title'], str) is not bool. isinstance(nov_temp_plot.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(nov_temp_plot.encoding.y['title'], str)).encode("utf-8")+b"82b144ab8a5042ee").hexdigest() == "7821b96141c3053c7551373e7ee4ca3d2c3f8026", "boolean value of isinstance(nov_temp_plot.encoding.y['title'], str) is not correct"

print('Success!')

We can see that there may be a small decrease in colder temperatures in recent years, and/or the temperatures in recent years look less variable compared to years before 1975. What about other months? Let's plot them! 

Instead of repeating the code above for the 11 other months, we'll take advantage of a `altair` function that we haven't met yet, `facet`. We will learn more about this function next week, this week we will give you the code for it.

**Question 2.6**
<br> {points: 1}

Fill in the missing code below to plot the average monthly sea surface temperatures to the year they were recorded for all months. 

*Assign your answer to an object called `all_temp_plot`.*

> Hint: don't forget to include the units for temperature in your data visualization.

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___)
#         .scale(zero=False),
#     y=alt.Y(___)
#         .title(___)
#         .scale(zero=False)
# ).facet(
#     'Month',
#     columns=4,
# )

# your code here
raise NotImplementedError
all_temp_plot

In [None]:
from hashlib import sha1
assert sha1(str(type(all_temp_plot is None)).encode("utf-8")+b"d81268b81e1b9f66").hexdigest() == "f6b9e158c0a8c233a0fd978e78500a2254bbe457", "type of all_temp_plot is None is not bool. all_temp_plot is None should be a bool"
assert sha1(str(all_temp_plot is None).encode("utf-8")+b"d81268b81e1b9f66").hexdigest() == "db141928c270d3c296928656a32a5b47cf2267c4", "boolean value of all_temp_plot is None is not correct"

assert sha1(str(type("Month" in all_temp_plot.data.columns)).encode("utf-8")+b"9bf6b6b40772d640").hexdigest() == "32e678fcd8c3df53460f0b15e42ff2a589c7b406", "type of \"Month\" in all_temp_plot.data.columns is not bool. \"Month\" in all_temp_plot.data.columns should be a bool"
assert sha1(str("Month" in all_temp_plot.data.columns).encode("utf-8")+b"9bf6b6b40772d640").hexdigest() == "6e0b375b019ccaf003deaa55cada7d1b0cdec3db", "boolean value of \"Month\" in all_temp_plot.data.columns is not correct"

assert sha1(str(type(all_temp_plot.facet)).encode("utf-8")+b"e7a3b38381f9a95f").hexdigest() == "b072f72f92c3642d31e083acff608e11f697d290", "type of all_temp_plot.facet is not correct"
assert sha1(str(all_temp_plot.facet).encode("utf-8")+b"e7a3b38381f9a95f").hexdigest() == "0e55775a25e0302ade57794f8c817d4bc0b97bd5", "value of all_temp_plot.facet is not correct"

print('Success!')

We can see above that some months show a small, but general increase in temperatures, whereas others don't. And some months show a change in variability and others do not. From this it is clear to us that if we are trying to understand temperature changes over time, we best keep data from different months separate. Also note that the months are sorted in alphabetic order, but it would have been better to sort it according to where during the year each month occurs, we will learn how to do this in an upcoming chapter!

## 3. Pollution in Madrid
We're working with a data set from Kaggle once again! [This data](https://www.kaggle.com/decide-soluciones/air-quality-madrid) was collected under the instructions from Madrid's City Council and is publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. This data includes daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties.

There are a number of stations set up around Madrid and each station's data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website:

- `SO_2`: sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups.
- `CO`: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death.
- `NO_2`: nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation.
- `PM10`: particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications.
- `NOx`: nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog.
- `O_3`: ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers.
- `TOL`: toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage.
- `BEN`: benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans.
- `EBE`: ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer.
- `MXY`: m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system.
- `PXY`: p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
- `OXY`: o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
- `TCH`: total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases.
- `NMHC`: non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans.

The goal of this assignment is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006). 
1. First do a plot of one of the pollutants (EBE). 
2. Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time). 
3. Now we will look at which pollutant decreased the most. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006. 

**Question 3.1** Multiple Choice: 
<br> {points: 1}

What big picture question are we trying to answer?

A. Did EBE decrease in Madrid between 2001 and 2006?

B. Of all the pollutants, which decreased the most between 2001 and 2006? 

C. Of all the pollutants, which decreased the least between 2001 and 2006?

D. Did EBE increase in Madrid between 2001 and 2006?

*Assign your answer to an object called `answer3_1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(all_temp_plot)).encode("utf-8")+b"1797484f1e1c07db").hexdigest() == "d9fdb363988ed7db2bf648e79a330d74661cf72f", "type of all_temp_plot is not correct"
assert sha1(str(all_temp_plot).encode("utf-8")+b"1797484f1e1c07db").hexdigest() == "b79788207e90be4e58d07c240bd1e0ce8dc5629b", "value of all_temp_plot is not correct"

print('Success!')

**Question 3.2** 
<br> {points: 1}

To begin working with this data, read the file `madrid_pollution.csv`. Note, this file (just like the avocado and sea surface data set) is found in the `data` directory. 

*Assign your answer to an object called `madrid`.* 

> Hint: check out the data file in the editor mode to see which delimitor is used, and then select the proper `pandas` function.

In [None]:
# your code here
raise NotImplementedError
madrid

In [None]:
from hashlib import sha1
assert str(type(madrid is None)) == "<class 'bool'>", "type of madrid is None is not bool. madrid is None should be a bool"
assert str(madrid is None) == "False", "boolean value of madrid is None is not correct"

assert str(type(madrid)) == "<class 'pandas.core.frame.DataFrame'>", "type of type(madrid) is not correct"

assert str(type(madrid.shape)) == "<class 'tuple'>", "type of madrid.shape is not tuple. madrid.shape should be a tuple"
assert str(len(madrid.shape)) == "2", "length of madrid.shape is not correct"
assert str(sorted(map(str, madrid.shape))) == "['17', '51864']", "values of madrid.shape are not correct"
assert str(madrid.shape) == "(51864, 17)", "order of elements of madrid.shape is not correct"

assert str(type(madrid.columns.values)) == "<class 'numpy.ndarray'>", "type of madrid.columns.values is not correct"
assert str(madrid.columns.values) == "['date' 'BEN' 'CO' 'EBE' 'MXY' 'NMHC' 'NO_2' 'NOx' 'OXY' 'O_3' 'PM10'\n 'PXY' 'SO_2' 'TCH' 'TOL' 'year' 'mnth']", "value of madrid.columns.values is not correct"

assert sha1(str(type(sum(madrid.BEN.dropna()))).encode("utf-8")+b"00cd5236ef54b447").hexdigest() == "abb09b04fa2cbdc605ecfe9705fd44fff18cda5a", "type of sum(madrid.BEN.dropna()) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(madrid.BEN.dropna()), 2)).encode("utf-8")+b"00cd5236ef54b447").hexdigest() == "6473419e9dbe133f72eb3eae5a98a55c096bbc7d", "value of sum(madrid.BEN.dropna()) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.3**
<br> {points: 1}

Now that the data is loaded in Python, create a scatter plot that compares ethylbenzene (`EBE`) values against the date they were recorded. This graph will showcase the concentration of ethylbenzene in Madrid over time. As usual, label your axes: 

- x = Date
- y = Ethylbenzene (μg/m³)

*Assign your answer to an object called `EBE_pollution`.*

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___).title(___)
# ).properties(width=800)

# your code here
raise NotImplementedError
EBE_pollution

# Are levels increasing or decreasing?

In [None]:
from hashlib import sha1
assert sha1(str(type(EBE_pollution is None)).encode("utf-8")+b"4cb708238abdc5e9").hexdigest() == "460b7b52ef802724f01a67791803bde1c71a5764", "type of EBE_pollution is None is not bool. EBE_pollution is None should be a bool"
assert sha1(str(EBE_pollution is None).encode("utf-8")+b"4cb708238abdc5e9").hexdigest() == "18e244850c67dbcbb0d5c79b4f630cbb0e6de733", "boolean value of EBE_pollution is None is not correct"

assert sha1(str(type(EBE_pollution.encoding.x['shorthand'])).encode("utf-8")+b"7ae4aff62560fa0a").hexdigest() == "d76ea04c4a0106f15bef9567cf516a1044bcd1b4", "type of EBE_pollution.encoding.x['shorthand'] is not str. EBE_pollution.encoding.x['shorthand'] should be an str"
assert sha1(str(len(EBE_pollution.encoding.x['shorthand'])).encode("utf-8")+b"7ae4aff62560fa0a").hexdigest() == "3c201917d6c65c1a6d1974abc2109cfc5136500a", "length of EBE_pollution.encoding.x['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.x['shorthand'].lower()).encode("utf-8")+b"7ae4aff62560fa0a").hexdigest() == "966bdd4b9b04a401205a430b8405ce689d1bc141", "value of EBE_pollution.encoding.x['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.x['shorthand']).encode("utf-8")+b"7ae4aff62560fa0a").hexdigest() == "330381c3e1dccf40c9d81f51ca7049abbc54b229", "correct string value of EBE_pollution.encoding.x['shorthand'] but incorrect case of letters"

assert sha1(str(type(EBE_pollution.encoding.y['shorthand'])).encode("utf-8")+b"f294128950b2a475").hexdigest() == "99038cea099b91a83922f2515b87c124bdfbbdde", "type of EBE_pollution.encoding.y['shorthand'] is not str. EBE_pollution.encoding.y['shorthand'] should be an str"
assert sha1(str(len(EBE_pollution.encoding.y['shorthand'])).encode("utf-8")+b"f294128950b2a475").hexdigest() == "0b6627c42073a4a3e6e502703b6979356c5362b0", "length of EBE_pollution.encoding.y['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.y['shorthand'].lower()).encode("utf-8")+b"f294128950b2a475").hexdigest() == "634dc044adbfdf33e058a3bf5e984507fb28f194", "value of EBE_pollution.encoding.y['shorthand'] is not correct"
assert sha1(str(EBE_pollution.encoding.y['shorthand']).encode("utf-8")+b"f294128950b2a475").hexdigest() == "b375a60df2a3365f58090365b6bf5437fed9a1a5", "correct string value of EBE_pollution.encoding.y['shorthand'] but incorrect case of letters"

assert sha1(str(type(EBE_pollution.mark)).encode("utf-8")+b"86cb1d0852f1c063").hexdigest() == "943baf1e5ff016a0d996ac8de77569e84805df76", "type of EBE_pollution.mark is not str. EBE_pollution.mark should be an str"
assert sha1(str(len(EBE_pollution.mark)).encode("utf-8")+b"86cb1d0852f1c063").hexdigest() == "dd8956c58619861ff0af905eda2fcb5d974e6cf1", "length of EBE_pollution.mark is not correct"
assert sha1(str(EBE_pollution.mark.lower()).encode("utf-8")+b"86cb1d0852f1c063").hexdigest() == "9920c2dc7ca5ed16127907d7d394d70daf5e0c1e", "value of EBE_pollution.mark is not correct"
assert sha1(str(EBE_pollution.mark).encode("utf-8")+b"86cb1d0852f1c063").hexdigest() == "9920c2dc7ca5ed16127907d7d394d70daf5e0c1e", "correct string value of EBE_pollution.mark but incorrect case of letters"

assert sha1(str(type(isinstance(EBE_pollution.encoding.x['title'], str))).encode("utf-8")+b"d0f8af88b331e763").hexdigest() == "85da4b0aa49cedb08804bf8cd4a0750cdca8a720", "type of isinstance(EBE_pollution.encoding.x['title'], str) is not bool. isinstance(EBE_pollution.encoding.x['title'], str) should be a bool"
assert sha1(str(isinstance(EBE_pollution.encoding.x['title'], str)).encode("utf-8")+b"d0f8af88b331e763").hexdigest() == "c3f3b2647396925975d1c909122bb84536fcba4b", "boolean value of isinstance(EBE_pollution.encoding.x['title'], str) is not correct"

assert sha1(str(type(isinstance(EBE_pollution.encoding.y['title'], str))).encode("utf-8")+b"c2c48c2e65146526").hexdigest() == "4e5f6710b4ad4194c49de3b44bb28dd8e2446f7e", "type of isinstance(EBE_pollution.encoding.y['title'], str) is not bool. isinstance(EBE_pollution.encoding.y['title'], str) should be a bool"
assert sha1(str(isinstance(EBE_pollution.encoding.y['title'], str)).encode("utf-8")+b"c2c48c2e65146526").hexdigest() == "714648d70cafc963dfdfca118ea2bc79aad7bd35", "boolean value of isinstance(EBE_pollution.encoding.y['title'], str) is not correct"

print('Success!')

We can see from this plot that over time, there are less and less high (> 25 μg/m³) EBE values.

**Question 3.4**
<br> {points: 1}

The question above asks you to write out code that allows visualization of all EBE recordings - which are taken every single hour of every day. Consequently the graph consists of many points and appears so densely plotted that it is difficult to interpret. In this question, we are going to clean up the graph and focus on max EBE readings from each month. To further investigate if this trend is changing over time, we will use `groupby` and `max` to create a new data set.

Fill in the `___` in the cell below. 

*Assign your answer to an object called `madrid_pollution`.*

In [None]:
# ___ = ___.groupby(["year", ___]).max("EBE").reset_index()

# your code here
raise NotImplementedError
madrid_pollution

In [None]:
from hashlib import sha1
assert sha1(str(type(madrid_pollution is None)).encode("utf-8")+b"a9cbd6409cc9d130").hexdigest() == "7a4546e1ec1565df8695c88294ea40250cdecaf0", "type of madrid_pollution is None is not bool. madrid_pollution is None should be a bool"
assert sha1(str(madrid_pollution is None).encode("utf-8")+b"a9cbd6409cc9d130").hexdigest() == "1d6108beb23871e95e2844f671311b31e01f2534", "boolean value of madrid_pollution is None is not correct"

assert sha1(str(type(madrid_pollution.shape)).encode("utf-8")+b"afd96c3a9ba96ab1").hexdigest() == "9a41ba044566d83fc124b05a7b6f1360fbd73e0d", "type of madrid_pollution.shape is not tuple. madrid_pollution.shape should be a tuple"
assert sha1(str(len(madrid_pollution.shape)).encode("utf-8")+b"afd96c3a9ba96ab1").hexdigest() == "c61ea70aa56ef41dba32451c6a441a3df82a75fe", "length of madrid_pollution.shape is not correct"
assert sha1(str(sorted(map(str, madrid_pollution.shape))).encode("utf-8")+b"afd96c3a9ba96ab1").hexdigest() == "6e0c03835e2de8647dcc58d28f74d2e09b74d576", "values of madrid_pollution.shape are not correct"
assert sha1(str(madrid_pollution.shape).encode("utf-8")+b"afd96c3a9ba96ab1").hexdigest() == "c037a4eea7cedcc09830f7ee0fb613ed7aceb283", "order of elements of madrid_pollution.shape is not correct"

assert sha1(str(type(sum(madrid_pollution.year))).encode("utf-8")+b"d9e98dc87c756dbc").hexdigest() == "b5b0e8870931af5478968406d10944757400044b", "type of sum(madrid_pollution.year) is not int. Please make sure it is int and not np.int64, etc. You can cast your value into an int using int()"
assert sha1(str(sum(madrid_pollution.year)).encode("utf-8")+b"d9e98dc87c756dbc").hexdigest() == "2a8855c166c6de8c3586bac9736606cd441d6493", "value of sum(madrid_pollution.year) is not correct"

print('Success!')

**Question 3.5**
<br> {points: 1}

Plot the new maximum EBE values versus the month they were recorded, split into side-by-side plots for each year. Again, we will use facetting (more on this next week) to plot each year side-by-side. 

*Assign your answer to an object called `madrid_plot`. Remember to label your axes.*

In [None]:
# ___ = alt.Chart(___).mark_point().encode(
#     x=alt.X(___).title(___),
#     y=alt.Y(___).title(___)
# ).facet("year")

# your code here
raise NotImplementedError
madrid_plot

In [None]:
from hashlib import sha1
assert str(type(madrid_plot is None)) == "<class 'bool'>", "type of madrid_plot is None is not bool. madrid_plot is None should be a bool"
assert str(madrid_plot is None) == "False", "boolean value of madrid_plot is None is not correct"

assert str(type(madrid_plot.facet)) == "<class 'altair.vegalite.v5.schema.channels.Facet'>", "type of madrid_plot.facet is not correct"
assert str(madrid_plot.facet) == "Facet({\n  shorthand: 'year'\n})", "value of madrid_plot.facet is not correct"

print('Success!')

**Question 3.6**
<br> {points: 1}

Now we want to see which of the pollutants has decreased the most. Therefore, we must repeat the same thing that we did in the questions above but for every pollutant (using the original data set)!  

First we will look at Madrid pollution in 2001 (filter for this year). Next we have to drop the columns that should be excluded (such as the date). Lastly, use the `max` function to create max values for all columns.

Note: The `max` function would return a pandas series. But since we would need a dataframe for later exercises, we need to convert the series to a dataframe by using `pd.DataFrame`. Applying `transpose` to the dataframe turns each row into a column, which is also helpful for later exercises.

Fill in the `___` in the cell below.

*Assign your answer to an object called `pollution_2001`.*

In [None]:
# ___ = pd.DataFrame(
#     madrid
#     [___]
#     .drop(columns=[___, ___, ___])
#     .___()
# ).transpose()

# your code here
raise NotImplementedError
pollution_2001

In [None]:
from hashlib import sha1
assert sha1(str(type(pollution_2001 is None)).encode("utf-8")+b"ae77bc4416691185").hexdigest() == "31e069fe9031ea63af7a794ec8259343f8db2aec", "type of pollution_2001 is None is not bool. pollution_2001 is None should be a bool"
assert sha1(str(pollution_2001 is None).encode("utf-8")+b"ae77bc4416691185").hexdigest() == "a46e6c75e872369d1dbfe74a5091495bb1981d33", "boolean value of pollution_2001 is None is not correct"

assert sha1(str(type(pollution_2001.shape)).encode("utf-8")+b"f13eef5ab6f60a0b").hexdigest() == "9c970af2e8fa37270b5a388c273bf5b5b7741641", "type of pollution_2001.shape is not tuple. pollution_2001.shape should be a tuple"
assert sha1(str(len(pollution_2001.shape)).encode("utf-8")+b"f13eef5ab6f60a0b").hexdigest() == "1f490d16b6cdde009d6f88bfe5db6a398df8f888", "length of pollution_2001.shape is not correct"
assert sha1(str(sorted(map(str, pollution_2001.shape))).encode("utf-8")+b"f13eef5ab6f60a0b").hexdigest() == "08c6eb791b7f2baa745d1bed340c5bf329064cf9", "values of pollution_2001.shape are not correct"
assert sha1(str(pollution_2001.shape).encode("utf-8")+b"f13eef5ab6f60a0b").hexdigest() == "13e3c3661c4ac3a9d6c1513bb2d7553dd821b65e", "order of elements of pollution_2001.shape is not correct"

assert sha1(str(type(pollution_2001.MXY.values)).encode("utf-8")+b"5e1857c46b8deefa").hexdigest() == "fe707ba2179f43fd17fb3c1e35ad479c26019f33", "type of pollution_2001.MXY.values is not correct"
assert sha1(str(pollution_2001.MXY.values).encode("utf-8")+b"5e1857c46b8deefa").hexdigest() == "fbf8891f803718a52bf3eaa69c4d559233c26a78", "value of pollution_2001.MXY.values is not correct"

assert sha1(str(type(pollution_2001.values.sum())).encode("utf-8")+b"9622d3a9b96bd867").hexdigest() == "9a00742f010b278587faeff4f45978c5ddad0cbe", "type of pollution_2001.values.sum() is not correct"
assert sha1(str(pollution_2001.values.sum()).encode("utf-8")+b"9622d3a9b96bd867").hexdigest() == "48b3b0f87d738921aea839a1c3c13d00fc9b6122", "value of pollution_2001.values.sum() is not correct"

print('Success!')

**Question 3.7**
<br> {points: 1}

Now repeat what you did for Question 3.6, but filter for 2006 instead. 

*Assign your answer to an object called `pollution_2006`.*

In [None]:
# your code here
raise NotImplementedError
pollution_2006

In [None]:
from hashlib import sha1
assert str(type(pollution_2006 is None)) == "<class 'bool'>", "type of pollution_2006 is None is not bool. pollution_2006 is None should be a bool"
assert str(pollution_2006 is None) == "False", "boolean value of pollution_2006 is None is not correct"

assert str(type(pollution_2006.shape)) == "<class 'tuple'>", "type of pollution_2006.shape is not tuple. pollution_2006.shape should be a tuple"
assert str(len(pollution_2006.shape)) == "2", "length of pollution_2006.shape is not correct"
assert str(sorted(map(str, pollution_2006.shape))) == "['1', '14']", "values of pollution_2006.shape are not correct"
assert str(pollution_2006.shape) == "(1, 14)", "order of elements of pollution_2006.shape is not correct"

assert sha1(str(type(pollution_2006.MXY.values)).encode("utf-8")+b"c2f2ca85a97b9a4c").hexdigest() == "3a4eabbf5d11b614d70cec2536125abe7776f85f", "type of pollution_2006.MXY.values is not correct"
assert sha1(str(pollution_2006.MXY.values).encode("utf-8")+b"c2f2ca85a97b9a4c").hexdigest() == "797841b8c70617b06dbfa1ddd9d25340748c2136", "value of pollution_2006.MXY.values is not correct"

assert sha1(str(type(pollution_2006.values.sum())).encode("utf-8")+b"cde21751885b9bb1").hexdigest() == "b676a45298c95d69bd644007f5bae551510f9cd5", "type of pollution_2006.values.sum() is not correct"
assert sha1(str(pollution_2006.values.sum()).encode("utf-8")+b"cde21751885b9bb1").hexdigest() == "cb2f861a2b50b6598ca0292bf4dda3fee6f0fb37", "value of pollution_2006.values.sum() is not correct"

print('Success!')

**Question 3.8** 
<br> {points: 1}

Which pollutant decreased by the greatest magnitude between 2001 and 2006? Given that your the two objects you just created, `pollution_2001` and `pollution_2006` are data frames with the same columns you should be able to subtract the two objects to find which pollutant decreased by the greatest magnitude between the two years. 

*Assign your answer to an object called `answer3_8`. Make sure to write the answer exactly as it is given in the data set.* Example: 

```
answer3_8 = "BEN"
```

In [None]:
# your code here
raise NotImplementedError

In [None]:
from hashlib import sha1
assert sha1(str(type(answer3_8)).encode("utf-8")+b"4c2e617b84be8ff1").hexdigest() == "47e51c9670e4cd6bb5d6fce9556fade0de0bdcf7", "type of answer3_8 is not str. answer3_8 should be an str"
assert sha1(str(len(answer3_8)).encode("utf-8")+b"4c2e617b84be8ff1").hexdigest() == "f93536bbc5717e51ed6d2396903bf13c14af5a8e", "length of answer3_8 is not correct"
assert sha1(str(answer3_8.lower()).encode("utf-8")+b"4c2e617b84be8ff1").hexdigest() == "7880c020bd6dd9d517cb06c01f09d35a28fee667", "value of answer3_8 is not correct"
assert sha1(str(answer3_8).encode("utf-8")+b"4c2e617b84be8ff1").hexdigest() == "a113bfccd3bc21d847ac61fb60bd69339decb3ab", "correct string value of answer3_8 but incorrect case of letters"

print('Success!')

**Question 3.9**
<br> {points: 1}

Given that there were only 14 columns in the data frame above, you could use your eyes to pick out which pollutant decreased by the greatest magnitude between 2001 and 2006. But what would you do if you had 100 columns? Or 1000 columns? It would take A LONG TIME for your human eyeballs to find the biggest difference. Maybe you could use the min funcion by specifying `axis=1` (horizontally):

In [None]:
# run this cell
(pollution_2006 - pollution_2001).min(axis=1)

This is a step in the right direction, but you get the value and not the column name... What are we to do? Tidy our data! Our data is not in tidy format, and so it's difficult to access the values for the variable pollutant because they are stuck as column headers. Let's use `melt` to tidy our data and make it look like this:

| pollutant | value  |
|-----------|--------|
| BEN       | -33.04 |
| CO        | -6.91  |
| ...       | ...    |

To answer this question, fill in the `___` in the cell below. 

*Assign your answer to an object called `pollution_diff` and ensure it has the same column names as the table pictured above.*

In [None]:
pollution_diff = pollution_2006 - pollution_2001
# pollution_diff = ___.melt(var_name=___, value_name=___)


# your code here
raise NotImplementedError
pollution_diff

In [None]:
from hashlib import sha1
assert sha1(str(type(pollution_diff is None)).encode("utf-8")+b"7306cff73ae9f427").hexdigest() == "9d45600b2e7925471a98da68d39dc81a09ef5406", "type of pollution_diff is None is not bool. pollution_diff is None should be a bool"
assert sha1(str(pollution_diff is None).encode("utf-8")+b"7306cff73ae9f427").hexdigest() == "6a46a5774219033d9b0412eb097674056a505155", "boolean value of pollution_diff is None is not correct"

assert sha1(str(type(pollution_diff.shape)).encode("utf-8")+b"18643c77a77e6817").hexdigest() == "cdbfc71723d36a5bf491b816424b337dd7f38c64", "type of pollution_diff.shape is not tuple. pollution_diff.shape should be a tuple"
assert sha1(str(len(pollution_diff.shape)).encode("utf-8")+b"18643c77a77e6817").hexdigest() == "06b5a26c6e0a4c32628545f1fdff1e42fe7d5aa7", "length of pollution_diff.shape is not correct"
assert sha1(str(sorted(map(str, pollution_diff.shape))).encode("utf-8")+b"18643c77a77e6817").hexdigest() == "ff0d10e00c30ab8f4610bb8073021461c56ba901", "values of pollution_diff.shape are not correct"
assert sha1(str(pollution_diff.shape).encode("utf-8")+b"18643c77a77e6817").hexdigest() == "f395897408484470f776ca61c922ddeba34a96c4", "order of elements of pollution_diff.shape is not correct"

assert sha1(str(type(pollution_diff.columns.values)).encode("utf-8")+b"2502bb7a518a63c5").hexdigest() == "c52fa60a7e1c47ee6cd3cffbe588aa214c27d1ee", "type of pollution_diff.columns.values is not correct"
assert sha1(str(pollution_diff.columns.values).encode("utf-8")+b"2502bb7a518a63c5").hexdigest() == "e5408c09374312118eb013786bb713f731dc0c0a", "value of pollution_diff.columns.values is not correct"

assert sha1(str(type(sum(pollution_diff.value))).encode("utf-8")+b"3304f5bff4363a5f").hexdigest() == "51df6adf613ea26bcd0abba42f4b499aec0b7ca1", "type of sum(pollution_diff.value) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(pollution_diff.value), 2)).encode("utf-8")+b"3304f5bff4363a5f").hexdigest() == "3f82b7402b3f00e68da212876f52c679936a1e0f", "value of sum(pollution_diff.value) is not correct (rounded to 2 decimal places)"

print('Success!')

**Question 3.10**
<br> {points: 1}

Now that you have tidy data, you can use `sort_values` and argument `ascending=False` to order the data in descending order. Each element of the `value` column corresponds to an amount of decrease in a pollutant; so the *largest decrease* in pollutant should be *most negative entry*, i.e., the last row in the resulting dataframe. Therefore, we can take the sorted dataframe and chain it to `tail` (with the argument `1`) to return only the last row of the data frame.

(the function `tail` is just like `head`, except it returns the last rows of the dataframe instead of the first rows.)

To answer this question, fill in the `___` in the cell below. 

*Assign your answer to an object called `max_pollution_diff`.*

In [None]:
# max_pollution_diff = ___.sort_values(by=___, ascending=False).tail(1)

# your code here
raise NotImplementedError
max_pollution_diff

In [None]:
from hashlib import sha1
assert sha1(str(type(max_pollution_diff is None)).encode("utf-8")+b"3a084ebb718f80c3").hexdigest() == "09d9377549ea3528d70761729b050fbe6ba9e4c4", "type of max_pollution_diff is None is not bool. max_pollution_diff is None should be a bool"
assert sha1(str(max_pollution_diff is None).encode("utf-8")+b"3a084ebb718f80c3").hexdigest() == "b72f27549c07757186afb00e7bdf7ae7239164fa", "boolean value of max_pollution_diff is None is not correct"

assert sha1(str(type(max_pollution_diff.shape)).encode("utf-8")+b"85fb5fcc7baac949").hexdigest() == "aeea8562cbb0a29cbf2b535870d6650ce3e31c65", "type of max_pollution_diff.shape is not tuple. max_pollution_diff.shape should be a tuple"
assert sha1(str(len(max_pollution_diff.shape)).encode("utf-8")+b"85fb5fcc7baac949").hexdigest() == "5264785ebcbd4fd37dc2a115e99760ce616b0eb6", "length of max_pollution_diff.shape is not correct"
assert sha1(str(sorted(map(str, max_pollution_diff.shape))).encode("utf-8")+b"85fb5fcc7baac949").hexdigest() == "2e05fd18b8e424806521b732715d7ab5c3e3067b", "values of max_pollution_diff.shape are not correct"
assert sha1(str(max_pollution_diff.shape).encode("utf-8")+b"85fb5fcc7baac949").hexdigest() == "118da06f3f9b12b96e3ee2ae8a1f6cb3d3e10f81", "order of elements of max_pollution_diff.shape is not correct"

assert sha1(str(type(max_pollution_diff.columns.values)).encode("utf-8")+b"e21c3ea0dfd435c9").hexdigest() == "5a056c33275b224eb6126532c2989b97071d8583", "type of max_pollution_diff.columns.values is not correct"
assert sha1(str(max_pollution_diff.columns.values).encode("utf-8")+b"e21c3ea0dfd435c9").hexdigest() == "defd0e363b5d74144f2d83ed82153c3b22189343", "value of max_pollution_diff.columns.values is not correct"

assert sha1(str(type(sum(max_pollution_diff.value))).encode("utf-8")+b"4b4e8e1ed3deea52").hexdigest() == "e739e87e40b13561fe922b460f023a1195a1596d", "type of sum(max_pollution_diff.value) is not float. Please make sure it is float and not np.float64, etc. You can cast your value into a float using float()"
assert sha1(str(round(sum(max_pollution_diff.value), 2)).encode("utf-8")+b"4b4e8e1ed3deea52").hexdigest() == "8458200f790c8f3159a695a15db142017c080bf8", "value of sum(max_pollution_diff.value) is not correct (rounded to 2 decimal places)"

print('Success!')

At the end of this data wrangling worksheet, we'll leave you with a couple quotes to ponder:

> “Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
