## Worksheet 2 - Renaming and reshaping

In [None]:
import pandas as pd
import numpy as np
from hashlib import sha1

## Exercises

In this set of practice exercises we'll be investigating the carbon footprint of different foods. We'll be leveraging a dataset compiled by [Kasia Kulma](https://r-tastic.co.uk/post/from-messy-to-tidy/) and contributed to [R's Tidy Tuesday project](https://github.com/rfordatascience/tidytuesday). This is the same set we used for Lecture 1.

In [None]:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
df = pd.read_csv(url)
df.head()

### Q1

rubric={autograde:1}

Create a dataframe to store the countries that produce more than 1000 Kg CO2/person/year for at least one food type. Name the dataframe `df_high_co2`.

In [None]:
df_high_co2 = df.query("co2_emmission > 1000") # SOLUTION
df_high_co2

In [None]:
assert type(df_high_co2) == pd.DataFrame, "Your answer should be a pandas DataFrame called df_high_co2"
assert df_high_co2.shape[0] == 5, "`df_high_co2` should have 5 rows"
assert df_high_co2.shape[1] == 4, "`df_high_co2` should have 4 columns"
assert df_high_co2['co2_emmission'].min() > 1000, "Incorrect values in the `co2_emmission` column."

### Q2

rubric={autograde:1}

Which country consumes the least amount of beef per person per year? Save the answer as a string in a variable called `least_beef`.
*Hint*: This will require multiple steps of filtering, sorting, and locating data.

In [None]:
# BEGIN SOLUTION
least_beef =  df.query("food_category == 'Beef'").sort_values(by="consumption").iloc[0]
least_beef = least_beef['country']
# END SOLUTION
least_beef

In [None]:
assert type(least_beef) == str
assert sha1(least_beef.encode('utf-8')).hexdigest() == '1ed5dd9d833f675b7509886681e2164d842f8dad', "Wrong answer"

### Q3

rubric={autograde:1}

Rename the columns of `df` such that:
- `food_category` becomes `category`
- `co2_emmission` becomes `co2`
- `country` becomes `nation`

When answering this, be sure to modify the original `df`, do not create a new dataframe with a different name.

In [None]:
# BEGIN SOLUTION
df.columns = ['nation', 'category', 'consumption', 'co2']
# END SOLUTION
df.head()

In [None]:
assert type(df) == pd.DataFrame, "`df` should be a pandas DataFrame called `df`"
assert df.shape[0] == 1430, "`df` should have 1430 rows"
assert df.shape[1] == 4, "`df` should have 4 columns"
assert df.columns.tolist() == ['nation', 'category', 'consumption', 'co2'], "Incorrect column names"

### Q4

rubric={autograde:1}

Make a new DataFrame consisting of all the countries that consume less than 10 kg of rice per year. Save this in a variable named `rice` and sort it from lowest to highest consumption. Don't forget that the name of the food consumption variable was changed in question 3 above.

In [None]:
# BEGIN SOLUTION
rice = df.query("category == 'Rice' & consumption < 10")
rice = rice.sort_values('consumption')
# END SOLUTION
rice.head()

In [None]:
assert type(rice) == pd.DataFrame, "`rice` should be a pandas DataFrame called `df`"
assert rice.shape[0] == 60, "`rice` should have 60 rows"
assert rice.shape[1] == 4, "`rice` should have 4 columns"
assert rice.iloc[0]['consumption'] == 0.95, "Dataframe is not sorted properly"

### Q5

rubric={autograde:1}

Create a Series (one column) of countries that eat over 10kg of either beef or poultry. Save this to a variable called `beef_or_chicken`.

*Hint*: This is a complex condition, so it might be easier to write out with `.query()`.

In [None]:
# BEGIN SOLUTION
condition = "category == 'Beef' & consumption > 10 | category == 'Poultry' & consumption > 10"
beef_or_chicken = df.query(condition)
# END SOLUTION
beef_or_chicken.head()

In [None]:
assert type(beef_or_chicken) == pd.DataFrame, "`beef_or_chicken` should be a pandas DataFrame called `df`"
assert beef_or_chicken.shape[0] == 154, "`beef_or_chicken` should have 154 rows"
assert beef_or_chicken.shape[1] == 4, "`beef_or_chicken` should have 4 columns"
assert beef_or_chicken.loc[1343]['consumption'] == 13.69, "Incorrect data in the consumption column"

### Q6

rubric={autograde:1}

We're now going to practice tidying data. Remember, to tidy data we need to think about the data and the statistical question we would like to ask about it. Consider this question for the  carbon footprint of different foods data set we have been working on:

*Is there a relationship between the amount of a food type consumed and the $CO_2$ emmission from that food type? And does this differ depending on the country where the food is grown and consumed?*

Considering this question, is version the data below tidy? If not, use the appropriate pandas function to tidy the data so that it is. Save this as a data frame called `df2_tidy`. 

> Note: Do not reset the index or modify the column names from your output from `.pivot()` or `.melt()` in this question.

In [None]:
df2 = pd.read_csv('data/food_consumption2.csv')
df2.head()

In [None]:
df2.tail()

In [None]:
# BEGIN SOLUTION
df2_tidy = df2.pivot(index=['country', 'food_category'], 
                     columns=['metrics'], 
                     values=['measurements']
                    )

# END SOLUTION
df2_tidy.head()

In [None]:
assert type(df2_tidy) == pd.DataFrame, "`df2_tidy` should be a pandas DataFrame called `df2_tidy`"
assert df2_tidy.shape[0] == 1430, "`df2_tidy` should have 1430 rows"
assert df2_tidy.shape[1] == 2, "`df2_tidy` should have 2 columns"
cols = df2_tidy.columns.to_list()
if type(df2_tidy.columns.to_list()[0]) == tuple: cols = [item for sublist in df2_tidy.columns.to_list() for item in sublist]
assert 'co2_emmission' in cols and 'consumption' in cols, "Both 'co2_emmission' and 'consumption' must be in the list"
assert 'metrics' in list(df2_tidy.columns.names)
assert df2_tidy.index.names == ['country', 'food_category'], "`df2_tidy` should have 'country', 'food_category' as the index names."

### Q7

rubric={autograde:1}

When `.pivot()` is called with multiple column names passed to the `index`, those entries become the “name” of each row that would be used when you filter rows with `[]` or `loc` rather than just simple numbers. This can be confusing… Use `.reset_index()` to set `df2_tidy` to have the usual, expected behaviour, where each row is “named” with an integer. This is a subtle point, but the main take-away is that when you call `.pivot()`, it is a good idea to call `.reset_index()` afterwards.

Name your new data frame `df2_tidy_index`.

In [None]:
#df2_tidy_index = df2_tidy
df2_tidy_index = df2_tidy.reset_index() # SOLUTION
df2_tidy_index.head()

In [None]:
assert type(df2_tidy_index) == pd.DataFrame, "`df2_tidy_index` should be a pandas DataFrame called `df2_tidy_index`"
assert df2_tidy_index.shape[0] == 1430, "`df2_tidy_index` should have 1430 rows"
assert df2_tidy_index.shape[1] == 4, "`df2_tidy_index` should have 4 columns"
assert list(df2_tidy_index.index.names) == [None]
cols_index = df2_tidy_index.columns.to_list()
if type(df2_tidy_index.columns.to_list()[0]) == tuple: cols_index = [item for sublist in df2_tidy_index.columns.to_list() for item in sublist]
assert 'country' in cols_index

### Q8

rubric={autograde:1}

When we perform the `.pivot()` operation, it also keeps the original column names and adds the new column name as a second column name. Having two names for a column can be confusing! So we should rename the columns in the data frame so that they only have one name. 

Name your new data frame `df2_tidy_index_renamed`.

In [None]:
df2_tidy_index_renamed = df2_tidy_index.copy() # make a copy so as not to modify the object in Q7
# BEGIN SOLUTION
df2_tidy_index_renamed.columns = [
    "country",
    "food_category",
    "co2_emmission",
    "consumption"
]
# END SOLUTION
df2_tidy_index_renamed.head()

In [None]:
assert type(df2_tidy_index_renamed) == pd.DataFrame, "`df2_tidy_index_renamed` should be a pandas DataFrame called `df2_tidy_index_renamed`"
assert df2_tidy_index_renamed.shape[0] == 1430, "`df2_tidy_index_renamed` should have 1430 rows"
assert df2_tidy_index_renamed.shape[1] == 4, "`df2_tidy_index_renamed` should have 4 columns"
assert list(df2_tidy_index_renamed.columns) == ['country', 'food_category', 'co2_emmission', 'consumption']
assert list(df2_tidy_index_renamed.index.names) == [None]

Congratulations! You are done the worksheet!!! Pat yourself on the back, and submit your worksheet to Gradescope!