# Worksheet 1 - Introduction to Pandas

## Exercises

In this set of practice exercises we'll be investigating the carbon footprint of different foods. We'll be leveraging a dataset compiled by [Kasia Kulma](https://r-tastic.co.uk/post/from-messy-to-tidy/) and contributed to [R's Tidy Tuesday project](https://github.com/rfordatascience/tidytuesday).

In [None]:
import pandas as pd
import numpy as np
from hashlib import sha1

### 1.1

rubric={autograde:1}

The dataset we'll be working with has the following columns:

|column      |description |
|:-------------|:-----------|
|country       | Country Name |
|food_category | Food Category |
|consumption   | Consumption (kg/person/year) |
|co2_emmission | Co2 Emission (Kg CO2/person/year) |


Import the dataset as a dataframe named `df` from this url: <https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv>

In [None]:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv" # SOLUTION
df = pd.read_csv(url) # SOLUTION
df.head()

In [None]:
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-18/food_consumption.csv"
assert type(df) == pd.DataFrame, "Your answer should be a pandas DataFrame called df"
assert len(df) == 1430, "File not loaded correctly"
assert df.columns.to_list() == ['country', 'food_category', 'consumption', 'co2_emmission'], "Headers are missing"

### 1.2

rubric={autograde:1}

Create a DataFrame that contains a single column, the one containing food categories. Save it in a variable called `food`.
Note this must be a DataFrame, not a Series, so be careful with your bracketing!

In [None]:
food = df[['food_category']] # SOLUTION
food

In [None]:
assert type(food) == pd.DataFrame, "Answer is not a DataFrame"
assert food.shape[1] == 1, '`food` has the wrong number of columns'
assert food.shape[0] == 1430, '`modified_df` has the wrong number of rows'
assert 'food_category' in df.columns, "food_category column is missing or misnamed"
assert food.nunique().sum() == 11, "Column contains wrong data"

### 1.3

rubric={autograde:1}

Find the country in row #1234. Save this in a variable called `country_1234`

In [None]:
country_1234 = df.iloc[1234]['country'] # SOLUTION
country_1234

In [None]:
assert type(country_1234) == str, '`country_1234` should be of type `str`'
assert sha1(country_1234.encode('utf-8')).hexdigest() == 'f92bcb6a06d2ec7c0af7c8a338f131bf887c64a0', "Wrong country"

### 1.4

rubric={autograde:1}

What is the mean `co2_emission` of the whole dataset? Save the answer to a variable named `mean_co2_emmission`. Your answer should be a `np.float64`.

In [None]:
mean_co2_emmission = df["co2_emmission"].mean() # SOLUTION
mean_co2_emmission

In [None]:
assert type(mean_co2_emmission) == np.float64, '`mean_co2_emmission` should be of type `np.float64`'
assert sha1(str(round(mean_co2_emmission, 5)).encode('utf-8')).hexdigest() == '5d78a1167015eacc6678b844c55c8640a84792f1', 'mean is calculated incorrectly'

### 1.5

rubric={autograde:1}

How many different kinds of foods are there in the dataset? How many countries are in the dataset? Save your answers as two variables whose type should be `int`.

In [None]:
n_food = df["food_category"].nunique() # SOLUTION
n_country = df["country"].nunique() # SOLUTION

n_food

In [None]:
n_country

In [None]:
assert type(n_food) == int, '`n_food` should be of type int'
assert type(n_country) == int, '`n_country` should be of type int'
assert n_food == 11, "Your answer is incorrect"
assert n_country == 130, "Your answer is incorrect"

### 1.6

rubric={autograde:1}

Sort the dataframe by CO2 emmissions, from biggest to smallest, and get the first 11 rows and the last column as a series. Save this series to a variable called `sorted_co2_emmission`.

In [None]:
sorted_co2_emmission = df.sort_values('co2_emmission', ascending=False).iloc[:11, -1] # SOLUTION
sorted_co2_emmission

In [None]:
assert type(sorted_co2_emmission) == pd.core.series.Series, '`sorted_co2_emmission` should be of type pd.core.series.Series'
assert len(sorted_co2_emmission) == 11, "Incorrect number of rows"
assert sha1(str(sorted_co2_emmission.loc[2]).encode('utf-8')).hexdigest() == '926a777ba35253ac61b64714f054df421bb74f69', "Incorrect column values"
assert sha1(str(sorted_co2_emmission.iloc[-1]).encode('utf-8')).hexdigest() == '2890f860087b19a34f9339f77df0e34c223f7ce7', "Incorrect column values"

### 1.7

rubric={autograde:1}

Assume that the `consumption` column represents kilograms. Create a new dataframe named `modified_df` with a new column called `consumption_lbs` which contains the equivalent in pounds. 
Note that 1kg = 2.2 pounds.

In [None]:
# make a copy of df to modify
modified_df = df.copy()
# BEGIN SOLUTION
modified_df['consumption_lbs'] = modified_df['consumption'] * 2.2
# END SOLUTION
modified_df.head()

In [None]:
assert type(modified_df) == pd.DataFrame, '`modified_df` should be of type pd.DataFrame'
assert modified_df.shape[1] == 5, '`modified_df` has the wrong number of columns'
assert modified_df.shape[0] == 1430, '`modified_df` has the wrong number of rows'
assert 'consumption_lbs' in modified_df.columns, "consumption_lbs missing or misnamed"
assert sha1(str(modified_df.loc[2, 'consumption_lbs']).encode('utf-8')).hexdigest() == 'adedb476f4eee7fa8b790246fc963c78759dc776', "Incorrect calculation for pounds"

### 1.8

rubric={autograde:1}

Find out total consumption in pounds (lbs) for Canada, across all food products. Your answer should be of type `np.float64`. Save the result in a variable called `canada`. *Hint* You might want to set the index to the `country` column.

In [None]:
# BEGIN SOLUTION
d = modified_df.set_index('country')
canada = d.loc['Canada']['consumption_lbs'].sum()
# END SOLUTION
canada

In [None]:
assert type(canada) == np.float64, '`canada` should be of type np.float64'
assert sha1(str(round(canada, 2)).encode('utf-8')).hexdigest() == '33287b20c5d249c0ab56538a8f4a5bb3b74a5dda', "Incorrect value"

### 1.9

rubric={autograde:1}

Further modify `modified_df` by creating a new column called `co2_per_kilo` which shows the ratio of CO2 emmissions to kilograms consumed.

In [None]:
# BEGIN SOLUTION
modified_df['co2_per_kilo'] = modified_df['co2_emmission'] / modified_df['consumption'] 
# END SOLUTION
modified_df.head()

In [None]:
assert type(modified_df) == pd.DataFrame, '`modified_df` should be of type pd.DataFrame'
assert modified_df.shape[1] == 6, '`modified_df` has the wrong number of columns'
assert modified_df.shape[0] == 1430, '`modified_df` has the wrong number of rows'
assert 'co2_per_kilo' in modified_df.columns, 'co2_per_kilo column is missing or misnamed'
assert sha1(str(round(modified_df.iloc[1]['co2_per_kilo'], 3)).encode('utf-8')).hexdigest() == '086958593cfacc2890be3289dec2725a9b4a4d5f', "Incorrect calculation"

Congratulations! You are done the worksheet!!! Pat yourself on the back and submit your worksheet to Gradescope!