# Exercises

## Useful links

1. [API Reference](https://plotnine.org/reference)
2. [gallery](https://plotnine.org/gallery/)
3. [tutorials](https://plotnine.org/gallery/)

Grab the data and functions from the `tutorial` notebook.

In [1]:
from plotnine import *
import pandas as pd
import numpy as np

def group_top(data, group, column, n):
    """
    Select the top n rows from each group in a dataframe based on a specified column
    """
    def top_n(gdf):
        return gdf.sort_values(column, ascending=False).head(n)
    return (
        data.groupby(group)
        .apply(top_n, include_groups=False)
        .reset_index()
        [data.columns]
    )

def group_bottom(data, group, column, n):
    """
    Select the bottom n rows from each group in a dataframe based on a specified column
    """
    def bottom_n(gdf):
        return gdf.sort_values(column, ascending=True).head(n)
    return (
        data.groupby(group)
        .apply(bottom_n, include_groups=False)
        .reset_index()
        [data.columns]
    )

def group_multi_top(data, group, columns, n):
    """
    Select the top n rows from each group in a dataframe for multiple columns
    """
    frames = [group_top(data, group, col, n) for col in columns]
    return (
        pd.concat(frames)
        .drop_duplicates()
        .sort_values(by=[group, *columns])
        .reset_index(drop=True)
    )

def group_multi_bottom(data, group, columns, n):
    """
    Select the bottom n rows from each group in a dataframe for multiple columns
    """
    frames = [group_bottom(data, group, col, n) for col in columns]
    return (
        pd.concat(frames)
        .drop_duplicates()
        .sort_values(by=[group, *columns])
        .reset_index(drop=True)
    )

def group_sum(data, group, columns):
    """
    Sum up all columns in a group
    """
    if isinstance(group, str):
        group = [group]
    else:
        group = list(group)
    return data.groupby(group)[columns].agg("sum").reset_index()

def dataframe_difference(df1, df2):
    """
    Remove all rows in df1 that are also in df2.
    """
    # Ensure the column order and names match for comparison
    common_columns = df1.columns.intersection(df2.columns)
    # Convert rows to tuples and compare them
    s1 = df1[common_columns].apply(tuple, axis=1)
    s2 = df2[common_columns].apply(tuple, axis=1)
    return df1.loc[~s1.isin(s2)].reset_index(drop=True)

def _rename_columns(s):
    if s.endswith("population"):
        return f"{s.split()[0]}"
    return s.replace("(km²)","").strip().replace(" ", "_")

# Units
# area - km²
# density - /km²

population_data = (
    pd.read_csv("data/world_population_data.csv")
    .rename(_rename_columns, axis=1)
    .drop(columns=["rank"])
    .melt(
        id_vars=["country", "continent", "area"],
        value_vars=["2023", "2022", "2020", "2015", "2010", "2000", "1990", "1980", "1970"],
        var_name="year",
        value_name="population",
    )
    .astype({"year": int})
)

def by_year(data, year):
    """
    Return a subset of the data for a given year
    """
    return data[data["year"] == year].reset_index(drop=True)

def by_continent(data, continent):
    """
    Return a subset of the data for a given continent
    """
    return data[data["continent"] == continent].reset_index(drop=True)

thousand = 10 ** 3
million = 10 ** 6
billion = 10 ** 9

## Practice data

When you finish any of the exercises early and you need something else, you can try visualising any of these subset dataframes. You can create subsets of them again. Have fun

In [2]:
population_data_2023 =  by_year(population_data, 2023)

africa_data = by_continent(population_data, "Africa")
asia_data = by_continent(population_data, "Asia")
north_america_data = by_continent(population_data, "North America")
south_america_data = by_continent(population_data, "South America")
europe_data = by_continent(population_data, "Europe")
oceania_data = by_continent(population_data, "Oceania")

## Part 1: `continents_data`

In [3]:
continents_data = group_sum(population_data, ["continent", "year"], "population")

### Exercise 1

Create a point graph of the datasets `continents_data` with the `year` on the `x` axis and the `population` on the `y` axis.

In [4]:
# Your solution here



### Exercise 2
Copy your solution from [Exercise 1](exercises.ipynb#Exercise-1) and:

- Add some colour
- Make it easy to read the big numbers
- Make sure the axes are well labelled

In [5]:
# Your solution here



### Exercise 3
Copy your solution from [Exercise 2](exercises.ipynb#Exercise-2) and create well scaled panels for each continent.

In [6]:
# Your solution here



### Exercise 4

Copy your solution from [Exercise 3](exercises.ipynb#Exercise-3).
- Add a line that passes through the dots. You need to find a suitable [geom](https://plotnine.org/reference/#geoms)
- Customise the plot in your unique way.
    * The default theme that we modify using [theme()](https://plotnine.org/reference/theme.html) is [theme_gray](https://plotnine.org/reference/theme_gray.html#gray), try using another theme. See how it is done [here](https://plotnine.org/reference/theme_classic.html#classic).
    * While you are at it use [labs](https://plotnine.org/reference/labs.html#plotnine.labs) to add a caption to the plot

In [7]:
# Your solution here



### Exercise 5

Given your solution in [Exercise 4](exercises.ipynb#Exercise-4), can you change the scale of anything else.
Look at the the list of available scales in the [API Reference](https://plotnine.org/reference/#scales).


In [8]:
# Your solution here



## Part 2: Mapping Tanzania

### Exercise 6 - Visualise the population of regions Tanzania.

You can draw plotnine can draw maps. In the `tz_maps` module, we prepare some data for you and import it.
Make something of it. You can see some examples at [geom_map](https://plotnine.org/reference/geom_map.html#examples).

In [9]:
from tz_maps import (
    # All these are geopandas dataframes
    country,
    zones,
    regions,
    districts,
    lakes,
    rivers,
)

_HINT: See what each dataframe contains. You can plot a simple map with only two lines._