In [None]:
import pandas as pd

In [None]:
pd.options.display.max_columns = 50

# Blowing your mind with groupby

We'll be using free data via gapminder.org [repository](https://github.com/open-numbers/ddf--gapminder--systema_globalis), CC-BY LICENSE for this exercise. 

Let's load the data from its storage on github: two tables, with information about world's countries:

In [None]:
countries = pd.read_csv(
    "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis"
    "/master/ddf--entities--geo--country.csv"
)

In [None]:
population = pd.read_csv(
    "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis"
    "/master/countries-etc-datapoints/ddf--datapoints--population_total--by--geo--time.csv"    
)

## 1. Adding a column

Let's convert long population table to wide by country

In [None]:
wide_population = population.pivot(index="geo", columns="time").reset_index()
wide_population

Set index in countries to geo codes

In [None]:
countries_by_name = countries.set_index("country")
countries_by_name

Let's add countries' region to the table with population. Notice the `.array` part. It is necessary here to convert `world_4region` from a pandas Series with index to a list without index

In [None]:
wide_population["region"] = countries_by_name.loc[wide_population.geo, "world_4region"].array
wide_population

In [None]:
countries_by_name.loc[wide_population.geo, "world_4region"]

In [None]:
countries_by_name.loc[wide_population.geo, "world_4region"].array

Remove composite column names

In [None]:
wide_population.columns = ["geo"] + wide_population.columns.levels[1][:-1].tolist() + ["region"]
wide_population

## 2. Computations in groups

Often, we want to perform certain computations separately in groups.

For example, for each region, let's compute it's total population.

In pandas, this is done using the `groupby` method and then performing operations on it.

It works in 3 phases:
1. Split the dataframe in groups, based on the `groupby` parameters
2. Perform operation for each group separately
3. Combine results of the operations into a single result (DataFrame or Series)

In [None]:
by_region = wide_population.groupby("region")

To inspect the groups from step 1

In [None]:
by_region.indices

In [None]:
by_region.groups

To perform operations and combine the results, treat `by_region` as a dataframe

In [None]:
by_region[2000].sum()

In [None]:
by_region.sum()

## 3. Classwork

1. Find countries with the highest standard deviation in population for each decade in the 20th century