
# Gapminder

Gapminder is a swedish foundation. Its website states

> Gapminder fights devastating misconceptions about global development. Gapminder produces free teaching resources making the world understandable based on reliable statistics. Gapminder promotes a fact-based worldview everyone can understand.

One of its founder, Hans Rosling, is famous for a 2006 Ted talk called _The best statistics you’ve never seen_.  It became one of the most seen TED talks ever, thanks to its unique combination of knowledge-testing, animating bubble charts and storytelling about global development.

It uses a lot of bubble plots like the following one:
<img src="gapminder_rosling.jpg" alt="Gapminder" style="width: 70%; align"/>

Our goal is to learn how to produce such a plot (and many more) with __Python__.

We will use the following libraries:

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns 
from celluloid import Camera
from IPython.display import HTML
import plotly.express as px
import altair as alt

# The `gapminder` dataset

The __gapminder__ package contains an excerpt of the full gapminder database



In [None]:
from gapminder import gapminder

In [None]:
gapminder

As it can be guessed from above, it measure the Lifespan measured by the Life Expectancy (`lifeExp`), the Income measured by the GDP per capita (`gdpPercap`) as well as the population size (`pop`) of several countries across all the continents for several years between 1957 and 2007.

# The original gapminder plot

## A first attempt for the year 2002 using __matplotlib__

__matplotlib__ is the most commonly used framework to produce plots in __Python__. It has been inspired by __Matlab__ plotting libraries and

> Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
> Matplotlib makes easy things easy and hard things possible.

A plot is obtained by specifying its different components.

For instance a scatter plot of the year 2002 can be obtained by:

In [None]:
gapminder_2002 = gapminder.query('year == 2002')


In [None]:
plt.scatter(gapminder_2002['gdpPercap'], gapminder_2002['lifeExp']);

__matplotlib__ automatically proposes a scale for the different axis.

To obtain a different color for each continent, the idiomatic way with __matplotlib__ is to plot a different scatter plot for each continent.

In [None]:
for continent,df  in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], label = continent)

1. Modify the previous plot so that the circle area is proportionnal to the population.

__Hint:__ Before specifying the `s` parameter, you need to scale the population to a value in point.

In [None]:
norm_pop = mpl.colors.Normalize(vmin=0, vmax=gapminder['pop'].max())
bubble_max_size = 300
gapminder_2002['bubble_size'] = norm_pop(gapminder_2002['pop']) * bubble_max_size

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent)

In order to obtain a figure closer to the one of Hans Rosling, we need to impose that the gdp per capita axis should be logarithmic

2. Specify this logarithmic scale

__Hint:__ Use `plt.xscale`.

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent)
plt.xscale('log')

## Styling the graph

We have now a graph which is correct but that could be improved and customized.


In [None]:
#hidden
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent)
plt.xscale('log')

3. Add a title _Gapminder bubble plot (2002)_ and change the axis name to Income and Lifespan

__Hint:__ Use `plt.title`, `plt.xlabel` and `plt.ylabel` to specify them.

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent)
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');

The default __matplotlib__ theme (even it is much better than the original one) may be change to another one with `mpl.style.use`.

4. Use the `seaborn-whitegrid` theme.

__Hint:__ You may use `mpl.style.available` to see a list of all available styles.

In [None]:
mpl.style.use('seaborn-whitegrid')

In [None]:
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent)
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');

Choosing the right color scheme can be complicated (especially if you are color blind). There are a lot of color palette that are available in __ggplot2__ that one can use or one can specify the color manually. As expected, everything is done with the help of the `scale_color*` function family...

5. Use the 'Set1' set of color from the Brewer family (https://colorbrewer2.org)

__Hint:__ You may define a color dictionary

In [None]:
colors = plt.get_cmap('Set1')
unique_continents = gapminder_2002['continent'].unique()
continent_colors = dict(zip(unique_continents, colors(range(len(unique_continents)))))

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');

6. Specify manually the color so that they are as similar as possible to the one of original picture.

__Hint:__ You can use html color names in the dictionary (a color picker may also be useful)

In [None]:
continent_colors = {'Africa': '#01d4e5', 'Americas': '#7dea01',
 'Asia': '#fc5173', 'Europe': '#fde803', 'Oceania': '#536227'}

In [None]:
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');

7. Specify manually the xticks by `[300, 1000, 3000, 10000, 30000]`

__Hint:__ you may have to use the following hack
```
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
```

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());

Adding a legend is normally as easy as adding `plt.legend()`

In [None]:
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
plt.legend();

Unfortunately, we have use two channels the size and the color and we should have two separate legends.

Fixing this issue proved to be quite challenging:

In [None]:
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
lgd = ax.legend()
ax.add_artist(lgd)
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])
scatter = plt.scatter([None, None], [None, None],
    s = [norm_pop(gapminder_2002['pop'].min()) * bubble_max_size,  norm_pop(gapminder_2002['pop'].max()) * bubble_max_size])
kw = dict(prop="sizes", num=5,
          func=lambda s: norm_pop.inverse(s) / bubble_max_size)
legend2 = ax.legend(*scatter.legend_elements(**kw),
    loc = 'bottom right',
    bbox_to_anchor=(0.8, 0.45) )


Last but not least, we may polish the legend by giving title, overriding the dot sizes for the continent while maintaining the same order

8. Use all your skills to do that (and specify the population in Millions).

__Hint:__ Use the `title` keyword of the legend and the `format` argument of `legend_elements`.

In [None]:
#solution
for continent, df in gapminder_2002.groupby('continent'):
    plt.scatter(df['gdpPercap'], df['lifeExp'], s=df['bubble_size'], label = continent,
    c = [continent_colors[continent]])
plt.xscale('log')
plt.title('Gapminder bubble plot (2002)')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
lgd = ax.legend(title='Continent')
ax.add_artist(lgd)
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])
scatter = plt.scatter([None, None], [None, None],
    s = [norm_pop(gapminder_2002['pop'].min()) * bubble_max_size,  norm_pop(gapminder_2002['pop'].max()) * bubble_max_size])
kw = dict(prop="sizes", num=5, fmt="{x:.0f} M",
          func=lambda s: norm_pop.inverse(s) / (1e6 * bubble_max_size))
legend2 = ax.legend(*scatter.legend_elements(**kw),
    loc = 'bottom right',
    title="Sizes",  bbox_to_anchor=(0.8, 0.45) )

## Other available framework

__matplotlib__ is not the only plot framework in __Python__ even though it is probably the most flexible. We show here two implementation of the same graph using __plotly__ and __altair__, two frameworks relying on javascript.

__plotly:__


In [None]:
fig = px.scatter(gapminder_2002,
    x='gdpPercap', y='lifeExp',
    color='continent', size='pop',
    size_max=40, hover_name='country')
fig.update_layout(
    title="Gapminder bubble plot (2002)",
    xaxis_title="Income",
    yaxis_title="Lifespan",
    xaxis_type='log')
fig.show()

__altair:__

In [None]:
alt.Chart(gapminder_2002, title='Gapminder bubble plot (2002)').\
    mark_circle().\
    encode(x=alt.X('gdpPercap:Q', scale=alt.Scale(type='log'), title='Income'),
        y=alt.Y('lifeExp:Q', scale=alt.Scale(zero=False), title='Lifespan'),
        color=alt.Color('continent:N', title='Continent'),
        size=alt.Size('pop:Q', title='Population', scale=alt.Scale(range=[0, 1000]), legend = alt.Legend(format='~s')),
        tooltip=['country', 'pop']
    ).interactive()

Note that both framework are inspired by the grammar of graphics rather than __matlab__: one specifies the mapping between the variables and the aesthetics and let the magic happens.

# Adding the temporal dimension

The goal is now to visualize the same information across the years.

## Using facets

Small multiples, or facets, are repetitions of the same graph for several similar datasets. They can easily be obtained with the `fig.suplots`
9. Obtain the bubble plot for all the years

__Hint:__ You may use the following pattern

In [None]:
gapminder_year = gapminder.groupby('year')
fig, axs = plt.subplots(nrows=1 + ((gapminder_year.ngroups - 1) // 4), ncols=4, sharex=True, sharey=True,
    figsize=(15, 12))
axs_flat = axs.flatten()
for i, (year, data) in enumerate(gapminder_year):
    axs_flat[i].scatter(data['gdpPercap'], data['lifeExp'])

In [None]:
gapminder_year = gapminder.groupby('year')
fig, axs = plt.subplots(nrows=1 + ((gapminder_year.ngroups - 1) // 4), ncols=4, sharex=True, sharey=True,
    figsize=(15, 12))
plt.xscale('log')
plt.xticks([300, 1000, 3000, 10000, 30000])
axs_flat = axs.flatten()
for i, (year, data) in enumerate(gapminder_year):
    for continent, df in data.groupby('continent'):
        axs_flat[i].scatter(df['gdpPercap'], df['lifeExp'], s=norm_pop(df['pop']) * bubble_max_size, label = continent,
            c = [continent_colors[continent]])
    axs_flat[i].set_title(f'year : {year}')
    axs_flat[i].get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
fig.add_subplot(111, frameon=False)
plt.tick_params(labelcolor='none', top=False, bottom=False, left=False, right=False)
plt.xlabel('Income')
plt.ylabel('Lifespan');

## Using animation

The `year` variable is a natural candidate for a temporal axis. One can use the `Camera` function of celluloid to capture a sequence of plots:



In [None]:
fig, ax = plt.subplots()
camera = Camera(fig)
for i in range(10):
    plt.plot([0, i], [0, i])
    camera.snap()
anim = camera.animate();
plt.close(fig)

and then produce an animation

In [None]:
HTML(anim.to_html5_video())

or save it in a file

In [None]:
anim.save('animation.gif', writer='imagemagick', fps=5)

10. Use this approach to produce a bubble plot animation across the years.

__Hint:__ You may use `ax.text` with `transform=ax.transAxes` to place the title.


In [None]:
#solution
fig, ax = plt.subplots()
plt.xscale('log')
plt.xticks([300, 1000, 3000, 10000, 30000])
camera = Camera(fig)
for year, data in gapminder_year:
    ax = plt.gca()
    for continent, df in data.groupby('continent'):
        ax.scatter(df['gdpPercap'], df['lifeExp'], s=norm_pop(df['pop']) * bubble_max_size, label = continent,
            c = [continent_colors[continent]])
    ax.text(.5, 1.025, f'year: {year}', transform=ax.transAxes, fontsize='large')
    ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
    camera.snap()
anim = camera.animate();
plt.close(fig)

In [None]:
#solution
anim.save('gapminder.gif', writer='imagemagick', fps=5)

In [None]:
#solution
HTML(anim.to_html5_video())

## Using the trace of the countries

We may also use a completely different approach. 
One may draw the full trajectory along the years of each country in the same plot


11. Display such a plot and add the color for each continent.

__Hint:__ Use `plt.plot`

In [None]:
#solution
for country, df in gapminder.groupby('country'):
    plt.plot(df['gdpPercap'], df['lifeExp'], c = continent_colors[df['continent'].iloc[0]])
plt.xscale('log')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());


12. Add the population information through the size of the lines.

__Hint:__ A line in __plt.plot__ should have a single width but you can use the following hack.


In [None]:
def plot_width(x, y, w, c):
    for i in range(len(x)-1):
        plt.plot(x.iloc[[i, i+1]], y.iloc[[i, i+1]], linewidth=w[i], c=c)

In [None]:
#solution
for country, df in gapminder.groupby('country'):
    plot_width(df['gdpPercap'], df['lifeExp'], norm_pop(df['pop']) * 10,
        c = continent_colors[df['continent'].iloc[0]])
plt.xscale('log')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
plt.ylim([30, 85])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());

In the previous approach, the size between two years is kept constant 
and proportionnal to the population of the start.

13. Find a way to have a smoother transition.

__Hint:__ One may enhance the previous `plot_width` function:

In [None]:
def plot_width(x, y, w, c, n=10):
    t = np.arange(len(x))
    t_r = np.arange(0, len(x), 1/n)
    x_r = np.interp(t_r, t, x)
    y_r = np.interp(t_r, t, y)
    w_r = np.interp(t_r, t, w)
    points = np.concatenate([x_r[:, None], y_r[:, None]], axis=1).reshape(-1, 1, 2)
    segments = np.concatenate([points[:-1], points[1:]], axis=1)
    lc = mpl.collections.LineCollection(segments, linewidths=w_r[:-1], color=c)
    plt.scatter(x, y, s=w ** 2 / np.sqrt(2), color=c)
    ax = plt.gca()
    ax.add_collection(lc)

In [None]:
#solution
for country, df in gapminder.groupby('country'):
    plot_width(df['gdpPercap'], df['lifeExp'], norm_pop(df['pop']) * 10,
     c = continent_colors[df['continent'].iloc[0]])
plt.xscale('log')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
plt.ylim([30, 85])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());

As a general comment, adding new functionalities to __matplotlib__ is a matter of using the basic primitives it is based on. You have to dwelled into __matplotlib__ documentation for that.

14. Propose a plot highlighting the trajectory of a few countries:

In [None]:
highlights = ['China', 'France', 'India', 'United States']

__Hint:__ You may plot all the countries in grey and overplot the selected countries

In [None]:
#solution
highlights_colors = dict(zip(highlights, colors(range(len(highlights)))))
for country, df in gapminder.groupby('country'):
    plot_width(df['gdpPercap'], df['lifeExp'], norm_pop(df['pop']) * 10,
        c='grey')
for country in highlights:
    df = gapminder[gapminder['country'] == country]
    plot_width(df['gdpPercap'], df['lifeExp'], norm_pop(df['pop']) * 10,
        c=highlights_colors[country])
plt.xscale('log')
plt.xlabel('Income')
plt.ylabel('Lifespan');
plt.xticks([300, 1000, 3000, 10000, 30000])
plt.ylim([30, 85])
ax = plt.gca()
ax.get_xaxis().set_major_formatter(mpl.ticker.ScalarFormatter());
custom_lgd = [mpl.lines.Line2D([0], [0], color=highlights_colors[country], lw=4, label=country) for country in highlights]
ax.legend(handles=custom_lgd);


# Map and choroplet

Spatial tools have made tremendeous progresses in the last years and it is
now really easy to produce a choroplet: a map in which region are filled
according to a value.

## A base map

The first step is to read the base map

In [None]:
world_map = gpd.read_file('../data/UIA_World_Countries_Boundaries.shp')

It corresponds to a classical dataframe augmented by a geometry column

In [None]:
world_map

that allows to produce map:

In [None]:
world_map.plot();

## Data cleaning

We should now combine this map with the gapminder columns. 
As often, we have to tackle a data cleaning issue, i.e. a small name mismatch.

In [None]:
def trDataFrame(columns, *data):
    return pd.DataFrame(
        data=list(zip(*[iter(data)]*len(columns))),
        columns=columns
    )

In [None]:
gapminder_world_map_name_difference = trDataFrame(
    ['country', 'COUNTRYAFF'],
    "Congo, Dem. Rep.", "Congo DRC",
    "Congo, Rep.",  "Congo",
    "Cote d'Ivoire", "Côte d'Ivoire",
    "Hong Kong, China", np.NaN,
    "Korea, Dem. Rep.", "North Korea",
    "Korea, Rep.", "South Korea", 
    "Puerto Rico", np.NaN,
    "Reunion", np.NaN,
    "Slovak Republic", "Slovakia",
    "Swaziland", np.NaN,
    "Taiwan", np.NaN,
    "West Bank and Gaza", "Palestinian Territory", 
    "Yemen, Rep.", "Yemen"
)
world_map = world_map.merge(gapminder_world_map_name_difference,
    on='COUNTRYAFF', how='left')
world_map.loc[pd.isna(world_map['country']), 'country'] = world_map.loc[pd.isna(world_map['country']), 'COUNTRYAFF']

15. Join the two datasets.

__Hint:__ Use a _left_ `merge` 

In [None]:
world_map = world_map.merge(gapminder_2002,
    on='country', how='left')

## Choroplet

Obtaining a choroplet is _as simple as_

In [None]:
world_map.plot(column='lifeExp', missing_kwds={'color': 'lightgrey'});

Note the use of `missing_kwds` to visualize the missing countries.

16. Change the default colormap to the _inferno_ one.

__Hint:__ Use the `cmap` argument.

In [None]:
world_map.plot(column='lifeExp', cmap='inferno', missing_kwds={'color': 'lightgrey'});

## Graticules

Graticules (coordinates grid) are a precious help when dealing with maps.
The following command loads a grid of latitudes and longitudes evenly spaced every 10 degrees. 

In [None]:
graticule = gpd.read_file('../data/graticule.shp')

17. Add this grid to the map.

__Hint:__ You may want to play with `color`, `linewidth` and `zorder`.


In [None]:
ax = plt.gca()
graticule.plot(ax=ax, color='lightgray', linewidth=.5, zorder=-1)
world_map.plot(ax=ax, column='lifeExp', cmap='inferno', missing_kwds={'color': 'lightgrey'})
plt.axis('off');

## Projection

As the earth is (mostly) spherical, there is always a projection involvec
in map. With __geopandas__, one can specify a new projection using the `to_crs` method.


18. Project the map and the graricule using the Mollweide projection.

__Hint:__ This projection is described by the parameters `+proj=moll`.

In [None]:
#solution
world_map_proj = world_map.to_crs("+proj=moll") 
graticule_proj = graticule.to_crs("+proj=moll") 

In [None]:
#solution
ax = plt.gca()
graticule_proj.plot(ax=ax, color='lightgray', linewidth=.5, zorder=-1)
world_map_proj.plot(ax=ax, column='lifeExp', cmap='inferno', missing_kwds={'color': 'lightgrey'})
plt.axis('off');

##Legend

The only missing piece is the legend.

19. Add a colormap legend

__Hint:__ Use the `legend` keyword of the `plot` method.

In [None]:
#solution
ax = plt.gca()
graticule_proj.plot(ax=ax, color='lightgray', linewidth=.5, zorder=-1)
world_map_proj.plot(ax=ax, column='lifeExp', cmap='inferno', missing_kwds={'color': 'lightgrey'}, legend=True)
plt.axis('off');
plt.title('Life Expectation in 2002');

A better legend can be obtain with some tweaks:

In [None]:
fig, ax = plt.subplots()
vmin = gapminder_2002['lifeExp'].min()
vmax = gapminder_2002['lifeExp'].max()
vnorm = mpl.colors.Normalize(vmin=vmin, vmax=vmax)
sm = plt.cm.ScalarMappable(cmap='inferno', norm=vnorm)
fig.colorbar(sm, orientation="vertical", fraction=.025, pad=0.1, aspect = 10)
graticule_proj.plot(ax=ax, color='lightgray', linewidth=.5, zorder=-1)
world_map_proj.plot(ax=ax, column='lifeExp', cmap='inferno', norm=vnorm,
    missing_kwds={'color': 'lightgrey'})
plt.axis('off')
plt.title('Life Expectation in 2002');

# Exploration

We will conclude this lab by a quick (and not so dirty) visual exploration of the dataset.
We will look at the most classical representations of the most classical data types.



## Counts

Count plots are probably the most common type of plots.

### Bar plot

Bar plots, in which the height (and the surface) or bars are proportionnal to the count is now ubiquitous.

Afte computing the counts:

In [None]:
gapminder_count = pd.DataFrame({'n': gapminder_2002.groupby('continent', as_index=False).size()}).reset_index()
gapminder_count

we can use `plt.bar` to visualize them:

In [None]:
plt.bar(gapminder_count['continent'], gapminder_count['n']);

20. Change the code so that the bars are horizontal

__Hint:__ Use `plt.barh`

In [None]:
#solution
plt.barh(gapminder_count['continent'], gapminder_count['n']);

21. Add some (redundant) colors to the bars.

__Hint:__ Use the `color` keyword.

In [None]:
#solution
plt.barh(gapminder_count['continent'], gapminder_count['n'], color=gapminder_count['continent'].map(continent_colors));

### Stacked bar plot

The bars can be put one on the top of another to obtain a stacked bar plot.

In [None]:
gapminder_count['cum_n'] = gapminder_count['n'][::-1].cumsum()[::-1]

In [None]:
plt.barh(" ", gapminder_count['cum_n'], color=gapminder_count['continent'].map(continent_colors));

This representation is much more interesting where there a two different type of categories.

21. Produce a stacked bar plot for the continent and the fact that the life expectation is above 70.

__Hint:__ You can use the following code and be inspired by the previous example.


In [None]:
gapminder_2002['above_70'] = (gapminder_2002['lifeExp'] >= 70).replace({True: 'Above 70', False: 'Below 70'})
gapminder_count2 = pd.DataFrame({'n': gapminder_2002.groupby(['above_70', 'continent']).size()}).reset_index()
gapminder_count2

In [None]:
#solution
gapminder_count2['cum_n'] = gapminder_count2.groupby('above_70')['n'].apply(lambda x: x[::-1].cumsum()[::-1])

In [None]:
#solution
plt.barh(gapminder_count2['above_70'], gapminder_count2['cum_n'], color=gapminder_count2['continent'].map(continent_colors));

### Pie plot

A pie plot can be seen as a stacked plot... in polar coordinates... which hides the total number information:


In [None]:
plt.pie(gapminder_count['n'], colors=gapminder_count['continent'].map(continent_colors), autopct='%1.1f%%', labels=gapminder_count['continent']);

22. Produce a pie for each life expectancy category.

In [None]:
#solution
fig, axs = plt.subplots(ncols=2)
axs_flat = axs.flatten()
for i, (above, df) in enumerate(gapminder_count2.groupby('above_70')):
    axs_flat[i].pie(df['n'], colors=df['continent'].map(continent_colors), autopct='%1.1f%%', labels=df['continent']);
    axs_flat[i].set_title(above)

## Quantities


## Bar plot

This is very similar to counts, except that the bar surface is proportional 
to a arbitrary quantity rather than a count.

23. Visualize the gdpPercap by continent

__Hint:__ You should use a weighted mean rather than a classical one.


In [None]:
#solution
gapminder_2002_continent = gapminder_2002.groupby('continent').\
    apply(lambda df: pd.Series({'gdpPercap': (df['gdpPercap'] * df['pop']).sum() / df['pop'].sum()})).\
    reset_index()

In [None]:
#solution
plt.barh(gapminder_2002_continent['continent'], gapminder_2002_continent['gdpPercap'])

24. Order the plot by gdp per capita

__Hint:__ Use `sort_values`.

In [None]:
gapminder_2002_continent = gapminder_2002_continent.sort_values('gdpPercap')

In [None]:
plt.barh(gapminder_2002_continent['continent'], gapminder_2002_continent['gdpPercap'])

### Distributions

Instead of a summary, we may be interested in the _repartition_ of the values.

#### Histogram

This is a count plot in which the bins are of the same (arbitrary) width.

In [None]:
plt.hist(gapminder_2002['gdpPercap'])

25. Specify the bins so that they starts at 0 and are of width 5000.

__Hint:__ Use `np.arange` and the `bins` keyword.

In [None]:
#solution
bins = np.arange(0, 50001, 5000)
plt.hist(gapminder_2002['gdpPercap'], bins=bins);

26. What happens if we choose the binwidth equal to 100?

In [None]:
#solution
bins = np.arange(0, 50001, 100)
plt.hist(gapminder_2002['gdpPercap'], bins=bins);

One can add the continent information by splitting the data into buckets:

In [None]:
slice = [ df['gdpPercap'] for s, df in gapminder_2002.groupby('continent')]
slice_name = [s for s, df in gapminder_2002.groupby('continent')]

In [None]:
plt.hist(slice);

27. By default, this is a dogded bar plot. How to obtain its stacked version.

__Hint:__ Use the `stacked` keyword.


In [None]:
plt.hist(slice, stacked=True);

28. Use the facetting system to obtain an histogram by continent.


In [None]:
#solution
bins = np.arange(0, 50001, 5000)
gapminder_2002_continent = gapminder_2002.groupby('continent')
fig, axs = plt.subplots(nrows=1 + (gapminder_2002_continent.ngroups - 1) // 2, ncols=2, sharex=True, sharey=True, figsize=(10,10))
axs_flat = axs.flatten()
for i, (continent, data) in enumerate(gapminder_2002_continent):
    axs_flat[i].hist(data['gdpPercap'], bins=bins)
    axs_flat[i].set_title(f'{continent}')
if (axs_flat.size > gapminder_2002_continent.ngroups):
    for i in range(gapminder_2002_continent.ngroups, axs_flat.size):
        axs_flat[i].remove()
    last_but_one_row = (gapminder_2002_continent.ngroups - 1) // 2
    if (last_but_one_row > 0):
        for i in range((last_but_one_row - 1) * 2 , last_but_one_row * 2):
            if (i + 2 >= gapminder_2002_continent.ngroups):
                axs_flat[i].tick_params(labelbottom=True)

Note that __seaborn__ proposes an alternative way to define those faceted plots:

In [None]:
def plot_hist(** kwargs):
    data = kwargs.pop('data')
    ax = plt.gca()
    ax.hist(data['gdpPercap'], bins=bins)  

In [None]:
g = sns.FacetGrid(gapminder_2002, col='continent', col_wrap=2)
g.map_dataframe(plot_hist)

### Density

Density can be seen as _smoothed_ version of histogram. 
The main difference is that they are renormalized so that 
the integral under the density curve is equal to 1.

In [None]:
sns.kdeplot(gapminder_2002['gdpPercap'], legend=False)

29. Renormalize the histogram so that the total area of the bars is equal to 1

__Hint:__ You can use the 'density' keyword

In [None]:
#solution
bins = np.arange(0, 50001, 5000)
plt.hist(gapminder_2002['gdpPercap'], bins=bins, density=True);

30. Superpose the density plots for the different continents.

In [None]:
for continent, df  in gapminder_2002.groupby('continent'):
    sns.kdeplot(df['gdpPercap'], label=continent)

31. Propose a facet plot.

In [None]:
def plot_kde(** kwargs):
    data = kwargs.pop('data')
    ax = plt.gca()
    sns.kdeplot(data['gdpPercap'], ax=ax)

In [None]:
g = sns.FacetGrid(gapminder_2002, col='continent', col_wrap=2)
g.map_dataframe(plot_kde)

### Interactions

Our final type of visualization is for the interaction bewteen two variables.
Something we may have already seen before...

#### Box plot

The most classical representation of the interaction between a categorical variable
and a continuous one is the box plot:

In [None]:
box = plt.boxplot(slice, labels=slice_name, patch_artist=True, medianprops = {'color': 'black'})
for patch, continent in zip(box['boxes'], slice_name):
    patch.set_facecolor(continent_colors[continent])

In [None]:
sns.boxplot(gapminder_2002['continent'], gapminder_2002['gdpPercap'], order=slice_name,
palette=continent_colors)

32. Can you see what is in its modern companion the violin plot?

__Hint:__ There may be similarities with the graph of 31.

In [None]:
sns.violinplot(x=gapminder_2002['continent'], y=gapminder_2002['gdpPercap'], order=slice_name,
palette=continent_colors)

#### Line plot

Last but not least, one can visualize the evolution of a continuous variable
with respect to another one using a line plot: 

In [None]:
for continent, df in gapminder.groupby('continent'):
    plt.scatter(df['year'], df['gdpPercap'], c=continent_colors[continent])
    for country, df2 in df.groupby('country'):
        plt.plot(df2['year'], df2['gdpPercap'], c=continent_colors[continent])
        

33. Can you add a regression line for each continent to the scatter plot?

__Hint__ `sns.regplot` can be useful.

In [None]:
for continent, df in gapminder.groupby('continent'):
    plt.scatter(df['year'], df['gdpPercap'], c=continent_colors[continent])
    sns.regplot(df['year'], df['gdpPercap'], color=continent_colors[continent], lowess=True)