# GDP vs Life Expectancy Gapminder Data

This is a rendering of the [Gapminder](http://www.gapminder.org/) data on GDP and life expectancy, using a visualisation similar to that of Gapminder, but implemented in Python. In particular, it uses [pandas](http://pandas.pydata.org/) and [bokeh](http://bokeh.pydata.org/en/latest/). It shows how compelling, interactive visualisations can be created now with a few lines of code using open-source tools.

If you just want to see the result, scroll down to the bottom. 

If you want to see how it's done, read on. We start by importing the stuff that we'll need.

In [1]:
import pandas as pd
import numpy as np
import re

import bokeh.plotting as bk

from bokeh.palettes import Spectral6
from bokeh.plotting import ColumnDataSource

from bokeh.models import (HoverTool, 
                          BoxZoomTool,
                          ResetTool,
                          PanTool,
                          WheelZoomTool,
                          Slider,
                          CustomJS,
                          Range1d,
                          Circle,
                          Text,
                          NumeralTickFormatter)

from bokeh.io import vform

Next, we'll read the data. The data has been download from Gapminder, converted to CSV, and placed into the `data/gapminder` directory. You can find all the Gapminder data at http://www.gapminder.org/data/. We started with the GDP data, in Purchasing Power Parity (PPP) terms. This is the "Income per person (GDP/capita, PPP$ inflation-adjusted)" indicator.

When we read the data, we do the following:

* We rename the `GDO per capita` column to `Entity`, as it really refers to the entity whose GDP is measured.
* We find the minimum and the maximum of the GDP values, across countries. We'll need that later to adjust the limits of our plot.

In [2]:
gdp_df = pd.read_csv('data/gapminder/'
                     'indicator_gapminder_gdp_per_capita_ppp.csv',
                     thousands=',')

gdp_df.rename(columns={'GDP per capita': 'Entity'}, 
              inplace=True)

min_gdp = gdp_df.iloc[:,1:].min().min()
max_gdp = gdp_df.iloc[:,1:].max().max()

We want to be able to group the countries based on the geographical regions. The geographical regions are from http://spreadsheets.google.com/pub?key=phT4mwjvEuGBtdf1ZeO7_PQ&gid=1. Once we read the regions, we merge them with the GDP data, using the `Entity` column.

In [3]:
geographical_regions_df = pd.read_csv('data/gapminder/'
                                      'geographical_regions.csv')

all_df = pd.merge(gdp_df, 
                  geographical_regions_df[['Entity', 'Group']], 
                  on='Entity')

Next we get the life expectancy data. This is the "Life expectancy (years)" indicator. We do a renaming as we did with the GDP data, and again we find the minimum and maximum life expectancy values that we'll use to adjust the limits of our plot.

In [4]:
lex_df = pd.read_csv('data/gapminder/'
                     'indicator_gapminder_life_expectancy_at_birth.csv')

lex_df.rename(columns={'Life expectancy with projections. Yellow is IHME': 
                       'Entity'}, 
              inplace=True)

min_lex = lex_df.iloc[:,1:].min().min()
max_lex = lex_df.iloc[:,1:].max().max()

We merge the life expectancy data with our `all_df` dataframe. This will grow to contain *all* our data. Note that both the GDP dataframe and the life expectancy dataframe contain columns with the values of their metric per year, from 1800 onwards. To resolve the overlap, pandas adds a `_x` suffix by default to the second dataframe. It's better to add a suffix ourselves, to make things clearer.

In [5]:
all_df = pd.merge(all_df, 
                  lex_df, 
                  on='Entity',
                  suffixes=("_gdp", "_lex"))

Finally, we need to get the population data. This is the "Population, total" indicator. Once more we do a column renaming and then we join with the rest of our data.

In [6]:
pop_df = pd.read_csv('data/gapminder/'
                     'indicator_gapminder_population.csv',
                     thousands=',')

pop_df.rename(columns={'Total population': 
                       'Entity'},
             inplace=True)

all_df = pd.merge(all_df, 
                  pop_df, 
                  on='Entity')   

There is something we need to take care of now. The plot will be rendered with JavaScript. That means that the column names of the dataframe must be understood by JavaScript; column names that start with a number are not good, so we must flip them and change `2015_gdp` to `gdp_2015`, `2015_lex` to `lex_2015`, and `2015_pop` to `pop_2015`.

While we'are at it, we'll also create another column that we will use for plotting the area of each country. The area of each country will be based on its population, so we add a `size_x` (where `x` is the year) column that is a scaled value of the area of the circle with the population as its radius.

In [7]:
for column in all_df.columns:
    col_match = re.match(r'(\d+)(_(.+))?', column)
    if col_match:
        if col_match.lastindex > 1:
            new_name = col_match.group(3) + '_' + col_match.group(1)
            all_df.rename(columns={column: new_name},
                          inplace=True)
        else:
            new_name = 'pop_' + col_match.group(1)
            all_df.rename(columns={column: new_name},
                          inplace=True)
            sizes = 0.003 * np.sqrt(all_df[new_name] / np.pi)        
            all_df['size_' + col_match.group(1)] = sizes

We want to plot each country with a colour the corresponds to its geographical region, or group. To associate groups with colours we must make a correspondence between groups and group codes, which are just numbers. 

In [8]:
groups = geographical_regions_df['Group'].drop_duplicates()
group_map = {v: k for k, v in enumerate(groups)}
all_df['Group Code'] = all_df['Group'].map(group_map)

The plot will display life expectancy, GDP, and population per year, for each year, selected by a slider. We'll add:

* a column `x` in the dataframe that will contain the GDP for the year selected

* a column `y` that will contain the life expectancy for the year selected

* a column `pop` that will contain the population for the year selected

* a column `size` that will contain the size of the country marker for the year selected\

We'll also gather together  the colours of each country; the colours will be those mapped by the colour palette from the group codes that are assigned to the countries.  

In [9]:
all_df['x'] = all_df['gdp_2015']
all_df['y'] = all_df['lex_2015']
all_df['pop'] = all_df['pop_2015']

all_df['size'] = all_df['size_2015']

colors = all_df['Group Code'].map(lambda x: Spectral6[x])

The rest is just bokeh.

In [10]:
source = ColumnDataSource(all_df)

hover = HoverTool(tooltips=[
        ("Country Name", "@Entity"),
        ("Population", "@pop")
        ]
    )

min_x = 100 * (min_gdp // 100)
max_x = 1000 * (max_gdp // 1000) + 1000

min_y = 10 * (min_lex // 10)
max_y = 100 * (max_lex // 100) + 100

tools = [
    hover,
    WheelZoomTool(),
    PanTool(),
    BoxZoomTool(),
    ResetTool()
]

p = bk.Figure(tools=tools,
              x_axis_type="log",
              x_axis_label="Income per person "
              "(GDP/capita, PPP$ inflation-adjusted)",
              y_axis_label="Life expectancy (years)",
              x_range=Range1d(min_x, max_x), 
              y_range=Range1d(min_y, max_y),
              plot_width=700)

p.xaxis[0].formatter = NumeralTickFormatter(format='0a')

text_x = 15000
text_y = 20

for i, group in enumerate(groups):
    p.add_glyph(Text(x=text_x, 
                     y=text_y, 
                     text=[group], 
                     text_font_size='10pt', 
                     text_color='#666666'))
    p.add_glyph(Circle(x=text_x - 2000, 
                       y=text_y + 1, 
                       fill_color=Spectral6[i], 
                       size=10, 
                       line_color=None, fill_alpha=0.8))
    text_y = text_y - 3

callback = CustomJS(args=dict(source=source), code="""
        var data = source.get('data');
        var v = cb_obj.get('value')
        x = data['gdp_' + v];
        y = data['lex_' + v];
        data['x'] = x;
        data['y'] = y;
        data['size'] = data['size_' + v];
        data['pop'] = data['pop_' + v];
        source.trigger('change');
    """)

p.scatter('x', 
          'y',
          source=source,
          size='size',
          fill_color=colors,
          fill_alpha=0.8)

slider = Slider(start=1800, end=2015, value=2015, step=1, title="Year",
                callback=callback)


layout = vform(p, slider)

bk.output_notebook()

bk.show(layout)


<bokeh.io._CommsHandle at 0x108b78240>