# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
1825009,2,ALYCIA,2013,91,10
3135014,2,MARINE,1990,13,221
892706,1,JORDAN,2008,85,6
576539,1,FRÉDÉRIC,1984,31,65
106629,1,AMAURY,2018,71,4


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [3]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
21,23,Creuse,"POLYGON ((2.16779 46.42407, 2.19757 46.4283, 2..."
5,6,Alpes-Maritimes,"POLYGON ((6.88743 44.36105, 6.92257 44.35073, ..."
20,22,Côtes-d'Armor,"POLYGON ((-3.65914 48.65921, -3.63649 48.67069..."
70,70,Haute-Saône,"POLYGON ((5.88473 47.92605, 5.90011 47.94475, ..."
67,67,Bas-Rhin,"POLYGON ((7.63529 49.05416, 7.67449 49.04504, ..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [4]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
3384798,13.0,Bouches-du-Rhône,"POLYGON ((4.73906 43.92406, 4.82174 43.91283, ...",2,RÉGINE,1941,13,25
2775370,69.0,Rhône,"POLYGON ((4.38808 46.21979, 4.39205 46.26302, ...",2,LEILA,2006,69,10
643344,57.0,Moselle,"POLYGON ((5.8934 49.49691, 5.93994 49.50097, 5...",1,GRÉGORY,1973,57,7
991795,,,,1,LOÏC,1963,971,3
455816,82.0,Tarn-et-Garonne,"POLYGON ((1.06408 44.37851, 1.10672 44.39235, ...",1,ERNEST,1916,82,3


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [5]:
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,160
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABBY,2,3
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDALLAH,1,7
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDEL,1,3
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDELKADER,1,3
...,...,...,...,...,...,...,...
239574,,,,974,ÉSAÏE,1,3
239575,,,,974,ÉTHAN,1,53
239576,,,,974,ÉTIENNE,1,3
239577,,,,974,ÉVA,2,32


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [6]:
multi = alt.selection_multi()

name = 'LUCIEN'
subset = grouped[grouped.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600).add_selection(
    multi
)

Deprecated since `altair=5.0.0`. Use selection_point instead.
  multi = alt.selection_multi()
Deprecated since `altair=5.0.0`. Use add_params instead.
  ).properties(width=800, height=600).add_selection(


In [7]:
import pandas as pd
import geopandas as gpd
import altair as alt

In [8]:
# Visualization 2

names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)
depts = gpd.read_file('departements-version-simplifiee.geojson')


names = depts.merge(names, how='right', left_on='code', right_on='dpt')

df_grouped = names.groupby(['nom', 'preusuel'])['nombre'].sum().reset_index()

top_names_list = (
    df_grouped
    .groupby('nom')
    .apply(lambda x: x.nlargest(3, 'nombre')['preusuel'].tolist())
    .reset_index()
    .rename(columns={0: 'top_3_names'})
)

top_names_nombre_list = (
    df_grouped
    .groupby('nom')
    .apply(lambda x: x.nlargest(3, 'nombre')['nombre'].tolist())
    .reset_index()
    .rename(columns={0: 'nombre_list'})
)

top_names = top_names_list.merge(top_names_nombre_list, how='right', left_on='nom', right_on='nom')

gdf = depts.merge(top_names, how='right', left_on='nom', right_on='nom')

map = alt.Chart(gdf).mark_geoshape(stroke='white').properties(
    width=600,
    height=400
).encode(
        color=alt.value('lightgray'),
        tooltip=['nom', 'top_3_names', 'nombre_list'],
).properties(width=600, height=400, title='Popular names per region')

map

  .apply(lambda x: x.nlargest(3, 'nombre')['preusuel'].tolist())
  .apply(lambda x: x.nlargest(3, 'nombre')['nombre'].tolist())


In [10]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)
depts = gpd.read_file('departements-version-simplifiee.geojson')

names


Unnamed: 0,sexe,preusuel,annais,dpt,nombre
10885,1,AADIL,1983,84,3
10886,1,AADIL,1992,92,3
10888,1,AAHIL,2016,95,3
10892,1,AARON,1962,75,3
10893,1,AARON,1976,75,3
...,...,...,...,...,...
3727545,2,ZYA,2013,44,4
3727546,2,ZYA,2013,59,3
3727547,2,ZYA,2017,974,3
3727548,2,ZYA,2018,59,3


In [15]:
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False).sum(numeric_only=True)
display(grouped)
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in

dpt_sums = grouped.groupby(['dpt'])['nombre'].sum().to_frame()
dpt_sums.columns.values[0] = "dpt_nombre_sum"
max_dpt_sum = max(dpt_sums.dpt_nombre_sum)
grouped = grouped.merge(dpt_sums, how='right', left_on='dpt', right_on='dpt')
grouped['name_percentage_in_dpt'] = grouped.nombre / grouped.dpt_nombre_sum

top_names = grouped.groupby(['preusuel'])['nombre'].sum().sort_values(ascending= False).head(35).index.tolist()
tops = grouped[grouped.preusuel.isin(top_names)]
# filter out the rows having NaN values in the 'nom' or 'code' columns
tops = tops.dropna(subset=['nom'])
tops

Unnamed: 0,dpt,preusuel,sexe,nombre
0,01,AARON,1,160
1,01,ABBY,2,3
2,01,ABDALLAH,1,7
3,01,ABDEL,1,3
4,01,ABDELKADER,1,3
...,...,...,...,...
239574,974,ÉSAÏE,1,3
239575,974,ÉTHAN,1,53
239576,974,ÉTIENNE,1,3
239577,974,ÉVA,2,32


Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre,dpt_nombre_sum,name_percentage_in_dpt
27,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ALAIN,1,2702,436511,0.006190
103,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ANDRÉ,1,5483,436511,0.012561
124,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ANNE,2,1762,436511,0.004037
223,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,BERNARD,1,3158,436511,0.007235
268,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,CATHERINE,2,2100,436511,0.004811
...,...,...,...,...,...,...,...,...,...
225243,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,PHILIPPE,1,1903,656950,0.002897
225247,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,PIERRE,1,1881,656950,0.002863
225343,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,RENÉ,1,26,656950,0.000040
225367,95,Val-d'Oise,"POLYGON ((2.59052 49.07965, 2.57203 49.06149, ...",95,ROBERT,1,95,656950,0.000145


In [None]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)
depts = gpd.read_file('departements-version-simplifiee.geojson')
merged = depts.merge(names, how='right', left_on='code', right_on='dpt')

region_df = names.groupby(['dpt', 'preusuel'])['nombre'].sum().reset_index()
region_df_top_names = region_df.groupby('dpt', as_index=False).apply(lambda x: x.sort_values('nombre', ascending=False)).reset_index()
region_df_top_names = region_df_top_names.groupby('dpt').head(3).reset_index()
region_df = depts.merge(region_df, how='right', left_on='code', right_on='dpt')
name_list = region_df_top_names['preusuel'].to_list()
region_df = region_df[region_df['preusuel'].isin(name_list)]

name_selection = alt.selection_single( fields=['preusuel'])
dpt_selection = alt.selection_multi( fields=['dpt'])

map = alt.Chart(region_df).mark_geoshape(stroke='white').encode(
    tooltip=[
        alt.Tooltip('dpt', title='Department'), 
        alt.Tooltip('nombre', title='Number')],
    color=alt.condition(
        dpt_selection, 
        alt.Color('nombre:Q', scale=alt.Scale(scheme='greens')), 
        alt.value('grey'))
).transform_filter(
    name_selection
).add_selection(
    dpt_selection
).properties(width=800, height=400, title="Heatmap of name popularity in France")

population = alt.Chart(region_df_top_names).mark_arc().encode(
    theta=alt.Y('sum(nombre):Q', title="Number of occurences"),
    color=alt.Color(field="preusuel", type="nominal"),
        tooltip=[
        alt.Tooltip('preusuel:N', title='Name'),
        alt.Tooltip('nombre:Q', aggregate='sum', title='Total')
    ],
).add_selection(name_selection
).transform_filter(dpt_selection
).properties(height=200, title="Histogram of most common names in the selected departments")

map & population

  region_df_top_names = region_df.groupby('dpt', as_index=False).apply(lambda x: x.sort_values('nombre', ascending=False)).reset_index()
Deprecated since `altair=5.0.0`. Use selection_point instead.
  name_selection = alt.selection_single( fields=['preusuel'])
Deprecated since `altair=5.0.0`. Use selection_point instead.
  dpt_selection = alt.selection_multi( fields=['dpt'])
Deprecated since `altair=5.0.0`. Use add_params instead.
  ).add_selection(
Deprecated since `altair=5.0.0`. Use add_params instead.
  ).add_selection(name_selection
