# Making chloropleth maps in Altair

Here's a quick example of how to make a chloropleth map in Altair.  In this example, we'll work with a fairly large data set of baby names in France from 1900-2019, broken down by department.

To work with geographical data, we'll use the `geopandas`, which loads `pandas` dataframes, but with support for geographical outlines in the `geojson` format.  You can use these dataframes just as you would a regular `pandas` dataframe, but they will include that extra geographical outline data.

To get started, we'll need to import our libraries.

In [1]:
import altair as alt
import pandas as pd
import geopandas as gpd # Requires geopandas -- e.g.: conda install -c conda-forge geopandas
from shapely import centroid
alt.data_transformers.enable('json') # Let Altair/Vega-Lite work with large data sets

pass

# Reading our names data

Now, let's read in our dataset.  The exported data is in CSV format, but with a `;` separator instead of commas.  The INSEE data collapses rare names or where department-level information has been elided (presumably to protect individuals with uncommon names or who were one of the only ones born with that name in a given year).  We'll strip those out.

In [2]:
names = pd.read_csv("dpt2020.csv", sep=";")
names.drop(names[names.preusuel == '_PRENOMS_RARES'].index, inplace=True)
names.drop(names[names.dpt == 'XX'].index, inplace=True)

names.sample(5)

Unnamed: 0,sexe,preusuel,annais,dpt,nombre
1135627,1,MATHIAS,2001,76,20
2881074,2,LIYA,2011,76,6
307524,1,CHRISTOPHE,1981,12,33
525896,1,FERNAND,1917,49,16
3114651,2,MARIE-PIERRE,1966,37,15


In [3]:

unique_preusuel = names['preusuel'].unique()
print(len(unique_preusuel))

unique_annais = names['annais'].unique()
print(min(unique_annais))
print(max(unique_annais))

unique_nombre = names['nombre'].unique()
print(min(unique_nombre))
print(max(unique_nombre))



15270
1900
2020
3
6310


In [4]:
# Assuming 'names' is your DataFrame
# First, group by 'preusuel' (name), 'sexe' (gender), and sum up 'nombre' (number of births) for each group
name_gender_group = names.groupby(['preusuel', 'sexe'])['nombre'].sum()

# Reset index to make 'preusuel' and 'sexe' as columns again
name_gender_group = name_gender_group.reset_index()

# Pivot the table to have separate columns for each gender
# 'sexe' uses 1 for male, 2 for female, so we need to map these to column names
name_gender_pivot = name_gender_group.pivot(index='preusuel', columns='sexe', values='nombre').fillna(0)

# Rename columns for clarity
name_gender_pivot.columns = ['Male', 'Female']

# Calculate the total births for each name and the total for all names
name_gender_pivot['Total Name Count'] = name_gender_pivot['Male'] + name_gender_pivot['Female']
filtered_total = name_gender_pivot['Total Name Count'].sum()  # Sum before filtering

# Apply the filtering criteria
hybrid_names = name_gender_pivot[
    (name_gender_pivot['Male'] / name_gender_pivot['Total Name Count'] >= 0.01) & 
    (name_gender_pivot['Female'] / name_gender_pivot['Total Name Count'] >= 0.01)
]

# Adjust the grand total to reflect only the sum of hybrid names
filtered_hybrid_total = hybrid_names['Total Name Count'].sum()

# Calculate the percentage of each name compared to the filtered hybrid total
hybrid_names['Percentage of Total'] = (hybrid_names['Total Name Count'] / filtered_hybrid_total) * 100

# Sort the results in descending order by the total count
hybrid_names_sorted = hybrid_names.sort_values(by='Total Name Count', ascending=False)

# Print the results
print(hybrid_names_sorted[['Male', 'Female', 'Percentage of Total']].head(20))
print("Sum of percentages:", hybrid_names_sorted['Percentage of Total'].sum())


               Male     Female  Percentage of Total
preusuel                                           
MARIE       24169.0  2231903.0            49.959055
CLAUDE     408989.0    54031.0            10.253237
DOMINIQUE  238623.0   165887.0             8.957576
CAMILLE     73761.0   201738.0             6.100723
YANNICK     84135.0     3270.0             1.935519
IRÈNE         799.0    78429.0             1.754446
JOSÉ        53436.0      629.0             1.197230
SACHA       51873.0     1825.0             1.189103
LOU          1541.0    43781.0             1.003622
ANDRÉA       4182.0    40486.0             0.989140
NOA         30287.0     4597.0             0.772481
CYRILLE     33216.0      583.0             0.748454
MORGAN      28857.0     1070.0             0.662711
ALIX         4911.0    21796.0             0.591407
EDEN        16808.0     9158.0             0.574998
FRANCE        587.0    23236.0             0.527543
GRÉGOIRE    22219.0      493.0             0.502941
CHARLIE     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hybrid_names['Percentage of Total'] = (hybrid_names['Total Name Count'] / filtered_hybrid_total) * 100


In [5]:
print(len(hybrid_names_sorted))
print(hybrid_names_sorted['Percentage of Total'].sum())

517
100.00000000000001


# Loading map data

Next, let's load some map data of regions in France using `geopandas`.  These map data come from the [INSEE] and [IGN] and were processed into the `geojson` format we'll need to work with by [Grégoire David].  Here's the [github] repository.

In this example, we'll work with the simplified departments tiles for the Hexagon, but that repository contains higher-resolution versions, the DOM-TOM, and more.

[Grégoire David]: https://gregoiredavid.fr
[INSEE]: http://www.insee.fr/fr/methodes/nomenclatures/cog/telechargement.asp
[IGN]: https://geoservices.ign.fr/adminexpress
[github]: https://github.com/gregoiredavid/france-geojson/

In [6]:
depts = gpd.read_file('departements-version-simplifiee.geojson')

depts.sample(5)

Unnamed: 0,code,nom,geometry
57,57,Moselle,"POLYGON ((5.89340 49.49691, 5.93994 49.50097, ..."
56,56,Morbihan,"MULTIPOLYGON (((-3.42179 47.62000, -3.44067 47..."
19,21,Côte-d'Or,"MULTIPOLYGON (((4.18190 47.15051, 4.18711 47.1..."
55,55,Meuse,"POLYGON ((4.95099 49.23687, 4.96436 49.24745, ..."
49,49,Maine-et-Loire,"POLYGON ((-1.24588 47.77672, -1.23825 47.80999..."


Notice how `depts` is a geopandas dataframe.  We'll use it just as a regular `pandas` dataframe, but it includes the geometry info we need to be able to draw those regions when we pass them into Altair.  We just need to make sure that when we work with our data, we keep them in a geopandas dataframe and not a plain dataframe if we want to draw the departments.

In the next cell, notice how we do a right-merge to bring in department data into names.  We do this as a merge on `depts` because we need a geopandas dataframe.  Remember, `depts` is a geopandas dataframe, while `names` is a regular dataframe.  If we did a left merge on `names`, we'd end up with a regular pandas dataframe. After this merge, both `names` and `depts` will be geopandas dataframes.

**Hint:** Be careful when you do your data joins here.  It's easy to accidentally merge the wrong way to accidentally create a _much bigger_ dataset.

In [7]:
# Keep a reference around to the plain pandas dataframe, without geometry data, just in case
just_names = names

names = depts.merge(names, how='right', left_on='code', right_on='dpt')

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre
1549292,64,Pyrénées-Atlantiques,"POLYGON ((-0.24284 43.58498, -0.21061 43.59324...",1,THIBAUD,2002,64,9
2880791,55,Meuse,"POLYGON ((4.95099 49.23687, 4.96436 49.24745, ...",2,LUCIENNE,1911,55,32
3607571,13,Bouches-du-Rhône,"POLYGON ((4.73906 43.92406, 4.82174 43.91283, ...",2,VÉRONIQUE,1962,13,301
2625936,62,Pas-de-Calais,"POLYGON ((2.06771 51.00651, 2.09760 50.99843, ...",2,JENNY,1915,62,6
155089,51,Marne,"POLYGON ((4.04797 49.40564, 4.07691 49.40161, ...",1,ARNAUD,1992,51,25


# Show a name over all years

Now we'll choose a name to show across all years.  To that, we'll group all of the names in a department together (squashing the years together) and use the sum.

In [8]:
grouped = names.groupby(['dpt', 'preusuel', 'sexe'], as_index=False)['nombre'].sum()
grouped = depts.merge(grouped, how='right', left_on='code', right_on='dpt') # Add geometry data back in
grouped

Unnamed: 0,code,nom,geometry,dpt,preusuel,sexe,nombre
0,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,AARON,1,160
1,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABBY,2,3
2,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDALLAH,1,7
3,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDEL,1,3
4,01,Ain,"POLYGON ((4.78021 46.17668, 4.79458 46.21832, ...",01,ABDELKADER,1,3
...,...,...,...,...,...,...,...
239574,,,,974,ÉSAÏE,1,3
239575,,,,974,ÉTHAN,1,53
239576,,,,974,ÉTIENNE,1,3
239577,,,,974,ÉVA,2,32


Now let's pick a name and check out how it's distribution over the last 120 years across Metropolitan France.  In this example, I choose the name “Lucien,” which I rather like for some reason.

In [9]:
name = 'LUCIEN'
subset = grouped[grouped.preusuel == name]
alt.Chart(subset).mark_geoshape(stroke='white').encode(
    tooltip=['nom', 'code', 'nombre'],
    color='nombre',
).properties(width=800, height=600)

### Viz 2

In [10]:
# add a column 'total_birth' with the total number of births per year and dpt
names['total_birth'] = names.groupby(['annais', 'dpt'])['nombre'].transform('sum')

# Remove na rows
names.dropna(inplace=True)

# Ensure type of columns
names['code'] = names['code'].astype(int)
names['nom'] = names['nom'].astype(str)
names['sexe'] = names['sexe'].astype(int)
names['preusuel'] = names['preusuel'].astype(str)
names['annais'] = names['annais'].astype(int)
names['dpt'] = names['dpt'].astype(int)
names['nombre'] = names['nombre'].astype(int)
names['total_birth'] = names['total_birth'].astype(int)

# Remove rows with missing geometry data
names = names.loc[names['geometry'].notnull()].copy()

names.sample(5)

Unnamed: 0,code,nom,geometry,sexe,preusuel,annais,dpt,nombre,total_birth
2235844,44,Loire-Atlantique,"POLYGON ((-2.45849 47.44812, -2.45343 47.46207...",2,DOMITILLE,2011,44,4,13904
542635,16,Charente,"POLYGON ((-0.10294 45.96966, -0.04143 45.99348...",1,FRANÇOIS,1908,16,35,4065
445115,21,Côte-d'Or,"MULTIPOLYGON (((4.18190 47.15051, 4.18711 47.1...",1,ENZO,1996,21,3,5359
936633,35,Ille-et-Vilaine,"MULTIPOLYGON (((-2.12371 48.60441, -2.14142 48...",1,KILIAN,2002,35,23,10508
980673,36,Indre,"POLYGON ((1.32667 47.18623, 1.40143 47.21245, ...",1,LIONEL,1923,36,4,3441


In [11]:
import altair as alt
import pandas as pd

def visualization2(gender):
    # Filter for the specified gender and drop the 'sexe' column
    subset = names.loc[names['sexe'] == gender].copy()
    subset.drop(columns=['sexe'], inplace=True)

    # Ensure the 'code' column is of type string
    subset['code'] = subset['code'].astype(str)

    # Add a column 'ratio' with the ratio of the number of births of the name over the total number of births per year and dpt
    subset = subset.sort_values(by=['annais', 'dpt', 'nombre'], ascending=False).drop_duplicates(subset=['annais', 'dpt'])
    subset['ratio'] = subset['nombre'] / subset['total_birth']

    # Create a parameter for the year with initial value 1900
    year_param = alt.param(value=1900, name='Year')

    # Create a slider for the year and bind it to the parameter
    year_slider = alt.binding_range(min=1900, max=2020, step=1, name='Year:')
    year_select = alt.selection_point(fields=['annais'], bind=year_slider, name='year', value=1900)

    # Create the map for France excluding Île-de-France
    france_map_chart = alt.Chart(subset[~subset['code'].str.startswith(('75', '77', '78', '91', '92', '93', '94', '95'))]).mark_geoshape(stroke='white').encode(
        tooltip=['nom', 'code', 'nombre', 'preusuel', 'annais', 'ratio'],
        color=alt.Color('nombre:Q', scale=alt.Scale(scheme='lightorange'))
    ).transform_filter(
        year_select
    ).properties(
        width=800,
        height=600
    ).add_params(
        year_param
    )

    # Filter for Île-de-France
    ile_de_france = subset[subset['code'].str.startswith(('75', '77', '78', '91', '92', '93', '94', '95'))].copy()

    # Create the map for Île-de-France
    ile_de_france_chart = alt.Chart(ile_de_france).mark_geoshape(stroke='white').encode(
        tooltip=['nom', 'code', 'nombre', 'preusuel', 'annais', 'ratio'],
        color=alt.Color('nombre:Q', scale=alt.Scale(scheme='lightorange'))  # Use the 'greens' color scheme
    ).transform_filter(
        year_select
    ).properties(
        width=400,
        height=300
    ).add_params(
        year_param
    )

    # Calculate centroids for placing the text in France map
    subset['centroid'] = subset['geometry'].apply(lambda geom: centroid(geom) if geom.is_valid else None)
    subset[['centroid_x', 'centroid_y']] = subset['centroid'].apply(lambda p: pd.Series({'centroid_x': p.x, 'centroid_y': p.y}) if p is not None else pd.Series({'centroid_x': None, 'centroid_y': None}))
    subset.drop(columns=['centroid'], inplace=True)

    # Add text marks for 'preusuel' in France map
    france_text_chart = alt.Chart(subset[~subset['code'].str.startswith(('75', '77', '78', '91', '92', '93', '94', '95'))]).mark_text(align='center', baseline='middle', fontSize=10).encode(
        longitude='centroid_x:Q',
        latitude='centroid_y:Q',
        text='preusuel:N'
    ).transform_filter(
        year_select
    ).add_params(
        year_param
    )

    # Calculate centroids for placing the text in Île-de-France map
    ile_de_france['centroid'] = ile_de_france['geometry'].apply(lambda geom: centroid(geom) if geom.is_valid else None)
    ile_de_france[['centroid_x', 'centroid_y']] = ile_de_france['centroid'].apply(lambda p: pd.Series({'centroid_x': p.x, 'centroid_y': p.y}) if p is not None else pd.Series({'centroid_x': None, 'centroid_y': None}))
    ile_de_france.drop(columns=['centroid'], inplace=True)

    # Add text marks for 'preusuel' in Île-de-France map
    ile_de_france_text_chart = alt.Chart(ile_de_france).mark_text(align='center', baseline='middle', fontSize=10).encode(
        longitude='centroid_x:Q',
        latitude='centroid_y:Q',
        text='preusuel:N'
    ).transform_filter(
        year_select
    ).add_params(
        year_param
    )

    # Set title based on gender
    title = 'Most Popular Female Name by Year and Department' if gender == 2 else 'Most Popular Male Name by Year and Department'

    # Combine charts
    france_combined_chart = alt.layer(france_map_chart, france_text_chart).add_params(year_select).properties(
        title=title
    )

    ile_de_france_combined_chart = alt.layer(ile_de_france_chart, ile_de_france_text_chart).add_params(year_select).properties(
        title='Île-de-France'
    )

    final_chart = alt.vconcat(france_combined_chart, ile_de_france_combined_chart)

    return final_chart


In [12]:
visualization2(1)

In [13]:
visualization2(2)