# Collections data heatmap visualisation

I'm still waiting for the combined dataset that we'll use for the Choropleth maps. While I'm waiting, I want to practise some basic visualisation techniques using the two cleaned modern-country files that Dianna and I created for the British Museum and the V&A Museum.

In this notebook I'll:

- Load the two cleaned datasets  
- Create three simple heatmaps:  
  - one for the British Museum  
  - one for the V&A  
  - one combining both datasets inside the notebook  

Once the combined dataset is ready, I'll switch to experimenting with Choropleth maps using the Folium guide from Real Python.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

bm_df = pd.read_csv("../data/bm_cleaned_with_countries.csv")
va_df = pd.read_csv("../data/va_cleaned_places_with_countries.csv")

# quick check that both datasets loaded properly
bm_df.head(), va_df.head()


## Step 1. Counting objects by country

To build a heatmap I need a table that shows how many records each modern country has. In this table, I'll be able to count objects per `FinalCountry` for each museum and sort the results so I can see which countries appear most often.


In [None]:
# count objects per modern country for the British Museum
bm_country_counts = (
    bm_df
    .dropna(subset=["FinalCountry"])
    .groupby("FinalCountry")
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)
)

# look at the top countries for the BM
bm_country_counts.head()


In [None]:
# count objects per modern country for the V&A
va_country_counts = (
    va_df
    .dropna(subset=["FinalCountry"])
    .groupby("FinalCountry")
    .size()
    .reset_index(name="count")
    .sort_values("count", ascending=False)
)

# look at the top countries for the V&A
va_country_counts.head()


## Step 2. Preparing the data for a simple heatmap

Heatmaps need a 2D table. The country counts are in a single column, so next I'll them into a one-column matrix that Seaborn can plot. I'll start with the BM dataset to make sure everything's looking ok.


In [None]:
# reshape BM counts into a 2D matrix (one column) for the heatmap
bm_matrix = bm_country_counts.pivot_table(
    index="FinalCountry",
    values="count"
)

bm_matrix.head()


In [None]:
# basic heatmap for the British Museum counts
plt.figure(figsize=(6, 12))

sns.heatmap(
    bm_matrix,
    cmap="OrRd",
    linewidths=0.5
)

plt.title("British Museum objects per country")
plt.xlabel("count")
plt.ylabel("FinalCountry")

plt.show()


## Step 3. Creating a simple heatmap for the V&A

Now that the British Museum heatmap is working, I can repeat the same steps for the V&A counts. I use the grouped country counts, turn them into a single-column matrix, and pass that into `sns.heatmap`.


In [None]:
# reshape V&A counts into a 2D matrix (one column) for the heatmap
va_matrix = va_country_counts.pivot_table(
    index="FinalCountry",
    values="count"
)

va_matrix.head()


In [None]:
# basic heatmap for the V&A counts
plt.figure(figsize=(6, 12))

sns.heatmap(
    va_matrix,
    cmap="OrRd",
    linewidths=0.5
)

plt.title("V&A objects per country")
plt.xlabel("count")
plt.ylabel("FinalCountry")

plt.show()


## Step 4. Creating a combined heatmap for both museums

For the third heatmap I want to see British Museum and V&A counts side by side. To do this I can join the two country count tables together, fill any missing values with zero, and build a small matrix with one column per museum.


In [None]:
# join BM and V&A country counts on FinalCountry
combined_counts = (
    bm_country_counts
    .rename(columns={"count": "BM_count"})
    .merge(
        va_country_counts.rename(columns={"count": "VA_count"}),
        on="FinalCountry",
        how="outer"
    )
    .fillna(0)
)

# convert counts to integers after filling NaN with 0
combined_counts["BM_count"] = combined_counts["BM_count"].astype(int)
combined_counts["VA_count"] = combined_counts["VA_count"].astype(int)

combined_counts.head()


In [None]:
# set FinalCountry as index so the matrix has one row per country
combined_matrix = combined_counts.set_index("FinalCountry")[["BM_count", "VA_count"]]

combined_matrix.head()


In [None]:
# heatmap showing BM and V&A counts side by side
plt.figure(figsize=(6, 12))

sns.heatmap(
    combined_matrix,
    cmap="OrRd",
    linewidths=0.5
)

plt.title("Objects per country: British Museum vs V&A")
plt.xlabel("Museum")
plt.ylabel("FinalCountry")

plt.show()


## Step 4. Switching to a 10×10 heatmap using top countries and count ranges

The first few heatmaps worked, but because there are so many countries the plots ended up as long vertical strips rather than the pixel-style square grid shown in the guide. Reducing to the top 10 countries will make the plots shorter, but they were still only one column wide for each museum.

To get something closer to the square heatmap in the tutorial, I need both axes to have ten values. So for the next version I will use:

- the **top 10 countries** on one axis  
- **10 count ranges** (for example 0–50, 50–150, 150–250, and so on) on the other axis  

Each country falls into one of these ranges based on its object count. This should give a proper 10×10 grid where each row has one coloured square, and the overall layout looks much more like the pixelated square heatmap in the resource.


In [None]:
# top 10 countries for the British Museum
bm_top10 = (
    bm_country_counts
    .sort_values("count", ascending=False)
    .head(10)
    .set_index("FinalCountry")
)

bm_top10


In [None]:
# set up count bins for the 10 ranges
bin_edges = [0, 50, 150, 250, 350, 450, 550, 650, 750, 850, 950]
bin_labels = [
    "0–50", "50–150", "150–250", "250–350", "350–450",
    "450–550", "550–650", "650–750", "750–850", "850–950"
]

# assign each BM top 10 country to a bin
bm_binned = bm_top10.copy()
bm_binned["count_bin"] = pd.cut(
    bm_binned["count"],
    bins=bin_edges,
    labels=bin_labels,
    include_lowest=True,
    right=False
)

# create a 10x10 table (countries x bins)
bm_grid = pd.DataFrame(
    0,
    index=bm_binned.index,
    columns=bin_labels
)

# fill the correct cell for each country with its count
for country, row in bm_binned.iterrows():
    bm_grid.loc[country, row["count_bin"]] = row["count"]

bm_grid


In [None]:
# plot the BM 10x10 heatmap
plt.figure(figsize=(8, 5))

sns.heatmap(
    bm_grid,
    cmap="Blues",
    linewidths=0.5,
    annot=True,
    fmt="g"
)

plt.title("BM top 10 countries by count range")
plt.xlabel("Count range")
plt.ylabel("FinalCountry")
plt.tight_layout()
plt.show()


## Step 4 result: why the 10×10 heatmap doesn't really work

The 10×10 heatmap technically works, but the outcome shows why this approach is not a good fit for the data. Each country only has one total object count, so when I force the data into ten count ranges, each country falls into one range only. This means each row has one coloured square and the rest are zeros. The plot ends up looking empty, even though the code is doing what it should be doing.

This highlights an important point: a heatmap only becomes meaningful when both axes vary. My data has a lot of countries (so plenty of variation in one direction), but only one value per country (so no variation in the other). Binning the counts creates a second axis on paper, but it does not add new information.

To make a proper square, “pixelated” heatmap like the one in the guide, I would need an additional dimension in the data. For example:
- values from **multiple museums**, so I could compare BM, V&A, Tate, National Gallery, etc. across countries  
- values across **object types**, giving a grid like country vs sculpture, painting, print, metalwork and so on.
- values across **time periods**, such as country by century (which I could actually try, given we have cleaned dates data!)

Any of these would produce a matrix where each country has several meaningful values, not just one. That would create the variation a heatmap needs to look full rather than mostly blank.

Since the dataset I have is essentially one-dimensional, a bar chart or a choropleth map is a better matc


## New Experiment: Step 1. Moving to a country × time period heatmap

The earlier heatmap attempts showed that a single value per country is not enough to produce the kind of pixel-style grid used in the guide. A heatmap only works well when both axes contain meaningful variation.

The next idea is to use time as the second dimension. I have cleaned production dates for both museums, including start dates, end dates and midpoint dates. This gives me the chance to explore how objects are distributed across different time periods and to compare countries through a historical lens.

In this step I'll experiment with creating a heatmap where:
- rows represent countries, and  
- columns represent time periods (for example centuries or defined ranges)

This should give each country multiple values rather than one, which means the heatmap will have a fuller grid and a clearer pattern. I’ll start by exploring the date data and deciding what time periods make sense for the dataset.


In [None]:
bm_dates_df = pd.read_csv("../data/bm_cleaned_dates.csv")
va_dates_df = pd.read_csv("../data/va_cleaned_dates.csv")

# quick check that they loaded properly
bm_dates_df.head(), va_dates_df.head()


## Step 2. Linking dates and countries for the British Museum

To build a heatmap with country on one axis and time periods on the other, I need a single table that combines both pieces of information. In this step I join the British Museum dates file to the main BM dataset, using `Museum number` as the key, so that each object has both a `FinalCountry` and a `midpoint_year`.


In [None]:
# keep only the columns I need from the BM dates file
bm_dates_min = bm_dates_df[["Museum number", "midpoint_year"]]

# join dates onto the main BM dataset
bm_with_dates = (
    bm_df
    .merge(bm_dates_min, on="Museum number", how="left")
)

# quick check that both FinalCountry and midpoint_year are present
bm_with_dates[["Museum number", "FinalCountry", "midpoint_year"]].head()


In [None]:
# top 10 BM countries by count (reusing the earlier logic)
bm_top10 = (
    bm_country_counts
    .sort_values("count", ascending=False)
    .head(10)
    .set_index("FinalCountry")
)

bm_top10.index


In [None]:
import numpy as np

# drop rows with missing midpoint_year before measuring range
bm_years = bm_with_dates.dropna(subset=["midpoint_year"])

# work out overall date range
min_year = int(bm_years["midpoint_year"].min())
max_year = int(bm_years["midpoint_year"].max())

min_year, max_year


## Step 2. Choosing a time framework
Now that the dates are merged into the dataset, I can see that the midpoint years span a huge range. Some objects fall several thousand years BC, others are in the medieval period, and others are modern. Because the range is so wide, plotting raw years would not produce a meaningful or readable heatmap.

I'll need to divide time into meaningful segments before I build the heatmap. My first idea was to use museum-friendly periods such as “Classical Antiquity” or “Medieval”, but these categories are very Eurocentric and don't map well onto global histories represented in the British Museum and V&A collections.

To keep the visualisation fair and usable across cultures, I’m switching to a set of broad, neutral time bins that work globally. These are simple date ranges rather than culturally specific periods, and they give me ten groups, which fits the 10×10 heatmap structure:

- Before 3000 BC  
- 3000–1000 BC  
- 1000–500 BC  
- 500 BC–0  
- 0–500 AD  
- 500–1000 AD  
- 1000–1500 AD  
- 1500–1800 AD  
- 1800–1950 AD  
- 1950–present

Each object’s midpoint date will fall into one of these bins, giving me enough variation to create a genuine two-dimensional heatmap with country on one axis and time on the other.


In [None]:
# define the ten global time bins
time_edges = [
    -60000,   # before 3000 BC (very low floor)
    -3000,
    -1000,
    -500,
    0,
    500,
    1000,
    1500,
    1800,
    1950,
    3000     # upper bound for present / future
]

time_labels = [
    "Before 3000 BC",
    "3000–1000 BC",
    "1000–500 BC",
    "500 BC–0",
    "0–500 AD",
    "500–1000 AD",
    "1000–1500 AD",
    "1500–1800 AD",
    "1800–1950 AD",
    "1950–present"
]


In [None]:
# bin the midpoint dates into the ten global ranges
bm_with_dates["time_bin"] = pd.cut(
    bm_with_dates["midpoint_year"],
    bins=time_edges,
    labels=time_labels,
    include_lowest=True,
    right=False
)

# quick check
bm_with_dates[["FinalCountry", "midpoint_year", "time_bin"]].head()


In [None]:
# filter BM dataset to only include the top 10 countries
bm_top10_countries = bm_top10.index.tolist()
bm_top10_dates = bm_with_dates[bm_with_dates["FinalCountry"].isin(bm_top10_countries)]


In [None]:
# count objects per country per time bin
bm_time_matrix = (
    bm_top10_dates
    .groupby(["FinalCountry", "time_bin"])
    .size()
    .unstack(fill_value=0)
    .reindex(index=bm_top10_countries, columns=time_labels)  # enforce order
)

bm_time_matrix


In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(
    bm_time_matrix,
    cmap="Blues",
    linewidths=0.5,
    annot=True,
    fmt="d"
)

plt.title("BM top 10 countries by global time periods")
plt.xlabel("Time period")
plt.ylabel("Country")
plt.tight_layout()
plt.show()


## Summary: Looking at the result and why the heatmap is still not the optimal visualisation

The heatmap works technically, and this time it really is a proper 10×10 grid. Each country now has values across several time periods instead of just one, so the plot finally behaves like a true heatmap rather than a single coloured cell per row.

But even so, the result is still quite hard to read. A few reasons for this became obvious as soon as I saw the output:

- The distribution of objects over time is very uneven. Some countries (like Greece and Egypt) have large clusters in certain periods, while others barely appear.
- Many cells are zero, which creates a patchy pattern with a lot of empty white space.
- The ranges are still very broad (thousands of years in some bins), so countries with long histories end up dominating the visual, while others remain sparse.
- Museums naturally collect unevenly across time, meaning some periods have far more surviving material than others.

This means the data itself is not evenly shaped for heatmapping. A heatmap works best when both axes have fairly balanced variation. Here, the “country” axis is strong, but the “time period” axis is highly skewed.

Even though this version is much better than the earlier experiments, it still shows that country-by-time-period is not the clearest way to view the data. This is useful in itself: it tells me that not every dataset naturally suits a heatmap, and the structure of the data matters just as much as the code.

At this point I'm going to stop the heatmap experimentation. I think a choropleth map will give a cleaner and more meaningful view of the data at country level, especially once I have the combined dataset from the group. When that file is ready, I’ll move on to trying the Folium approach for mapping the countries geographically.
