# Linking BGB to GZM

*Gerhard de Kok*

This script links entries in the BGB (Boekhouder-Generaal Batavia) to entries in the GZM (Generale Zeemonsterrollen). It also enriches the GZM-data with links to DAS ship IDs.

How it works:
* Look up all (unique) musterings in the GZM for which there is no current link to a DAS-ship
* See if there is a voyage in the BGB with a similarly named ship in the same year (plus or minus 1)
* If so: make a match

Caveat: ships in both the BGB and the GZM have not been disambiguated beforehand. Doing so would result in more matches, but also more interpretation. 

In [1]:
# Import necessary modules
import pandas as pd
import numpy as np
import difflib

In [3]:
# First, load the entire Zeemonsterrollen database and the DAS database (Excel format)
gzm = pd.ExcelFile('gzm.xlsx')
das = pd.ExcelFile('das.xlsx')

# Load the match list between BGB and DAS
bgbmatch = pd.ExcelFile('Matching_results.xlsx')

In [4]:
# Parse the Excel sheets we will be using
zeemons = gzm.parse('Database GZM (MvR 2014)')
zeemons = zeemons.set_index('ID')
das_ships = das.parse('shipNameVariant')
das_ships = das_ships.set_index('shipNameVariantID')
das_voyages = das.parse('das_voyage')
das_voyages = das_voyages.set_index('voyId')

# Add columns to the dataframes
zeemons['DAS ID HEENREIS'] = np.nan
zeemons['DAS ID TERUGREIS'] = np.nan
zeemons['DAS SHIP ID'] = np.nan
bgbnomatch = bgbmatch.parse('No match')
bgbnomatch['BGB ship name'] = bgbnomatch['BGB ship name'].str.lower()

## 1. Linking GZM to DAS ship IDs
For many musterings, the GZM includes a link to the DAS voyages for the outwards and homebound voyages of VOC vessels. What we want is the corresponding DAS ship ID, to link the mustering to a DAS ship (instead of just to a voyage). 

The code below looks at the DAS voyage-links already present in the GZM and extracts the corresponding ship ID (from DAS).

In [5]:
# Loop over entries in Zeemonsterrollen  
# If there is an entry to DAS present: add voyage IDs and Ship IDs
for entry in zeemons.index:
    dasheen = zeemons.loc[entry, 'DAS HEENREIS']
    dasterug = zeemons.loc[entry, 'DAS TERUGREIS']
    
    das_id_heen = np.nan
    das_id_terug = np.nan
    das_ship_id = np.nan
    das_shipname_id = np.nan
    
    if not pd.isnull(dasheen):
        
        # Band-aid try/except for data error
        try:
            das_id_heen = das_voyages.loc[das_voyages['voyNumberDAS'] == dasheen].index[0]
            das_ship_id = das_voyages.loc[das_id_heen, 'shipID']
            das_shipname_id = das_voyages.loc[das_id_heen, 'shipName']
        except:
            das_id_heen = np.nan
        
    if not pd.isnull(dasterug):
        # Band-aid try/except for data error
        try:
            das_id_terug = das_voyages.loc[das_voyages['voyNumberDAS'] == dasterug].index[0]
            das_ship_id = das_voyages.loc[das_id_heen, 'shipID']
            das_shipname_id = das_voyages.loc[das_id_heen, 'shipName']
        except:
            das_id_heen = np.nan
        
    zeemons.loc[entry, 'DAS ID HEENREIS'] = das_id_heen
    zeemons.loc[entry, 'DAS ID TERUGREIS'] = das_id_terug
    zeemons.loc[entry, 'DAS SHIP ID'] = das_ship_id
    bgbnomatch['GZB ship ID'] = np.nan

# Convert floats to ints
zeemons['DAS ID HEENREIS'] = zeemons['DAS ID HEENREIS'].astype('Int64')
zeemons['DAS ID TERUGREIS'] = zeemons['DAS ID TERUGREIS'].astype('Int64')

## 2. Linking BGB voyages to GZM ships
Both the GZM and BGB do not disambiguate their ships. Therefore, we will now give all ships in the GZM database an individual ship ID. These ships are not yet disambiguated, so each entry (mustering) will get a separate ID. This means a single ship with multiple musterings will get multiple IDs (to be disambiguated later, should we want to).

In [6]:
# Add internal Ship IDs to GZM database (based on conversation with Leon)
zeemons['GZM SHIP ID'] = np.nan

for entry in zeemons.index:
    
    listofids = []
    
    shipnames = zeemons.loc[entry, 'NAAM SCHIP (GESTANDAARDISEERD)']

    # If no shipname is filled in, it is not a ship mustering 
    if pd.isnull(shipnames):
        continue
        
    # Several musterings have data for multiple ships, their names separated by ';'
    shipnames = shipnames.split('; ')
    
    for c, shipname in enumerate(shipnames, 1):
        if len(shipnames) > 1:
            shipid = str(entry) + "-" + str(c)
            listofids.append(shipid)
        else:
            shipid = str(entry)
            listofids.append(shipid)
    
    zeemons.loc[entry, 'GZM SHIP ID'] = "; ".join(listofids)

In [7]:
# Show the final result (copied to zeemonsterrollen db)
zeemons

Unnamed: 0_level_0,JAAR,INVENTARISNUMMER (Nationaal Archief 1.04.02),TYPE ADMINISTRATIE,FOLIONUMMER,NAAM SCHIP (ORIGINEEL),NAAM SCHIP (GESTANDAARDISEERD),SCHEEPSTYPE (ORIGINEEL),SCHEEPSTYPE (GESTANDAARDISEERD),LOKATIE (JUNI),TOTAAL OPVARENDEN,...,SCHIPPER TUSSENVOEGSEL,SCHIPPER ACHTERNAAM,SCHIPPER HERKOMST,SCHIPPER AANKOMST SCHIP EN JAAR,DAS HEENREIS,DAS TERUGREIS,DAS ID HEENREIS,DAS ID TERUGREIS,DAS SHIP ID,GZM SHIP ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,1691,11707,Zeemonsterrol,1,Landswelvaren,LANDS WELVAREN,Schip,schip,"Batavia, Ter Rhede",126,...,,,,,1598.2,5903.2,92658,96812,DAS_ship0853,6
7,1691,11707,Zeemonsterrol,6,De Ridderschap van Holland,RIDDERSCHAP VAN HOLLAND,Schip,schip,"Batavia, Ter Rhede",128,...,,,,,1587.4,5895.4,92647,96804,DAS_ship1245,7
8,1691,11707,Zeemonsterrol,11,De Goede Hoop,GOEDE HOOP,Schip,schip,"Batavia, Ter Rhede",79,...,,,,,1591.1,5900.1,92651,96809,DAS_ship0469,8
9,1691,11707,Zeemonsterrol,14,Schoondijk,SCHOONDIJK,Schip,schip,"Batavia, Ter Rhede",74,...,,,,,1585.1,5896.1,92645,96805,DAS_ship1331,9
10,1691,11707,Zeemonsterrol,16,Waterland,WATERLAND,Schip,schip,"Batavia, Ter Rhede",67,...,,,,,1601.3,5899.3,92661,96808,DAS_ship1662,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5322,1791,11705,Zeemonsterrol,30,Catharina Johanna,VROUWE KATHARINA JOHANNA,Schip,schip,Onrust,12,...,,,,,4611.2,8341.2,95649,99124,DAS_ship1599,5322
5323,1791,11705,Zeemonsterrol,30,Oud Haerlem,OUD HAARLEM,Schip,schip,Onrust,6,...,,,,,4571.7,,95609,,DAS_ship1126,5323
5324,1791,11705,Zeemonsterrol,30,De Jonge Oranjeboom,ORANJEBOOM,Schip,schip,"Batavia, ter rede",32,...,,,,,,,,,,5324
5325,1791,11705,Zeemonsterrol,31,Cornelia Adriana,CORNELIA ADRIANA,Schip,schip,Onrust,5,...,,,,,,,,,,5325


In [8]:
# Get the GZM-entries for which there is no link to DAS
gzm_to_match = zeemons[['JAAR', 'NAAM SCHIP (GESTANDAARDISEERD)', 'GZM SHIP ID']].loc[zeemons['DAS SHIP ID'].isna()]
gzm_to_match['NAAM SCHIP (GESTANDAARDISEERD)'] = gzm_to_match['NAAM SCHIP (GESTANDAARDISEERD)'].str.lower()
gzm_to_match = gzm_to_match.dropna()
gzm_to_match

Unnamed: 0_level_0,JAAR,NAAM SCHIP (GESTANDAARDISEERD),GZM SHIP ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
23,1691,sint nicolaas,23
24,1691,grijpvogel,24
37,1691,batavia,37
42,1691,standvastigheid,42
56,1691,wijk op zee,56
...,...,...,...
5292,1790,triton,5292
5293,1790,afrikaan,5293
5324,1791,oranjeboom,5324
5325,1791,cornelia adriana,5325


In [9]:
# Get a list of unique BGB-shipnames (for which no DAS-match exists)
bgb_shiplist = list(bgbnomatch['BGB ship name'].unique())
bgb_shiplist.remove('onbekend')

In [10]:
matchlist = []

# Match the musterings to BGB shipvoyages, if possible
for monstering in gzm_to_match.index:
    gzm_shipname = gzm_to_match.loc[monstering, 'NAAM SCHIP (GESTANDAARDISEERD)']
    gzm_shipname = gzm_shipname.split('; ')
    for c, ship in enumerate(gzm_shipname):
        
        # See if there is a similarly named ship in BGB
        checking = difflib.get_close_matches(ship, bgb_shiplist, n=3, cutoff=0.85)
        
        if checking:
            bgb_shipname = checking[0]
            matching_voy = list(bgbnomatch.index[bgbnomatch['BGB ship name'] == bgb_shipname])
            
            # For each voyage of that similarly named ship: get its booking year
            for match in matching_voy:
                bgb_bookingyear = int(bgbnomatch.loc[match, 'BGB Booking year'])
                gzm_year = int(gzm_to_match.loc[monstering, 'JAAR'])
                      
                # Check if the BGB booking year is the same as the year of the mustering in the GZM (+/- 1)
                if (-2 < (bgb_bookingyear - gzm_year) < 2) :

                    bgb_shipvoyage = bgbnomatch.loc[match, 'BGB Shipvoyage ID']                    
                    gzm_shipid = gzm_to_match.loc[monstering, 'GZM SHIP ID']
                    if ';' in gzm_shipid:
                        gzm_shipidlist =gzm_shipid.split('; ')
                        gzm_shipid = gzm_shipidlist[c]
                    
                    matchlist.append([bgb_shipvoyage, gzm_shipid])

# Write the matches to the BGB dataframe
for match in matchlist:
    indexno = bgbnomatch['GZB ship ID'].index[bgbnomatch['BGB Shipvoyage ID'] == match[0]]
    bgbnomatch.loc[indexno, 'GZB ship ID'] = match[1]

In [11]:
# View the resulting matches
bgbnomatch.dropna()

Unnamed: 0.1,Unnamed: 0,BGB Shipvoyage ID,BGB Voyage ID,BGB ship ID,BGB ship name,BGB Booking year,GZB ship ID
0,0,139,99448,3150,kleine pallas,1790,5285
14,14,171,99481,3163,cornelia adriana,1780,4927
15,15,172,99484,3163,cornelia adriana,1780,4927
16,16,174,99486,3150,kleine pallas,1780,4964
21,21,257,99565,3150,kleine pallas,1790,5285
...,...,...,...,...,...,...,...
3805,3805,19958,118060,4642,landouw,1729,2958
3892,3892,20229,118248,3901,leguaan,1763,4496
3916,3916,20267,117034,4703,haai,1723,2597
3917,3917,20268,117034,4936,ontong java,1723,2598


In [12]:
# Write result to Excel
bgbnomatch.to_excel("bgb_gzm.xlsx")

## 3. Disambiguating ships in BGB/GZM
BGB contains information on almost 20,000 voyages of VOC vessels. A previous script could not match 3,988 vessels to similarly named vessels in DAS. These unmatched vessels were mostly ships with an unknown name or small coastal craft.

We will now try to extract the individual vessels the 3,988 vessels that are present in this collection. Lets get an overview of all ships, just based on their name (and the first and last appearance of that name).

In [13]:
bgbnomatch

Unnamed: 0.1,Unnamed: 0,BGB Shipvoyage ID,BGB Voyage ID,BGB ship ID,BGB ship name,BGB Booking year,GZB ship ID
0,0,139,99448,3150,kleine pallas,1790,5285
1,1,145,99454,3153,langmoedigheid,1790,
2,2,149,99458,3154,vredelief,1790,
3,3,151,99460,3155,wilhelmina,1790,
4,4,153,99462,3156,jonge wilhelmina,1790,
...,...,...,...,...,...,...,...
3983,3983,20338,116787,3412,onbekend,1731,
3984,3984,20339,117012,3412,onbekend,1723,
3985,3985,20340,117815,3412,onbekend,1729,
3986,3986,20341,117832,3412,onbekend,1729,


In [17]:
bgbshiprange = bgbnomatch.groupby('BGB ship name')['BGB Booking year'].agg({'min', 'max', 'count'}).reset_index()
bgbshiprange

Unnamed: 0,BGB ship name,count,min,max
0,'t kasteel de vijf sinnen,1,1738,1738
1,aardenhout,10,1701,1704
2,actif,1,1783,1783
3,adriana,61,1735,1781
4,adriana johanna,1,1743,1743
...,...,...,...,...
811,zeewager,1,1801,1801
812,zeilvis,1,1703,1703
813,zijdeteelt,2,1731,1731
814,zomer,12,0,1785


The above dataframe tells us there are 816 individual shipnames in the BGB unmatched table. The ship 'Aardenhout' has 10 mentions between 1701 and 1704. Lets take a closer look:

In [18]:
bgbnomatch.loc[bgbnomatch['BGB ship name'] == 'aardenhout']

Unnamed: 0.1,Unnamed: 0,BGB Shipvoyage ID,BGB Voyage ID,BGB ship ID,BGB ship name,BGB Booking year,GZB ship ID
2064,2064,10544,109701,4250,aardenhout,1702,848.0
2646,2646,13998,112893,4250,aardenhout,1701,
2704,2704,14271,113140,4250,aardenhout,1704,1039.0
2723,2723,14435,113297,4250,aardenhout,1701,
2725,2725,14454,113317,4250,aardenhout,1701,
2750,2750,14619,113452,4250,aardenhout,1701,
2784,2784,14822,113628,4250,aardenhout,1701,
3008,3008,15976,114656,4250,aardenhout,1703,900.0
3009,3009,15977,114657,4250,aardenhout,1703,900.0
3510,3510,18637,116934,4250,aardenhout,1703,900.0


This is probably the same ship! It is matched to multiple GZB ship IDs (because these are also not disambiguated).

Not let's look at the indiviual vessels in the GZM.

In [22]:
gzmshiprange = gzm_to_match.groupby('NAAM SCHIP (GESTANDAARDISEERD)')['JAAR'].agg({'min', 'max', 'count'}).reset_index()
gzmshiprange

Unnamed: 0,NAAM SCHIP (GESTANDAARDISEERD),count,min,max
0,[eiland edam],1,1703,1703
1,[onbekend],20,1696,1747
2,aardenhout,7,1696,1706
3,achilles,6,1714,1719
4,adam,7,1712,1719
...,...,...,...,...
398,zon,1,1708,1708
399,zuster,1,1736,1736
400,zwaardvis,6,1697,1708
401,zwarte arend,1,1696,1696


Again, let's take a closer look at the 'Aardenhout':

In [23]:
gzm_to_match.loc[gzm_to_match['NAAM SCHIP (GESTANDAARDISEERD)'] == 'aardenhout']

Unnamed: 0_level_0,JAAR,NAAM SCHIP (GESTANDAARDISEERD),GZM SHIP ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
329,1696,aardenhout,329
406,1697,aardenhout,406
525,1698,aardenhout,525
848,1703,aardenhout,848
900,1704,aardenhout,900
1039,1705,aardenhout,1039
1129,1706,aardenhout,1129


So the question now becomes: how to link these datasets based on ships? First disambiguate GZM and then use the resulting IDs to link to BGB? Or the other way around? Or disambiguate both GZM and BGB and link the resulting IDs?