# ID-ing remaining BGB and GZM ships

*Gerhard de Kok*

This script tries to identify unique (individual) ships from both the Boekhouder-Generaal Batavia (BGB) and Generale Zeemonsterrollen (GZM) datasets. It then gives these ships a unique ID and creates a linkset to link BGB voyages and GZM musterings to these IDs.

> #### BGB
>The BGB is a dataset with information on almost 20,000 voyages of the Dutch East India Company (VOC), which includes information on cargoes carried and ship used. It **does not** contain a disambiguated list of ships, i.e. a single ship name may refer to multiple physical vessels.
> #### GZM
>The GZM is a dataset with information on more than 5,300 musterings done in the late seventeenth and eighteenth century by officials of the Dutch East India Company (VOC). In a mustering, all personnel serving on a vessel or land-based location were listed. For vessel musterings, the dataset includes standardized versions of the ship names. There **is no** disambiguated list of shipnames, i.e. a single ship name may refer to multiple physical vessels in the dataset.
> #### DAS
> DAS is a dataset containing information on more than 4,700 voyages of the VOC, but it is limited to intercontinental voyages between Europe and Asia. It **does** contain a disambiguated list of ships.

## Previously established links to a disambiguated ship list (DAS)

Some musterings in the GZM can easily be linked to ships in an existing list of disambiguated ships (namely the list of DAS ships). In addition, many voyages in the BGB have already been link to this same list of DAS ships.

GZM has as list of musterings, which have unique IDs. If it is a ship mustering, the name of the ship is recorded (both the name from the source and a standardized version). The creator of the dataset (Matthias van Rossum) included link to DAS voyages (which have unique IDs), but not to DAS ship IDs. We want to establish a link between the mustering and the unique ship ID from DAS in these cases. This is really simple, by just looking up the ship IDs in DAS based on the DAS voyage ID already present in the GZM.

For the BGB, a previous script linked voyages in this dataset to DAS ship IDs. 

## The problem
For many musterings in the GZM and many voyages in the BGB, no link to a DAS ship ID can be made. This is usually the case when no DAS ship with a similar name was active in the same period as the ship mentioned in a GZM mustering or BGB voyage. It is to be expected that many of these ships were active in the inter-Asiatic trade circuits and not used on intercontinental voyages (and thus not present in DAS). 

The goal is to create a list of unique IDs for every ship in the GZM and BGB for which no DAS ship ID is available.

## Automated solution

We begin we a pre-step: linking the GZM to DAS ship IDs in cases where Matthias already made a link to a DAS voyage ID.

That leaves us with a list of BGB voyages without a DAS ship ID and a list of GZM musterings without a DAS ship ID. 

For both of these lists:
1. Group them on unique shipnames.
2. See if the first and last mention in the dataset of this unique shipname falls within a 30-year timerange. If it is active for more than 30 years, chances are too high that the same name belonged to multiple ships.

That leaves us with two 'disambiguated' lists, which we then combine into one list. We give unique IDs to each ship in that masterlist and link those to GZM and BGB entries.

Further details are in the script below.

In [1]:
# Import necessary modules
import pandas as pd
import numpy as np
import difflib

In [2]:
# First, load the entire Zeemonsterrollen database and the DAS database (Excel format)
gzm = pd.ExcelFile('gzm.xlsx')
das = pd.ExcelFile('das.xlsx')

# Load the match list between BGB and DAS (which contains a sheet with BGB voyages for which no DAS match could be made)
bgbmatch = pd.ExcelFile('Matching_results.xlsx')

In [3]:
# Parse the Excel sheets we will be using
zeemons = gzm.parse('Database GZM (MvR 2014)')
zeemons = zeemons.set_index('ID')
das_ships = das.parse('shipNameVariant')
das_ships = das_ships.set_index('shipNameVariantID')
das_voyages = das.parse('das_voyage')
das_voyages = das_voyages.set_index('voyId')
bgbnomatch = bgbmatch.parse('No match')
bgbnomatch['BGB ship name'] = bgbnomatch['BGB ship name'].str.lower()

# Add columns to the dataframes
zeemons['DAS ID HEENREIS'] = np.nan
zeemons['DAS ID TERUGREIS'] = np.nan
zeemons['DAS SHIP ID'] = np.nan

### Linking GZM to DAS ship IDs
For many musterings, the GZM includes a link to the DAS voyages for the outwards and homebound voyages of VOC vessels. What we want is the corresponding DAS ship ID, to link the mustering to a DAS ship (instead of just to a voyage). 

The code below looks at the DAS voyage-links already present in the GZM and extracts the corresponding ship ID (from DAS).

In [4]:
# Loop over entries in Zeemonsterrollen  
# If there is an entry to DAS present: add voyage IDs and Ship IDs
for entry in zeemons.index:
    dasheen = zeemons.loc[entry, 'DAS HEENREIS']
    dasterug = zeemons.loc[entry, 'DAS TERUGREIS']
    
    das_id_heen = np.nan
    das_id_terug = np.nan
    das_ship_id = np.nan
    das_shipname_id = np.nan
    
    if not pd.isnull(dasheen):
        
        # Band-aid try/except for data error
        try:
            das_id_heen = das_voyages.loc[das_voyages['voyNumberDAS'] == dasheen].index[0]
            das_ship_id = das_voyages.loc[das_id_heen, 'shipID']
            das_shipname_id = das_voyages.loc[das_id_heen, 'shipName']
        except:
            das_id_heen = np.nan
        
    if not pd.isnull(dasterug):
        # Band-aid try/except for data error
        try:
            das_id_terug = das_voyages.loc[das_voyages['voyNumberDAS'] == dasterug].index[0]
            das_ship_id = das_voyages.loc[das_id_heen, 'shipID']
            das_shipname_id = das_voyages.loc[das_id_heen, 'shipName']
        except:
            das_id_heen = np.nan
        
    zeemons.loc[entry, 'DAS ID HEENREIS'] = das_id_heen
    zeemons.loc[entry, 'DAS ID TERUGREIS'] = das_id_terug
    zeemons.loc[entry, 'DAS SHIP ID'] = das_ship_id

# Convert floats to ints
zeemons['DAS ID HEENREIS'] = zeemons['DAS ID HEENREIS'].astype('Int64')
zeemons['DAS ID TERUGREIS'] = zeemons['DAS ID TERUGREIS'].astype('Int64')

GZM musterings have unique IDs, but one mustering may involve multiple ships. In this case, the shipname-fields seperate the names with a ';'. Since we want each mention of a ship to have a unique ID, we have to preprocess the GZM dataset to include one unique ship ID for every mention of a ship. Note that these IDs are not yet disambiguated, so a single ship may get multiple IDs (if it is mentioned in multiple musterings).

In [5]:
# Add internal Ship IDs to GZM database (based on conversation with Leon)
zeemons['GZM SHIP ID'] = np.nan

for entry in zeemons.index:
    
    listofids = []
    
    shipnames = zeemons.loc[entry, 'NAAM SCHIP (GESTANDAARDISEERD)']

    # If no shipname is filled in, it is not a ship mustering 
    if pd.isnull(shipnames):
        continue
        
    # Several musterings have data for multiple ships, their names separated by ';'
    shipnames = shipnames.split('; ')
    
    for c, shipname in enumerate(shipnames, 1):
        if len(shipnames) > 1:
            shipid = str(entry) + "-" + str(c)
            listofids.append(shipid)
        else:
            shipid = str(entry)
            listofids.append(shipid)
    
    zeemons.loc[entry, 'GZM SHIP ID'] = "; ".join(listofids)

### Create a unique shiplist for BGB
BGB contains information on almost 20,000 voyages. A previous script could not match the ships used in 3,988 voyages to vessels in DAS. These unmatched vessels were mostly ships with an unknown name or small coastal craft.

We will now try to extract the individual vessels from the 3,988 voyages that are present in this collection. Let's get an overview of all ships, just based on their name (and the first and last appearance of that name).

In [8]:
bgbshiprange = bgbnomatch.groupby('BGB ship name')['BGB Booking year'].agg({'min', 'max', 'count'}).reset_index()
bgbshiprange

Unnamed: 0,BGB ship name,max,count,min
0,'t kasteel de vijf sinnen,1738,1,1738
1,aardenhout,1704,10,1701
2,actif,1783,1,1783
3,adriana,1781,61,1735
4,adriana johanna,1743,1,1743
...,...,...,...,...
811,zeewager,1801,1,1801
812,zeilvis,1703,1,1703
813,zijdeteelt,1731,2,1731
814,zomer,1785,12,0


The above dataframe tells us there are 816 individual shipnames in the BGB unmatched table. For instance, the ship 'Aardenhout' has 10 mentions between 1701 and 1704.

Let's now assume each individual shipname to be an individual ship if the time between its first and last mention is no longer than 30 years.

In [9]:
unique_bgb_ships = bgbshiprange[(bgbshiprange['max'] - bgbshiprange['min']) <= 30].sort_values(by=['BGB ship name'])
unique_bgb_ships = unique_bgb_ships[['BGB ship name', 'min', 'max', 'count']]
unique_bgb_ships.rename(columns={'BGB ship name': 'shipname'}, inplace=True)
unique_bgb_ships.reset_index(inplace=True)
del unique_bgb_ships['index']
unique_bgb_ships

Unnamed: 0,shipname,min,max,count
0,'t kasteel de vijf sinnen,1738,1738,1
1,aardenhout,1701,1704,10
2,actif,1783,1783,1
3,adriana johanna,1743,1743,1
4,adriana wilhelmina,1774,1774,1
...,...,...,...,...
699,zeerob,0,0,1
700,zeevaart,1790,1790,1
701,zeewager,1801,1801,1
702,zeilvis,1703,1703,1


For the BGB, this leaves 704 individual ships.

### Create a unique shiplist for GZM
Now let's try to extract the individual vessels from the GZM. In case there is a link to a DAS ship ID, the ship is already disambiguated. That leaves 1,654 ship musterings for which there is no match to a unique ship ID:

In [10]:
# Get the GZM-entries for which there is no link to DAS
gzm_to_match = zeemons[['JAAR', 'NAAM SCHIP (GESTANDAARDISEERD)', 'GZM SHIP ID']].loc[zeemons['DAS SHIP ID'].isna()]
gzm_to_match['NAAM SCHIP (GESTANDAARDISEERD)'] = gzm_to_match['NAAM SCHIP (GESTANDAARDISEERD)'].str.lower()
gzm_to_match = gzm_to_match.dropna()
gzm_to_match

Unnamed: 0_level_0,JAAR,NAAM SCHIP (GESTANDAARDISEERD),GZM SHIP ID
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
23,1691,sint nicolaas,23
24,1691,grijpvogel,24
37,1691,batavia,37
42,1691,standvastigheid,42
56,1691,wijk op zee,56
...,...,...,...
5292,1790,triton,5292
5293,1790,afrikaan,5293
5324,1791,oranjeboom,5324
5325,1791,cornelia adriana,5325


Let's now group this dataframe on (individual) shipnames and the years between which these were active. We first need an extra step, because one mustering may contain multiple ships.

In [11]:
#Some GZM entries have multiple ships (separated with ';'), explode these entries (Pandas 1.3+ required)
dis_gzm_to_match = gzm_to_match 
dis_gzm_to_match["NAAM SCHIP (GESTANDAARDISEERD)"] = dis_gzm_to_match["NAAM SCHIP (GESTANDAARDISEERD)"].str.split('; ')
dis_gzm_to_match["GZM SHIP ID"] = dis_gzm_to_match["GZM SHIP ID"].str.split('; ')
dis_gzm_to_match = dis_gzm_to_match.explode(['NAAM SCHIP (GESTANDAARDISEERD)', 'GZM SHIP ID'])

gzmshiprange = dis_gzm_to_match.groupby('NAAM SCHIP (GESTANDAARDISEERD)')['JAAR'].agg({'min', 'max', 'count'}).reset_index()
gzmshiprange

Unnamed: 0,NAAM SCHIP (GESTANDAARDISEERD),max,count,min
0,landouw,1719,2,1718
1,[eiland edam],1703,1,1703
2,[onbekend],1747,20,1696
3,aardenhout,1706,7,1696
4,achilles,1719,6,1714
...,...,...,...,...
390,zon,1708,1,1708
391,zuster,1736,1,1736
392,zwaardvis,1708,6,1697
393,zwarte arend,1696,1,1696


Now let's apply the same rule as we did for the BGB: assume each individual shipname to be an individual ship if the time between its first and last mention is no longer than 30 years.

In [13]:
unique_gzm_ships = gzmshiprange[(gzmshiprange['max'] - gzmshiprange['min']) <= 30].sort_values(by=['NAAM SCHIP (GESTANDAARDISEERD)'])
unique_gzm_ships = unique_gzm_ships[['NAAM SCHIP (GESTANDAARDISEERD)', 'min', 'max', 'count']]
unique_gzm_ships.rename(columns={'NAAM SCHIP (GESTANDAARDISEERD)': 'shipname'}, inplace=True)
unique_gzm_ships.reset_index(inplace=True)
del unique_gzm_ships['index']
unique_gzm_ships

Unnamed: 0,shipname,min,max,count
0,landouw,1718,1719,2
1,[eiland edam],1703,1703,1
2,aardenhout,1696,1706,7
3,achilles,1714,1719,6
4,adam,1712,1719,7
...,...,...,...,...
369,zon,1708,1708,1
370,zuster,1736,1736,1
371,zwaardvis,1697,1708,6
372,zwarte arend,1696,1696,1


For the GZM, applying this rule leaves 374 individual ships.

### Merging the lists of unique BGB and GZM ship into a masterlist
We now have lists of 704 ships from BGB and 374 ships from GZM assumed to be unique. Both lists may overlap, so let's combine them into one masterlist. 

To do this, we:
1. Loop over each unique BGB ship.
2. See if a similarly named ship is present in the list of unique GZM ships.
3. If so, see if this ship was active in about the same time period.
4. If so, delete it from the GZM list and add it to the masterlist.
5. Finally, merge the BGB and GZM lists into one.

In [55]:
unique_gzm_shiplist = list(unique_gzm_ships['shipname'])
globships = []

# Loop over every unique shipname in BGB
for ship in unique_bgb_ships.index:
    bgb_shipname = unique_bgb_ships.loc[ship, 'shipname']
    
    # If the name contains a 'comma', it is probably not a real shipname (see dataset, i.e. 'pantjalang, 1')
    # Then add the ship, but do not look for a namematch
    if ',' in bgb_shipname:
        globships.append([ship, pd.NA])
    
    else:
        # See if there is a corresponding shipname in GZM
        match = difflib.get_close_matches(bgb_shipname, unique_gzm_shiplist, n=1, cutoff=0.85)
        
        # If not, add this ship as is
        if not match:
            globships.append([ship, pd.NA])
        
        # Else: look up the ID of the corresponding GZM ship and add it too, but ...
        else:
            gzm_shipname = match[0]
            gzm_index = unique_gzm_ships.index[unique_gzm_ships['shipname'] == gzm_shipname].astype(int)[0]
            
            # ... only add it as a combi to the masterlist if the GZM ship is active in the same period as the BGB voyage
            avg_year_gzm = (unique_gzm_ships.loc[gzm_index, ['max']].astype(int)[0] + unique_gzm_ships.loc[gzm_index, ['min']].astype(int)[0] ) / 2
            avg_year_bgb = (unique_bgb_ships.loc[ship, ['max']].astype(int)[0] + unique_bgb_ships.loc[ship, ['min']].astype(int)[0] ) / 2

            if abs(avg_year_gzm - avg_year_bgb) <= 15:
            
                globships.append([ship, int(gzm_index)])
                
                # Remove it from the GZM lists of ships that have been added to the masterlist
                unique_gzm_shiplist.remove(gzm_shipname)
            
            else:
                # If the timeperiode differs, just add the BGB ship as is
                globships.append([ship, pd.NA])


# Add the remaining GZM ships to the masterlist:
for not_matched in unique_gzm_shiplist:
    gzm_index = unique_gzm_ships.index[unique_gzm_ships['shipname'] == not_matched].astype(int)[0] 
    globships.append([pd.NA, gzm_index])
    

Let's create a Dataframe with the results. We want to include information on first and last mentions of each ship in GZM and BGB (if available).

In [56]:
# Create list that will become the Dataframe
masterlist = []

# Loop over list of individual ships
for ship in globships:
    bgb_id = ship[0]
    gzm_id = ship[1]
    bgb_min = pd.NA
    bgb_max = pd.NA
    gzm_min = pd.NA
    gzm_max = pd.NA
    shipname = ''
    
    # If the ship is present in BGB, add information
    if not pd.isna(bgb_id):
        shipname = unique_bgb_ships.loc[bgb_id, 'shipname']
        bgb_min = unique_bgb_ships.loc[bgb_id, 'min']
        bgb_max = unique_bgb_ships.loc[bgb_id, 'max']
    
    # If the ship is present in GZM, add information
    if not pd.isna(gzm_id):
        gzm_min = unique_gzm_ships.loc[gzm_id, 'min']
        gzm_max = unique_gzm_ships.loc[gzm_id, 'max']
        
        if shipname == '':
            shipname = unique_gzm_ships.loc[gzm_id, 'shipname']
     
    # Append entry to masterlist.
    masterlist.append([shipname, bgb_id, gzm_id, bgb_min, bgb_max, gzm_min, gzm_max])

In [57]:
# Now let's view the Dataframe. We want to give each ship an index no. starting with 1857. 
# Since DAS already contains 1856 unique ships, this gives a nice continuity.
master = pd.DataFrame(masterlist)
master.columns = ['Shipname', 'BGB_id', 'GZM_id', 'min_BGB', 'max_BGB', 'min_GZM', 'max_GZM']
master.sort_values(by='Shipname', inplace=True)
master.reset_index(drop=True, inplace=True)
master.index +=1857
master

Unnamed: 0,Shipname,BGB_id,GZM_id,min_BGB,max_BGB,min_GZM,max_GZM
1857,landouw,,0,,,1718,1719
1858,'t kasteel de vijf sinnen,0,,1738,1738,,
1859,[eiland edam],,1,,,1703,1703
1860,aardenhout,1,2,1701,1704,1696,1706
1861,achilles,,3,,,1714,1719
...,...,...,...,...,...,...,...
2853,zijdeteelt,703,,1731,1731,,
2854,zon,,369,,,1708,1708
2855,zuster,,370,,,1736,1736
2856,zwaardvis,,371,,,1697,1708


The result of combining the list of 704 BGB ships and 374 GZM ships is a new masterlist with 1001 ships. This means 77 ships have been linked together between both datasets.

### Linking the GZM dataset to the new masterlist

Now we want to link each individual ship mentioned in the musterings (and for which no DAS link is present) to a ship in our new masterlist (if possible). 

We use our previously constructed Dataframe where all individual ships without a DAS link have their own entry (the exploded version).

In [122]:
# Add a column to the Dataframe for the Globalise ship ID (from our new masterlist)
dis_gzm_to_match['GLOB SHIP ID'] = np.nan
dis_gzm_to_match['GLOB SHIP ID'] = dis_gzm_to_match['GLOB SHIP ID'].astype(pd.Int32Dtype())

# Reset the index, since the numbers are no longer unique due to the explosion
dis_gzm_to_match.reset_index(inplace=True, drop=True)

In [123]:
unique_gzm_shiplist = list(unique_gzm_ships['shipname'])

# Loop over each entry in the GZM
for entry in dis_gzm_to_match.index:
        shipname = dis_gzm_to_match.loc[entry, 'NAAM SCHIP (GESTANDAARDISEERD)']

        shipname = shipname.lower()

        # See if the ship is in the GZM shiplist (it should be, else it falls outside the rule scope)
        # I.e. the ship 'colombo' (first mention 1717, last 1762, which is more than 30 years).
        if shipname in unique_gzm_shiplist:
            gzm_index = unique_gzm_ships.index[unique_gzm_ships['shipname'] == shipname].astype(int)[0]
            glob_id = master.index[master['GZM_id'] == gzm_index].astype(int)[0]
            dis_gzm_to_match.loc[entry, 'GLOB SHIP ID'] = glob_id

### Linking the BGB dataset to the new masterlist

Now we want to link each individual ship mentioned in the BGB voyages (and for which no DAS link is present) to a ship in our new masterlist (if possible). 

In [61]:
# Add a column to the 'BGB no match' Dataframe for the Globalise ship ID (from our new masterlist)
bgbnomatch['GLOB SHIP ID'] = np.nan
bgbnomatch['GLOB SHIP ID'] = bgbnomatch['GLOB SHIP ID'].astype(pd.Int32Dtype())

In [62]:
unique_bgb_shiplist = list(unique_bgb_ships['shipname'])

# Loop over each entry in the BGB
for entry in bgbnomatch.index:
        shipname = bgbnomatch.loc[entry, 'BGB ship name']
        
        # See if the ship is in the BGB shiplist (it should be, else it falls outside the rule scope)
        if shipname in unique_bgb_shiplist:
            bgb_index = unique_bgb_ships.index[unique_bgb_ships['shipname'] == shipname].astype(int)[0]
            glob_id = master.index[master['BGB_id'] == bgb_index].astype(int)[0]
            bgbnomatch.loc[entry, 'GLOB SHIP ID'] = glob_id

### Create a linkset

The last step is to create the linkset to import into the triplestore. 

In [186]:
# Create an empty list to hold the links
linklist = []

# Get lists of unique ships with a new ID in both the GZM and the BGB
gzmlist = list(dis_gzm_to_match['GLOB SHIP ID'].dropna().unique())
bgblist = list(bgbnomatch['GLOB SHIP ID'].dropna().unique())

# Loop over every ship in our masterlist
for globship in master.index:
    gzmlinks = []
    bgblinks = []
    
    # If it is also in the GZM: add the ship IDs
    if globship in gzmlist:
        gzm_entries = list(dis_gzm_to_match.loc[dis_gzm_to_match['GLOB SHIP ID'] == globship, 'GZM SHIP ID'])
        for entry in gzm_entries:
            gzmlinks.append(entry)
            
    # If it is also in the BGB: add the shipvoyage IDs
    if globship in bgblist:
        bgb_entries = list(bgbnomatch.loc[bgbnomatch['GLOB SHIP ID'] == globship, 'BGB Shipvoyage ID'])
        for entry in bgb_entries:
            bgblinks.append(entry)
    
    # Append them to the linklist in a formatted way (mulitple entries within GZM and BGB separated by ';')
    linklist.append([globship, ';'.join(str(i) for i in gzmlinks), ';'.join(str(i) for i in bgblinks)])

# Make a Dataframe
linkset = pd.DataFrame.from_records(linklist, columns=['GLOB SHIP ID', 'GZM SHIP IDS', 'BGB SHIPVOYAGES'])        

In [189]:
# Show the resulting linkset
linkset

Unnamed: 0,GLOB SHIP ID,GZM SHIP IDS,BGB SHIPVOYAGES
0,1857,2290-2;2391-2,
1,1858,,12258
2,1859,794,
3,1860,329;406;525;848;900;1039;1129,10544;13998;14271;14435;14454;14619;14822;1597...
4,1861,1828;1956;2025;2175;2313;2412,
...,...,...,...
996,2853,,18537;18543
997,2854,1302,
998,2855,3403,
999,2856,419;496;887;1034;1139;1279,


In [190]:
# Write the resulting linkset to a CSV-file
linkset.to_csv('linkset.csv', encoding='utf-8', index=False)