# Enriching data from BGB with ship IDs from DAS

Version 0.2 (11 February 2022)

*Gerhard de Kok*

This script is an experiment to enrich the data from the Bookkeeper-General Batavia dataset (BGB) with links to the data in the Dutch-Asiatic Shipping dataset (DAS).


### The problem

BGB has as list of voyages with associated internal ship IDs. Some of the ships in BGB are also present in DAS, which only contains intercontinental voyages. If a voyage (between Europe and Asia) is present in both BGB and DAS, a connection is (usually) already made between the BGB ship ID and the DAS ship ID. For about 14,500 (mostly intra-Asiatic) voyages in BGB, such a connection is not present. However, many of the vessels used for these voyages are present with a separate ship ID in DAS. The challenge is to link BGB ship IDs to DAS ship IDs in such cases.


### Proposed solution

The present script tries to resolve this issue in a rule-based manner. Basically, the rules are as follows:

1. For each voyage in BGB, check if the ship already has an associated DAS ship ID from another voyage in the original BGB dataset (if so, use that)
    1. This is not as straightforward as it should be: BGB ship IDs are not unique (different ships that share a name can still have the same BGB ID). I therefore also check if the DAS connection present in the BGB dataset falls within the time range mentioned below
1. If not, fuzzy match the ship name with a list of ship names in DAS. Then check to see if the DAS ship was active in the same time period:
    1. The active timerange of a DAS ship is defined as: (first mention of ship in DAS) until (last mention in DAS + 20 years)
    1. If the BGB voyage took place inside that timerange, make the match

Caveat: when multiple matches are possible, the script now only makes the first match.


### Results

For now, the script matched 9403 BGB voyages to 772 unique DAS ships.

Further cleaning is probably necessary. 

In [1]:
# Import necessary modules
import pandas as pd
import difflib

def within_range(checknumber, range_start, range_end):
    """" 
    Function to check if a BGB booking year falls within the timespan
    in DAS referred to in the introduction to this script
    
    """
    if (checknumber >= range_start) & (checknumber <= (range_end + 20)):
        return True
    else:
        return False

In [2]:
# First, load the entire BGB database and the DAS database (Excel format)
bgb = pd.ExcelFile('bgb.xlsx')
das = pd.ExcelFile('das.xlsx')

# Parse the Excel sheets we will be using from BGB
# For the bgb_relations dataframe, some operations are needed to convert IDs to integers
bgb_ships = bgb.parse('bgb_ship')
bgb_ships = bgb_ships.set_index('id')
bgb_relations = bgb.parse('bgb_relVoyageShip')
bgb_relations.dropna(subset=['voyId'], how='all', inplace=True)
bgb_relations['voyId'] = pd.to_numeric(bgb_relations['voyId'], downcast='integer', errors='coerce')
bgb_relations = bgb_relations.set_index('id')
bgb_voyages = bgb.parse('bgb_voyage')
bgb_voyages = bgb_voyages.set_index('voyId')

# From DAS, we need the list of ship names ...
# ... and the years between which these ships were employed by the VOC
das_ships = das.parse('shipNameVariant')
das_ships = das_ships.set_index('shipNameVariantID')
das_voyages = das.parse('das_voyage')
das_voyages = das_voyages.set_index('voyId')

In [3]:
# Now I want a list of individual ships and the years between which they were employed
# Converting the date to datetime is not possible, since many voyages took place before 1677 (out of bounds)
# This means we lose vectorization advantages anyway, so I'll generate a dataframe using a Python loop 

# Create an empty list to hold the data on ships and dates
das_ship_dates = []

# Populate the list with data from DAS
for voyage in das_voyages.index:
    current_ship_id = das_voyages.loc[voyage, 'shipID']
    current_ship_name_id = das_voyages.loc[voyage, 'shipName']
    current_ship_departure = das_voyages.loc[voyage, 'voyDepartureEDTF']
    current_ship_arrival = das_voyages.loc[voyage, 'voyArrivalDateEDTF']
    current_ship_name = das_ships.loc[current_ship_name_id, 'shipNameVariant']
    
    # Convert dates to 4 digits (datetime not possible w/o workarounds)
    current_ship_departure = str(current_ship_departure)
    current_ship_departure = current_ship_departure[:4]
    current_ship_arrival = str(current_ship_arrival)
    current_ship_arrival = current_ship_arrival[:4]

    # Construct a list with data on this voyage
    this_voyage = (voyage, current_ship_name, current_ship_id, current_ship_name_id, current_ship_departure, current_ship_arrival)
    
    # Append that list to the aforementioned list (of lists)
    das_ship_dates.append(this_voyage)
    
# Create a Pandas dataframe from the list of lists
fulldata = pd.DataFrame.from_records(das_ship_dates, columns=['DasID', 'Shipname', 'DasShipID', 'DasShipNameVariant', 'Startyear', 'Endyear'])
fulldata = fulldata.set_index('DasID')

# Convert to yearcolumns to numeric values (they include some messed up data)
fulldata['Startyear'] = pd.to_numeric(fulldata['Startyear'], errors='coerce')
fulldata['Endyear'] = pd.to_numeric(fulldata['Endyear'], errors='coerce')

# Now we can drop NaNs (error coercion made these NaN)
# And subsequently convert from float to int (due to messed up data, to_numeric couldn't do this)
fulldata.dropna(how='any', inplace=True)
fulldata['Startyear'] = fulldata['Startyear'].astype(int)
fulldata['Endyear'] = fulldata['Endyear'].astype(int)


In [4]:
# We now have all the data to make a dataframe with: DAS IDs, shipnames, first year the ship was active,
# and last year the ship was active (on intercontinental voyages)
# Caveat: NaN-values are excluded
# Caveat (2): renaming of ships not taken into account (although all rows are individual ships)
summary = fulldata.groupby(['DasShipID']).agg({'Shipname':'first', 'Startyear':'min', 'Endyear':'max'})
summary

Unnamed: 0_level_0,Shipname,Startyear,Endyear
DasShipID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DAS_ship0001,'s Heer Arendskerke,1725,1742
DAS_ship0002,'s Lands Welvaren,1763,1773
DAS_ship0003,'s-Graveland,1659,1660
DAS_ship0004,'s-Graveland,1723,1726
DAS_ship0005,'s-Gravenhage,1628,1635
...,...,...,...
DAS_ship1851,President,1675,1675
DAS_ship1852,Schaapherder,1692,1693
DAS_ship1854,Toevalligheid,1745,1745
DAS_ship1855,Batavier,1736,1758


In [5]:
# Generate a list with unique ship names in DAS for fuzzy matching later on
dasshiplist = list(summary['Shipname'].unique())

In [6]:
# Now we can turn our attention to the BGB data
# Let's check which voyages are missing a link to DAS
missing_data = bgb_relations[bgb_relations['DAS_shipID'].isna()]

# Add the bgb_relVoyageShip ids for which a link is missing to a list
missinglist = list(missing_data.index)

In [7]:
# Loop over the list with BGB voyages IDs for which links are missing
# And gather the necessary information to apply the rules

# Define an empty list to hold the results
voyagedata = []

# Check each voyage/ship-relation in BGB with no ship ID
# Note: I have assumed the links that have already been made in BGB to be correct, but this does not seem to be the case
# See note Leon on the globise_data channel (Slack, 11-2-2022)

for bgb_shipvoyageid in missinglist:

    # Get the BGB Voyage ID
    bgb_shipvoyageid = int(bgb_shipvoyageid)    
    bgb_voyageid = int(missing_data.loc[bgb_shipvoyageid, ['voyId']])
        
    # Get some basic information from BGB on the ships used in this ship/voyage combination
    bgb_shipid = bgb_relations.loc[bgb_shipvoyageid, 'shipId']
    
    # For some BGB ship IDs, there is no corresponding ship name in the BGB dataset (skip those)
    try:
        bgb_name = bgb_ships.loc[bgb_shipid, 'naam']
    except:
        continue
        
    # Get the BGB booking year (With try/except since some BGB voyage IDs are not in the BGB voyages table)
    try:
        bgb_bookingyear = int(bgb_voyages.loc[bgb_voyageid, 'voyBookingYear'])
    except:
        bgb_bookingyear = 0
        
    # APPLICATION OF THE RULES (see introduction to this script)       
    # RULE 1: CHECK FOR EXISTING BGB-DAS MATCHES FOR THE SAME SHIP (IN OTHER BGB VOYAGES)
    
    already_matched = list(bgb_relations.loc[bgb_relations['shipId'] == bgb_shipid]['DAS_shipID'].dropna().unique())
    
    # Apply rule 1
    # In at least 5211 (far too many !!!) cases, the same BGB ship ID links to multiple DAS ship IDs, see i.e. 
    # BGB ship 3147 (Concordia),, which is connected to two separate Concordia's in DAS.
    # https://bgb.huygens.knaw.nl/bgb/voyage/3156 vs https://bgb.huygens.knaw.nl/bgb/voyage/7430
    # I have assumed the BGB data on individual voyages to be correct, but the ship 
    # disambiguation seems to be messed up in BGB
    # I therefore do not apply rule 1 to ships with multiple DAS matches in the BGB

    if len(already_matched) == 1:
        das_id_match = already_matched[0]
        das_name = summary.loc[das_id_match, 'Shipname']
        das_start = summary.loc[das_id_match, 'Startyear']
        das_end = summary.loc[das_id_match, 'Endyear']        

        # To filter the most glaring errors, I add another check: does this ship fall within range limits?
        if within_range(bgb_bookingyear, das_start, das_end):

            # Construct a list with relevant data on this voyage
            method = 1
            this_voyage = (bgb_voyageid, bgb_shipid, das_id_match, bgb_name, das_name, bgb_bookingyear, das_start, das_end, method)

            # Add that list to the results list
            voyagedata.append(this_voyage)
            continue
    
    # RULE 2: FIND SIMILARLY NAMED DAS SHIPS IN SAME TIME PERIOD AS SHIP USED FOR THIS BGB VOYAGE
    
    # 2.3 Check for a closely matching ship name in DAS
    # Version 0.2: changed sensitivity from 0.8 to 0.85  
    checking = difflib.get_close_matches(bgb_name, dasshiplist, n=3, cutoff=0.85)
    
    # Go to next voyage if there are no matches based on ship name
    if not checking:
        continue
    
    # There may be multiple DAS ships that fuzzy match with the BGB name, I use only the closest match
    das_name = checking[0]
    
    # Get the start and end years for each of these similarly named DAS ships
    startyears = list(summary.loc[summary['Shipname'] == das_name]['Startyear'])
    endyears = list(summary.loc[summary['Shipname'] == das_name]['Endyear'])  
    
    # Check if these years fall within the time range, and if so: make the match
    for start, end in zip(startyears, endyears):
            if within_range(bgb_bookingyear, start, end):
                
                # Lookup corresponding DAS shipID
                das_id_match = summary.loc[(summary['Shipname'] == das_name) & (summary['Startyear'] == start)].index[0]
                
                # Construct a list with relevant data on this voyage
                method = 2
                this_voyage = (bgb_voyageid, bgb_shipid, das_id_match, bgb_name, das_name, bgb_bookingyear, start, end, method)
                
                # Add that list to the results list
                voyagedata.append(this_voyage)

matched = pd.DataFrame.from_records(voyagedata, columns=['BGB Voyage ID', 'BGB ship ID', 'DAS Ship ID', 'BGB ship name', 'DAS ship name', 'BGB Booking year', 'DAS first seen', 'DAS last seen', 'Matched based on rule'])

In [8]:
matched

Unnamed: 0,BGB Voyage ID,BGB ship ID,DAS Ship ID,BGB ship name,DAS ship name,BGB Booking year,DAS first seen,DAS last seen,Matched based on rule
0,99405,3112,DAS_ship0599,Hindeloopen,Hindeloopen,1790,1778,1787,1
1,99406,3113,DAS_ship0495,Gouverneur-Generaal De Clerck,Gouverneur-Generaal De Clerck,1790,1783,1791,1
2,99406,3256,DAS_ship1358,Slot ter Hoge,Slot Ter Hoge,1790,1780,1788,2
3,99407,3115,DAS_ship0892,Leviathan,Leviathan,1790,1787,1791,1
4,99408,3116,DAS_ship1575,Vredenburg,Vredenburg,1790,1785,1794,1
...,...,...,...,...,...,...,...,...,...
9398,118243,3738,DAS_ship1073,Noord Nieuwland,Noord Nieuwland,1763,1750,1769,2
9399,118252,3862,DAS_ship1366,Sloterdijk,Sloterdijk,1763,1748,1761,2
9400,115493,4457,DAS_ship0809,Kokenge,Kockengen,1715,1711,1728,1
9401,115493,3542,DAS_ship1476,Unie,Unie,1715,1697,1713,2


In [10]:
matched.to_excel("matched v0.2.xlsx")