# BGB-voyages to DAS ship IDs

*Gerhard de Kok*

This script enriches the data from the Bookkeeper-General Batavia dataset (BGB) with links to the data in the Dutch-Asiatic Shipping dataset (DAS).

> #### BGB
>The BGB is a dataset with information on almost 20,000 voyages of the Dutch East India Company (VOC), which includes information on cargoes carried and ship used. It **does not** contain a disambiguated list of ships, i.e. a single ship name may refer to multiple physical vessels.
> #### DAS
> DAS is a dataset containing information on more than 4,700 voyages of the VOC, but it is limited to intercontinental voyages between Europe and Asia. It **does** contain a disambiguated list of ships.

## The problem
BGB has as list of voyages, which have non-unique IDs. The IDs are not unique, since multiple ships may partake in a single voyage and each ship gets its own entry or row in the dataset. The unique IDs are found on the level of 'shipvoyage IDs', which are combinations of a voyage and a ship. BGB ships have IDs, but these are not disambiguated (i.e. a single ship name with a single ID may refer to multiple physical vessels).

Some of the ships in BGB are also present in DAS, which only contains intercontinental voyages. If a voyage is present in both BGB and DAS, a connection is (usually) already made between the BGB ship ID and the DAS ship ID. For about 14,500 (mostly intra-Asiatic) voyages in BGB, such a connection is not present. However, many of the vessels used for these voyages are present with a separate ship ID in DAS. The challenge is to link BGB ship IDs to DAS ship IDs in such cases.

## Automated solution
The present script tries to resolve this issue in a rule-based manner. Basically, the rules are as follows:

1. For each voyage in BGB, check if the ship already has an associated DAS ship ID from another voyage in the original BGB dataset (if so, check if it is correct and then use that.)

2. If no link is already present, fuzzy match the BGB ship name with a list of ship names in DAS. Then check to see if the DAS ship was active in the same time period.

> #### Definition for 'same time period'
>The active timerange of a DAS ship is defined as: (first mention of ship in DAS) until (last mention in DAS + 20 years).
> If the BGB voyage took place inside that timerange (or rather: if it was booked in the same timerange), make the match between the BGB shipvoyage (a unique ID of a ship/voyage-combination) and the DAS ship ID.

## Output
The script produces an Excel-file with the following sheets:
1. Matches: the matches between BGB shipvoyage IDs and DAS ship IDs made by the script.
2. Ambiguous: cases were multiple DAS ship IDs matched the rules for a single BGB shipvoyage ID.
3. No match: cases where a match could not be made for a BGB shipvoyage ID.

In [1]:
# Import necessary modules
import pandas as pd
import difflib

def within_range(checknumber, range_start, range_end):
    """" 
    Function to check if a BGB booking year falls within the timespan
    in DAS referred to in the introduction to this script
    
    """
    if (checknumber >= range_start) & (checknumber <= (range_end + 20)):
        return True
    else:
        return False

In [2]:
# First, load the entire BGB database and the DAS database (Excel format)
bgb = pd.ExcelFile('bgb.xlsx')
das = pd.ExcelFile('das.xlsx')

# Parse the Excel sheets we will be using from BGB
# For the bgb_relations dataframe, some operations are needed to convert IDs to integers
bgb_ships = bgb.parse('bgb_ship')
bgb_ships = bgb_ships.set_index('id')
bgb_relations = bgb.parse('bgb_relVoyageShip')
bgb_relations.dropna(subset=['voyId'], how='all', inplace=True)
bgb_relations['voyId'] = pd.to_numeric(bgb_relations['voyId'], downcast='integer', errors='coerce')
bgb_relations = bgb_relations.set_index('id')
bgb_voyages = bgb.parse('bgb_voyage')
bgb_voyages = bgb_voyages.set_index('voyId')

# From DAS, we need the list of ship names ...
# ... and the years between which these ships were employed by the VOC
das_ships = das.parse('shipNameVariant')
das_ships = das_ships.set_index('shipNameVariantID')
das_voyages = das.parse('das_voyage')
das_voyages = das_voyages.set_index('voyId')

In [3]:
das_ships

Unnamed: 0_level_0,shipID,shipNameVariant,shipNameVariantRemark
shipNameVariantID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DAS_snv0001,DAS_ship0001,'s Heer Arendskerke,
DAS_snv0002,DAS_ship0002,'s Lands Welvaren,
DAS_snv0003,DAS_ship0003,'s-Graveland,
DAS_snv0004,DAS_ship0004,'s-Graveland,
DAS_snv0005,DAS_ship0005,'s-Gravenhage,
...,...,...,...
DAS_snv1893,DAS_ship1852,Schaapherder,
DAS_snv1894,DAS_ship1853,Senhor De Bonfim E Sancta Maria,
DAS_snv1895,DAS_ship1854,Toevalligheid,
DAS_snv1896,DAS_ship1855,Batavier,


In [4]:
# Now I want a list of individual ships from DAS and the years between which they were employed
# Converting the date to datetime is not possible, since many voyages took place before 1677 (out of bounds)
# This means we lose vectorization advantages anyway, so I'll generate a dataframe using a Python loop 

# Create an empty list to hold the data on ships and dates
das_ship_dates = []

# Populate the list with data from DAS
for voyage in das_voyages.index:
    current_ship_id = das_voyages.loc[voyage, 'shipID']
    current_ship_name_id = das_voyages.loc[voyage, 'shipName']
    current_ship_departure = das_voyages.loc[voyage, 'voyDepartureEDTF']
    current_ship_arrival = das_voyages.loc[voyage, 'voyArrivalDateEDTF']
    current_ship_name = das_ships.loc[current_ship_name_id, 'shipNameVariant']
    
    # Convert dates to 4 digits (datetime not possible w/o workarounds)
    current_ship_departure = str(current_ship_departure)
    current_ship_departure = current_ship_departure[:4]
    current_ship_arrival = str(current_ship_arrival)
    current_ship_arrival = current_ship_arrival[:4]

    # Construct a list with data on this voyage
    this_voyage = (voyage, current_ship_name, current_ship_id, current_ship_name_id, current_ship_departure, current_ship_arrival)
    
    # Append that list to the aforementioned list (of lists)
    das_ship_dates.append(this_voyage)
    
# Create a Pandas dataframe from the list of lists
fulldata = pd.DataFrame.from_records(das_ship_dates, columns=['DasID', 'Shipname', 'DasShipID', 'DasShipNameVariant', 'Startyear', 'Endyear'])
fulldata = fulldata.set_index('DasID')

# Convert the yearcolumns to numeric values (they include some messed up data)
fulldata['Startyear'] = pd.to_numeric(fulldata['Startyear'], errors='coerce')
fulldata['Endyear'] = pd.to_numeric(fulldata['Endyear'], errors='coerce')

# Now we can drop NaNs (error coercion made these NaN)
# And subsequently convert from float to int (due to messed up data, to_numeric couldn't do this)
fulldata['Startyear'] = fulldata['Startyear'].astype(pd.Int32Dtype())
fulldata['Endyear'] = fulldata['Endyear'].astype(pd.Int32Dtype())


In [5]:
# We now have all the data to make a dataframe with: DAS IDs, shipnames, first year the ship was active,
# and last year the ship was active (on intercontinental voyages)
summary = fulldata.groupby(['DasShipNameVariant']).agg({'Shipname':'last', 'DasShipID':'last', 'Startyear':'min', 'Endyear':'max'})
summary

Unnamed: 0_level_0,Shipname,DasShipID,Startyear,Endyear
DasShipNameVariant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DAS_snv0001,'s Heer Arendskerke,DAS_ship0001,1725,1742
DAS_snv0002,'s Lands Welvaren,DAS_ship0002,1763,1773
DAS_snv0003,'s-Graveland,DAS_ship0003,1659,1660
DAS_snv0004,'s-Graveland,DAS_ship0004,1723,1726
DAS_snv0005,'s-Gravenhage,DAS_ship0005,1628,1635
...,...,...,...,...
DAS_snv1893,Schaapherder,DAS_ship1852,1690,1693
DAS_snv1894,Senhor De Bonfim E Sancta Maria,DAS_ship1853,1782,
DAS_snv1895,Toevalligheid,DAS_ship1854,1745,1745
DAS_snv1896,Batavier,DAS_ship1855,1736,1758


In [6]:
# Generate a list with unique ship names in DAS for fuzzy matching later on
dasshiplist = list(summary['Shipname'].unique())

# Add all bgb_relVoyageShip ids to a list 
checklist = list(bgb_relations.index)

In [7]:
# Loop over the list with BGB voyages IDs
# Gather the necessary information to apply the rules

# Define an empty list to hold the results
voyagedata = []
ambiguous = []
notmatched = []

# Check each voyage/ship-relation in BGB
for bgb_shipvoyageid in checklist:
    
    bgb_shipvoyageid = int(bgb_shipvoyageid)    

    # Get the BGB Voyage ID  
    bgb_voyageid = int(bgb_relations.loc[bgb_shipvoyageid, ['voyId']])

    # Get some basic information from BGB on the ships used in this ship/voyage combination
    bgb_shipid = bgb_relations.loc[bgb_shipvoyageid, 'shipId']

    # For some BGB ship IDs, there is no corresponding ship name in the BGB dataset (skip those)
    try:
        bgb_name = bgb_ships.loc[bgb_shipid, 'naam']
    except:
        continue

    # Get the BGB booking year (With try/except since some BGB voyage IDs are not in the BGB voyages table)
    try:
        bgb_bookingyear = int(bgb_voyages.loc[bgb_voyageid, 'voyBookingYear'])
    except:
        bgb_bookingyear = 0

    # APPLICATION OF THE RULES (see introduction to this script)       
    # RULE 1: CHECK FOR EXISTING BGB-DAS MATCH FOR THIS SHIP 
    
    existing_link = bgb_relations.loc[bgb_shipvoyageid, 'DAS_voyage']
    if not pd.isna(existing_link):
        das_shipnamevariantid = das_voyages.loc[int(existing_link), 'shipName']
        das_shipnameid = das_voyages.loc[int(existing_link), 'shipID']
        das_shipnamevariant = summary.loc[das_shipnamevariantid, 'Shipname']
        
        # Get voyage dates from DAS
        startdate_das = das_voyages.loc[int(existing_link), 'voyDepartureEDTF']
        try:
            startdate_das = int(str(startdate_das[0:4]))
        except:
            startdate_das = 0
        
        enddate_das = das_voyages.loc[int(existing_link), 'voyArrivalDateEDTF']
        try:
            enddate_das = int(str(enddate_das[0:4]))
        except:
            enddate_das = 0
         
        # Some voyages do not contain information on the endyears, in that case: set it equal to startyear
        if enddate_das == 0:
            enddate_das = startdate_das
        
        # Quality check 1: is the shipname equal enough between BGB and DAS?
        if difflib.get_close_matches(bgb_name, [das_shipnamevariant], n=3, cutoff=0.70):
            
            # Quality check 2: did the BGB voyage take place in the same time range as in DAS?
            # This filters out just the Leiden
            if not abs(bgb_bookingyear - startdate_das) > 20:
            
                # Verify the link
                method = 1
                this_voyage = (bgb_shipvoyageid, bgb_voyageid, bgb_shipid, int(existing_link), das_shipnameid, das_shipnamevariantid, bgb_name,das_shipnamevariant, bgb_bookingyear, startdate_das, enddate_das, method)
                voyagedata.append(this_voyage)
                continue

    # RULE 2: NAME MATCHING FOR SHIPS WITHOUT (VALID) EXISTING LINK
    
    # Check for a closely matching ship name in DAS
    checking = difflib.get_close_matches(bgb_name, dasshiplist, n=3, cutoff=0.85)

    # Go to next voyage if there are no matches based on ship name
    if not checking:
        notmatched.append(bgb_shipvoyageid)
        continue
        
    # There may be multiple DAS ships that fuzzy match with the BGB name, I use only the closest match
    das_shipnamevariant = checking[0]  

    # Bandaid solution: the ship 'Postiljon' is the only one for which we have no startyear and no endyear in DAS
    # This messes up the code, so just skip this ship
    if das_shipnamevariant == 'Postiljon':
        notmatched.append(bgb_shipvoyageid)
        continue
    
    # Get the start and end years for each of these similarly named DAS ships
    startyears = list(summary.loc[summary['Shipname'] == das_shipnamevariant]['Startyear'])
    endyears = list(summary.loc[summary['Shipname'] == das_shipnamevariant]['Endyear'])  
    
    # Some voyages do not contain information on the endyears, in that case: set it equal to startyear
    for index, value in enumerate(endyears):
        if pd.isnull(value):
            endyears[index] = startyears[index]
   
    # Check if these years fall within the time range, and if so: make the match
    templist = []
    for startdate_das, enddate_das in zip(startyears, endyears):

        if within_range(bgb_bookingyear, startdate_das, enddate_das):
            
            # Look up the ship ID and shipnamevariant ID from DAS
            das_shipnamevariantid_match = summary.loc[(summary['Shipname'] == das_shipnamevariant) & (summary['Startyear'] == startdate_das)].index[0]
            das_shipnameid_match = summary.loc[das_shipnamevariantid_match, 'DasShipID']

            # Add voyages to a temporary holding list
            method = 2
            this_voyage = (bgb_shipvoyageid, bgb_voyageid, bgb_shipid, "No link", das_shipnameid_match, das_shipnamevariantid_match, bgb_name,das_shipnamevariant, bgb_bookingyear, startdate_das, enddate_das, method)
            templist.append(this_voyage)
        
    # Check to see if the temporary list holds more than one result, if so: put it on the ambiguous-list
    if len(templist) == 1:
        for result in templist:
            voyagedata.append(result)
            continue

    if len(templist) > 1:
        for result in templist:
            ambiguous.append(result)
            
# Make dataframes        
matched = pd.DataFrame.from_records(voyagedata, columns=['BGB Shipvoyage ID', 'BGB Voyage ID', 'BGB ship ID', 'DAS voyage ID', 'DAS shipname ID', 'DAS shipnamevariant ID', 'BGB ship name', 'DAS ship name', 'BGB Booking year', 'DAS first seen', 'DAS last seen', 'Matched based on rule'])        
multimatched = pd.DataFrame.from_records(ambiguous, columns=['BGB Shipvoyage ID', 'BGB Voyage ID', 'BGB ship ID', 'DAS voyage ID', 'DAS shipname ID', 'DAS shipnamevariant ID', 'BGB ship name', 'DAS ship name', 'BGB Booking year', 'DAS first seen', 'DAS last seen', 'Matched based on rule'])        

In [8]:
# Make a dataframe with information about the failed matches
emtpylist = []

for bgb_shipvoyageid in notmatched:
    
    # Get the BGB Voyage ID  
    bgb_voyageid = int(bgb_relations.loc[bgb_shipvoyageid, ['voyId']])

    # Get some basic information from BGB on the ships used in this ship/voyage combination
    bgb_shipid = bgb_relations.loc[bgb_shipvoyageid, 'shipId']

    # For some BGB ship IDs, there is no corresponding ship name in the BGB dataset (skip those)
    try:
        bgb_name = bgb_ships.loc[bgb_shipid, 'naam']
    except:
        continue

    # Get the BGB booking year (With try/except since some BGB voyage IDs are not in the BGB voyages table)
    try:
        bgb_bookingyear = int(bgb_voyages.loc[bgb_voyageid, 'voyBookingYear'])
    except:
        bgb_bookingyear = 0
    
    this_voyage = (bgb_shipvoyageid, bgb_voyageid, bgb_shipid, bgb_name, bgb_bookingyear)
    emtpylist.append(this_voyage)
    
nomatch = pd.DataFrame.from_records(emtpylist, columns=['BGB Shipvoyage ID', 'BGB Voyage ID', 'BGB ship ID', 'BGB ship name', 'BGB Booking year'])        

In [9]:
# Save in Excel sheet
with pd.ExcelWriter('Matching_results.xlsx') as writer: 
    matched.to_excel(writer, sheet_name='Matches')
    multimatched.to_excel(writer, sheet_name='Ambiguous')
    nomatch.to_excel(writer, sheet_name='No match')