# Create `voc_persons_contracts.csv` and `voc_names.csv`


This notebook integrates the original VOC muster records file from the [VOC Opvarenden collection of the Dutch National Archives](https://www.nationaalarchief.nl/onderzoeken/index/nt00444) with disambiguation results from the HUMIGEC project (see accompanying data paper for more information). It also adds references to the `voc_ranks.csv` file, translates column names into English, and selects columns to be saved as the `voc_persons_contracts.csv` and `voc_names.csv` files.

Please note that this notebook uses the `voc_ranks.csv` file, which has been created in another notebook, and the `voc_places.csv` file, for which no notebook detailing the transformation process is available.

##  Environment Setup and Import Libraries

Set up the environment and import necessary libraries for data manipulation and path handling.

In [1]:
import pandas as pd
import os
import numpy as np
from datetime import datetime

## Define File Paths

Set up paths for data directories to manage file locations conveniently.

In [2]:
local_folder = '../'

data_path = os.path.join(local_folder, 'original')
external_path = os.path.join(local_folder, 'external')
intermediary_path = os.path.join(local_folder, 'intermediary')
output_path = os.path.join(local_folder, 'enriched')

## Load VOC Muster Records

Load the VOC muster records file, add Dutch column names, and display basic information about the data.

In [3]:
NA_vocop_file = os.path.join(data_path, 'NT00444_OPVARENDEN.csv')

nl_headers = [
    "VolledigeNaam",
    "Voornaam",
    "Patroniem",
    "Tussenvoegsel",
    "Achternaam",
    "PlaatsHerkomst",
    "Rang",
    "RangEn",
    "RangDe",
    "Omschrijving",
    "OmschrijvingEn",
    "OmschrijvingDe",
    "DatumInDienst",
    "DatumUitDienst",
    "UitDienstPlaats",
    "UitDienstPlaatsEn",
    "UitDienstPlaatsDe",
    "RedenUitDienst",
    "RedenUitDienstEn",
    "RedenUitDienstDe",
    "OmschrijvingUitDienst",
    "OmschrijvingUitDienstEn",
    "OmschrijvingUitDienstDe",
    "UitgevarenSchip",
    "Maandbrief",
    "Schuldbrief",
    "Opmerking",
    "OpgestaptKaap",
    "OpgestaptKaapSchip",
    "OpgestaptKaapKamer",
    "SchipTerugReis",
    "KamerTerugReis",
    "DASnummerTerugreis",
    "OpmerkingenTerugreis",
    "InformatieTerugreisBijING",
    "TerugReisDatumVertrek",
    "TerugReisAankomstKaap",
    "TerugReisVertrekKaap",
    "TerugReisAankomst",
    "Soldijboek",
    "Begunstigde",
    "Bronverwijzing",
    "uid",
    "scan",
    "handle",
    "ID"
]

df_na_vocop = pd.read_csv(NA_vocop_file, index_col=None, header=None, names=nl_headers)

## Integrate Disambiguation Results

Load and merge disambiguation results from the HUMIGEC project with the VOC muster records.

In [4]:
vocop = pd.read_csv(os.path.join(intermediary_path, 'vocop-clustered-levenshtein-1.no-service-overlap.cape-boarders-linked.csv'), 
                    dtype={'HISCO_CODE': str, 
                           'voyNumberDAS': str, 
                           'boardedAtCapeVoyNumberDAS': str,
                           'boardedAtCapeShipID': str,
                           'DASvoyageNumberReturnJourney': str}, delimiter='\t')

In [5]:
vocop = pd.merge(vocop,df_na_vocop[["Soldijboek", "uid","ID"]], left_on="VOCOP_id", right_on="ID")

In [6]:
vocop.drop('ID', inplace=True, axis=1)

## Load and Merge Additional Data

Load supplementary data files which provide details such as the sequence number of the voyage for an individual (cluster row) and information about the reasons for terminating service. Integrate these datasets with the current data compilation to yield a more detailed and insightful analysis.

In [55]:
career = pd.read_csv(os.path.join(intermediary_path, 'clustered_observations.csv'), 
                     usecols=['clusterRow', 'VOCOP_id'])

# endservice: information on reasons for ending career
endservice = pd.read_excel(os.path.join(external_path, 'reasonEndService.xlsx'))

# merge the data
vocop_persons_contracts = pd.merge(vocop, career, how='left', on=['VOCOP_id'])

# Here we replace records that have a 'nan' as reason end service with Unknown to enable matching with the 'endservice' file in the next step.
vocop_persons_contracts['reasonEndService'].fillna('Unknown', inplace=True)

vocop_persons_contracts = pd.merge(vocop_persons_contracts, endservice[['redenEng', 'couldMusterAgain']], 
          left_on='reasonEndService', right_on='redenEng')



In [None]:
vocop_persons_contracts[vocop_persons_contracts['couldMusterAgain'] == 1]['redenEng'].value_counts()

In [None]:
print("Original columns: {}".format(vocop_persons_contracts.columns.values))

## Select and Rename Columns

In [None]:
# here we select a subset of columns
selected_columns = ['VOCOP_id', 'fullNameOriginal', 'placeOfOrigin', 'disambiguated',
                   'cluster_ID', 'clusterRow', 'debtLetter', 'monthLetter', 'dateBeginService',
                   'date_end_service_improved', 'reasonEndService', 'voyageID', 'boardedAtCape',
                   'boardedAtCapeVoyageID','DASvoyageReturnID', 'Soldijboek', 'uid',
                   'generalRemark', 'sourceReference', 'scanPermalink', 'couldMusterAgain', 'endServiceWhere'] 


vocop_persons_contracts = vocop_persons_contracts[selected_columns]

In [None]:
# here we rename the columns
vocop_persons_contracts.rename(columns = {'VOCOP_id': 'vocop_id', 'fullNameOriginal':'full_name', 
                                          'placeOfOrigin': 'place_of_origin', 'disambiguated': 'disambiguated_person', 
                                          'cluster_ID':'person_cluster_id', 'clusterRow': 'person_cluster_row',
                                          'debtLetter':'debt_letter', 'monthLetter':'month_letter', 
                                          'dateBeginService':'date_begin_contract', 'date_end_service_improved':'date_end_contract',
                                          'reasonEndService':'reason_end_contract', 'voyageID':'outward_voyage_id',
                                         'boardedAtCape':'changed_ship_at_cape', 'boardedAtCapeVoyageID':'changed_ship_at_cape_voyage_id',
                                         'DASvoyageReturnID':'return_voyage_id', 'generalRemark':'remark', 'Soldijboek':'source_id', 'uid':'uid',
                                         'sourceReference': 'source_reference', 'scanPermalink': 'scan_permalink', 'couldMusterAgain': 'could_muster_again', 'endServiceWhere':'location_end_contract'}, inplace=True)

## Merge Rank ID and Place ID

Load the `voc_ranks.csv`, `voc_ranks_corrected.csv`, and `voc_places.csv` files and merge the `rank_id`, `rank_corrected` and `place_id` values in the dataframe.

In [None]:
ranks_corrected = pd.read_csv(os.path.join(external_path, 'ranks_corrected.csv'), sep=';')

In [None]:
vocop_persons_contracts = pd.merge(vocop_persons_contracts, ranks_corrected[['vocop_id', 'eng_improved', 'rank_corrected']], on='vocop_id', how='left')
vocop_persons_contracts[vocop_persons_contracts['vocop_id'].isna()]
vocop_persons_contracts['rank_corrected'] = vocop_persons_contracts['rank_corrected'].astype('Int64')

In [None]:
vocop_persons_contracts.rename(columns = {'eng_improved' : 'rank'}, inplace=True)

In [None]:
voc_ranks = pd.read_csv(os.path.join(output_path, 'voc_ranks.csv'))

In [None]:
vocop_persons_contracts['rank'] = vocop_persons_contracts['rank'].str.lower()
vocop_persons_contracts = pd.merge(vocop_persons_contracts, voc_ranks[['rank_id', 'rank']], on='rank', how='left')
vocop_persons_contracts[vocop_persons_contracts['rank'].isna()]
vocop_persons_contracts['rank_id'] = vocop_persons_contracts['rank_id'].astype('Int64')

In [None]:
voc_places = pd.read_csv(os.path.join(output_path, 'voc_places.csv'))

In [None]:
vocop_persons_contracts = pd.merge(vocop_persons_contracts, 
                                   voc_places[['place_id', 'place_original']], 
                                   left_on='place_of_origin', 
                                   right_on='place_original', 
                                   how='left')

vocop_persons_contracts.drop('place_original', axis=1, inplace=True)
vocop_persons_contracts[vocop_persons_contracts['place_of_origin'].isna()]
vocop_persons_contracts['place_id'] = vocop_persons_contracts['place_id'].astype('Int64')


## Calculate Contract Length

Calculate the contract length for each record from its start and end dates, using a workaround for old dates. 

In [None]:
# turn into one function?
def make_datetime(x):
    date_format = "%Y-%m-%d"
    try:
        return datetime.strptime(x, date_format).date()
    except:
        return np.nan


def calculate_delta(begin, end):
    if begin is not np.nan and end is not np.nan:
        delta = end - begin
        return delta.days
    else:
        return np.nan     

In [None]:

vocop_persons_contracts['date_begin_contract'] = vocop_persons_contracts['date_begin_contract'].apply(make_datetime)
vocop_persons_contracts['date_end_contract'] = vocop_persons_contracts['date_end_contract'].apply(make_datetime)

print(vocop_persons_contracts['date_begin_contract'].dtype)
print(vocop_persons_contracts['date_end_contract'].dtype)


In [None]:
vocop_persons_contracts['contract_length'] = vocop_persons_contracts.apply(lambda x: calculate_delta(x['date_begin_contract'], x['date_end_contract']), axis=1)                         

In [None]:
vocop_persons_contracts['contract_length'].describe()

## Re-order Columns and Fix Integer Values

Re-order the columns and fix integer values (some columns have nans, so use int64, which allows for nulls).

In [None]:
vocop_persons_contracts = vocop_persons_contracts[['vocop_id', 'full_name', 'place_of_origin', 'place_id', 'disambiguated_person',
       'person_cluster_id', 'person_cluster_row', 'rank', 'rank_corrected', 'rank_id', 'debt_letter',
       'month_letter', 'date_begin_contract', 'date_end_contract', 'contract_length',
       'reason_end_contract', 'could_muster_again', 'location_end_contract', 
       'outward_voyage_id', 'changed_ship_at_cape',
       'changed_ship_at_cape_voyage_id', 'return_voyage_id', 'remark', 'source_id', 'source_reference', 
        'uid', 'scan_permalink',]]

In [None]:
vocop_persons_contracts['person_cluster_id'] = vocop_persons_contracts['person_cluster_id'].astype('Int64')
vocop_persons_contracts['rank_id'] = vocop_persons_contracts['rank_id'].astype('Int64')

In [None]:
int_cols = ['person_cluster_id',
            'contract_length',
            'outward_voyage_id',
            'return_voyage_id',
            'changed_ship_at_cape_voyage_id'
           ]

for col in int_cols:
    vocop_persons_contracts[col] = vocop_persons_contracts[col].astype('Int64')

## Print Descriptive Statistics

In [None]:
print('Total number of records: {}'.format(vocop_persons_contracts.shape[0]))
print('Records without VoyageID: {}'.format(vocop_persons_contracts[vocop_persons_contracts['outward_voyage_id'].isnull()].shape[0]))
print('Records that cannot be disambiguated: {}'.format(vocop_persons_contracts[vocop_persons_contracts['disambiguated_person'] == 0].shape[0]))
print('Records that boarded at Cape: {}'.format(vocop_persons_contracts[vocop_persons_contracts['changed_ship_at_cape'] == 'Ja'].shape[0]))
print('Number of unique persons: {}'.format(vocop_persons_contracts['person_cluster_id'].nunique()))
print('Number of unique persons that travelled more than once: {}'.format(vocop_persons_contracts[vocop_persons_contracts['person_cluster_row'] > 1]['person_cluster_id'].nunique()))

## Save Data to `vocop_persons_contracts.csv` and `voc_names.csv`

To facilitate analysis, save the muster record data to `voc_persons_contracts.csv` and all the original and normalized name parts to `voc_names.csv`

In [None]:
vocop_persons_contracts.to_csv(os.path.join(output_path, 'voc_persons_contracts.csv'), 
                               index=None)

In [None]:
column_name_map = {
    'VOCOP_id': 'vocop_id', 
    'fullNameOriginal': 'full_name_original', 
    'firstNameOriginal': 'first_name_original',
    'patronymicOriginal': 'patronymic_original', 
    'familyNamePrefixOriginal': 'family_name_prefix_original', 
    'familyNameOriginal': 'family_name_original',
    'fullNameNormalized': 'full_name_normalized',
    'firstNameNormalized': 'first_name_normalized', 
    'patronymicNormalized': 'patronymic_normalized',
    'familyNamePrefixNormalized': 'family_name_prefix_normalized', 
    'familyNameNormalized': 'family_name_normalized'
}

vocop_names = vocop[list(column_name_map.keys())]
vocop_names = vocop_names.rename(columns=column_name_map)
vocop_names

In [None]:
vocop_names.to_csv(os.path.join(output_path, 'voc_names.csv'), 
                               index=None)