# Create `voc_voyages.csv`

This notebook creates a file with basic information about VOC ship voyages between Europe and Asia, taking from the [Dutch-Asiatic Shipping database](https://resources.huygens.knaw.nl/das/) of the Huygens Institute.

##  Environment Setup and Import Libraries

Set up the environment and import necessary libraries for data manipulation and path handling.

In [1]:
import pandas as pd
import os
import numpy as np
from edtf import parse_edtf

In [2]:
local_folder = '../'

data_path = os.path.join(local_folder, 'raw')
intermediary_path = os.path.join(local_folder, 'intermediary')
external_path = os.path.join(local_folder, 'external')
output_path = os.path.join(local_folder, 'enriched')

## Read Selected Columns from Data

In [4]:
selected_columns = ['voyId', 'voyNameShip', 'voyNumber 1','ShipID','voyTonnage', 'voyChamber',
                    'voyDepartureEDTF', 'voyDeparturePlace', 'voyCapeArrivalEDTF',
                   'voyCapeDepartureEDTF', 'voyArrivalDateEDTF', 'voyArrivalPlace']

voc_voyages = pd.read_excel(os.path.join(external_path, 'das.xlsx'), usecols=selected_columns)

## Normalize EDTF Date Representation

Take the first date from [EDTF](https://www.loc.gov/standards/datetime/) date representations of complex, uncertain/approximate dates, or date ranges to facilite calculating with dates.

In [3]:
def fix_edtf(x):
    if x.startswith('['):
        return str(parse_edtf(x).objects[0])
    elif '/' in x:
        return x.split('/')[0]
    else:
        return x

In [6]:
for col in ['voyDepartureEDTF', 'voyCapeArrivalEDTF', 'voyCapeDepartureEDTF', 'voyArrivalDateEDTF']:
    voc_voyages[col] = voc_voyages[col].astype(str)
    voc_voyages[col] = voc_voyages[col].apply(fix_edtf)

## Add Direction Field

Utilize the numeric values in the 'voyNumber 1' column to populate a new column named 'direction'. Based on the number, categorize each voyage as either 'outward' (from Europe to Asia) or 'return' (from Asia to Europe).

In [7]:
voc_voyages['voyNumber 1'] = voc_voyages['voyNumber 1'].astype(str).map(lambda x: x.rstrip('A'))

voc_voyages['direction'] = np.where(voc_voyages['voyNumber 1'].astype(int) < 5000, 'outward', 'return')

In [9]:
voc_voyages.drop('voyNumber 1', axis=1, inplace=True)

## Rename and Select Columns and Perform Last Normalization

Rename the columns and select the columns to be saved in the `voc_voyages.csv` file. Also convert shipnames to title case.

In [10]:
voc_voyages.rename(columns = {'voyId': 'das_voyage_id', 'voyNameShip':'ship_name',
                             'voyTonnage': 'ship_tonnage', 'voyChamber': 'chamber',
                             'voyDepartureEDTF': 'departure_date', 'voyDeparturePlace':'departure_place',
                             'voyCapeArrivalEDTF':'arrival_date_cape', 'voyCapeDepartureEDTF':'departure_date_cape',
                             'voyArrivalDateEDTF':'arrival_date', 'voyArrivalPlace':'arrival_place'}, inplace=True)

In [11]:
voc_voyages = voc_voyages[['das_voyage_id', 'direction', 'ship_name', 'ship_tonnage', 'chamber', 
            'departure_date', 'departure_place', 'arrival_date_cape', 'departure_date_cape', 'arrival_date',
            'arrival_place']]

In [12]:
voc_voyages['ship_name'] = voc_voyages['ship_name'].apply(lambda x: x.title())

In [13]:
voc_voyages.to_csv(os.path.join(output_path, 'voc_voyages.csv'), index=None)