# Create `voc_beneficiaries.csv`

This notebook creates the `voc_beneficiaries.csv` file from the original beneficiaries data from the [VOC Opvarenden collection of the Dutch National Archives](https://www.nationaalarchief.nl/onderzoeken/index/nt00444). The resulting file features English column names and an additional column with normalized relation information.

## Import Libraries and Original File

Import the Pandas library and use the function defined in `helper.py` to read the original data into a dataframe and add English column names.

In [None]:
import pandas as pd

import helper

df = helper.read_beneficiaries_df()

print(f'Beneficiaries data\n\tnumber of rows: {df.shape[0]}\n\tnumber of columns: {df.shape[1]}')

df.head(5)

## Normalize Relation Labels

Change 'Zus(ter)' into 'Zuster' and apply title case to 'instelling'.


In [None]:
df.relation.value_counts()

In [None]:
# Change value 'Zus(ter)' into 'Zuster'
df.loc[df['relation'] == 'Zus(ter)', 'relation'] = 'Zuster'

# Update all occurrences of 'instelling' to 'Instelling'
df.relation = df.relation.apply(lambda x: x.title() if isinstance(x, str) else x)

# display value counts to check relation labels are consistent
df.relation.value_counts()

## Write Processed Data to File

In [None]:
processed_file = '../enriched/voc_beneficiaries.csv'
df.to_csv(processed_file, index=None)