# Create `voc_ranks.csv`

This notebook integrates files with improved rank information and wage sample data that were created in the HUMIGEC project (see accompanying data paper for more information).

## Import Libraries

Import necessary libraries for data manipulation and file path handling.

In [None]:
import pandas as pd
import numpy as np
import os

## Define File Paths

Set up paths for data directories to manage file locations conveniently.

In [None]:
local_folder = '../'

data_path = os.path.join(local_folder, 'original')
intermediary_path = os.path.join(local_folder, 'intermediary')
external_path = os.path.join(local_folder, 'external')
output_path = os.path.join(local_folder, 'enriched')

## Load Data

Load the VOC ranks and wage information from external CSV files, prepared by historian and research assistant on HUMIGEC project. See accompanying data paper for more information.

In [None]:
voc_ranks = pd.read_csv(os.path.join(external_path, 'VOC_ranks.csv'), 
                        dtype={'rank_id': str, #because of NaNs
                           'HISCO': str},
                        delimiter=';')


wages = pd.read_csv(os.path.join(external_path, 'wage_sample_new_categories_updated.csv'), usecols=['monthly_wage', 'parent_rank'], delimiter=';') 

## Calculate Median Wage

Compute the median wage per parent rank and merge it with the voc_ranks dataframe.

In [None]:
median_wages = wages.groupby('parent_rank')['monthly_wage'].median()
voc_ranks = pd.merge(voc_ranks, median_wages.to_frame('median_wage').reset_index(), on='parent_rank', how='outer')

## Data Cleaning and Preparation

Perform data cleaning and preparation steps such as renaming columns, sorting, and filtering the data.

In [None]:
voc_ranks.rename(columns = {'HISCO': 'hisco'}, inplace=True)

# remove rows with nans on rank_id
voc_ranks = voc_ranks[voc_ranks['rank_id'].notna()]

voc_ranks['rank_id'] = voc_ranks['rank_id'].astype('int')
voc_ranks.sort_values('rank_id', inplace=True)

## Construct HISCO URIs

Create a new column hisco_uri in voc_ranks with URIs constructed from the hisco values.

In [None]:
base_uri = 'https://iisg.amsterdam/resource/hisco/code/hisco/'

# Using .apply() with a lambda function to create the 'hisco_uri' column
voc_ranks['hisco_uri'] = voc_ranks['hisco'].apply(lambda x: base_uri + str(x) if pd.notnull(x) else np.nan)


## Finalize and Save DataFrame

Finalize the DataFrame by selecting specific columns and then save it to a CSV file.

In [None]:
voc_ranks = voc_ranks [['rank_id', 'rank', 'parent_rank', 'category', 'subcategory', 'hisco', 'hisco_uri',
       'rank_nl', 'rank_description_nl', 'rank_description_eng', 'median_wage']]

voc_ranks.to_csv(os.path.join(output_path, 'voc_ranks.csv'), index=None)