# Merging Orbis Financial Data

## This notebook merges the Orbis financial datasets (balance sheets, global ratios, and other key variables) into a single dataset. 

## Technical Notes:

- This notebook uses `dask` instead of `pandas` to handle large datasets that do not fit into memory. 

## How to use:

- Make sure you have the required libraries installed. You can install them using pip:
```bash
pip install "dask[complete]"
```

- Change the `dataset_dir` variable to point to the directory where your Orbis datasets are stored.

- Change the `out_dir` variable to point to the directory where you want to save the merged dataset.

- Run the notebook. The merged dataset will be saved as a CSV file in the specified output directory.




In [12]:
import dask.dataframe as dd
import os


In [13]:
# Some common missing value indicators in the dataset
na_values = ["none", "n.a.", "n.a", "n.s.", "n.s", "N/A", "N/A", "N.A.", "N.A", "NaN", "nan", "NA", "na", "", " ", "-", "--"]

# Define all columns to be read as strings to avoid issues with mixed types
dtypes = {
    "BvD ID number": "object",
    "Year": "int64",
    "Orbis ID number": "object",
    'Operating Revenue per Employee': 'object',
    'Operating Revenue per Employee (Alt)': 'object',
    'Operating Revenue per Employee (Alt2)': 'object',
    'Profit per Employee': 'object',
    'Shareholders Funds per Employee': 'object',
    'Total Assets per Employee': 'object',
    'Working Capital per Employee': 'object'
}

# Define the directory where the datasets are located
script_dir = os.path.dirname(os.path.abspath(os.getcwd()))
dataset_dir = os.path.join(script_dir, "unmerged-datasets/orbis-financial")

# Read the datasets using Dask
balance_sheet = dd.read_csv(os.path.join(dataset_dir, "financial_balance_sheet_assets_profit_loss.csv"), na_values=na_values, dtype=dtypes)
global_ratios = dd.read_csv(os.path.join(dataset_dir, "financial_global_ratios.csv"), na_values=na_values, dtype=dtypes)
key_vars = dd.read_csv(os.path.join(dataset_dir, "financial_key_variables.csv"), na_values=na_values, dtype=dtypes)

print("Balance Sheet Columns:")
print(balance_sheet.columns)
print("Global Ratios Columns:")
print(global_ratios.columns)
print("Key Variables Columns:")
print(key_vars.columns)


Balance Sheet Columns:
Index(['Company name Latin alphabet', 'BvD ID number', 'Year',
       'Current_Assets_USD', 'EBIT_USD', 'Non_Current_Assets_USD',
       'Financial_Expenses_USD', 'Financial_Profit_Loss_USD',
       'Financial_Revenue_USD', 'Gross_Profit_USD', 'Intangible_Assets_USD',
       'Operating_Income_USD', 'Profit_After_Tax_m_USD', 'PBT_USD',
       'Profit_Loss_After_Tax_USD', 'Profit_Loss_Before_Tax_USD',
       'Tangible_Fixed_Assets_USD', 'Total_Assets_USD',
       'Total_Operating_Expenses_USD', 'Orbis ID number'],
      dtype='object')
Global Ratios Columns:
Index(['Unnamed: 0', 'Company name Latin alphabet', 'BvD ID number',
       'Orbis ID number', 'Year', 'Operating Revenue per Employee',
       'Operating Revenue per Employee (Alt)',
       'Operating Revenue per Employee (Alt2)', 'Profit per Employee',
       'Shareholders Funds per Employee', 'Total Assets per Employee',
       'Working Capital per Employee'],
      dtype='object')
Key Variables Columns:
Ind

In [14]:
# These columns are the unique identifiers for each observation
# and are used to merge the datasets later
index_cols = ["BvD ID number", "Orbis ID number", "Year"]

print("Dropping rows with null values in columns:", index_cols)

num_rows_balance_sheet = balance_sheet.shape[0].compute()
num_rows_global_ratios = global_ratios.shape[0].compute()
num_rows_key_vars = key_vars.shape[0].compute()

# Drop rows with null values in all index columns
# This is done to ensure that we only keep rows that have complete information
# across all datasets
balance_sheet = balance_sheet.dropna(subset=index_cols, how="all")
global_ratios = global_ratios.dropna(subset=index_cols, how="all")
key_vars = key_vars.dropna(subset=index_cols, how="all")

balance_sheet_shape = balance_sheet.shape
global_ratios_shape = global_ratios.shape
key_vars_shape = key_vars.shape

num_rows_balance_sheet_after = balance_sheet_shape[0].compute()
num_rows_global_ratios_after = global_ratios_shape[0].compute()
num_rows_key_vars_after = key_vars_shape[0].compute()

num_cols_balance_sheet = balance_sheet_shape[1]
num_cols_global_ratios = global_ratios_shape[1]
num_cols_key_vars = key_vars_shape[1]

print(f"Dropped {num_rows_balance_sheet - num_rows_balance_sheet_after} rows with null values from balance sheet")
print(f"Dropped {num_rows_global_ratios - num_rows_global_ratios_after} rows with null values from global ratios")
print(f"Dropped {num_rows_key_vars - num_rows_key_vars_after} rows with null values from key variables")

print(f"Balance Sheet: {num_rows_balance_sheet_after} rows, {num_cols_balance_sheet} columns")
print(f"Global Ratios: {num_rows_global_ratios_after} rows, {num_cols_global_ratios} columns")
print(f"Key Variables: {num_rows_key_vars_after} rows, {num_cols_key_vars} columns")

Dropping rows with null values in columns: ['BvD ID number', 'Orbis ID number', 'Year']
Dropped 0 rows with null values from balance sheet
Dropped 0 rows with null values from global ratios
Dropped 0 rows with null values from key variables
Balance Sheet: 19212750 rows, 20 columns
Global Ratios: 19257690 rows, 12 columns
Key Variables: 16465729 rows, 11 columns


In [15]:
print("Expected number of columns in merged dataset:")
expected_cols = num_cols_balance_sheet + num_cols_global_ratios + num_cols_key_vars - len(index_cols) * 2
print(expected_cols)

Expected number of columns in merged dataset:
37


In [16]:
balance_sheet["BvD ID number"] = balance_sheet["BvD ID number"].str.upper()
global_ratios["BvD ID number"] = global_ratios["BvD ID number"].str.upper()
key_vars["BvD ID number"] = key_vars["BvD ID number"].str.upper()

balance_sheet["Orbis ID number"] = balance_sheet["Orbis ID number"].str.upper()
global_ratios["Orbis ID number"] = global_ratios["Orbis ID number"].str.upper()
key_vars["Orbis ID number"] = key_vars["Orbis ID number"].str.upper()

In [17]:
# Merge the datasets on the index columns
# The merge is done using an outer join to ensure that we keep all rows
merged_financial_data = balance_sheet.merge(global_ratios, on=["BvD ID number", "Orbis ID number", "Year"], how="outer")
merged_financial_data = merged_financial_data.merge(key_vars, on=["BvD ID number", "Orbis ID number", "Year"], how="outer")

In [18]:
print("Merged financial data columns:")
print(merged_financial_data.columns)

num_rows_merged = merged_financial_data.shape[0].compute()
print(f"Number of rows in merged dataset: {num_rows_merged}")

# Check if the number of columns in the merged dataset is as expected
print("Did get expected number of columns in merged dataset:")
num_cols_merged = merged_financial_data.shape[1]
got_expected_cols = num_cols_merged == expected_cols
if got_expected_cols:
    print("True")
else:
    print("False")
    print(f"Expected: {expected_cols}, Got: {num_cols_merged}")

print(merged_financial_data.head())

Merged financial data columns:
Index(['Company name Latin alphabet_x', 'BvD ID number', 'Year',
       'Current_Assets_USD', 'EBIT_USD', 'Non_Current_Assets_USD',
       'Financial_Expenses_USD', 'Financial_Profit_Loss_USD',
       'Financial_Revenue_USD', 'Gross_Profit_USD', 'Intangible_Assets_USD',
       'Operating_Income_USD', 'Profit_After_Tax_m_USD', 'PBT_USD',
       'Profit_Loss_After_Tax_USD', 'Profit_Loss_Before_Tax_USD',
       'Tangible_Fixed_Assets_USD', 'Total_Assets_USD',
       'Total_Operating_Expenses_USD', 'Orbis ID number', 'Unnamed: 0',
       'Company name Latin alphabet_y', 'Operating Revenue per Employee',
       'Operating Revenue per Employee (Alt)',
       'Operating Revenue per Employee (Alt2)', 'Profit per Employee',
       'Shareholders Funds per Employee', 'Total Assets per Employee',
       'Working Capital per Employee', 'Company name Latin alphabet',
       'Current ratio', 'Number of employees', 'Profit margin',
       'ROCE using P/L before tax', 'RO

In [19]:
# Drop duplicate rows based on the index columns. This is done to ensure that we only keep unique observations
merged_financial_data_no_duplicates = merged_financial_data.drop_duplicates(subset=index_cols, keep="first")
num_rows_merged_no_duplicates = merged_financial_data_no_duplicates.shape[0].compute()

print(f"Number of rows in merged dataset after dropping duplicates: {num_rows_merged_no_duplicates}")
print(f"Dropped {num_rows_merged - num_rows_merged_no_duplicates} duplicate rows")

Number of rows in merged dataset after dropping duplicates: 39771332
Dropped 34380 duplicate rows


In [20]:

out_dir = os.path.join(script_dir, "merged-datasets")
out_path = os.path.join(out_dir, "orbis-financial-merged.csv")
merged_financial_data_no_duplicates.to_csv(out_path, single_file=True, index=False, na_rep="N/A", lineterminator="\n")
print("Merged dataset saved to 'merged-datasets/orbis-financial-merged.csv'")

Merged dataset saved to 'merged-datasets/orbis-financial-merged.csv'


In [21]:
# Count the number of missing values in each index column
for col in index_cols:
    num_missing = merged_financial_data_no_duplicates[col].isnull().sum().compute()
    print(f"Number of missing values in '{col}': {num_missing}")

Number of missing values in 'BvD ID number': 0
Number of missing values in 'Orbis ID number': 2010
Number of missing values in 'Year': 0
