This is a script for splitting multiple events per cell in the OCR + API Prof Data collected for DigiKAR. We will use the [pandas package in Python](https://pandas.pydata.org/) to read the data from an EXCEL file and manipulate them. **STEP 1** is to connect Google Colab with the user's Google Drive.


In [None]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

In **STEP 2**, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames. The [ast](https://docs.python.org/3/library/ast.html) package can be used for converting cell content in a dataframe to strings or lists. 

In [None]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install ast

In **step 3**, we can import the packages to the script and load our data. Before merging the input files, names will be normalised as some have access spaces, capitalised surnames, or inverted first and last names.

The combined data will be written to a new dataframe and displayed as a fully searchable table.

In [None]:
import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import os
# from ast import literal_eval # can be used to check if cell content is string but different function was used below 

# path to input files

file_path="https://github.com/ieg-dhr/DigiKAR/blob/main/Sample%20Data/Factoid_PROFS_consolidation_STEP2_events-reconstructed.xlsx?raw=true"

# define dataframe for final output

f_to_add=[]

# structure of input files

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]
                


df = pd.read_excel(file_path, index_col=None, dtype=str) # axis=1, sort=False sheet_name='FactoidList'
df = df.fillna("n/a") # replace empty fields for string
df_length=len(df)

print("There are ", len(df), "items in your DataFrame!")

# delete all duplicate rows with exact matches

df_unique=df.drop_duplicates()
print("Your DataFrame has now ", len(df_unique), "items with at least one unique cell." )

display(df_unique)


In **step 4**, we will multiply ("explode") entries in the pers_function column. This solution was kindly suggested by [Lukas Hstermeyer](https://stackoverflow.com/users/5240684/lukas-hestermeyer) on StackOverflow to avoid using an `if` loop within a `for` loop. In general, loops for iterating over dataframes out to by avoided in [Pandas](https://pandas.pydata.org/). It is faster and more reliable to use [vectorized solutions](https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac).




In [None]:
df2 = df_unique
print("The original frame has ", len(df2), "entries.")

# make sure all values are strings
df2 = df2.astype(str)
df2['pers_function'] = df2['pers_function'].str.split(', ')

df_explode=df2.explode('pers_function')

print("The new frame has ", len(df_explode), "entries.")

display(df_explode)

In **step 5**, we write all processed data to a new EXCEL file.

In [None]:
# write updated dataframe to a new EXCEL file

workbook=directory+'FACTOIDS_consolidated/Factoid_PROFS_consolidation_STEP3_events-split.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
df_explode.to_excel(writer, sheet_name='FactProfSplit') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.") 
 

Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

March 2023
