This is a script for splitting multiple events in the OCR + API Prof Data.


In [1]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

Mounted at /content/drive


In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [2]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlsxwriter
  Downloading XlsxWriter-3.0.9-py3-none-any.whl (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.8/152.8 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xlsxwriter
Successfully installed xlsxwriter-3.0.9
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In **step 1**, we can import the packages to the script and load our data. Before merging the input files, names will be normalised as some have access spaces, capitalised surnames, or inverted first and last names.

The combined data will be written to a new dataframe and displayed.

In [6]:
import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os
import re

# path to input files

file_path="https://github.com/ieg-dhr/DigiKAR/blob/main/Sample%20Data/Factoid_PROFS_consolidation_STEP2_events-reconstructed.xlsx?raw=true"

# define dataframe for final output

f_to_add=[]

# structure of input files

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]
                


df = pd.read_excel(file_path, index_col=None, dtype=str) # axis=1, sort=False sheet_name='FactoidList'
df = df.fillna("n/a") # replace empty fields for string
df_length=len(df)

print("There are ", len(df), "items in your DataFrame!")

# delete all duplicate rows with exact matches

df_unique=df.drop_duplicates()
print("Your DataFrame has now ", len(df_unique), "items with at least one unique cell." )

display(df_unique)


There are  29449 items in your DataFrame!
Your DataFrame has now  29449 items with at least one unique cell.


Unnamed: 0.2,Unnamed: 0.1,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,...,place_name,inst_name,rel_pers,source_quotations,additional_info,comment,info_dump,source,source_site,Unnamed: 0
0,1222,286,API,Wendelin Wendelin Dietes,,Funktionsausübung,,,1616-01-01,1646-12-31,...,Aschaffenburg,Kollegiatstift St. Peter und Alexander Aschaff...,,,\nLektoratspräbende Stift St. Peter und Alexan...,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,0
1,1223,455,API,Matthäus Anton Chrysostomus Eberwein,,Amtsantritt,,,,,...,Mainz,St. Viktor Mainz,,,\nLektoratspräbende Stift St. Viktor (Mainz)\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,1
2,1225,809,API,Bertold Georg Gothelp,,Amtsantritt,,,,,...,Mainz,St. Peter Mainz,,,,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,3
3,1228,2617,API,Alexander Günther Samhaber,,Amtsantritt,,,,,...,,Augustinerorden,,,\nAugustinerorden\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,6
4,1233,3462,API,Damian Friedrich Boost,,Amtsantritt,,,1779-02-03,1779-02-03,...,,Concilium Majus,,,\nConcilium majus\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29444,29319,7092,OCR,Johann Michael Wunderlich,,Zulassung,,,"1737, 1737-11-26","1737, 1737-11-26",...,Mainz,"Universität Mainz, Philosophische Fakultät",,,1737 wurde er in Mainz zur Artistenfakultät zu...,,,"Praetorius, Professoren, S.137; \n RPh 63v, 6...",,8887
29445,29332,4021,OCR,Johann Friedrich Wüstefled,Wüstenfeld,Zulassung,,,"1770, 1770-12-31","1770, 1770-12-31",...,Mainz,,,,er bewarb sich in Mainz 1770 als ao. Professor...,,,"Praetorius, Professoren, S.100; \n RR II 91; ...",,8900
29446,29346,7099,OCR,Georg Zeder,,Zulassung,1765-06-26,,,,...,,Concilium Majus,,,am 26.6.1765 wurde er zum Concilium majus zuge...,,,"Praetorius, Professoren, S.138; \n Cat. Jes. ...",,8914
29447,29356,5057,OCR,Thomas Zenzen,,Zulassung,1788-07-29,,,,...,"Mainz, Trier",Universität Trier,,,Trier am 29.7.1788 bat er in Mainz um Zulassun...,,,"Praetorius, Professoren, S.134; \n Prot. med....",,8924


In **step 2**, we identify cells with comma-separated values and split them.




In [15]:
df2 = df_unique
df_size=len(df2)

# find cells with commas in the pers_function column
list_to_append=[]
try:
  for x in range(0, df_size):
      print(df_size - x)
      e_df=df2.iloc[[x]].fillna("n/a") # virtual value to avoid issues with empty data frames
  
      if "," in e_df['pers_function'].values[0]:
        e_functions=e_df['pers_function'].values[0]
        function_list=e_functions.split(",")
        for function in function_list:
          print(function)
          e_df["pers_function"]=function
          print(e_df)
          list_to_append.append(e_df)

      else:
        print("Only one value found.")

except Exception as e:
  print(e)

df_split = pd.concat(list_to_append, axis=0, ignore_index=True, sort=False)
display(df_split)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1160
Only one value found.
1159
Only one value found.
1158
Only one value found.
1157
Only one value found.
1156
Only one value found.
1155
Only one value found.
1154
Only one value found.
1153
Only one value found.
1152
Only one value found.
1151
Only one value found.
1150
Only one value found.
1149
Only one value found.
1148
Only one value found.
1147
Only one value found.
1146
Only one value found.
1145
Only one value found.
1144
Only one value found.
1143
Only one value found.
1142
Mitglied
      Unnamed: 0.1 factoid_ID pers_ID             pers_name alternative_names  \
28307        23126       4147     OCR  Johann Anton Caprano               n/a   

      event_type event_after-date event_before-date event_start   event_end  \
28307        Tod              n/a               n/a  1818-10-18  1818-10-18   

       ... place_name inst_name rel_pers source_quotations  \
28307  ...      Mainz       n/a      n/a           

Unnamed: 0.2,Unnamed: 0.1,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,...,place_name,inst_name,rel_pers,source_quotations,additional_info,comment,info_dump,source,source_site,Unnamed: 0
0,1974,932,API,Matthias Joseph Hagen,,Amtsantritt,,,1791-01-01,1814-04-11,...,,,,,,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,752
1,1974,932,API,Matthias Joseph Hagen,,Amtsantritt,,,1791-01-01,1814-04-11,...,,,,,,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,752
2,2341,432,API,Franz Anton Chrysostomus Dürr,,Amtsantritt,,,1759-01-01,,...,,Stiftung Mainzer Universitätsfonds\n,,,Stiftung Mainzer Universitätsfonds\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,1119
3,2341,432,API,Franz Anton Chrysostomus Dürr,,Amtsantritt,,,1759-01-01,,...,,Stiftung Mainzer Universitätsfonds\n,,,Stiftung Mainzer Universitätsfonds\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,1119
4,3288,2478,OCR,Simon Bagen,,Amtsantritt,,,"1523, 1524, 1529","1523, 1524, 1529",...,"Köln, Mainz",Universität Köln/Universität Mainz,"V: Peter B., seine Mutter, eine Mainzerin, ka...",,aus der Diözese Köln * 1523 oder 1524; V: Pete...,,,"Praetorius, Professoren, S.97; \n Dürr, Bl.34...",,2066
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6275,25824,482,OCR,Michael Kress,,Zulassung,1744-08-28,,,,...,Mainz,Universität Mainz],,,am 28.8.1744 erfolgte die Zulassung zur Theolo...,,,"Praetorius, Professoren, S.95, 137; \n RR I 9...",,5392
6276,25824,482,OCR,Michael Kress,,Zulassung,1744-08-28,,,,...,Mainz,Universität Mainz],,,am 28.8.1744 erfolgte die Zulassung zur Theolo...,,,"Praetorius, Professoren, S.95, 137; \n RR I 9...",,5392
6277,25831,486,OCR,Johann Martin Ignaz Kreussler,Kreißler,Zulassung,1762-11-03,,,,...,Heidelberg,Universität Heidelberg,,,er wurde am 3.11.1762 als Professor der Logik ...,,,"Praetorius, Professoren, S.138; \n Cat. Jes. ...",,5399
6278,25831,486,OCR,Johann Martin Ignaz Kreussler,Kreißler,Zulassung,1762-11-03,,,,...,Heidelberg,Universität Heidelberg,,,er wurde am 3.11.1762 als Professor der Logik ...,,,"Praetorius, Professoren, S.138; \n Cat. Jes. ...",,5399


In **step 3**, we write all data to a new EXCEL file.

In [16]:
from pandas.tseries.offsets import FY5253
# write amended rows to existing data frame for further processing

f_to_add.append(df2)
f_to_add.append(df_split)

df3 = pd.concat(f_to_add, axis=0, ignore_index=True, sort=False)

print(len(df3))

display(df3)

workbook=directory+'FACTOIDS_consolidated/Factoid_PROFS_consolidation_STEP3_events-split.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
df3.to_excel(writer, sheet_name='FactProfSplit') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")  

35729


Unnamed: 0.2,Unnamed: 0.1,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,...,place_name,inst_name,rel_pers,source_quotations,additional_info,comment,info_dump,source,source_site,Unnamed: 0
0,1222,286,API,Wendelin Wendelin Dietes,,Funktionsausübung,,,1616-01-01,1646-12-31,...,Aschaffenburg,Kollegiatstift St. Peter und Alexander Aschaff...,,,\nLektoratspräbende Stift St. Peter und Alexan...,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,0
1,1223,455,API,Matthäus Anton Chrysostomus Eberwein,,Amtsantritt,,,,,...,Mainz,St. Viktor Mainz,,,\nLektoratspräbende Stift St. Viktor (Mainz)\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,1
2,1225,809,API,Bertold Georg Gothelp,,Amtsantritt,,,,,...,Mainz,St. Peter Mainz,,,,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,3
3,1228,2617,API,Alexander Günther Samhaber,,Amtsantritt,,,,,...,,Augustinerorden,,,\nAugustinerorden\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,6
4,1233,3462,API,Damian Friedrich Boost,,Amtsantritt,,,1779-02-03,1779-02-03,...,,Concilium Majus,,,\nConcilium majus\n,,,ProfAPI,http://gutenberg-biographics.ub.uni-mainz.de/i...,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35724,25824,482,OCR,Michael Kress,,Zulassung,1744-08-28,,,,...,Mainz,Universität Mainz],,,am 28.8.1744 erfolgte die Zulassung zur Theolo...,,,"Praetorius, Professoren, S.95, 137; \n RR I 9...",,5392
35725,25824,482,OCR,Michael Kress,,Zulassung,1744-08-28,,,,...,Mainz,Universität Mainz],,,am 28.8.1744 erfolgte die Zulassung zur Theolo...,,,"Praetorius, Professoren, S.95, 137; \n RR I 9...",,5392
35726,25831,486,OCR,Johann Martin Ignaz Kreussler,Kreißler,Zulassung,1762-11-03,,,,...,Heidelberg,Universität Heidelberg,,,er wurde am 3.11.1762 als Professor der Logik ...,,,"Praetorius, Professoren, S.138; \n Cat. Jes. ...",,5399
35727,25831,486,OCR,Johann Martin Ignaz Kreussler,Kreißler,Zulassung,1762-11-03,,,,...,Heidelberg,Universität Heidelberg,,,er wurde am 3.11.1762 als Professor der Logik ...,,,"Praetorius, Professoren, S.138; \n Cat. Jes. ...",,5399


/content/drive/My Drive/Colab_DigiKAR/FACTOIDS_consolidated/Factoid_PROFS_consolidation_STEP3_events-split.xlsx
Done.


Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

March 2023
