This is a script for **grouping and aggregating events** from AP3 factoid lists in the DigiKAR project.

Grouping, in this case, means that rows of a column with the same values are combined according to a rule specified in the function. This rule can be a simple list of all existing values, a minimum value, a maximium value, or a sum.

This script performs grouping and aggregation with the Pandas package in Python to read and manipulate EXCEL data as DataFrames. DataFrames are 2-dimensional data representations in rows and columns. They can be written to different file formats such as CSV, EXCEL, JSON or RDF.

First of all, we need to connect the Colab notebook with Google Drive and define the directory for input and output data.

In [None]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [None]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

In **step 1**, we can import the packages to the script and load our data. Before merging the input files, names will be normalised as some have access spaces, capitalised surnames, or inverted first and last names.

The combined data will be written to a new dataframe and displayed.

In [None]:
import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os
import re

# path to input files

factoid_path1="" # ADD NEW PROF DATA

factoid_paths=[factoid_path1] # ADD MORE URLS IF NECESSARY

# define dataframe for final output

f_to_add=[]

# structure of input files

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]

frame_list=[]
for file in factoid_paths:
    df = pd.read_excel(file, index_col=None, dtype=str) # axis=1, sort=False sheet_name='FactoidList'
    df = df.fillna("n/a") # replace empty fields for string
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=True, sort=False)

print("There are ", len(f), "items in your DataFrame!")

# delete all duplicate rows with exact matches

f_unique=f.drop_duplicates()
print("Your DataFrame has now ", len(f_unique), "items with at least one unique cell." )

display(f_unique)


In the **step 2** of the consolidation, the script will try and identify similar events per person based on event type, place and institution. Those events are "aggregated" (that is: merged) while all source information etc. is preserved. In terms of dates, minimum and maximum dates given for the presumed identical events are used to create a new dataframe.

The result of this process will be that the number of rows in our table structure will be more or less drastically decreased. Where the automated factoid aggregation based on four values is too radical, more columns can be included as obligatory matches before the data are merged.

In [None]:
# consolidate events per person

f_new=f_unique

# read person list

pers_name_f=(f_new[['pers_name']])
search_df=pers_name_f.drop_duplicates() # remove duplicates
search_list=search_df['pers_name'].tolist()

# count no. of entries in flattened person list

no_person=len(search_list)
print("There are", no_person, "unique person names in this data set.")

# iterate through unique persons to get their events

frame_list=[]
for name in search_list:
    #print("\n",name, "\n")
    res_df=(f_new.loc[f_new['pers_name'] == name])

# list existing events per person
    bio_events=res_df['event_type'].values.tolist()
    #print(set(bio_events))

# check if duplicate events with same place and institution have different dates and create range

    duplicate = res_df[res_df.duplicated(['event_type', 'pers_function', 'place_name', 'inst_name'])]
    print(duplicate)
    if len(duplicate)>1:
        print("For ", name, "there are ", len(duplicate), "similar events.")

# aggregate similar events
        try:
          df_new = duplicate.groupby(["event_type", "place_name", "inst_name"]).agg( # This line of code merges cells!!
                                        {"event_after-date":'min',
                                        "event_before-date":'max',
                                        "event_start":'min',
                                        "event_end":'min',
                                        "factoid_ID": list, # ORIGINAL IDS are combined as RECONSTRUCTION MARKER
                                        "pers_ID":list,
                                        "pers_name":list,
                                        "alternative_names":list,
                                        "pers_title":list,
                                        "pers_function":list,
                                        "inst_name":list,
                                        "rel_pers":list,
                                        "source_quotations":list,
                                        "additional_info":list,
                                        "comment":list,
                                        "info_dump":list,
                                        "source":list,
                                        "source_site":list})
          frame_list.append(df_new)
        except TypeError:
          print("One of the date fields contains invalid characters / is string!")
    else:
      continue

frame_list.append(f_new)

f_result = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)


In **step 3**, the updated data are written to an EXCEL file as a back-up and for archiving. In **steps 5 & 6**, the data will be manipulated further.

In [None]:
from pandas.tseries.offsets import FY5253
# write amended rows to existing data frame for further processing

f_to_add.append(df2)

df3 = pd.concat(f_to_add, axis=0, ignore_index=True, sort=False)

print(len(df3))

display(df3)

workbook=directory+'FACTOIDS_consolidated/Factoid_PROFS_consolidation_STEP2_events-reconstructed_BACKUP.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
df3.to_excel(writer, sheet_name='FactCons1') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")

In **step 5**, we create another dictionary to rank events. This time, the events are given a value between 0 and 100 to define at what stages in people's biographies they normally occur.

**We also define which events generally occur before the others!**

In [None]:
# ranking of events if no time is given

event_value_dict={"Sonstiges":0,
                  "Geburt":1,
                  "Taufe":2,
                  "Primäre Bildungsstation":3,
                  "Privatunterricht":3,
                  "Rezeption":10,
                  "Zulassung":10,
                  "Immatrikulation":10,
                  "Studium":10,
                  "Prüfungsverfahren":10,
                  "Graduation":10,
                  "Praktikum":10,
                  "Promotion":10,
                  "Wohnsitznahme": 10,
                  "Reise":20,
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  "Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30,
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30,
                  "Präsentation":30,
                  "Vokation":39,
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44,
                  "Ehrung":45,
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

event_hierarchy_dict={
                  "Geburt":"Taufe",
                  "Geburt":"Tod",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe",
                  "Primäre Bildungsstation":"Studium",
                  "Immatrikulation":"Studium",
                  "Prüfungsverfahren":"Graduation",
                  "Studium":"Promotion",
                  "Aufnahme":"Funktionsausübung",
                  "Aufschwörung":"Funktionssausübung",
                  "erfolglose Bewerbung":"Funktionssausübung",
                  "Introduktion":"Mitgliedschaft",
                  "Vokation":"Funktionsausübung",
                  "Ernennung":"Funktionsausübung",
                  "Amtseinführung":"Funktionsausübung",
                  "Vereidigung":"Funktionsausübung",
                  "Amtsantritt":"Funktionsausübung",
                  "Beförderung":"Funktionsausübung",
                  "Funktionsausübung":"Entlassung",
                  "Funktionsausübung":"Suspendierung",
                  "Funktionsausübung":"Absetzung",
                  "Funktionsausübung":"Resignation",
                  "Funktionsausübung":"Rücktritt",
                  "Funktionsausübung":"Pensionierung",
                  "Funktionsausübung":"Pension",
                  "Funktionsausübung":"Tod"}

In [None]:
# add event values from dict to data frame

try:
    f_result['event_value'] = f_result['event_type'].map(event_value_dict)
    f_result.sort_values(by =['event_after-date','event_start','event_before-date', 'event_end', 'event_value'])
except:
    print("No values.")

print("Aggregation complete!")

display(f_result)

# find events with no dates at all and reconstruct before & after dates based on hierarchies


The final step is to write the results to a single output file.

In [None]:
# write all results to new EXCEL file

workbook=directory+'FACTOIDS_consolidated/Factoid_PROFS_consolidation_STEP3_aggregation-hierarchisation_NEW.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
f_result.to_excel(writer, sheet_name='FactCons') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")

Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

January 2023
