This is a script for consolidating factoid lists in AP3.

The package mainly uses the Pandas package in Python to read and manipulate EXCEL data as DataFrames. DataFrames are 2-dimensional data representations in rows and columns. They can be written to different file formats such as CSV, EXCEL, JSON or RDF.

First of all, we need to connect this Colab notebook with your Google Drive and define the directory for input and output data.


In [46]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [47]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Now we can import the packages to the script and load our data.

In [1]:
# Script to sort event per person by event value and date

# written for the DigiKAR geohumanities project in September 2021 by Monika Barget

import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os

# path to input files

factoid_path=directory+'FACTOIDS_to_consolidate'

# structure of input files

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]
                
frame_list=[]
for item in os.listdir(factoid_path):
    file = os.path.join(factoid_path, item)
    print(file)
    df = pd.read_excel(file, sheet_name='FactoidList', index_col=None, dtype=str) # axis=1, sort=False
    df = df.fillna("n/a") # replace empty fields for string
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)

print("There are ", len(f), "items in your DataFrame!")

# delete all duplicate rows with exact matches

f_unique=f.drop_duplicates()
print("Your DataFrame has now ", len(f_unique), "items with at least one unique cell." )

# ranking of events if no time is given

event_hierarchy_dict={
                  "Geburt":"Taufe",
                  "Geburt":"Tod",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe",
                  "Geburt":"Taufe", 
                  "Primäre Bildungsstation" # Beziehung zu "Privatunterricht"?
                  # Welche Beziehung haben "Rezeption" und "Zulassung" zueinander oder zu anderen Ergeignissen?
                  "Immatrikulation":"Studium",
                  "Prüfungsverfahren":"Graduation",
                  "Studium":"Promotion",
                  "Aufnahme":"Funktionsausübung",
                  "Aufschwörung":"Funktionssausübung",
                  "erfolglose Bewerbung":"Funktionssausübung",
                  "Aufenthalt" # Bezug zu "Reise"?
                  "Introduktion":"Mitgliedschaft" # Nur in dieser Kombination möglich?
                  "Präsentation" # Bezug zu anderen Ereignissen?
                  "Vokation":"Funktionsausübung",
                  "Ernennung":"Funktionsausübung",
                  "Amtseinführung":"Funktionsausübung",
                  "Vereidigung":"Funktionsausübung",
                  "Amtsantritt":"Funktionsausübung",
                  "Beförderung":"Funktionsausübung",
                  "Funktionsausübung":"Entlassung",
                  "Funktionsausübung":"Suspendierung",
                  "Funktionsausübung":"Absetzung",
                  "Funktionsausübung":"Resignation",
                  "Funktionsausübung":"Rücktritt",
                  "Funktionsausübung":"Pensionierung",
                  "Funktionsausübung":"Pension",
                  "Funktionsausübung":"Tod"}

event_value_dict={"Sonstiges":0, 
                  "Geburt":1, 
                  "Taufe":2, 
                  "Primäre Bildungsstation":3, 
                  "Privatunterricht":3,
                  "Rezeption":10, # nicht sicher ob bezogen auf Studium?
                  "Zulassung":10, # vor dem Studium, oder z.B. auch zur Prüfung?
                  "Immatrikulation":10,
                  "Studium":10,
                  "Prüfungsverfahren":10,
                  "Graduation":10,
                  "Praktikum":10,
                  "Promotion":10,
                  "Wohnsitznahme": 10,
                  "Reise":20, # Events mit Code "20" können in der Lebensmitte mehrfach auftreten
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  "Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30, # bezogen worauf?
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30, # vermutlich nicht für ganz junge Personen?
                  "Präsentation":30, # nicht sicher was das ist...
                  "Vokation":39, # Berufung an Uni?
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44, # wie oft werden Personen durchschnittlich befördert?
                  "Ehrung":45, # vermutlich bei Personen ab Lebensmitte?
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

# read person list

pers_name_f=(f[['pers_name']]) 
search_df=pers_name_f.drop_duplicates() # remove duplicates
search_list=search_df['pers_name'].tolist()

# count no. of entries in flattened person list

no_person=len(search_list)
print("There are", no_person, "unique person names in this data set.")

# iterate through unique persons to get their events

frame_list=[]
for name in search_list:
    res_df=(f.loc[f['pers_name'] == name])

# check tricky events    
    bio_events=res_df['event_type'].values.tolist()
    print(bio_events)
    bio_check=[]
    prime_suspects=[]
    for b in bio_events:
        if str(b) not in bio_check:
            bio_check.append(str(b))
        else:
          prime_suspects.append(str(b))

    print(name, prime_suspects)

# aggregate similar events

    df_new = res_df.groupby(["event_type", "place_name"]).agg(
                                  {"event_after-date":'min',
                                  "event_before-date":'max',
                                  "event_start":'min',
                                  "event_end":'min',
                                  "factoid_ID":list,
                                  "pers_ID":list,
                                  "pers_name":list,
                                  "alternative_names":list,
                                  "pers_title":list,
                                  "pers_function":list,
                                  "inst_name":list,
                                  "rel_pers":list,
                                  "source_quotations":list,
                                  "additional_info":list,
                                  "comment":list,
                                  "info_dump":list,
                                  "source":list,
                                  "source_site":list} 
                                  )
    frame_list.append(df_new)

f_result = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)

# add event values from dict to data frame

f_result['event_value'] = f_result['event_type'].map(event_value_dict)
f_result.sort_values(by =['event_after-date','event_start','event_before-date', 'event_end', 'event_value'])

print("Aggregation complete!")
display(f_result) 

ModuleNotFoundError: ignored

KeyError: ignored

The final step is to write the results to a single output file.

In [51]:
# write all results to new EXCEL file

workbook=directory+'FACTOIDS_consolidated/Factoid_consolidated.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
f_result.to_excel(writer, sheet_name='FactCons') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")

/content/drive/My Drive/Colab_DigiKAR/FACTOIDS_consolidated/Factoid_consolidated.xlsx
Done.


Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

January 2023
