This is a script for consolidating factoid lists based on the ontology mapping of all entities found in AP3 data.

The package mainly uses the Pandas package in Python to read and manipulate EXCEL data as DataFrames. DataFrames are 2-dimensional data representations in rows and columns. They can be written to different file formats such as CSV, EXCEL, JSON or RDF.

First of all, we need to connect this Colab notebook with your Google Drive and define the directory for input and output data.


In [None]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

Mounted at /content/drive


In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [None]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlsxwriter
  Downloading XlsxWriter-3.0.7-py3-none-any.whl (152 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.8/152.8 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xlsxwriter
Successfully installed xlsxwriter-3.0.7
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Now we can import the packages to the script and load our data.

In [None]:
## import relevant packages

import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os

# define files containing ontological mapping

event_ontology='https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/OntologyFiles/event_ontology.csv' 
#title_ontology='https://github.com/ieg-dhr/DigiKAR/blob/main/OntologyFiles/title_ontology.csv?raw=true' 
#function_ontology='https://github.com/ieg-dhr/DigiKAR/blob/main/OntologyFiles/function_ontology.csv?raw=true'
place_ontology='https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/OntologyFiles/place_ontology.csv' 

# open ontology files

# READ EVENTS
data_e = pd.read_csv(event_ontology, sep=",")
events_old=data_e['event_old'].values.tolist()
events_new=data_e['event_type'].values.tolist()
func_new=data_e['pers_function'].values.tolist()
    
# READ TITLES
#data_t = pd.read_csv(title_ontology, sep=",")
#title_old=data_t['title_old'].values.tolist()
#events_new=data_t['per_title'].values.tolist()

# READ FUNCTIONS
#data_f = pd.read_csv(function_ontology, sep=",")
#function_old=data_f['function_old'].values.tolist()
#function_new=data_f['pers_function'].values.tolist()

# READ PLACES
data_p = pd.read_csv(place_ontology, sep=",")
places_old=data_p['place_old'].values.tolist()
places_new=data_p['place_new'].values.tolist()
    


The second step is to read all files in the input directory as one DataFrame and to manipulate the data.

In [None]:
# function to process data

def extract_information(filenames):
        
# read all excel files in directory as one data frame

    frame_list=[]
    for item in os.listdir(filenames):
        file = os.path.join(filenames, item)
        df = pd.read_excel(file, sheet_name='FactoidList', index_col=None, dtype=str) # axis=1, sort=False
        df = df.fillna("@") # replace empty fields for string
        frame_list.append(df)

    f = pd.concat(frame_list, axis=0, ignore_index=False, sort=False)
    print(f['event_name'])

    # replace words in EVENT column & check if corresponding function needs to be updated
            
    for e_old in events_old:
        try:
            e_new=data_e.loc[data_e['event_old'] == e_old, 'event_type'].values[0]
            print(type(e_new))
            f['event_name'] = f['event_name'].replace(e_old, e_new)

# check if event results in a specific function and add it if necessary

            f_rel=data_e.loc[data_e['title_old'] == e_old, 'pers_function'].values[0]
            
            if f_rel==True:
                f['title_old'] = f['title_old'].replace(e_old, e_new)
            else:
                print("No function found.")
                continue
            
        except KeyError:
            print("No mapping.")
            continue


# write all results to new EXCEL file

    workbook=directory+'FACTOIDS_mapped/Profs_mapped.xlsx'
    writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
    f.to_excel(writer, sheet_name='Mapped2') # Convert the dataframe to an XlsxWriter Excel object.
    writer.save() # Close the Pandas Excel writer and output the Excel file.

In [None]:
'''            
# find "hidden" places rows and add values to PLACE column
            
    for p in places_new:
        print(p)
        try:
            p_add=f[f["place_new"].map(lambda place_new: p in place_new) & f["inst_name"].map(lambda inst_name: p in inst_name)]
            print(p_add)
            
# Still raises ValueError: Columns must be same length as key
# Code will be fixed ASAP

            f['place_name'] =(f['place_name'].map(str) + "/" + p_add)
            
        except KeyError:
            print("Key Error")
            continue
'''

'            \n# find "hidden" places rows and add values to PLACE column\n            \n    for p in places_new:\n        print(p)\n        try:\n            p_add=f[f["place_new"].map(lambda place_new: p in place_new) & f["inst_name"].map(lambda inst_name: p in inst_name)]\n            print(p_add)\n            \n# Still raises ValueError: Columns must be same length as key\n# Code will be fixed ASAP\n\n            f[\'place_name\'] =(f[\'place_name\'].map(str) + "/" + p_add)\n            \n        except KeyError:\n            print("Key Error")\n            continue\n'

In [None]:
# iterate through all XLSX files in directoy    

def main():
    filenames = directory+"FACTOIDS_to_map"
    extract_information(filenames)
    print("Done.") 

if __name__ == "__main__":
    main() 
    
# ADDITIONAL OPTIONS:

'''            
# replace words in TITLE column
            
    for t_old in title_old:
        print(t_old)
        try:
            t_new=data.loc[data['title_old'] == t_old, 'pers_title'].values[0]
            print(e_new)
            f['title_old'] = f['title_old'].replace(t_old, t_new)                
            
        except KeyError:
            print("Key Error")
            continue
            
# replace words in FUNCTION column
            
    for f_old in function_old:
        print(f_old)
        try:
            f_new=data_f.loc[data_f['event_name'] == f_old, 'event_type'].values[0]
            print(f_new)
            f['event_name'] = f['event_name'].replace(f_old, f_new)
            
        except KeyError:
            print("Key Error")
            continue
            
        print(f)
'''

print("All data mapped!")


0                          Lehrtätigkeit
1                   Berufliche Tätigkeit
2                   Berufliche Tätigkeit
3                   Akademische Laufbahn
4                                  birth
                      ...               
3632                      Lehrtätigkeit 
3633               Erhalt einer Präbende
3634                Akademische Laufbahn
3635    Übernahme eines politischen Amts
3636    Übernahme eines politischen Amts
Name: event_name, Length: 3637, dtype: object
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 'str'>
No mapping.
<class 's

Check the output files and repeat process with refined ontology files if necessary.

Script by Monika Barget, Maastricht/Mainz

January 2023
