<a href="https://colab.research.google.com/github/ieg-dhr/DigiKAR/blob/main/JupyterNotebooks_DigiKAR/Factoids_Step3a_Geocoding%26IDs_Profs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a script for consolidating factoid lists in AP3.

The package mainly uses the Pandas package in Python to read and manipulate EXCEL data as DataFrames. DataFrames are 2-dimensional data representations in rows and columns. They can be written to different file formats such as CSV, EXCEL, JSON or RDF.

First of all, we need to connect this Colab notebook with your Google Drive and define the directory for input and output data.


In [1]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

Mounted at /content/drive


In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [2]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

Collecting xlsxwriter
  Downloading XlsxWriter-3.1.2-py3-none-any.whl (153 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/153.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.0/153.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xlsxwriter
Successfully installed xlsxwriter-3.1.2


In **step 1**, we can import the packages to the script and load our data. Before merging the input files, names will be normalised as some have access spaces, capitalised surnames, or inverted first and last names.

The combined data will be written to a new dataframe and displayed.

In [4]:
import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os
import re

# path to input files

factoid_paths=["https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/Factoid-PROFS-consolidation-STEP3-events-split-v10.xlsx"]

# define dataframe for final output

f_to_add=[]

# structure of input files

# obligatory columns in valid factoid list

column_names = ["factoid_ID",
                "pers_ID",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source",
                "source_site"]

frame_list=[]
for file in factoid_paths:
    df = pd.read_excel(file, index_col=None, dtype=str) # axis=1, sort=False sheet_name='FactoidList'
    df = df.fillna("n/a") # replace empty fields for string
    frame_list.append(df)

f_unique = pd.concat(frame_list, axis=0, ignore_index=True, sort=False)

print("There are ", len(f), "items in your DataFrame!")

# delete all duplicate rows with exact matches

display(f_unique)


  warn("Workbook contains no default style, apply openpyxl's default")


There are  12825 items in your DataFrame!


Unnamed: 0,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,event_date,...,pers_function,place_name,inst_name,rel_pers,source_quotations,additional_info,comment,info_dump,source,source_site
0,1139,OCR,Helfericus de Bobenhausen,,Funktionsausübung,1510,,,,,...,Lehrer,Mainz],,,,als Theologie-Dozent genannt,,,"Knodt 2, S.40) \n",
1,reconstruction,OCR,Helfericus de Bobenhausen,,Mitgliedschaft,1510,,,,,...,Ordensmann,,Franziskaner ###,,,"ex Ord. Minorum (ca. 1510), sub eodem tempore ...",,,"Knodt 2, S.40) \n",
2,1358,OCR,Jodocus von Gelnhausen,,Funktionsausübung,,1482,,,,...,Vikar,Mainz,Kollegiatstift St. Peter Mainz,,,er war Vikar zu St. Peter,,,"Knodt II, 42) \n",
3,1359,OCR,Jodocus von Gelnhausen,,Tod,1482,,,,,...,Toter,,Sterbehaus ###,,,1482,,,"Knodt II, 42) \n",
4,1357,OCR,Jodocus von Gelnhausen,,Zulassung,,,,,1482-05-22,...,Lehrer,Mainz,"Universität Mainz, Theologische Fakultät",,,wurde am 22.5.1482 als Dozent der Theologie zu...,,,"Knodt II, 42) \n",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12820,2107,OCR,Wolfgang Speth,,Funktionsausübung,,,1652,1655,,...,Lehrer,Mainz,Universität Mainz,,,"Ende 1652 ging er nach Mainz, wo er bis 1655 S...",,,"De Back.-Som. VII, Sp.1436; \n Wegele, S.352;...",
12821,2106,OCR,Wolfgang Speth,,Funktionsausübung,1650,,,,,...,Lehrer,Würzburg,Universität Würzburg,,,"1648-1650 lehrte er Philosophie in Bamberg, da...",,,"De Back.-Som. VII, Sp.1436; \n Wegele, S.352;...",
12822,2101,OCR,Wolfgang Speth,,Geburt,,,,,1604-07-25,...,Kind,Bamberg,Geburtshaus ###,,,Bamberg * 25.7.1604,,,"De Back.-Som. VII, Sp.1436; \n Wegele, S.352;...",
12823,2105,OCR,Wolfgang Speth,,Promotion,,,1648-09-02,1648-09-02,,...,Promovend,Bamberg,"Universität Bamberg, Theologische Fakultät",,,am 2.9.1648 wurde er in Bamberg zum Dr. theol....,,,"De Back.-Som. VII, Sp.1436; \n Wegele, S.352;...",


In **step 2**, we add the person ID's from the person ontology list and also add coordinates from Geonames and Google.




In [26]:
# Merge input dataframe with dfs containing person IDs and geocoding
# documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

## Read person IDs from Github
## columns: pers_ID_MB, pers_name, alternative_names, Unnamed: 4, name_new_fs, pers_ID_FS,poi, pers_name_corr, freq
infile1="https://github.com/ieg-dhr/DigiKAR/raw/main/OntologyFiles/Factoid_PersonNames_merged.xlsx" # has to contain pers_name column!
person_df = pd.read_excel(infile1)

## Read geocoding from Github
infile2="https://github.com/ieg-dhr/DigiKAR/raw/main/OntologyFiles/Ortsontologie_Geocoded_gepr%C3%BCft.xlsx" # has to contain place_name column!
geo_df = pd.read_excel(infile2)

## merge dataframes horizontally

merged_df1 = pd.merge(f_unique, geo_df, on='place_name', how='left').fillna("n/a")
merged_df2 = pd.merge(merged_df1, person_df, on=('pers_name', "alternative_names"), how="left").fillna("n/a")

display(merged_df2)

Unnamed: 0,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,event_date,...,Unnamed: 21,Unnamed: 22,NaN,pers_ID_MB,Unnamed: 4,name_new_fs,pers_ID_FS,poi,pers_name_corr,freq
0,1139,OCR,Helfericus de Bobenhausen,,Funktionsausübung,1510,,,,,...,,,0.000173,,,,,,,
1,1139,OCR,Helfericus de Bobenhausen,,Funktionsausübung,1510,,,,,...,,,0.000173,,,,,,,
2,reconstruction,OCR,Helfericus de Bobenhausen,,Mitgliedschaft,1510,,,,,...,,,,,,,,,,
3,1358,OCR,Jodocus von Gelnhausen,,Funktionsausübung,,1482,,,,...,,,0.000173,,,,,,,
4,1359,OCR,Jodocus von Gelnhausen,,Tod,1482,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14254,2101,OCR,Wolfgang Speth,,Geburt,,,,,1604-07-25,...,,,0.000002,,,,,,,
14255,2101,OCR,Wolfgang Speth,,Geburt,,,,,1604-07-25,...,,,0.000002,,,,,,,
14256,2105,OCR,Wolfgang Speth,,Promotion,,,1648-09-02,1648-09-02,,...,,,0.000002,,,,,,,
14257,2105,OCR,Wolfgang Speth,,Promotion,,,1648-09-02,1648-09-02,,...,,,0.000002,,,,,,,


**Step 3** adds event values from the event value Python dictionary on Github.




In [27]:
## load external dictionary with EVENT VALUES
# following method 2 on https://www.geeksforgeeks.org/how-to-read-dictionary-from-file-in-python/

# importing the module
import requests
import ast

master = "https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/Data%20Categorisation/Event_value_dict.txt" # add Sven's new mapping
req = requests.get(master)
req = req.text
print(req)

# reconstructing the data as a dictionary
event_value_dict = ast.literal_eval(req)
print(type(event_value_dict))

# add event values from dict to data frame

try:
    test = event_value_dict["Aufschwörung"] # random test if valid dict
    print("Value for chosen key: ", test)
except:
    print("Invalid dict structure!")

merged_df2['event_value'] = merged_df2['event_type'].map(event_value_dict) # optional: na_action='ignore'

display(merged_df2)

{
    "Amtsantritt": "K",
    "Aufenthalt": "M",
    "Beisetzung": "Z",
    "Bewerbung": "I",
    "Eheschließung": "P",
    "Entlassung": "R",
    "Erfolglose Bewerbung": "I",
    "Flucht": "M",
    "Funktionsausübung": "Q",
    "Geburt": "A",
    "Graduation": "G",
    "Haft": "M",
    "Immatrikulation": "C",
    "Konflikt": "M",
    "Konversion": "M",
    "Mitgliedschaft": "L",
    "Nicht-Ausübung": "N",
    "Nobilitierung": "S",
    "Ordination": "J",
    "Primäre Bildungsstation": "B",
    "Promotion": "H",
    "Prüfung": "F",
    "Resignation": "T",
    "Rezeption": "K",
    "Sonstiges": "M",
    "Studium": "D",
    "Taufe": "A",
    "Tod": "Y",
    "Verleihung eines Ehrentitels": "S",
    "Wahl": "K",
    "Weihe": "J",
    "Wohnsitznahme": "O",
    "Zulassung": "E"
   }
   

<class 'dict'>
Invalid dict structure!


Unnamed: 0,factoid_ID,pers_ID,pers_name,alternative_names,event_type,event_after-date,event_before-date,event_start,event_end,event_date,...,Unnamed: 22,NaN,pers_ID_MB,Unnamed: 4,name_new_fs,pers_ID_FS,poi,pers_name_corr,freq,event_value
0,1139,OCR,Helfericus de Bobenhausen,,Funktionsausübung,1510,,,,,...,,0.000173,,,,,,,,Q
1,1139,OCR,Helfericus de Bobenhausen,,Funktionsausübung,1510,,,,,...,,0.000173,,,,,,,,Q
2,reconstruction,OCR,Helfericus de Bobenhausen,,Mitgliedschaft,1510,,,,,...,,,,,,,,,,L
3,1358,OCR,Jodocus von Gelnhausen,,Funktionsausübung,,1482,,,,...,,0.000173,,,,,,,,Q
4,1359,OCR,Jodocus von Gelnhausen,,Tod,1482,,,,,...,,,,,,,,,,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14254,2101,OCR,Wolfgang Speth,,Geburt,,,,,1604-07-25,...,,0.000002,,,,,,,,A
14255,2101,OCR,Wolfgang Speth,,Geburt,,,,,1604-07-25,...,,0.000002,,,,,,,,A
14256,2105,OCR,Wolfgang Speth,,Promotion,,,1648-09-02,1648-09-02,,...,,0.000002,,,,,,,,H
14257,2105,OCR,Wolfgang Speth,,Promotion,,,1648-09-02,1648-09-02,,...,,0.000002,,,,,,,,H


**Step 4** is to write the results to a single output file.

In [28]:
# write all results to new EXCEL file

workbook=directory+'FACTOIDS_consolidated/Factoid_PROFS_v10_geocoded-with-IDs.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
merged_df2.to_excel(writer, sheet_name='FactCons') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")

/content/drive/My Drive/Colab_DigiKAR/FACTOIDS_consolidated/Factoid_PROFS_v10_geocoded-with-IDs.xlsx


  writer.save() # Close the Pandas Excel writer and output the Excel file.


Done.


Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

June 2023
