<a href="https://colab.research.google.com/github/ieg-dhr/DigiKAR/blob/main/JupyterNotebooks_DigiKAR/Factoids_Step2c_VerticalConsolidation_Staatskalender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a script for consolidating factoid lists in AP3.

The package mainly uses the Pandas package in Python to read and manipulate EXCEL data as DataFrames. DataFrames are 2-dimensional data representations in rows and columns. They can be written to different file formats such as CSV, EXCEL, JSON or RDF.

First of all, we need to connect this Colab notebook with your Google Drive and define the directory for input and output data.


In [None]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_DigiKAR/"

In the second step, we have to install additional Packages needed for working with CSV, EXCEL and DataFrames.

In [None]:
## install packages that are not part of Python's standard distribution

!pip install xlsxwriter
!pip install pandas
!pip install numpy

In **step 1**, we can import the packages to the script and load our data. Before merging the input files, names will be normalised as some have access spaces, capitalised surnames, or inverted first and last names.

The combined data will be written to a new dataframe and displayed.

In [None]:
import xlsxwriter
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
import os
import re

# path to input files

factoid_paths=["https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/FactoidList_1756er_Staatskalender_Meta_final_TEST-MB_FS0.xlsx",
               "https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/FactoidList_1756er_Staatskalender_Meta_final_TEST-MB_FS1.xlsx",
               "https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/FactoidList_1756er_Staatskalender_Meta_final_TEST-MB_FS2.xlsx",
               "https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/FactoidList_1756er_Staatskalender_Meta_final_TEST-MB_FS3.xlsx",
               "https://github.com/ieg-dhr/DigiKAR/raw/main/Sample%20Data/FactoidList_1756er_Staatskalender_Meta_final_TEST-MB_FS4.xlsx"
               ]

# define dataframe for final output

f_to_add=[]

# structure of input files

# obligatory columns in valid factoid list

# read all data frames from path

frame_list=[]
for file in factoid_paths:
    df = pd.read_excel(file, index_col=None, dtype=str) # axis=1, sort=False sheet_name='FactoidList'
    df = df.fillna("n/a") # replace empty fields for string
    df_length=len(df)
    frame_list.append(df)

f = pd.concat(frame_list, axis=0, ignore_index=True, sort=False)

print("There are ", len(f), "items in your DataFrame!")

# delete all duplicate rows with exact matches

f_unique=f.drop_duplicates()
print("Your DataFrame has now ", len(f_unique), "items with at least one unique cell." )

# add columns missing according to factoid model

column_names = ["factoid_ID",
                "pers_ID",
                "pers_name",
                "alternative_names",
                "event_type",
                "event_after-date",
                "event_before-date",
                "event_start",
                "event_end",
                "event_date",
                "pers_title",
                "pers_function",
                "place_name",
                "inst_name",
                "rel_pers",
                "source_quotations",
                "additional_info",
                "comment",
                "info_dump",
                "source_combined",
                "event_value", # add more potential categorisations if needed
                "source",
                "source_site"]

df2 = f_unique.reindex(columns=column_names)
df2.fillna('n/a', inplace=True)

# populate some of the empty columns with data

df2.loc[:, "event_end"] = df2["event_start"]
df2.loc[:, "event_type"] = ["Funktionsausübung"] * 31414
df2['source_combined'] = df2['source'].astype(str) + ': ' + df2['source_site'].astype(str)

print("Done.")

# rename dataframe for next step

display(df2)


In [None]:
# Merge dataframe with dfs containing person IDs and geocoding

## Read person IDs from Github
#infile1=directory+"Mainz2_Geonames.xlsx"
#person_df = pd.read_excel(infile1)

## Read geocoding from Github
#infile2=directory+"Addresses_Geocoded_withGoogle.xlsx"
#geo_df = pd.read_excel(infile2)

## Merge input dataframe horizontally

from functools import reduce

# define list of DataFrames
#dfs = [df2, person_df, geo_df] # dataframe list can be extended if necessary

# merge all DataFrames into one
#final_df = reduce(lambda  left,right: pd.merge(dfs) # left,right,on=['column_name'] # how='outer'

#display(final_df)

## Write new table to excel file

#outfile=directory+"AP3_final-df.xlsx"
#final_df.to_excel(outfile)

In **step 2**, we reconstruct end dates for successive start dates. The data are automatically aggregated using Python's `groupby` function. If the results are too narrow or too broad, please change the aggregation rules below!


In [None]:
# Group the dataframe and aggregate the start and end dates
# code updated after problem with merged columns
# see discussion on Stackoverflow: https://stackoverflow.com/questions/76558443/column-remains-empty-when-using-map-with-dictionary-in-pandas-dataframe/76558586#76558586

grouped_df = df2.groupby(['pers_name', 'event_type', "pers_function", "pers_title", "inst_name", "place_name"], as_index=False).agg(
                                                         {'event_start': 'min',
                                                          "event_after-date":'min',
                                                          "event_before-date":'max',
                                                          "event_end":'max',
                                                          "factoid_ID":list,
                                                          "alternative_names":list,
                                                          "pers_ID":list,
                                                          "rel_pers":list,
                                                          "source_quotations":list,
                                                          "additional_info":list,
                                                          "comment":list,
                                                          "info_dump":list,
                                                          "source_combined":list,
                                                          "event_value":list
                                                          })

display(grouped_df)

In **step 3**, we can flatten the information and only preserve unique information per cell.

In [None]:
# flatten data in dataframe cells

def flatten_list(cell):
    if isinstance(cell, list):
        unique_values = set(cell)
        return ', '.join(str(value) for value in unique_values)
    else:
        return str(cell)

# flatten all cells containing lists
df3 = grouped_df.applymap(flatten_list)

# show the flattened DataFrame
display(df3)

In **step 4**, we enrich the data, e.g. by adding event values from an external Python dictionary stored in Github.

In [None]:
## load external dictionary with EVENT VALUES
# following method 2 on https://www.geeksforgeeks.org/how-to-read-dictionary-from-file-in-python/

# importing the module
import requests
import ast

master = "https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/Data%20Categorisation/Event_value_dict.txt" # add Sven's new mapping
req = requests.get(master)
req = req.text
print(req)

# reconstructing the data as a dictionary
event_value_dict = ast.literal_eval(req)
print(type(event_value_dict))

# add event values from dict to data frame

try:
    test = event_value_dict["Aufschwörung"] # random test if valid dict
    print("Value for chosen key: ", test)
except:
    print("Invalid dict structure!")

df3['event_value'] = df3['event_type'].map(event_value_dict) # optional: na_action='ignore'

display(df3)

In [None]:
## load external dictionary with EVENT CATEGORIES (e.g. I: agent-oriented)
# following method 2 on https://www.geeksforgeeks.org/how-to-read-dictionary-from-file-in-python/

# importing the module
import requests
import ast

master = "https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/Data%20Categorisation/####.txt" # add file name
req = requests.get(master)
req = req.text
print(req)

# reconstructing the data as a dictionary
event_category_dict = ast.literal_eval(req)
print(type(event_category_dict))

# add event values from dict to data frame

try:
    test = event_category_dict["Geburt"] # random test if valid dict
    print("Value for chosen key: ", test)
except:
    print("Invalid dict structure!")

df3['event_category'] = df3['event_type'].map(event_category_dict) # optional: na_action='ignore'

display(df3)

In [None]:
## load external dictionary with FUNCTION CATEGORIES (e.g. teaching versus administration)
# following method 2 on https://www.geeksforgeeks.org/how-to-read-dictionary-from-file-in-python/

# importing the module
import requests
import ast

master = "https://raw.githubusercontent.com/ieg-dhr/DigiKAR/main/Data%20Categorisation/####.txt" # add file name
req = requests.get(master)
req = req.text
print(req)

# reconstructing the data as a dictionary
function_category_dict = ast.literal_eval(req)
print(type(function_category_dict))

# add event values from dict to data frame

try:
    test = function_category_dict["Professor"] # random test if valid dict
    print("Value for chosen key: ", test)
except:
    print("Invalid dict structure!")

df3['function_category'] = df3['pers_function'].map(function_category_dict) # optional: na_action='ignore'

display(df3)

In [None]:
# save enriched df to DRIVE

workbook=directory+'FACTOIDS_consolidated/Factoid_Staatskalender_ALL_consolidation_with-event-values.xlsx'
print(workbook)
writer = pd.ExcelWriter(workbook, engine='xlsxwriter') # create a Pandas Excel writer using XlsxWriter as the engine.
df3.to_excel(writer, sheet_name='FactCons1') # Convert the dataframe to an XlsxWriter Excel object.
writer.save() # Close the Pandas Excel writer and output the Excel file.
print("Done.")

Check the output files and repeat process if necessary.

Script by Monika Barget, Maastricht/Mainz

June 2023
