# Data Manipulation


The used data set can be found at ../data or under http://mfm.uni-leipzig.de/dt/Forschung/CDV2018.php.

**Example data point (shortened):**

"b0339": {
    
    "defaultSurName": "Beethoven",
    "defaultPreName": "Ludwig van",
    "dates": ["1770 Dez 16", "1827 Mar 27", ... ],
    "gender": ["männlich"],
    "konfession": ["römisch-katholisch"],
    "musicalJobs": ["Komponist", ...],
    "relation": {
      "Lehrer": [{"target": "a0884", "displayName": "Albrechtsberger, Johann Georg"}, ...],
      "Schüler": [{"target": "n0412", "displayName": "Neukäufler, Ferdinand"}, ...], ...},
    "places": [
      "Aschaffenburg",
      "Bad Mergentheim", ...
    ],
    "mainPlace": "Wien",
    "links": [ [
        "Biographien/Karrieren",
        {
          "Akten der Reichskanzlei der Weimarer Republik": [
            {"displayName": "Akten der Reichskanzlei der Weimarer Republik (1)", "target":                                        ["http://www.bundesarchiv.de/aktenreichskanzlei/1919-1933/0000/adr/getPPN/118508288/"]} ], ...
    ],
    "mapPlaces": [ ["Wien", "48.208174", "16.373819", "9."], ...] 
}



## 1. Reading

The first step is reading in the data set, consisting of 26 json files, which separate the persons according to the first letter of their last name.

In [2]:
'''
Loading the data
'''

import json
import os

path = "data/Personen/"

data = {}

for file in os.listdir(path):      # the path contains 26 json files
    
    with open(path + file, encoding='utf-8') as f:
        x = json.load(f)
        
    for key in x:              # we don't need the 'links' attribute for our analysis
        x[key].pop('links', None) 
        
    data.update(x)


## 2. Conversion

Once the metadata dictionary is extracted, we convert the data into a more easily readable format. In addition to the obvious attributes, we extract an 'era' attribute, which refers to the Western classical music eras. 

In [14]:
'''
Attribute extraction
'''

import re
     
# western music eras 
music_era = {'Medieval': range(500, 1400), 'Renaissance': range(1400, 1600), 'Baroque': range(1600, 1750), 
             'Classical': range(1750, 1820), 'Romantic': range(1820, 1910), 
             'Modern': range(1910, 1975), 'Contemporary': range(1975, 2018)}


def get_node_attributes(person):
    attr = {}
    info = data[person]
    
    # name
    prename = info["defaultPreName"]
    surname = info["defaultSurName"]
    if prename and surname:
        attr['name'] = ' '.join([prename, surname])
    elif prename:
        attr['name'] = prename
    elif surname:
        attr['name'] = surname
        
    # gender
    attr['gender'] = info["gender"][0] if info["gender"] else 'unknown'
    
    # religion
    attr['religion'] = info['konfession'][0] if info['konfession'] else 'unknown'
    
    # music jobs
    clean_music_jobs = [job for job in info['musicalJobs'] if job]   # filters out nulls
    attr['musicJobs'] = ','.join(clean_music_jobs) if clean_music_jobs else 'none'

    # other jobs
    clean_other_jobs = [job for job in info['otherJobs'] if job]   # filters out nulls
    attr['otherJobs'] = ','.join(clean_other_jobs) if clean_other_jobs else 'none'
        
    # workplaces - Wirkungsorte
    attr['workPlace'] = ','.join(info['places']) if info['places'] else 'unknown'
    
    # main workplace - Hauptwirkungsort
    attr['mainPlace'] = info['mainPlace'] if info['mainPlace'] else 'unknown'
    
    # year of birth 
    attr["birthYear"] = int(re.search('[0-9]{3,4}', info["dates"][0]).group(0)) if info["dates"][0] else 0

    # year of death
    try:                            
        attr["deathYear"] = int(re.search('[0-9]{3,4}', info["dates"][2]).group(0)) if info["dates"][2] else 3000
    except AttributeError:
        attr["deathYear"] = 3000                 # BAD DATA POINT; birthPlace given instead of deathYear
        attr["birthPlace"] = info["dates"][2] if info["dates"][2] else 'unknown'
    
    # place of birth
    attr["birthPlace"] = info["dates"][4] if info["dates"][4] else 'unknown'
    
    # place of death
    attr["deathPlace"] = info["dates"][5] if info["dates"][5] else 'unknown'    
    
    # era
    for era in music_era:
        if attr['deathYear'] in music_era[era]:
            attr['era'] = era
            
    if 'era' not in attr:
        for era in music_era:
            if (attr['birthYear'] + 30) in music_era[era]:
                attr['era'] = era
                
    if 'era' not in attr:
        attr['era'] = 'unknown'
            
    return attr


'''
Cleaning the data
'''

clean_data = {person: get_attributes(person) for person in data}


  return f(*args, **kwds)
  return f(*args, **kwds)


## 3. Graph creation

We create a directed graph from the data with the Python module networkx. The module makes it possible to export the graph to a Gephi-readable format.

In [None]:
'''
Creating the graph
'''

import networkx as nx

g = nx.DiGraph()

for person in data: 
    
    attribs = clean_data[person]
    g.add_node(person, **attribs)
    
    rels = data[person]['relation']     # all relations that person has
    
    for rel in rels:                    # add the relation to the graph as an edge between two persons
        for other_person in rels[rel]:
            if other_person['target'] in data:
                g.add_edge(person, other_person['target'], relation=rel)
                
                
'''
Exporting the graph 
'''

nx.write_gexf(g, 'graph_full.gexf')

## 4. Pickling

In order to avoid reading in all json files and converting the data more than once, we pickle the objects we will need for our analysis. Converting the attributes dictionary to a DataFrame will further simplify later use.

In [12]:
import pickle
import pandas as pd

pickle.dump(data, open('../pickled/person_dict.p', 'wb'))

df = pd.DataFrame(clean_data).T     # transposing, so that the person ids act as indices instead of as columns
df.to_pickle('../pickled/dataframed_attributes.p')