# Fast name extraction from GND

![](brainstorm.jpg)

## Main data structure

cooc is a dictionary that contains for each property a dictionary that contains for each name a Counter, which contains the cooccurring names and their frequency.

Example:
```
{
    "gndo:surname": 
        {
            "Hussell": 
                {
                    "Hussell": 2, 
                    "Hussel": 2 
                }, 
            "Bayer": 
                {
                    "Bayer": 251,
                    "Beyer": 93, 
                    ...
                }
            ...
         }
     ...
}
```

In [1]:
import collections
props = ["gndo:forename", "gndo:surname"]
cooc = collections.defaultdict(lambda : collections.defaultdict(collections.Counter))

In [2]:
root = "C:/Users/rovera/compact_memory/"

## Extraction and access function

In [3]:
tmp = collections.defaultdict(set)

def extract_literal(line):
    start = line.index('"')
    end = line.index('"', start + 1)
    return line[start + 1:end]

def bad_name(name):
    return " " in name or "." in name or "-" in name

def extract_from_file(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        count = 0
        for p in props:
            tmp[p] = set()
        for line in f:
            if "gndo:gndIdentifier" in line:
                for p in props:
                    if len(tmp[p]) > 1:
                        for name in tmp[p]:
                            cooc[p][name].update(tmp[p])
                    tmp[p] = set()
                # id = line.split(' ')[2].strip('"')
                count += 1
                if count % 10000 == 0:
                    print('.', end='')
                if count % 1000000 == 0:
                    print(' {}'.format(count))
            for p in props:
                if p in line:
                    name = extract_literal(line)
                    if not bad_name(name):
                        tmp[p].add(name)

def get_coocs(prop, name):
    return collections.Counter(cooc[prop][name]).most_common()

## Loading and saving of the extracted data

In [4]:
import json

def save_cooc(filename):
    with open(filename, 'w', encoding='utf-8') as fp:
        json.dump(cooc, fp)
        
def load_cooc(filename):
    with open(filename, 'r', encoding='utf-8') as fp:
        return json.load(fp)

The raw GND data can be downloaded here (gunzip and put in current directory):
- https://data.dnb.de/opendata/authorities-name_lds.ttl.gz
- https://data.dnb.de/opendata/authorities-person_lds.ttl.gz

Uncomment the lines below to extract the dataset from the GND files:

In [5]:
extract_from_file(root+'data/gnd/authorities-name_lds.ttl')
extract_from_file(root+'data/gnd/authorities-person_lds.ttl')
save_cooc(root+'data/gnd/cooc.json')

.................................................................................................... 1000000
.................................................................................................... 2000000
.................................................................................................... 3000000
.................................................................................................... 4000000
.................................................................................................... 5000000
.................................................................................................... 6000000
.................................................................................................... 7000000
........................................................................................................ 1000000
.................................................................................................... 2000000
...............

Load from existing dump file:

In [6]:
cooc = load_cooc(root+'data/gnd/cooc.json')

## Play with the data

Todo: Find a way to filter out names that do not fit, e.g. by edit-distance or clique analysis.

In [6]:
len(cooc['gndo:forename'])

199591

In [13]:
print(cooc['gndo:forename']['Paulus'])

{'Paulus': 2539, 'Paul': 2212, 'Paolo': 260, 'Pál': 65, 'Pal': 9, 'Paullus': 83, 'Borgasius': 1, 'Pauel': 1, 'Pawel': 14, 'Pauls': 2, 'Paweł': 27, 'Pavlvs': 12, 'Pavao': 6, 'Pauli': 8, 'Paulo': 42, 'Baoluo': 1, 'Selig': 3, 'Būlus': 5, 'Būluṣ': 2, 'Paola': 2, 'Constantinus': 2, 'Hieronymus': 1, 'Girolamo': 1, 'Pavel': 24, 'Joan': 2, 'Johannes': 12, 'Paavali': 2, 'Paolano': 1, 'ŁUkasz': 1, 'Lucas': 2, 'Xystus': 3, 'Joannes': 5, 'John': 3, 'Celidonius': 2, 'Gilles': 1, 'Germanus': 1, 'Desiderius': 1, 'Eugenius': 1, 'Vincentius': 1, 'Aegidius': 1, 'Zachaeus': 2, 'Mattheus': 1, 'Iacobus': 2, 'Jakob': 2, 'Jacob': 2, 'Jacobus': 2, 'Christian': 4, 'Christianus': 3, 'Paulu': 1, 'Paulos': 6, 'Poul': 7, 'Pablo': 15, 'Bulos': 1, 'Faulos': 1, 'Pulos': 1, 'Paulvs': 2, 'Amnicola': 1, 'Giampaolo': 2, 'Petrus': 7, 'Xistus': 1, 'Paulies': 1, 'Páll': 3, 'Johan': 1, 'Pavlos': 1, 'Pūlus': 1, 'Pauluse': 1, 'Paulinus': 2, 'Paulin': 2, 'Paull': 12, 'Walter': 1, 'Werner': 1, 'Pawe': 1, 'Paal': 1, 'Sincer

In [12]:
get_coocs('gndo:surname', 'Abeles')
        

[('Abeles', 7),
 ('Abel', 2),
 ('Abélès', 1),
 ('Allers', 1),
 ('Grailich', 1),
 ('Abelesz', 1),
 ('Nathan', 1)]