# Fast name extraction from GND

## Main data structure

cooc is a dictionary that contains for each property a dictionary that contains for each name a Counter, which contains the cooccurring names and their frequency.

Example:
```
{
    "gndo:surname": 
        {
            "Hussell": 
                {
                    "Hussell": 2, 
                    "Hussel": 2 
                }, 
            "Bayer": 
                {
                    "Bayer": 251,
                    "Beyer": 93, 
                    ...
                }
            ...
         }
     ...
}
```

In [1]:
import collections
props = ["gndo:forename", "gndo:surname"]
cooc = collections.defaultdict(lambda : collections.defaultdict(collections.Counter))

## Extraction and access function

In [2]:
tmp = collections.defaultdict(set)

def extract_literal(line):
    start = line.index('"')
    end = line.index('"', start + 1)
    return line[start + 1:end]

def bad_name(name):
    return " " in name or "." in name or "-" in name

def extract_from_file(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        count = 0
        for p in props:
            tmp[p] = set()
        for line in f:
            if "gndo:gndIdentifier" in line:
                for p in props:
                    if len(tmp[p]) > 1:
                        for name in tmp[p]:
                            cooc[p][name].update(tmp[p])
                    tmp[p] = set()
                # id = line.split(' ')[2].strip('"')
                count += 1
                if count % 10000 == 0:
                    print('.', end='')
                if count % 1000000 == 0:
                    print(' {}'.format(count))
            for p in props:
                if p in line:
                    name = extract_literal(line)
                    if not bad_name(name):
                        tmp[p].add(name)

def get_coocs(prop, name):
    return collections.Counter(cooc[prop][name]).most_common()

## Loading and saving of the extracted data

In [3]:
import json

def save_cooc(filename):
    with open(filename, 'w', encoding='utf-8') as fp:
        json.dump(cooc, fp)
        
def load_cooc(filename):
    with open(filename, 'r', encoding='utf-8') as fp:
        return json.load(fp)

The raw GND data can be downloaded here (gunzip and put in current directory):
- https://data.dnb.de/opendata/authorities-name_lds.ttl.gz
- https://data.dnb.de/opendata/authorities-person_lds.ttl.gz

Uncomment the lines below to extract the dataset from the GND files:

In [4]:
# extract_from_file('authorities-name_lds.ttl')
# extract_from_file('authorities-person_lds.ttl')
# save_cooc('cooc.json')

Load from existing dump file:

In [5]:
cooc = load_cooc('cooc.json')

## Play with the data

Todo: Find a way to filter out names that do not fit, e.g. by edit-distance or clique analysis.

In [6]:
len(cooc['gndo:forename'])

199591

In [7]:
get_coocs('gndo:forename', 'Simon')
        

[('Simon', 553),
 ('Simone', 70),
 ('Symon', 48),
 ('Simeon', 47),
 ('Szymon', 41),
 ('Šimʿôn', 25),
 ('Shimʿon', 24),
 ('Shimon', 20),
 ('Šimon', 18),
 ('Simón', 13),
 ('Siméon', 9),
 ('Semen', 9),
 ('שמעון', 9),
 ('Simonis', 8),
 ('Sigmund', 8),
 ('Pierre', 7),
 ('Šimʿōn', 7),
 ('Saimon', 6),
 ('Šimʿon', 6),
 ('Thomas', 6),
 ('Sajmon', 6),
 ('Simão', 5),
 ('Schimon', 5),
 ('Semyon', 5),
 ('John', 5),
 ('Symeon', 5),
 ('Simonus', 4),
 ("Shim'on", 4),
 ('Johannes', 4),
 ('Christian', 4),
 ('Simonas', 4),
 ('Shimen', 4),
 ('Siegmund', 4),
 ('Salomon', 4),
 ('Somon', 3),
 ('Sīmūn', 3),
 ('Georg', 3),
 ("Šim'ôn", 3),
 ("Sim'ôn", 3),
 ('Simeone', 3),
 ('Peter', 3),
 ('Szimon', 3),
 ('Simonius', 3),
 ('Seme͏̈n', 3),
 ('Henri', 3),
 ('Shimòn', 3),
 ('Martin', 3),
 ('Shimeon', 3),
 ('Tomaso', 2),
 ('Tommaso', 2),
 ('William', 2),
 ('Njami', 2),
 ('Šelomô', 2),
 ('Sîmôn', 2),
 ('Édouard', 2),
 ('Gaudentio', 2),
 ('Gaudence', 2),
 ('Šmeon', 2),
 ('Shmeon', 2),
 ("S̆im'Ôn",