# Creating a SKOS Thesaurus from the cleaned signature schema

We'll use [RDFlib](https://rdflib.readthedocs.io) for managing SKOS terms etc. You don't _have_ to as it is quite possible to just output RDF through string manipulation, but seeing as this is not a huge dataset, we can afford to go through an in-memory RDF graph.

In [1]:
import pandas as pd
import numpy as np
import re
import urllib.parse
from rdflib import Graph, Literal, Namespace, RDF
from rdflib.namespace import SKOS

In [2]:
CAP = 1000

## Load the data
We assume to be working with the final output of the [signatures_processing](signatures_processing.ipynb) notebook.

In [3]:
df = pd.read_csv('data/csv/sig_lookup.csv',dtype={'numbis': str, 'backreference': str, 'text_4': str})

# Create a multi-index as we might need to access rows over and over.
df.set_index(['lev','sys','numbis'], inplace=True)
df.sort_index()
# Also an index by text, because we need to locate rows by the contect of text_1 to text_4
df.set_index(['text'], append=True, inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,backreference,text_1,text_2,text_3,text_4
lev,sys,numbis,text,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,,,Bibliotheca Hertziana,,,,,
,,,Systematischer Standortkatalog,,,,,
1.0,A,,Handbücherei,,,,,
2.0,Aa,,Allgemeine Nachschlagewerke,,Handbücherei,,,
3.0,Aa xxx,yyy,Lexika,,Handbücherei,Allgemeine Nachschlagewerke,,
3.0,...,...,...,...,...,...,...,...
3.0,E-VER 300,,"Malerei, Grafik, Mosaik, Buchmalerei",,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,Hauptkirche,,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,sonstige einzelne Kirchen,,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,einelne Profangebäude,,Topographie Italien (ohne Rom),Verona,,


In [4]:
def make_term(sys, label, graph):
    sys_uri = re.sub(r'\s+', '/', str(syz).strip())
    sys_uri = urllib.parse.quote(sys_uri)
    subj = NS_DATA[sys_uri]
    graph.add((subj, RDF.type, SKOS.Concept))
    if label:
        g.add((subj, SKOS.prefLabel, Literal(label, lang='de')))
    return subj

In [5]:
def least_generic_broader(row):
    for text in reversed(row):
        if pd.notna(text):
            return text
    return None

def find_broader_term(df, level, syz, term):
    "level must be lower"
    match = None
    sliced = df.loc[(level, slice(None), slice(None), term)]
    if sliced.empty:
        raise ValueError("No match for "+str(level)+", "+term)
    for i, r in sliced.iterrows():
        brd_syz = i[0].split()[0].strip()
        if syz.startswith(brd_syz):
            match = i
    if not match:
        if level > 1:
            return find_broader_term(df, level-1, syz, term)
        else:
            return None
    else:
        return match

In [23]:
g = Graph()
NS_DATA = Namespace('http://data.biblhertz.it/term/sys/')

count = 0
for index, row in df.iterrows():
    try:
        syz = index[1]
        if pd.isna(syz):
            continue
        uri = make_term(syz, index[3] if pd.notna(index[3]) else None, g)
        # Look up a higher-level term
        if index[0] > 1.0:
            print("Looking for a broader to {}:{}".format(index[0],syz,index[3]))
            term = least_generic_broader(row[-4:])
            try:
                df_br = find_broader_term(df, index[0]-1, syz, term)
                if df_br[0]:
                    print("Found {}".format(df_br[0]))
                    broader = make_term(df_br[0], None, g)
                    print(broader)
                    g.add((uri, SKOS.broader, broader))
                    
            except ValueError:
                print("[WARN] Couldn't match key: {},{},{}".format(index[0],syz,term))
            
    except IndexError:
        # Better to ask for forgiveness than for permission
        pass
    count += 1
    if CAP == count:
        break
    

Looking for a broader to 2.0:Aa 
Found A 
http://data.biblhertz.it/term/sys/Aa
Looking for a broader to 3.0:Aa xxx
Found Aa 
http://data.biblhertz.it/term/sys/Aa/xxx
Looking for a broader to 4.0:Aa 40
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/40
Looking for a broader to 3.0:Aa xxx
Found Aa 
http://data.biblhertz.it/term/sys/Aa/xxx
Looking for a broader to 4.0:Aa 60
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/60
Looking for a broader to 4.0:Aa 65
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/65
Looking for a broader to 4.0:Aa 70
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/70
Looking for a broader to 4.0:Aa 75
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/75
Looking for a broader to 4.0:Aa xxx
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/xxx
Looking for a broader to 5.0:Aa 80
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/80
Looking for a broader to 5.0:Aa 83
Found Aa xxx
http://data.biblhertz.it/term/sys/Aa/83
Looking for a broader to 5.0:Aa 85
Found 

Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/400
Looking for a broader to 5.0:Ab 406
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/406
Looking for a broader to 5.0:Ab 416
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/416
Looking for a broader to 5.0:Ab 420
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/420
Looking for a broader to 5.0:Ab 428
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/428
Looking for a broader to 5.0:Ab 434
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/434
Looking for a broader to 5.0:Ab 450
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/450
Looking for a broader to 5.0:Ab 452
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/452
Looking for a broader to 5.0:Ab 454
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/454
Looking for a broader to 5.0:Ab 456
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/456
Looking for a broader to 5.0:Ab 458
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/458
Looking for a broader to 5.0:Ab 460
Found Ab x

Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2428
Looking for a broader to 5.0:Ab 2434
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2434
Looking for a broader to 5.0:Ab 2450
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2450
Looking for a broader to 5.0:Ab 2452
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2452
Looking for a broader to 5.0:Ab 2454
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2454
Looking for a broader to 5.0:Ab 2456
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2456
Looking for a broader to 5.0:Ab 2458
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2458
Looking for a broader to 5.0:Ab 2460
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2460
Looking for a broader to 5.0:Ab 2468
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/2468
Looking for a broader to 4.0:Ab 5000
Found Ab xxx
http://data.biblhertz.it/term/sys/Ab/5000
Looking for a broader to 2.0:Ad 
Found A 
http://data.biblhertz.it/term/sys/Ad
Looking for a broader to 3.0:Ad 50
Fou

[WARN] Couldn't match key: 4.0,Ah xxx,Sonstiges
Looking for a broader to 5.0:Ah 50
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/50
Looking for a broader to 5.0:Ah 60
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/60
Looking for a broader to 5.0:Ah 70
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/70
Looking for a broader to 5.0:Ah 80
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/80
Looking for a broader to 4.0:Ah xxx
[WARN] Couldn't match key: 4.0,Ah xxx,Sonstiges
Looking for a broader to 5.0:Ah 100
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/100
Looking for a broader to 5.0:Ah 105
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/105
Looking for a broader to 5.0:Ah 110
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/110
Looking for a broader to 5.0:Ah 115
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/115
Looking for a broader to 5.0:Ah 120
Found Ah xxx
http://data.biblhertz.it/term/sys/Ah/120
Looking for a broader to 5.0:Ah 131
Found Ah xxx
http://data.biblh

[WARN] Couldn't match key: 4.0,Ar xxx,Sonstiges
Looking for a broader to 5.0:Ar 200
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/200
Looking for a broader to 5.0:Ar 220
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/220
Looking for a broader to 5.0:Ar 240
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/240
Looking for a broader to 5.0:Ar 260
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/260
Looking for a broader to 4.0:Ar xxx
[WARN] Couldn't match key: 4.0,Ar xxx,Sonstiges
Looking for a broader to 5.0:Ar 300
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/300
Looking for a broader to 5.0:Ar 320
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/320
Looking for a broader to 5.0:Ar 340
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/340
Looking for a broader to 5.0:Ar 360
Found Ar xxx
http://data.biblhertz.it/term/sys/Ar/360
Looking for a broader to 4.0:Ar xxx
[WARN] Couldn't match key: 4.0,Ar xxx,Sonstiges
Looking for a broader to 5.0:Ar 500
Found Ar xxx
http://data.bib

Found Ax xxx
http://data.biblhertz.it/term/sys/Ax/980
Looking for a broader to 2.0:Az 
Found A 
http://data.biblhertz.it/term/sys/Az
Looking for a broader to 3.0:Az 20
Found Az 
http://data.biblhertz.it/term/sys/Az/20
Looking for a broader to 3.0:Az 24
Found Az 
http://data.biblhertz.it/term/sys/Az/24
Looking for a broader to 3.0:Az 28
Found Az 
http://data.biblhertz.it/term/sys/Az/28
Looking for a broader to 3.0:Az 40
Found Az 
http://data.biblhertz.it/term/sys/Az/40
Looking for a broader to 3.0:Az 50
Found Az 
http://data.biblhertz.it/term/sys/Az/50
Looking for a broader to 3.0:Az 55
Found Az 
http://data.biblhertz.it/term/sys/Az/55
Looking for a broader to 3.0:Az 70
Found Az 
http://data.biblhertz.it/term/sys/Az/70
Looking for a broader to 3.0:Az 100
Found Az 
http://data.biblhertz.it/term/sys/Az/100
Looking for a broader to 3.0:Az 103
Found Az 
http://data.biblhertz.it/term/sys/Az/103
Looking for a broader to 3.0:Az 106
Found Az 
http://data.biblhertz.it/term/sys/Az/106
Looking for

Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/524
Looking for a broader to 5.0:Bb 528
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/528
Looking for a broader to 5.0:Bb 532
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/532
Looking for a broader to 5.0:Bb 536
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/536
Looking for a broader to 5.0:Bb 540
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/540
Looking for a broader to 5.0:Bb 544
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/544
Looking for a broader to 5.0:Bb 548
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/548
Looking for a broader to 5.0:Bb 552
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/552
Looking for a broader to 5.0:Bb 556
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/556
Looking for a broader to 5.0:Bb 560
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/560
Looking for a broader to 5.0:Bb 564
Found Bb xxx
http://data.biblhertz.it/term/sys/Bb/564
Looking for a broader to 5.0:Bb 568
Found Bb x

Found Be 
http://data.biblhertz.it/term/sys/Be/xxx
Looking for a broader to 4.0:Be xxx
Found Be xxx
http://data.biblhertz.it/term/sys/Be/xxx
Looking for a broader to 5.0:Be 610
Found Be xxx
http://data.biblhertz.it/term/sys/Be/610
Looking for a broader to 5.0:Be 620
Found Be xxx
http://data.biblhertz.it/term/sys/Be/620
Looking for a broader to 5.0:Be 630
Found Be xxx
http://data.biblhertz.it/term/sys/Be/630
Looking for a broader to 5.0:Be 650
Found Be xxx
http://data.biblhertz.it/term/sys/Be/650
Looking for a broader to 5.0:Be 652
Found Be xxx
http://data.biblhertz.it/term/sys/Be/652
Looking for a broader to 5.0:Be 664
Found Be xxx
http://data.biblhertz.it/term/sys/Be/664
Looking for a broader to 4.0:Be xxx
Found Be xxx
http://data.biblhertz.it/term/sys/Be/xxx
Looking for a broader to 5.0:Be 710
Found Be xxx
http://data.biblhertz.it/term/sys/Be/710
Looking for a broader to 5.0:Be 720
Found Be xxx
http://data.biblhertz.it/term/sys/Be/720
Looking for a broader to 5.0:Be 750
Found Be xxx


In [17]:
len(g)

1867

In [18]:
res = g.query(f"""
SELECT DISTINCT * WHERE {{ ?x a <{SKOS.Concept}> 
  ;  <{SKOS.prefLabel}> ?l
  OPTIONAL {{ ?x <{SKOS.broader}> ?broad FILTER (?x!=?broad) }}
  
}} 
LIMIT 30""")

for row in res:
    print(f"{row.broad} > {row.x} {row.l}")

None > http://data.biblhertz.it/term/sys/A Handbücherei
None > http://data.biblhertz.it/term/sys/Aa Allgemeine Nachschlagewerke
None > http://data.biblhertz.it/term/sys/Aa/xxx Lexika
None > http://data.biblhertz.it/term/sys/Aa/xxx Bibliographien
None > http://data.biblhertz.it/term/sys/Aa/xxx Nationale Allgemeinbibliographien
None > http://data.biblhertz.it/term/sys/Aa/xxx Bibliographien zur Landeskunde
None > http://data.biblhertz.it/term/sys/Aa/xxx Biographische Lexika
None > http://data.biblhertz.it/term/sys/Aa/xxx Adreßbücher wissenschaftlicher Gesellschaften bzw. Institutionen
None > http://data.biblhertz.it/term/sys/Aa/xxx Sonstiges
None > http://data.biblhertz.it/term/sys/Aa/40 Enzyklopädien und Sachlexika
None > http://data.biblhertz.it/term/sys/Aa/60 Bibliographien der Bibliographien
None > http://data.biblhertz.it/term/sys/Aa/65 Bibliographien der "verkleideten Literatur" (Anonymen- und Pseudonymenlexika; falsches, fehlendes oder fingiertes Impressum)
None > http://data.biblh