# Creating a SKOS Thesaurus from the cleaned signature schema

We'll use [RDFlib](https://rdflib.readthedocs.io) for managing SKOS terms etc. You don't _have_ to as it is quite possible to just output RDF through string manipulation, but seeing as this is not a huge dataset, we can afford to go through an in-memory RDF graph.

In [54]:
import pandas as pd
import numpy as np
import re
import urllib.parse
from rdflib import Graph, Literal, Namespace, RDF
from rdflib.namespace import SKOS

## Load the data
We assume to be working with the final output of the [signatures_processing](signatures_processing.ipynb) notebook.

In [2]:
df = pd.read_csv('data/csv/sig_updated.csv',dtype={'numbis': str, 'backreference': str, 'text_4': str})

# Create a multi-index as we might need to access rows over and over.
df.set_index(['lev','sys','numbis'], inplace=True)
df.sort_index()
# Also an index by text, because we need to locate rows by the contect of text_1 to text_4
df.set_index(['text'], append=True, inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,backreference,text_1,text_2,text_3,text_4
lev,sys,numbis,text,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,,,Bibliotheca Hertziana,,,,,
,,,Systematischer Standortkatalog,,,,,
1.0,A,,Handbücherei,,,,,
2.0,Aa,,Allgemeine Nachschlagewerke,,Handbücherei,,,
3.0,Aa xxx,yyy,Lexika,,Handbücherei,Allgemeine Nachschlagewerke,,
3.0,...,...,...,...,...,...,...,...
3.0,E-VER 300,,"Malerei, Grafik, Mosaik, Buchmalerei",,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,Hauptkirche,,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,sonstige einzelne Kirchen,,Topographie Italien (ohne Rom),Verona,,
3.0,E-VER 300,,einelne Profangebäude,,Topographie Italien (ohne Rom),Verona,,


In [3]:
def least_generic_broader(row):
    for text in reversed(row):
        if pd.notna(text):
            return text
    return None

In [57]:
g = Graph()
NS_DATA = Namespace('http://data.biblhertz.it/term/sys/')

count = 0
CAP = 1000
for index, row in df.iterrows():
    try:
        syz = index[1]
        if pd.isna(syz):
            continue
        sys_uri = re.sub(r'\s+', '/', str(syz).strip())
        sys_uri = urllib.parse.quote(sys_uri)
        subj = NS_DATA[sys_uri]
        g.add((subj, RDF.type, SKOS.Concept))
        lab = index[3]
        if pd.notna(lab):
            g.add((subj, SKOS.prefLabel, Literal(lab, lang='de')))
            
        # Look up a higher-level term
        if index[0] > 1.0:
            broader = least_generic_broader(row[-4:])
            higher = df.loc[(index[0]-1.0, slice(None), slice(None), broader)]
    except IndexError:
        # Better to ask for forgiveness than for permission
        pass
    count += 1
    if CAP == count:
        break
    

Detected broader term: Handbücherei
(2.0, 'Aa ', nan, 'Allgemeine Nachschlagewerke')
           backreference text_1 text_2 text_3 text_4
sys numbis                                          
A   NaN              NaN    NaN    NaN    NaN    NaN


KeyError: 'A'

In [5]:
len(g)

1867

In [6]:
res = g.query(f"""
SELECT DISTINCT * WHERE {{ ?x a <{SKOS.Concept}> 
  ;  <{SKOS.prefLabel}> ?l
}} 
LIMIT 10""")

for row in res:
    print(f"{row.x} {row.l}")

http://data.biblhertz.it/term/sys/A Handbücherei
http://data.biblhertz.it/term/sys/Aa Allgemeine Nachschlagewerke
http://data.biblhertz.it/term/sys/Aa/xxx Lexika
http://data.biblhertz.it/term/sys/Aa/xxx Bibliographien
http://data.biblhertz.it/term/sys/Aa/xxx Nationale Allgemeinbibliographien
http://data.biblhertz.it/term/sys/Aa/xxx Bibliographien zur Landeskunde
http://data.biblhertz.it/term/sys/Aa/xxx Biographische Lexika
http://data.biblhertz.it/term/sys/Aa/xxx Adreßbücher wissenschaftlicher Gesellschaften bzw. Institutionen
http://data.biblhertz.it/term/sys/Aa/xxx Sonstiges
http://data.biblhertz.it/term/sys/Aa/40 Enzyklopädien und Sachlexika
