# Little bit more advanced mapping techniques

In the other notebook we mapped the concepts of Bouterwek, Eschenburg and Goethe based on a string comparison. When the strings exactly matched we created a `skos:closeMatch`. 
Obviously this 1:1 string comparison is not really robust. There are cases of spelling variation ("Äsopische Fabel"/"Aesopische Fabel") that are not matched. There are also multi word expressions which do not match to the single word label of a probably correspondig concept.
For these two scenarios we will develop mapping mechanisms:

* string containment: we look if a string of concept A is included in the string of concept B and vice versa. This should help identifying candidates where multi-word expressions as the label of a term map to single word terms.
* edit distance: we will use levenshtein distance and try to find spelling variations of single term labels.

We would probably have to combine these approaches into a matching algorithm at some later stage

In [1]:
!ls out | grep terms

bouterwek_terms.json
eschenburg_terms.json
goethe_dichtarten_terms.json


In [2]:
def load_terms(filename:str):
    filepath = "out/" + filename
    
    with open(filepath, "r") as f:
        parsed = json.load(f)

    return parsed

In [3]:
import json
# load the terms from file
bouterwek_terms = load_terms("bouterwek_terms.json")
eschenburg_terms = load_terms("eschenburg_terms.json")
goethe_terms = load_terms("goethe_dichtarten_terms.json") # drop the dichtarten in the name!

## String containment

In [4]:
#bouterwek_terms

In [5]:
#eschenburg_terms

In [6]:
def match_by_string_containment_brute_force(terms_1:list, terms_2:list, name_1:str, name_2:str):
    for term_1 in terms_1:
        for term_2 in terms_2:
            if (term_1["label"].lower() in term_2["label"].lower()) or (term_2["label"].lower() in term_1["label"].lower()):
                print(f"Term {term_1['label']} ({name_1}) and {term_2['label']} ({name_2}) are mapping candidates")     

In [7]:
match_by_string_containment_brute_force(goethe_terms, eschenburg_terms, "goethe", "eschenburg")

Term Drama (goethe) and Drama (eschenburg) are mapping candidates
Term Drama (goethe) and Dramatische Dichtungsart (eschenburg) are mapping candidates
Term Elegie (goethe) and Elegie (eschenburg) are mapping candidates
Term Epigramm (goethe) and Epigramm (eschenburg) are mapping candidates
Term Epistel (goethe) and Epistel (eschenburg) are mapping candidates
Term Erzählung (goethe) and Allegorische Erzählung (eschenburg) are mapping candidates
Term Erzählung (goethe) and Poetische Erzählung (eschenburg) are mapping candidates
Term Fabel (goethe) and Aesopische Fabel (eschenburg) are mapping candidates
Term Heroide (goethe) and Heroide (eschenburg) are mapping candidates
Term Lehrgedicht (goethe) and Lehrgedicht (eschenburg) are mapping candidates
Term Roman (goethe) and Roman (eschenburg) are mapping candidates
Term Romanze (goethe) and Roman (eschenburg) are mapping candidates
Term Satire (goethe) and Satire (eschenburg) are mapping candidates


In [8]:
match_by_string_containment_brute_force(eschenburg_terms, bouterwek_terms, "eschenburg", "bouterwek")

Term Drama (eschenburg) and Dramatische Dichtungsart (bouterwek) are mapping candidates
Term Elegie (eschenburg) and Elegie (bouterwek) are mapping candidates
Term Epigramm (eschenburg) and Epigramm (bouterwek) are mapping candidates
Term Hirtengedicht (eschenburg) and Das Hirtengedicht und die idyllische Poesie (bouterwek) are mapping candidates
Term Lustspiel (eschenburg) and Lustspiel (bouterwek) are mapping candidates
Term Oper (eschenburg) and Oper (bouterwek) are mapping candidates
Term Roman (eschenburg) and Epische Romanze (bouterwek) are mapping candidates
Term Roman (eschenburg) and Roman (bouterwek) are mapping candidates
Term Roman (eschenburg) and Romantische Canzone (bouterwek) are mapping candidates
Term Trauerspiel (eschenburg) and Trauerspiel (bouterwek) are mapping candidates
Term Epistel (eschenburg) and Didaktische Epistel (bouterwek) are mapping candidates
Term Epistel (eschenburg) and Lyrische Epistel (bouterwek) are mapping candidates
Term Lehrgedicht (eschenburg

In [9]:
match_by_string_containment_brute_force(goethe_terms, bouterwek_terms, "goethe", "bouterwek")

Term Ballade (goethe) and Ballade (bouterwek) are mapping candidates
Term Drama (goethe) and Dramatische Dichtungsart (bouterwek) are mapping candidates
Term Elegie (goethe) and Elegie (bouterwek) are mapping candidates
Term Epigramm (goethe) and Epigramm (bouterwek) are mapping candidates
Term Epistel (goethe) and Didaktische Epistel (bouterwek) are mapping candidates
Term Epistel (goethe) and Lyrische Epistel (bouterwek) are mapping candidates
Term Epopöe (goethe) and Epopöe (bouterwek) are mapping candidates
Term Fabel (goethe) and Äsopische Fabel (bouterwek) are mapping candidates
Term Lehrgedicht (goethe) and Lehrgedicht (bouterwek) are mapping candidates
Term Ode (goethe) and Ode (bouterwek) are mapping candidates
Term Roman (goethe) and Epische Romanze (bouterwek) are mapping candidates
Term Roman (goethe) and Roman (bouterwek) are mapping candidates
Term Roman (goethe) and Romantische Canzone (bouterwek) are mapping candidates
Term Romanze (goethe) and Epische Romanze (bouterwe

This for sure needs some manual intervention. Actually I don't want to do it in the cases we have both single word expressions. Only in the cases where there are a single word and a multi-word expression compared. I will re-write the function below and leave the results above here for reference. 


In [10]:
def match_by_string_containment(terms_1:list, terms_2:list, name_1:str, name_2:str):
    for term_1 in terms_1:
        for term_2 in terms_2:
            # only check if one of the terms is a multi-word expression. 
            # This is not perfect, because I check for leerzeichen.. Maybe regex would be better, but for testing this is enough here
            # We won't compare two multi-word expressions
            if (" " in term_1["label"]) or (" " in term_2["label"]):

                # don't go on  if the multi-word expressions are exact matches
                if term_1["label"].lower() == term_2["label"].lower():
                    break
            
                if (term_1["label"].lower() in term_2["label"].lower()) or (term_2["label"].lower() in term_1["label"].lower()):
                    print(f"Term {term_1['label']} ({name_1}) and {term_2['label']} ({name_2}) are mapping candidates") 

In [11]:
match_by_string_containment(goethe_terms, eschenburg_terms, "goethe", "eschenburg")

Term Drama (goethe) and Dramatische Dichtungsart (eschenburg) are mapping candidates
Term Erzählung (goethe) and Allegorische Erzählung (eschenburg) are mapping candidates
Term Erzählung (goethe) and Poetische Erzählung (eschenburg) are mapping candidates
Term Fabel (goethe) and Aesopische Fabel (eschenburg) are mapping candidates


In [12]:
match_by_string_containment(goethe_terms, bouterwek_terms, "goethe", "bouterwek")

Term Drama (goethe) and Dramatische Dichtungsart (bouterwek) are mapping candidates
Term Epistel (goethe) and Didaktische Epistel (bouterwek) are mapping candidates
Term Epistel (goethe) and Lyrische Epistel (bouterwek) are mapping candidates
Term Fabel (goethe) and Äsopische Fabel (bouterwek) are mapping candidates
Term Roman (goethe) and Epische Romanze (bouterwek) are mapping candidates
Term Roman (goethe) and Romantische Canzone (bouterwek) are mapping candidates
Term Romanze (goethe) and Epische Romanze (bouterwek) are mapping candidates


In [13]:
match_by_string_containment(bouterwek_terms, eschenburg_terms, "bouterwek", "eschenburg")

Term Didaktische Epistel (bouterwek) and Epistel (eschenburg) are mapping candidates
Term Epische Romanze (bouterwek) and Roman (eschenburg) are mapping candidates
Term Lyrische Epistel (bouterwek) and Epistel (eschenburg) are mapping candidates
Term Das Hirtengedicht und die idyllische Poesie (bouterwek) and Hirtengedicht (eschenburg) are mapping candidates
Term Romantische Canzone (bouterwek) and Roman (eschenburg) are mapping candidates
Term Dramatische Dichtungsart (bouterwek) and Drama (eschenburg) are mapping candidates


This needs to be done interactive

In [14]:
def match_by_string_containment_interactive(terms_1:list, terms_2:list, name_1:str, name_2:str):
    results = []
    
    for term_1 in terms_1:
        for term_2 in terms_2:
            # only check if one of the terms is a multi-word expression. 
            # This is not perfect, because I check for leerzeichen.. Maybe regex would be better, but for testing this is enough here
            # We won't compare two multi-word expressions
            if (" " in term_1["label"]) or (" " in term_2["label"]):

                # don't go on  if the multi-word expressions are exact matches
                if term_1["label"].lower() == term_2["label"].lower():
                    break
            
                if (term_1["label"].lower() in term_2["label"].lower()) or (term_2["label"].lower() in term_1["label"].lower()):
                    print(f"Match concepts with labels {term_1['label']} ({name_1}) and {term_2['label']} ({name_2})?")
                    user_assesment= input("y/n")
                    if user_assesment == "y":
                        print("Will create mapping.\n")
                        true_positive = dict()
                        true_positive["term1_label"] = term_1["label"]
                        true_positive["term2_label"] = term_2["label"]
                        true_positive["term1_id"] = term_1["id"]
                        true_positive["term2_id"] = term_2["id"]
                        true_positive["term1_source"] = name_1
                        true_positive["term2_source"] = name_2
                        true_positive["user_assesment"] = user_assesment
                        results.append(true_positive)
                        
                    elif user_assesment == "n":
                        print("Will NOT create a mapping.\n")
                        false_positive = dict()
                        false_positive["term1_label"] = term_1["label"]
                        false_positive["term2_label"] = term_2["label"]
                        false_positive["term1_id"] = term_1["id"]
                        false_positive["term2_id"] = term_2["id"]
                        false_positive["term1_source"] = name_1
                        false_positive["term2_source"] = name_2
                        false_positive["user_assesment"] = user_assesment
                        results.append(false_positive)
                    else:
                        print("Not a valid input. Will resume and don't map.\n")
        
    return results

In [15]:
goethe_eschenburg_matches_by_containment = match_by_string_containment_interactive(goethe_terms, eschenburg_terms, "goethe", "eschenburg")

Match concepts with labels Drama (goethe) and Dramatische Dichtungsart (eschenburg)?


y/n y


Will create mapping.

Match concepts with labels Erzählung (goethe) and Allegorische Erzählung (eschenburg)?


y/n n


Will NOT create a mapping.

Match concepts with labels Erzählung (goethe) and Poetische Erzählung (eschenburg)?


y/n n


Will NOT create a mapping.

Match concepts with labels Fabel (goethe) and Aesopische Fabel (eschenburg)?


y/n y


Will create mapping.



In [16]:
goethe_eschenburg_matches_by_containment

[{'term1_label': 'Drama',
  'term2_label': 'Dramatische Dichtungsart',
  'term1_id': 'https://genre.clscor.io/goethe/drama',
  'term2_id': 'https://genre.clscor.io/eschenburg/dramatische_dichtungsart',
  'term1_source': 'goethe',
  'term2_source': 'eschenburg',
  'user_assesment': 'y'},
 {'term1_label': 'Erzählung',
  'term2_label': 'Allegorische Erzählung',
  'term1_id': 'https://genre.clscor.io/goethe/erzaehlung',
  'term2_id': 'https://genre.clscor.io/eschenburg/allegorische_erzaehlung',
  'term1_source': 'goethe',
  'term2_source': 'eschenburg',
  'user_assesment': 'n'},
 {'term1_label': 'Erzählung',
  'term2_label': 'Poetische Erzählung',
  'term1_id': 'https://genre.clscor.io/goethe/erzaehlung',
  'term2_id': 'https://genre.clscor.io/eschenburg/poetische_erzaehlung',
  'term1_source': 'goethe',
  'term2_source': 'eschenburg',
  'user_assesment': 'n'},
 {'term1_label': 'Fabel',
  'term2_label': 'Aesopische Fabel',
  'term1_id': 'https://genre.clscor.io/goethe/fabel',
  'term2_id':

In [44]:
goethe_bouterwek_matches_by_containment = match_by_string_containment_interactive(goethe_terms, bouterwek_terms, "goethe", "bouterwek")

Match concepts with labels Drama (goethe) and Dramatische Dichtungsart (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Epistel (goethe) and Didaktische Epistel (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Epistel (goethe) and Lyrische Epistel (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Fabel (goethe) and Äsopische Fabel (bouterwek)?


y/n y


Will create mapping.

Match concepts with labels Roman (goethe) and Epische Romanze (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Roman (goethe) and Romantische Canzone (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Romanze (goethe) and Epische Romanze (bouterwek)?


y/n y


Will create mapping.



In [45]:
goethe_bouterwek_matches_by_containment

[{'term1_label': 'Drama',
  'term2_label': 'Dramatische Dichtungsart',
  'term1_id': 'https://genre.clscor.io/goethe/drama',
  'term2_id': 'https://genre.clscor.io/bouterwek/dramatische_dichtungsart',
  'term1_source': 'goethe',
  'term2_source': 'bouterwek',
  'user_assesment': 'n'},
 {'term1_label': 'Epistel',
  'term2_label': 'Didaktische Epistel',
  'term1_id': 'https://genre.clscor.io/goethe/epistel',
  'term2_id': 'https://genre.clscor.io/bouterwek/didaktische_epistel',
  'term1_source': 'goethe',
  'term2_source': 'bouterwek',
  'user_assesment': 'n'},
 {'term1_label': 'Epistel',
  'term2_label': 'Lyrische Epistel',
  'term1_id': 'https://genre.clscor.io/goethe/epistel',
  'term2_id': 'https://genre.clscor.io/bouterwek/lyrische_epistel',
  'term1_source': 'goethe',
  'term2_source': 'bouterwek',
  'user_assesment': 'n'},
 {'term1_label': 'Fabel',
  'term2_label': 'Äsopische Fabel',
  'term1_id': 'https://genre.clscor.io/goethe/fabel',
  'term2_id': 'https://genre.clscor.io/boute

In [46]:
eschenburg_bouterwek_matches_by_containment = match_by_string_containment_interactive(eschenburg_terms, bouterwek_terms, "eschenburg", "bouterwek")

Match concepts with labels Drama (eschenburg) and Dramatische Dichtungsart (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Hirtengedicht (eschenburg) and Das Hirtengedicht und die idyllische Poesie (bouterwek)?


y/n y


Will create mapping.

Match concepts with labels Roman (eschenburg) and Epische Romanze (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Roman (eschenburg) and Romantische Canzone (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Epistel (eschenburg) and Didaktische Epistel (bouterwek)?


y/n n


Will NOT create a mapping.

Match concepts with labels Epistel (eschenburg) and Lyrische Epistel (bouterwek)?


y/n n


Will NOT create a mapping.



In [47]:
eschenburg_bouterwek_matches_by_containment

[{'term1_label': 'Drama',
  'term2_label': 'Dramatische Dichtungsart',
  'term1_id': 'https://genre.clscor.io/eschenburg/drama',
  'term2_id': 'https://genre.clscor.io/bouterwek/dramatische_dichtungsart',
  'term1_source': 'eschenburg',
  'term2_source': 'bouterwek',
  'user_assesment': 'n'},
 {'term1_label': 'Hirtengedicht',
  'term2_label': 'Das Hirtengedicht und die idyllische Poesie',
  'term1_id': 'https://genre.clscor.io/eschenburg/hirtengedicht',
  'term2_id': 'https://genre.clscor.io/bouterwek/hirtengedicht',
  'term1_source': 'eschenburg',
  'term2_source': 'bouterwek',
  'user_assesment': 'y'},
 {'term1_label': 'Roman',
  'term2_label': 'Epische Romanze',
  'term1_id': 'https://genre.clscor.io/eschenburg/roman',
  'term2_id': 'https://genre.clscor.io/bouterwek/epische_romanze',
  'term1_source': 'eschenburg',
  'term2_source': 'bouterwek',
  'user_assesment': 'n'},
 {'term1_label': 'Roman',
  'term2_label': 'Romantische Canzone',
  'term1_id': 'https://genre.clscor.io/eschenb

In [52]:
from rdflib import Graph, SKOS, URIRef

In [53]:
# Need to create rdf triples and export the graphs
def export_close_match_on_containment(matchings:list):
    """creates the export files"""
    source_1_name = matchings[0]["term1_source"]
    source_2_name = matchings[0]["term2_source"]

    g_1 = Graph()
    g_2 = Graph()

    for item in matchings:
        if item["user_assesment"] == "y":
            # add it to the first graph
            g_1.add(( URIRef(item["term1_id"]), SKOS.closeMatch, URIRef(item["term2_id"]) ))

            # add the inverse to the second graph
            g_2.add(( URIRef(item["term2_id"]), SKOS.closeMatch, URIRef(item["term1_id"]) ))

    file_1_name = source_1_name + "_closeMatch_" + source_2_name + "_based_on_containment.ttl"
    g_1.serialize(destination="out/"+file_1_name)
    file_2_name = source_2_name + "_closeMatch_" + source_1_name + "_based_on_containment.ttl"
    g_2.serialize(destination="out/"+file_2_name)


In [55]:
# goethe_bouterwek_matches_by_containment
# goethe_eschenburg_matches_by_containment
# eschenburg_bouterwek_matches_by_containment
export_close_match_on_containment(goethe_bouterwek_matches_by_containment)
export_close_match_on_containment(goethe_eschenburg_matches_by_containment)
export_close_match_on_containment(eschenburg_bouterwek_matches_by_containment)

see next notebook for edit distance