# Little bit more advanced mapping techniques [Edit Distance]

In the other notebook we mapped the concepts of Bouterwek, Eschenburg and Goethe based on a string comparison. When the strings exactly matched we created a `skos:closeMatch`. 
Obviously this 1:1 string comparison is not really robust. There are cases of spelling variation ("Äsopische Fabel"/"Aesopische Fabel") that are not matched. There are also multi word expressions which do not match to the single word label of a probably correspondig concept.
For these two scenarios we will develop mapping mechanisms:

* string containment: we look if a string of concept A is included in the string of concept B and vice versa. This should help identifying candidates where multi-word expressions as the label of a term map to single word terms.
* edit distance: we will use levenshtein distance and try to find spelling variations of single term labels.

We would probably have to combine these approaches into a matching algorithm at some later stage

In [1]:
!ls out | grep terms

bouterwek_terms.json
eschenburg_terms.json
goethe_dichtarten_terms.json


In [2]:
def load_terms(filename:str):
    filepath = "out/" + filename
    
    with open(filepath, "r") as f:
        parsed = json.load(f)

    return parsed

In [3]:
import json
# load the terms from file
bouterwek_terms = load_terms("bouterwek_terms.json")
eschenburg_terms = load_terms("eschenburg_terms.json")
goethe_terms = load_terms("goethe_dichtarten_terms.json") # drop the dichtarten in the name!

## Edit Distance

In [4]:
from Levenshtein import distance

In [5]:
distance("Äsopische Fabel", "Aesopische Fabel")

2

In [6]:
distance("Satire", "Satyre")

1

In [7]:
distance("Sonett", "Cantate")

5

In [8]:
def match_by_levenshtein_brute_force(terms_1:list, terms_2:list, name_1:str, name_2:str, max_distance:int):
    for term_1 in terms_1:
        for term_2 in terms_2:
            
            if term_1["label"] == term_2["label"]:
                break
                
            edit_distance = distance(term_1["label"].lower(), term_2["label"].lower())
            
            if edit_distance <= max_distance:
                print(f"Term {term_1['label']} ({name_1}) and {term_2['label']} ({name_2}) are mapping candidates")     

In [9]:
match_by_levenshtein_brute_force(bouterwek_terms,eschenburg_terms, "bouterwek", "eschenburg",2)

Term Ode (bouterwek) and Oper (eschenburg) are mapping candidates
Term Äsopische Fabel (bouterwek) and Aesopische Fabel (eschenburg) are mapping candidates


In [10]:
match_by_levenshtein_brute_force(bouterwek_terms,goethe_terms, "bouterwek", "goethe",2)

Term Oper (bouterwek) and Ode (goethe) are mapping candidates


In [11]:
match_by_levenshtein_brute_force(eschenburg_terms,goethe_terms, "eschenburg", "goethe",2)

Term Kantate (eschenburg) and Cantate (goethe) are mapping candidates
Term Oper (eschenburg) and Ode (goethe) are mapping candidates


In [15]:
# This needs to be interactive as well
def match_by_levenshtein_interactive(terms_1:list, terms_2:list, name_1:str, name_2:str, max_distance:int):
    results = []
    for term_1 in terms_1:
        for term_2 in terms_2:
            
            if term_1["label"] == term_2["label"]:
                break
                
            edit_distance = distance(term_1["label"].lower(), term_2["label"].lower())
            
            if edit_distance <= max_distance:

                print(f"Match concepts with labels {term_1['label']} ({name_1}) and {term_2['label']} ({name_2})?")
                user_assesment= input("y/n")
                if user_assesment == "y":
                    print("Will create mapping.\n")
                    true_positive = dict()
                    true_positive["term1_label"] = term_1["label"]
                    true_positive["term2_label"] = term_2["label"]
                    true_positive["term1_id"] = term_1["id"]
                    true_positive["term2_id"] = term_2["id"]
                    true_positive["term1_source"] = name_1
                    true_positive["term2_source"] = name_2
                    true_positive["user_assesment"] = user_assesment
                    results.append(true_positive)
                        
                elif user_assesment == "n":
                    print("Will NOT create a mapping.\n")
                    false_positive = dict()
                    false_positive["term1_label"] = term_1["label"]
                    false_positive["term2_label"] = term_2["label"]
                    false_positive["term1_id"] = term_1["id"]
                    false_positive["term2_id"] = term_2["id"]
                    false_positive["term1_source"] = name_1
                    false_positive["term2_source"] = name_2
                    false_positive["user_assesment"] = user_assesment
                    results.append(false_positive)
                else:
                    print("Not a valid input. Will resume and don't map.\n")
        
    return results

In [16]:
bouterwek_eschenburg_matches_by_edit_distance = match_by_levenshtein_interactive(
    bouterwek_terms,
    eschenburg_terms, 
    "bouterwek", 
    "eschenburg",
    2)

Match concepts with labels Ode (bouterwek) and Oper (eschenburg)?


y/n n


Will NOT create a mapping.

Match concepts with labels Äsopische Fabel (bouterwek) and Aesopische Fabel (eschenburg)?


y/n y


Will create mapping.



In [17]:
bouterwek_eschenburg_matches_by_edit_distance

[{'term1_label': 'Ode',
  'term2_label': 'Oper',
  'term1_id': 'https://genre.clscor.io/bouterwek/ode',
  'term2_id': 'https://genre.clscor.io/eschenburg/oper',
  'term1_source': 'bouterwek',
  'term2_source': 'eschenburg',
  'user_assesment': 'n'},
 {'term1_label': 'Äsopische Fabel',
  'term2_label': 'Aesopische Fabel',
  'term1_id': 'https://genre.clscor.io/bouterwek/aesopische_fabel',
  'term2_id': 'https://genre.clscor.io/eschenburg/aesopische_fabel',
  'term1_source': 'bouterwek',
  'term2_source': 'eschenburg',
  'user_assesment': 'y'}]

In [18]:
bouterwek_goethe_matches_by_edit_distance = match_by_levenshtein_interactive(
    bouterwek_terms,
    goethe_terms, 
    "bouterwek", 
    "goethe",
    2)

Match concepts with labels Oper (bouterwek) and Ode (goethe)?


y/n n


Will NOT create a mapping.



In [20]:
bouterwek_goethe_matches_by_edit_distance

[{'term1_label': 'Oper',
  'term2_label': 'Ode',
  'term1_id': 'https://genre.clscor.io/bouterwek/oper',
  'term2_id': 'https://genre.clscor.io/goethe/ode',
  'term1_source': 'bouterwek',
  'term2_source': 'goethe',
  'user_assesment': 'n'}]

In [19]:
eschenburg_goethe_matches_by_edit_distance = match_by_levenshtein_interactive(
    eschenburg_terms,
    goethe_terms, 
    "eschenburg", 
    "goethe",
    2)

Match concepts with labels Kantate (eschenburg) and Cantate (goethe)?


y/n y


Will create mapping.

Match concepts with labels Oper (eschenburg) and Ode (goethe)?


y/n n


Will NOT create a mapping.



In [21]:
eschenburg_goethe_matches_by_edit_distance

[{'term1_label': 'Kantate',
  'term2_label': 'Cantate',
  'term1_id': 'https://genre.clscor.io/eschenburg/kantate',
  'term2_id': 'https://genre.clscor.io/goethe/cantate',
  'term1_source': 'eschenburg',
  'term2_source': 'goethe',
  'user_assesment': 'y'},
 {'term1_label': 'Oper',
  'term2_label': 'Ode',
  'term1_id': 'https://genre.clscor.io/eschenburg/oper',
  'term2_id': 'https://genre.clscor.io/goethe/ode',
  'term1_source': 'eschenburg',
  'term2_source': 'goethe',
  'user_assesment': 'n'}]

In [None]:
# bouterwek and goethe do not match at all

# for these we need to create the skos
#bouterwek_eschenburg_matches_by_edit_distance
#eschenburg_goethe_matches_by_edit_distance

In [22]:
from rdflib import Graph, SKOS, URIRef

In [23]:
# Need to create rdf triples and export the graphs
def export_close_match_on_levenshtein(matchings:list):
    """creates the export files"""
    source_1_name = matchings[0]["term1_source"]
    source_2_name = matchings[0]["term2_source"]

    g_1 = Graph()
    g_2 = Graph()

    for item in matchings:
        if item["user_assesment"] == "y":
            # add it to the first graph
            g_1.add(( URIRef(item["term1_id"]), SKOS.closeMatch, URIRef(item["term2_id"]) ))

            # add the inverse to the second graph
            g_2.add(( URIRef(item["term2_id"]), SKOS.closeMatch, URIRef(item["term1_id"]) ))

    file_1_name = source_1_name + "_closeMatch_" + source_2_name + "_based_on_levenshtein.ttl"
    g_1.serialize(destination="out/"+file_1_name)
    file_2_name = source_2_name + "_closeMatch_" + source_1_name + "_based_on_levenshtein.ttl"
    g_2.serialize(destination="out/"+file_2_name)

In [24]:
export_close_match_on_levenshtein(bouterwek_eschenburg_matches_by_edit_distance)

In [25]:
export_close_match_on_levenshtein(eschenburg_goethe_matches_by_edit_distance)