In [1]:
# TODO from insights by prof
# dont save in json but locally in a var
# only match by labels (no other matching needed, no properties, superclasses, etc.)
# just save labels in list with key is class uri and value is label: if memory is not a problem save in to structures: key value of label and value is class uri
# 5 - 10 min for human, mouse ontologies is reasonable

# Output: in rdf format (see reference_anatomy) just put final matches in there above the threshold (defined by the user). As relation use owl:equivalentClass

# Project Description
The goal of the project is to develop a simple yet effective ontology alignment framework in Python that focuses on lexical similarity matching. The framework will utilize both string matching techniques and the semantic capabilities of large language models to identify potential alignments between entities (such as classes) in two different ontologies.

### Objectives
1. **Develop an ontology alignment framework** that can process and compare ontologies based on textual content.
2. **Implement lexical similarity matching** using both basic string matching techniques and advanced semantic analysis with embeddings from LLMs.
3. **Output alignments with confidence scores**, enabling users to understand and evaluate the quality and reliability of the suggested alignments.

### Steps to Perform

#### Step 1: Ontology Parsing
- **Goal**: Load and parse the ontologies to be aligned.
- **Tasks**:
  - Utilize libraries like `rdflib` or `owlready2` to read ontology files.
  - Extract relevant textual information (e.g., class names, labels, descriptions).

#### Step 2: Lexical Similarity Matching
This step is divided into two sub-steps: string matching and embeddings matching.

##### a. String Matching
- **Goal**: Implement direct and fuzzy string comparison techniques to find matches based on textual similarity.
- **Tasks**:
  - Perform normalization (e.g., lowercasing, removing special characters).
  - Use string comparison methods (exact match, substring search, edit distance).

##### b. Embeddings Matching Using LLMs
- **Goal**: Use the semantic context provided by LLMs to match terms based on their meanings.
- **Tasks**:
  - Generate embeddings for the textual content of each ontology using models from the Hugging Face Transformers library.
  - Calculate similarity scores between embeddings (e.g., using cosine similarity).

#### Step 3: Combining and Filtering
- **Goal**: Aggregate results from both matching techniques and refine the output.
- **Tasks**:
  - Combine scores from string and embeddings matching.
  - Apply thresholds to filter out matches with low confidence.
  - Optionally, use simple structural checks to add confidence to matches (e.g., matched entities have similar parent classes).

#### Step 4: Output and Evaluation
- **Goal**: Output the alignment results and provide means for evaluation.
- **Tasks**:
  - Format the output in a structured way (e.g., JSON, CSV) that lists entity pairs and their matching scores.
  - If possible, evaluate the effectiveness using known benchmarks or test cases to calculate precision, recall, and F1-score.

### Summary
The project is centered on creating a practical tool for ontology matching, focusing on textual content using both conventional and advanced NLP techniques. By combining string-based and semantic-based approaches, the framework aims to provide robust alignments that are supported by both literal and contextual text similarities. This dual approach enhances the capability of the alignment process, making it more flexible and potentially more accurate than using only one method.

## Notes

- [This paper](https://arxiv.org/pdf/2309.07172) suggests that Flan-T5-XXL might perform best: [Hugging face link to model](https://huggingface.co/google/flan-t5-xxl)

In [2]:
# imports
import json
from owlready2 import *
import rdflib
import pandas as pd
from collections import OrderedDict, defaultdict

**rdflib vs owlready2:**

Interchangeability: Given that OWL is an application of RDF, tools that can parse RDF/XML can generally handle .owl files, and vice versa, provided that the ontology-specific constructs are understood by the tool. This is why libraries like rdflib, which are capable of parsing RDF, are suitable for handling OWL files serialized in RDF/XML format.

Flexibility: Choosing to work with rdflib for general RDF handling and owlready2 for specific ontology manipulations where needed is a flexible approach. It allows you to leverage the strengths of both libraries—rdflib for its robust RDF manipulation and SPARQL querying capabilities, and owlready2 for its ontology-specific features like reasoning and direct manipulation of classes and properties.

# Load/ Parse ontologies

In [3]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"

# output paths
onto1_path_out = "ontology_jsons/onto1.json"
onto2_path_out = "ontology_jsons/onto2.json"

In [4]:
def load_ontology(file_path):
    """
    Loads an ontology from a given file path, which can be in RDF (.rdf) or OWL (.owl) format.
    
    Args:
    file_path (str): The file path to the ontology file.
    
    Returns:
    rdflib.Graph: A graph containing the ontology data.
    """
    # Create a new RDF graph
    graph = rdflib.Graph()

    # Bind some common namespaces to the graph
    namespaces = {
        "rdf": rdflib.namespace.RDF,
        "rdfs": rdflib.namespace.RDFS,
        "owl": rdflib.namespace.OWL,
        "xsd": rdflib.namespace.XSD
    }
    for prefix, namespace in namespaces.items():
        graph.namespace_manager.bind(prefix, namespace)

    # Attempt to parse the file
    try:
        graph.parse(file_path, format=rdflib.util.guess_format(file_path))
        print(f"Successfully loaded ontology from {file_path}")
    except Exception as e:
        print(f"Failed to load ontology from {file_path}: {e}")
        return None

    return graph

In [5]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)
print(onto1_graph, onto2_graph)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']]. [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory']].


In [6]:
def preprocess_label(label):
    return str(label).replace("_", " ").strip(" ,.").lower()

### New approach without json and instead dicts

In [12]:
def extract_ontology_details_to_dict(graph):
    # Query for classes
    class_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
    }
    """
    classes = graph.query(class_query)
    ontology_labels_dict = OrderedDict()
    labels_list = []

    # Process class results
    for row in classes:
        class_uri, label, label_dt, label_lang = row
        class_key = str(class_uri)
        label_str = preprocess_label(label)
        if label_str not in ontology_labels_dict:
            ontology_labels_dict[label_str] = class_key
            labels_list.append(label_str)
            
    
    # TODO implement matching with properties

    return ontology_labels_dict, labels_list

In [13]:
onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

In [14]:
def find_duplicates(ordered_dict):
    # Step 1: Count occurrences of each value
    value_counts = defaultdict(int)
    for key, value in ordered_dict.items():
        value_counts[value] += 1

    # Step 2: Filter to find values that appear more than once
    duplicates = {value for value, count in value_counts.items() if count > 1}

    # Step 3: Collect keys for these duplicate values
    duplicate_keys = {key: value for key, value in ordered_dict.items() if value in duplicates}

    return duplicate_keys

# Find duplicates
onto1_duplicate_keys = find_duplicates(onto1_dict)
print("Duplicate keys and values:", onto1_duplicate_keys)
# => filtering for multiple labels works

Duplicate keys and values: {'respiratory system epithelium': 'http://mouse.owl#MA_0001823', '2 respiratory system epithelium': 'http://mouse.owl#MA_0001823'}


some more cleaning ideas:
- remove non-meaningful classes or properties

## 3. Matching

### Required libraries

In [15]:
#!pip install python-Levenshtein

In [16]:
#!pip install scikit-learn

In [17]:
#!pip install linktransformer

### 3.1. String Matching

For String Matching we will implement 4 different methods that the user then can chose via a parameter when calling the method.

The metrics we will use are:
- Levenshtein distance
- Jaccard Similarity
- Cosine Similarity
- TF-IDF
- LinkTransformer

In [18]:
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
import linktransformer as lt

def levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

def calc_cosine_similarity(str1, str2):
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform([str1, str2])
    return cosine_similarity(count_matrix)[0][1]

def jaccard_similarity(str1, str2):
    # Tokenize the strings into sets of words
    set1 = set(str1.split())
    set2 = set(str2.split())
    
    # Find the intersection and union of the two sets
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    
    # Calculate the Jaccard score
    if not union:  # Handle the edge case where both strings might be empty
        return 0.0
    return len(intersection) / len(union)

# def calculate_tfidf_cosine_similarity(str1, str2): # TODO adjust for new workflow
#     vectorizer = TfidfVectorizer()
#     tfidf = vectorizer.fit_transform([str1, str2])
#     # Calculate the cosine similarity between the two vectors
#     # tfidf_matrix[0:1] gets the tf-idf vector for the first document
#     # tfidf_matrix[1:2] gets the tf-idf vector for the second document
#     sim_score = cosine_similarity(tfidf[0:1], tfidf[1:2])

#     # sim_score is an array of shape (1,1); we return the element at [0][0]
#     return sim_score[0][0]

  from .autonotebook import tqdm as notebook_tqdm


In [19]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.
    
    Args:
    labels (list): List of all labels from both ontologies.
    
    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform(labels)
    return vectorizer, count_matrix

def cosine_compare_labels(count_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their count vector indices.
    
    Args:
    count_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the count vectors.
    index1, index2 (int): Indices of the labels to compare.
    
    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(count_matrix[index1:index1+1], count_matrix[index2:index2+1])[0][0]


def execute_cosine_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, count_matrix = cosine_vectorize_labels(all_labels)
    
    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = cosine_compare_labels(count_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [20]:
execute_cosine_string_matching(onto1_list, onto2_list)

Similarity score between 'mouse anatomy' and 'anatomic structure system or substance': 0.0


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def tfidf_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.
    
    Args:
    labels (list): List of all labels from both ontologies.
    
    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(labels)
    return vectorizer, tfidf_matrix

def tfidf_compare_labels(tfidf_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their TF-IDF vector indices.
    
    Args:
    tfidf_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the TF-IDF vectors.
    index1, index2 (int): Indices of the labels to compare.
    
    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(tfidf_matrix[index1:index1+1], tfidf_matrix[index2:index2+1])[0][0]


def execute_tfidf_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, tfidf_matrix = tfidf_vectorize_labels(all_labels)
    
    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = tfidf_compare_labels(tfidf_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [22]:
execute_tfidf_string_matching(onto1_list, onto2_list)

Similarity score between 'mouse anatomy' and 'anatomic structure system or substance': 0.0


In [23]:
# Unfortunately, this does not work at all because the models or the are too big. The kernel crashes

"""
def linktransformer_comparison(onto1, onto2):
    # Make pandas dataframes (can only compare dataframes)
    # Specify the record_path to expand the labels and superclasses if needed
    df_onto1 = pd.json_normalize(onto1, 'labels', ['uri', 'type', 'superclasses'], 
                    record_prefix='label_')
    df_onto2 = pd.json_normalize(onto2, 'labels', ['uri', 'type', 'superclasses'], 
                    record_prefix='label_')
    
    # Comparison using the most downloaded LLM: 
    # models tested: sentence-transformers/all-MiniLM-L6-v2 -> crashes
    # dell-research-harvard/lt-wikidata-comp-multi -> crashes
    df_matched = lt.merge(df_onto1, df_onto2, on="label_value", merge_type="1:1", suffixes=('_onto1', '_onto2'), model='dell-research-harvard/lt-wikidata-comp-multi')

    return df_matched

onto_matched = linktransformer_comparison(onto1_data, onto2_data)
"""

'\ndef linktransformer_comparison(onto1, onto2):\n    # Make pandas dataframes (can only compare dataframes)\n    # Specify the record_path to expand the labels and superclasses if needed\n    df_onto1 = pd.json_normalize(onto1, \'labels\', [\'uri\', \'type\', \'superclasses\'], \n                    record_prefix=\'label_\')\n    df_onto2 = pd.json_normalize(onto2, \'labels\', [\'uri\', \'type\', \'superclasses\'], \n                    record_prefix=\'label_\')\n    \n    # Comparison using the most downloaded LLM: \n    # models tested: sentence-transformers/all-MiniLM-L6-v2 -> crashes\n    # dell-research-harvard/lt-wikidata-comp-multi -> crashes\n    df_matched = lt.merge(df_onto1, df_onto2, on="label_value", merge_type="1:1", suffixes=(\'_onto1\', \'_onto2\'), model=\'dell-research-harvard/lt-wikidata-comp-multi\')\n\n    return df_matched\n\nonto_matched = linktransformer_comparison(onto1_data, onto2_data)\n'

In [24]:
def execute_string_matching(metric, data1, data2):
    """
    Executes the selected matching metric on the provided data.

    Args:
    metric (str): A single letter representing the metric to use.
                  'Levenshtein' for Levenshtein Distance,
                  'Jaccard' for Jaccard Similarity,
                  'LinkTransformer' for Link Transformer.
    data1, data2 (str): The data strings to compare.

    Returns:
    result: The result of the chosen metric computation.
    """
    if metric == 'Levenshtein':
        return levenshtein_distance(data1, data2)
    elif metric == 'Jaccard':
        return jaccard_similarity(data1, data2)
    elif metric == 'LinkTransformer':
        pass # TODO implement or remove
    else:
        raise ValueError("Invalid metric selection")

In [25]:
len(onto1_dict)

2739

In [26]:
import time

def measure_time(method=print()):
    start_time = time.time()
    method
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Elapsed time:", elapsed_time, "seconds")




In [27]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)

onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

print(len(onto1_list))
print(len(onto2_list))

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
2739
3299


In [28]:
def match_ontologies(onto1_path_in, onto2_path_in, metric, bidirectional=False):
    # load ontologies
    onto1_graph = load_ontology(onto1_path_in)
    onto2_graph = load_ontology(onto2_path_in)
    
    onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
    onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)
    
    print(onto1_list)
    print(onto2_list)
    
    labels_already_tested_labels = {} # dict to store when labels (of ontology 2) were already tested for label (of ontology 1) => necessary to avoid infite loop
    
    for label in onto1_list:
        labels_already_tested_labels[label] = []
            
    onto2_used_classes = {}
    
    if metric == "Cosine":
        execute_cosine_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    elif metric == "TF-IDF":
        execute_tfidf_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    else:
        class_results = {}

        operation_total = 0
        for x in onto1_list:
            for y in onto2_list:
                operation_total += 1
        print(operation_total)
                
        while onto1_list: # loop over labels of ontology 1 until empty
            print(len(onto1_list))
            label1 = onto1_list.pop() # remove the last element in the list => removing the last (instead of first) makes things easier and less error prone
            # labels that got added again cause a better match was found (see later step) will be appended to the end and therefore handled immediately

            # Match from Ontology 1 to Ontology 2
            label_result = [label1, "", 0]
            best_score = 0
            already_tested_labels = labels_already_tested_labels[label1]
            # print("Current label: ", label1)
            # print("Aready tested labels: ", already_tested_labels)
            for label2 in onto2_list:
                if label2 not in already_tested_labels: # check that label wasn't already checked in previous run
                    matching_score = execute_string_matching(metric, label1, label2) # calculate string matching score
                    # If a perfect match is found, stop iterating over labels for this entry
                    if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0): # handle perfect match
                        best_score = matching_score
                        label_result = [label1, label2, best_score]
                        break # stop searching for matches cause perfect match found
                    # Check if a match for this label has been found before
                    if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score): # handle higher score then before
                        label_result[1] = label2
                        best_score = matching_score
                    
            # print("Label done: ", label1)
            label_result[2] = best_score # save best score in label_result
            label_with_best_score = label_result[1] # get label that achieved the best score
            # print(label_result)
            class_uri = onto1_dict[label1] # get the class_uri of the currently checked label in ontology 1
            if label_result[2] == 0 and label_with_best_score == '': # handle if no match was found
                class_results[class_uri] = label_result
            else: 
                class2_uri = onto2_dict[label_with_best_score] # get class_uri of the label with best match
                label_result[1] = class2_uri # save class_uri instead of label => TODO maybe change to not manipulate label_result as it is confusing for later steps
                # print(label1)
                # print(label_result)
                if class2_uri not in onto2_used_classes: # check if class found of ontology 2 is NOT already used by other class in ontology 1
                    if class_uri not in class_results: # handle no entry exists yet for that class
                        class_results[class_uri] = label_result
                        onto2_used_classes[class2_uri] = class_uri
                        labels_already_tested_labels[label1].append(label_with_best_score)
                    elif label_result[2] > class_results[class_uri][2]: # handle entry exist but now higher score was found with another label of the class (handles multiple labels)
                            class_results[class_uri] = label_result
                            onto2_used_classes[class2_uri] = class_uri
                            labels_already_tested_labels[label1].append(label_with_best_score)
                else: # class of ontology 2 already in use
                    # print("Already in use")
                    result_current_class_in_use = onto2_used_classes[class2_uri] # get class uri of class that uses that class of ontology 2
                    # print(class_results[result_current_class_in_use])
                    if label_result[2] > class_results[result_current_class_in_use][2]: # if score of the new found match is higher than the current assigned one
                        # print("Better score as already in use")
                        class_results[class_uri] = label_result # set the class of ontology 2 to that current class
                        onto2_used_classes[class2_uri] = class_uri # overwrite the use of that class to new class of ontology 1
                        # print(class_results[result_current_class_in_use][0])
                        # print(result_current_class_in_use)
                        old_used_label = class_results[result_current_class_in_use][0]
                        labels_already_tested_labels[old_used_label].append(label_with_best_score)
                        onto1_list.append(old_used_label) # add the old used label again to list again that gets iterated as it now doesnt have a match anymore
                        class_results[result_current_class_in_use] = ["", "", 0] # set result of earlier class to None (could also be remmoved but that way later we can handle if no match found)
                    else: # handle not a higher score
                        labels_already_tested_labels[label1].append(label_with_best_score) # add the label to the already_tested_labels
                        onto1_list.append(label1) # append currently check label again as it needs to handled again with the new information of already_tested_labels
                    
        # TODO implement bidirectional matching
                    
        # ---------------------- OUTDATED but nice for report            
        # Old code without handling if better match found for already in use class (good for report)
        # # Match from Ontology 1 to Ontology 2
        # for label1 in onto1_list:
        #     label_result = ["", 0]
        #     best_score = 0
        #     for label2 in onto2_list:
        #         operations_done += 1
        #         if operations_done % 10000 == 0:
        #             print(operations_done)
        #         matching_score = execute_string_matching(metric, label1, label2)
        #         # If a perfect match is found, stop iterating over labels for this entry
        #         if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0):
        #             best_score = matching_score
        #             label_result = [label2, best_score]
        #             break
        #         # Check if a match for this label has been found before
        #         if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score):
        #             label_result[0] = label2
        #             best_score = matching_score
                    
        #     # print("Label done: ", label1)
        #     label_result[1] = best_score
        #     class_uri = onto1_dict[label1]
        #     class2_uri = onto2_dict[label2]
        #     label_result[0] = class2_uri
        #     # OLD_TO-DO check that if class of ontology 2 already used isn't allowed to use
        #     if class_uri not in class_results:
        #         class_results[class_uri] = label_result
        #     elif class_results[class_uri][1] < label_result[1]:
        #             class_results[class_uri] = label_result
                    
        # OLD_TO-DO important: currently it takes the best match found for the current class of the ontology 1.
        # But it doesnt take into account if a later class has a higher score with that class and therefore would be better suited
        # Solution: make dict and always take one element that gets removed. If later another element matches with the class matched with
        # the previous element the earlier removed element get added again and the value of that elements gets assigned to the higher value element
        
        return class_results


In [29]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"

matching_results = match_ontologies(onto1_path_in, onto2_path_in, 'Jaccard')

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
['mouse anatomy', 'spinal cord grey matter', 'organ system', 'trunk', 'body cavity/lining', 'head/neck', 'limb', 'tail', 'adipose tissue', 'cardiovascular system', 'connective tissue', 'endocrine system', 'hemolymphoid system', 'integumental system', 'muscle', 'nervous system', 'sensory organ', 'skeletal system', 'visceral organ system', 'back', 'abdomen/pelvis/perineum', 'thorax', 'head', 'neck', 'forelimb', 'hindlimb', 'lower back', 'upper back', 'abdomen', 'pelvis', 'chest', 'thoracic cavity', 'arm', 'lower arm', 'upper arm', 'elbow', 'hand', 'shoulder', 'wrist', 'carpus', 'hand digit', 'metacarpus', 'ankle', 'foot', 'hip', 'knee', 'leg', 'foot digit', 'metatarsus', 'tarsus', 'lower leg', 'upper leg', 'pericardial cavity', 'peritoneal cavity', 'pleural cavity', 'fat', 'brown fat', 'white fat', 'blood', 'blood vessel', 'arterial blood vessel', 'aorta', 'arteriole', 

In [30]:
len(matching_results)
# TODO one element is always missing: length of classes in onto_1 is 2739 but resulting matches is 2738

2738

In [31]:
# Comparing with reference_anatomy where matches all have measure of 1
print(matching_results["http://mouse.owl#MA_0000110"])
print(matching_results["http://mouse.owl#MA_0001468"])
print(matching_results["http://mouse.owl#MA_0001886"])
print(matching_results["http://mouse.owl#MA_0000702"]) # has 0.75
print(matching_results["http://mouse.owl#MA_0000751"]) # has 0.8
print(matching_results["http://mouse.owl#MA_0000064"])
print(matching_results["http://mouse.owl#MA_0000062"])
print(matching_results["http://mouse.owl#MA_0002497"]) # has 0.66
print(matching_results["http://mouse.owl#MA_0001221"])
print(matching_results["http://mouse.owl#MA_0001424"]) # has 0.25 => check why so low and for which class then the correct label was used: http://human.owl#NCI_C32242

['intervertebral disc', 'http://human.owl#NCI_C49571', 1.0]
['occipital bone', 'http://human.owl#NCI_C12757', 1.0]
['external ear cartilage', 'http://human.owl#NCI_C49225', 1.0]
['aorta smooth muscle', 'http://human.owl#NCI_C49191', 0.75]
['lymphatic vessel smooth muscle', 'http://human.owl#NCI_C49260', 0.8]
['artery', 'http://human.owl#NCI_C12372', 1.0]
['aorta', 'http://human.owl#NCI_C12669', 1.0]
['liver perisinusoidal space', 'http://human.owl#NCI_C33309', 0.6666666666666666]
['tensor tympani', 'http://human.owl#NCI_C33748', 1.0]
['cervical vertebra 4', 'http://human.owl#NCI_C33501', 0.25]


In [32]:
# check for http://mouse.owl#MA_0001424
# http://mouse.owl#MA_0001424 has label cervical vertebra 4
# currently matched with http://human.owl#NCI_C32245 => has label C7_Vertebra
# but in the reference.rdf matched with http://human.owl#NCI_C32242 => has label C4_Vertebra with score 1
# REPORT when running the matching again the http://mouse.owl#MA_0001424 gets matched to http://human.owl#NCI_C32431
# => which has a score of 0.16 => reason: because now the elements where iterated differently the before matched label got now matched with
# another class but with same score 0.25 => http://mouse.owl#MA_0001436: ['sacral vertebra 3', 'http://human.owl#NCI_C32245', 0.25]
print(jaccard_similarity("cervical vertebra 4", "C4_Vertebra"))
print(jaccard_similarity("thoracic vertebra 10", "C4_Vertebra"))

0.0
0.0


In [33]:
counter = 0
unique_values = set()
for key, element in matching_results.items():
    if element[1] == "http://human.owl#NCI_C32245":
        print(key)
        print(element)
    if not element[2] == 0 and not element[2] == 1:
        counter += 1
        unique_values.add(element[2])
# unique_values

http://mouse.owl#MA_0001443
['thoracic vertebra 6', 'http://human.owl#NCI_C32245', 0.25]


In [34]:
# Test to check if our matching works and a better match for a class that is already in use is found
test_onto1_path_in = "test_ontologies/test1.owl"
test_onto2_path_in = "test_ontologies/test2.owl"

test_matching_results = match_ontologies(test_onto1_path_in, test_onto2_path_in, 'Jaccard')
print(jaccard_similarity("mouse anatomy", "mouse anatomian"))
print(jaccard_similarity("mouse anatomian", "mouse anatomian"))
test_matching_results

2024-05-16 22:47:33 - http://mouse.owl/1 ontology 1 does not look like a valid URI, trying to serialize this will break.
2024-05-16 22:47:33 - http://mouse.owl/2 ontology 1 does not look like a valid URI, trying to serialize this will break.
Successfully loaded ontology from test_ontologies/test1.owl


2024-05-16 22:47:33 - http://mouse.owl/1 ontology 2 does not look like a valid URI, trying to serialize this will break.
Successfully loaded ontology from test_ontologies/test2.owl
['mouse anatomy', 'mouse anatomian']
['mouse anatomian']
2
2
1
1
0.3333333333333333
1.0


{'http://mouse.owl/2 ontology 1': ['mouse anatomian',
  'http://mouse.owl/1 ontology 2',
  1.0],
 'http://mouse.owl/1 ontology 1': ['mouse anatomy', '', 0]}

In [35]:
# TODO implement LLM (probably Word2Vec)
# TODO implement combining and filtering
# TODO implement user inputs (also weighted average with user defined formular would be nice)

In [36]:
import time

def measure_matching_time(onto1_path_out, onto2_path_out, metric, bidirectional=False):
    start_time = time.time()
    match_ontologies(onto1_path_out, onto2_path_out, metric, bidirectional)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Elapsed time:", elapsed_time, "seconds")

In [37]:
# Usage
matches = match_ontologies(onto1_path_in, onto2_path_in, 'Jaccard', bidirectional=False)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
['mouse anatomy', 'spinal cord grey matter', 'organ system', 'trunk', 'body cavity/lining', 'head/neck', 'limb', 'tail', 'adipose tissue', 'cardiovascular system', 'connective tissue', 'endocrine system', 'hemolymphoid system', 'integumental system', 'muscle', 'nervous system', 'sensory organ', 'skeletal system', 'visceral organ system', 'back', 'abdomen/pelvis/perineum', 'thorax', 'head', 'neck', 'forelimb', 'hindlimb', 'lower back', 'upper back', 'abdomen', 'pelvis', 'chest', 'thoracic cavity', 'arm', 'lower arm', 'upper arm', 'elbow', 'hand', 'shoulder', 'wrist', 'carpus', 'hand digit', 'metacarpus', 'ankle', 'foot', 'hip', 'knee', 'leg', 'foot digit', 'metatarsus', 'tarsus', 'lower leg', 'upper leg', 'pericardial cavity', 'peritoneal cavity', 'pleural cavity', 'fat', 'brown fat', 'white fat', 'blood', 'blood vessel', 'arterial blood vessel', 'aorta', 'arteriole', 

### Create artificial ontologies for testing

In [38]:
import json
import random

def generate_class_uri(index, base_uri="http://mouse.owl#MA_"):
    return f"{base_uri}{1000 + index:04}"

def generate_label(index):
    base_labels = ["Nerve", "muscle", "vein", "artery", "bone", "tissue", "cell", "organ", "gland", "membrane"]
    part = random.choice(base_labels)
    return preprocess_label(f"{part} {index}")

def generate_superclasses(index):
    base_superclasses = [
        "http://www.w3.org/2002/07/owl#Thing",
        "http://mouse.owl#AnatomicalStructure",
        "http://mouse.owl#BiologicalProcess"
    ]
    return [random.choice(base_superclasses) for _ in range(random.randint(1, 3))]

def create_ontology_entries(num_entries=100):
    entries = []
    for i in range(num_entries):
        entry = {
            "class_uri": generate_class_uri(i),
            "labels": [{"value": generate_label(i), "datatype": "http://www.w3.org/2001/XMLSchema#string", "language": None}],
            "superclasses": generate_superclasses(i)
        }
        entries.append(entry)
    return entries

# Generate the data for two ontology files
ontology1 = create_ontology_entries()
ontology2 = create_ontology_entries()

with open('ontology_jsons/test-ontology1.json', 'w') as file1:
    json.dump(ontology1, file1, indent=4)

with open('ontology_jsons/test-ontology2.json', 'w') as file2:
    json.dump(ontology2, file2, indent=4)


In [39]:
# Matches on artificial test data
test_matches = match_ontologies('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

Successfully loaded ontology from ontology_jsons/test-ontology1.json
Successfully loaded ontology from ontology_jsons/test-ontology2.json
[]
[]
0


In [40]:
measure_matching_time('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

Successfully loaded ontology from ontology_jsons/test-ontology1.json
Successfully loaded ontology from ontology_jsons/test-ontology2.json
[]
[]
0
Elapsed time: 0.017974138259887695 seconds


In [41]:
counter = 0
for label_key, data in test_matches.items():
    if data[2] > 0:
        print(data)
        counter += 1

print(counter)

0


Output format example:
```
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment" 
	 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<Alignment>
<xml>yes</xml>
<level>0</level>
<type>??</type>

<map>
	<Cell>
		<entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
		<entity2 rdf:resource="http://human.owl#NCI_C52561"/>
		<measure rdf:datatype="xsd:float">1.0</measure>
		<relation>=</relation>
	</Cell>
</map>
```