In [None]:
# TODO from insights by prof
# dont save in json but locally in a var
# only match by labels (no other matching needed, no properties, superclasses, etc.)
# just save labels in list with key is class uri and value is label: if memory is not a problem save in to structures: key value of label and value is class uri
# 5 - 10 min for human, mouse ontologies is reasonable

# Output: in rdf format (see reference_anatomy) just put final matches in there above the threshold (defined by the user). As relation use owl:equivalentClass

# Project Description
The goal of the project is to develop a simple yet effective ontology alignment framework in Python that focuses on lexical similarity matching. The framework will utilize both string matching techniques and the semantic capabilities of large language models to identify potential alignments between entities (such as classes) in two different ontologies.

### Objectives
1. **Develop an ontology alignment framework** that can process and compare ontologies based on textual content.
2. **Implement lexical similarity matching** using both basic string matching techniques and advanced semantic analysis with embeddings from LLMs.
3. **Output alignments with confidence scores**, enabling users to understand and evaluate the quality and reliability of the suggested alignments.

### Steps to Perform

#### Step 1: Ontology Parsing
- **Goal**: Load and parse the ontologies to be aligned.
- **Tasks**:
  - Utilize libraries like `rdflib` or `owlready2` to read ontology files.
  - Extract relevant textual information (e.g., class names, labels, descriptions).

#### Step 2: Lexical Similarity Matching
This step is divided into two sub-steps: string matching and embeddings matching.

##### a. String Matching
- **Goal**: Implement direct and fuzzy string comparison techniques to find matches based on textual similarity.
- **Tasks**:
  - Perform normalization (e.g., lowercasing, removing special characters).
  - Use string comparison methods (exact match, substring search, edit distance).

##### b. Embeddings Matching Using LLMs
- **Goal**: Use the semantic context provided by LLMs to match terms based on their meanings.
- **Tasks**:
  - Generate embeddings for the textual content of each ontology using models from the Hugging Face Transformers library.
  - Calculate similarity scores between embeddings (e.g., using cosine similarity).

#### Step 3: Combining and Filtering
- **Goal**: Aggregate results from both matching techniques and refine the output.
- **Tasks**:
  - Combine scores from string and embeddings matching.
  - Apply thresholds to filter out matches with low confidence.
  - Optionally, use simple structural checks to add confidence to matches (e.g., matched entities have similar parent classes).

#### Step 4: Output and Evaluation
- **Goal**: Output the alignment results and provide means for evaluation.
- **Tasks**:
  - Format the output in a structured way (e.g., JSON, CSV) that lists entity pairs and their matching scores.
  - If possible, evaluate the effectiveness using known benchmarks or test cases to calculate precision, recall, and F1-score.

### Summary
The project is centered on creating a practical tool for ontology matching, focusing on textual content using both conventional and advanced NLP techniques. By combining string-based and semantic-based approaches, the framework aims to provide robust alignments that are supported by both literal and contextual text similarities. This dual approach enhances the capability of the alignment process, making it more flexible and potentially more accurate than using only one method.

## Notes

- [This paper](https://arxiv.org/pdf/2309.07172) suggests that Flan-T5-XXL might perform best: [Hugging face link to model](https://huggingface.co/google/flan-t5-xxl)

In [1]:
# imports
import json
from owlready2 import *
import rdflib
import pandas as pd

  from pandas.core import (


**rdflib vs owlready2:**

Interchangeability: Given that OWL is an application of RDF, tools that can parse RDF/XML can generally handle .owl files, and vice versa, provided that the ontology-specific constructs are understood by the tool. This is why libraries like rdflib, which are capable of parsing RDF, are suitable for handling OWL files serialized in RDF/XML format.

Flexibility: Choosing to work with rdflib for general RDF handling and owlready2 for specific ontology manipulations where needed is a flexible approach. It allows you to leverage the strengths of both libraries—rdflib for its robust RDF manipulation and SPARQL querying capabilities, and owlready2 for its ontology-specific features like reasoning and direct manipulation of classes and properties.

# Load/ Parse ontologies

In [2]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"

# output paths
onto1_path_out = "ontology_jsons/onto1.json"
onto2_path_out = "ontology_jsons/onto2.json"

In [3]:
def load_ontology(file_path):
    """
    Loads an ontology from a given file path, which can be in RDF (.rdf) or OWL (.owl) format.
    
    Args:
    file_path (str): The file path to the ontology file.
    
    Returns:
    rdflib.Graph: A graph containing the ontology data.
    """
    # Create a new RDF graph
    graph = rdflib.Graph()

    # Bind some common namespaces to the graph
    namespaces = {
        "rdf": rdflib.namespace.RDF,
        "rdfs": rdflib.namespace.RDFS,
        "owl": rdflib.namespace.OWL,
        "xsd": rdflib.namespace.XSD
    }
    for prefix, namespace in namespaces.items():
        graph.namespace_manager.bind(prefix, namespace)

    # Attempt to parse the file
    try:
        graph.parse(file_path, format=rdflib.util.guess_format(file_path))
        print(f"Successfully loaded ontology from {file_path}")
    except Exception as e:
        print(f"Failed to load ontology from {file_path}: {e}")
        return None

    return graph

In [4]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)
print(onto1_graph, onto2_graph)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']]. [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']].


In [130]:
def preprocess_label(label):
    return str(label).strip(" ,.").lower()

In [131]:
def extract_class_details(graph):
    """
    Extracts and prints details of each class in the given RDF graph.

    Args:
    graph (rdflib.Graph): The RDF graph containing the ontology data.
    """
    # Assuming namespaces are already bound to the graph elsewhere
    
    query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang ?superclass
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
        OPTIONAL { ?class rdfs:subClassOf ?superclass }
    }
    """

    # Execute the query
    results = graph.query(query)

    # Data structure to organize class details
    class_details = {}

    # Process results
    for row in results:
        class_uri, label, label_dt, label_lang, superclass = row

        if class_uri not in class_details:
            class_details[class_uri] = {'class_uri': str(class_uri), 'labels': [], 'superclasses': []}

        if label and {'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None} not in class_details[class_uri]['labels']:
            class_details[class_uri]['labels'].append({'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None})

        if superclass and str(superclass) not in class_details[class_uri]['superclasses']:
            class_details[class_uri]['superclasses'].append(str(superclass))

    # Convert dictionary to JSON
    json_data = json.dumps(list(class_details.values()), indent=4, separators=(',', ': '))
    return json_data

In [132]:
onto1_json = extract_class_details(onto1_graph)
onto2_json = extract_class_details(onto2_graph)

In [133]:
# function to save as json
def save_to_json(file_path, raw_data):
    with open(file_path, 'w') as f:
        f.write(raw_data)
    print(f"Data has been saved to '{file_path}'.")

# function to load json data
def load_json_data(file_path):
    """
    Reads JSON data from a file and returns it.

    Parameters:
        file_path (str): The path to the JSON file to be read.
    
    Returns:
        dict/list: The data loaded from the JSON file.
    """
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

In [134]:
save_to_json(onto1_path_out, onto1_json)
save_to_json(onto2_path_out, onto2_json)

Data has been saved to 'ontology_jsons/onto1.json'.
Data has been saved to 'ontology_jsons/onto2.json'.


In [9]:
##### OUTDATED now use rdflib
# function to save class information in a json
# def extract_ontology_data(ontology):
#     """
#     Extracts detailed information about classes from the specified ontology.
    
#     Parameters:
#         ontology (owlready2.Ontology): The loaded ontology from which to extract class information.
    
#     Returns:
#         list: A list of dictionaries, each containing details about a class.
#     """
#     classes_info = []
#     for cls in ontology.classes():
#         if cls.name != "Thing":  # Skip 'owl:Thing'
#             class_details = {
#                 "id": cls.iri,
#                 "label": cls.label[0] if cls.label else "No label",
#                 "superclasses": [supercls.iri for supercls in cls.is_a if hasattr(supercls, 'iri') and supercls.name != "Thing"],
#                 "annotations": {
#                     "comment": cls.comment[0] if cls.comment else "No comment"
#                 }
#             }
#             classes_info.append(class_details)
#     return classes_info
# 
# onto1_data = extract_ontology_data(onto1)
# onto2_data = extract_ontology_data(onto2)


In [10]:
onto1_data = load_json_data(onto1_path_out)
onto2_data = load_json_data(onto2_path_out)

print("Ontology 1 Data:", onto1_data)
print("Ontology 2 Data:", onto2_data)

Ontology 1 Data: [{'class_uri': 'http://mouse.owl#MA_0000252', 'labels': [{'value': 'otic capsule', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://www.w3.org/2002/07/owl#Thing', 'N686f9c4382044bee976ba448d47c0e8d']}, {'class_uri': 'http://mouse.owl#MA_0000013', 'labels': [{'value': 'hemolymphoid system', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://mouse.owl#MA_0000003']}, {'class_uri': 'http://mouse.owl#MA_0001084', 'labels': [{'value': 'vestibulocochlear VIII ganglion', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://mouse.owl#MA_0000214']}, {'class_uri': 'http://mouse.owl#MA_0000105', 'labels': [{'value': 'cellular cartilage', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://mouse.owl#MA_0000104']}, {'class_uri': 'http://mouse.owl#MA_0000509', 'labels': [{'value': 'upper back

## 3. Matching

### Required libraries

In [11]:
!pip install python-Levenshtein



In [12]:
!pip install scikit-learn



### 3.1. String Matching

For String Matching we will implement 4 different methods that the user then can chose via a parameter when calling the method.

The metrics we will use are:
- Levenshtein distance
- Jaccard Similarity
- Cosine Similarity
- TF-IDF

In [38]:
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

def levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

def jaccard_similarity(str1, str2):
    # Tokenize the strings into sets of characters or words
    set1, set2 = set(str1.split()), set(str2.split())
    
    # Create a universe of all items
    universe = list(set1.union(set2))
    
    # Create binary vectors for each string
    vector1 = [1 if item in set1 else 0 for item in universe]
    vector2 = [1 if item in set2 else 0 for item in universe]
    
    # Calculate Jaccard similarity
    return jaccard_score(vector1, vector2)

def tfidf_comparison(str1, str2):
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([str1, str2])
    return tfidf.toarray()  # Returns the TF/IDF vectors for both strings

In [166]:
def execute_string_matching(metric, data1, data2):
    """
    Executes the selected matching metric on the provided data.

    Args:
    metric (str): A single letter representing the metric to use.
                  'L' for Levenshtein Distance,
                  'J' for Jaccard Similarity,
                  'T' for TF/IDF.
    data1, data2 (str): The data strings to compare.

    Returns:
    result: The result of the chosen metric computation.
    """
    if metric == 'L':
        return levenshtein_distance(data1, data2)
    elif metric == 'J':
        return jaccard_similarity(data1, data2)
    elif metric == 'T': # use when there a long strings as labels, for short labels (one word) this will give a similar/ same result as the calc_cosine_similarity()
        tfidf_vectors = tfidf_comparison(data1, data2)
        print(tfidf_vectors)
        return cosine_similarity(tfidf_vectors[0].reshape(1, -1), tfidf_vectors[1].reshape(1, -1))[0][0]
    else:
        raise ValueError("Invalid metric selection")

In [200]:
data1 = "artery 0"
data2 = "artery 0"
metric_choice = 'L'  # Choose 'L', 'J' or 'T'

result = execute_string_matching(metric_choice, data1, data2)
print("The similarity score is:", result)

The similarity score is: 0


In [143]:
def match_ontologies(file1, file2, metric, bidirectional=False): # TODO make way more efficient
    ontology1 = load_json_data(file1)
    ontology2 = load_json_data(file2)
    
    results = {}

    operation_total = 0
    for class1 in ontology1:
        for class2 in ontology2:
            operation_total += 1
    
    print(operation_total)
    operations_done = 0
    # Match from Ontology 1 to Ontology 2
    for class1 in ontology1:
        for class2 in ontology2:
            operations_done += 1
            if operations_done % 10000 == 0:
                print(operations_done)
                # break
            for label1 in class1['labels']:
                for label2 in class2['labels']:
                    first_label = label1['value'] # preprocessing already done when extracting ontology and saving in json-format
                    second_label = label2['value']
                    score = execute_string_matching(metric, first_label, second_label)
                    # If a perfect match is found, stop iterating over labels for this entry
                    if ((metric == 'T' or metric == 'J') and score == 1) or metric == 'L' and score == 0:
                        results[first_label] = (first_label, second_label, score)
                        break
                    # Check if a match for this label has been found before
                    if first_label in results:
                        # If the current score is higher, update the result
                        if score > results[first_label][2]:
                            results[first_label] = (first_label, second_label, score)
                    else:
                        results[first_label] = (first_label, second_label, score)
                else:
                    continue  # Continue to the next label if a perfect match is not found
                break  # Break out of the label iteration loop if a perfect match is found
        else:
            continue  # Continue to the next class2 if a perfect match is not found
        break  # Break out of the class2 iteration loop if a perfect match is found
    
    # Optionally match from Ontology 2 to Ontology 1 TODO: implement but first finish for normal matching
    # if bidirectional:
    #     for class2 in ontology2:
    #         for class1 in ontology1:
    #             if class1['labels'] and class2['labels']:
    #                 label2 = class2['labels'][0]['value']
    #                 label1 = class1['labels'][0]['value']
    #                 score = execute_string_matching(metric, label2, label1)
    #                 results.append((class2['class_uri'], class1['class_uri'], score))
    
    return results


In [116]:
import time

def measure_matching_time(onto1_path_out, onto2_path_out, metric, bidirectional=False):
    start_time = time.time()
    match_ontologies(onto1_path_out, onto2_path_out, metric, bidirectional)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Elapsed time:", elapsed_time, "seconds")

In [86]:
# Usage
matches = match_ontologies(onto1_path_out, onto2_path_out, 'C', bidirectional=False)

901716
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000


In [89]:
counter = 0
for match in matches:
    if match[2] > 0:
        print(match)

print(counter)

('glomerular capillary basement membrane', 'Capillary', 0.3799783615910079)
('thorax', 'Thorax', 1.0)
('pericardium', 'Pericardium', 1.0)
('auditory bone', 'Bone', 0.5797386715376658)
('popliteal lymph node', 'Lymph', 0.44943641652398214)
('bronchus connective tissue', 'Bronchus', 0.44943641652398214)
('bronchus connective tissue', 'Bronchus-Associated_Lymphoid_Tissue', 0.26055567105626243)
('squamosal bone', 'Bone', 0.5797386715376658)
('glomerular mesangium', 'Mesangium', 0.5797386715376658)
('spleen venous sinus', 'Sinus', 0.44943641652398214)
('pancreas body', 'Body', 0.5797386715376658)
('liver sinusoid', 'Liver', 0.5797386715376658)
('dentate gyrus granule cell layer', 'Monocytoid_B-Cell', 0.19431434016858148)
('bronchus smooth muscle', 'Bronchus', 0.44943641652398214)
('bronchus smooth muscle', 'Bronchus-Associated_Lymphoid_Tissue', 0.26055567105626243)
('sphenoid bone', 'Bone', 0.5797386715376658)
('foot bone', 'Bone', 0.5797386715376658)
('splenius', 'Splenius', 1.0)
('lacrima

### Create artificial ontologies for testing

In [137]:
import json
import random

def generate_class_uri(index, base_uri="http://mouse.owl#MA_"):
    return f"{base_uri}{1000 + index:04}"

def generate_label(index):
    base_labels = ["Nerve", "muscle", "vein", "artery", "bone", "tissue", "cell", "organ", "gland", "membrane"]
    part = random.choice(base_labels)
    return preprocess_label(f"{part} {index}")

def generate_superclasses(index):
    base_superclasses = [
        "http://www.w3.org/2002/07/owl#Thing",
        "http://mouse.owl#AnatomicalStructure",
        "http://mouse.owl#BiologicalProcess"
    ]
    return [random.choice(base_superclasses) for _ in range(random.randint(1, 3))]

def create_ontology_entries(num_entries=100):
    entries = []
    for i in range(num_entries):
        entry = {
            "class_uri": generate_class_uri(i),
            "labels": [{"value": generate_label(i), "datatype": "http://www.w3.org/2001/XMLSchema#string", "language": None}],
            "superclasses": generate_superclasses(i)
        }
        entries.append(entry)
    return entries

# Generate the data for two ontology files
ontology1 = create_ontology_entries()
ontology2 = create_ontology_entries()

with open('ontology_jsons/test-ontology1.json', 'w') as file1:
    json.dump(ontology1, file1, indent=4)

with open('ontology_jsons/test-ontology2.json', 'w') as file2:
    json.dump(ontology2, file2, indent=4)


In [144]:
# Matches on artificial test data
test_matches = match_ontologies('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

10000
10000


In [139]:
measure_matching_time('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

10000
10000
Elapsed time: 7.353135108947754 seconds


In [145]:
counter = 0
for label_key, data in test_matches.items():
    if data[2] > 0:
        print(data)
        counter += 1

print(counter)

('artery 0', 'artery 8', 1.0)
('organ 1', 'organ 30', 0.5797386715376658)
('nerve 2', 'nerve 29', 0.5797386715376658)
('tissue 3', 'tissue 23', 0.5797386715376658)
('nerve 4', 'nerve 29', 0.5797386715376658)
('nerve 5', 'nerve 29', 0.5797386715376658)
('organ 6', 'organ 30', 0.5797386715376658)
('membrane 7', 'membrane 2', 1.0)
('muscle 8', 'muscle 4', 1.0)
('cell 9', 'cell 9', 1.0)
('organ 10', 'artery 10', 0.33609692727625756)
('tissue 11', 'artery 11', 0.33609692727625756)
('tissue 12', 'membrane 12', 0.33609692727625756)
('tissue 13', 'muscle 13', 0.33609692727625756)
('tissue 14', 'membrane 14', 0.33609692727625756)
('muscle 15', 'muscle 4', 0.5797386715376658)
('cell 16', 'cell 7', 0.5797386715376658)
('cell 17', 'cell 17', 1.0000000000000002)
('cell 18', 'cell 7', 0.5797386715376658)
('muscle 19', 'muscle 4', 0.5797386715376658)
('cell 20', 'cell 7', 0.5797386715376658)
('gland 21', 'gland 0', 0.5797386715376658)
('muscle 22', 'muscle 22', 1.0000000000000002)
('gland 23', 'gland

Output format example:
```
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment" 
	 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<Alignment>
<xml>yes</xml>
<level>0</level>
<type>??</type>

<map>
	<Cell>
		<entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
		<entity2 rdf:resource="http://human.owl#NCI_C52561"/>
		<measure rdf:datatype="xsd:float">1.0</measure>
		<relation>=</relation>
	</Cell>
</map>
```