Changing approaches

Direct approaches to create bschema have too many compromises and aren't working generically. 

Should change approaches to use one hop graph queries and LLMs to do new class names, then can just create a class graph. 

This also has more interpretability than the bschema, where you still had to inspect the graph to learn what something is, instead of just looking at a human readable label. 

The amount of iterations of naming things based on their one hop graphs should also work for different schema or types of graphs so it is somewhat manual but also can be generalized. 


Two options:

1. can go through and create ask queries for each thing, then run the ask query on each node and see which node matches which query. Ask queries are pretty fast so this is probably faster than option 2. However, since ask queries will be correct for a subset of the query (i.e., they won't catch if there's MORE information for a given thing) this will require some correction at the end. Will essentially have to remove the vav-cooling-only label from vavs with reheat based on the fact that all vavs with reheat are also vav-cooling-only, so we can remove this. Can alternatively generate two sparql queries and test both directions

2. Can go through and check if class graphs are isomorphic for each thing. 

Then need to build the inferred class label. 

LLM class naming isn't particularly good using a locally running LLM. Should just number the variants of the original class. (ConnectionPoint1,2, VAV1,2, etc.)

Using GPT OSS run by CBORG, creating new class names was decent

# TODO: 
- works right for brick, but class graph currently only contains one class. This may create bugs, so I will need to consider other methods. Working so far because it happens to gets the most recently added class, which is alphabetically the latest because of the version tag I'm adding. Instead of directly doing the class relations it may make sense to separate instance and class once again 

- Fix how I'm getting aspects and named nodes again so these show up in class graph for isomorphism. 

In [135]:
from rdflib.compare import isomorphic

from pprint import pprint
from typing import Union, List, Union, Tuple, Optional
from rdflib import Graph, URIRef
import sys

sys.path.append('../utils')
sys.path.append('utils')

from namespaces import *
from utils import * 

from tqdm import tqdm

In [136]:
def get_subgraph_with_hops(
    graph: Graph, 
    central_node: Union[str, URIRef], 
    num_hops: int,
    get_classes = False
) -> Graph:
    """
    Extract a subgraph containing all triples within num_hops from central_node,
    including rdf:type information for all entities.
    
    Args:
        graph: The source RDF graph
        central_node: The central node URI (as string or URIRef)
        num_hops: Number of hops to traverse from the central node
    
    Returns:
        A new Graph containing the subgraph
    """
    if isinstance(central_node, str):
        # print("central node is string: ", central_node)
        central_node = URIRef(central_node)
    
    subgraph = Graph(store = 'Oxigraph')
    visited_nodes = set()
    current_layer = {central_node}
    
    # Add class information for central node
    for class_uri in graph.objects(central_node, RDF.type):
        subgraph.add((central_node, RDF.type, class_uri))
    
    for hop in range(num_hops):
        next_layer = set()
        
        for node in current_layer:
            if node in visited_nodes:
                continue
            visited_nodes.add(node)
            
            # Get triples where node is subject
            for p, o in graph.predicate_objects(node):
                subgraph.add((node, p, o))
                if isinstance(o, URIRef):
                    next_layer.add(o)
                    # Add class information for object
                    for class_uri in graph.objects(o, RDF.type):
                        subgraph.add((o, RDF.type, class_uri))
            
            # Get triples where node is object
            for s, p in graph.subject_predicates(node):
                subgraph.add((s, p, node))
                if isinstance(s, URIRef):
                    next_layer.add(s)
                    # Add class information for subject
                    for class_uri in graph.objects(s, RDF.type):
                        subgraph.add((s, RDF.type, class_uri))
        
        current_layer = next_layer
    if get_classes:
        for s, p, o in subgraph:
            for class_uri in graph.objects(s, RDF.type):
                subgraph.add((s, RDF.type, class_uri))
            for class_uri in graph.objects(o, RDF.type):
                subgraph.add((o, RDF.type, class_uri))
    return subgraph

In [174]:
def get_class(node: URIRef, data_graph) -> URIRef:
        """Get the class of a node from the data graph"""
        # TODO: Shouldn't hve a default class and need a union of classes
        for _, _, o in data_graph.triples((node, A, None)):
            return o
        return URIRef("http://www.w3.org/2000/01/rdf-schema#Resource")

def create_class_pattern(triple: Tuple[URIRef, URIRef, URIRef], data_graph) -> Tuple[URIRef, URIRef, URIRef]:
        """Create class-based pattern from concrete triples"""
        # may want to have a union of classes 
        # TODO: if ordering is not consistent of classes this could create bugs
        # subject node should maybe represent union of classes
        named_node_predicates = [S223.hasAspect, S223.hasEnumerationKind, S223.hasQuantityKind, S223.hasUnit, S223.hasMedium, S223.ofConstituent]
        s_class = get_class(triple[0], data_graph)
        if triple[1] == A:
            o_class = triple[2]
        elif triple[1] in [S223['hasAspect'], S223.hasEnumerationKind, S223.hasQuantityKind, S223.hasUnit]:
            o_class = triple[2]
        else:
            o_class = get_class(triple[2], data_graph)
        return (s_class, triple[1], o_class)

def create_class_graph(data_graph: Graph):
    class_graph = Graph(store = 'Oxigraph')
    for triple in data_graph: 
        class_triple = create_class_pattern(triple, data_graph)
        if class_triple in class_graph:
            continue
        else:
            class_graph.add(class_triple)

    return class_graph

def create_class_graph(data_graph: Graph):
    class_graph = Graph(store = 'Oxigraph')
    for triple in data_graph: 
        class_triple = create_class_pattern(triple, data_graph)
        if class_triple in class_graph:
            continue
        else:
            class_graph.add(class_triple)

    return class_graph

def get_class_isomorphisms(data_graph):
    distinct_class_subgraphs = []
    seen_subjects = set()
    equivalent_subjects = []
    subject_classes = []
    for s,p,o in tqdm(data_graph):
        add_subgraph = False
        found_in_preferred = False
        subject_class = get_class(s, data_graph)
        if s in seen_subjects:
            continue
        else:
            seen_subjects.add(s)
        subgraph = get_subgraph_with_hops(data_graph, s, 1)
        class_graph = create_class_graph(subgraph)
        if equivalent_subjects == []:
            equivalent_subjects.append([s])
            distinct_class_subgraphs.append(class_graph)
            subject_classes.append(subject_class)
            continue
            
        indices = [i for i, x in enumerate(subject_classes) if x == subject_class]
        # check indices first 
        for i in indices:
            g = distinct_class_subgraphs[i]
            if isomorphic(class_graph, g): 
                equivalent_subjects[i].append(s)
                found_in_preferred = True
                break

        if found_in_preferred:
            continue

        for i, g in enumerate(distinct_class_subgraphs):
            if isomorphic(class_graph, g): 
                equivalent_subjects[i].append(s)
                add_subgraph = False
                break
            else:
                add_subgraph = True
                
        if add_subgraph == True:
            distinct_class_subgraphs.append(class_graph)
            equivalent_subjects.append([s])
            subject_classes.append(subject_class)
    return distinct_class_subgraphs, seen_subjects, equivalent_subjects, subject_classes


In [232]:
def find_similar_sublists(list1, list2, min_intersection_ratio=0.4):
    """
    Find pairs of sublists that have high overlap but aren't exactly the same.
    
    Args:
        list1: First list containing sublists
        list2: Second list containing sublists
        min_intersection_ratio: Minimum ratio of intersection to consider (default 0.8)
    
    Returns:
        list: Pairs of similar but not identical sublists with their similarity info
    """
    similar_pairs = []
    
    for i, sublist1 in enumerate(list1):
        set1 = set(sublist1)
        
        for j, sublist2 in enumerate(list2):
            set2 = set(sublist2)
            
            # Skip if they're exactly the same
            if set1 == set2:
                continue
            
            # Calculate intersection and union
            intersection = set1 & set2
            union = set1 | set2
            
            # Calculate Jaccard similarity
            jaccard = len(intersection) / len(union) if union else 0
            
            # Also calculate intersection ratio relative to each set
            ratio1 = len(intersection) / len(set1) if set1 else 0
            ratio2 = len(intersection) / len(set2) if set2 else 0
            
            # Check if meets threshold
            if jaccard >= min_intersection_ratio:
                similar_pairs.append({
                    'list1_index': i,
                    'list2_index': j,
                    'jaccard_similarity': jaccard,
                    'intersection_size': len(intersection),
                    'only_in_list1': list(set1 - set2),
                    'only_in_list2': list(set2 - set1),
                    'common': list(intersection),
                    'size_list1': len(set1),
                    'size_list2': len(set2)
                })
    
    return similar_pairs

# Usage:
# similar = find_similar_sublists(your_list1, your_list2, min_intersection_ratio=0.8)
# 
# for pair in similar:
#     print(f"\nSublist {pair['list1_index']} from list1 and {pair['list2_index']} from list2:")
#     print(f"  Similarity: {pair['jaccard_similarity']:.2%}")
#     print(f"  Only in list1: {len(pair['only_in_list1'])} items")
#     print(f"  Only in list2: {len(pair['only_in_list2'])} items")

In [246]:
counter = {}

def lists_have_same_members(list1, list2):
    """Check if two lists with sublists have same members, regardless of order."""
    if len(list1) != len(list2):
        return False
    
    # Convert each sublist to a frozenset and compare as sets
    sets1 = {frozenset(sublist) for sublist in list1}
    sets2 = {frozenset(sublist) for sublist in list2}
    
    return sets1 == sets2

def copy_graph(g):
    g2 = Graph(store = "Oxigraph")
    bind_prefixes(g2)
    for triple in g:
        g2.add(triple)
    return g2

def create_new_class_name(cls_name):
    name = cls_name.split('#')[-1].split('_version_')[0]
    counter[name] = count = counter.get(name, 0) + 1
    return URIRef(f"{name}_version_{count}")

def assign_new_classes(data_graph, distinct_class_subgraphs, equivalent_subjects, subject_classes):
    class_mappings = {}
    new_subject_classes = {}
    for i, subj_list in enumerate(equivalent_subjects):
        new_cls_name = create_new_class_name(subject_classes[i])
        for s in subj_list:
            new_subject_classes.update({s: HPFS[new_cls_name.split('#')[-1]]})
            class_mappings[new_cls_name] = (subject_classes[i], distinct_class_subgraphs[i])

    return new_subject_classes, class_mappings

def run_algo(original_data_graph, iterations = 5):
    # need to remove ontology statement because having just the prefix breaks serialization/parsing by oxigraph
    global counter
    counter = {}
    class_mappings = []
    original_data_graph.remove((URIRef('urn:example#'), A, OWL.Ontology))
    data_graph = copy_graph(original_data_graph)
    for i in range(iterations):
        distinct_class_subgraphs, seen_subjects, equivalent_subjects, subject_classes = get_class_isomorphisms(data_graph)
        new_subject_classes, new_class_mappings = assign_new_classes(data_graph, distinct_class_subgraphs, equivalent_subjects, subject_classes)
        class_mappings.append(new_class_mappings)

        if i >= 1:
            # stop condition is no subjects are being distinguished from each other
            if lists_have_same_members(equivalent_subjects, prev_equivalent_subjects):
                print('Algorithm stopped, since no further distinguishing classes found. Process took ',i, 'iterations')
                print('Printing Last Set Difference')
                pprint(last_set_diff)
                break
            else:
                last_set_diff = find_similar_sublists(equivalent_subjects, prev_equivalent_subjects)
    
        prev_equivalent_subjects = equivalent_subjects
        for s, cls_name in new_subject_classes.items():
            data_graph.add((s, A, cls_name))


    class_graph = create_class_graph(data_graph)
    return class_graph, data_graph, class_mappings


In [247]:
data_graph = Graph(store = 'Oxigraph')
data_graph.parse("/Users/lazlopaul/Desktop/223p/experiments/graph-pattern-id/from-data-graph/brick-example.ttl", format="turtle")

<Graph identifier=N082cf9c3708a4394804561533797b34e (<class 'rdflib.graph.Graph'>)>

In [248]:
s223_data_graph = Graph(store = 'Oxigraph')
s223_data_graph.parse("/Users/lazlopaul/Desktop/223p/experiments/graph-pattern-id/from-data-graph/s223-example.ttl", format="turtle")

<Graph identifier=N7d10cf08d5f545d19c7f2f5e81c77b78 (<class 'rdflib.graph.Graph'>)>

In [249]:
cg, dg, cm = run_algo(data_graph, 4)
cg.print()

100%|██████████| 3428/3428 [00:00<00:00, 5271.03it/s]
100%|██████████| 4572/4572 [00:00<00:00, 5016.41it/s]
100%|██████████| 5716/5716 [00:01<00:00, 5138.38it/s]
100%|██████████| 6860/6860 [00:01<00:00, 5090.55it/s]


Algorithm stopped, since no further distinguishing classes found. Process took  3 iterations
Printing Last Set Difference
[{'common': [rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_3_8'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_5'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_2'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_3_5'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_10'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_11'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_3'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_0'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_1_9'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_14'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_

In [250]:
# sorting through to look pritier
for s,p,o in cg:
    if (p == A) & (str(HPFS) in str(o)):
        cg.remove((s,p,o))
    if p == RDFS.label:
        cg.remove((s,p,o))
bind_prefixes(cg)
cg.serialize('algo5-brick.ttl')

<Graph identifier=N9215dd6a8d5148c9a15aa2e9552f50e7 (<class 'rdflib.graph.Graph'>)>

In [251]:
cg, dg, cm = run_algo(s223_data_graph, 5)
cg.print()

100%|██████████| 6944/6944 [00:01<00:00, 4298.35it/s]
100%|██████████| 8480/8480 [00:01<00:00, 5412.16it/s] 
100%|██████████| 10016/10016 [00:01<00:00, 5355.87it/s]
100%|██████████| 11552/11552 [00:02<00:00, 5210.29it/s]


Algorithm stopped, since no further distinguishing classes found. Process took  3 iterations
Printing Last Set Difference
[{'common': [rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_3_8'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_5'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_2'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_3_5'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_10'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_11'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_3'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_0_0'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_1_9'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_14'),
             rdflib.term.URIRef('urn:example#vav-cooling-only_dmp-dmppos_2_

In [181]:
len(cm[2])

47

In [182]:
i = 5
list(cm[0].values())[i][0].split('#')[-1]

'QuantifiableObservableProperty'

In [183]:
# sorting through to look pritier
for s,p,o in cg:
    if (p == A) & (str(HPFS) in str(o)):
        cg.remove((s,p,o))
    if p == RDFS.label:
        cg.remove((s,p,o))
bind_prefixes(cg)
cg.serialize('algo5-s223.ttl')

<Graph identifier=N783478077d0a46b4b03a96a2b8add9dc (<class 'rdflib.graph.Graph'>)>

# old code

In [None]:
# # pip install transformers
# from transformers import AutoModelForCausalLM, AutoTokenizer
# checkpoint = "HuggingFaceTB/SmolLM3-3B"

# device = "cuda" 
# tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
# model = AutoModelForCausalLM.from_pretrained(checkpoint)


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.38s/it]


In [None]:
# i=5
# cls_name = list(cm[0].values())[i][0].split('#')[-1]
# cls_graph = list(cm[0].values())[i][1].serialize(format = 'ttl')
# messages = [{"role":"system",
#              "content": "Please provide a brief reply with the new brick class within the tags <brick-class></brick-class> so that I can retrieve it."},
#             {"role": "user", 
#              "content": f"""If it is possible to further specify the class name based on the RDF graph of nodes surrounding it, provide a more specific brick ontology class name.
#             The current class name is {cls_name} and the RDF graph of surrounding nodes is {cls_graph}"""}]
# input_text=tokenizer.apply_chat_template(messages, tokenize=False)
# # print(input_text)
# inputs = tokenizer.apply_chat_template(
#     messages,
#     # enable_thinking=True,
#     tokenize=True,
#     return_tensors="pt",
# )

# outputs = model.generate(inputs, max_new_tokens=100)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))

system
## Metadata

Knowledge Cutoff Date: June 2025
Today Date: 16 November 2025
Reasoning Mode: /think

## Custom Instructions

Please provide a brief reply with the new brick class within the tags <brick-class></brick-class> so that I can retrieve it.

user
If it is possible to further specify the class name based on the RDF graph of nodes surrounding it, provide a more specific brick ontology class name.
            The current class name is Heating_Coil and the RDF graph of surrounding nodes is @prefix brick: <https://brickschema.org/schema/Brick#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

brick:AHU a brick:AHU ;
    brick:hasPart brick:Heating_Coil.

brick:Heating_Coil a brick:Heating_Coil ;
    rdfs:label rdfs:Resource ;
    brick:hasPoint brick:Valve_Position_Command.

brick:Valve_Position_Command a brick:Valve_Position_Command.


assistant
<think>

</think>
<brick-class>Heating_Coil</brick-class>


In [119]:
output_text = tokenizer.decode(outputs[0])
output_text.split('\n')[-1].split('<|im_end|>')[0]

'Heating_Coil_Usage'

In [None]:
assign_new_classes(data_graph)

TypeError: assign_new_classes() missing 2 required positional arguments: 'equivalent_subjects' and 'subject_classes'

In [None]:
distinct_class_subgraphs[0]

<Graph identifier=N5d7d30764f494f0cb7110a871aa59d36 (<class 'rdflib.graph.Graph'>)>

In [None]:
distinct_class_subgraphs, seen_subjects, equivalent_subjects, subject_classes = get_class_isomorphisms(data_graph)
pprint(equivalent_subjects)

distinct_class_subgraphs, seen_subjects, equivalent_subjects, subject_classes = get_class_isomorphisms(s223_data_graph)
pprint(equivalent_subjects)

100%|██████████| 3429/3429 [00:01<00:00, 3152.39it/s]


[[rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_9'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_8'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_7'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_6'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_5'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_4'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_3'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_2'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_14'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_13'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_12'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_11'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_10'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_1'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_3_0'),
  rdflib.term.URIRef('urn:example#vav-with-reheat_rhc_2_9'),
  rdflib.term.URIRe

100%|██████████| 6945/6945 [00:02<00:00, 3257.00it/s]

[[rdflib.term.URIRef('urn:example#ahu_3con'),
  rdflib.term.URIRef('urn:example#ahu_2con'),
  rdflib.term.URIRef('urn:example#ahu_1con'),
  rdflib.term.URIRef('urn:example#ahu_0con'),
  rdflib.term.URIRef('urn:example#connection_ffb196cc'),
  rdflib.term.URIRef('urn:example#connection_d42d7b83'),
  rdflib.term.URIRef('urn:example#connection_caee10fe'),
  rdflib.term.URIRef('urn:example#connection_c5beb3e7'),
  rdflib.term.URIRef('urn:example#connection_c5bbef44'),
  rdflib.term.URIRef('urn:example#connection_9c5a94c9'),
  rdflib.term.URIRef('urn:example#connection_86bf7afe'),
  rdflib.term.URIRef('urn:example#connection_581d5f80'),
  rdflib.term.URIRef('urn:example#connection_44e373b8'),
  rdflib.term.URIRef('urn:example#connection_3b4ac2e2'),
  rdflib.term.URIRef('urn:example#connection_1fd24ce3'),
  rdflib.term.URIRef('urn:example#connection_1acd19ad')],
 [rdflib.term.URIRef('urn:example#multiple-zone-ahu_ra_damper_3'),
  rdflib.term.URIRef('urn:example#multiple-zone-ahu_ra_damper_2'


