Been a while since I last looked at this, so may miss some of the intent of the original algorithm. Rewriting here. 

# Create Approximate Schema
1. Start with b-schema graph size 0, select a triple from data graph 
    - If first step, add a random triple. All subsequent steps, select a triple connected to the existing b-schema, but not currently represented by the b-schema. 
2. Using a parameter for amount of hops (with default value of 1) create the graph around this triple. 
3. Generate a class graph for this mini graph.
4. Query the b-schema against the b-schema for this graph to see if it is already represented by the b-schema.
5. If this query fails, find which statements in the mini b-schema were not represented by the main b-schema. Add these statements to the b-schema.
6. This will create an approximate b-schema. 

# Verify schema 
1. Iterate through each triple 
2. Based on the classes and relations of the new triples, find a candidate statement in the b-schema that may represent this triple. 
3. Invert the b-schema statements to SPARQL queries with the class and relation information. 
4. Iteratively run queries with limit one with the identity of the added triple bound to the candidate statements.
    -  If one of these queries succeeds, the new triple should be added to the list of triples represented by a statement in the b-schema. 
5. If all triples are represented, end the process
6. Will need to remove instances represented by a certain part of the summary if they're also represented by another part of the summary. e.g. the vavs will all have an ask true result when bound to vav-cooling-only. However, the only the ones not judged true for VAV with reheat should be linked in that way. 


Status
- Still haven't quite made the b-schema
- The class graph is what algo2 ended up producing pretty much, and it is a useful summary that can be quickly constructed. 

Maybe can do the following: 
- start from the nodes in the b-schema. 
 - First, see if there are any triples between nodes in the b-schema not already represented. This covers the case where individual classes in the class graph map to multiple instances
 - Then, starting from the nodes in the b-schema, test one hop around each of them. 
 If the one hop doesn't work, add triples from the one hop graph that don't appear in the b-schema (maybe based on class, but triple ordering is consistent/deterministic, this is not necessary). 

 e.g. lets say we test the vav with reheat. First pass we would find that there is a missing feeds relationship from the ahu to the vav with reheat. Then we would construct the one hop query and it would fail. We could then see which triples are in the one hop graph and not the b-schema and we can add those. 

Creating the b-schema for brick correctly. 223P not working correctly for a couple reasons. 1) There are some connection points 2 hops away from anything else (I think). These aren't getting matched to something valid in the b-schema. 2) Properties aren't getting properly distinguished from each other. May be because the hasAspect information isn't included in the query for some reason. However, hasAspect is showing up on other things. only one property showing up on AHU but there should be many

Point 2 is easier to address than point 1. However, point 1 wouldn't affect fully valid 223P models (with all connection points connected). If there are lots of junctions in some models I can see it causing issues with my 1 hop subgraph approach. Could add temporary relations for inferencing (which is a common OWL thing to do I think). 

In [92]:
from algo3 import * 
from tqdm import tqdm 

In [None]:
def print_bschema(b_schema_statements, save = None):
    b_schema_trpls = [stmt.pattern[0][1] for stmt in b_schema_statements]
    b_schema_graph = Graph(store = 'Oxigraph')
    for trpl in b_schema_trpls:
        b_schema_graph.add((trpl.s, trpl.p, trpl.o))
    if save is not None:
        b_schema_graph.serialize(save, format="turtle")
    b_schema_graph.print()

In [None]:
data_graph = Graph(store = 'Oxigraph')
data_graph.parse("brick-example.ttl", format="turtle")

algo = BSchemaGenerator(data_graph)
ex_trpl = algo.all_triples[2]

query_one_hop_with_classes = f"""
{get_prefixes(data_graph)}
CONSTRUCT {{
    ?s ?p ?o .
    ?s a ?scls .
    ?o a ?cls .
}}
WHERE {{
    ?s ?p ?o . 
    ?s a ?scls .
    ?o a ?cls .
    FILTER(?s = <{ex_trpl.s}>) .
    FILTER NOT EXISTS {{
        
    }}
}}
"""

mini_bs_graph = Graph(store = 'Oxigraph')
mini_bs_graph.parse(data = data_graph.query(query_one_hop_with_classes).graph.serialize(format = 'ttl'), format = 'ttl')
mini_bs = BSchemaGenerator(mini_bs_graph)
stmts = mini_bs.generate_b_schema()


In [None]:
# class graph 
mini_bs_graph = Graph(store = 'Oxigraph')
mini_bs_graph.parse(data = data_graph.query(query_one_hop_with_classes).graph.serialize(format = 'ttl'), format = 'ttl')
mini_bs = BSchemaGenerator(mini_bs_graph)


def create_pattern_dict(mini_bs):
    pattern_dict = {}
    for triple in mini_bs.all_triples:
        pattern = mini_bs._create_class_pattern([triple])[0][0]
        if pattern in pattern_dict.keys():
            continue
        else:
            pattern_dict[pattern] = triple
    return pattern_dict

In [None]:
g = Graph(store = 'Oxigraph')
pattern_dict = create_pattern_dict(mini_bs)
for pattern, triple in pattern_dict.items():
    g.add((triple.s, triple.p, triple.o))

In [None]:
def create_class_graph(mini_bs):
    class_graph = Graph(store = 'Oxigraph')
    triple_graph = Graph(store = 'Oxigraph')
    triple_schema = []
    for triple in mini_bs.all_triples:
        pattern = mini_bs._create_class_pattern([triple])[0]
        class_pattern = pattern[0]
        class_triple = (URIRef(class_pattern.s), URIRef(class_pattern.p), URIRef(class_pattern.o))
        if class_triple in class_graph:
            continue
        else:
            class_graph.add(class_triple)
            triple_schema.append(pattern)
            triple_graph.add((triple.s, triple.p, triple.o))
    # add any triples to triple_graph connecting things already present 
    all_nodes = set([s for s, p, o in triple_graph] + [o for s, p, o in triple_graph])
    for node in all_nodes:
        for node2 in all_nodes:
            for triple in mini_bs.data_graph.triples((node, None, node2)):
                if triple in triple_graph:
                    continue
                else:
                    triple_graph.add(triple)
                    # print('adding triple: ', triple)
    return class_graph, triple_schema, triple_graph

In [None]:
bschema = BSchemaGenerator(data_graph)
class_graph, triple_schema, triple_graph = create_class_graph(bschema)


In [None]:
triple_graph.print()

In [None]:
class_graph.print()

In [None]:
pq = PatternQuery(triple_schema, data_graph)

In [None]:
print(pq.get_ask_query())

In [None]:
data_graph.query(pq.get_ask_query()).askAnswer

In [80]:
def get_subgraph_with_hops(
    graph: Graph, 
    central_node: Union[str, rdflib.URIRef], 
    num_hops: int
) -> Graph:
    """
    Extract a subgraph containing all triples within num_hops from central_node,
    including rdf:type information for all entities.
    
    Args:
        graph: The source RDF graph
        central_node: The central node URI (as string or URIRef)
        num_hops: Number of hops to traverse from the central node
    
    Returns:
        A new Graph containing the subgraph
    """
    if isinstance(central_node, str):
        central_node = rdflib.URIRef(central_node)
    
    subgraph = Graph(store = 'Oxigraph')
    visited_nodes = set()
    current_layer = {central_node}
    
    # Add class information for central node
    for class_uri in graph.objects(central_node, RDF.type):
        subgraph.add((central_node, RDF.type, class_uri))
    
    for hop in range(num_hops):
        next_layer = set()
        
        for node in current_layer:
            if node in visited_nodes:
                continue
            visited_nodes.add(node)
            
            # Get triples where node is subject
            for p, o in graph.predicate_objects(node):
                subgraph.add((node, p, o))
                if isinstance(o, rdflib.URIRef):
                    next_layer.add(o)
                    # Add class information for object
                    for class_uri in graph.objects(o, RDF.type):
                        subgraph.add((o, RDF.type, class_uri))
            
            # Get triples where node is object
            for s, p in graph.subject_predicates(node):
                subgraph.add((s, p, node))
                if isinstance(s, rdflib.URIRef):
                    next_layer.add(s)
                    # Add class information for subject
                    for class_uri in graph.objects(s, RDF.type):
                        subgraph.add((s, RDF.type, class_uri))
        
        current_layer = next_layer
    
    return subgraph

In [102]:
def compare_to_query(subject = """urn:example#vav-with-reheat_name_0_0""", data_graph = data_graph, triple_graph = triple_graph):
    one_hop_graph = get_subgraph_with_hops(data_graph, subject, 1)
    bschema = BSchemaGenerator(one_hop_graph)
    _, one_hop_triple_schema, _ = create_class_graph(bschema)
    pq = PatternQuery(one_hop_triple_schema, one_hop_graph)
    # print(pq.get_ask_query())
    res = triple_graph.query(pq.get_ask_query())
    # print(res.askAnswer)
    if not res.askAnswer:
        # (one_hop_graph - triple_graph).print()
        return False, pq, one_hop_graph - triple_graph
    return True, pq, None

In [82]:
# For this specific example, this would complete the b-schema. I don't think that holds up in all cases
success, pq, _ = compare_to_query()

In [83]:
triple_graph.print()

@prefix brick: <https://brickschema.org/schema/Brick#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:example#multiple-zone-ahu_name_0> a brick:AHU ;
    rdfs:label "AHU"^^xsd:string ;
    brick:feeds <urn:example#vav-cooling-only_name_0_0>,
        <urn:example#vav-with-reheat_name_0_0> ;
    brick:hasPart <urn:example#multiple-zone-ahu_clg_coil_0>,
        <urn:example#multiple-zone-ahu_htg_coil_0>,
        <urn:example#multiple-zone-ahu_ra_damper_0>,
        <urn:example#multiple-zone-ahu_sa_fan_0> ;
    brick:hasPoint <urn:example#multiple-zone-ahu_filter_pd_0>,
        <urn:example#multiple-zone-ahu_ma_temp_0>,
        <urn:example#multiple-zone-ahu_oa_temp_0>,
        <urn:example#multiple-zone-ahu_ra_temp_0>,
        <urn:example#multiple-zone-ahu_sa_temp_0> .

<urn:example#multiple-zone-ahu_clg_coil-valve_cmd_0> a brick:Valve_Position_Command ;
    rdfs:label "Valve_Position_Command"^^xsd:string .

<urn:example

In [84]:
s223_data_graph = Graph(store = 'Oxigraph')
s223_data_graph.parse("s223-example.ttl", format="turtle")
s223_bschema = BSchemaGenerator(s223_data_graph)
s223_class_graph, s223_triple_schema, s223_triple_graph = create_class_graph(s223_bschema)

compare_to_query(subject = """urn:example#vav-with-reheat_name_0_0""", data_graph = s223_data_graph, triple_graph = s223_triple_graph)

skipping subject empty string


(False,
 <algo3.PatternQuery at 0x1192f12d0>,
 <Graph identifier=N49cac46a0e104dde81d8f38093b7b836 (<class 'rdflib.graph.Graph'>)>)

In [None]:
def compare_all_nodes(data_graph = data_graph, triple_graph = triple_graph, add_from_initial = True):
    matches = 0
    not_matches = 0
    unmatched_subjects = set()
    added_to_triples = 0
    print('Starting Graph Types')
    print(f"{triple_graph.store=}, {data_graph.store=}")
    if add_from_initial:
        for s, p, o in tqdm(triple_graph):
            success, pq, one_hop_graph = compare_to_query(subject = s, data_graph = data_graph, triple_graph = triple_graph)
            if not success:
                for s,p,o in one_hop_graph:
                    triple_graph.add((s,p,o))
                added_to_triples += 1

    print('Graph Types After adjusting triple_graph')
    print(f"{triple_graph.store=}, {data_graph.store=}")
    for s, p, o in tqdm(data_graph):
        success, pq, unmatched_graph = compare_to_query(subject = s, data_graph = data_graph, triple_graph = triple_graph)
        if success:
            matches += 1
        else:
            not_matches += 1
            unmatched_subjects.add(s)
    print('Graph Types After query_comparison')
    print(f"{triple_graph.store=}, {data_graph.store=}, {unmatched_graph=}")
    return matches, not_matches, unmatched_subjects, added_to_triples, triple_graph

In [108]:
matches, not_matches, unmatched_subjects, added_to_triples, new_triple_graph = compare_all_nodes(data_graph = data_graph, triple_graph = triple_graph)

Starting Graph Types
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x10f55da80>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105835d80>


100%|██████████| 91/91 [00:00<00:00, 447.82it/s]


Graph Types After adjusting triple_graph
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x10f55da80>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105835d80>


100%|██████████| 3429/3429 [00:05<00:00, 571.65it/s] 

skipping subject empty string
Graph Types After query_comparison
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x10f55da80>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105835d80>, unmatched_graph=None





In [109]:
print("Matched: ", matches, " Not matched: ", not_matches, "Unique unmatched subjects: ", len(unmatched_subjects))
print("Added to triples_graph: ", added_to_triples, " times")
pprint(unmatched_subjects)
new_triple_graph.print()

Matched:  3429  Not matched:  0 Unique unmatched subjects:  0
Added to triples_graph:  0  times
set()
@prefix brick: <https://brickschema.org/schema/Brick#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:example#multiple-zone-ahu_name_0> a brick:AHU ;
    rdfs:label "AHU"^^xsd:string ;
    brick:feeds <urn:example#vav-cooling-only_name_0_0>,
        <urn:example#vav-with-reheat_name_0_0> ;
    brick:hasPart <urn:example#multiple-zone-ahu_clg_coil_0>,
        <urn:example#multiple-zone-ahu_htg_coil_0>,
        <urn:example#multiple-zone-ahu_ra_damper_0>,
        <urn:example#multiple-zone-ahu_sa_fan_0> ;
    brick:hasPoint <urn:example#multiple-zone-ahu_filter_pd_0>,
        <urn:example#multiple-zone-ahu_ma_temp_0>,
        <urn:example#multiple-zone-ahu_oa_temp_0>,
        <urn:example#multiple-zone-ahu_ra_temp_0>,
        <urn:example#multiple-zone-ahu_sa_temp_0> .

<urn:example#multiple-zone-ahu_clg_coil-valve_cmd_0

In [110]:
new_triple_graph.serialize('algo3-brick.ttl', format = 'ttl')

<Graph identifier=N7b7268c1cd2d4e7dbd0d1c120e986016 (<class 'rdflib.graph.Graph'>)>

In [112]:
matches, not_matches, unmatched_subjects, added_to_triples, s223_new_triple_graph = compare_all_nodes(data_graph = s223_data_graph, triple_graph = s223_triple_graph)

Starting Graph Types
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x1192e6140>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105855e70>


100%|██████████| 209/209 [00:00<00:00, 561.51it/s]


Graph Types After adjusting triple_graph
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x1192e6140>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105855e70>


100%|██████████| 6945/6945 [00:10<00:00, 666.48it/s] 

skipping subject empty string
Graph Types After query_comparison
triple_graph.store=<oxrdflib.store.OxigraphStore object at 0x1192e6140>, data_graph.store=<oxrdflib.store.OxigraphStore object at 0x105855e70>, unmatched_graph=None





In [113]:
# outlet connection point is missing the connectedTo relationship, causing an error missing the match. 
# this approach won't work if there's a lot of nonunique things within one hop. Two hops explodes the graph too much 

# Other nodes are not matched and seem to not show up in the graph, like the ahu properties. 
# The graph around an ahu supply air temperature property just looks like ahu hasproperty property has quantitykind/unit temperature. 
# Labels/NamedNodes are not included in the summary, so it doesn't check aspects.
# Two paths forward: can either create classes for s223 properties based on their aspects (Haystack tagsets) or we can include named nodes (URIs but not literals)
# Second path seems better. 
print("Matched: ", matches, " Not matched: ", not_matches, "Unique unmatched subjects: ", len(unmatched_subjects))
pprint(list(unmatched_subjects))

Matched:  6945  Not matched:  0 Unique unmatched subjects:  0
[]


In [114]:
# # Usage example:
# subgraph = get_subgraph_with_hops(
#     s223_data_graph, 
#     'urn:example#multiple-zone-ahu_ra_temp_0'z, 
#     num_hops=1
# )

# subgraph.print()

In [115]:
s223_new_triple_graph.serialize('algo3-s223.ttl', format = 'ttl')

<Graph identifier=Nc4fc8f72583c407a8b391d004634c25b (<class 'rdflib.graph.Graph'>)>

In [117]:
s223_new_triple_graph.print()

@prefix ns1: <http://data.ashrae.org/standard223#> .
@prefix ns2: <http://qudt.org/schema/qudt/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<urn:example#vav-cooling-only_zone_0_0> a ns1:DomainSpace ;
    rdfs:label "DomainSpace"^^xsd:string ;
    ns1:hasDomain ns1:Domain-HVAC .

<urn:example#connection_44e373b8> a ns1:Connection ;
    ns1:cnx <urn:example#outlet_cp_69174235> .

<urn:example#connection_c5beb3e7> a ns1:Connection ;
    rdfs:label "Connection"^^xsd:string ;
    ns1:cnx <urn:example#inlet_cp_1e64ecc2> .

<urn:example#inlet_cp_5b811537> a ns1:InletConnectionPoint ;
    ns1:cnx <urn:example#multiple-zone-ahu_htg_coil_0> .

<urn:example#inlet_cp_5e2bf803> a ns1:InletConnectionPoint ;
    rdfs:label "InletConnectionPoint"^^xsd:string ;
    ns1:cnx <urn:example#multiple-zone-ahu_ra_damper_1> ;
    ns1:hasMedium ns1:Medium-Air .

<urn:example#inlet_cp_7aaf2b0f> a ns1:InletConnectionPoint ;
    ns1:cnx <urn:exampl

In [None]:
for s,o in s223_data_graph[:S223.hasAspect:]:
    print(type(s),type(o))
    break

for s,o in s223_data_graph[:RDFS.label:]:
    print(type(s),type(o))
    break


In [None]:
success, pq, _ = compare_to_query(subject="urn:example#multiple-zone-ahu_htg_coil_0", data_graph=s223_data_graph, triple_graph=s223_new_triple_graph)

In [None]:
pq.graph.print()