# STS-based BIM-LCA matching using LLM embeddings
Following the previous introduction of large language models (LLM), we'll focus LLM using cosine between two word embeddings to calculate semantic textual similarity (STS).

### Setup including tokenizer and LLM

In [2]:
import re
from sentence_transformers import util, SentenceTransformer
import time
import os
import json

- We use the tokenizer etc settings from Spacy
- The selected LLM ('IfcMaterial2MP') was fine-tuned on material matched of 23 real world case studies and their material datasets of EPEA material database. The full publication can be found here: https://mediatum.ub.tum.de/doc/1748706/cvmzzzhbk6nbww149l7eugzk4.2024_Forth_i3CE.pdf 

In [None]:
import spacy

# Load the English tokenizer, tagger, parser, etc.
nlp = spacy.load("en_core_web_sm")

# Load LLM
llm_name = 'kforth/IfcMaterial2MP'
llm = SentenceTransformer(llm_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/127 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/56.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/727k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/196 [00:00<?, ?B/s]

### Select relevant elements for testing the matching accuracy
- Please make sure that the correct elements and its layers are selected for matching

In [None]:
# Load relevant material-specific elements and layers
relevantBoM_Elements = [
  '1OG-Balkon-1-1',
  'Basic-Wall-3-Schichtplatte-27mm---Schalplatte-3555236',
  'Basic-Wall-BSP-140---5s-1686981',
  'Basic-Wall-Wärmedämmung_Mineralwolle_130-1839774',
  'Basic-Wall-Wärmedämmung_Slentex-70-687731',
  'Bleche-Fassade_V6-Bleche-Fassade_V6-4910201',
  'EG-A-1-1',
  'Floor-3-Schichtplatte-40mm-5206093',
  'Pfosten-rechteckig-Forster-60-x-90---Nur-Pfosten-2355404',
  'Stiffener-Stiffener-1094120',
  'TU-DF-1---Pfostenstock-Falztüre-DL---900-x-2130-1118965'
]

relevantBoM_Layers = [
  'Basic-Wall-Holzbau_Archisonic_Vorbauschale-2146702_L1',
  'Basic-Wall-Holzfassade-hinterlüftet-1105047_L1',
  'Floor-Finish-Floor---Wood-169093_L1',
  'Floor-Residential---Wood-Joist-with-Subflooring-144800_L1'
]

Download and load all pre-processed bill of materials of the relevant elements and its layers

In [None]:
import requests

# Request relevant material-specific elements and layers
for element in relevantBoM_Elements:
  url = f"https://raw.githubusercontent.com/jakob-beetz/sbe-2025-lca-workshop/refs/heads/main/04_llm_prompt-based_matching/SBE_15_samples/step_01_data_extraction/step_01d_filter_data/Elements/{element}.json"
  output_path = f"{element}.json"
  response = requests.get(url)
  with open(output_path, "wb") as f:
      f.write(response.content)

for layer in relevantBoM_Layers:
  url = f"https://raw.githubusercontent.com/jakob-beetz/sbe-2025-lca-workshop/refs/heads/main/04_llm_prompt-based_matching/SBE_15_samples/step_01_data_extraction/step_01d_filter_data/Target_Layers/{layer}.json"
  output_path = f"{layer}.json"
  response = requests.get(url)
  with open(output_path, "wb") as f:
      f.write(response.content)

print("✅ Bill of materials downloaded.")

✅ Bill of materials downloaded.


Load and extract the relevant elements

In [None]:
from pprint import pprint

# Extract element data
relevantBoMdict_Elements = {}

for element in relevantBoM_Elements:
  with open(f"{element}.json") as f:
    data = json.load(f)
  relevantBoMdict_Elements[element] = data
pprint(list(relevantBoMdict_Elements.keys()))

['1OG-Balkon-1-1',
 'Basic-Wall-3-Schichtplatte-27mm---Schalplatte-3555236',
 'Basic-Wall-BSP-140---5s-1686981',
 'Basic-Wall-Wärmedämmung_Mineralwolle_130-1839774',
 'Basic-Wall-Wärmedämmung_Slentex-70-687731',
 'Bleche-Fassade_V6-Bleche-Fassade_V6-4910201',
 'EG-A-1-1',
 'Floor-3-Schichtplatte-40mm-5206093',
 'Pfosten-rechteckig-Forster-60-x-90---Nur-Pfosten-2355404',
 'Stiffener-Stiffener-1094120',
 'TU-DF-1---Pfostenstock-Falztüre-DL---900-x-2130-1118965']


Load and extract the relevant layers.

In [None]:
# Extract layer data
relevantBoMdict_Layers = {}

for layer in relevantBoM_Layers:
  with open(f"{layer}.json") as f:
    data = json.load(f)
  relevantBoMdict_Layers[layer] = data

pprint(list(relevantBoMdict_Layers.keys()))

['Basic-Wall-Holzbau_Archisonic_Vorbauschale-2146702_L1',
 'Basic-Wall-Holzfassade-hinterlüftet-1105047_L1',
 'Floor-Finish-Floor---Wood-169093_L1',
 'Floor-Residential---Wood-Joist-with-Subflooring-144800_L1']


Load and extract all related materials.

In [50]:
# Exctract material name for each element
matearial_layers_per_element = []
for element in relevantBoMdict_Elements:
  matearial_layers_element = []
  for material in relevantBoMdict_Elements[element]['Element Material Data']:
    if type(relevantBoMdict_Elements[element]['Element Material Data']) == list:
      if 'Layers' in material.keys():
        for matearial_layer in material['Layers']:
            matearial_layers_element.append(matearial_layer['Material Name'])
      else:
        matearial_layers_element.append(material['Material Name'])
    else:
      matearial_layers_element.append(relevantBoMdict_Elements[element]['Element Metadata']['Name'])

  matearial_layers_element = set(matearial_layers_element)
  matearial_layers_per_element.append(matearial_layers_element)

matearial_layers_per_element

[{'Stahlbeton'},
 {'3-Schicht-Platte'},
 {'Brettschichtholz C24'},
 {'Wärmedämmung Mineralwolle, 0.035 W/mK'},
 {'Wärmedämmung Slentex Aerogel'},
 {'Material 4'},
 {'Massiv'},
 {'Massivholz'},
 {'Stahl, 45-34, Pulverbeschichtung schwarz'},
 {'Steel - S355J2G3'},
 {'TU DF 1 - Pfostenstock Falztüre:DL - 900 x 2130:1118965'}]

In [51]:
# Exctract material name for each material layer
material_layer_names = [relevantBoMdict_Layers[layer_material]['Target Layer of Material Inference']['Material Name']
                  for layer_material in relevantBoMdict_Layers]
material_layer_names

['Archisonic Charcoal',
 'Holzschalung Vertikal, Nadelholz,  schwarz gestrichen',
 'Wood - Flooring',
 'Wood - Sheathing - plywood']

In [None]:
# Combine material and layer names in one list
material_names = material_layer_names + [list(material)[0] for material in matearial_layers_per_element]
material_names

['Archisonic Charcoal',
 'Holzschalung Vertikal, Nadelholz,  schwarz gestrichen',
 'Wood - Flooring',
 'Wood - Sheathing - plywood',
 'Stahlbeton',
 '3-Schicht-Platte',
 'Brettschichtholz C24',
 'Wärmedämmung Mineralwolle, 0.035 W/mK',
 'Wärmedämmung Slentex Aerogel',
 'Material 4',
 'Massiv',
 'Massivholz',
 'Stahl, 45-34, Pulverbeschichtung schwarz',
 'Steel - S355J2G3',
 'TU DF 1 - Pfostenstock Falztüre:DL - 900 x 2130:1118965']

Tokenize the material names.

In [None]:
# Tokenize material names - split into words, remove numbers, punctuation, etc.
material_names_tokens = [[token.text for token in nlp(material_name) if not token.is_punct and not token.is_space and token.is_alpha] for material_name in material_names]
material_names_tokens

[['Archisonic', 'Charcoal'],
 ['Holzschalung', 'Vertikal', 'Nadelholz', 'schwarz', 'gestrichen'],
 ['Wood', 'Flooring'],
 ['Wood', 'Sheathing', 'plywood'],
 ['Stahlbeton'],
 ['Schicht', 'Platte'],
 ['Brettschichtholz'],
 ['Wärmedämmung', 'Mineralwolle', 'W', 'mK'],
 ['Wärmedämmung', 'Slentex', 'Aerogel'],
 ['Material'],
 ['Massiv'],
 ['Massivholz'],
 ['Stahl', 'Pulverbeschichtung', 'schwarz'],
 ['Steel'],
 ['TU', 'DF', 'Pfostenstock', 'Falztüre', 'DL', 'x']]

In [71]:
all_material_name_tokens = []
for material_name, material_name_tokens in zip(material_names, material_names_tokens):
    all_material_name_tokens.append([material_name] + material_name_tokens)

all_material_name_tokens

[['Archisonic Charcoal', 'Archisonic', 'Charcoal'],
 ['Holzschalung Vertikal, Nadelholz,  schwarz gestrichen',
  'Holzschalung',
  'Vertikal',
  'Nadelholz',
  'schwarz',
  'gestrichen'],
 ['Wood - Flooring', 'Wood', 'Flooring'],
 ['Wood - Sheathing - plywood', 'Wood', 'Sheathing', 'plywood'],
 ['Stahlbeton', 'Stahlbeton'],
 ['3-Schicht-Platte', 'Schicht', 'Platte'],
 ['Brettschichtholz C24', 'Brettschichtholz'],
 ['Wärmedämmung Mineralwolle, 0.035 W/mK',
  'Wärmedämmung',
  'Mineralwolle',
  'W',
  'mK'],
 ['Wärmedämmung Slentex Aerogel', 'Wärmedämmung', 'Slentex', 'Aerogel'],
 ['Material 4', 'Material'],
 ['Massiv', 'Massiv'],
 ['Massivholz', 'Massivholz'],
 ['Stahl, 45-34, Pulverbeschichtung schwarz',
  'Stahl',
  'Pulverbeschichtung',
  'schwarz'],
 ['Steel - S355J2G3', 'Steel'],
 ['TU DF 1 - Pfostenstock Falztüre:DL - 900 x 2130:1118965',
  'TU',
  'DF',
  'Pfostenstock',
  'Falztüre',
  'DL',
  'x']]

Reshape all material names and tokens in a dataframe.

In [None]:
import pandas as pd

# Organize names and tokens in dataframe
material_names_and_tokens_df = pd.DataFrame({
    'material_names': material_names,
    'material_name_tokens': all_material_name_tokens
})

material_names_and_tokens_df = material_names_and_tokens_df.explode('material_name_tokens')
material_names_and_tokens_df = material_names_and_tokens_df.groupby(['material_names'], as_index=True).apply(lambda x: x, include_groups=False).reset_index()
material_names_and_tokens_df

Unnamed: 0,material_names,level_1,material_name_tokens
0,3-Schicht-Platte,5,3-Schicht-Platte
1,3-Schicht-Platte,5,Schicht
2,3-Schicht-Platte,5,Platte
3,Archisonic Charcoal,0,Archisonic Charcoal
4,Archisonic Charcoal,0,Archisonic
5,Archisonic Charcoal,0,Charcoal
6,Brettschichtholz C24,6,Brettschichtholz C24
7,Brettschichtholz C24,6,Brettschichtholz
8,"Holzschalung Vertikal, Nadelholz, schwarz ges...",1,"Holzschalung Vertikal, Nadelholz, schwarz ges..."
9,"Holzschalung Vertikal, Nadelholz, schwarz ges...",1,Holzschalung


### Load & restructure ÖKOBAUDAT
- Load zipped ÖKOBAUDAT database.

In [None]:
# Request ÖKOBAUDAT dataset
url = "https://raw.githubusercontent.com/jakob-beetz/sbe-2025-lca-workshop/refs/heads/main/data/zip_files/01_OBD-database.zip"
output_path = "./01_OBD-database.zip"
response = requests.get(url)
with open(output_path, "wb") as f:
    f.write(response.content)

print("✅ ÖKOBAUDAT dataset downloaded.")

✅ ÖKOBAUDAT dataset downloaded.


In [None]:
import zipfile
import pandas as pd
from collections import defaultdict

# Organize ÖKOBAUDAT dataset in a dataframe
if os.path.isfile('obd_df.csv'):
    obd_df = pd.read_csv('obd_df.csv')
else:
    zip_path = "./01_OBD-database.zip"

    folder_data = set()
    leaf_file_data = []

    with zipfile.ZipFile(zip_path) as z:
        all_paths = z.namelist()

        # Split into folders and files
        folders = {p for p in all_paths if p.endswith('/')}
        files = [p for p in all_paths if not p.endswith('/')]

        # Map folders to their subfolders
        folder_children = defaultdict(set)
        for folder in folders:
            for other in folders:
                if other != folder and other.startswith(folder):
                    folder_children[folder].add(other)

        # Identify leaf folders (no subfolders)
        leaf_folders = {f for f in folders if not folder_children[f]}

        # Collect folder data
        for folder in folders:
            parts = folder.strip('/').split('/')
            name = parts[-1]
            parent = parts[-2] if len(parts) > 1 else ''
            depth = len(parts)
            folder_data.add((name, parent, depth, True))

        # Collect files in leaf folders
        for file_path in files:
            for folder in leaf_folders:
                if file_path.startswith(folder) and '/' not in file_path[len(folder):]:
                    parts = file_path.strip('/').split('/')
                    name = parts[-1]
                    if name != 'index.json': continue
                    with z.open(file_path) as f:
                            json_content = json.load(f)
                    for item in json_content["items"]:
                        if "Name" in item:
                            parent = parts[-2] if len(parts) > 1 else ''
                            depth = len(parts)
                            leaf_file_data.append((item["Name"], parent, depth, False))

    # Combine and create DataFrame
    all_data = list(folder_data) + leaf_file_data
    obd_df = pd.DataFrame(all_data, columns=["name", "parent", "depth", "is_category"])
    obd_df = obd_df.sort_values(by=["depth", "parent", "name", "is_category"]).reset_index(drop=True)
    obd_df = obd_df[obd_df['depth'] > 1].reset_index(drop=True)
    obd_df.to_csv('obd_df.csv', index=False)

obd_df[obd_df['parent'] == 'Dämmstoffe']


Unnamed: 0,name,parent,depth,is_category
17,Baumwolle,Dämmstoffe,3,True
18,Blähperlit,Dämmstoffe,3,True
19,Calciumsilikat,Dämmstoffe,3,True
20,Dämmelemente,Dämmstoffe,3,True
21,Expandierter_Kork,Dämmstoffe,3,True
22,Expandiertes_Polystyrol_(EPS),Dämmstoffe,3,True
23,Extrudiertes_Polystyrol_(XPS),Dämmstoffe,3,True
24,Flachsfaser,Dämmstoffe,3,True
25,Hanffaser,Dämmstoffe,3,True
26,Harnstoff-Formaldehydharz,Dämmstoffe,3,True


- We pre-calculated the embeddings of the whole ÖKOBAUDAT to process the STS-matching quicker and stored it as a numpy file.
- Now, we are loading the pre-processed ÖKOBAUDAT with its embeddings.

In [None]:
import numpy as np

# Load ÖKOBAUDAT dataset embeddings
url = f'https://raw.githubusercontent.com/jakob-beetz/sbe-2025-lca-workshop/refs/heads/main/data/zip_files/obd_embeddings_llm_{llm_name.split("/")[1]}.npy'
output_path = f'./obd_embeddings_llm_{llm_name.split("/")[1]}.npy'
response = requests.get(url)
with open(output_path, "wb") as f:
    f.write(response.content)

obd_embeddings = np.load(f'obd_embeddings_llm_{llm_name.split("/")[1]}.npy')
print("✅ ÖKOBAUDAT embeddings downloaded.")

✅ ÖKOBAUDAT embeddings downloaded.


Show the shape of the ÖKOBAUDAT embeddings numpy file.

In [24]:
obd_embeddings.shape

(2264, 768)

In [73]:
similarity_matrix = llm.similarity(llm.encode(material_names_and_tokens_df['material_name_tokens']), obd_embeddings)
similarity_matrix.shape

torch.Size([51, 2264])

In [74]:
similarity_matrix_df = pd.DataFrame(similarity_matrix)

### STS-matching of IfcMaterial to ÖKOBAUDAT datasets
- iterate through all tokens and hierarchical levels of ÖKOBAUDAT's material categories
- store the best material match and its cosine score (value between 0-1)

In [None]:
#Find best match for each material name and token

def find_best_match(group):
  for depth in range(obd_df['depth'].min(), obd_df['depth'].max()+1):
    group[f'material_match_step_{depth-1}'] = ''
    group[f'material_score_step_{depth-1}'] = ''
    group[f'best_material_match_step_{depth-1}'] = ''
    group[f'best_material_score_step_{depth-1}'] = ''

  current_group_indices = group.index
  group = group.reset_index()
  best_category_match = ''

  for depth in range(obd_df['depth'].min(), obd_df['depth'].max()+1):
    obd_filter = obd_df['depth'] == depth
    if best_category_match != '':
      obd_filter = ((obd_df['depth'] == depth) & (obd_df['parent'] == best_category_match))
    similarity_matrix_filtered = similarity_matrix_df.loc[current_group_indices, obd_filter]
    if similarity_matrix_filtered.shape[1] == 0:
      continue
    group[f'material_match_step_{depth-1}'] = obd_df['name'].iloc[similarity_matrix_filtered.idxmax(axis=1).values].reset_index(drop=True)
    group[f'material_score_step_{depth-1}'] = similarity_matrix_filtered.max(axis=1).reset_index(drop=True)

    best_material_score_index = group[f'material_score_step_{depth-1}'].idxmax()
    #return best_material_score_index
    best_category_match = group.loc[best_material_score_index][f'material_match_step_{depth-1}']
    best_category_score = group.loc[best_material_score_index][f'material_score_step_{depth-1}']
    group[f'best_material_match_step_{depth-1}'] = best_category_match
    group[f'best_material_score_step_{depth-1}'] = best_category_score


  return group

material_names_and_tokens_matched_df = material_names_and_tokens_df.groupby('level_1').apply(find_best_match, include_groups=False)
material_names_and_tokens_matched_df
material_names_and_tokens_matched_df[['material_name_tokens', 'material_match_step_1', 'material_score_step_1', 'best_material_match_step_1']].head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,material_name_tokens,material_match_step_1,material_score_step_1,best_material_match_step_1
level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,Archisonic Charcoal,End_of_Life,0.438994,End_of_Life
0,1,Archisonic,End_of_Life,0.521619,End_of_Life
0,2,Charcoal,Beschichtungen,0.426524,End_of_Life
1,0,"Holzschalung Vertikal, Nadelholz, schwarz ges...",Holz,0.617127,Holz
1,1,Holzschalung,Holz,0.736707,Holz
1,2,Vertikal,Beschichtungen,0.580615,Holz
1,3,Nadelholz,Holz,0.766775,Holz
1,4,schwarz,Beschichtungen,0.565022,Holz
1,5,gestrichen,Holz,0.508163,Holz
2,0,Wood - Flooring,End_of_Life,0.556935,Holz


- show all matching results in a dataframe of every matching step

In [75]:
material_names_and_tokens_matched_df[['material_names', 'best_material_match_step_1', 'best_material_match_step_2', 'best_material_match_step_3', 'best_material_match_step_4']].groupby('material_names').first()

Unnamed: 0_level_0,best_material_match_step_1,best_material_match_step_2,best_material_match_step_3,best_material_match_step_4
material_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3-Schicht-Platte,Holz,Holzwerkstoffe,Spanplatten,Eurospan Raw Chipboard
Archisonic Charcoal,End_of_Life,Generisch,Bauschutt,Construction rubble landfill
Brettschichtholz C24,Holz,Vollholz,Brettschichtholzplatte,Cross laminated timber
"Holzschalung Vertikal, Nadelholz, schwarz gestrichen",Holz,Vollholz,Konstruktionsvollholz,Cross-laminated timber
Massiv,Beschichtungen,Grundierungen,Grundierungen_Farben_und_Putze,Passive Purple
Massivholz,Holz,Vollholz,Konstruktionsvollholz,Cross-laminated timber
Material 4,Holz,Modifiziertes_Holz,Thermisch_behandeltes_Holz,"Thermally treated wood (1 m3, 409 kg/m3)"
"Stahl, 45-34, Pulverbeschichtung schwarz",Beschichtungen,Fassadenfarben,Dispersion,"Applicationpaint emulsion, dispersion paint"
Stahlbeton,Kunststoffe,Profile,Kunststoffprofile_hart,"Cable duct PVC, rigid"
Steel - S355J2G3,Sonstige,Baustellenprozesse,Bagger_Aushub,Excavator15 kW
