## Data Loading

In [1]:
import os
import argparse
import sys, pathlib

PROJECT_ROOT = pathlib.Path().resolve().parent 
if PROJECT_ROOT not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

In [2]:

from IPython.display import display
from feature_engineering import create_ceramic_summary, generate_one_hot_embeddings
import config
from config import LOCAL_DATA_PATH     # now works
from data_loader import load_all_dataframes


dfs = load_all_dataframes(data_base_path=LOCAL_DATA_PATH)
if not any(not df.empty for df in dfs.values()):
    print("ERROR: No dataframes were loaded. Exiting.")

# Create ceramic summary
ceramic_summary_df = create_ceramic_summary(dfs)
if ceramic_summary_df.empty:
    print("ERROR: ceramic_summary_df is empty. Exiting.")

# Save ceramic summary
output_summary_path = os.path.join(config.OUTPUT_BASE_DIR, "ceramic_summary_prepared.csv")
try:
    os.makedirs(config.OUTPUT_BASE_DIR, exist_ok=True)
    ceramic_summary_df.to_csv(output_summary_path, index=False)
    print(f"Ceramic Summary saved to: {output_summary_path}")
except Exception as e:
    print(f"ERROR saving ceramic_summary_df: {e}")

# Generate embeddings and add to dataframes
ceramic_summary_df_with_ohe = generate_one_hot_embeddings(ceramic_summary_df.copy())
dfs["ceramic_summary"] = ceramic_summary_df_with_ohe


#for name, df in dfs.items():
#    print(f"\n\u2500\u2500\u2500 {name} \u2500\u2500\u2500")
#    display(df.head())

Loading data from: C:\Users\moham\OneDrive\Desktop\spiridon\Spiridon\data
  Loaded ceramic.csv as dfs['ceramic']
  Loaded object_colors.csv as dfs['object_colors']
  Loaded object_colors_attrib.csv as dfs['object_colors_attrib']
  Loaded object_feature.csv as dfs['object_feature']
  Loaded object_feature_combined_names.csv as dfs['object_feature_combined_names']
  Loaded object_feature_attrib.csv as dfs['object_feature_attrib']
  Loaded object_function_translated.csv as dfs['object_function']
  Loaded object_function_attrib.csv as dfs['object_function_attrib']
  Loaded tech_cat_translated.csv as dfs['tech_cat']
  Loaded archaeological_sites.csv as dfs['archaeological_sites']
  Loaded traditional_designation.csv as dfs['traditional_designation']
  Loaded historical_period.csv as dfs['historical_period']
  Loaded tech_cat_color_attrib.csv as dfs['tech_cat_color_attrib']
  Loaded tech_cat_feature_attrib.csv as dfs['tech_cat_feature_attrib']
  Loaded tech_cat_function_attrib.csv as dfs['te

## Get The Ceramics only from Level 2
### Analysis about the number of ceramics assigned to categories level 2

In [3]:
from utils import analyze_ceramic_distribution_by_hierarchy_level

hierarchy_info, ceramic_distribution, stats = analyze_ceramic_distribution_by_hierarchy_level(dfs)

Analyzing ceramic distribution across hierarchy levels...
Calculating hierarchy levels...
Counting ceramics at each hierarchy level...

📊 CERAMIC DISTRIBUTION ACROSS HIERARCHY LEVELS

📈 Overall Statistics:
  • Total categories: 229
  • Root categories: 5
  • Maximum hierarchy depth: 4
  • Total ceramics analyzed: 8697
  • Ceramics with valid categories: 8697

🏛️ Distribution by Root Category (sorted by total ceramics):
--------------------------------------------------------------------------------

🔹 Root Category: Unglazed categories (literally: Categories without vitreous coating) (ID: 137) - Total: 5301 ceramics
    └─ Root: 9 ceramics (0.2%)
       Categories: Unglazed categories (literally: Categories without vitreous coating)
    └─ Level 1: 25 ceramics (0.5%)
       Categories: Architectural, unglazed (literally: without vitreous coating), Reducing firing, Oxidizing firing (+3 more)
    └─ Level 2: 3872 ceramics (73.0%)
       Categories: Architectural, unglazed / Algiers, Medi

## Datasets Preparation 

### 1- Bert Embedding For RGCN + MLP : Including Ontology 
#### Output on :  output/rgcn_data/Ontology

In [4]:
from data_preparation.format_rgcn_data import prepare_all_level_based_studies

all_studies = prepare_all_level_based_studies(dfs, bert_model_name="all-MiniLM-L6-v2", auto_save=True)

# 4. Save the results
if all_studies:
    print("\n\n--- Generated Datasets ---")
    for repo, etudes in all_studies.items():
        print(f"- {repo}:")
        for etude, data in etudes.items():
            status = "Generated" if data is not None else "Skipped"
            print(f"  - {etude}: {status}")

=== STARTING DATA PREPARATION (Dynamic Root Discovery) ===
🤖 BERT Model: all-MiniLM-L6-v2


  from .autonotebook import tqdm as notebook_tqdm


📊 BERT Embedding Dimension: 384
Initializing CategoryHierarchy to discover roots from data...
=== Category Hierarchy Path Extraction Demo ===

Initializing CategoryHierarchy...
Hierarchy built. Found 5 roots. Processed 229 categories.

=== Hierarchy Summary ===
Total categories: 229
Root categories: 5
Root IDs: [132, 135, 137, 140, 144]

=== Example Category Paths ===

Category ID: 1
  Name: Kaolinitic from the Uzège group
  Level: 2
  Path (IDs): 135 -> 76 -> 1
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic from the Uzège group

Category ID: 2
  Name: Kaolinitic / Ollières-Val de Trets
  Level: 2
  Path (IDs): 135 -> 76 -> 2
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic / Ollières-Val de Trets

Category ID: 3
  Name: Medieval Tin Glazed
  Level: 1
  Path (IDs): 144 -> 3
  Path (Names): Categories with opaque or opacified coating -> Medieval Tin Glazed

Category ID: 4
  Name: Tin Glazed, green and brown de

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_2_connections_etude1 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 791 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude1. Train triplets: 3244, Eval triplets: 630. Target categories: 46
    ✅ Successfully prepared dataset for level_2_connections / etude1
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 1134 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_2_connections_etude1_prime data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1355 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude1_prime. Train triplets: 6041, Eval triplets: 1134. Target categories: 57
    ✅ Successfully prepared dataset for level_2_connections / etude1_prime
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 3292 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_2_connections_etude2 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3570 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude2. Train triplets: 17905, Eval triplets: 3292. Target categories: 78
    ✅ Successfully prepared dataset for level_2_connections / etude2
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

--- Generating Repo: level_1_connections (Linking to Level 1) ---
  --- Processing: etude1 ---
    Using 630 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 630 selected 

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_1_connections_etude1 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 763 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude1. Train triplets: 3244, Eval triplets: 630. Target categories: 18
    ✅ Successfully prepared dataset for level_1_connections / etude1
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 1134 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_1_connections_etude1_prime data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1319 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude1_prime. Train triplets: 6041, Eval triplets: 1134. Target categories: 21
    ✅ Successfully prepared dataset for level_1_connections / etude1_prime
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 3292 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_1_connections_etude2 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3512 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude2. Train triplets: 17905, Eval triplets: 3292. Target categories: 20
    ✅ Successfully prepared dataset for level_1_connections / etude2
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

--- Generating Repo: level_0_connections (Linking to Level 0) ---
  --- Processing: etude1 ---
    Using 630 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 630 selected 

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_0_connections_etude1 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 745 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude1. Train triplets: 3244, Eval triplets: 630. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude1
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 1134 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_0_connections_etude1_prime data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1298 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude1_prime. Train triplets: 6041, Eval triplets: 1134. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude1_prime
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 3292 selected ceramics (handling lists)...


  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_0_connections_etude2 data for RGCN...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3492 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude2. Train triplets: 17905, Eval triplets: 3292. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude2
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

=== FINISHED ALL COMPARATIVE STUDY PREPARATION     ===

🔄 AUTO-SAVING datasets to 'output/rgcn_data/ontology'...

💾 --- Saving All Study Datasets to: 'output/rgcn_data/ontology' ---

📁 --- Saving 

### 2- Bert Embedding For RGCN + MLP : Without Ontology 
#### Output on :  output/rgcn_data/without_ontology

In [5]:
from data_preparation.format_rgcn_data_without_ontology import prepare_all_level_based_studies

all_studies = prepare_all_level_based_studies(dfs, bert_model_name="all-MiniLM-L6-v2", auto_save=True)

# 4. Save the results
if all_studies:
    print("\n\n--- Generated Datasets ---")
    for repo, etudes in all_studies.items():
        print(f"- {repo}:")
        for etude, data in etudes.items():
            status = "Generated" if data is not None else "Skipped"
            print(f"  - {etude}: {status}")

=== STARTING DATA PREPARATION (WITHOUT ONTOLOGY) ===
🤖 BERT Model: all-MiniLM-L6-v2
📁 Output Directory: output/rgcn_data/without_ontology


📊 BERT Embedding Dimension: 384
Initializing CategoryHierarchy to discover roots from data...
=== Category Hierarchy Path Extraction Demo ===

Initializing CategoryHierarchy...
Hierarchy built. Found 5 roots. Processed 229 categories.

=== Hierarchy Summary ===
Total categories: 229
Root categories: 5
Root IDs: [132, 135, 137, 140, 144]

=== Example Category Paths ===

Category ID: 1
  Name: Kaolinitic from the Uzège group
  Level: 2
  Path (IDs): 135 -> 76 -> 1
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic from the Uzège group

Category ID: 2
  Name: Kaolinitic / Ollières-Val de Trets
  Level: 2
  Path (IDs): 135 -> 76 -> 2
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic / Ollières-Val de Trets

Category ID: 3
  Name: Medieval Tin Glazed
  Level: 1
  Path (IDs): 144 -> 3
  Path (Names): Categories with opaque or opacified coating -> Medieval Tin Glazed

Category ID: 4
  Name: Tin Glazed, green and brown de

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_2_connections_etude1 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 769 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude1 (WITHOUT ONTOLOGY). Train triplets: 3146, Eval triplets: 630. Target categories: 46
    ✅ Successfully prepared dataset for level_2_connections / etude1 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 1

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_2_connections_etude1_prime data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1329 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude1_prime (WITHOUT ONTOLOGY). Train triplets: 5894, Eval triplets: 1134. Target categories: 57
    ✅ Successfully prepared dataset for level_2_connections / etude1_prime (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_2_connections_etude2 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 2
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3554 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_2_connections_etude2 (WITHOUT ONTOLOGY). Train triplets: 17723, Eval triplets: 3292. Target categories: 78
    ✅ Successfully prepared dataset for level_2_connections / etude2 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

--- Generating Repo: level_1_connections (Linking to Level 1) ---
  --- Processing: etude1 ---
    Using 630 pre-sample

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_1_connections_etude1 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 741 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude1 (WITHOUT ONTOLOGY). Train triplets: 3146, Eval triplets: 630. Target categories: 18
    ✅ Successfully prepared dataset for level_1_connections / etude1 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 1

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_1_connections_etude1_prime data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1293 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude1_prime (WITHOUT ONTOLOGY). Train triplets: 5894, Eval triplets: 1134. Target categories: 21
    ✅ Successfully prepared dataset for level_1_connections / etude1_prime (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_1_connections_etude2 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 1
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3496 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_1_connections_etude2 (WITHOUT ONTOLOGY). Train triplets: 17723, Eval triplets: 3292. Target categories: 20
    ✅ Successfully prepared dataset for level_1_connections / etude2 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

--- Generating Repo: level_0_connections (Linking to Level 0) ---
  --- Processing: etude1 ---
    Using 630 pre-sample

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_0_connections_etude1 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 723 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude1 (WITHOUT ONTOLOGY). Train triplets: 3146, Eval triplets: 630. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude1 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude1_prime ---
    Using 1134 pre-sampled ceramics.
    Extracting triplets...
Extracting triplets for 11

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_0_connections_etude1_prime data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 1272 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude1_prime (WITHOUT ONTOLOGY). Train triplets: 5894, Eval triplets: 1134. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude1_prime (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D
  --- Processing: etude2 ---
    Using 3292 pre-sampled ceramics.
    Extracting triplets...
Extracting 

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_0_connections_etude2 data for RGCN (WITHOUT ONTOLOGY)...
    🎯 Target Ceramic->Category Connection Level: 0
    🤖 BERT Model: all-MiniLM-L6-v2
    📊 BERT Embedding Dimension: 384D
    🔍 Identifying all unique nodes in the sampled data...
    📋 Found 3476 unique node identifiers to include in the graph.
    🎯 Assigning graph indices and generating ALL-BERT embeddings...
    🔗 Processing triplets using pre-assigned graph indices (WITHOUT ONTOLOGY)...
    🌳 Adding RootCategory->Function/Feature triplets...
    ✅ Formatted data for level_0_connections_etude2 (WITHOUT ONTOLOGY). Train triplets: 17723, Eval triplets: 3292. Target categories: 5
    ✅ Successfully prepared dataset for level_0_connections / etude2 (WITHOUT ONTOLOGY)
    📊 Embedding Info: all-MiniLM-L6-v2 -> 384D

=== FINISHED ALL COMPARATIVE STUDY PREPARATION     ===
=== (WITHOUT ONTOLOGY REASONING)                   ===

🔄 AUTO-SA

### 3-Embedding For MLP : one hot 

In [3]:
from data_preparation.format_mlp_classification_data import prepare_all_mlp_studies

prepare_all_mlp_studies(dfs)

=== STARTING PREPARATION FOR ALL MLP STUDIES       ===
Initializing CategoryHierarchy...
Initializing CategoryHierarchy...
Hierarchy built. Found 5 roots. Processed 229 categories.

--- STEP 1: Selecting Master Set of Ceramics (Level >= 2) ---
  Master candidate pool counts per root: {135: 823, 144: 889, 137: 5267, 132: 1334, 140: 126}

--- STEP 2: Sampling Master Set for Each Étude ---
  Sampling for etude1...
    -> Selected 678 unique ceramics for etude1.
  Sampling for etude1_prime...
    -> Selected 1134 unique ceramics for etude1_prime.
  Sampling for etude2...
    -> Selected 3612 unique ceramics for etude2.

--- STEP 3: Generating All MLP Datasets ---

--- Generating: level_2_target/etude1/type_0 ---
    Processing 678 ceramics for this dataset.
    Generating 'y' labels by finding ancestors at Level 2...
    Generating ceramic attribute embeddings...
    Attribute embedding generated. Length: 63
    Final MLP data shapes: X=(678, 63), y=(678,)
    ✅ MLP data successfully saved

### 4-Embedding For MLP : Bert 

In [3]:
from data_preparation.format_mlp_classification_data_bert import prepare_all_bert_mlp_studies
prepare_all_bert_mlp_studies(dfs)

  from .autonotebook import tqdm as notebook_tqdm


=== STARTING PREPARATION FOR ALL BERT-MLP STUDIES  ===
Loading Sentence-BERT model: 'all-MiniLM-L6-v2'...
Initializing CategoryHierarchy...
Initializing CategoryHierarchy...
Hierarchy built. Found 5 roots. Processed 229 categories.

--- STEP 1: Pre-computing BERT embeddings for all functions and features ---


Batches: 100%|██████████| 6/6 [00:00<00:00,  8.62it/s]


  ✅ Pre-computed 181 function embeddings.


Batches: 100%|██████████| 4/4 [00:00<00:00,  9.76it/s]


  ✅ Pre-computed 104 feature embeddings.

--- STEP 2: Selecting and Sampling Master Ceramic Set (Level >= 2) ---
  -> Selected 678 unique ceramics for etude1.
  -> Selected 1230 unique ceramics for etude1_prime.
  -> Selected 3612 unique ceramics for etude2.

--- STEP 3: Generating All BERT-MLP Datasets ---

--- Generating: level_2_target/etude1 ---
    Processing 678 ceramics for this dataset.
    Generating 'y' labels for Level 2...
    Generating aggregated BERT embeddings for 'X'...
    Final MLP data shapes: X=(678, 1152), y=(678,)
    ✅ BERT-MLP data successfully saved to: mlp_bert_level_studies\level_2\etude1

--- Generating: level_1_target/etude1 ---
    Processing 678 ceramics for this dataset.
    Generating 'y' labels for Level 1...
    Generating aggregated BERT embeddings for 'X'...
    Final MLP data shapes: X=(678, 1152), y=(678,)
    ✅ BERT-MLP data successfully saved to: mlp_bert_level_studies\level_1\etude1

--- Generating: level_0_target/etude1 ---
    Processing 678

### 5-Embedding For RGCN : One hot all 

In [3]:
from data_preparation.format_rgcn_data_ohe import prepare_all_level_based_studies_onehot

all_datasets = prepare_all_level_based_studies_onehot(dfs, auto_save=True)

=== STARTING DATA PREPARATION (ONE-HOT ENCODINGS) ===
Initializing CategoryHierarchy to discover roots from data...
=== Category Hierarchy Path Extraction Demo ===

Initializing CategoryHierarchy...
Hierarchy built. Found 5 roots. Processed 229 categories.

=== Hierarchy Summary ===
Total categories: 229
Root categories: 5
Root IDs: [132, 135, 137, 140, 144]

=== Example Category Paths ===

Category ID: 1
  Name: Kaolinitic from the Uzège group
  Level: 2
  Path (IDs): 135 -> 76 -> 1
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic from the Uzège group

Category ID: 2
  Name: Kaolinitic / Ollières-Val de Trets
  Level: 2
  Path (IDs): 135 -> 76 -> 2
  Path (Names): Categories with transparent glazes -> Kaolinitic, glazed -> Kaolinitic / Ollières-Val de Trets

Category ID: 3
  Name: Medieval Tin Glazed
  Level: 1
  Path (IDs): 144 -> 3
  Path (Names): Categories with opaque or opacified coating -> Medieval Tin Glazed

Category ID: 4
  Name: Tin Glaz

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_2_connections_onehot_etude1 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 2
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 791
      🏺 Ceramics in sample: 630
      ⚙️  Functions in sample: 48
      🔧 Features in sample: 62
      📂 Categories in sample: 51
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 630
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attribute e

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_2_connections_onehot_etude1_prime data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 2
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 1355
      🏺 Ceramics in sample: 1134
      ⚙️  Functions in sample: 90
      🔧 Features in sample: 69
      📂 Categories in sample: 62
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 1134
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    A

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_2_connections_onehot_etude2 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 2
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 3570
      🏺 Ceramics in sample: 3292
      ⚙️  Functions in sample: 115
      🔧 Features in sample: 80
      📂 Categories in sample: 83
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 3292
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attrib

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_1_connections_onehot_etude1 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 1
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 763
      🏺 Ceramics in sample: 630
      ⚙️  Functions in sample: 48
      🔧 Features in sample: 62
      📂 Categories in sample: 23
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 630
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attribute e

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_1_connections_onehot_etude1_prime data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 1
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 1319
      🏺 Ceramics in sample: 1134
      ⚙️  Functions in sample: 90
      🔧 Features in sample: 69
      📂 Categories in sample: 26
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 1134
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    A

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_1_connections_onehot_etude2 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 1
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 3512
      🏺 Ceramics in sample: 3292
      ⚙️  Functions in sample: 115
      🔧 Features in sample: 80
      📂 Categories in sample: 25
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 3292
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attrib

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 630 ceramics.

  🔄 Formatting level_0_connections_onehot_etude1 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 0
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 745
      🏺 Ceramics in sample: 630
      ⚙️  Functions in sample: 48
      🔧 Features in sample: 62
      📂 Categories in sample: 5
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 630
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attribute em

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 1134 ceramics.

  🔄 Formatting level_0_connections_onehot_etude1_prime data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 0
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 1298
      🏺 Ceramics in sample: 1134
      ⚙️  Functions in sample: 90
      🔧 Features in sample: 69
      📂 Categories in sample: 5
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 1134
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    At

  new_col_values.append([pd.to_numeric(i, errors='ignore') for i in item])


Finished extraction for selection. Got results structure for 3292 ceramics.

  🔄 Formatting level_0_connections_onehot_etude2 data for RGCN with TRUE ONE-HOT ENCODINGS (FIXED)...
    🎯 Target Ceramic->Category Connection Level: 0
🔧 FIXED: Converting ceramic_id with proper index management...
📊 After ceramic_id conversion: 10482 rows
📊 Ceramic ID range: 48 to 11064
    🌍 Building full vocabulary from entire database...
    📊 Full vocabulary sizes:
      🏺 All Ceramics: 10482
      ⚙️  All Functions: 181
      🔧 All Features: 104
      📂 All Categories: 229
    🔍 Identifying nodes present in sampled data...
    📋 Nodes in sample: 3492
      🏺 Ceramics in sample: 3292
      ⚙️  Functions in sample: 115
      🔧 Features in sample: 80
      📂 Categories in sample: 5
🔍 DEBUGGING MISSING CERAMICS (FIXED):
📊 Available ceramic IDs: 10482
📊 Ceramic IDs in sample: 3292
❌ Missing ceramics: 0
    🏺 Generating ceramic attribute embeddings...
    Generating ceramic attribute embeddings...
    Attribu