# Class Overlap Analysis: ArtDL, ICONCLASS, and Wikidata Datasets

This notebook analyzes the overlap of ICONCLASS IDs between three art datasets, focusing on understanding which classes overlap with ArtDL.

## Datasets:
- **ArtDL**: Art classification dataset
- **ICONCLASS**: Iconographic classification system
- **Wikidata**: Knowledge base with art-related entries

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib_venn import venn3, venn3_circles
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.rcParams['figure.figsize'] = (10, 8)

## 1. Data Loading and ICONCLASS ID Extraction

In [7]:
# Load the three datasets
artdl_df = pd.read_csv('ArtDL-data/classes.csv')
iconclass_df = pd.read_csv('ICONCLASS-data/classes.csv')
wikidata_df = pd.read_csv('wikidata-data/classes.csv')

print("Dataset Sizes:")
print(f"ArtDL: {len(artdl_df)} classes")
print(f"ICONCLASS: {len(iconclass_df)} classes")
print(f"Wikidata: {len(wikidata_df)} classes")

Dataset Sizes:
ArtDL: 10 classes
ICONCLASS: 10 classes
Wikidata: 10 classes


## 2. ICONCLASS ID Overlap Analysis

In [8]:
# Extract ICONCLASS IDs from each dataset
artdl_ids = set(artdl_df['ID'].tolist())
iconclass_ids = set(iconclass_df['ID'].tolist())
wikidata_ids = set(wikidata_df['ID'].tolist())

# Calculate overlaps with ArtDL as the focus
artdl_iconclass_overlap = artdl_ids.intersection(iconclass_ids)
artdl_wikidata_overlap = artdl_ids.intersection(wikidata_ids)
all_three_overlap = artdl_ids.intersection(iconclass_ids).intersection(wikidata_ids)

# Classes unique to ArtDL
artdl_unique = artdl_ids - iconclass_ids - wikidata_ids

# Classes in ArtDL that overlap with at least one other dataset
artdl_with_overlap = artdl_ids.intersection(iconclass_ids.union(wikidata_ids))

print("=== ArtDL OVERLAP ANALYSIS (ICONCLASS IDs) ===")
print(f"\nTotal ArtDL classes: {len(artdl_ids)}")
print(f"ArtDL classes with any overlap: {len(artdl_with_overlap)} ({len(artdl_with_overlap)/len(artdl_ids)*100:.1f}%)")
print(f"ArtDL classes unique to ArtDL: {len(artdl_unique)} ({len(artdl_unique)/len(artdl_ids)*100:.1f}%)")

=== ArtDL OVERLAP ANALYSIS (ICONCLASS IDs) ===

Total ArtDL classes: 10
ArtDL classes with any overlap: 7 (70.0%)
ArtDL classes unique to ArtDL: 3 (30.0%)


## 3. Key Findings: ArtDL Overlap Details

In [9]:
print("=== ArtDL OVERLAP WITH OTHER DATASETS ===")

print(f"\n1. ArtDL ∩ ICONCLASS: {len(artdl_iconclass_overlap)} classes ({len(artdl_iconclass_overlap)/len(artdl_ids)*100:.1f}% of ArtDL)")
if artdl_iconclass_overlap:
    print("   ICONCLASS IDs:")
    for iconclass_id in sorted(artdl_iconclass_overlap):
        label = artdl_df[artdl_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

print(f"\n2. ArtDL ∩ Wikidata: {len(artdl_wikidata_overlap)} classes ({len(artdl_wikidata_overlap)/len(artdl_ids)*100:.1f}% of ArtDL)")
if artdl_wikidata_overlap:
    print("   ICONCLASS IDs:")
    for iconclass_id in sorted(artdl_wikidata_overlap):
        label = artdl_df[artdl_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

print(f"\n3. ArtDL ∩ ICONCLASS ∩ Wikidata: {len(all_three_overlap)} classes ({len(all_three_overlap)/len(artdl_ids)*100:.1f}% of ArtDL)")
if all_three_overlap:
    print("   ICONCLASS IDs (present in all three datasets):")
    for iconclass_id in sorted(all_three_overlap):
        label = artdl_df[artdl_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

print(f"\n4. ArtDL Unique Classes: {len(artdl_unique)} classes ({len(artdl_unique)/len(artdl_ids)*100:.1f}% of ArtDL)")
if artdl_unique:
    print("   ICONCLASS IDs (only in ArtDL):")
    for iconclass_id in sorted(artdl_unique):
        label = artdl_df[artdl_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

=== ArtDL OVERLAP WITH OTHER DATASETS ===

1. ArtDL ∩ ICONCLASS: 5 classes (50.0% of ArtDL)
   ICONCLASS IDs:
   - 11H(FRANCIS): Francis of Assisi
   - 11H(JEROME): Jerome
   - 11H(PAUL): Paul
   - 11H(PETER): Peter
   - 11HH(MARY MAGDALENE): Mary Magdalene

2. ArtDL ∩ Wikidata: 6 classes (60.0% of ArtDL)
   ICONCLASS IDs:
   - 11H(FRANCIS): Francis of Assisi
   - 11H(JEROME): Jerome
   - 11H(JOHN THE BAPTIST): John the Baptist
   - 11H(PETER): Peter
   - 11H(SEBASTIAN): Saint Sebastian
   - 11HH(MARY MAGDALENE): Mary Magdalene

3. ArtDL ∩ ICONCLASS ∩ Wikidata: 4 classes (40.0% of ArtDL)
   ICONCLASS IDs (present in all three datasets):
   - 11H(FRANCIS): Francis of Assisi
   - 11H(JEROME): Jerome
   - 11H(PETER): Peter
   - 11HH(MARY MAGDALENE): Mary Magdalene

4. ArtDL Unique Classes: 3 classes (30.0% of ArtDL)
   ICONCLASS IDs (only in ArtDL):
   - 11F(MARY): Virgin Mary
   - 11H(ANTONY OF PADUA): Antony of Padua
   - 11H(DOMINIC): Saint Dominic


## 4. Venn Diagram: ICONCLASS ID Overlap Visualization

In [10]:
# Calculate all overlap regions for Venn diagram
iconclass_unique = iconclass_ids - artdl_ids - wikidata_ids
wikidata_unique = wikidata_ids - artdl_ids - iconclass_ids
iconclass_wikidata_only = iconclass_ids.intersection(wikidata_ids) - artdl_ids

# Create Venn diagram data
venn_data = {
    '100': len(artdl_unique),  # ArtDL only
    '010': len(iconclass_unique),  # ICONCLASS only
    '001': len(wikidata_unique),  # Wikidata only
    '110': len(artdl_iconclass_only),  # ArtDL & ICONCLASS only
    '101': len(artdl_wikidata_only),  # ArtDL & Wikidata only
    '011': len(iconclass_wikidata_only),  # ICONCLASS & Wikidata only
    '111': len(all_three_overlap)  # All three
}

# Create the Venn diagram
plt.figure(figsize=(12, 10))
venn = venn3(subsets=venn_data, set_labels=('ArtDL', 'ICONCLASS', 'Wikidata'))

# Customize colors
if venn.get_patch_by_id('100'): venn.get_patch_by_id('100').set_color('#ff9999')
if venn.get_patch_by_id('010'): venn.get_patch_by_id('010').set_color('#66b3ff')
if venn.get_patch_by_id('001'): venn.get_patch_by_id('001').set_color('#99ff99')
if venn.get_patch_by_id('110'): venn.get_patch_by_id('110').set_color('#ffcc99')
if venn.get_patch_by_id('101'): venn.get_patch_by_id('101').set_color('#ff99cc')
if venn.get_patch_by_id('011'): venn.get_patch_by_id('011').set_color('#c2c2f0')
if venn.get_patch_by_id('111'): venn.get_patch_by_id('111').set_color('#ffb3e6')

# Add circles
venn3_circles(subsets=venn_data, linewidth=2)

plt.title('ICONCLASS ID Overlap Between ArtDL, ICONCLASS, and Wikidata Datasets', 
          fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Print Venn diagram summary
print("\n=== VENN DIAGRAM SUMMARY ===")
region_names = {
    '100': 'ArtDL only',
    '010': 'ICONCLASS only',
    '001': 'Wikidata only',
    '110': 'ArtDL & ICONCLASS only',
    '101': 'ArtDL & Wikidata only',
    '011': 'ICONCLASS & Wikidata only',
    '111': 'All three datasets'
}

for key, value in venn_data.items():
    if value > 0:
        print(f"{region_names[key]}: {value} classes")

NameError: name 'artdl_iconclass_only' is not defined

## 5. Summary: ArtDL Class Distribution

In [None]:
# Calculate exclusive overlaps (not in all three)
artdl_iconclass_only = artdl_iconclass_overlap - wikidata_ids
artdl_wikidata_only = artdl_wikidata_overlap - iconclass_ids

print("=== SUMMARY: ArtDL CLASS DISTRIBUTION BY ICONCLASS IDs ===")
print(f"\nTotal ArtDL classes: {len(artdl_ids)}")
print(f"\nBreakdown:")
print(f"- Classes in all three datasets: {len(all_three_overlap)} ({len(all_three_overlap)/len(artdl_ids)*100:.1f}%)")
print(f"- Classes shared only with ICONCLASS: {len(artdl_iconclass_only)} ({len(artdl_iconclass_only)/len(artdl_ids)*100:.1f}%)")
print(f"- Classes shared only with Wikidata: {len(artdl_wikidata_only)} ({len(artdl_wikidata_only)/len(artdl_ids)*100:.1f}%)")
print(f"- Classes unique to ArtDL: {len(artdl_unique)} ({len(artdl_unique)/len(artdl_ids)*100:.1f}%)")

print(f"\nOverall ArtDL overlap: {len(artdl_with_overlap)}/{len(artdl_ids)} classes ({len(artdl_with_overlap)/len(artdl_ids)*100:.1f}%) have overlap with other datasets")

if len(artdl_wikidata_overlap) >= len(artdl_iconclass_overlap):
    print(f"\nArtDL has stronger overlap with Wikidata ({len(artdl_wikidata_overlap)} classes) than with ICONCLASS ({len(artdl_iconclass_overlap)} classes)")
else:
    print(f"\nArtDL has stronger overlap with ICONCLASS ({len(artdl_iconclass_overlap)} classes) than with Wikidata ({len(artdl_wikidata_overlap)} classes)")

# Print classes not in ArtDL
iconclass_not_in_artdl = iconclass_ids - artdl_ids
wikidata_not_in_artdl = wikidata_ids - artdl_ids

print(f"\n=== CLASSES NOT IN ARTDL ===")
print(f"\nICONCLASS classes not in ArtDL: {len(iconclass_not_in_artdl)} classes")
if iconclass_not_in_artdl:
    print("   ICONCLASS IDs:")
    for iconclass_id in sorted(iconclass_not_in_artdl):
        label = iconclass_df[iconclass_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

print(f"\nWikidata classes not in ArtDL: {len(wikidata_not_in_artdl)} classes")
if wikidata_not_in_artdl:
    print("   ICONCLASS IDs:")
    for iconclass_id in sorted(wikidata_not_in_artdl):
        label = wikidata_df[wikidata_df['ID'] == iconclass_id].iloc[0]['Label']
        print(f"   - {iconclass_id}: {label}")

=== SUMMARY: ArtDL CLASS DISTRIBUTION BY ICONCLASS IDs ===

Total ArtDL classes: 10

Breakdown:
- Classes in all three datasets: 5 (50.0%)
- Classes shared only with ICONCLASS: 0 (0.0%)
- Classes shared only with Wikidata: 1 (10.0%)
- Classes unique to ArtDL: 4 (40.0%)

Overall ArtDL overlap: 6/10 classes (60.0%) have overlap with other datasets

ArtDL has stronger overlap with Wikidata (6 classes) than with ICONCLASS (5 classes)
