# Data overview

From the `*.log` files that as of 09-Feb-2025, you can see:

- Total entities: 113,721,673
- Total entities that have English labels: 88,923,509
- Total properties: 12,397
- Ony 90,816 entities (classes) have instances in Wikidata, which is about 0.079% of the total entities.
- ![](./process_p31_p279/classes_cumulative_distribution.png)
- Above image shows that even among 90,816 entities, there few entities have most of the instances.
- There are 4,932,987 (a, subclass_of, b) relationships we can find.
- 4,201,747 child entities have parent entities through subclass_of (P279). This is lower than 4,932,987, since one child entity can have multiple parent entities
- Out of 4,201,747 child entities, 666,843 have multiple parent entities (P279), which violates the hierarchical tree structure. What is worse than from child-to-parent realtionships we can find 4,831,313 cycles using DFS!
- There are 121,437,570 (a, instance_of, b) relationships we can find through P31. This is a bigger number that that of P279. The wikidata users are more likely to write about instance_of relationships than subclass_of relationships.
- 96,392,021 entities have English descriptions. This is kinda odd cuz it means that there are more entities with English labels than English descriptions. I'd have expected that it's the other way around.

## Count triples and stuff

In [1]:
import os
import glob
from collections import defaultdict
from tqdm.auto import tqdm


def load_triples(directory):
    """
    Read all TSV files in the specified directory and build two mappings:
    - child_to_parents: child -> list of parent classes (or instance classes)
    - parent_to_children: parent -> list of subclasses (or entities)

    Args:
        directory (str): Directory containing TSV files.

    Returns:
        tuple: (child_to_parents, parent_to_children)
    """
    triple_count = 0
    child_to_parents = defaultdict(list)
    parent_to_children = defaultdict(list)
    tsv_files = glob.glob(os.path.join(directory, "*.tsv"))
    print(f"Found {len(tsv_files)} TSV files in '{directory}'.")

    for filename in tqdm(tsv_files):
        with open(filename, "r", encoding="utf-8") as f:
            header = f.readline()  # skip header line
            for line in f:
                parts = line.strip().split("\t")
                if len(parts) != 3:
                    continue
                triple_count += 1
                child, prop, parent = parts
                child_to_parents[child].append(parent)
                parent_to_children[parent].append(child)
    print(f"Loaded {triple_count} triples from '{directory}'.")
    return child_to_parents, parent_to_children


# ---------------------------
# Process P279 (subclass) data
# ---------------------------
child_to_parents_p279, parent_to_children_p279 = load_triples("P279")
print(
    f"\nP279: Loaded {len(child_to_parents_p279)} child entities and {len(parent_to_children_p279)} parent entities.\n"
)

# Count nodes with multiple parents in P279
multiple_parents_p279 = {
    child: parents
    for child, parents in tqdm(
        child_to_parents_p279.items(), desc="P279: Counting multiple parents"
    )
    if len(parents) > 1
}
print(
    f"P279: Nodes with multiple parents: {len(multiple_parents_p279)} out of {len(child_to_parents_p279)} children.\n"
)


# Function to detect cycles (applied only on P279, where cycles in subclass relationships may occur)
def find_cycles_full(child_to_parents):
    """
    Detect cycles by performing DFS on the entire mapping.
    Returns a list of cycles found.
    """
    cycles = []
    visited = set()

    def dfs(node, path, local_visited):
        if node in path:
            cycles.append(path[path.index(node) :] + [node])
            return
        if node in local_visited:
            return
        local_visited.add(node)
        for parent in child_to_parents.get(node, []):
            dfs(parent, path + [node], local_visited)

    for node in tqdm(child_to_parents.keys(), desc="P279: Finding cycles (full scan)"):
        if node not in visited:
            local_visited = set()
            dfs(node, [], local_visited)
            visited.update(local_visited)
    return cycles


# Detect cycles in P279 data (full scan, not sampling)
cycles_p279 = find_cycles_full(child_to_parents_p279)
print(f"\nP279: Found {len(cycles_p279)} cycles in the data.")


# Count the number of triples for P31
directory = "./P31/"
triple_count = 0
tsv_files = glob.glob(os.path.join(directory, "*.tsv"))
print(f"Found {len(tsv_files)} TSV files in '{directory}'.")

for filename in tqdm(tsv_files):
    with open(filename, "r", encoding="utf-8") as f:
        header = f.readline()  # skip header line
        for line in f:
            parts = line.strip().split("\t")
            if len(parts) != 3:
                continue
            triple_count += 1
print(f"Loaded {triple_count} triples from '{directory}'.")


# Count the number en_descriptions

directory = "./en_description/"
en_description_count = 0
tsv_files = glob.glob(os.path.join(directory, "*.tsv"))
print(f"Found {len(tsv_files)} TSV files in '{directory}'.")

for filename in tqdm(tsv_files):
    with open(filename, "r", encoding="utf-8") as f:
        header = f.readline()  # skip header line
        for line in f:
            parts = line.strip().split("\t")
            en_description_count += 1
print(f"Loaded {en_description_count} en_descrioptions from '{directory}'.")


  from .autonotebook import tqdm as notebook_tqdm


Found 2275 TSV files in 'P279'.


  0%|          | 0/2275 [00:00<?, ?it/s]

100%|██████████| 2275/2275 [00:08<00:00, 276.60it/s]


Loaded 4932987 triples from 'P279'.

P279: Loaded 4201747 child entities and 282692 parent entities.



P279: Counting multiple parents: 100%|██████████| 4201747/4201747 [00:00<00:00, 6228200.82it/s]


P279: Nodes with multiple parents: 666843 out of 4201747 children.



P279: Finding cycles (full scan): 100%|██████████| 4201747/4201747 [01:14<00:00, 56690.58it/s] 



P279: Found 4831313 cycles in the data.
Found 2275 TSV files in './P31/'.


100%|██████████| 2275/2275 [00:20<00:00, 113.60it/s]


Loaded 121437570 triples from './P31/'.
Found 2275 TSV files in './en_description/'.


100%|██████████| 2275/2275 [00:15<00:00, 145.99it/s]

Loaded 96392021 en_descrioptions from './en_description/'.



