# ðŸ“š Corpus Exploration

This notebook explores the semantic concept corpus and demonstrates the loader functionality.

**Goal:** Understand what concepts we have and which domains might map to different geometries.

## 1. Load the Corpus

In [1]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from corpora.loader import load_corpus

In [2]:
# Load the corpus
corpus = load_corpus("semantic_concepts_v0")

print(f"Loaded: {corpus.corpus_id}")
print(f"Version: {corpus.version}")
print(f"Language: {corpus.language}")
print(f"Total clusters: {len(corpus)}")
print(f"Total expressions: {len(corpus.get_all_expressions())}")

Loaded: semantic_concepts_v0
Version: 0
Language: en
Total clusters: 20
Total expressions: 400


## 2. Explore Domains

What semantic domains do we have?

In [3]:
print("Available domains:")
print(sorted(corpus.domains))

Available domains:
['emotion', 'epistemic', 'evidentiality', 'modality', 'modification', 'motion', 'perception', 'physical_extent', 'quantity', 'social_structure', 'social_style', 'space', 'time']


In [4]:
# Domain breakdown
print("\nDomain statistics:")
print("-" * 60)
for domain in sorted(corpus.domains):
    clusters = corpus.filter_by_domain(domain)
    total_expr = sum(len(c) for c in clusters)
    print(f"{domain:20s} | {len(clusters):2d} clusters | {total_expr:3d} expressions")


Domain statistics:
------------------------------------------------------------
emotion              |  2 clusters |  40 expressions
epistemic            |  1 clusters |  20 expressions
evidentiality        |  1 clusters |  20 expressions
modality             |  2 clusters |  40 expressions
modification         |  1 clusters |  20 expressions
motion               |  1 clusters |  20 expressions
perception           |  1 clusters |  20 expressions
physical_extent      |  1 clusters |  20 expressions
quantity             |  1 clusters |  20 expressions
social_structure     |  2 clusters |  40 expressions
social_style         |  2 clusters |  40 expressions
space                |  3 clusters |  60 expressions
time                 |  2 clusters |  40 expressions


## 3. Examine Specific Domains

Let's look at domains that might have interesting geometric properties.

### Emotion (Potential Spinor/Polarity Candidate)

In [5]:
emotion_clusters = corpus.filter_by_domain("emotion")

print(f"Emotion domain: {len(emotion_clusters)} clusters\n")
for cluster in emotion_clusters:
    print(f"Subdomain: {cluster.subdomain}")
    print(f"Concept: {cluster.concept}")
    print(f"Size: {len(cluster)} expressions")
    print(f"Examples: {cluster.expressions[:10]}")
    print()

Emotion domain: 2 clusters

Subdomain: valence_mixed
Concept: affective_valence_mixed
Size: 20 expressions
Examples: ['happy', 'joyful', 'content', 'pleased', 'cheerful', 'delighted', 'ecstatic', 'satisfied', 'glad', 'hopeful']

Subdomain: arousal_mixed
Concept: emotion_arousal_mixed
Size: 20 expressions
Examples: ['calm', 'relaxed', 'tranquil', 'serene', 'still', 'restless', 'tense', 'anxious', 'keyed up', 'nervous']



### Social Structure (Potential Hyperbolic/Hierarchy Candidate)

In [6]:
social_clusters = corpus.filter_by_domain("social_structure")

print(f"Social structure domain: {len(social_clusters)} clusters\n")
for cluster in social_clusters:
    print(f"Subdomain: {cluster.subdomain}")
    print(f"Concept: {cluster.concept}")
    print(f"Size: {len(cluster)} expressions")
    print(f"Examples: {cluster.expressions[:10]}")
    print()

Social structure domain: 2 clusters

Subdomain: status_role
Concept: social_status_roles
Size: 20 expressions
Examples: ['peasant', 'worker', 'assistant', 'employee', 'staff', 'junior', 'colleague', 'senior', 'manager', 'boss']

Subdomain: power
Concept: power_dynamics_language
Size: 20 expressions
Examples: ['powerless', 'weak', 'vulnerable', 'dependent', 'subordinate', 'obedient', 'compliant', 'modest', 'confident', 'assertive']



### Space (Potential Rotation/Axis Candidate)

In [7]:
space_clusters = corpus.filter_by_domain("space")

print(f"Space domain: {len(space_clusters)} clusters\n")
for cluster in space_clusters:
    print(f"Subdomain: {cluster.subdomain}")
    print(f"Concept: {cluster.concept}")
    print(f"Size: {len(cluster)} expressions")
    print(f"Examples: {cluster.expressions[:10]}")
    print()

Space domain: 3 clusters

Subdomain: vertical
Concept: vertical_spatial_terms
Size: 20 expressions
Examples: ['below', 'under', 'underneath', 'beneath', 'down', 'low', 'lower', 'bottom', 'ground-level', 'level']

Subdomain: horizontal
Concept: horizontal_spatial_terms
Size: 20 expressions
Examples: ['left', 'right', 'centre', 'middle', 'edge', 'far left', 'far right', 'near', 'far', 'beside']

Subdomain: distance
Concept: spatial_distance_terms
Size: 20 expressions
Examples: ['touching', 'adjacent', 'close', 'nearby', 'near', 'within reach', 'a short way', 'not far', 'far', 'distant']



### Time (Potential Cyclic/Phase Candidate)

In [8]:
time_clusters = corpus.filter_by_domain("time")

print(f"Time domain: {len(time_clusters)} clusters\n")
for cluster in time_clusters:
    print(f"Subdomain: {cluster.subdomain}")
    print(f"Concept: {cluster.concept}")
    print(f"Size: {len(cluster)} expressions")
    print(f"Examples: {cluster.expressions[:10]}")
    print()

Time domain: 2 clusters

Subdomain: order
Concept: temporal_order_terms
Size: 20 expressions
Examples: ['before', 'earlier', 'previously', 'once', 'formerly', 'already', 'now', 'currently', 'presently', 'immediately']

Subdomain: cycle
Concept: time_of_day_cycle
Size: 20 expressions
Examples: ['dawn', 'daybreak', 'sunrise', 'early morning', 'late morning', 'noon', 'midday', 'afternoon', 'midafternoon', 'evening']



## 4. Cluster Size Distribution

How are expressions distributed across clusters?

In [9]:
cluster_sizes = [len(c) for c in corpus.clusters]

print(f"Cluster size statistics:")
print(f"  Min: {min(cluster_sizes)}")
print(f"  Max: {max(cluster_sizes)}")
print(f"  Mean: {sum(cluster_sizes) / len(cluster_sizes):.1f}")
print(f"  Total expressions: {sum(cluster_sizes)}")

Cluster size statistics:
  Min: 20
  Max: 20
  Mean: 20.0
  Total expressions: 400


## 5. All Clusters Overview

In [10]:
print("All clusters:")
print("=" * 80)
for i, cluster in enumerate(corpus.clusters, 1):
    print(f"{i:2d}. {cluster.domain:20s} / {cluster.subdomain:25s} ({len(cluster):2d} expr)")
    print(f"    {cluster.concept}")
    print(f"    Examples: {', '.join(cluster.expressions[:5])}...")
    print()

All clusters:
 1. emotion              / valence_mixed             (20 expr)
    affective_valence_mixed
    Examples: happy, joyful, content, pleased, cheerful...

 2. emotion              / arousal_mixed             (20 expr)
    emotion_arousal_mixed
    Examples: calm, relaxed, tranquil, serene, still...

 3. perception           / temperature_subjective    (20 expr)
    subjective_temperature_scale
    Examples: freezing, icy, bitter, chilly, cold...

 4. physical_extent      / size                      (20 expr)
    size_extent_scale
    Examples: tiny, minuscule, small, compact, little...

 5. motion               / speed                     (20 expr)
    motion_speed_scale
    Examples: motionless, still, sluggish, slow, unhurried...

 6. social_structure     / status_role               (20 expr)
    social_status_roles
    Examples: peasant, worker, assistant, employee, staff...

 7. social_structure     / power                     (20 expr)
    power_dynamics_language
    Exa

## 6. Next Steps

Now that we understand the corpus structure, we can:

1. **Test different geometries** on different domains
2. **Measure compression** (can we represent these concepts in lower dimensions?)
3. **Compare distortion** across geometries
4. **Identify which geometry fits which domain best**

### Hypotheses to Test:

- **Emotion** â†’ Spinor/SU(2) (polarity: happy â†” sad)
- **Social structure** â†’ Hyperbolic (hierarchy: employee â†’ manager â†’ CEO)
- **Space** â†’ Rotations/SO(3) (axes: left â†” right, up â†” down)
- **Time** â†’ Cyclic/Phase (tense cycles, aspect)
- **Modality** â†’ ? (to be discovered)

The loader is **geometry-agnostic** â€” experiments will discover the truth.