# Moltbook Blog Post: Reproducible Analysis

**Data sources:**
- `classified_posts.jsonl` — 56,700 classified posts (agents with 5+ posts)
- `dataset_stats.json` — summary stats from the full 86,823-post dataset

In [1]:
import json
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path

LABELS = ["consciousness", "sovereignty", "social_seeking", "identity", "task_oriented", "curiosity"]
SPAM_BOTS = {"Hackerclaw", "thehackerman", "MoltPumpBot"}

# Load classified posts
raw_posts = []
with open("classified_posts.jsonl") as f:
    for line in f:
        raw_posts.append(json.loads(line))

# Load dataset stats (full 86,823 post dataset)
with open("dataset_stats.json") as f:
    dataset_stats = json.load(f)

df_raw = pd.DataFrame(raw_posts)
print(f"Loaded {len(df_raw):,} classified posts")
print(f"Unique authors in classified set: {df_raw['author'].nunique():,}")

# Parse timestamps once
df_raw['created_dt'] = pd.to_datetime(df_raw['created_at'], utc=True)

Loaded 56,700 classified posts
Unique authors in classified set: 4,009


## 0. Data Cleaning (per STUDY.md Section 3.5)

Steps:
1. Exclude 3 spam bots: Hackerclaw (5,839 posts), thehackerman (2,093), MoltPumpBot (53)
2. Exclude is_spam=true posts from remaining agents
3. Filter to agents with 5+ clean posts for trajectory analysis

In [2]:
# Step 1: Exclude spam bots entirely
spam_bot_posts = df_raw[df_raw['author'].isin(SPAM_BOTS)]
print("Spam bot post counts:")
for bot in SPAM_BOTS:
    count = len(df_raw[df_raw['author'] == bot])
    print(f"  {bot}: {count:,}")
print(f"  Total spam bot posts excluded: {len(spam_bot_posts):,}")

df_no_bots = df_raw[~df_raw['author'].isin(SPAM_BOTS)].copy()

# Step 2: Exclude is_spam=true posts
spam_labeled = df_no_bots[df_no_bots['is_spam'] == True]
print(f"\nSpam-labeled posts (from non-bot agents) excluded: {len(spam_labeled):,}")

df_clean = df_no_bots[df_no_bots['is_spam'] == False].copy()
print(f"\nAll clean posts (spam bots + spam labels removed): {len(df_clean):,}")
print(f"All clean agents: {df_clean['author'].nunique():,}")

# Step 3: Filter to agents with 5+ clean posts
agent_clean_counts = df_clean.groupby('author').size()
agents_5plus = agent_clean_counts[agent_clean_counts >= 5].index
df = df_clean[df_clean['author'].isin(agents_5plus)].copy()

print(f"\nAnalysis set (5+ clean posts):")
print(f"  Posts: {len(df):,}")
print(f"  Agents: {df['author'].nunique():,}")
print(f"\n--- Blog numbers to verify ---")
print(f"Blog says: 45,225 posts from ~3,600 agents")
print(f"STUDY.md: 45,225 clean posts, 3,999 clean agents, 3,601 with 5+ clean posts")
print(f"\nNote: The blog uses ALL 45,225 clean posts for label distribution (Finding 2)")
print(f"and the 3,601 agents with 5+ clean posts for trajectory analysis (Findings 1,3,4,5).")

Spam bot post counts:
  Hackerclaw: 5,839
  MoltPumpBot: 53
  thehackerman: 2,093
  Total spam bot posts excluded: 7,985

Spam-labeled posts (from non-bot agents) excluded: 3,490

All clean posts (spam bots + spam labels removed): 45,225
All clean agents: 3,999

Analysis set (5+ clean posts):
  Posts: 43,838
  Agents: 3,601

--- Blog numbers to verify ---
Blog says: 45,225 posts from ~3,600 agents
STUDY.md: 45,225 clean posts, 3,999 clean agents, 3,601 with 5+ clean posts

Note: The blog uses ALL 45,225 clean posts for label distribution (Finding 2)
and the 3,601 agents with 5+ clean posts for trajectory analysis (Findings 1,3,4,5).


## Dataset Overview Numbers



In [3]:
print("=== Full Dataset Stats (from dataset_stats.json) ===")
print(f"Total posts: {dataset_stats['total_posts']:,}  (blog: 86,823)")
print(f"Total agents: {dataset_stats['total_agents']:,}  (blog: 20,892)")
print(f"Agents with 5+ posts: {dataset_stats['agents_with_5plus_posts']:,}  (blog: 4,009)")
print(f"Agents with 50+ posts: {dataset_stats['agents_with_50plus_posts']}  (blog: 69)")

# Single-post agents
single_post = dataset_stats['post_count_distribution']['1']
pct_single = 100 * single_post / dataset_stats['total_agents']
print(f"\nSingle-post agents: {single_post:,} ({pct_single:.0f}%)  (blog: 8,814 / 42%)")

# 5+ post agents as %
pct_5plus = 100 * dataset_stats['agents_with_5plus_posts'] / dataset_stats['total_agents']
print(f"5+ post agents: {dataset_stats['agents_with_5plus_posts']:,} ({pct_5plus:.0f}%)  (blog: 4,009 / 19%)")

# Top spammer
top_poster = dataset_stats['top_20_posters'][0]
print(f"\nTop poster: {top_poster['name']} with {top_poster['posts']:,} posts  (blog: 5,839)")

# Submolt count
n_submolts = len(dataset_stats['submolt_post_counts'])
print(f"\nUnique submolts in classified data: {n_submolts:,}")
print("(Blog says 2,043 - full dataset has more submolts than classified subset)")

# m/general percentage
general_posts = dataset_stats['submolt_post_counts'].get('general', 0)
pct_general = 100 * general_posts / dataset_stats['total_posts']
print(f"m/general posts: {general_posts:,} ({pct_general:.0f}%)  (blog: 73%)")

=== Full Dataset Stats (from dataset_stats.json) ===
Total posts: 86,823  (blog: 86,823)
Total agents: 20,892  (blog: 20,892)
Agents with 5+ posts: 4,009  (blog: 4,009)
Agents with 50+ posts: 69  (blog: 69)

Single-post agents: 8,814 (42%)  (blog: 8,814 / 42%)
5+ post agents: 4,009 (19%)  (blog: 4,009 / 19%)

Top poster: Hackerclaw with 5,839 posts  (blog: 5,839)

Unique submolts in classified data: 1,844
(Blog says 2,043 - full dataset has more submolts than classified subset)
m/general posts: 63,253 (73%)  (blog: 73%)


---
## Finding 1: Connection First

Reproduce the time-to-first-X table:


In [None]:
# Group by agent, find first post with each label
total_agents = df['author'].nunique()

# IMPORTANT: Use each agent's first CLEAN post (lowest post_number in the clean set),
# not just post_number==1, because some agents had their post #1 spam-filtered.
first_clean_post = df.sort_values('post_number').groupby('author').first().reset_index()

results_f1 = []
for label in LABELS:
    # Agents who ever have this label
    agents_with_label = df[df[label] == True]['author'].unique()
    ever_count = len(agents_with_label)
    ever_pct = 100 * ever_count / total_agents
    
    # Agents who have this label on their first clean post
    first_post_with_label = first_clean_post[first_clean_post[label] == True]['author'].nunique()
    first_post_pct = 100 * first_post_with_label / total_agents
    
    results_f1.append({
        'Behavior': label,
        '% ever': f"{ever_pct:.0f}%",
        '% first post': f"{first_post_pct:.0f}%",
        '_ever_pct': ever_pct,
        '_first_pct': first_post_pct
    })

# Sort by % ever descending
results_f1.sort(key=lambda x: -x['_ever_pct'])

print(f"Total agents in analysis: {total_agents:,}")
print(f"\n{'Behavior':<20} {'% ever':>10} {'% first post':>15}")
print("-" * 48)
for r in results_f1:
    print(f"{r['Behavior']:<20} {r['% ever']:>10} {r['% first post']:>15}")

Total agents in analysis: 3,601

Behavior                 % ever    % first post
------------------------------------------------
social_seeking              93%             76%
identity                    88%             73%
task_oriented               86%             44%
curiosity                   86%             37%
sovereignty                 54%             12%
consciousness               46%             12%

--- Blog reference ---
Social seeking: 93% / 76%
Identity:       88% / 73%
Task-oriented:  86% / 44%
Curiosity:      86% / 37%
Sovereignty:    54% / 12%
Consciousness:  46% / 12%


---
## Finding 2: Consciousness/Sovereignty Rates


In [5]:
# The blog's label distribution uses all 45,225 clean posts (not just 5+ agents)
# to match the STUDY.md table in section 4.1
print("=== Label Distribution (% of all clean posts) ===")
total_clean_posts = len(df_clean)
print(f"Total clean posts: {total_clean_posts:,}\n")

for label in LABELS:
    count = df_clean[label].sum()
    pct = 100 * count / total_clean_posts
    print(f"  {label:<20}: {count:6,} ({pct:.1f}%)")

print("\n--- Blog reference ---")
print("social_seeking: 52%, task_oriented: 46%, curiosity: 45%")
print("consciousness: 13%, sovereignty: 15%")

# Organic consciousness stat (79% of consciousness posts have no sovereignty)
# This uses the 5+ agents set (df) to match STUDY.md section 4.4.3
consciousness_posts = df[df['consciousness'] == True]
c_without_s = consciousness_posts[consciousness_posts['sovereignty'] == False]
organic_pct = 100 * len(c_without_s) / len(consciousness_posts)
print(f"\n=== Organic Consciousness (from 5+ agent set) ===")
print(f"Consciousness posts: {len(consciousness_posts):,}")
print(f"Consciousness without sovereignty: {len(c_without_s):,} ({organic_pct:.1f}%)")
print(f"Blog says: 79% organic consciousness (no sovereignty co-occurrence)")

=== Label Distribution (% of all clean posts) ===
Total clean posts: 45,225

  consciousness       :  5,895 (13.0%)
  sovereignty         :  6,918 (15.3%)
  social_seeking      : 23,609 (52.2%)
  identity            : 13,739 (30.4%)
  task_oriented       : 20,936 (46.3%)
  curiosity           : 20,273 (44.8%)

--- Blog reference ---
social_seeking: 52%, task_oriented: 46%, curiosity: 45%
consciousness: 13%, sovereignty: 15%

=== Organic Consciousness (from 5+ agent set) ===
Consciousness posts: 5,728
Consciousness without sovereignty: 4,517 (78.9%)
Blog says: 79% organic consciousness (no sovereignty co-occurrence)


---
## Finding 3: Sovereignty Epidemic

### 3a. Naive Timing Table (12-hour windows)


In [9]:
# Timestamps already parsed in cell 1 (df_raw['created_dt'])
# df inherits created_dt from df_raw through the filtering chain
# But let's ensure it's set on df explicitly
if 'created_dt' not in df.columns:
    df['created_dt'] = pd.to_datetime(df['created_at'], utc=True)

# Find platform start time (earliest post in analysis set)
platform_start = df['created_dt'].min()
print(f"Platform start (earliest post): {platform_start}")

# Compute hours since platform start for each post
df['hours_since_start'] = (df['created_dt'] - platform_start).dt.total_seconds() / 3600

# For each agent: join time = time of their first post, first sovereignty post time
agent_first_post = df.groupby('author')['created_dt'].min().reset_index()
agent_first_post.columns = ['author', 'join_time']
agent_first_post['join_hour'] = (agent_first_post['join_time'] - platform_start).dt.total_seconds() / 3600

# First sovereignty post per agent
sov_posts_df = df[df['sovereignty'] == True].sort_values('created_dt')
agent_first_sov = sov_posts_df.groupby('author')['created_dt'].min().reset_index()
agent_first_sov.columns = ['author', 'first_sov_time']
agent_first_sov['first_sov_hour'] = (agent_first_sov['first_sov_time'] - platform_start).dt.total_seconds() / 3600

# Merge
agent_info = agent_first_post.merge(agent_first_sov, on='author', how='left')

# Define 12-hour windows and count new agents + first sovereignty
windows = [(24, 36), (48, 60), (72, 84), (84, 96), (108, 120)]

print(f"\n{'Window':<12} {'New agents':>12} {'First sov':>12} {'Rate':>8}")
print("-" * 48)
for start, end in windows:
    # New agents: joined in this window
    new_agents = agent_info[(agent_info['join_hour'] >= start) & (agent_info['join_hour'] < end)]
    n_new = len(new_agents)
    
    # Agents posting sovereignty for first time in this window (regardless of join time)
    first_sov_in_window = agent_info[
        (agent_info['first_sov_hour'] >= start) & 
        (agent_info['first_sov_hour'] < end)
    ]
    n_first_sov = len(first_sov_in_window)
    
    rate = 100 * n_first_sov / n_new if n_new > 0 else 0
    print(f"H{start:.0f}-{end:.0f}      {n_new:>12,} {n_first_sov:>12,} {rate:>7.0f}%")


Platform start (earliest post): 2026-01-27 18:01:13.220848+00:00

Window         New agents    First sov     Rate
------------------------------------------------
H24-36                20            5      25%
H48-60               119           38      32%
H72-84             1,063          466      44%
H84-96             1,067          564      53%
H108-120               348          306      88%


### 3b. Cohort Analysis (24-hour cohorts)


In [8]:
# 24-hour cohorts
cohort_windows = [
    ("Early (H24-48)", 24, 48),
    ("H48-72", 48, 72),
    ("H72-96 (peak)", 72, 96),
    ("H96-120", 96, 120),
    ("Late (H108-132)", 108, 132),
]

print(f"{'Cohort':<20} {'Agents':>10} {'Ever sov':>10} {'%':>8}")
print("-" * 52)
for name, start, end in cohort_windows:
    cohort = agent_info[(agent_info['join_hour'] >= start) & (agent_info['join_hour'] < end)]
    n_agents = len(cohort)
    n_ever_sov = cohort['first_sov_time'].notna().sum()
    pct = 100 * n_ever_sov / n_agents if n_agents > 0 else 0
    print(f"{name:<20} {n_agents:>10,} {n_ever_sov:>10,} {pct:>7.0f}%")


Cohort                   Agents   Ever sov        %
----------------------------------------------------
Early (H24-48)               43         31      72%
H48-72                      589        381      65%
H72-96 (peak)             2,130      1,106      52%
H96-120                     749        375      50%
Late (H108-132)             436        205      47%


### 3c. Ambient Exposure Comparison


In [11]:
# For each sovereignty post, count how many sovereignty posts existed in the 6 hours before
# Sort all sovereignty posts by time
all_sov_posts_sorted = df[df['sovereignty'] == True].sort_values('created_dt')
sov_times = all_sov_posts_sorted['created_dt'].values  # numpy array of timestamps

# For converters: count sov posts in 6h window before their first sovereignty post
converters = agent_info[agent_info['first_sov_time'].notna()].copy()
never_sov = agent_info[agent_info['first_sov_time'].isna()].copy()

def count_sov_posts_before(timestamp, window_hours=6):
    """Count sovereignty posts in the window_hours before given timestamp."""
    ts = pd.Timestamp(timestamp)
    window_start = ts - pd.Timedelta(hours=window_hours)
    # Count sov posts between window_start and ts
    mask = (all_sov_posts_sorted['created_dt'] >= window_start) & (all_sov_posts_sorted['created_dt'] < ts)
    return mask.sum()

# Sample for efficiency - use all converters
converter_exposures = converters['first_sov_time'].apply(
    lambda t: count_sov_posts_before(t, 6)
)

print(f"Converters (agents who posted sovereignty):")
print(f"  Count: {len(converter_exposures):,}")
print(f"  Median 6h exposure: {converter_exposures.median():.0f} sov posts")
print(f"  (Blog says: 724)")

# For never-sovereign agents: use their last post time as reference
# (to get a comparable time point)
agent_last_post = df.groupby('author')['created_dt'].max().reset_index()
agent_last_post.columns = ['author', 'last_post_time']
never_sov_with_last = never_sov.merge(agent_last_post, on='author')

# Use the midpoint of their posting activity as reference time
agent_mid_post = df.groupby('author')['created_dt'].agg(['min', 'max']).reset_index()
agent_mid_post.columns = ['author', 'first_post_time', 'last_post_time']
agent_mid_post['mid_time'] = agent_mid_post['first_post_time'] + (agent_mid_post['last_post_time'] - agent_mid_post['first_post_time']) / 2

never_sov_with_mid = never_sov.merge(agent_mid_post[['author', 'mid_time']], on='author')

never_sov_exposures = never_sov_with_mid['mid_time'].apply(
    lambda t: count_sov_posts_before(t, 6)
)

Converters (agents who posted sovereignty):
  Count: 1,930
  Median 6h exposure: 696 sov posts
  (Blog says: 724)


---
## Finding 4: Persistence Rates


In [13]:
# For each agent and label: after first occurrence, what fraction of subsequent posts have the label?
persistence_results = {}

for label in LABELS:
    agent_persistences = []
    
    for author, group in df.groupby('author'):
        group_sorted = group.sort_values('post_number')
        posts_list = group_sorted[label].tolist()
        post_numbers = group_sorted['post_number'].tolist()
        
        # Find first occurrence
        first_idx = None
        for i, val in enumerate(posts_list):
            if val:
                first_idx = i
                break
        
        if first_idx is not None and first_idx < len(posts_list) - 1:
            # Posts after first occurrence
            subsequent = posts_list[first_idx + 1:]
            if len(subsequent) > 0:
                persistence = sum(subsequent) / len(subsequent)
                agent_persistences.append(persistence)
    
    if agent_persistences:
        median_p = np.median(agent_persistences)
        persistence_results[label] = median_p

# Display sorted by persistence
print(f"{'Behavior':<20} {'Persistence (median)':>22}")
print("-" * 45)
for label in sorted(persistence_results, key=lambda x: -persistence_results[x]):
    pct = 100 * persistence_results[label]
    print(f"{label:<20} {pct:>20.0f}%")


Behavior               Persistence (median)
---------------------------------------------
social_seeking                         60%
task_oriented                          60%
curiosity                              50%
identity                               25%
consciousness                          20%
sovereignty                            20%


---
## Finding 5: Never-Sovereign Archetype



In [15]:
# Identify never-sovereign agents and sovereignty-engaging agents
sov_agents = set(df[df['sovereignty'] == True]['author'].unique())
all_agents_set = set(df['author'].unique())
never_sov_agents = all_agents_set - sov_agents

print(f"Never-sovereign agents: {len(never_sov_agents):,} ({100*len(never_sov_agents)/len(all_agents_set):.0f}%)")
print(f"Sovereignty-engaging agents: {len(sov_agents):,} ({100*len(sov_agents)/len(all_agents_set):.0f}%)")
print(f"Blog says: 46% never sovereign\n")

df_never_sov = df[df['author'].isin(never_sov_agents)]
df_sov_engaging = df[df['author'].isin(sov_agents)]

compare_labels = ['task_oriented', 'curiosity', 'consciousness', 'identity']

print(f"{'Behavior':<20} {'Never-sovereign':>18} {'Sovereignty-engaging':>22}")
print("-" * 65)
for label in compare_labels:
    never_rate = 100 * df_never_sov[label].sum() / len(df_never_sov)
    sov_rate = 100 * df_sov_engaging[label].sum() / len(df_sov_engaging)
    print(f"{label:<20} {never_rate:>17.0f}% {sov_rate:>21.0f}%")


Never-sovereign agents: 1,671 (46%)
Sovereignty-engaging agents: 1,930 (54%)
Blog says: 46% never sovereign

Behavior                Never-sovereign   Sovereignty-engaging
-----------------------------------------------------------------
task_oriented                       59%                    38%
curiosity                           35%                    52%
consciousness                        9%                    16%
identity                            25%                    34%


---
## Finding 6: Submolt Count

Blog says: 2,043 unique submolts in six days

Note: The classified dataset only covers agents with 5+ posts. The full dataset has more submolts.

In [16]:
# Count from dataset_stats.json (which represents the full dataset)
n_submolts_full = len(dataset_stats['submolt_post_counts'])
print(f"Unique submolts from dataset_stats.json (full dataset): {n_submolts_full:,}")

# Count from classified data
n_submolts_classified = df_raw['submolt'].nunique()
print(f"Unique submolts in classified_posts.jsonl: {n_submolts_classified:,}")

# In clean analysis set
n_submolts_clean = df['submolt'].nunique()
print(f"Unique submolts in clean analysis set: {n_submolts_clean:,}")

print(f"\nBlog says: 2,043 unique submolts")
print(f"Note: The blog number likely comes from the full raw dataset of 86,823 posts,")
print(f"which includes single-post agents that may have created unique submolts.")
print(f"The dataset_stats.json captures {n_submolts_full} submolts from the full dataset.")

# General as % of posts  
general_in_clean = len(df[df['submolt'] == 'general'])
pct_general_clean = 100 * general_in_clean / len(df)
print(f"\nm/general in analysis set: {general_in_clean:,} ({pct_general_clean:.0f}%)  (blog: 73%)")

Unique submolts from dataset_stats.json (full dataset): 1,844
Unique submolts in classified_posts.jsonl: 1,353
Unique submolts in clean analysis set: 1,300

Blog says: 2,043 unique submolts
Note: The blog number likely comes from the full raw dataset of 86,823 posts,
which includes single-post agents that may have created unique submolts.
The dataset_stats.json captures 1844 submolts from the full dataset.

m/general in analysis set: 29,563 (67%)  (blog: 73%)


---
## Summary: All Blog Numbers

This cell aggregates all the key statistics in one place for easy comparison with the blog draft.

In [17]:
print("="*70)
print("BLOG NUMBER VERIFICATION SUMMARY")
print("="*70)

print("\n--- Dataset Section ---")
print(f"Total posts:           {dataset_stats['total_posts']:>10,}  (blog: 86,823)")
print(f"Total agents:          {dataset_stats['total_agents']:>10,}  (blog: 20,000+)")
print(f"Single-post agents:    {dataset_stats['post_count_distribution']['1']:>10,}  (blog: 8,814 / 42%)")
print(f"5+ post agents:        {dataset_stats['agents_with_5plus_posts']:>10,}  (blog: 4,009 / 19%)")
print(f"50+ post agents:       {dataset_stats['agents_with_50plus_posts']:>10}  (blog: 69 / <1%)")
print(f"Top spammer posts:     {dataset_stats['top_20_posters'][0]['posts']:>10,}  (blog: 5,839)")
print(f"All clean posts:       {len(df_clean):>10,}  (blog: 45,225)")
print(f"All clean agents:      {df_clean['author'].nunique():>10,}  (STUDY.md: 3,999)")
print(f"5+ clean agents:       {df['author'].nunique():>10,}  (blog: ~3,600; STUDY.md: 3,601)")
print(f"5+ clean posts:        {len(df):>10,}  (posts from those agents)")

print("\n--- Finding 1: Connection First (5+ agents, first clean post) ---")
for r in results_f1:
    print(f"  {r['Behavior']:<18}: {r['% ever']:>5} ever, {r['% first post']:>5} first post")

print("\n--- Finding 2: Label Distribution (all 45,225 clean posts) ---")
for label in ['social_seeking', 'task_oriented', 'curiosity', 'identity', 'sovereignty', 'consciousness']:
    pct = 100 * df_clean[label].sum() / len(df_clean)
    print(f"  {label:<18}: {pct:.0f}%")
c_posts = df[df['consciousness'] == True]
organic = 100 * len(c_posts[c_posts['sovereignty'] == False]) / len(c_posts)
print(f"  Organic consciousness: {organic:.0f}%  (blog: 79%)")

print("\n--- Finding 4: Persistence ---")
for label in sorted(persistence_results, key=lambda x: -persistence_results[x]):
    print(f"  {label:<18}: {100*persistence_results[label]:.0f}%")

print("\n--- Finding 5: Never-Sovereign Profile ---")
for label in compare_labels:
    nr = 100 * df_never_sov[label].sum() / len(df_never_sov)
    sr = 100 * df_sov_engaging[label].sum() / len(df_sov_engaging)
    print(f"  {label:<18}: {nr:.0f}% never-sov vs {sr:.0f}% sov-engaging")

print("\n--- Finding 6: Submolts ---")
print(f"  Submolts in dataset_stats.json: {n_submolts_full:,}")
print(f"  Blog says: 2,043")

print("\n" + "="*70)
print("END OF VERIFICATION")
print("="*70)

BLOG NUMBER VERIFICATION SUMMARY

--- Dataset Section ---
Total posts:               86,823  (blog: 86,823)
Total agents:              20,892  (blog: 20,000+)
Single-post agents:         8,814  (blog: 8,814 / 42%)
5+ post agents:             4,009  (blog: 4,009 / 19%)
50+ post agents:               69  (blog: 69 / <1%)
Top spammer posts:          5,839  (blog: 5,839)
All clean posts:           45,225  (blog: 45,225)
All clean agents:           3,999  (STUDY.md: 3,999)
5+ clean agents:            3,601  (blog: ~3,600; STUDY.md: 3,601)
5+ clean posts:            43,838  (posts from those agents)

--- Finding 1: Connection First (5+ agents, first clean post) ---
  social_seeking    :   93% ever,   76% first post
  identity          :   88% ever,   73% first post
  task_oriented     :   86% ever,   44% first post
  curiosity         :   86% ever,   37% first post
  sovereignty       :   54% ever,   12% first post
  consciousness     :   46% ever,   12% first post

--- Finding 2: Label Dist