Hierarchical Population Clustering

Gravitational potential-based clustering algorithm that discovers natural regional structure in population data at multiple scales—from neighborhoods to continents.

Overview

This project implements a hierarchical clustering algorithm based on gravitational potential between population centers. Starting from individual census tracts or grid cells, it iteratively merges regions with the highest mutual attraction, building a complete hierarchy that reveals natural population structure without prior geographic knowledge.

Key insight: The same algorithm discovers both that Central Park divides Manhattan's Upper East and West Sides, and that the Sahara Desert divides North and Sub-Saharan Africa.

Visualizations

Spider Web Network Maps

Lines connecting merged population centers, colored by merge order.

74,000 US census tracts merging hierarchically

Global population at 15 arc-minute resolution

3D Gravitational Potential Surface

Population distribution rendered as elevation, where height represents cumulative gravitational pull.

Western Hemisphere gravitational potential

Key Findings

The algorithm naturally discovers:

Continental regions (South Asia 1.8B, East Asia 1.5B, Western Hemisphere 927M)
Natural barriers (deserts, oceans, mountain ranges) as low-density boundaries
Cultural corridors (Grand Trunk Road in South Asia emerges as coherent 1B+ person region)
Metropolitan structure (urban cores, suburbs, satellite cities)
Micro-geography (Central Park divides Manhattan neighborhoods)

Cross-validation: Census tracts, block groups, and world grid data all converge on the same regional boundaries, confirming the algorithm captures real geographic structure.

Algorithm

Gravitational Potential Model

potential(i, j) = (population_i × population_j) / distance(i, j)^4

The inverse fourth power heavily weights nearby populations, mimicking urban agglomeration while allowing long-range effects.

Process

Start with individual census tracts/grid cells as separate regions
Calculate pairwise gravitational potential between all regions
Merge the pair with highest potential
Update centroid (population-weighted) and recalculate potentials
Repeat until all regions merged into one

Output: Complete merge tree showing hierarchical structure at all scales.

Distance Calculation

Approximate spherical distance optimized for speed:

xMiles = deltaLon × 69.0 × cos(averageLatitude)
yMiles = deltaLat × 69.172
distance² = xMiles² + yMiles²

Cosine lookup table for latitudes 0-89° avoids repeated trig calculations.

Performance

Complexity: O(n²) with aggressive caching
Cache strategy: Store potentials for large merged regions (>100 members)
Memory management: Release merged tract contents to handle large datasets

Data Sources

US Census Bureau: Census tracts (~74k) and block groups (~220k)
GPW v4: Gridded Population of the World (CIESIN)
- 15 arc-minute resolution (~28km at equator, ~1 million grid cells)
- 2015 population adjusted to UN estimates

Project Structure

lib/                        # Python library
├── constants.py            # Earth distance constants
├── geometry.py             # Distance calculations (cosine-corrected, haversine)
├── potential.py            # Potential field calculation (chunked, vectorized)
└── io.py                   # CSV loading utilities

tests/                      # Python test suite
├── test_geometry.py        # Distance function tests
├── test_potential.py       # Potential calculation tests
├── test_io.py              # File I/O tests
└── test_regression.py      # Regression baselines

src/
├── cli/                    # Python CLI tools
│   ├── calculate_potential.py   # Population potential calculator
│   └── generate_3d_surface.py   # 3D visualization generator
└── com/jimspencer/         # Java implementation (original)
    ├── Tracts.java         # Main clustering algorithm with caching
    ├── Tract.java          # Individual tract or merged region
    ├── SpiderMap.java      # Alternative edge-based approach
    ├── MapEdge.java        # Adjacency between tracts
    └── Region.java         # Region merging by boundary distance

experiments/                # Experimental scripts (organized by category)
├── 3d_export/              # OBJ, STL, point cloud exports
├── 3d_print/               # Blender integration, print previews
├── triangulation/          # Delaunay triangulation utilities
├── visualization/          # Regional visualizations, color exports
├── regional_analysis/      # Specific region investigations
├── triangle_centers/       # Triangle center calculations
├── exclude_experiments/    # Exclusion parameter experiments
└── misc/                   # Utility scripts

res/
├── censusTracts.csv        # Input: US census tract data
├── treeOutput.csv          # Output: sequential merge records
├── tracts_sf_bay.csv       # SF Bay Area subset for testing
├── 74k_us.pdf              # US spider web visualization
├── worldPrintMap.pdf       # World spider web (colored)
├── 3dsurface.png           # 3D potential surface
└── usa6region.jpg          # Physical map with 6 hand-drawn regions

docs/
├── INTERNAL-NOTES.md       # Project organization reference
└── world tree.docx         # Annotated world hierarchy (to 3M pop)

Usage

Python Library (Recommended)

Installation:

# Install dependencies
pip install numpy scipy pandas pytest

# Run tests
pytest tests/ -v

Basic Usage:

from lib import io, potential, geometry

# Load census data
df = io.load_csv('res/tracts_sf_bay.csv')
lons = df['LONGITUDE'].values
lats = df['LATITUDE'].values
weights = df['POPULATION'].values

# Calculate population potential at each census tract
# Uses 1/d³ potential (from 1/d⁴ force law)
potentials = potential.calculate_potential_chunked(
    sample_lons=lons,
    sample_lats=lats,
    source_lons=lons,
    source_lats=lats,
    source_weights=weights,
    distance_fn=geometry.cos_corrected_distance,
    force_exponent=3,
    chunk_size=1000  # Process 1000 points at a time
)

print(f"Potential range: {potentials.min():.0f} to {potentials.max():.0f}")

API Reference:

calculate_potential_chunked(sample_lons, sample_lats, source_lons, source_lats, source_weights, distance_fn, force_exponent=3, chunk_size=1000, min_distance_miles=0.0, max_distance_miles=None)

sample_lons/lats: Where to calculate potential (e.g., triangle centers, census tracts)
source_lons/lats/weights: Population sources (census tracts with populations)
distance_fn: cos_corrected_distance (fast) or haversine_distance (accurate)
force_exponent: 1 for gravity (1/d), 3 for social cohesion (1/d³), etc.
chunk_size: Memory management (1000 works for 48GB RAM with 72k points)
min_distance_miles: Smooth noise by clamping distances (e.g., 1.0 mile for census centroids)
max_distance_miles: Limit to local influences (e.g., 50-100 miles)

Distance Functions:

cos_corrected_distance: Fast approximation using cosine correction (~3× faster)
haversine_distance: Accurate great-circle distance (use for scientific work)

Smoothing Census Centroid Noise:

# Census tract centroids are approximate (±0.5-1 mile error)
# Use min_distance_miles to smooth this noise
potentials = potential.calculate_potential_chunked(
    lons, lats, lons, lats, weights,
    geometry.cos_corrected_distance,
    force_exponent=3,
    min_distance_miles=1.0  # Treat anything <1 mile as 1 mile away
)

Advanced Options (pre-computed distances):

# For custom workflows, use pre-computed distance matrix
distances = geometry.cos_corrected_distance(sample_lons, sample_lats, source_lons, source_lats)
potentials = potential.calculate_potential(
    distances,
    weights,
    force_exponent=3,
    max_distance_miles=100,      # Zero contribution beyond distance
    min_distance_miles=1.0       # Smooth census noise
)

Java Implementation (Original)

# Compile
javac -d out src/com/jimspencer/*.java

# Run (edit Tracts.java to select input file)
java -cp out com.jimspencer.Tracts

Input: CSV with LONGITUDE,LATITUDE,POPULATION Output: res/treeOutput.csv with merge sequence

Generating Visualizations

The tree output can be rendered as:

GeoJSON MultiLineString for web mapping
PDF using vector graphics tools
Interactive visualizations (D3.js, Observable, etc.)

See docs/INTERNAL-NOTES.md for details on visualization workflow.

World Population Hierarchy

Top-level structure discovered by the algorithm:

South Asia (1.8B) — Grand Trunk Road corridor, India/Pakistan/Bangladesh
East Asia (1.5B) — China, Japan, Korea
Western Hemisphere (927M) — North/Central/South America
Europe & Middle East (909M) — Benelux to Tehran, Mediterranean
East Africa (464M) — Great Lakes, Ethiopia, Mozambique corridor
West Africa (409M) — Nigerian coast, Sahel
Independent Areas — Indonesia (245M), SE Asia (109M), Philippines (103M)
Isolated — Australia (21M), Pacific Northwest (14M), New Zealand (5M)

Complete hierarchy annotated down to 3M population in docs/world tree.docx. The algorithm continues to individual tracts but manual annotation became tedious.

Notable Discoveries

Grand Trunk Road emerges as natural region despite spanning Pakistan/India/Bangladesh borders
Florida separates from Deep South in US clustering
Middle East/Indian subcontinent gap aligns with desert barrier
Boundary regions (Denver, Salt Lake City) are genuinely ambiguous—they could plausibly belong to multiple regions

Technical Notes

Known Artifacts

Equatorial gridding: Early merges in gridded data show slight horizontal bias at the equator (E-W neighbors equally likely). Not visible at normal viewing distance and doesn't affect regional structure.

Line crossings: Spider web lines occasionally cross. This is correct—the algorithm uses potential (not pure geography) so attachment points may seem counterintuitive but reflect population attraction.

Validation

Cross-validation across multiple datasets confirms robustness:

US census tracts (74k)
US block groups (220k)
World 15 arc-minute grid (~1 million cells)

All converge on the same major regional boundaries.

Future Development

Interactive web visualization with zoom/drill-down
"Crystal growth" animation of merge sequence
Temporal analysis with historical population data
Alternative metrics (travel time, economic gravity)
Code refactoring and test coverage

Contributing

This is currently a personal research project. If you're interested in collaborating, extending the work, or using it for academic research, please reach out.

License

[To be determined]

Citation

If you use this work, please cite:

Spencer, J. (2015-2021). Hierarchical Population Clustering via Gravitational Potential.
[GitHub repository or paper citation once available]

Acknowledgments

Thanks to:

The printer guy who thought it was cool
My spouse for tolerating this obsession

"It made me pretty happy, might bring joy to others."

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.idea		.idea
docs		docs
experiments		experiments
lib		lib
out/production/javaMap		out/production/javaMap
output		output
res		res
scripts		scripts
src		src
test_data		test_data
tests		tests
.gitignore		.gitignore
.test_push		.test_push
README.md		README.md
TODO.md		TODO.md
ca_1mile_simple.py		ca_1mile_simple.py
censusBlockGroupResults.csv		censusBlockGroupResults.csv
check_usa_progress.sh		check_usa_progress.sh
generate_region_viz.sh		generate_region_viz.sh
javaMap.iml		javaMap.iml
requirements.txt		requirements.txt
sf_1mile_simple.py		sf_1mile_simple.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchical Population Clustering

Overview

Visualizations

Spider Web Network Maps

3D Gravitational Potential Surface

Key Findings

Algorithm

Gravitational Potential Model

Process

Distance Calculation

Performance

Data Sources

Project Structure

Usage

Python Library (Recommended)

Java Implementation (Original)

Generating Visualizations

World Population Hierarchy

Notable Discoveries

Technical Notes

Known Artifacts

Validation

Future Development

Contributing

License

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

jspenc4/javaMap

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Population Clustering

Overview

Visualizations

Spider Web Network Maps

3D Gravitational Potential Surface

Key Findings

Algorithm

Gravitational Potential Model

Process

Distance Calculation

Performance

Data Sources

Project Structure

Usage

Python Library (Recommended)

Java Implementation (Original)

Generating Visualizations

World Population Hierarchy

Notable Discoveries

Technical Notes

Known Artifacts

Validation

Future Development

Contributing

License

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages