# Clean Grid-to-District Mapping

Cleans the raw ArcGIS grid-to-district intersection table into formats suitable for the FAMAIL pipeline.

**Input**: `raw_data/grid_to_district_ArcGIS_table_raw.csv` (exported from `create_48x90_grid_map_districts.ipynb`)

**Outputs**:
1. `source_data/grid_to_district_mapping.pkl` — Primary programmatic artifact (dict lookups + numpy arrays)
2. `source_data/grid_to_district_mapping.sample.json` — Human-readable schema sample (git-tracked)

**Grid convention** (matching FAMAIL codebase):
- (0, 0) = south-west corner
- x_grid = latitude dimension (0-47, south to north)
- y_grid = longitude dimension (0-89, west to east)
- cell_id = x_grid * 90 + y_grid

## 1. Load and inspect raw data

In [1]:
import pandas as pd
import numpy as np
import pickle
import json
from pathlib import Path

PROJECT_ROOT = Path.cwd()
# Adjust if running from a subdirectory
if PROJECT_ROOT.name == 'geo_data':
    PROJECT_ROOT = PROJECT_ROOT.parent.parent
elif PROJECT_ROOT.name == 'data':
    PROJECT_ROOT = PROJECT_ROOT.parent

RAW_CSV = PROJECT_ROOT / 'raw_data' / 'grid_to_district_ArcGIS_table_raw.csv'
OUTPUT_PKL = PROJECT_ROOT / 'source_data' / 'grid_to_district_mapping.pkl'
OUTPUT_JSON = PROJECT_ROOT / 'source_data' / 'grid_to_district_mapping.sample.json'

GRID_ROWS, GRID_COLS = 48, 90

df = pd.read_csv(RAW_CSV, encoding='utf-8-sig')
print(f'Loaded {len(df)} rows')
print(f'Columns: {list(df.columns)}')
df.head()

Loaded 4320 rows
Columns: ['OID_', 'x_grid', 'y_grid', 'cell_id', 'Shape_Length', 'Shape_Area', 'cell_area', 'district', 'overlap_m2', 'overlap_pct', 'district_id']


Unnamed: 0,OID_,x_grid,y_grid,cell_id,Shape_Length,Shape_Area,cell_area,district,overlap_m2,overlap_pct,district_id
0,1,0,0,0,4045.170426,1022363.0,9e-05,Nanshan District,5.9e-05,65.557165,7.0
1,2,0,1,1,4045.199704,1022378.0,9e-05,Nanshan District,9e-05,100.000003,7.0
2,3,0,2,2,4045.228996,1022393.0,9e-05,Nanshan District,9e-05,100.000006,7.0
3,4,0,3,3,4045.258509,1022408.0,9e-05,Nanshan District,9e-05,100.000001,7.0
4,5,0,4,4,4045.287837,1022423.0,9e-05,Nanshan District,9e-05,99.999998,7.0


In [3]:
# Validate grid dimensions
assert df['x_grid'].min() == 0 and df['x_grid'].max() == GRID_ROWS - 1, \
    f"x_grid range unexpected: {df['x_grid'].min()}-{df['x_grid'].max()}"
assert df['y_grid'].min() == 0 and df['y_grid'].max() == GRID_COLS - 1, \
    f"y_grid range unexpected: {df['y_grid'].min()}-{df['y_grid'].max()}"
assert len(df) == GRID_ROWS * GRID_COLS, \
    f"Expected {GRID_ROWS * GRID_COLS} cells, got {len(df)}"

# Check cell_id consistency
expected_ids = df['x_grid'] * GRID_COLS + df['y_grid']
assert (df['cell_id'] == expected_ids).all(), "cell_id mismatch!"

# Mapped vs unmapped
mapped = df['district'].notna() & (df['district'] != '')
print(f'Mapped to district: {mapped.sum()} / {len(df)} ({100*mapped.mean():.1f}%)')
print(f'Unmapped (ocean/outside): {(~mapped).sum()}')
print(f'\nDistricts found: {sorted(df.loc[mapped, "district"].unique())}')

Mapped to district: 2605 / 4320 (60.3%)
Unmapped (ocean/outside): 1715

Districts found: ['Bao’an District', 'Dapeng District', 'Futian District', 'Guangming District', 'Longgang District', 'Longhua District', 'Luohu District', 'Nanshan District', 'Pingshan District', 'Yantian District']


## 2. Clean district names

Two normalizations are needed to align with the demographic dataset (`all_demographics_by_district.csv`):
1. **Strip ` District` suffix** — raw ArcGIS uses `"Nanshan District"`, demographics uses `"Nanshan"`
2. **Normalize Unicode apostrophes** — ArcGIS exports `Bao\u2019an` (right single quote U+2019), demographics uses `Bao'an` (ASCII apostrophe U+0027)

In [4]:
# Load demographic data to verify name alignment
demo_csv = PROJECT_ROOT / 'source_data' / 'all_demographics_by_district.csv'
if demo_csv.exists():
    demo_df = pd.read_csv(demo_csv)
    demo_districts = set(demo_df['DistrictName'].values)
    print(f'Demographic districts ({len(demo_districts)}): {sorted(demo_districts)}')
else:
    demo_districts = None
    print(f'Warning: demographic CSV not found at {demo_csv}')

# Clean district names:
# 1. Strip " District" suffix
# 2. Normalize Unicode apostrophes (U+2019 -> U+0027)
df['district_clean'] = (
    df['district']
    .str.replace(' District', '', regex=False)
    .str.replace('\u2019', "'", regex=False)  # right single quote -> ASCII apostrophe
    .str.replace('\u2018', "'", regex=False)  # left single quote -> ASCII apostrophe
)

mapped_clean = df.loc[mapped, 'district_clean'].unique()
print(f'\nCleaned district names: {sorted(mapped_clean)}')

# Verify all cleaned names appear in demographic data
if demo_districts is not None:
    missing = set(mapped_clean) - demo_districts
    extra = demo_districts - set(mapped_clean)
    if missing:
        print(f'WARNING: Grid districts not found in demographics: {missing}')
    if extra:
        print(f'Note: Demographic districts with no grid cells: {extra}')
    if not missing:
        print('All grid district names match demographic data.')

Demographic districts (10): ["Bao'an", 'Dapeng', 'Futian', 'Guangming', 'Longgang', 'Longhua', 'Luohu', 'Nanshan', 'Pingshan', 'Yantian']

Cleaned district names: ["Bao'an", 'Dapeng', 'Futian', 'Guangming', 'Longgang', 'Longhua', 'Luohu', 'Nanshan', 'Pingshan', 'Yantian']
All grid district names match demographic data.


## 3. Build cleaned data structures

### Canonical district ordering

Alphabetical ordering of **all 10** Shenzhen districts. This ensures `district_id` values are stable and can be used as indices into demographic feature arrays.

| district_id | District |
|:-----------:|----------|
| 0 | Bao'an |
| 1 | Dapeng |
| 2 | Futian |
| 3 | Guangming |
| 4 | Longgang |
| 5 | Longhua |
| 6 | Luohu |
| 7 | Nanshan |
| 8 | Pingshan |
| 9 | Yantian |

In [5]:
# Canonical district list: all 10 Shenzhen districts, alphabetically sorted
ALL_DISTRICTS = sorted([
    "Bao'an", "Dapeng", "Futian", "Guangming", "Longgang",
    "Longhua", "Luohu", "Nanshan", "Pingshan", "Yantian"
])
district_to_id = {name: i for i, name in enumerate(ALL_DISTRICTS)}

print('Canonical district ordering:')
for did, name in enumerate(ALL_DISTRICTS):
    count = (df.loc[mapped, 'district_clean'] == name).sum()
    print(f'  {did}: {name} ({count} cells)')

Canonical district ordering:
  0: Bao'an (522 cells)
  1: Dapeng (544 cells)
  2: Futian (86 cells)
  3: Guangming (169 cells)
  4: Longgang (422 cells)
  5: Longhua (182 cells)
  6: Luohu (87 cells)
  7: Nanshan (307 cells)
  8: Pingshan (177 cells)
  9: Yantian (109 cells)


In [6]:
# --- Build all output data structures ---

# 1. cell_to_district: {(x_grid, y_grid): district_name} for mapped cells only
cell_to_district = {}
for _, row in df[mapped].iterrows():
    cell_to_district[(int(row['x_grid']), int(row['y_grid']))] = row['district_clean']

# 2. valid_cells: set of (x_grid, y_grid) tuples for cells within Shenzhen
valid_cells = set(cell_to_district.keys())

# 3. district_id_grid: numpy array (48, 90), -1 for unmapped cells
district_id_grid = np.full((GRID_ROWS, GRID_COLS), -1, dtype=np.int8)
for (x, y), name in cell_to_district.items():
    district_id_grid[x, y] = district_to_id[name]

# 4. valid_mask: boolean numpy array (48, 90)
valid_mask = district_id_grid >= 0

# 5. overlap_pct_grid: numpy array (48, 90), 0.0 for unmapped
overlap_pct_grid = np.zeros((GRID_ROWS, GRID_COLS), dtype=np.float32)
for _, row in df[mapped].iterrows():
    x, y = int(row['x_grid']), int(row['y_grid'])
    overlap_pct_grid[x, y] = float(row['overlap_pct'])

print(f'cell_to_district: {len(cell_to_district)} entries')
print(f'valid_cells: {len(valid_cells)} cells')
print(f'district_id_grid: shape {district_id_grid.shape}, '
      f'range [{district_id_grid.min()}, {district_id_grid.max()}]')
print(f'valid_mask: {valid_mask.sum()} True / {valid_mask.size} total')
print(f'overlap_pct_grid: non-zero cells {(overlap_pct_grid > 0).sum()}')

cell_to_district: 2605 entries
valid_cells: 2605 cells
district_id_grid: shape (48, 90), range [-1, 9]
valid_mask: 2605 True / 4320 total
overlap_pct_grid: non-zero cells 2605


## 4. Validate output

In [7]:
# Sanity checks
assert len(cell_to_district) == mapped.sum(), "cell_to_district count mismatch"
assert len(valid_cells) == mapped.sum(), "valid_cells count mismatch"
assert valid_mask.sum() == mapped.sum(), "valid_mask count mismatch"
assert district_id_grid[~valid_mask].max() == -1, "unmapped cells should be -1"
assert district_id_grid[valid_mask].min() >= 0, "mapped cells should have id >= 0"

# Verify consistency between data structures
for (x, y), name in list(cell_to_district.items())[:5]:
    assert district_id_grid[x, y] == district_to_id[name]
    assert valid_mask[x, y] == True

# Geographic sanity: (0, *) = southernmost row, should be coastal districts
if (0, 0) in cell_to_district:
    print(f'Cell (0, 0) [south-west corner]: {cell_to_district[(0, 0)]}')
if (GRID_ROWS-1, GRID_COLS//2) in cell_to_district:
    print(f'Cell ({GRID_ROWS-1}, {GRID_COLS//2}) [north center]: '
          f'{cell_to_district[(GRID_ROWS-1, GRID_COLS//2)]}')

# District cell counts
from collections import Counter
dist_counts = Counter(cell_to_district.values())
print(f'\nCells per district:')
for name in ALL_DISTRICTS:
    print(f'  {name}: {dist_counts.get(name, 0)}')

print(f'\nAll validation checks passed.')

Cell (0, 0) [south-west corner]: Nanshan

Cells per district:
  Bao'an: 522
  Dapeng: 544
  Futian: 86
  Guangming: 169
  Longgang: 422
  Longhua: 182
  Luohu: 87
  Nanshan: 307
  Pingshan: 177
  Yantian: 109

All validation checks passed.


## 5. Save pickle file

Primary artifact for programmatic use in the FAMAIL pipeline.

| Key | Type | Description |
|-----|------|-------------|
| `cell_to_district` | `dict{(int,int): str}` | (x_grid, y_grid) to district name, mapped cells only |
| `valid_cells` | `set{(int,int)}` | All (x_grid, y_grid) tuples within Shenzhen |
| `district_id_grid` | `np.ndarray (48,90) int8` | Numeric district ID per cell, -1 = unmapped |
| `valid_mask` | `np.ndarray (48,90) bool` | True for cells within Shenzhen |
| `overlap_pct_grid` | `np.ndarray (48,90) float32` | Majority-district overlap percentage (0-100) |
| `district_names` | `list[str]` | 10 district names ordered by district_id |
| `district_to_id` | `dict{str: int}` | District name to numeric ID |

In [8]:
mapping_data = {
    'cell_to_district': cell_to_district,
    'valid_cells': valid_cells,
    'district_id_grid': district_id_grid,
    'valid_mask': valid_mask,
    'overlap_pct_grid': overlap_pct_grid,
    'district_names': ALL_DISTRICTS,
    'district_to_id': district_to_id,
}

OUTPUT_PKL.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PKL, 'wb') as f:
    pickle.dump(mapping_data, f, protocol=pickle.HIGHEST_PROTOCOL)

print(f'Saved: {OUTPUT_PKL}')
print(f'Size: {OUTPUT_PKL.stat().st_size:,} bytes')

Saved: /home/robert/FAMAIL/source_data/grid_to_district_mapping.pkl
Size: 80,453 bytes


## 6. Save JSON sample file

Human-readable schema documentation, git-tracked (`.sample.json` is excluded from
`.gitignore`). Allows anyone to understand the pickle structure without loading it.

In [9]:
# Build a representative sample for the JSON file
sample_cells = sorted(list(valid_cells))[:5]
sample_unmapped = [(x, y) for x in range(GRID_ROWS) for y in range(GRID_COLS)
                   if (x, y) not in valid_cells][:3]

json_data = {
    '_description': (
        'Schema sample for grid_to_district_mapping.pkl. '
        'Shows structure and a few representative entries.'
    ),
    '_grid_convention': {
        'origin': '(0, 0) = south-west corner',
        'x_grid': 'latitude dimension, 0-47, increases northward',
        'y_grid': 'longitude dimension, 0-89, increases eastward',
        'cell_id': 'x_grid * 90 + y_grid',
        'extent': {
            'lon_min': 113.75, 'lon_max': 114.65,
            'lat_min': 22.44, 'lat_max': 22.87
        }
    },
    '_statistics': {
        'total_cells': int(GRID_ROWS * GRID_COLS),
        'mapped_cells': len(cell_to_district),
        'unmapped_cells': GRID_ROWS * GRID_COLS - len(cell_to_district),
        'districts_with_cells': len(dist_counts),
        'cells_per_district': {
            name: dist_counts.get(name, 0) for name in ALL_DISTRICTS
        }
    },
    'district_names': ALL_DISTRICTS,
    'district_to_id': district_to_id,
    'cell_to_district (sample)': {
        f'({x}, {y})': cell_to_district[(x, y)]
        for x, y in sample_cells
    },
    'district_id_grid': {
        'dtype': 'int8',
        'shape': [GRID_ROWS, GRID_COLS],
        'unmapped_value': -1,
        'sample_mapped': {
            f'[{x}][{y}]': int(district_id_grid[x, y])
            for x, y in sample_cells
        },
        'sample_unmapped': {
            f'[{x}][{y}]': int(district_id_grid[x, y])
            for x, y in sample_unmapped
        }
    },
    'valid_mask': {
        'dtype': 'bool',
        'shape': [GRID_ROWS, GRID_COLS],
        'true_count': int(valid_mask.sum()),
        'description': 'True for cells within Shenzhen, False for ocean/outside'
    },
    'valid_cells': {
        'type': 'set of (x_grid, y_grid) tuples',
        'count': len(valid_cells),
        'sample': [list(c) for c in sample_cells]
    },
    'overlap_pct_grid': {
        'dtype': 'float32',
        'shape': [GRID_ROWS, GRID_COLS],
        'description': 'Percentage of cell area covered by its majority district (0-100)',
        'sample': {
            f'[{x}][{y}]': round(float(overlap_pct_grid[x, y]), 2)
            for x, y in sample_cells
        }
    }
}

with open(OUTPUT_JSON, 'w', encoding='utf-8') as f:
    json.dump(json_data, f, indent=2, ensure_ascii=False)

print(f'Saved: {OUTPUT_JSON}')

Saved: /home/robert/FAMAIL/source_data/grid_to_district_mapping.sample.json


## 7. Verify saved files can be loaded

In [10]:
# Reload and verify pickle
with open(OUTPUT_PKL, 'rb') as f:
    loaded = pickle.load(f)

print('Pickle contents:')
print(f'  Keys: {list(loaded.keys())}')
print(f'  cell_to_district: {len(loaded["cell_to_district"])} entries')
print(f'  valid_cells: {len(loaded["valid_cells"])} cells')
print(f'  district_id_grid: shape {loaded["district_id_grid"].shape}, '
      f'dtype {loaded["district_id_grid"].dtype}')
print(f'  valid_mask: {loaded["valid_mask"].sum()} True cells')
print(f'  district_names: {loaded["district_names"]}')

print(f'\nSample lookups:')
print(f'  (0, 0) -> {loaded["cell_to_district"].get((0, 0), "UNMAPPED")}')
print(f'  (0, 0) in valid_cells: {(0, 0) in loaded["valid_cells"]}')
print(f'  district_id_grid[0, 0] = {loaded["district_id_grid"][0, 0]}')

# Reload and verify JSON
with open(OUTPUT_JSON, 'r') as f:
    json_loaded = json.load(f)
print(f'\nJSON sample keys: {list(json_loaded.keys())}')

Pickle contents:
  Keys: ['cell_to_district', 'valid_cells', 'district_id_grid', 'valid_mask', 'overlap_pct_grid', 'district_names', 'district_to_id']
  cell_to_district: 2605 entries
  valid_cells: 2605 cells
  district_id_grid: shape (48, 90), dtype int8
  valid_mask: 2605 True cells
  district_names: ["Bao'an", 'Dapeng', 'Futian', 'Guangming', 'Longgang', 'Longhua', 'Luohu', 'Nanshan', 'Pingshan', 'Yantian']

Sample lookups:
  (0, 0) -> Nanshan
  (0, 0) in valid_cells: True
  district_id_grid[0, 0] = 7

JSON sample keys: ['_description', '_grid_convention', '_statistics', 'district_names', 'district_to_id', 'cell_to_district (sample)', 'district_id_grid', 'valid_mask', 'valid_cells', 'overlap_pct_grid']
