## Reducing the dimensionality of the data

In [25]:
import pandas as pd
import numpy as np
import geopandas as gpd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Whichever method of dimensionality reduction that we choose, we still have to make sure that the lower dimension variables explain a **significant proportion** of the original data.(90%)?.

So what happens when we run this algorithm?

What happens when we run regression on PCA?

We are predicting the dependent variable based on patterns of co-variation in the inputs, rather than individual features. So coefficients are not directly interpretable per variable. Instead, we interpret which group of features each component represents.

### With GWR

We're fitting a local regression model at each h3 cell here, using sptially weighted data from nearby cells. Instead of one global coefficient for each predictor, you get a surface of coefficients, one per h3, indicating that relationships vary across space.

So essentially - GWR tells you where each of your variables are more strongly linked to the dependent variable (in this case, the KPI). So in that sense, perhaps it makes some sense to have some explainability within the PCA.

> Lasso regression or scikit feature selection for clustering the variables

### So... the framework

1. Join all H3 datasets → one large GeoDataFrame.
2. Standardize / normalize features.
3. Apply PCA to reduce dimensionality:
    * per thematic block (POIs, buildings, greenness, etc.)
4. Fit regression models:
    * Global OLS or spatial lag/error regression → for baseline relationships.
    * GWR → to see where relationships vary spatially.
5. Map local coefficients → interpret which urban features (or principal components) are locally more important.


## It's probably best to PCA the datasets indiviudally so the PCA clusters can be conceptually similar

In [26]:
# python
from pathlib import Path
import sys

repo_root = Path.cwd() if (Path.cwd() / "src").exists() else Path.cwd().parent
sys.path.append(str(repo_root / "src"))

from config import PROCESSED_DATA_DIR
import geopandas as gpd

gdf = gpd.read_parquet(PROCESSED_DATA_DIR / "barcelona_h3_res10_landuse_aggregated.parquet")

In [27]:
#read in datasets individually
landuse = gpd.read_parquet('../data/processed/barcelona_h3_res10_landuse_aggregated.parquet')
ndvi = gpd.read_parquet('../data/processed/barcelona_h3_res10_ndvi_aggregated.parquet')
pois = gpd.read_parquet('../data/processed/barcelona_h3_res10_poi.parquet')
buildings = gpd.read_parquet('../data/processed/barcelona_h3_res10_building_aggregated.parquet')

In [None]:
print(landuse.columns.shape) # we only want to use the pct area columns here
print(ndvi.columns.shape) # I'm assuming here that these dummy columns refer to % of clustered category in each h3 cell.
print(pois.columns.shape)
print(buildings.columns.shape) # we only want to use the pct area columns here

(87,)
(8,)
(359,)
(28,)


In [None]:
#we only want to keep the columns that start with pct_area_ for landuse and buildings
landuse_features = [col for col in landuse.columns if col.startswith('pct_area_')]
landuse_df = landuse[landuse_features]
print("new landuse cols = " + str(landuse_df.shape[1]))
buildings_features = [col for col in buildings.columns if col.startswith('pct_area_')]
buildings_df = buildings[buildings_features]
buildings_df.shape
print("new buildings cols = " + str(buildings_df.shape[1]))

new landuse shape = 17
new buildings shape = 13


> Over 400 columns after joining, with a significant weighting to pois. NDVI is *probably* fine, but the others will need some sort of clustering

#### Why don't we cluster them together?

If we mix the datasets, we're losing structure and interpretability. We'll be clustering counts with POIs and building types in heterogenous clusters, creating garbage clusters that mean very little.

In [30]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer


# --------------------------------------------------------------
# Clean column names before embedding
# --------------------------------------------------------------
def clean_column_name(col):
    # Remove prefixes and special patterns common in your dataset
    replacements = [
        "num_polygons_", "num_buildings_", "pct_area_", "area_", 
        "avg_polygon_area_", "avg_", "_mean", "_count", "_total"
    ]

    clean = col
    for r in replacements:
        clean = clean.replace(r, "")
    clean = clean.replace("_", " ")

    return clean.strip()


# --------------------------------------------------------------
# Auto-generate readable cluster names
# --------------------------------------------------------------
def generate_cluster_name(columns):
    # Extract keywords from cleaned names
    cleaned_cols = [clean_column_name(c) for c in columns]

    vectorizer = CountVectorizer(stop_words="english")
    X = vectorizer.fit_transform(cleaned_cols)

    freqs = np.asarray(X.sum(axis=0)).ravel()
    vocab = np.array(vectorizer.get_feature_names_out())

    if len(vocab) == 0:
        return "misc"

    top_keywords = vocab[np.argsort(freqs)[-3:]]  # top 3 words
    return "_".join(top_keywords)


# --------------------------------------------------------------
# Main function: semantic clustering + aggregation
# --------------------------------------------------------------
def reduce_semantic_features(df, prefix="", n_clusters=4, model_name="all-MiniLM-L6-v2"):
    """
    df          : pandas DataFrame with your landuse / POIs / buildings
    prefix      : filter columns by prefix ("" means all numeric)
    n_clusters  : number of semantic clusters to produce
    model_name  : sentence-transformer model
    """
    print(f"Loading embedding model: {model_name} ...")
    model = SentenceTransformer(model_name)

    # ---------------------------
    # 1. Select only numeric columns
    # ---------------------------
    numeric_cols = [
        c for c in df.columns
        if c.startswith(prefix)
        and pd.api.types.is_numeric_dtype(df[c])
    ]

    print(f"Found {len(numeric_cols)} numeric columns starting with '{prefix}'.")

    if len(numeric_cols) == 0:
        raise ValueError("No numeric columns matched your prefix.")

    # ---------------------------
    # 2. Clean names for embedding
    # ---------------------------
    cleaned_names = [clean_column_name(c) for c in numeric_cols]

    # ---------------------------
    # 3. Embed column names
    # ---------------------------
    print("Embedding column names...")
    embeddings = model.encode(cleaned_names, show_progress_bar=True)

    # ---------------------------
    # 4. Cluster embeddings
    # ---------------------------
    print(f"Clustering into {n_clusters} semantic groups...")
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(embeddings)

    # Build cluster -> columns mapping
    clusters = {i: [] for i in range(n_clusters)}
    for col, label in zip(numeric_cols, labels):
        clusters[label].append(col)

    # ---------------------------
    # 5. Auto name clusters
    # ---------------------------
    cluster_names = {}
    for cluster_id, cols in clusters.items():
        cluster_names[cluster_id] = generate_cluster_name(cols)

    # ---------------------------
    # 6. Aggregate features per cluster
    # ---------------------------
    out_df = pd.DataFrame(index=df.index)

    for cluster_id, col_list in clusters.items():
        readable = cluster_names[cluster_id]
        new_col = f"{prefix}cluster_{cluster_id}_{readable}"
        out_df[new_col] = df[col_list].sum(axis=1)

    # ---------------------------
    # 7. Display summary
    # ---------------------------
    print("\nGenerated the following aggregated features:")
    for cid, name in cluster_names.items():
        print(f"  - Cluster {cid}: {name}")
        print(f"    Example columns: {clusters[cid][:4]} ... ({len(clusters[cid])} total)\n")

    return out_df, clusters, cluster_names


# --------------------------------------------------------------
# Optional quick test (comment out when importing)
# --------------------------------------------------------------
if __name__ == "__main__":
    print("Module loaded. Use reduce_semantic_features(df, prefix, n_clusters).")


Module loaded. Use reduce_semantic_features(df, prefix, n_clusters).


In [43]:
landuse_reduced, landuse_clusters, landuse_names = reduce_semantic_features(landuse_df, prefix="", n_clusters=4)

Loading embedding model: all-MiniLM-L6-v2 ...
Found 17 numeric columns starting with ''.
Embedding column names...


Batches: 100%|██████████| 1/1 [00:00<00:00,  3.73it/s]

Clustering into 4 semantic groups...

Generated the following aggregated features:
  - Cluster 0: golf
    Example columns: ['pct_area_golf'] ... (1 total)

  - Cluster 1: military_religious_transportation
    Example columns: ['pct_area_agriculture', 'pct_area_education', 'pct_area_entertainment', 'pct_area_managed'] ... (8 total)

  - Cluster 2: developed_horticulture_residential
    Example columns: ['pct_area_construction', 'pct_area_developed', 'pct_area_horticulture', 'pct_area_residential'] ... (4 total)

  - Cluster 3: park_pedestrian_recreation
    Example columns: ['pct_area_cemetery', 'pct_area_park', 'pct_area_pedestrian', 'pct_area_recreation'] ... (4 total)






In [44]:
buildings_reduced, buildings_clusters, buildings_names = reduce_semantic_features(buildings_df, prefix="", n_clusters=4)

Loading embedding model: all-MiniLM-L6-v2 ...
Found 13 numeric columns starting with ''.
Embedding column names...


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.84it/s]

Clustering into 4 semantic groups...

Generated the following aggregated features:
  - Cluster 0: entertainment_medical_religious
    Example columns: ['pct_area_agricultural', 'pct_area_commercial', 'pct_area_education', 'pct_area_entertainment'] ... (6 total)

  - Cluster 1: military_service_transportation
    Example columns: ['pct_area_industrial', 'pct_area_military', 'pct_area_service', 'pct_area_transportation'] ... (4 total)

  - Cluster 2: civic_residential
    Example columns: ['pct_area_civic', 'pct_area_residential'] ... (2 total)

  - Cluster 3: outbuilding
    Example columns: ['pct_area_outbuilding'] ... (1 total)






In [45]:
pois_reduced, pois_clusters, pois_names = reduce_semantic_features(pois, prefix="", n_clusters=4)

Loading embedding model: all-MiniLM-L6-v2 ...
Found 357 numeric columns starting with ''.
Embedding column names...


Batches: 100%|██████████| 12/12 [00:01<00:00, 10.21it/s]

Clustering into 4 semantic groups...

Generated the following aggregated features:
  - Cluster 0: salon_shop_store
    Example columns: ['antique_shop', 'art_craft_hobby_store', 'auto_body_shop', 'b2b_supplier_distributor'] ... (58 total)

  - Cluster 1: utility_rental_service
    Example columns: ['accountant_or_bookkeeper', 'adoption_service', 'agricultural_service', 'air_transport_facility_service'] ... (108 total)

  - Cluster 2: care_animal_school
    Example columns: ['allergy_and_immunology', 'alternative_medicine', 'animal_boarding', 'animal_hospital'] ... (67 total)

  - Cluster 3: place_station_sport
    Example columns: ['accommodation', 'agricultural_area', 'airport', 'airport_terminal'] ... (124 total)






In [None]:


# Suppose you have one thematic table
X = buildings_df.drop(columns=['h3_id'])
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=3)
pca_components = pca.fit_transform(X_scaled)

buildings_pca = (
    pd.DataFrame(pca_components, columns=[f'build_pc{i+1}' for i in range(3)])
    .assign(h3_id=buildings_df['h3_id'].values)
)
