## Reducing the dimensionality of the data

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Whichever method of dimensionality reduction that we choose, we still have to make sure that the lower dimension variables explain a **significant proportion** of the original data.(90%)?.

So what happens when we run this algorithm?

What happens when we run regression on PCA?

We are predicting the dependent variable based on patterns of co-variation in the inputs, rather than individual features. So coefficients are not directly interpretable per variable. Instead, we interpret which group of features each component represents.

### With GWR

We're fitting a local regression model at each h3 cell here, using sptially weighted data from nearby cells. Instead of one global coefficient for each predictor, you get a surface of coefficients, one per h3, indicating that relationships vary across space.

So essentially - GWR tells you where each of your variables are more strongly linked to the dependent variable (in this case, the KPI). So in that sense, perhaps it makes some sense to have some explainability within the PCA.

### So... the framework

1. Join all H3 datasets → one large GeoDataFrame.
2. Standardize / normalize features.
3. Apply PCA to reduce dimensionality:
    * per thematic block (POIs, buildings, greenness, etc.)
4. Fit regression models:
    * Global OLS or spatial lag/error regression → for baseline relationships.
    * GWR → to see where relationships vary spatially.
5. Map local coefficients → interpret which urban features (or principal components) are locally more important.


## It's probably best to PCA the datasets indiviudally so the PCA clusters can be conceptually similar

In [None]:


# Suppose you have one thematic table
X = buildings_df.drop(columns=['h3_id'])
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=3)
pca_components = pca.fit_transform(X_scaled)

buildings_pca = (
    pd.DataFrame(pca_components, columns=[f'build_pc{i+1}' for i in range(3)])
    .assign(h3_id=buildings_df['h3_id'].values)
)
