Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Unsupervised Learning for Submarket Modeling


This research was originally conducted for the master's degree thesis "Unsupervised Learning for Submarket: A Proxy for Neighborhood Change" at Columbia University in 2019 ( However, it is currently being revised to emphasize the main contributions to real estate and housing economics. The current version offers alternative applications for determining capitalization rates for commercial real estate valuation, rather than for approximating an empirical argument on notion of neighborhood change. The pending edits have been temporarily reserved as private. However, the R implemented analyses will available here in the future.


This study focused on submarket modeling with unsupervised learning and geographic information system fundamentals to predict the number and geography of submarkets at the neighborhood scale. A Spatially Constrained Weighted-Multivariate Hierarchical Clustering algorithm was trained to identify the optimal number of Multifamily Residential Commercial Real Estate submarkets in Manhattan, New York. The methodology explored Non-Negative Matrix Factorization for predicting the annual normalized values of every Multifamily Residential Commercial Real Estate property in Manhattan, which had been transacted from and including 2004 to 2018. Several extensive data transformations were applied prior to model fitting. A novel conditional random sampling technique was introduced to train and test set split sparse matrices for validating the prediction results. The study utilized a series of optimization techniques, including Leave One Out Cross Validation for estimating the optimal low rank matrix. The results from Non-Negative Matrix Factorization were compared with other imputation methods for sparse matrices, including Simon Funk’s Singular Value Decomposition. Both the observed and imputed values were then clustered on a weighted basis with the Spatially Constrained Weighted-Multivariate Hierarchical Clustering algorithm, using five different Agglomerative Hierarchical Clustering linkage methods: Average Linkage, Median Linkage, Centroid Linkage, Complete Linkage, and Ward’s Method. Ward’s Method was found to be the superior linkage method for determining the optimal number of submarkets, when measured by the maximum absolute value difference between the mean intra-cluster similarity and the maximum inter-cluster dissimilarity. The clustering results indicated that the optimal number of Multifamily Residential Commercial Real Estate submarkets in Manhattan from 2004 to 2018 was 43. The final results were mapped by spatial joining to the intersecting land lot polygons with their respective submarket identifications. This study found that in several cases, there was a strong and obvious presence of multiple submarkets contained within discrete neighborhood boundaries and zip codes. It introduced a new method for determining capitalization rates for commercial real estate valuations using submarkets, which have been identified with unsupervised learning, rather than relying on areal unit aggregation.


Applied Spatially Constrained Multivariate Clustering




No releases published


No packages published