# Estimating home prices with graph-guided regularization

This example is from [Hallac, Leskovec and Boyd, "Network Lasso: Clustering and Optimization in Large Graphs" (2015)](http://web.stanford.edu/~hallac/Network_Lasso.pdf)

In [4]:
import cvxpy as cp
import epopt as ep
import numpy as np
import scipy.sparse as sp
import pandas as pd

## Real estate data and linear model 

The data is a list of real estate transactions over a one-week period in May 2008:

In [9]:
housing = pd.read_csv("housing.csv")
housing

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.519470,-121.435768
5,5828 PEPPERMILL CT,SACRAMENTO,95841,CA,3,1,1122,Condo,Wed May 21 00:00:00 EDT 2008,89921,38.662595,-121.327813
6,6048 OGDEN NASH WAY,SACRAMENTO,95842,CA,3,2,1104,Residential,Wed May 21 00:00:00 EDT 2008,90895,38.681659,-121.351705
7,2561 19TH AVE,SACRAMENTO,95820,CA,3,1,1177,Residential,Wed May 21 00:00:00 EDT 2008,91002,38.535092,-121.481367
8,11150 TRINITY RIVER DR Unit 114,RANCHO CORDOVA,95670,CA,2,2,941,Condo,Wed May 21 00:00:00 EDT 2008,94905,38.621188,-121.270555
9,7325 10TH ST,RIO LINDA,95673,CA,3,2,1146,Residential,Wed May 21 00:00:00 EDT 2008,98937,38.700909,-121.442979


Our goal will be to build a model that estimates home price given other factors describing the home. We start with a basic linear model.

In [27]:
def mse(x, y):
    return np.mean((x - y)**2)
    
X = housing[["beds", "baths", "sq__ft"]]
y = housing["price"]
    
from sklearn import linear_model
lr = linear_model.RidgeCV()
lr.fit(X, y)

print "normalized MSE:", mse(lr.predict(X), y) / np.var(y)

normalized MSE: 0.819533884819


## Linear model with graph-guided regularization

In the basic linear regression above we simply fit a single set of weights $\theta \in \mathbb{R}^4$ for the entire dataset. Now, we will let each home have its own set of weights, but regularize them to be the same. In particular we will solve the problem
$$
\minimize \;\; \sum_{i \in \mathcal{N}} \left((\theta_i^Tx_i - y_i)^2 + \mu\|\theta_i\|_2^2\right) + \lambda \sum_{(j,k) \in \mathcal{E}} w_{jk}\|\theta_j - \theta_k\|_2
$$
where $\theta_i \in \mathbb{R}^4$ are the model weights, $x_i$ is the covariates: (beds, baths, sq_ft, 1), $y_i$ is the price; $\mu > 0$ adds a small amount of standard $\ell_2$ regularization on the weights to keep them from getting too large.

The graph is constructed by connecting each home to its five nearest neighbors:

In [92]:
from sklearn import neighbors

haversine = neighbors.DistanceMetric.get_metric("haversine")
dist = haversine.pairwise(housing[["latitude", "longitude"]])
weights = {}
for i in range(len(housing)):
    for j in np.argsort(D[i,:])[:6]:
        if i != j:
            weights[tuple(sorted((i,j)))] = 1/(dist[i,j] + 0.1)

Now we write the optimization problem in matrix form and solve it

In [130]:
# Parameters
N = len(housing)
E = len(weights)
p = 4
mu = 0.5
lam = 5

# Get data, normalize
X = housing.as_matrix(columns=["beds", "baths", "sq__ft"])
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
X = np.hstack((X, np.ones((N, 1))))

y = housing.as_matrix(columns=["price"])
y_mu = np.mean(y)
y_sigma = np.std(y)
y = (y - y_mu) / y_sigma

# Construct the weighted difference operator 
w = np.array(weights.values())
Dv = np.hstack((w, -w))
Di = np.hstack((np.arange(E), np.arange(E)))
Dj = [i for i, j in weights.keys()] + [j for i, j in weights.keys()]
D = sp.coo_matrix((Dv, (Di, Dj)))    


# Construct the data matrix in block diagonal form
BXv = X.flatten()
BXi = np.repeat(np.arange(N), p)
BXj = np.arange(N*p)
BX = sp.coo_matrix((BXv, (BXi, BXj)))

# Formulate the problem
Theta = cp.Variable(N, p)
f = (cp.sum_squares(BX*cp.vec(Theta.T) - y) +
     mu*cp.sum_squares(Theta[:,:3]) +
     lam*cp.sum_entries(cp.pnorm(D*Theta, 2, axis=1)))
prob = cp.Problem(cp.Minimize(f))

prob.solve(solver=cp.SCS, verbose=True)

print "normalized MSE:", cp.sum_squares(BX*cp.vec(Theta.T) - y).value/N

----------------------------------------------------------------------------
	SCS v1.1.7 - Splitting Conic Solver
	(c) Brendan O'Donoghue, Stanford University, 2012-2015
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 34079
eps = 1.00e-03, alpha = 1.50, max_iters = 2500, normalize = 1, scale = 1.00
Variables n = 6962, constraints m = 19044
Cones:	soc vars: 19044, soc blks: 3022
Setup time: 1.94e-02s
----------------------------------------------------------------------------
 Iter | pri res | dua res | rel gap | pri obj | dua obj | kap/tau | time (s)
----------------------------------------------------------------------------
     0|      inf       inf       nan      -inf       inf       inf  2.55e-03 
   100| 5.79e-02  1.10e-02  2.15e-02  2.02e+03  2.11e+03  2.31e-12  9.07e-02 
   200| 9.63e-03  4.23e-03  2.17e-03  1.25e+03  1.26e+03  1.41e-12  1.76e-01 
   300| 8.76e-03  3.18e-03  3.25e-03  1.10e+03  1.10e+03  1.39e-12  