# Spatial Autocorrelation In Zillow Home Price Residuals

This kernel checks for spatial autocorrelation in the home price residuals for the Zillow dataset. It does so by computing the "gamma index" and  checking for its statistical significance. For an overvidew of the gamma index (and other spatial autocorrelation concepts), check out this brief:

http://www.dpi.inpe.br/gilberto/tutorials/software/geoda/tutorials/w9_spauto3_slides.pdf

Evidence of spatial autocorrelation can provide guidance for better predicting the residuals for the competition. Strong spatial autocorrelation means that homes that are close together tend to have similar residuals. That would then suggest that we should look for "localized" factors when trying to predict those residuals.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

## Define a map projection
In the Zillow dataset, parcel locations are given as geographic (longitude, latitude) coordinates. That's great, but if we want to compute distances between parcels (which we do), it's a lot easier to work in projected (x, y) coordinates. 

We can use `pyproj` to define a function that remaps from (lon, lat) to (x, y) according to a specified map projection. The string below defines a Transverse Mercator projection with a standard meridian at 118.5 degrees west longitude, which crosses the Los Angeles area. So our projected coordinates will have a minimal amount of spatial distortion. 

Projected coordinates will be in meters.

In [None]:
import pyproj
center_lon = -118.5  # A reasonable value for the Los Angeles area.
ps0 = '+proj=tmerc +lat_0=0.0 +lon_0=%.1f +y_0=0 +x_00 +k_0=0.9996 +units=m +ellps=WGS84' % center_lon
remap = pyproj.Proj(ps0)

## Get all parcel locations
Read the file that lists all properties. Then build a lookup table that associates each parcel ID with its projected coordinates.

In [None]:
props = pd.read_csv('../input/properties_2016.csv')

In [None]:
# Build a new dataframe that just includes the lon and lat, indexed by parcel ID.
lon = props['longitude'] / 1.0e6
lat = props['latitude'] / 1.0e6
parcelid = props['parcelid']
parcel_coords = pd.DataFrame({'parcelid': parcelid, 'lon': lon, 'lat': lat})
parcel_coords = parcel_coords.set_index('parcelid').dropna(axis=0)
parcel_coords.head()

In [None]:
# Make a plot to get a sense of the overall spatial distribution of the data we're 
# dealing with. Here we're just plotting a random sample of the parcel lon/lat coordinates.
dfs = parcel_coords.sample(n=5000)
plt.figure(figsize=(12, 12))
plt.plot(dfs['lon'], dfs['lat'], '.')

That's LA alright. 

For the analysis below, we're just going to focus on a subset of the area.
So here we define a lon/lat bounding box for the area of interest, which pretty much covers the San Fernando Valley. This bounding box will be applied to the training data when we read it a couple of cells down.


In [None]:
lon_min = -118.70
lon_max = -118.25
lat_min = 34.10
lat_max = 34.35

## Get sale price residuals
The training dataset consists of parcels with a known residual error. We read this list, associating each one with its 
latitude and longitude using the look-up table that we just built above. Then we apply the map projection that we defined above, giving projected (x, y) coordiantes for each parcel. 



In [None]:
import csv
parcel_list = []
with open('../input/train_2016_v2.csv') as source:

    reader = csv.DictReader(source, delimiter=',')
    k = 0
    for rec in reader:
    
#         k += 1
#         if k % 10000 == 0:
#             print('Handling record %d' % k)
        
        pid = int(rec['parcelid'])
        lon = parcel_coords.loc[pid]['lon']
        lat = parcel_coords.loc[pid]['lat']
        
        # Only keep records that fall within our area of interest.
        if lon_min < lon < lon_max and lat_min < lat < lat_max:
            
            # Also, subsample the records a bit. A full sample is more than we need to make 
            # our point, and just slows things down.
            if np.random.random() < 0.1:
                
                # Get the projected coordinates.
                (xx, yy) = remap(lon, lat)
                
                # Add the relevant information to the parcel list.
                parcel_list.append({'xx': xx, 'yy': yy, 'resid': float(rec['logerror'])})
            
print('Keeping information for %d parcels' % len(parcel_list))

In [None]:
# Take another look at the spatial distribution of the parcels. This time we plot them
# in projected (x, y) coordinates.
df = pd.DataFrame.from_dict(parcel_list)
df.head()
plt.figure(figsize=(10, 10))
plt.plot(df['xx'], df['yy'], '.')

## Spatial autocorrelation 
In this section we compute the gamma index as a measure of spatial autocorrelation. The gamma index is based on the cross-product of two measures of the similarity between pairs of parcels: "value similarity" and "spatial similarity". There are a lot of different ways these quantities can be defined.

Here we will define the "value similarity" for a pair of parcels as a function of the difference of their log error values. We take this difference and apply a negative exponential scaling, so that larger values (near 1.0) represent greater similarity of residuals.

The "spatial similarity" will be defined similarly as a negative exponential function of the linear distance between a pair of parcels. Speficially the spatial similarity function looks like this.

In [None]:
range_parameter = 300.0  # meters
distance = np.arange(0.0, 2000.0, 10.0)
spatial_similarity = np.exp(-distance / range_parameter)

plt.figure(figsize=(8, 4))
plt.plot(distance, spatial_similarity, '-')
plt.xlabel('parcel-to-parcel distance [meters]')
plt.ylabel('spatial similarity')

We have used a range parameter of 300 meters. That scales the function so that inter-parcel distances larger than about 1500 meters are assigned a spatial similarity close to zero. So in the calculations below, we can essentially ignore pairs of parcels that are separated by distances greater than that.     

Next we look at pairs of parcels from the list that we created above, and compute their spatial and value similarity measures.

In [None]:
parcel_count = len(parcel_list)
distance_threshold = 1500.0  # meters
range_parameter = 300.0  # meters 
spatial_similarity = []
value_similarity = []
sample_count = 0

for ii in range(parcel_count):
    
#     if ii % 100 == 0:
#         print('%d / %d' % (ii, parcel_count))
        
    for jj in range(ii+1, parcel_count):
        dx = parcel_list[ii]['xx'] - parcel_list[jj]['xx']
        dy = parcel_list[ii]['yy'] - parcel_list[jj]['yy']
        dd = np.sqrt(dx**2 + dy **2)
        if dd < distance_threshold:
            dv = np.abs(parcel_list[ii]['resid'] - parcel_list[jj]['resid'])
            value_similarity.append(np.exp(-dv / 0.5))
            spatial_similarity.append(np.exp(-dd / range_parameter))
            sample_count += 1

Compare the spatial similarity and value similarity values that we just found.

In [None]:
plt.figure(figsize=(8,6))
plt.plot(spatial_similarity, value_similarity, '.')
plt.xlabel('spatial similarity')
plt.ylabel('value similarity')
plt.ylim(0, 1.02)

That certainly looks like spatial autocorrelation. That is, parcel pairs that have a high "spatial similarity" (i.e. that are close to 
one another) tend to have high "value similarity" as well (i.e. they have similar errors in estimated home sale prices). Parcels that are further apart begin to lose that tendency. 

That assertion can be made more rigorous by computing the gamma index and checking its statistical significance.

In [None]:
gamma = np.dot(spatial_similarity, value_similarity)
print('gamma = %f' % gamma)

So we know the gamma value. Great. But what does it mean? Is that value high or low?

The gamma value itself doesn't tell us anything because it depends on the units and the scale of the values that we are comparing. So to determine whether this value is evidence of spatial autocorrelation, we need to compare it to values that you would get if no autocorrelation existed. That is, we need to do your basic statistical hypothesis test.

We can manufacture a set of "null hypothesis" gamma values by keeping the spatial similarity values that we computed above, but replacing the value similarities with values computed from randomly selected parcel pairs. Doing this a bunch of times gives us a distributon of what gamma would tend to be in the absence of autocorrelation. 

In [None]:
nnn = 100
gamma_null = np.zeros(nnn)
for k in range(nnn):
    for i in range(len(spatial_similarity)):
        zz = np.random.choice(len(parcel_list), 2, replace=False)
        v0 = parcel_list[zz[0]]['resid']
        v1 = parcel_list[zz[1]]['resid']
        dv = np.abs(v0 - v1)
        value_similarity = np.exp(-dv /0.5)
        gamma_null[k] += value_similarity * spatial_similarity[i]
    print('%d / %d: null gamma: %.1f [vs. observed gamma %.1f]' % (k, nnn, gamma_null[k], gamma))

In [None]:
# Plot the distribution of the gamma values that we woudl get under the null 
# hypothesis of no spatial autocorrelation.
plt.figure(figsize=(10, 10))
plt.hist(gamma_null, bins=30)
plt.xlabel('Gamma Value')
plt.ylabel('Relative Frequency')
plt.title('Gamma Values In Absence Of Autocorrelation (Versus Observed Value %.1f)' % gamma)

So in the absence of spatial autocorrelation, the gamma index never gets as large as the one that we 
observed in our dataset (out of the 100 cases that we generated). This amounts to very strng evidence that Zillow home price estimates have spatially correlated errors.
