# Purpose of this notebook

The purpose of this notebook is to migrate the workbook pseudo code of `LOSH_*.ipynb` and `OLJC_*.ipynb` into functions that match the `PySAL` structure. These will be expanded over time and built out. 

**Table of contents**
1. [Univariate Local Join Counts](#ULJC)
2. [Bivariate Local Join Counts](#BVLJC)
3. [Multivariate Local Join Counts](#MVLJC)
4. [LOSH](#LOSH)

In [1]:
# Load pep8 styling throughout notebook
%load_ext pycodestyle_magic

In [2]:
# Turn pep8 style checking ON
#%pycodestyle_on
# Note that you can turn on line numbers with ESC+L!

In [3]:
#Turn pep8 style checking OFF to resume testing functions
#%pycodestyle_off

## LJC

### Univariate <a name="ULJC"></a>

First draft of implementation is in 'legacy' format (matching functions like moran.py). 

In [4]:
"""
Univariate local join counts for binary attributes
"""

__author__ = "Sergio J. Rey <srey@asu.edu> , Luc Anselin <luc.anselin@asu.edu>"

from libpysal.weights.spatial_lag import lag_spatial
import numpy as np
import pandas as pd


class Join_Counts_Local_old(object):

    """Univariate Local Join Counts
    Parameters
    ----------
    y               : array
                      binary variable measured across n spatial units
    w               : W
                      spatial weights instance
    Attributes
    ----------
    y            : array
                   original variable
    w            : W
                   original w object
    bb           : float
                   number of black-black joins
    Notes
    -----
    Technical details and derivations can be found in :cite:`anselinli2019`.
    """
    def __init__(self, y, w):
        y = np.asarray(y).flatten()
        w.transformation = 'b'
        self.w = w
        # The following line differs from esda.Join_Counts() function
        self.adj_list = self.w.to_adjlist(remove_symmetric=False)
        self.y = y
        results = self.__calc(self.y)
        # As there is only one item being returned we just use
        # results. Once more need sto be returned in last line
        # of __calc, this would change back to results[0]
        self.bb = results

    def __calc(self, z):
        adj_list = self.adj_list
        zseries = pd.Series(z, index=self.w.id_order)
        focal = zseries.loc[adj_list.focal].values
        neighbor = zseries.loc[adj_list.neighbor].values
        BB = (focal == 1) & (neighbor == 1)
        adj_list_BB = pd.DataFrame(adj_list.focal.values,
                                   BB.astype('uint8')).reset_index()
        adj_list_BB.columns = ['BB', 'ID']
        adj_list_BB = adj_list_BB.groupby(by='ID').sum()
        BB = adj_list_BB.BB.values
        return (BB)

Above function is working but is in the 'old' `moran.py` or `join_counts.py` formatting style. Levi suggested making them in the form of scikit-learn or scipy. I'm leaning torwards the scikit-learn style and so I'm emulating `lee.py`. Note that the following `Local_Join_Count` is the preferred function. 

In [2]:
import numpy
from sklearn.base import BaseEstimator
import libpysal

PERMUTATIONS = 999

class Local_Join_Count(BaseEstimator):

    """Local Join Count Statistic"""

    def __init__(self, connectivity=None, permutations=PERMUTATIONS):
        """
        Initialize a Join_Counts_Local estimator
        Arguments
        ---------
        connectivity:   scipy.sparse matrix object
                        the connectivity structure describing the relationships
                        between observed units. Need not be row-standardized.
        Attributes
        ----------
        BB_:  numpy.ndarray (1,)
              array containing the estimated Local Join Count coefficients,
              where element [0,0] is the number of Local Join Counts, ...
        """

        self.connectivity = connectivity
        self.permutations = permutations

    def fit(self, y):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing binary (0/1) data
        Returns
        -------
        the fitted estimator.
        Notes
        -----
        Technical details and derivations found in :cite:`AnselinLi2019`.
        """
        y = np.asarray(y).flatten()

        w = self.connectivity
        # Binary weights are needed for this statistic
        w.transformation = 'b'

        self.BB_ = self._statistic(y, w)
        
        if permutations:
            print(PERMUTATIONS)

        # Need the >>> return self to get the associated .BB_ attribute
        # (significance in future, i.e. self.reference_distribution_ in lee.py)
        return self

    @staticmethod
    def _statistic(y, w):
        # Create adjacency list. Note that remove_symmetric=False - this is
        # different from the esda.Join_Counts() function.
        adj_list = w.to_adjlist(remove_symmetric=False)
        zseries = pd.Series(y, index=w.id_order)
        focal = zseries.loc[adj_list.focal].values
        neighbor = zseries.loc[adj_list.neighbor].values
        BB = (focal == 1) & (neighbor == 1)
        adj_list_BB = pd.DataFrame(adj_list.focal.values,
                                   BB.astype('uint8')).reset_index()
        adj_list_BB.columns = ['BB', 'ID']
        adj_list_BB = adj_list_BB.groupby(by='ID').sum()
        BB = adj_list_BB.BB.values
        return (BB)

ConnectionError: HTTPSConnectionPool(host='geodacenter.github.io', port=443): Max retries exceeded with url: /data-and-lab// (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x05CA9910>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

Test both the old and new function with some inputs...

In [3]:
import numpy as np
import libpysal
import pandas as pd
# Create a 16x16 grid
w = libpysal.weights.lat2W(4, 4)
y_1 = np.ones(16)
# Set the first 9 of the ones to 0
y_1[0:8] = 0
print('new y_1', y_1)

ConnectionError: HTTPSConnectionPool(host='geodacenter.github.io', port=443): Max retries exceeded with url: /data-and-lab// (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x05CCC8E0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

In [7]:
Join_Counts_Local_old(y_1, w)

<__main__.Join_Counts_Local_old at 0x1f991d30>

In [8]:
test_ljc_uni = Join_Counts_Local_old(y_1, w)
vars(test_ljc_uni)
print(test_ljc_uni.bb)

[0 0 0 0 0 0 0 0 2 3 3 2 2 3 3 2]


In [9]:
temp = Local_Join_Count(connectivity=w).fit(y_1)
temp.BB_

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 2, 3, 3, 2], dtype=uint64)

Test to ensure equivalency

In [10]:
test_ljc_uni.bb == temp.BB_

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [11]:
# Compare speed of two functions
%alias_magic t timeit

Created `%t` as an alias for `%timeit`.
Created `%%t` as an alias for `%%timeit`.


In [12]:
%t Local_Join_Count(connectivity=w).fit(y_1)

6.14 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%t Join_Counts_Local_old(y_1, w)

5.94 ms ± 651 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


No apparent difference in speed?

### Bivariate Local Join Count  <a name="BVLJC"></a>

In [14]:
# https://github.com/pysal/esda/blob/master/esda/lee.py
import numpy
from scipy import sparse
from sklearn.base import BaseEstimator

class Local_Join_Count_BV(BaseEstimator):

    """Global Spatial Pearson Statistic"""

    def __init__(self, connectivity=None):
        """
        Initialize a Join_Counts_Local estimator
        Arguments
        ---------
        connectivity:   scipy.sparse matrix object
                        the connectivity structure describing the relationships
                        between observed units. Will be row-standardized.
        Attributes
        ----------
        BJC_:  numpy.ndarray (1,)
               array containing the estimated Bivariate Local Join Counts ...
        CLC_:  numpy.ndarray (1,)
               array containing the estimated Bivariate Local Join Counts ...
        """

        self.connectivity = connectivity

    def fit(self, x, z, case=None):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing binary (0/1) data
        Returns
        -------
        the fitted estimator.
        Notes
        -----
        Technical details and derivations can be found in :cite:`Lee2001`.
        """
        x = np.asarray(x).flatten()
        z = np.asarray(z).flatten()

        w = self.connectivity
        w.transformation = 'b'

        self.LJC_ = self._statistic(x, z, w, case=case)

        return self

    @staticmethod
    def _statistic(x, z, w, case=None):
        # Create adjacency list. Note that remove_symmetric=False - this is
        # different from the esda.Join_Counts() function.
        adj_list = w.to_adjlist(remove_symmetric=False)

        # First, set up a series that maps the y values
        # (input as self.y) to the weights table
        zseries_x = pd.Series(x, index=w.id_order)
        zseries_z = pd.Series(z, index=w.id_order)

        # Next, map the y values to the focal (i) values
        focal_x = zseries_x.loc[adj_list.focal].values
        focal_z = zseries_z.loc[adj_list.focal].values

        # Repeat the mapping but for the neighbor (j) values
        neighbor_x = zseries_x.loc[adj_list.neighbor].values
        neighbor_z = zseries_z.loc[adj_list.neighbor].values

        if case == "BJC":
            BJC = (focal_x == 1) & (focal_z == 0) & \
                  (neighbor_x == 0) & (neighbor_z == 1)
            adj_list_BJC = pd.DataFrame(adj_list.focal.values,
                                        BJC.astype('uint8')).reset_index()
            adj_list_BJC.columns = ['BJC', 'ID']
            adj_list_BJC = adj_list_BJC.groupby(by='ID').sum()
            return adj_list_BJC.BJC.values
        elif case == "CLC":
            CLC = (focal_x == 1) & (focal_z == 1) & \
                  (neighbor_x == 1) & (neighbor_z == 1)
            adj_list_CLC = pd.DataFrame(adj_list.focal.values,
                                        CLC.astype('uint8')).reset_index()
            adj_list_CLC.columns = ['CLC', 'ID']
            adj_list_CLC = adj_list_CLC.groupby(by='ID').sum()
            return (adj_list_CLC.CLC.values)
        else:
            print("Please specify which type of bivariate Local Join Count \
            you would like to calculate (either 'BJC' or 'CLC'). See Anselin \
            and Li 2019 p. 9-10 for more information")

Test some values...

In [15]:
x = y_1
z = [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]

print('x', x)
print('z', z)

x [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
z [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]


In [16]:
# Case 1
temp2 = Local_Join_Count_BV(connectivity=w).fit(x, z, case="BJC")
print(temp2.LJC_)
# Case 2
temp2 = Local_Join_Count_BV(connectivity=w).fit(x, z, case="CLC")
print(temp2.LJC_)

[0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 2 2 0 0 2 2]


In [17]:
# Try with a purposefully wrong input or blnak
# Improper input
print(Local_Join_Count_BV(connectivity=w).fit(x, z, case="ThisIsWrong"))
# No input for case
print(Local_Join_Count_BV(connectivity=w).fit(x, z))

Please specify which type of bivariate Local Join Count             you would like to calculate (either 'BJC' or 'CLC'). See Anselin             and Li 2019 p. 9-10 for more information
Local_Join_Count_BV(connectivity=<libpysal.weights.weights.W object at 0x1F98F0E8>)
Please specify which type of bivariate Local Join Count             you would like to calculate (either 'BJC' or 'CLC'). See Anselin             and Li 2019 p. 9-10 for more information
Local_Join_Count_BV(connectivity=<libpysal.weights.weights.W object at 0x1F98F0E8>)


### Multivariate Local Join Count <a name="MVLJC"></a>

In [18]:
import numpy
from scipy import sparse
from sklearn.base import BaseEstimator

class Local_Join_Count_MV(BaseEstimator):

    """Global Spatial Pearson Statistic"""

    def __init__(self, connectivity=None):
        """
        Initialize a Join_Counts_Local estimator
        Arguments
        ---------
        connectivity:   scipy.sparse matrix object
                        the connectivity structure describing the relationships
                        between observed units. Will be row-standardized.
        Attributes
        ----------
        MJC_:  numpy.ndarray (1,)
               array containing the Multivariate Local Join Counts ...
        """

        self.connectivity = connectivity

    def fit(self, variables):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing binary (0/1) data
        Returns
        -------
        the fitted estimator.
        Notes
        -----
        Technical details and derivations can be found in :cite:`Lee2001`.
        """

        # Need not be flattened?

        w = self.connectivity
        w.transformation = 'b'

        self.MJC_ = self._statistic(variables, w)

        return self

    @staticmethod
    def _statistic(variables, w):
        # Create adjacency list. Note that remove_symmetric=False -
        # different from the esda.Join_Counts() function.
        adj_list = w.to_adjlist(remove_symmetric=False)

        # The zseries
        zseries = [pd.Series(i, index=w.id_order) for i in variables]
        # The focal values
        focal = [zseries[i].loc[adj_list.focal].values for
                 i in range(len(variables))]
        # The neighbor values
        neighbor = [zseries[i].loc[adj_list.neighbor].values for
                    i in range(len(variables))]

        # Find instances where all surrounding 
        # focal and neighbor values == 1
        focal_all = np.array(np.all(np.dstack(focal)==1, 
                                    axis=2))
        neighbor_all = np.array(np.all(np.dstack(neighbor)==1, 
                                       axis=2))
        MCLC = (focal_all == True) & (neighbor_all == True)
        # Convert list of True/False to boolean array 
        # and unlist (necessary for building pd.DF)
        MCLC = list(MCLC*1)
        
        # Create a df that uses the adjacency list
        # focal values and the BBs counts
        adj_list_MCLC = pd.DataFrame(adj_list.focal.values,
                                     MCLC).reset_index()
        # Temporarily rename the columns
        adj_list_MCLC.columns = ['MCLC', 'ID']
        adj_list_MCLC = adj_list_MCLC.groupby(by='ID').sum()

        return (adj_list_MCLC.MCLC.values)

Test inputs

In [19]:
x = x.astype(np.int32)
print('x', x)
print('z', z)
y = [0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1]
y = np.asarray(y).flatten()
print('y', y)

x [0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
z [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
y [0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1]


In [20]:
temp = Local_Join_Count_MV(connectivity=w).fit([x, y, z])
temp.MJC_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2], dtype=int64)

## LOSH  <a name="LOSH"></a>

In [1]:
# https://github.com/pysal/esda/blob/master/esda/lee.py
import numpy
from scipy import sparse
from scipy import stats
from sklearn.base import BaseEstimator
import pysal.lib as lp


class losh(BaseEstimator):
    """Local spatial heteroscedasticity (LOSH)"""

    def __init__(self, connectivity=None, inference=None):
        """
        Initialize a losh estimator

        Arguments
        ---------
        connectivity: scipy.sparse matrix object
                      the connectivity structure describing the relationships
                      between observed units.
        inference: str
                   describes type of inference to be used. options are
                   "chi-square", "permutation", or "simulation".

        Attributes
        ----------
        Hi: numpy array
            Array of LOSH values for each spatial unit.    
        ylag: array
              Spatially lagged y values.
        yresid: array
                Spatially lagged residual values.
        VarHi: array
               Variance of Hi.
        pval: numpy array
              P-values for inference based on either "chi-square", 
              "permutation", or "simulation" approaches.
        """
        
        self.connectivity = connectivity
        self.inference = inference
        
    def fit(self, y, a=None):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing continuous data
        a       :   int
                    residual multiplier. Default is 2 in order to generate a
                    variance measure. Users may use 1 for absolute deviations.

        Returns
        -------
        the fitted estimator.

        Notes
        -----
        Technical details and derivations can be found in :cite:`OrdGetis2012`.
        """
        y = np.asarray(y).flatten()
        
        w = self.connectivity

        self.Hi, self.ylag, self.yresid, self.VarHi = self._statistic(y, w, a)
        
        if self.inference is None:
            return self
        elif self.inference == "chi-square":
            print("Note: chi-square inference selected. This assumes a=2.")
            dof = 2/self.VarHi
            Zi = (2*self.Hi)/self.VarHi
            self.pval = 1 - stats.chi2.cdf(Zi, dof)

        return self

    @staticmethod
    def _statistic(y, w, a):
        # Define what type of variance to use
        if a is None:
            a = 2
        else:
            a = a
                
        Wrs = [round(np.sum(list(w[y].values()))) for y in range(len(y))]
        
        # Calculate spatial mean
        ylag = lp.weights.lag_spatial(w, y)/Wrs
        # Calculate and adjust residuals based on multiplier
        yresid = abs(y-ylag)**a
        # Calculate denominator of Hi calculation 
        # as mean of residuals
        denom =  int(np.mean(yresid)) * np.array(Wrs)
        # Carry out final $H_{i}$ calculation by dividing
        # spatial average of residuals by denom
        Hi = lp.weights.lag_spatial(w, yresid) / denom
        
        # Calculate variance
        n = len(y)
        # Calculate average of residuals
        yresid_mean = np.mean(yresid)
        # Calculate VarHi
        VarHi = ((n-1)**-1) * \
                     (denom**-2) * \
                     ((np.sum(yresid**2)/n) - yresid_mean**2) * \
                     ((n*np.array([np.sum(np.array(list(w[y].values()))**2) for y in range(len(y))])) - np.array(Wrs)**2)
        VarHi  

        return (Hi, ylag, yresid, VarHi)

Test values based on existing Global Spatial Autocorrelation notebook.

In [5]:
# Load modules
import pandas as pd
import geopandas as gpd
import pysal.lib as lp
import matplotlib.pyplot as plt
import rasterio as rio
import numpy as np
import shapely.geometry as geom
%matplotlib inline

In [6]:
df = gpd.read_file('https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/neighborhoods.gpkg')
listings = gpd.read_file('https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/listings.gpkg')
listings['price'] = listings.price.str.replace('$', '').str.replace(',','_').astype(float)
median_price = gpd.sjoin(listings[['price', 'geometry']], df, op='within')\
                  .groupby('index_right').price.median()
df['median_pri'] = median_price.values
# Make sure missing values are taken care of
pd.isnull(df['median_pri']).sum()
df = df
df['median_pri'].fillna((df['median_pri'].mean()), inplace=True)
y = df['median_pri']

In [7]:
w = lp.weights.Queen.from_dataframe(df)

Pass through function

In [11]:
temp = losh(connectivity=w, inference="chi-square").fit(y)

Note: chi-square inference selected. This assumes a=2.


In [12]:
temp.Hi

array([0.02996735, 0.05041104, 0.81636059, 0.58900481, 0.28605031,
       1.5451563 , 0.25909648, 2.44204798, 1.0135418 , 0.55538144,
       0.0526856 , 0.1165462 , 0.94798433, 0.80453878, 0.96029393,
       1.62748943, 1.08607199, 1.84646003, 1.21865793, 4.22737747,
       0.30344004, 0.84448541, 3.16069697, 1.02350556, 2.40941657,
       0.86424857, 0.21019234, 0.08468352, 0.4810683 , 3.41051811,
       0.05860075, 0.78184891, 0.02526062, 0.90894206, 0.17749852,
       1.02319397, 4.23919724, 0.57482262, 0.55791693, 2.46713092,
       0.7909307 , 0.17225221, 0.34349454, 0.61917688])

# Old Losh

In [None]:
# https://github.com/pysal/esda/blob/master/esda/lee.py
import numpy
from scipy import sparse
from scipy import stats
from sklearn.base import BaseEstimator
import pysal.lib as lp


class losh(BaseEstimator):
    """Local spatial heteroscedasticity (LOSH)"""

    def __init__(self, connectivity=None, inference=None, standardization=None):
        """
        Initialize a losh estimator

        Arguments
        ---------
        connectivity: scipy.sparse matrix object
                      the connectivity structure describing the relationships
                      between observed units.
        standardization: str
                         defines whether or not the user wants
                         row-standardized weights or abstract weights. options
                         are "row" or "abstract".
        inference: str
                   describes type of inference to be used. options are
                   "chi-square", "permutation", or "simulation".

        Attributes
        ----------
        Hi: numpy array
            Array of LOSH values for each spatial unit.    
        ylag: array
              Spatially lagged y values.
        yresid: array
                Spatially lagged residual values.
        VarHi: array
               Variance of Hi.
        pval: numpy array
              P-values for inference based on either "chi-square", 
              "permutation", or "simulation" approaches.
        """
        
        self.connectivity = connectivity
        self.inference = inference
        self.standardization = standardization

    def fit(self, y, a=None, standardization=None):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing continuous data
        a       :   int
                    residual multiplier. Default is 2 in order to generate a
                    variance measure. Users may use 1 for absolute deviations.

        Returns
        -------
        the fitted estimator.

        Notes
        -----
        Technical details and derivations can be found in :cite:`OrdGetis2012`.
        """
        y = np.asarray(y).flatten()

        if self.standardization is None:
            print("Warning: No standardization specified, row-standardization assumed.")
            w = self.connectivity
            w.transform = 'r'
        elif self.standardization == "row":
            w = self.connectivity
            w.transform = 'r'
        elif self.standardization == "abstract":
            w = self.connectivity

        self.Hi, self.ylag, self.yresid, self.VarHi = self._statistic(y, w, a)
        
        if self.inference is None:
            return self
        elif self.inference == "chi-square":
            print("Note: chi-square inference selected. This assumes a=2.")
            dof = 2/self.VarHi
            Zi = (2*self.Hi)/self.VarHi
            self.pval = 1 - stats.chi2.cdf(Zi, dof)

        return self

    @staticmethod
    def _statistic(y, w, a):
        # Define what type of variance to use
        if a is None:
            a = 2
        else:
            a = a
        
        transform = w.get_transform()
        
        # Calculate spatial lag (mean when row standardized, sum when not)
        ylag = lp.weights.lag_spatial(w, y)
        # Calculate residuals of y values
        yresid = y-ylag
        # Adjust residuals based on multiplier
        yresid = abs(yresid)**a
        
        # Calculate average of residuals (used as
        # denominator in $H_{i}$ calculation and 
        # in variance alculation)
        yresid_mean = np.mean(yresid)
        # Carry out final $H_{i}$ calculation by dividing
        # spatial average of residuals by mean of residuals
        lag_resid = lp.weights.lag_spatial(w, yresid)
        
        # Define denominator of $H_i$ calculation.
        # If row standardized, use mean of residuals.
        if transform == "R":
            denom = np.mean(yresid)
        # If not row standardized, use sum of lagged y values.
        else:
            denom = ylag
        
        # Calculate Hi
        Hi = lag_resid/denom
        
        # Calculate variance
        n = len(y)
        # Calculate VarHi
        VarHi =    ((len(y)-1)**-1) * \
                   ((yresid_mean)**-2) * \
                   ((np.sum(yresid**2)/n) - yresid_mean**2) * \
                   ((n * (1/np.array(list(w.cardinalities.values())))) - \
                   [np.sum(list(w[y].values())) for y in range(len(y))])       

        return (Hi, ylag, yresid, VarHi)