# GSoC Progress Phase II Demonstration Notebook
*This notebook demonstrates progress on the GSoC 2020 project entitled [PySAL ESDA Enhancements: Local join count and LOSH statistics](https://docs.google.com/document/d/1WjHjy5Eyk4WG5QWfnsnhWg1r4-e09JXXCx0iaPphg6c/edit).*

All proposed estimators have been implemented with docstrings, doctests, and stylized using the PEP8 style guide. A [PR is open](https://github.com/pysal/esda/pull/139) documenting these proposed contributions. A GSOC call on 7/24/2020 outlined several additional tasks to keep the contribution momentum going:

- Updating LOSH inference to new the `__crand() engine`
- Drafting univariate and multivariate local Geary estimators

I've made incremental progress on bullets one and two, but I have yet to start on bullet three. Progress on bullet one and two are demonstrated below. 

## Updating LOSH inference to new the `__crand() engine`

We first import the most up-to-date version of the `losh` function and some other relevant modules. 

In [1]:
from esda.losh import losh
import numpy as np
import geopandas as gpd
import libpysal.weights as lp

We then run the `losh` function on some trial data, specifically the Denver Housing Dataset. 

In [2]:
denver = gpd.read_file('https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/denver/denver.gpkg')
y_denver = denver['HU_RENTED']
# Create weights
wq_denver = lp.Queen.from_dataframe(denver)

In [3]:
test_losh = losh(connectivity=wq_denver, inference="chi-square").fit(y_denver)

We can then print out the LOSH $H_i$ values...

In [4]:
test_losh.Hi

array([0.72862202, 0.60570534, 1.27499573, 1.27775056, 3.66426323,
       0.20630709, 0.30039163, 3.22878969, 2.47324962, 6.21885436,
       4.89325502, 4.93389736, 1.94911049, 0.52298053, 0.55981765,
       0.82850515, 1.05341417, 0.84461904, 0.60987348, 0.94527555,
       4.26593154, 0.96302429, 1.0423338 , 0.25133172, 0.17036824,
       0.42732169, 2.11051377, 1.53535997, 1.23655419, 0.27353575,
       0.13635485, 1.83236117, 1.52523996, 0.25140425, 0.30772211,
       0.47047662, 0.97574338, 0.09293957, 0.52861436, 0.40511641,
       0.56719311, 0.26454396, 0.21123503, 0.15851932, 1.47641653,
       0.53764716, 0.52337802, 0.39966269, 0.53409657, 0.04247725,
       2.13130198, 0.16518512, 1.18980213, 0.05127779, 5.30026563,
       0.60634197, 1.96356402, 1.3692009 , 1.01438307, 0.47150821,
       0.13384698, 0.11567471, 0.28335191, 0.65958817, 0.38815222,
       0.25514561, 0.21688526, 0.31786797, 0.03497591, 0.41814267,
       1.014082  , 0.15674392, 0.23448769, 0.21170868, 0.08105

As well as the LOSH $\chi^2$ p-values...

In [5]:
test_losh.pval

array([0.43800098, 0.63274819, 0.27976395, 0.27963721, 0.01625474,
       0.83193401, 0.82488061, 0.02148205, 0.06894109, 0.00291876,
       0.00977939, 0.00569095, 0.1621828 , 0.69043069, 0.55252597,
       0.47777914, 0.3427617 , 0.43831955, 0.55728303, 0.39508798,
       0.00776009, 0.40896109, 0.34640792, 0.79661973, 0.89210175,
       0.70368663, 0.14544505, 0.21350185, 0.29456901, 0.68194318,
       0.85051163, 0.17564698, 0.20623918, 0.75327493, 0.5891479 ,
       0.70259935, 0.36919055, 0.84387688, 0.29230521, 0.53346689,
       0.54853048, 0.78647001, 0.78542879, 0.87055297, 0.22952724,
       0.56473036, 0.60826038, 0.72309133, 0.65862816, 0.94533262,
       0.14343363, 0.82457905, 0.30101181, 0.97508279, 0.00674329,
       0.58648503, 0.13595911, 0.25396347, 0.28256421, 0.60294959,
       0.852825  , 0.74525482, 0.8372008 , 0.42308931, 0.65533261,
       0.82871472, 0.85737389, 0.65067648, 0.95392359, 0.67596931,
       0.36792767, 0.59472128, 0.37904896, 0.65643836, 0.78716

These will serve as a comparison to the values generated in the numba transition. We load in the necessary modules for `_crand()` computation.

In [6]:
from esda.crand import (
    crand as _crand_plus,
    njit as _njit,
    _prepare_univariate
)

We then write some numba-compatible code to calculate $H_i$

In [7]:
@_njit(fastmath=True)
def _losh(i, z, permuted_ids, weights_i, scaling):
    zi, zrand = _prepare_univariate(i, z, permuted_ids, weights_i)
    
    # Working
    rowsum = np.sum(weights_i)
    # Working
    ylag = (zrand @ weights_i)
    # Working
    yresid = ((zi-ylag)*-1)**2
    # Not working exactly? Should be a fixed value?
    denom = np.mean(yresid) * rowsum
    # Run _prepare_univariate again to get spatial lag of residuals
    yresid_i, yresid_rand = _prepare_univariate(i, yresid, permuted_ids, weights_i)
    # Run final Hi Calculation
    Hi = (yresid_rand @ weights_i) / denom
        
    return Hi

Most of these individual lines are behaving as expected in that they produce similar values to the non-numba code. If we pass the above numba function through `_crand_plus()` we are principally interested in looking at the `rHi` values.

In [8]:
# Note: forcing z and observed into arrays - need to incorporate this into above function?
# Standardized: wq_denver
# Non-standardized: wq_denver_ns
p_sim, rHi = _crand_plus(z=np.array(y_denver), w=wq_denver, observed=np.array(test_losh.Hi), 
            permutations=999, keep=True, n_jobs=1, 
            stat_func=_losh)
print(p_sim)
print(rHi)

[0.447 0.007 0.044 0.227 0.001 0.001 0.001 0.001 0.003 0.001 0.001 0.001
 0.021 0.013 0.141 0.444 0.28  0.43  0.123 0.324 0.001 0.321 0.309 0.001
 0.001 0.008 0.003 0.028 0.022 0.143 0.001 0.012 0.07  0.01  0.1   0.003
 0.367 0.001 0.404 0.168 0.066 0.001 0.001 0.001 0.103 0.076 0.037 0.002
 0.005 0.001 0.024 0.001 0.149 0.001 0.001 0.095 0.001 0.023 0.206 0.093
 0.001 0.004 0.001 0.439 0.021 0.001 0.001 0.04  0.001 0.023 0.12  0.346
 0.359 0.107 0.043]
[[1.25056273 0.70102753 0.27614659 ... 1.73634114 2.36023747 1.26879172]
 [0.78768735 1.24853582 0.92181837 ... 0.90223035 0.94639712 1.16559047]
 [0.78794952 1.22519175 0.9157881  ... 0.89658239 0.95294932 1.15697598]
 ...
 [0.07719271 1.51467175 0.51201383 ... 1.00240849 0.44995257 1.00240849]
 [0.82279984 0.199793   0.22131767 ... 1.13951941 0.23211767 0.56425032]
 [0.76804103 0.45402607 0.13292046 ... 1.11928031 0.11735213 0.50438961]]


With each run the above values will change, but we are principally interested in the second set of arrays and the extent to which they align with the `test_losh.Hi` values. They are largely similar to `test_losh.Hi`. The `p_sim` values are irrelevant as they are based on a different calculation.

This is where I have paused in the migration as I'm not exactly sure how to proceed. Input is welcome!

To my understanding, there are a few options from here. Given these conditionally randomized values, I could simply take the mean and use that mean as the input in the existing chi-square p-value calculation, such as: 

In [9]:
sim = np.transpose(rHi)
Hi_sim = sim.mean(axis=0)
dof = 2/test_losh.VarHi
Zi = (2*Hi_sim)/test_losh.VarHi
from scipy import stats
pval = 1 - stats.chi2.cdf(Zi, dof)
pval

array([0.33797649, 0.40640287, 0.41475853, 0.36211762, 0.38267292,
       0.36707143, 0.39764617, 0.40814591, 0.42970941, 0.39377397,
       0.3884992 , 0.4027361 , 0.3973533 , 0.43637409, 0.39746542,
       0.45274732, 0.40004736, 0.44308364, 0.43089687, 0.43593081,
       0.45348343, 0.4530607 , 0.38813549, 0.43701048, 0.44518581,
       0.45728489, 0.44168431, 0.45253213, 0.44133208, 0.41879683,
       0.39423354, 0.40371639, 0.33696717, 0.39530092, 0.388648  ,
       0.44903091, 0.39374151, 0.40111497, 0.23921197, 0.38685679,
       0.3794951 , 0.40360719, 0.3911361 , 0.42072171, 0.41247489,
       0.38819604, 0.4207816 , 0.43804191, 0.44380882, 0.38869435,
       0.45279709, 0.40153868, 0.39772761, 0.44021033, 0.45934452,
       0.4536327 , 0.46588733, 0.41882124, 0.3582769 , 0.44938826,
       0.45176776, 0.43964764, 0.48741048, 0.43042849, 0.44656606,
       0.47086906, 0.47753528, 0.42930866, 0.45724734, 0.47346122,
       0.47741915, 0.47172698, 0.2948364 , 0.45589874, 0.47758

However, I'm not sure this is appropriate as it uses the original `VarHi`. This is especially noticeable when comparing the p-values between those from the original (`og`) and conditionally-randomized (`cr`) approach:

In [10]:
import pandas as pd
corrdf = pd.DataFrame(test_losh.pval, pval).reset_index()
corrdf.columns = ['pval_og', 'pval_cr']
corrdf['pval_og'].corr(corrdf['pval_cr'])

0.21993687049346328


Attempts to calculate `VarHi` in the `_crand()` engine are proving a bit tricky and deviate quite far from the observed `VarHi` values. Moreover, the above method is quite different from the bootstrap method proposed by [Xu et al 2014](https://link.springer.com/article/10.1007%2Fs00168-014-0605-5) and implmeneted in `R` `spdep::LOSH.mc`. I think I need to more closely examine both the paper and [this section of the `R` code](https://github.com/r-spatial/spdep/blob/master/R/LOSH.mc.R#L71-L86) to understand its implementation. 

## Local Geary statistics

While the migration of `losh()` has been a bit rough, the Local Geary statistics are proceeding along quite nicely. I have started a workbook (available [here](https://github.com/jeffcsauer/GSOC2020/blob/master/review/Local_Geary_Workbook.ipynb)) where I work through the calculations and start constructing the functions. Presently the functions are returning correct Local Geary values in both the univariate and multivariate case. However, I have yet to work on inference. I am hoping to tackle that in the first week of August.

### Local Geary Univariate

In [11]:
import numpy as np
import pandas as pd
import warnings
from scipy import sparse
from scipy import stats
from sklearn.base import BaseEstimator
import libpysal as lp
from esda.crand import (
    crand as _crand_plus,
    njit as _njit,
    _prepare_univariate
)



class Local_Geary(BaseEstimator):
    """Local Geary - Univariate"""

    def __init__(self, connectivity=None, inference=None):
        """
        Initialize a Local_Geary estimator

        Arguments
        ---------
        connectivity     : scipy.sparse matrix object
                           the connectivity structure describing the
                           relationships between observed units.
        inference        : str
                           describes type of inference to be used. options are
                           "chi-square" or "permutation" methods.

        Attributes
        ----------
        localG           : numpy array
                           Array of Local Geary values for each spatial unit.
        pval             : numpy array
                           P-values for inference based on either
                           "chi-square" or "permutation" methods.
        """

        self.connectivity = connectivity
        self.inference = inference

    def fit(self, x):
        """
        Arguments
        ---------
        x                : numpy.ndarray
                           array containing continuous data

        Returns
        -------
        the fitted estimator.

        Notes
        -----
        Technical details and derivations can be found in :cite:`Anselin1995`.

        Examples
        --------
        Guerry data replication GeoDa tutorial
        >>> import libpysal
        >>> import geopandas as gpd
        >>> guerry = lp.examples.load_example('Guerry')
        >>> guerry_ds = gpd.read_file(guerry.get_path('Guerry.shp'))
        >>> w = libpysal.weights.Queen.from_dataframe(guerry_ds)
        """
        x = np.asarray(x).flatten()

        w = self.connectivity
        w.transform = 'r'

        self.localG = self._statistic(x, w)

        if self.inference is None:
        #   self.p_sim, self.rjoins = _crand_plus(
        #       z=self.x, 
        #       w=self.w, 
        #       observed=self.localG,
        #       permutations=permutations, 
        #       keep=True, 
        #       n_jobs=n_jobs,
        #       stat_func=_local_geary
        #   )
        #   
            print("No inference selected.")
        else:
            raise NotImplementedError(f'The requested inference method \
            ({self.inference}) is not currently supported!')

        return self

    @staticmethod
    def _statistic(x, w):
        # Caclulate z-scores for x
        zscore_x = (x - np.mean(x))/np.std(x)
        # Create focal (xi) and neighbor (zi) values
        adj_list = w.to_adjlist(remove_symmetric=False)
        zseries = pd.Series(zscore_x, index=wq.id_order)
        zi = zseries.loc[adj_list.focal].values
        zj = zseries.loc[adj_list.neighbor].values
        # Carry out local Geary calculation
        gs = sum(list(wq.weights.values()), []) * (zi-zj)**2
        # Reorganize data
        adj_list_gs = pd.DataFrame(adj_list.focal.values, gs).reset_index()
        adj_list_gs.columns = ['gs', 'ID']
        adj_list_gs = adj_list_gs.groupby(by='ID').sum()
        
        localG = adj_list_gs.gs.values
        
        return (localG)

# --------------------------------------------------------------
# Conditional Randomization Function Implementations
# --------------------------------------------------------------

@_njit(fastmath=True)
def _local_geary(i, z, permuted_ids, weights_i, scaling):
    zi, zrand = _prepare_univariate(i, z, permuted_ids, weights_i)
    return zi * (zrand @ weights_i)

Following the GeoDa web example, we can apply the `Local_Geary` function on the `Donatns` column of the `Guerry` dataset. 

In [12]:
import libpysal as lp
import geopandas as gpd
from scipy import stats
import numpy as np
guerry = lp.examples.load_example('Guerry')
guerry_ds = gpd.read_file(guerry.get_path('Guerry.shp'))
wq = lp.weights.Queen.from_dataframe(guerry_ds)
x = guerry_ds['Donatns']

In [13]:
functest = Local_Geary(connectivity=wq).fit(x)
functest.localG

No inference selected.


array([1.82087039e-01, 5.60014026e-01, 9.75294606e-01, 2.15906938e-01,
       6.17372564e-01, 3.84450059e-02, 2.43181756e-01, 9.71802819e-01,
       4.06447101e-02, 7.24722785e-01, 6.30952854e-02, 2.42104497e-02,
       1.59496916e+01, 9.29326006e-01, 9.65188634e-01, 1.32383286e+00,
       3.31775497e-01, 2.99446505e+00, 9.43946814e-01, 2.99570159e+00,
       3.66702291e-01, 2.09592365e+00, 1.46515861e+00, 1.82118455e-01,
       3.10216680e+00, 5.43063937e-01, 5.74532559e+00, 4.79160197e-02,
       1.58993089e-01, 7.18327253e-01, 1.24297849e+00, 8.72629331e-02,
       7.52809650e-01, 4.56515485e-01, 3.86766562e-01, 1.17632604e-01,
       6.90884685e-01, 2.87206102e+00, 4.10455112e-01, 4.04349959e-01,
       1.14211758e-01, 9.59519953e-01, 3.51347976e-01, 7.30240974e-01,
       4.40370938e-01, 7.20360356e-02, 1.66241706e+00, 5.83258909e+00,
       2.30332507e-01, 4.38369688e-01, 8.41461470e-01, 1.52959486e+00,
       4.32157479e-02, 2.08325903e+00, 1.19722984e+00, 1.28169257e+00,
      

### Local Geary Multivariate

In [14]:
import numpy as np
import pandas as pd
import warnings
from scipy import sparse
from scipy import stats
from sklearn.base import BaseEstimator
import libpysal as lp


class Local_Geary_MV(BaseEstimator):
    """Local Geary - Univariate"""

    def __init__(self, connectivity=None, inference=None):
        """
        Initialize a Local_Geary estimator

        Arguments
        ---------
        connectivity     : scipy.sparse matrix object
                           the connectivity structure describing the
                           relationships between observed units.
        inference        : str
                           describes type of inference to be used. options are
                           "chi-square" or "permutation" methods.

        Attributes
        ----------
        localG           : numpy array
                           Array of Local Geary values for each spatial unit.
        pval             : numpy array
                           P-values for inference based on either
                           "chi-square" or "permutation" methods.
        """

        self.connectivity = connectivity
        self.inference = inference

    def fit(self, variables):
        """
        Arguments
        ---------
        variables        : numpy.ndarray
                           array containing continuous data

        Returns
        -------
        the fitted estimator.

        Notes
        -----
        Technical details and derivations can be found in :cite:`Anselin1995`.

        Examples
        --------
        Guerry data replication GeoDa tutorial
        >>> import libpysal
        >>> import geopandas as gpd
        >>> guerry = lp.examples.load_example('Guerry')
        >>> guerry_ds = gpd.read_file(guerry.get_path('Guerry.shp'))
        >>> w = libpysal.weights.Queen.from_dataframe(guerry_ds)
        """
        self.variables = np.array(variables, dtype='float')

        w = self.connectivity
        w.transform = 'r'

        self.localG = self._statistic(variables, w)

        if self.inference is None:
            return self
        #elif self.inference == 'chi-square':
        #    if a != 2:
        #        warnings.warn(f'Chi-square inference assumes that a=2, but \
        #        a={a}. This means the inference will be invalid!')
        #    else:
        #        dof = 2/self.VarHi
        #        Zi = (2*self.Hi)/self.VarHi
        #        self.pval = 1 - stats.chi2.cdf(Zi, dof)
        else:
            raise NotImplementedError(f'The requested inference method \
            ({self.inference}) is not currently supported!')

        return self

    @staticmethod
    def _statistic(variables, w):
        # Caclulate z-scores for input variables
        zseries = [stats.zscore(i) for i in variables]
        # Define denominator adjustment
        k = len(variables)
        # Create focal and neighbor values
        adj_list = w.to_adjlist(remove_symmetric=False)
        zseries = [pd.Series(i, index=wq.id_order) for i in zseries]
        focal = [zseries[i].loc[adj_list.focal].values for
                 i in range(len(variables))]
        neighbor = [zseries[i].loc[adj_list.neighbor].values for
                    i in range(len(variables))]
        # Carry out local Geary calculation
        gs = sum(list(wq.weights.values()), []) * \
        (np.array(focal) - np.array(neighbor))**2
        # Reorganize data
        temp = pd.DataFrame(gs).T
        temp['ID'] = adj_list.focal.values
        adj_list_gs = temp.groupby(by='ID').sum()
        localG = adj_list_gs.sum(axis=1)/k
        
        return (localG)

As before, we can apply this on `Donatns` and `Suicids`.

In [15]:
x = guerry_ds['Donatns']
y = guerry_ds['Suicids']

In [16]:
functest = Local_Geary_MV(connectivity=wq).fit([x,y])
functest.localG

ID
0     0.153819
1     0.303560
2     2.954720
3     0.123140
4     0.387960
        ...   
80    1.657430
81    0.525764
82    0.645337
83    0.717948
84    0.216181
Length: 85, dtype: float64