# Introduction

This is the validation notebook for the PySAL implementation of the local join count (LJC) univariate statistic. This notebook will begin with a brief review of the LJC univariate  and a manual calculation of the values on a 'toy' dataset. We will then introduce the PySAL implementation of the `Local_Join_Count` function. Output from the `Local_Join_Count` function will be compared to the results from the manual calculation on the 'toy' dataset. Following the 'toy' dataset will be a comparison of the PySAL `Local_Join_Count` function to the external `GeoDa` results on an external dataset. As of now, calculations of inference are not included in the function.

1. [Review of the LJC statistic](#Review)
2. [Manual calculations on a 'toy' dataset](#Toy)
3. [Implementation of Local_Join_Count function](#LJC)
4. [Application of Local_Join_Count function on the 'toy' dataset](#LJCToy)
5. [Application of Local_Join_Count function on 'real world' datasets](#LJCRealWorld)

## Review of the LJC statistic <a name="Review"></a>

To review, global join counts focus on the total number of adjacent counts of certain values across the entire study area.  This is represented as $BB$:

$$BB = \sum_{i} \sum_j w_{ij} x_{i} x_{j}$$

Of particular interest to us are the number of local black-black (1-1) join counts. This is represented as $BB_i$: 

$$BB_i = x_i \sum_{j} w_{ij} x_j$$

...where a count of the neighbors with an observation of $x_j=1$ for those locations where $x_i=1$. This focuses on the BB counts of a given polygon (x_i).

## Manual calculations on a 'toy' dataset <a name="Toy"></a>

We now create a small 'toy' dataset to illustrate the local join counts. This toy dataset is a 4x4 lattice grid filled with 1s. We then alter the first 8 values to 0. This effectively looks like:

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 |

In [1]:
import numpy as np
import libpysal
import pandas as pd

# Create a 16x16 grid
w = libpysal.weights.lat2W(4, 4)
y_1 = np.ones(16)
# Set the first 9 of the ones to 0
y_1[0:8] = 0
print('new y_1', y_1)

new y_1 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]


For a given cell of the above table, we are interest in the adjacent grid cells that are equal to 1. We can find these through the use of **binary weights**. 

In [2]:
# Flatten the input vector y
y_1 = np.asarray(y_1).flatten()
# ensure weights are binary transformed
w.transformation = 'b'

How does PySAL identify these cells? Through an adjacency list. This creates a list object of unique focal ($i$) and neighbor ($j$) pairs. The `remove_symmetric=True` ensure that there are not duplicated (but reversed) adjacency pairs. This is a great shortcut when calculating global join counts.

In [3]:
adj_list = w.to_adjlist(remove_symmetric=True) 
print(adj_list)
print(w[0])

    focal  neighbor  weight
1       0         1     1.0
3       1         5     1.0
5       2         1     1.0
7       2         3     1.0
10      4         0     1.0
12      4         5     1.0
16      5         6     1.0
17      6         2     1.0
21      7         3     1.0
22      7         6     1.0
23      7        11     1.0
24      8         4     1.0
25      8        12     1.0
27      9         5     1.0
28      9         8     1.0
31     10         6     1.0
32     10         9     1.0
33     10        14     1.0
34     10        11     1.0
37     11        15     1.0
39     12        13     1.0
40     13         9     1.0
44     14        13     1.0
45     14        15     1.0
{4: 1.0, 1: 1.0}


From this list we can validate neighbors. For example, in our 4x4 grid, we know that the upper-left hand corner of the grid (w[0]) only touches its right and bottom neighbor(remember: we are not using a queen contiguity in this example). Thus, the first weight object will capture these relationships and they will be reflected in the adj_list table (see row 1 [0 1 1.0] and 4 [4 0 1.0]). 

**However, in the Local Join Count (LJC) situation does not use the `remove_symmetric=True`.** This allows us to identify the specific join counts for each area $i$.

In [4]:
adj_list = w.to_adjlist(remove_symmetric=False) 
print(adj_list)

    focal  neighbor  weight
0       0         4     1.0
1       0         1     1.0
2       1         0     1.0
3       1         5     1.0
4       1         2     1.0
5       2         1     1.0
6       2         6     1.0
7       2         3     1.0
8       3         2     1.0
9       3         7     1.0
10      4         0     1.0
11      4         8     1.0
12      4         5     1.0
13      5         1     1.0
14      5         4     1.0
15      5         9     1.0
16      5         6     1.0
17      6         2     1.0
18      6         5     1.0
19      6        10     1.0
20      6         7     1.0
21      7         3     1.0
22      7         6     1.0
23      7        11     1.0
24      8         4     1.0
25      8        12     1.0
26      8         9     1.0
27      9         5     1.0
28      9         8     1.0
29      9        13     1.0
30      9        10     1.0
31     10         6     1.0
32     10         9     1.0
33     10        14     1.0
34     10        11 

We now mirror the existing implementation of `Join_Counts` to create some objects that count the number of 1 value for the focal ($i$) and neighbor ($j$) cells. 

In [5]:
zseries = pd.Series(y_1, index=w.id_order)
focal = zseries.loc[adj_list.focal].values
neighbor = zseries.loc[adj_list.neighbor].values

With these objects we can now identify where focal and neighbor values have the same 1 value. 

In [6]:
# Identify which adjacency lists are both equal to 1
BBs = (focal == 1) & (neighbor == 1)
BBs
# also convert to a 0/1 array
BBs.astype('uint8')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1], dtype=uint8)

Now we need to map these values to the adjacency list. By grouping by the 'ID" column of the adjacnecy list, we can get the sum of agreements where focal and neighbor values have the same 1 value. 

In [7]:
# Create a df that uses the adjacency list focal values and the BBs counts
manual = pd.DataFrame(adj_list.focal.values, BBs.astype('uint8')).reset_index()
# Temporarily rename the columns
manual.columns = ['BB', 'ID']
manual = manual.groupby(by='ID').sum()
manual.BB.values

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 2, 3, 3, 2], dtype=uint64)

Let's do a visual comparison to the original table:

Original table

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 |

Local Join Counts (univariate)

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 2 | 3 | 3 | 2 |
| 2 | 3 | 3 | 2 |

This makes sense! For example, look at the bottom right corner. This has a value of 1 and has two neighbors with a value of 1, so the $BB_i$ of that bottom right corner is 2. 

## Implementation of Local_Join_Count function <a name="LJC"></a>

The above manual calculations are implemented in the function called `local_join_count.py` (available on the [jeffcsauer/GSOC2020/scratch](https://github.com/jeffcsauer/GSOC2020/tree/master/functions) github work journal). 

In [8]:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
import libpysal

PERMUTATIONS = 999

class Local_Join_Count(BaseEstimator):

    """Local Join Count Statistic"""

    def __init__(self, connectivity=None, permutations=PERMUTATIONS):
        """
        Initialize a Join_Counts_Local estimator
        Arguments
        ---------
        connectivity:   scipy.sparse matrix object
                        the connectivity structure describing the relationships
                        between observed units. Need not be row-standardized.
        Attributes
        ----------
        BB:  numpy.ndarray (1,)
             array containing the estimated Local Join Count coefficients,
             where element [0,0] is the number of Local Join Counts, ...
        """

        self.connectivity = connectivity
        self.permutations = permutations

    def fit(self, y, permutations=999):
        """
        Arguments
        ---------
        y       :   numpy.ndarray
                    array containing binary (0/1) data
        Returns
        -------
        the fitted estimator.
        Notes
        -----
        Technical details and derivations found in :cite:`AnselinLi2019`.
        """
        y = np.asarray(y).flatten()
        
        w = self.connectivity
        # Binary weights are needed for this statistic
        w.transformation = 'b'
        
        self.y = y
        self.n = len(y)
        self.w = w
        
        self.BB = self._statistic(y, w)
        
        if permutations:
            self._crand()
            sim = np.transpose(self.rjoins)
            above = sim >= self.BB
            larger = above.sum(0)
            low_extreme = (self.permutations - larger) < larger
            larger[low_extreme] = self.permutations - larger[low_extreme]
            # 1 - simulated p-value? or just the simulated p-value?
            # values of 0.001 seem to be NA or error?
            self.p_sim = (larger + 1.0) / (permutations + 1.0)

        # Need the >>> return self to get the associated .BB attribute
        # (significance in future, i.e. self.reference_distribution_ in lee.py)
        return self

    @staticmethod
    def _statistic(y, w):
        # Create adjacency list. Note that remove_symmetric=False - this is
        # different from the esda.Join_Counts() function.
        adj_list = w.to_adjlist(remove_symmetric=False)
        zseries = pd.Series(y, index=w.id_order)
        focal = zseries.loc[adj_list.focal].values
        neighbor = zseries.loc[adj_list.neighbor].values
        BB = (focal == 1) & (neighbor == 1)
        adj_list_BB = pd.DataFrame(adj_list.focal.values,
                                   BB.astype('uint8')).reset_index()
        adj_list_BB.columns = ['BB', 'ID']
        adj_list_BB = adj_list_BB.groupby(by='ID').sum()
        BB = adj_list_BB.BB.values
        return (BB)
    
    def _crand(self):
        """
        conditional randomization

        for observation i with ni neighbors,  the candidate set cannot include
        i (we don't want i being a neighbor of i). we have to sample without
        replacement from a set of ids that doesn't include i. numpy doesn't
        directly support sampling wo replacement and it is expensive to
        implement this. instead we omit i from the original ids,  permute the
        ids and take the first ni elements of the permuted ids as the
        neighbors to i in each randomization.

        """
        # converted z to y
        # renamed lisas to joins
        y = self.y
        n = len(y)
        joins = np.zeros((self.n, self.permutations))
        n_1 = self.n - 1
        prange = list(range(self.permutations))
        k = self.w.max_neighbors + 1
        nn = self.n - 1
        rids = np.array([np.random.permutation(nn)[0:k] for i in prange])
        ids = np.arange(self.w.n)
        ido = self.w.id_order
        w = [self.w.weights[ido[i]] for i in ids]
        wc = [self.w.cardinalities[ido[i]] for i in ids]

        for i in range(self.w.n):
            idsi = ids[ids != i]
            np.random.shuffle(idsi)
            tmp = y[idsi[rids[:, 0:wc[i]]]]
            joins[i] = y[i] * (w[i] * tmp).sum(1)
        self.rjoins = joins

## Application of Local_Join_Count function on the 'toy' dataset <a name="LJCToy"></a>

In [9]:
# Recreate test data that mirrors above
# Create a 16x16 grid
w = libpysal.weights.lat2W(4, 4)
y_1 = np.ones(16)
# Set the first 9 of the ones to 0
y_1[0:8] = 0
print('new y_1', y_1)

new y_1 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]


In [10]:
y_1.sum()
len(y_1)

16

In [11]:
# Run function on weights and y vector
toy_results = Local_Join_Count(connectivity=w).fit(y_1)

Compare output of `Local_Join_Count` function to the manually-calculated `LJC` from above.

In [12]:
toy_results.BB == manual.BB.values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

All values match.

## Application of Local_Join_Count function on 'real world' datasets <a name="LJCRealWorld"></a>

Ideally, we would look to compare the output to the values from the original Anselin and Li 2019 paper. However, the example use cases in Anselin and Li 2019 do not provide full tables of LJC and associate p-values to confirm equivalency. Thus, we compare the results from the PySAL implementation of `Local_Join_Counts` to the output from GeoDa using a GeoDa example dataset. Specifically, we use the [Baltimore Housing Sales dataset](https://geodacenter.github.io/data-and-lab/baltim/) and focus on the 'dwell' binary variable. 

### Comparison to GeoDa output

We first load in the Baltimore Housing Sales dataset.

In [13]:
import geopandas as gpd
balt = gpd.read_file('https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/baltimore/baltimore_housing.gpkg')
balt.head()

Unnamed: 0,station,price,nroom,dwell,nbath,patio,firepl,ac,bment,nstor,gar,age,citcou,lotsz,sqft,x,y,geometry
0,1.0,47.0,4.0,0.0,1.0,0.0,0.0,0.0,2.0,3.0,0.0,148.0,0.0,5.7,11.25,907.0,534.0,POINT (907.000 534.000)
1,2.0,113.0,7.0,1.0,2.5,1.0,1.0,1.0,2.0,2.0,2.0,9.0,1.0,279.51,28.92,922.0,574.0,POINT (922.000 574.000)
2,3.0,165.0,7.0,1.0,2.5,1.0,1.0,0.0,3.0,2.0,2.0,23.0,1.0,70.64,30.62,920.0,581.0,POINT (920.000 581.000)
3,4.0,104.3,7.0,1.0,2.5,1.0,1.0,1.0,2.0,2.0,2.0,5.0,1.0,174.63,26.12,923.0,578.0,POINT (923.000 578.000)
4,5.0,62.5,7.0,1.0,1.5,1.0,1.0,0.0,2.0,2.0,0.0,19.0,1.0,107.8,22.04,918.0,574.0,POINT (918.000 574.000)


Isolate the variable of interest.

In [14]:
y_balt = balt['dwell']

When working with points in PySAL we need to arrange them into a tree-able list of x and y points. Thus we extract the x and y columns of the baltimore dataset.

In [15]:
points = list(zip(balt['x'], balt['y']))
import libpysal
kd = libpysal.cg.KDTree(np.array(points))

We need to recreate the weights used in the GeoDa analysis. The weight scheme used was a k-nearest neighbor (knn) approach, using 5 neighbors.

In [16]:
balt_knn5 = libpysal.weights.KNN(kd, k=5, ) 

We can now apply our PySAL `Local_Join_Count` function. 

In [17]:
test_results = Local_Join_Count(connectivity=balt_knn5).fit(y_balt)
test_results.BB

array([0, 4, 5, 5, 4, 4, 5, 4, 2, 5, 4, 3, 3, 0, 0, 0, 0, 0, 0, 3, 3, 0,
       2, 4, 3, 4, 3, 4, 5, 3, 1, 5, 5, 5, 0, 4, 4, 0, 5, 2, 0, 4, 4, 5,
       4, 4, 4, 5, 5, 4, 4, 4, 0, 5, 4, 4, 4, 5, 2, 3, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 4, 0, 0, 0, 0, 2, 2, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 3, 3, 3, 0, 2, 0, 2, 3, 0, 3, 0,
       0, 2, 4, 3, 5, 3, 1, 0, 0, 3, 0, 0, 4, 2, 3, 2, 2, 3, 0, 2, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 4, 5,
       5, 4, 5, 0, 4, 5, 0, 4, 5, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 4, 3, 0, 0, 3, 0, 3, 3, 4, 3, 4, 4, 4, 2, 0, 3, 3, 3,
       3, 0, 0, 0, 2, 2, 1, 0, 0, 0, 0, 0, 0], dtype=uint64)

Now let's read in the results from GeoDa analysis.

In [18]:
# Load GeoDa analysis results
GeoDa_LJC = pd.read_csv('https://github.com/jeffcsauer/GSOC2020/raw/master/validation/data/baltimore/balt_knn_5_LJC_univariate.csv')
GeoDa_LJC.head()

Unnamed: 0,station,price,nroom,dwell,nbath,patio,firepl,ac,bment,nstor,gar,age,citcou,lotsz,sqft,x,y,JC,NN,PP_VAL
0,1,47.0,4.0,0.0,1.0,0.0,0.0,0.0,2.0,3.0,0.0,148.0,0.0,5.7,11.25,907.0,534.0,0,5,
1,2,113.0,7.0,1.0,2.5,1.0,1.0,1.0,2.0,2.0,2.0,9.0,1.0,279.51,28.92,922.0,574.0,4,5,0.252
2,3,165.0,7.0,1.0,2.5,1.0,1.0,0.0,3.0,2.0,2.0,23.0,1.0,70.64,30.62,920.0,581.0,5,5,0.046
3,4,104.3,7.0,1.0,2.5,1.0,1.0,1.0,2.0,2.0,2.0,5.0,1.0,174.63,26.12,923.0,578.0,5,5,0.05
4,5,62.5,7.0,1.0,1.5,1.0,1.0,0.0,2.0,2.0,0.0,19.0,1.0,107.8,22.04,918.0,574.0,4,5,0.219


Compare the PySAL LJC results to to the GeoDa LJC results. Due to the somewhat high (n=211) number of comparisons, we will tabulate the results.

In [19]:
results = test_results.BB == GeoDa_LJC['JC']
results.value_counts()

True    211
Name: JC, dtype: int64

All 211 elements have the same LJC between the PySAL implementation and the GeoDa results.

Compare p-values at face value (will not match exactly...)

In [20]:
test_results.p_sim[0:10]

array([0.001, 0.233, 0.044, 0.041, 0.223, 0.237, 0.037, 0.201, 0.147,
       0.033])

In [21]:
np.array(GeoDa_LJC.PP_VAL[0:10])

array([  nan, 0.252, 0.046, 0.05 , 0.219, 0.247, 0.048, 0.207, 0.165,
       0.043])

Differences in p-values...

In [22]:
test_results.p_sim[0:10] - np.array(GeoDa_LJC.PP_VAL[0:10])

array([   nan, -0.019, -0.002, -0.009,  0.004, -0.01 , -0.011, -0.006,
       -0.018, -0.01 ])

# Next steps

- Develop inference 
- Ensure docstrings include all relevant information and formatting
- When constructing demonstration notebooks, ensure that notebooks are as standalone as possible (for GeoDa comparison make sure to include all relevant files in case the user does not want to download the program)