# Introduction

This is the validation notebook for the PySAL implementation of the local join count (LJC) univariate statistic. This notebook will begin with a brief review of the LJC univariate  and a manual calculation of the values on a 'toy' dataset. We will then introduce the PySAL implementation of the `Local_Join_Count` function. Output from the `Local_Join_Count` function will be compared to the results from the manual calculation on the 'toy' dataset. Following the 'toy' dataset will be a comparison of the PySAL `LOSH` function to the external `GeoDa` results on an external dataset. As of now, calculations of inference are not included in the function.

1. [Review of the LJC statistic](#Review)
2. [Manual calculations on a 'toy' dataset](#Toy)
3. [Implementation of Local_Join_Count function](#LJC)
4. [Application of Local_Join_Count function on the 'toy' dataset](#LJCToy)
5. [Application of Local_Join_Count function on 'real world' datasets](#LJCRealWorld)

## Review of the LJC statistic <a name="Review"></a>

To review, global join counts focus on the total number of adjacent counts of certain values across the entire study area.  This is represented as $BB$:

$$BB = \sum_{i} \sum_j w_{ij} x_{i} x_{j}$$

Of particular interest to us are the number of local black-black (1-1) join counts. This is represented as $BB_i$: 

$$BB_i = x_i \sum_{j} w_{ij} x_j$$

...where a count of the neighbors with an observation of $x_j=1$ for those locations where $x_i=1$. This focuses on the BB counts of a given polygon (x_i).

## Manual calculations on a 'toy' dataset <a name="Toy"></a>

We now create a small 'toy' dataset to illustrate the local join counts. This toy dataset is a 4x4 lattice grid filled with 1s. We then alter the first 8 values to 0. This effectively looks like:

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 |

In [1]:
import numpy as np
import libpysal
import pandas as pd

# Create a 16x16 grid
w = libpysal.weights.lat2W(4, 4)
y_1 = np.ones(16)
# Set the first 9 of the ones to 0
y_1[0:8] = 0
print('new y_1', y_1)

new y_1 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]


For a given cell of the above table, we are interest in the adjacent grid cells that are equal to 1. We can find these through the use of **binary weights**. 

In [None]:
# Flatten the input vector y
y_1 = np.asarray(y_1).flatten()
# ensure weights are binary transformed
w.transformation = 'b'

How does PySAL identify these cells? Through an adjacency list. This creates a list object of unique focal ($i$) and neighbor ($j$) pairs. The `remove_symmetric=True` ensure that there are not duplicated (but reversed) adjacency pairs. This is a great shortcut when calculating global join counts.

In [2]:
adj_list = w.to_adjlist(remove_symmetric=True) 
print(adj_list)
print(w[0])

    focal  neighbor  weight
1       0         1     1.0
3       1         5     1.0
5       2         1     1.0
7       2         3     1.0
10      4         0     1.0
12      4         5     1.0
16      5         6     1.0
17      6         2     1.0
21      7         3     1.0
22      7         6     1.0
23      7        11     1.0
24      8         4     1.0
25      8        12     1.0
27      9         5     1.0
28      9         8     1.0
31     10         6     1.0
32     10         9     1.0
33     10        14     1.0
34     10        11     1.0
37     11        15     1.0
39     12        13     1.0
40     13         9     1.0
44     14        13     1.0
45     14        15     1.0
{4: 1.0, 1: 1.0}


From this list we can validate neighbors. For example, in our 4x4 grid, we know that the upper-left hand corner of the grid (w[0]) only touches its right and bottom neighbor(remember: we are not using a queen contiguity in this example). Thus, the first weight object will capture these relationships and they will be reflected in the adj_list table (see row 1 [0 1 1.0] and 4 [4 0 1.0]). 

**However, in the Local Join Count (LJC) situation does not use the `remove_symmetric=True`.** This allows us to identify the specific join counts for each area $i$.

In [4]:
adj_list = w.to_adjlist(remove_symmetric=False) 
print(adj_list)

    focal  neighbor  weight
0       0         4     1.0
1       0         1     1.0
2       1         0     1.0
3       1         5     1.0
4       1         2     1.0
5       2         1     1.0
6       2         6     1.0
7       2         3     1.0
8       3         2     1.0
9       3         7     1.0
10      4         0     1.0
11      4         8     1.0
12      4         5     1.0
13      5         1     1.0
14      5         4     1.0
15      5         9     1.0
16      5         6     1.0
17      6         2     1.0
18      6         5     1.0
19      6        10     1.0
20      6         7     1.0
21      7         3     1.0
22      7         6     1.0
23      7        11     1.0
24      8         4     1.0
25      8        12     1.0
26      8         9     1.0
27      9         5     1.0
28      9         8     1.0
29      9        13     1.0
30      9        10     1.0
31     10         6     1.0
32     10         9     1.0
33     10        14     1.0
34     10        11 

We now mirror the existing implementation of `Join_Counts` to create some objects that count the number of 1 value for the focal ($i$) and neighbor ($j$) cells. 

In [7]:
zseries = pd.Series(y_1, index=w.id_order)
focal = zseries.loc[adj_list.focal].values
neighbor = zseries.loc[adj_list.neighbor].values

With these objects we can now identify where focal and neighbor values have the same 1 value. 

In [8]:
# Identify which adjacency lists are both equal to 1
BBs = (focal_var1 == 1) & (neighbor_var1 == 1)
BBs
# also convert to a 0/1 array
BBs.astype('uint8')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1], dtype=uint8)

Now we need to map these values to the adjacency list. By grouping by the 'ID" column of the adjacnecy list, we can get the sum of agreements where focal and neighbor values have the same 1 value. 

In [15]:
# Create a df that uses the adjacency list focal values and the BBs counts
manual = pd.DataFrame(adj_list.focal.values, BBs.astype('uint8')).reset_index()
# Temporarily rename the columns
manual.columns = ['BB', 'ID']
manual = manual.groupby(by='ID').sum()
manual.BB.values

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 2, 3, 3, 2], dtype=uint64)

Let's do a visual comparison to the original table:

Original table

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 |

Local Join Counts (univariate)

|   |   |   |   |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 |
| 2 | 3 | 3 | 2 |
| 2 | 3 | 3 | 2 |

This makes sense! For example, look at the bottom right corner. This has a value of 1 and has two neighbors with a value of 1, so the $BB_i$ of that bottom right corner is 2. 

## Implementation of Local_Join_Count function <a name="LJC"></a>

The above manual calculations are implemented in the function called `Local_Join_Count`. We run an ongoing notebook where the functions are being developed. Note that the below is likely to change over time. The following cell loads in the `Local_Join_Count` function from the `migration.ipynb` (available on the [jeffcsauer/GSOC2020/scratch](https://github.com/jeffcsauer/GSOC2020/tree/master/scratch) github work journal. 

In [10]:
%%capture
import nbimporter
import os
os.chdir("C:/Users/jeffe/Dropbox/GSOC2020/scratch/")
import migration
%run migration.ipynb

## Application of Local_Join_Count function on the 'toy' dataset <a name="LJCToy"></a>

In [11]:
# Recreate test data that mirrors above
# Create a 16x16 grid
w = libpysal.weights.lat2W(4, 4)
y_1 = np.ones(16)
# Set the first 9 of the ones to 0
y_1[0:8] = 0
print('new y_1', y_1)

new y_1 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]


In [12]:
# Run function on weights and y vector
test_results = Local_Join_Count(connectivity=w).fit(y_1)
test_results.BB_

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 2, 2, 3, 3, 2], dtype=uint64)

Compare output of `Local_Join_Count` function to the manually-calculated `LJC` from above.

In [17]:
test_results.BB_ == manual.BB.values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

All values match.

## Application of Local_Join_Count function on 'real world' datasets <a name="LJCRealWorld"></a>

## Comparison to GeoDa output

Given the recent (2019) release of Local Join Counts, the primary implementation of Local Join Counts are in the GeoDa program. Here we compare the results from the PySAL `Local_Join_Count` function to the results of GeoDa.

## Comparison to Anselin and Li 2019 appendix datasets