### Walkthrough of the SpSSIM Analysis & Python Script
**Jessica Embury, San Diego State University, Fall 2022**

Reference Literature: *Jin, C., Nara, A., Yang, J.-A., Tsou, M.-H. (2020). Similarity measurement on human mobility data with spatially weighted structural similarity index (SpSSIM). Transactions in GIS, 24, 104-122. [https://doi.org/10.1111/tgis.12590](https://doi.org/10.1111/tgis.12590)*

#### Explanation of Python files
- *util.py*: general functions that support the analysis, but are not specific to it
- *matrixes.py*: functions for the generation of analysis matrixes (flow probabilities, weights)
- *analysis.py*: functions for calculation of SpSSIM values
- *main.py*: file path variables and script to run the analysis

#### 1. Create a distance matrix with a row and column for every Census Block Group (CBG).
Each value is the distance between the row CBG and the column CBG in kilometers (km).
**Note:** This function uses a cross join and takes a long time to run!

In [16]:
import pandas as pd
from matrixes import create_distance_matrix

cbg_shapefile = './data/source_data/CENSUS_BLOCKGROUPTIGER2010/CENSUS_BLOCKGROUPTIGER2010.shp'
bg_csv = '/data/source_data/cbgs_2019.csv'
distance_matrix_csv = './data/cbg_distance_matrix.csv'

distance_matrix = create_distance_matrix(cbg_shapefile, bg_csv, distance_matrix_csv)
distance_matrix.head(3)

Unnamed: 0,cbg_orig,60730001001,60730001002,60730002011,60730002021,60730002022,60730002023,60730003001,60730003002,60730003003,...,60730216002,60730218001,60730218002,60730219001,60730219002,60730220001,60730220002,60730221001,60730221002,60730221003
0,60730001001,0.0,0.0,0.0,0.4,0.0,0.5,0.9,1.4,1.7,...,7.4,6.6,5.6,8.9,9.3,10.6,11.3,40.9,40.0,38.6
1,60730001002,0.0,0.0,0.3,0.3,0.0,0.5,1.2,1.6,1.9,...,7.1,6.2,5.2,8.7,9.2,10.6,11.4,41.0,40.1,38.7
2,60730002011,0.0,0.3,0.0,0.1,0.0,0.0,0.1,0.5,0.8,...,7.5,6.8,5.7,8.5,8.8,9.9,10.6,41.3,40.4,39.1


#### 2. Download Origin-Destination (O-D) data and save as .CSV files.
These tables have a record for each origin, destination, and mobility value (e.g., job, device count). Below is a sample of the LODES 2019 O-D table.

In [23]:
od_table_csv = './data/source_data/lodes_od_2019.csv'
od_df = pd.read_csv(od_table_csv)
print('The LODES 2019 O-D table has {} rows.'.format(len(lodes_od_df)))
od_df.head(3)

The LODES 2019 O-D table has 428273 rows.


Unnamed: 0,cbg_orig,cbg_dest,num_jobs
0,60730001001,60730001001,3
1,60730001001,60730001002,12
2,60730001001,60730002011,2


#### 3. Convert the O-D table into an O-D matrix with one row for each origin CBG and one column for each destination CBG.

In [17]:
from matrixes import odtable2matrix

od_raw_matrix_csv = './data/raw_matrix/sd_lodes_2019_raw_matrix.csv'
od_matrix = odtable2matrix(od_table_csv, od_raw_matrix_csv, 'cbg_orig', 'cbg_dest', 'num_jobs')
od_matrix.head(3)

./data/source_data/lodes_od_2019.csv
O-D table has 428273 rows.
Pivot table has 1794 rows and 1794 columns. 0 differences: []


cbg_dest,60730001001,60730001002,60730002011,60730002021,60730002022,60730002023,60730003001,60730003002,60730003003,60730003004,...,60730216002,60730218001,60730218002,60730219001,60730219002,60730220001,60730220002,60730221001,60730221002,60730221003
cbg_orig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
60730001001,3.0,12.0,2.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2.0,0.0
60730001002,14.0,18.0,6.0,3.0,0.0,1.0,0.0,0.0,6.0,0.0,...,2.0,0.0,0.0,1.0,0.0,0.0,0.0,7.0,0.0,0.0
60730002011,0.0,0.0,37.0,0.0,1.0,0.0,0.0,2.0,3.0,1.0,...,0.0,2.0,0.0,14.0,1.0,0.0,0.0,2.0,0.0,0.0


#### 4. Convert the O-D matrix into a flow probabilities matrix with one row for each origin CBG and one column for each destination CBG.
The value is now the probability of the origin CBG row having flow into the destination CBG column. The sum of each row equals 1.

In [18]:
from matrixes import create_probability_matrix

flow_probs_matrix_csv = './data/flow_probabilities/sd_lodes_2019_flow_probabilities_matrix.csv'
flow_probs_matrix = create_probability_matrix(od_raw_matrix_csv, flow_probs_matrix_csv, 'cbg_orig')
flow_probs_matrix.head(3)

Raw matrix has 1794 rows. Probabilities matrix has 1794 rows.
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1.]


Unnamed: 0_level_0,60730001001,60730001002,60730002011,60730002021,60730002022,60730002023,60730003001,60730003002,60730003003,60730003004,...,60730218001,60730218002,60730219001,60730219002,60730220001,60730220002,60730221001,60730221002,60730221003,row_total
cbg_orig,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
60730001001,0.006834,0.027335,0.004556,0.0,0.0,0.004556,0.0,0.0,0.004556,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.009112,0.004556,0.0,1.0
60730001002,0.025135,0.032316,0.010772,0.005386,0.0,0.001795,0.0,0.0,0.010772,0.0,...,0.0,0.0,0.001795,0.0,0.0,0.0,0.012567,0.0,0.0,1.0
60730002011,0.0,0.0,0.047558,0.0,0.001285,0.0,0.0,0.002571,0.003856,0.001285,...,0.002571,0.0,0.017995,0.001285,0.0,0.0,0.002571,0.0,0.0,1.0


#### 5. Choose the distance bins.
Local SpSSIMs will be calculated for the CBGs inside each distance bin. Below are 10km distance bins.


In [19]:
distance_bins = [(0, 10), (10, 20), (20, 30), (30, 40), (40, 50), (50, 60), (60, 70), (70, 80), (80, 90), (90, 100), (100, 110), (110, 120)]

#### 6. Create a weight matrix for each distance bin.
Matrix cells equal 1 if the distance between the origin and destination CBGs is within the distance bin's range. Otherwise, the cell values equal 0.


In [20]:
from matrixes import create_weights_matrix

weights_matrix_csv = './data/weights_10km_bins/weights_matrix_{}km_{}km.csv'
for b in distance_bins:
    df = create_weights_matrix(distance_matrix_csv, weights_matrix_csv, b)

df.head(3)

Unnamed: 0,cbg_orig,60730001001,60730001002,60730002011,60730002021,60730002022,60730002023,60730003001,60730003002,60730003003,...,60730216002,60730218001,60730218002,60730219001,60730219002,60730220001,60730220002,60730221001,60730221002,60730221003
0,60730001001,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,0,0,0,0,0
1,60730001002,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,0,0,0,0,0
2,60730002011,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,0,0,0,0


#### 7. Use 2 probability flow matrixes and the weights matrixes to calculate the SpSSIM.
Behind-the-scenes helper functions are used to calculate the mean and variance for each probability flow matrix, the covariance of the matrixes, and the local SpSSIM value for each distance bin. A results table with this data generated and saved.


In [21]:
from analysis import calc_global_spssim

results_csv = './data/spssim_results/lodes2019_lodes2019.csv'
spssim, results = calc_global_spssim(flow_probs_matrix_csv, flow_probs_matrix_csv, weights_matrix_csv, results_csv, 'cbg_orig', distance_bins)

./data/flow_probabilities/sd_lodes_2019_flow_probabilities_matrix.csv ./data/flow_probabilities/sd_lodes_2019_flow_probabilities_matrix.csv 1.0


#### In this example, I calculated the similarity of the same matrix, so the SpSSIM = 1.
SpSSIM scores range from 0-1 with a score equal to 1 if the flow probabilities matrixes are the same.


In [22]:
print('The SpSSIM value is {}.'.format(spssim))
results

The SpSSIM value is 1.0.


Unnamed: 0,matrix1,matrix2,distance_bin,constant1,constant2,n,mean1,mean2,variance1,variance2,covariance,local_spssim,global_spssim
0,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(0, 10)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
1,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(10, 20)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
2,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(20, 30)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
3,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(30, 40)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
4,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(40, 50)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
5,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(50, 60)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
6,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(60, 70)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
7,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(70, 80)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
8,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(80, 90)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0
9,data/flow_probabilities/sd_lodes_2019_flow_pro...,data/flow_probabilities/sd_lodes_2019_flow_pro...,"(90, 100)",0,0,3218436,0.000257,0.000257,6e-06,6e-06,6e-06,1.0,1.0


#### The Python script is available on [GitHub](https://github.com/jlembury/spssim_analysis). Backup data is on [Google Drive](https://drive.google.com/drive/folders/12WrJ_iIWFP6eIQUoisutON2G8xudTM-D?usp=sharing).
