# Using Spatial 2SLS regression to estimate the impact of socio-economic factors on *Crime Rate* in Chicago


In the process of evaluating our models, we adhere to a standard statistical significance threshold of 5%.

In [252]:
import geopandas as gpd
import libpysal as  ps
import spreg
import warnings
warnings.filterwarnings('ignore')
from constants import PREPROCESSED_DATA_PATH

gdf = gpd.read_file(PREPROCESSED_DATA_PATH)

The code below creates a “Queen” contiguity weight matrix. In spatial analysis, a weight matrix defines the “neighbors” for each observation in the dataset. The “Queen” method considers two observations as neighbors if they share a common edge or a common vertex. Also, the normalization of the weight matrix using row-standardization is performed.

In [253]:
w = ps.weights.Rook.from_dataframe(gdf)
w.transform='r'

The `robust="white"` attribute is used in our model to make our error calculations more reliable. The `spreg.GM_Lag` class provides access to the Spatial Two Stage Least Squares (S2SLS) with results and diagnostics.

S2SLS regression utilizing the White Population Percent variable:

In [254]:
Y_name = 'CRIME_RATE'
X_name = ['POP_DENSITY', 'LIQUOR_STORES_DENSITY', 'POP_BELOW_125_POVERTY_PCT', 'WHITE_POP_PCT', 'ASIAN_POP_PCT']

Y = gdf[Y_name].to_numpy()
X = gdf[X_name].to_numpy()

sem = spreg.GM_Lag(Y, X,  w=w, name_y=Y_name, name_x=X_name, robust='white')

print(sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :  CRIME_RATE                Number of Observations:          77
Mean dependent var  :     61.8206                Number of Variables   :           7
S.D. dependent var  :     39.4260                Degrees of Freedom    :          70
Pseudo R-squared    :      0.8702
Spatial Pseudo R-squared:  0.8453

White Standard Errors
------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        25.44052        10.65582         2.38748         0.01696
         POP_DENSITY        -0.00139         0.00051        -2.72026         0.00652
LIQUOR_STORES_DENSIT

S2SLS regression utilizing the Black Population Percent variable:

In [255]:
X_name = ['POP_DENSITY', 'LIQUOR_STORES_DENSITY', 'POP_BELOW_125_POVERTY_PCT', 'BLACK_POP_PCT', 'ASIAN_POP_PCT']

X = gdf[X_name].to_numpy()

sem = spreg.GM_Lag(Y, X,  w=w, name_y=Y_name, name_x=X_name, robust='white')

print(sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :  CRIME_RATE                Number of Observations:          77
Mean dependent var  :     61.8206                Number of Variables   :           7
S.D. dependent var  :     39.4260                Degrees of Freedom    :          70
Pseudo R-squared    :      0.8674
Spatial Pseudo R-squared:  0.8390

White Standard Errors
------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT       -12.61295         5.33651        -2.36352         0.01810
         POP_DENSITY        -0.00113         0.00053        -2.11283         0.03462
LIQUOR_STORES_DENSIT

In the first regression, where the *WHITE_POP_PCT* variable is used, the S2SLS model exhibits a Pseudo R-squared value of 0.8702, indicating that the model explains approximately 87.02% of the variation in *CRIME_RATE*. This is an improvement over the OLS model.

Among the independent variables, *POP_DENSITY*, *LIQUOR_STORES_DENSITY*, *POP_BELOW_125_POVERTY_PCT*, *WHITE_POP_PCT* and *ASIAN_POP_PCT* were found to be statistically significant. The negative coefficients for *POP_DENSITY*, *WHITE_POP_PCT* and *ASIAN_POP_PCT* suggest that an increase in these variables is associated with a decrease in the crime rate. This could indicate the presence of certain underlying factors.

The variable *W_CRIME_RATE* is also significant, representing the spatial lag of the dependent variable *CRIME_RATE*. This suggests that the crime rate in a given location is influenced by the crime rates in neighboring locations. The positive coefficient of *W_CRIME_RATE* indicates that locations with high crime rates tend to be surrounded by other locations with high crime rates.

In the second regression, where the *BLACK_POP_PCT* variable is used, the S2SLS model shows a Pseudo R-squared value of 0.8674. Among the independent variables, *POP_DENSITY*, *LIQUOR_STORES_DENSITY*, *POP_BELOW_125_POVERTY_PCT*, and *BLACK_POP_PCT* were found to be statistically significant. The positive coefficient for *BLACK_POP_PCT* suggests that an increase in the percentage of the black population is associated with an increase in the crime rate. This could be due to various socio-economic factors.

The S2SLS model, with its ability to account for spatial dependence in the data, proves to be more reliable and accurate in predicting the crime rate. This is especially true given the significant Lagrange Multiplier tests in the OLS model. Thus, the S2SLS model is better suited for this analysis.