# Using Spatial Error regression to estimate the impact of socio-economic factors on *Crime Rate* in Chicago


In the process of evaluating our models, we adhere to a standard statistical significance threshold of 5%.

In [1]:
import geopandas as gpd
import libpysal as  ps
import spreg
import warnings
warnings.filterwarnings('ignore')
from constants import PREPROCESSED_DATA_PATH

gdf = gpd.read_file(PREPROCESSED_DATA_PATH)

The code below creates a “Queen” contiguity weight matrix. In spatial analysis, a weight matrix defines the “neighbors” for each observation in the dataset. The “Queen” method considers two observations as neighbors if they share a common edge or a common vertex. Also, the normalization of the weight matrix using row-standardization is performed.

In [2]:
w = ps.weights.Queen.from_dataframe(gdf)
w.transform='r'

The `spreg.ML_Error` provides access to the Maximum Likelihood Spatial Error Model (SEM), with results and diagnostics. SEM regression utilizing the White Population Percent variable:

In [3]:
Y_name = 'CRIME_RATE'
X_name = ['POP_DENSITY', 'LIQUOR_STORES_DENSITY', 'POP_BELOW_125_POVERTY_PCT', 'WHITE_POP_PCT', 'ASIAN_POP_PCT']

Y = gdf[Y_name].to_numpy()
X = gdf[X_name].to_numpy()

sem = spreg.ML_Error(Y, X,  w=w, name_y=Y_name, name_x=X_name)

print(sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :  CRIME_RATE                Number of Observations:          77
Mean dependent var  :     61.8206                Number of Variables   :           6
S.D. dependent var  :     39.4260                Degrees of Freedom    :          71
Pseudo R-squared    :      0.8058
Log likelihood      :   -320.0348
Sigma-square ML     :    212.9012                Akaike info criterion :     652.070
S.E of regression   :     14.5911                Schwarz criterion     :     666.132

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT        72.69564    

SEM regression utilizing the Black Population Percent variable:

In [4]:
X_name = ['POP_DENSITY', 'LIQUOR_STORES_DENSITY', 'POP_BELOW_125_POVERTY_PCT', 'BLACK_POP_PCT', 'ASIAN_POP_PCT']

X = gdf[X_name].to_numpy()

sem = spreg.ML_Error(Y, X,  w=w, name_y=Y_name, name_x=X_name)

print(sem.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :  CRIME_RATE                Number of Observations:          77
Mean dependent var  :     61.8206                Number of Variables   :           6
S.D. dependent var  :     39.4260                Degrees of Freedom    :          71
Pseudo R-squared    :      0.8138
Log likelihood      :   -318.0533
Sigma-square ML     :    205.0680                Akaike info criterion :     648.107
S.E of regression   :     14.3202                Schwarz criterion     :     662.169

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         6.72558    

The SEM shows a Pseudo R-squared values of 0.8058 and 0.8138, which is lower than the other models except OLS.

Among all the predictors, *LIQUOR_STORES_DENSITY*, *POP_BELOW_125_POVERTY_PCT*, *WHITE_POP_PCT*, and *BLACK_POP_PCT* were found to be statistically significant at the 5% level. This is less compairing to the other models where more variables were significant. This could suggest that the other models are better at capturing the spatial dependence structure.

In conclusion, while the SEM provides valuable insights, it appears that the SLM or 2SLS might be a better choice for this particular dataset and research question.