# Crime Rate Prediction

Research Question: Can socio-economic indicators and crime history predict the yearly crime rate change in a community?

In [1]:
%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1716490809401_0002,pyspark,busy,Link,Link,


In [2]:
sc.install_pypi_package("matplotlib==3.2.1", "https://pypi.org/simple")
sc.install_pypi_package("pandas==1.0.5", "https://pypi.org/simple")
sc.install_pypi_package("scipy==1.4.1", "https://pypi.org/simple")
sc.install_pypi_package("seaborn==0.11.2", "https://pypi.org/simple")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1716490809401_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting matplotlib==3.2.1
  Using cached https://files.pythonhosted.org/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib==3.2.1)
  Using cached https://files.pythonhosted.org/packages/9d/ea/6d76df31432a0e6fdf81681a895f009a4bb47b3c39036db3e1b528191d52/pyparsing-3.1.2-py3-none-any.whl
Collecting python-dateutil>=2.1 (from matplotlib==3.2.1)
  Using cached https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl
Collecting cycler>=0.10 (from matplotlib==3.2.1)
  Using cached https://files.pythonhosted.org/packages/5c/f9/695d6bedebd747e5eb0fe8fad57b72fdf25411273a39791cde838d5a8f51/cycler-0.11.0-py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib==3.2.1)
  Using cached https://files.pythonhosted.org/packages/f9/77/e3046bf19720b22e3e0b7c

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
import matplotlib.pyplot as plt
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, Bucketizer,StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [83]:
crimes = spark.read \
    .option("quote", "\"")  \
    .option("escape", "\"") \
    .option("ignoreLeadingWhiteSpace", True) \
    .option("header", "true") \
    .parquet("s3://hvpachisia-chicago-crime/processed_data/")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [84]:
all_columns = crimes.schema.names

columns_to_drop = [col for col in all_columns if "index" in col or "vec" in col]

crimes = crimes.drop(*columns_to_drop)
crimes.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- community_area: string (nullable = true)
 |-- id: string (nullable = true)
 |-- case_number: string (nullable = true)
 |-- date: string (nullable = true)
 |-- block: string (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location_description: string (nullable = true)
 |-- arrest: string (nullable = true)
 |-- domestic: string (nullable = true)
 |-- beat: string (nullable = true)
 |-- district: string (nullable = true)
 |-- ward: string (nullable = true)
 |-- x_coordinate: string (nullable = true)
 |-- y_coordinate: string (nullable = true)
 |-- year: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- location: string (nullable = true)
 |-- month: string (nullable = true)
 |-- community_name: string (nullable = true)
 |-- birth_rate: string (nullable = true)
 |-- below_poverty_level: double (nullable = true)
 |-- crowded_housing: double (nullable = true)
 |

In [85]:
crimes = crimes.drop('yearlycrimeratechange','prevyearcrimecount')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [86]:
crime_yearly_count = crimes.groupBy("community_area", "year").agg(F.count("*").alias("yearly_crime_count"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [87]:
yearly_counts_df = crime_yearly_count.cache()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [88]:
windowSpecYear = Window.partitionBy("community_area").orderBy("year")
yearly_counts_df = yearly_counts_df.withColumn("PrevYearCrimeCount", F.lag("yearly_crime_count", 1).over(windowSpecYear))
yearly_counts_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+----+------------------+------------------+
|community_area|year|yearly_crime_count|PrevYearCrimeCount|
+--------------+----+------------------+------------------+
|            18|2001|                12|              null|
|            18|2002|               925|                12|
|            18|2003|              1288|               925|
|            18|2004|              1036|              1288|
|            18|2005|              1006|              1036|
|            18|2006|              1004|              1006|
|            18|2007|              1063|              1004|
|            18|2008|              1018|              1063|
|            18|2009|              1100|              1018|
|            18|2010|               899|              1100|
|            18|2011|               851|               899|
|            18|2012|               782|               851|
|            18|2013|               634|               782|
|            18|2014|               527|

In [89]:
yearly_counts_df = yearly_counts_df.withColumn("YearlyCrimeRateChange", 
                                               (F.col("yearly_crime_count") - F.col("PrevYearCrimeCount")) / F.col("PrevYearCrimeCount"))

yearly_counts_df = yearly_counts_df.fillna({"PrevYearCrimeCount": 0, "YearlyCrimeRateChange": 0})
yearly_counts_df.show()


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+----+------------------+------------------+---------------------+
|community_area|year|yearly_crime_count|PrevYearCrimeCount|YearlyCrimeRateChange|
+--------------+----+------------------+------------------+---------------------+
|            18|2001|                12|                 0|                  0.0|
|            18|2002|               925|                12|    76.08333333333333|
|            18|2003|              1288|               925|   0.3924324324324324|
|            18|2004|              1036|              1288|  -0.1956521739130435|
|            18|2005|              1006|              1036| -0.02895752895752896|
|            18|2006|              1004|              1006| -0.00198807157057...|
|            18|2007|              1063|              1004|  0.05876494023904383|
|            18|2008|              1018|              1063| -0.04233301975540922|
|            18|2009|              1100|              1018|  0.08055009823182711|
|            18|

In [90]:
crimes = crimes.join(yearly_counts_df, ["community_area", "year"], "left")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [91]:
crimes.select("community_area", "year", "yearly_crime_count", "PrevYearCrimeCount", "YearlyCrimeRateChange").show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------+----+------------------+------------------+---------------------+
|community_area|year|yearly_crime_count|PrevYearCrimeCount|YearlyCrimeRateChange|
+--------------+----+------------------+------------------+---------------------+
|            53|2002|              5541|                56|    97.94642857142857|
|            53|2003|              7291|              5541|   0.3158274679660711|
|            53|2003|              7291|              5541|   0.3158274679660711|
|            53|2003|              7291|              5541|   0.3158274679660711|
|            53|2004|              7888|              7291|  0.08188177204773008|
|            53|2004|              7888|              7291|  0.08188177204773008|
|            53|2005|              7941|              7888| 0.006719066937119675|
|            53|2005|              7941|              7888| 0.006719066937119675|
|            53|2005|              7941|              7888| 0.006719066937119675|
|            53|

In [93]:
crimes.groupBy("YearlyCrimeRateChange").count().show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------------+-----+
|YearlyCrimeRateChange|count|
+---------------------+-----+
|  0.07807486631016043| 1008|
|                 74.5| 1359|
|  -0.0851581508515815| 1504|
|  0.12888198757763975|  727|
| -0.08241371213899977|11724|
| -0.08658546047016837|11773|
|  0.08455034588777863| 1411|
| 0.005757466714645...| 2795|
| -0.00129082225377...| 7737|
| -0.03186022610483042|  942|
| 0.020080321285140562|  508|
| -0.06820221358061622| 6230|
|   -0.070748730964467|14645|
|               96.675| 7814|
|  0.05442176870748299| 1860|
| -0.06932773109243698| 1772|
|  0.03842348284960422| 6297|
|  0.05990453460620525| 8882|
|  0.10262193402875669| 3911|
|  0.12111801242236025| 1805|
+---------------------+-----+
only showing top 20 rows

In [100]:
categorical_columns = ["community_area"]
numerical_columns = ["below_poverty_level", "crowded_housing", "no_high_school_diploma", 
                     "per_capita_income", "unemployment", "areacrimecount"]

indexers = [StringIndexer(inputCol=c, outputCol=c+"_index", handleInvalid="keep") for c in categorical_columns]

assembler_inputs = [c+"_index" for c in categorical_columns] + numerical_columns
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features", handleInvalid="skip")

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

regressor = LinearRegression(labelCol="YearlyCrimeRateChange", featuresCol="scaledFeatures")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [101]:
pipeline_stages = indexers + [assembler, scaler, regressor]
pipeline = Pipeline(stages=pipeline_stages)

paramGrid = ParamGridBuilder() \
    .addGrid(regressor.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(regressor.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

evaluator = RegressionEvaluator(
    labelCol="YearlyCrimeRateChange", 
    predictionCol="prediction", 
    metricName="rmse")

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [102]:
train_full, test = crimes.randomSplit([0.8, 0.2], seed=0)
train_subset = train_full.sample(fraction=0.1, seed=0)  

assert train_subset.count() > 0, "Training subset is empty. Adjust the sampling fraction."
assert test.count() > 0, "Test dataset is empty. Ensure data is properly loaded and split."

test.persist()
train_subset.persist()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[community_area: string, year: string, id: string, case_number: string, date: string, block: string, primary_type: string, description: string, location_description: string, arrest: string, domestic: string, beat: string, district: string, ward: string, x_coordinate: string, y_coordinate: string, latitude: double, longitude: double, location: string, month: string, community_name: string, birth_rate: string, below_poverty_level: double, crowded_housing: double, no_high_school_diploma: double, per_capita_income: int, unemployment: double, hourofday: int, dayofweek: int, season: string, isweekend: int, areacrimecount: bigint, timesincelastcrime: bigint, nextcrimedate: string, recurrentcrime: int, PrevCrimesAtLocation: bigint, yearly_crime_count: bigint, PrevYearCrimeCount: bigint, YearlyCrimeRateChange: double]

In [104]:
cvModel = crossval.fit(train_subset)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [105]:
predictions = cvModel.transform(test)

rmse = evaluator.evaluate(predictions)
print(f"Test RMSE: {rmse:.2f}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test RMSE: 18.83

In [106]:
bestModel = cvModel.bestModel
lrModel = bestModel.stages[-1]
r2 = lrModel.summary.r2
print(f" - regParam: {bestModel.stages[-1]._java_obj.getRegParam()}")
print(f" - elasticNetParam: {bestModel.stages[-1]._java_obj.getElasticNetParam()}")
print(f'Coefficients: {lrModel.coefficients}')
print(f'R^2 value: {r2}')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

 - regParam: 0.01
 - elasticNetParam: 0.0
Coefficients: [-0.12045403879589812,0.03275092619173329,-0.7803919192291107,0.7170935335115407,0.6744119989303673,0.48126255543731356,-0.40577017219722694]
R^2 value: 0.0010472360193978236

The linear regression model aimed to predict the yearly crime rate change in a community using socio-economic indicators and crime history data. The model's performance, as indicated by a Root Mean Squared Error (RMSE) of 18.83, suggests moderate accuracy in its predictions. The model's coefficients reveal the direction and magnitude of each feature's influence on crime rate change. For instance, the coefficient for below_poverty_level is -0.12, indicating a negative relationship, whereas per_capita_income has a positive coefficient of 0.72, suggesting a positive association with crime rate changes.

However, the R^2 value of 0.001 indicates that the model explains only a small fraction of the variance in crime rate changes, highlighting the complexity of crime dynamics and suggesting that additional factors might be necessary to improve predictive accuracy. This could be due to the nature of the input data that we had since the socio-economic factors provided were for a point in time, which would not allow for the temporal nature of what is required to get a better prediction. 

In [107]:
spark.stop()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…