# Predicting arrest
Research Question: What factors predict the likelihood of an arrest during a crime?


In [1]:
%%configure -f
{
    "conf": {
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
    }
}

In [2]:
sc.install_pypi_package("matplotlib==3.2.1", "https://pypi.org/simple")
sc.install_pypi_package("pandas==1.0.5", "https://pypi.org/simple")
sc.install_pypi_package("scipy==1.4.1", "https://pypi.org/simple")
sc.install_pypi_package("seaborn==0.11.2", "https://pypi.org/simple")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1716490809401_0002,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting matplotlib==3.2.1
  Downloading https://files.pythonhosted.org/packages/b2/c2/71fcf957710f3ba1f09088b35776a799ba7dd95f7c2b195ec800933b276b/matplotlib-3.2.1-cp37-cp37m-manylinux1_x86_64.whl (12.4MB)
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib==3.2.1)
  Downloading https://files.pythonhosted.org/packages/9d/ea/6d76df31432a0e6fdf81681a895f009a4bb47b3c39036db3e1b528191d52/pyparsing-3.1.2-py3-none-any.whl (103kB)
Collecting python-dateutil>=2.1 (from matplotlib==3.2.1)
  Downloading https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229kB)
Collecting cycler>=0.10 (from matplotlib==3.2.1)
  Downloading https://files.pythonhosted.org/packages/5c/f9/695d6bedebd747e5eb0fe8fad57b72fdf25411273a39791cde838d5a8f51/cycler-0.11.0-py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib==3.2.1)
  Downloading https://files.pythonhosted.org/packages/f9/77/e3

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import year, month
import matplotlib.pyplot as plt
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
import matplotlib.pyplot as plt
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, Bucketizer,StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [49]:
crimes = spark.read \
    .option("quote", "\"")  \
    .option("escape", "\"") \
    .option("ignoreLeadingWhiteSpace", True) \
    .option("header", "true") \
    .parquet("s3://hvpachisia-chicago-crime/processed_data/")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [50]:
all_columns = crimes.schema.names

columns_to_drop = [col for col in all_columns if "index" in col or "vec" in col]

crimes = crimes.drop(*columns_to_drop)

crimes.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- community_area: string (nullable = true)
 |-- id: string (nullable = true)
 |-- case_number: string (nullable = true)
 |-- date: string (nullable = true)
 |-- block: string (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location_description: string (nullable = true)
 |-- arrest: string (nullable = true)
 |-- domestic: string (nullable = true)
 |-- beat: string (nullable = true)
 |-- district: string (nullable = true)
 |-- ward: string (nullable = true)
 |-- x_coordinate: string (nullable = true)
 |-- y_coordinate: string (nullable = true)
 |-- year: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- location: string (nullable = true)
 |-- month: string (nullable = true)
 |-- community_name: string (nullable = true)
 |-- birth_rate: string (nullable = true)
 |-- below_poverty_level: double (nullable = true)
 |-- crowded_housing: double (nullable = true)
 |

In [51]:
crimes.groupBy("arrest").count().show()
from pyspark.sql.types import IntegerType
crimes = crimes.withColumn("arrest", F.when(F.col("arrest") == "true", 1).otherwise(0))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------+-------+
|arrest|  count|
+------+-------+
|  true|1886071|
| false|5560931|
+------+-------+

+------+-------+
|arrest|  count|
+------+-------+
|     1|1886071|
|     0|5560931|
+------+-------+

This is not the best in terms of balance in our target variable, but we cannot do anything to change it. Most of the data says that people are not arrested for the crime. 

In [52]:
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

categorical_columns = ["community_area","primary_type", "description", "location_description", "beat", "district", "ward", "season"]
numerical_columns = ["latitude", "longitude", "hourofday", "dayofweek", "below_poverty_level", 
                     "crowded_housing", "no_high_school_diploma", "per_capita_income", "unemployment"]

indexers = [StringIndexer(inputCol=c, outputCol=c+"_index", handleInvalid="keep") for c in categorical_columns]
encoders = [OneHotEncoder(inputCols=[c+"_index"], outputCols=[c+"_vec"]) for c in categorical_columns]

assembler_inputs = [c+"_vec" for c in categorical_columns] + numerical_columns
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features", handleInvalid="skip")

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

classifier = RandomForestClassifier(labelCol="arrest", featuresCol="scaledFeatures")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [54]:
pipeline_stages = indexers + encoders + [assembler, scaler, classifier]
pipeline = Pipeline(stages=pipeline_stages)

paramGrid = ParamGridBuilder() \
    .addGrid(classifier.numTrees, [10, 20, 50]) \
    .addGrid(classifier.maxDepth, [5, 10, 20]) \
    .build()


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [56]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="arrest", 
    predictionCol="prediction", 
    metricName="accuracy")

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [69]:
train_full, test = crimes.randomSplit([0.5, 0.5], seed=10)
train_subset = train_full.sample(fraction=0.1, seed=10)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [70]:
test.persist()
train_subset.persist()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame[community_area: string, id: string, case_number: string, date: string, block: string, primary_type: string, description: string, location_description: string, arrest: int, domestic: string, beat: string, district: string, ward: string, x_coordinate: string, y_coordinate: string, year: string, latitude: double, longitude: double, location: string, month: string, community_name: string, birth_rate: string, below_poverty_level: double, crowded_housing: double, no_high_school_diploma: double, per_capita_income: int, unemployment: double, hourofday: int, dayofweek: int, season: string, isweekend: int, areacrimecount: bigint, timesincelastcrime: bigint, prevyearcrimecount: bigint, yearlycrimeratechange: double, nextcrimedate: string, recurrentcrime: int, PrevCrimesAtLocation: bigint]

In [61]:
cvModel = crossval.fit(train_subset)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [63]:
predictions = cvModel.transform(test)

accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy:.2f}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Test Accuracy: 0.87

In [74]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
predictionAndLabels = predictions.select("prediction", "arrest").rdd.map(lambda x: (float(x[0]), float(x[1])))
metrics = BinaryClassificationMetrics(predictionAndLabels)

print("Area under PR = %s" % metrics.areaUnderPR)
print("Area under ROC = %s" % metrics.areaUnderROC)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Area under PR = 0.7855401444161383
Area under ROC = 0.7518499517907852

In [76]:
feature_importances = cvModel.bestModel.stages[-1].featureImportances
encoded_feature_names = [c + "_vec" for c in categorical_columns] + numerical_columns
import pandas as pd
feature_importances_df = pd.DataFrame(list(zip(encoded_feature_names, feature_importances)), columns=["Feature", "Importance"])
feature_importances_df = feature_importances_df.sort_values(by="Importance", ascending=False)
print(feature_importances_df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

                     Feature  Importance
0         community_area_vec    0.000877
3   location_description_vec    0.000305
12       below_poverty_level    0.000172
6                   ward_vec    0.000109
2            description_vec    0.000081
1           primary_type_vec    0.000073
4                   beat_vec    0.000066
13           crowded_housing    0.000049
14    no_high_school_diploma    0.000048
16              unemployment    0.000045
5               district_vec    0.000038
11                 dayofweek    0.000037
10                 hourofday    0.000036
9                  longitude    0.000034
15         per_capita_income    0.000033
7                 season_vec    0.000028
8                   latitude    0.000028

The analysis aimed to predict the likelihood of an arrest during a crime using various features, including crime type, location, and socio-economic indicators. The Random Forest classifier achieved a test accuracy of 0.87, indicating robust performance. The model's evaluation metrics showed an Area Under the Precision-Recall Curve (AUC-PR) of 0.79 and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.75, suggesting a good balance between precision and recall and the model's capability to distinguish between different classes.

The top features identified include Community Area (community_area_vec), indicating the significant influence of the geographic area where the crime occurred (which tracks with what we would expect in Chicago), and Location Description (location_description_vec), emphasizing the importance of the specific crime location. Socio-economic conditions such as Below Poverty Level (below_poverty_level) were also crucial predictors. Other notable features included Ward (ward_vec), Description (description_vec), and the type of crime (primary_type_vec). Additional important factors were the specific police beat (beat_vec), indicators of crowded housing, educational attainment levels, unemployment rates, the district, time-related factors such as the day of the week and hour of the day, and geographical coordinates (latitude and longitude).

In [None]:
spark.stop()