# Query 5: Nearest Police Station Analysis

## Description:
Using spatial data, we assign each crime (from the datasets 2010–2019 and 2020–Present) to the nearest police station. Then, for each police division, we compute:
- The number of crimes that occurred closest to that division.
- The average distance (in km) of those crimes from the police station.

The final results are presented sorted in descending order by the crime count.

## Resource Configurations Tested:  
The query was executed with total resources of 8 cores and 16GB memory in three configurations:
- **Configuration 1:** 2 executors × 4 cores / 8GB memory  
- **Configuration 2:** 4 executors × 2 cores / 4GB memory  
- **Configuration 3:** 8 executors × 1 core / 2GB memory  

Finally, the output is saved to an S3 path as a CSV file for further analysis.


In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "2",
        "spark.executor.memory": "8g",
        "spark.executor.cores": "4"
    }
    
}

In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "4",
        "spark.executor.memory": "4g",
        "spark.executor.cores": "2"
    }

}

In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "8",
        "spark.executor.memory": "2g",
        "spark.executor.cores": "1"
    }

}

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, expr, round as spark_round, count, avg, row_number
from sedona.spark import *
import time
from pyspark.sql.window import Window


spark = SparkSession.builder \
    .appName("Query 5: Nearest Police Station Analysis") \
    .getOrCreate()


sedona = SedonaContext.create(spark)
print("Running query with the following configuration:")
conf = spark.sparkContext.getConf()

print("Executor Instances:", conf.get("spark.executor.instances"))
print("Executor Memory:", conf.get("spark.executor.memory"))
print("Executor Cores:", conf.get("spark.executor.cores"))



Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
2901,application_1732639283265_2860,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Running query with the following configuration:
Executor Instances: 8
Executor Memory: 2g
Executor Cores: 1

In [3]:
# Start the timer to measure execution duration
start_time = time.time()

# Step 1: Load crime data from the specified CSV files and preprocess
crime_df = spark.read.csv(
    "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv",
    header=True,
    inferSchema=True
).union(
    spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2020_to_Present_20241101.csv",
        header=True,
        inferSchema=True
    )
).filter((col("LAT") != 0) | (col("LON") != 0)) \
 .withColumn("crime_point", expr("ST_Point(LON, LAT)")) 

# Step 2: Load police station data and create spatial points
police_station_df = spark.read.csv(
    "s3://initial-notebook-data-bucket-dblab-905418150721/LA_Police_Stations.csv",
    header=True,
    inferSchema=True
).withColumn("station_point", expr("ST_Point(X, Y)"))  

# Step 3: Perform a cross join to compute distances between every crime and every police station
crime_station_distance = crime_df.crossJoin(police_station_df).select(
    col("DR_NO"),
    col("crime_point"),
    col("DIVISION").alias("police_division"),
    (ST_DistanceSphere(col("crime_point"), col("station_point")) / 1000).alias("distance_km")
)

# Step 4: Use a window function to find the nearest police station for each crime
w = Window.partitionBy("DR_NO").orderBy("distance_km")

nearest_station_df = crime_station_distance \
    .withColumn("rn", row_number().over(w)) \
    .filter(col("rn") == 1) \
    .select("DR_NO", "police_division", "distance_km")

# Step 5: Group by police division and compute the average distance and crime count
final_result = nearest_station_df.groupBy("police_division").agg(
    spark_round(avg("distance_km"), 3).alias("avg_distance_km"),
    count("*").alias("crime_count")
).orderBy(col("crime_count").desc())

final_result.show()

end_time = time.time()

print(f"Duration: {round(end_time - start_time, 2)} seconds") 

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+---------------+-----------+
| police_division|avg_distance_km|crime_count|
+----------------+---------------+-----------+
|       HOLLYWOOD|          2.076|     224340|
|        VAN NUYS|          2.953|     210134|
|       SOUTHWEST|          2.191|     188901|
|        WILSHIRE|          2.593|     185996|
|     77TH STREET|          1.717|     171827|
|         OLYMPIC|          1.724|     170897|
| NORTH HOLLYWOOD|          2.643|     167854|
|         PACIFIC|           3.85|     161359|
|         CENTRAL|          0.992|     153871|
|         RAMPART|          1.535|     152736|
|       SOUTHEAST|          2.422|     152176|
|     WEST VALLEY|          3.036|     138643|
|         TOPANGA|          3.297|     138217|
|        FOOTHILL|          4.251|     134896|
|          HARBOR|          3.703|     126747|
|      HOLLENBECK|           2.68|     115837|
|WEST LOS ANGELES|          2.792|     115781|
|          NEWTON|          1.635|     111110|
|       NORTH

In [5]:
# Define the S3 output path
s3_path = "s3://groups-bucket-dblab-905418150721/group21/police_station_analysis/"

# Save the final results as a CSV
final_result.coalesce(1).write.mode("overwrite").csv(s3_path, header=True)
print(f"Final results saved to: {s3_path}")


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Final results saved to: s3://groups-bucket-dblab-905418150721/group21/police_station_analysis/