# Query 4: Racial Profiling of Crime Victims
 
## Description:

The goal of Query 4 is to present the victim count for each racial group in the top 3 highest paid and bottom 3 lowest paid LA communities. The analysis includes:

- Calculating the average per person income for each community and selecting the top 3 and bottom 3

- Selecting crimes based on those communities

- Counting the victims for each racial group

## Configurations Tested

- 2 executors $\times$ 1 core / 2GB

- 2 executors $\times$ 2 cores / 4GB

- 2 executors $\times$ 4 cores / 8GB

The results are saved as CSV files for further analysis. 


In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "2",
        "spark.executor.memory": "2g",
        "spark.executor.cores": "1"
    }

}

In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "2",
        "spark.executor.memory": "4g",
        "spark.executor.cores": "2"
    }
}

In [1]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "2",
        "spark.executor.memory": "8g",
        "spark.executor.cores": "4"
    }
}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
2728,application_1732639283265_2687,pyspark,idle,Link,Link,,
2773,application_1732639283265_2732,pyspark,idle,Link,Link,,
2779,application_1732639283265_2738,pyspark,idle,Link,Link,,
2783,application_1732639283265_2742,pyspark,idle,Link,Link,,
2784,application_1732639283265_2743,pyspark,idle,Link,Link,,
2786,application_1732639283265_2745,pyspark,idle,Link,Link,,
2787,application_1732639283265_2746,pyspark,idle,Link,Link,,
2791,application_1732639283265_2750,pyspark,idle,Link,Link,,
2793,application_1732639283265_2752,pyspark,idle,Link,Link,,
2797,application_1732639283265_2756,pyspark,idle,Link,Link,,


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, regexp_replace, cast, to_date, year, expr, desc, asc, row_number, sum, round
from pyspark.sql.window import Window
from sedona.spark import *
from sedona.sql import *
import time

def query4(spark):
    print("Running query with the following configuration:")
    conf = spark.sparkContext.getConf()

    print("Executor Instances:", conf.get("spark.executor.instances"))
    print("Executor Memory:", conf.get("spark.executor.memory"))
    print("Executor Cores:", conf.get("spark.executor.cores"))
    start_time = time.time()
    
    crime_data = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv",
        header=True,
        inferSchema=True
    ).filter((col("LAT") != 0) | (col("LON") != 0)) \
    .withColumn("geom", expr("ST_Point(LON, LAT)")) 

    income_data = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/LA_income_2015.csv",
        header=True,
        inferSchema=True
    )
    
    race_ethnicity_codes = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/RE_codes.csv",
        header=True,
        inferSchema=True
    )
    
    
    blocks_df = sedona.read.format("geojson") \
        .option("multiLine", "true").load("s3://initial-notebook-data-bucket-dblab-905418150721/2010_Census_Blocks.geojson") \
        .selectExpr("explode(features) as features") \
        .select("features.*")

    flattened_df = blocks_df.select(
        [col(f"properties.{col_name}").alias(col_name) for col_name in blocks_df.schema["properties"].dataType.fieldNames()] + ["geometry"]
    ).drop("properties").drop("type")

    zipcode_df = flattened_df.filter((col("CITY") == "Los Angeles")) \
        .select(
            col("COMM"),
            col("ZCTA10").alias("ZIPCODE"),
            col("POP_2010").alias("POPULATION"),
            col("HOUSING10").alias("HOUSING_UNITS"),
            col("geometry")
        )

    income = income_data.withColumn(
        "ZIPCODE",
        col("Zip Code")
    ).withColumn(
        "MEDIAN_INCOME",
        regexp_replace(col("Estimated Median Income"), "[$,]", "").cast("double")
    ).select("ZIPCODE", "MEDIAN_INCOME")

    zipcode_income = zipcode_df.join(
        income, 
        on="ZIPCODE",
        how="inner"
    )
    
    zipcode_income_agg = zipcode_income.groupBy("COMM").agg(
                sum("POPULATION").alias("TOTAL_POPULATION"),
                sum(col("HOUSING_UNITS") * col("MEDIAN_INCOME")).alias("TOTAL_INCOME"),
                ST_Union_Aggr("geometry").alias("geometry")
            ).withColumn(
                    "AVERAGE_INCOME_PER_PERSON",
                    round(col("TOTAL_INCOME") / col("TOTAL_POPULATION"), 2)
            )
    
    top3_income =  zipcode_income_agg.orderBy(desc("AVERAGE_INCOME_PER_PERSON")).select("geometry").limit(3)
    bottom3_income =  zipcode_income_agg.orderBy(asc("AVERAGE_INCOME_PER_PERSON")).select("geometry").limit(3)
    
    crime_data_2015 = crime_data \
        .withColumn("Date", to_date(col("DATE OCC"), "MM/dd/yyyy hh:mm:ss a")) \
        .filter(year(col("Date")) == 2015)

    top3_crime_income_joined = crime_data_2015.join(    
        top3_income,
        expr("ST_Within(geom, geometry)"),
        "inner"
    )
    
    bottom3_crime_income_joined = crime_data_2015.join(    
        bottom3_income,
        expr("ST_Within(geom, geometry)"),
        "inner"
    )
    
    top3_crime_income_joined = top3_crime_income_joined.join(
        race_ethnicity_codes,
        "Vict Descent",
        "inner"
    )
    
    bottom3_crime_income_joined = bottom3_crime_income_joined.join(
        race_ethnicity_codes,
        "Vict Descent",
        "inner"
    )    
    
    top3_victim_counts = top3_crime_income_joined.groupBy("Vict Descent Full") \
         .count() \
         .withColumnRenamed("count", "#") \
         .withColumnRenamed("Vict Descent Full", "Victim Descent") \
         .orderBy(desc("#"))
    
    bottom3_victim_counts = bottom3_crime_income_joined.groupBy("Vict Descent Full") \
         .count() \
         .withColumnRenamed("count", "#") \
         .withColumnRenamed("Vict Descent Full", "Victim Descent") \
         .orderBy(desc("#"))
    
    print("Top Income Groups Victim Counts")
    top3_victim_counts.show(truncate=False)

    print("Bottom Income Groups Victim Counts")
    bottom3_victim_counts.show(truncate=False)
    
    print(f"Time elapsed {time.time()-start_time}")
    return top3_victim_counts, bottom3_victim_counts

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
spark = SparkSession.builder \
    .appName("Query 4") \
    .getOrCreate()

sedona = SedonaContext.create(spark)
conf = spark.sparkContext.getConf()
top_df, bottom_df = query4(spark)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Running query with the following configuration:
Executor Instances: 2
Executor Memory: 8g
Executor Cores: 4
Top Income Groups Victim Counts
+------------------------------+---+
|Victim Descent                |#  |
+------------------------------+---+
|White                         |695|
|Other                         |86 |
|Hispanic/Latin/Mexican        |77 |
|Unknown                       |49 |
|Black                         |43 |
|Other Asian                   |22 |
|Chinese                       |1  |
|American Indian/Alaskan Native|1  |
+------------------------------+---+

Bottom Income Groups Victim Counts
+------------------------------+----+
|Victim Descent                |#   |
+------------------------------+----+
|Hispanic/Latin/Mexican        |3342|
|Black                         |1127|
|White                         |428 |
|Other                         |252 |
|Other Asian                   |138 |
|Unknown                       |30  |
|American Indian/Alaskan Native|23  |


In [5]:
# Define the S3 output path
top_s3_path = "s3://groups-bucket-dblab-905418150721/group21/racial_profiling_top3/"
bottom_s3_path = "s3://groups-bucket-dblab-905418150721/group21/racial_profiling_bottom3/"
# Save the final results as a CSV
top_df.coalesce(1).write.mode("overwrite").csv(top_s3_path, header=True)
print(f"Final results saved to: {top_s3_path}")
bottom_df.coalesce(1).write.mode("overwrite").csv(bottom_s3_path, header=True)
print(f"Final results saved to: {bottom_s3_path}")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Final results saved to: s3://groups-bucket-dblab-905418150721/group21/racial_profiling_top3/
Final results saved to: s3://groups-bucket-dblab-905418150721/group21/racial_profiling_bottom3/