# Query 1: Victim Age Group Analysis
  
## Description: 
The goal of Query 1 is to rank the victim age groups in incidents involving any form of “aggravated assault” in descending order. The following age groups are defined:  
- **Children:** < 18  
- **Young Adults:** 18 – 24  
- **Adults:** 25 – 64  
- **Seniors:** > 64  
  
This notebook implements the query using both the DataFrame API and the RDD API. Both implementations will be executed using 4 Spark executors, and their execution times will be measured and compared. Finally, the results are saved in CSV files for further analysis.



In [None]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "4"
    }

}

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count, lower, expr
import time

# Create a Spark session
spark = SparkSession.builder \
    .appName("Query1: Victim Age Group Analysis") \
    .getOrCreate()

# Function to measure execution time
def measure_execution_time(func, spark):
    start_time = time.time()
    results = func(spark)
    end_time = time.time()
    return end_time - start_time, results

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
def dataframe_implementation(spark):
    # Read the crime datasets for 2010-2019 and 2020-Present, then union them.
    crime_data_2010_2019 = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv",
        header=True, inferSchema=True
    )
    crime_data_2020_present = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2020_to_Present_20241101.csv",
        header=True, inferSchema=True
    )
    crime_data = crime_data_2010_2019.union(crime_data_2020_present)
    
    # Filter for incidents involving any kind of "aggravated assault" (case-insensitive)
    aggravated_assault = crime_data.filter(lower(col("Crm Cd Desc")).like("%aggravated assault%"))
    
    # Categorize victims into age groups
    age_grouped = aggravated_assault.withColumn(
        "Age Group",
        when(col("Vict Age") < 18, "Children")
        .when((col("Vict Age") >= 18) & (col("Vict Age") <= 24), "Young Adults")
        .when((col("Vict Age") >= 25) & (col("Vict Age") <= 64), "Adults")
        .otherwise("Seniors")
    )
    
    # Group by "Age Group" and count the incidents, ordering by count in descending order
    age_group_counts = age_grouped.groupBy("Age Group") \
        .agg(count("*").alias("Count")) \
        .orderBy(col("Count").desc())
    
    return age_group_counts.collect()

# Measure DataFrame implementation time and display results
df_time, df_results = measure_execution_time(dataframe_implementation, spark)
print(f"DataFrame Time: {df_time:.2f} seconds")
print("DataFrame Results:")
print("\n".join([f"{row['Age Group']}: {row['Count']}" for row in df_results]))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

DataFrame Time: 22.04 seconds
DataFrame Results:
Adults: 121093
Young Adults: 33605
Children: 15928
Seniors: 5985

In [5]:
def rdd_implementation(spark):
    # Read the crime datasets as RDDs for 2010-2019 and 2020-Present.
    crime_data_2010_2019_rdd = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv",
        header=True, inferSchema=True
    ).rdd
    crime_data_2020_present_rdd = spark.read.csv(
        "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2020_to_Present_20241101.csv",
        header=True, inferSchema=True
    ).rdd
    
    # Union the two RDDs
    crime_data_rdd = crime_data_2010_2019_rdd.union(crime_data_2020_present_rdd)
    
    # Filter for rows that contain "aggravated assault" (case-insensitive)
    aggravated_assault_rdd = crime_data_rdd.filter(
        lambda row: "aggravated assault" in row["Crm Cd Desc"].lower()
    )
    

    # Map each row to (Age Group, 1) based on "Vict Age"
    age_group_rdd = aggravated_assault_rdd.map(
        lambda row: (
            "Children" if row["Vict Age"] < 18 else
            "Young Adults" if 18 <= row["Vict Age"] <= 24 else
            "Adults" if 25 <= row["Vict Age"] <= 64 else
            "Seniors", 1
        )
    )
    
    # Reduce by key (Age Group) to count occurrences
    age_group_counts_rdd = age_group_rdd.reduceByKey(lambda x, y: x + y)
    
    # Sort by count in descending order
    sorted_age_group_counts = age_group_counts_rdd.sortBy(lambda x: x[1], ascending=False)
    
    return sorted_age_group_counts.collect()

# Measure RDD implementation time and display results
rdd_time, rdd_results = measure_execution_time(rdd_implementation, spark)
print(f"RDD Time: {rdd_time:.2f} seconds")
print("RDD Results:")
for group, count in rdd_results:
    print(f"{group}: {count}")



FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

RDD Time: 22.25 seconds
RDD Results:
Adults: 121093
Young Adults: 33605
Children: 15928
Seniors: 5985