In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim, lower, col, when, desc
import time
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Query1_DataFrame") \
    .config("spark.executor.instances", 4) \
    .getOrCreate()

# Start timing
start_time = time.time()
# Load crime data
# Load the 2010-2019 crime data
crime_df_2010_2019 = spark.read.csv(
    "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv", 
    header=True
)

# Load the 2020-present crime data
crime_df_2020_present = spark.read.csv(
    "s3://initial-notebook-data-bucket-dblab-905418150721/CrimeData/Crime_Data_from_2020_to_Present_20241101.csv", 
    header=True
)
crime_df = crime_df_2010_2019.union(crime_df_2020_present)


# crime_df.select("Crm Cd Desc", "LAT", "LON").show(10, truncate=False)

# Clean up 'Crm Cd Desc' column and apply filters
filtered_df = crime_df.filter(
    (lower(trim(col("Crm Cd Desc"))).contains("aggravated assault")) &  # Case-insensitive match
    (col("LAT").isNotNull() & col("LON").isNotNull() & (col("LAT") != 0) & (col("LON") != 0))  # Valid coordinates
)

# filtered_df.select("Crm Cd Desc", "LAT", "LON").show(10, truncate=False)
# filtered_df.count()

# Add Age Group column based on 'Vict Age'
age_grouped_df = filtered_df.withColumn(
    "Age_Group",
    when(col("Vict Age").cast("int") < 18, "Children")
    .when((col("Vict Age").cast("int") >= 18) & (col("Vict Age").cast("int") <= 24), "Young Adults")
    .when((col("Vict Age").cast("int") >= 25) & (col("Vict Age").cast("int") <= 64), "Adults")
    .when(col("Vict Age").cast("int") > 64, "Seniors")
)

# Group by Age Group and count incidents, then sort
result_df = age_grouped_df.groupBy("Age_Group").count().orderBy(desc("count"))

# Stop timing and print out the execution duration
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

# Write results
group_number = "24"

s3_path = "s3://groups-bucket-dblab-905418150721/group"+group_number+"/results/"

result_df.write.mode("overwrite").parquet(s3_path + "q1_dataframe_output")

# Show results
result_df.show()


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Time taken: 7.61 seconds
+------------+------+
|   Age_Group| count|
+------------+------+
|      Adults|121052|
|Young Adults| 33588|
|    Children| 15923|
|     Seniors|  5985|
+------------+------+

In [None]:
# Summarized

# The performance difference between DataFrame and RDD versions in Spark is mainly due to several factors:

# 1. **Optimized Execution (DataFrame API)**: DataFrames benefit from the **Catalyst Optimizer**, which automatically applies optimizations like predicate pushdown and logical plan optimization. These optimizations reduce the amount of data processed and increase efficiency.

# 2. **Manual Optimizations (RDD)**: RDDs lack these optimizations, resulting in less efficient execution. RDD operations are interpreted at runtime, without Spark’s ability to generate optimized bytecode or automatically apply filters early in the process.

# 3. **Efficient Data Representation**: DataFrames use a columnar storage format (Tungsten engine), which is more efficient for filtering and aggregation. RDDs use a row-based structure, which is less efficient for similar operations.

# 4. **Parallelism and Distribution**: DataFrames manage parallelism and distribution more effectively than RDDs. Spark optimizes DataFrame queries for better partitioning and task scheduling, reducing overhead.

# 5. **Cluster Overhead**: RDDs may incur more overhead due to manual partitioning and more complex operations, while DataFrames are optimized for distributed execution, reducing unnecessary shuffling and communication.

# In summary, **DataFrames** are generally faster than **RDDs** for structured data processing due to built-in optimizations, making them the preferred choice for most Spark operations.





# The difference in execution times between the DataFrame and RDD versions in Spark can be attributed to several factors related to the underlying design and optimizations of each approach. Here's an explanation of why you might see the discrepancy:

# ### 1. **Optimized Execution (DataFrame API)**:
#    - **Catalyst Optimizer**: The DataFrame API in Spark leverages the **Catalyst Optimizer**, which is an advanced query optimizer. The optimizer automatically applies several optimizations, such as:
#      - Predicate pushdown: Conditions like `age < 18` and filtering on `Crm Cd Desc` are applied as early as possible in the execution plan, minimizing the amount of data processed.
#      - Logical plan optimization: The Catalyst optimizer transforms logical plans into physical plans that are optimized for better performance.
#      - Whole-stage code generation: This allows Spark to generate Java bytecode that is more efficient than interpreting each operation in the RDD-based approach.
#    - **Built-in Optimizations**: DataFrames abstract away a lot of the complexity and apply optimizations like filtering, projection, and partitioning automatically. This makes DataFrame operations faster.

# ### 2. **RDD Operations (Low-Level API)**:
#    - **Manual Optimizations**: With RDDs, Spark doesn’t apply the same level of optimization as with DataFrames. Operations on RDDs are executed in a more manual and unoptimized way, meaning that Spark can't perform some optimizations like predicate pushdown or plan optimization.
#    - **Serialization and Deserialization**: When using RDDs, Spark may need to serialize and deserialize the data more frequently, which can add overhead, especially when passing data between stages in a distributed environment.
#    - **Lack of Whole-Stage Code Generation**: RDD transformations don't benefit from Spark's whole-stage code generation, which is a significant optimization available with DataFrames and Datasets. This makes RDD transformations slower because Spark has to interpret each operation at runtime instead of executing pre-compiled bytecode.

# ### 3. **Data Representation**:
#    - **DataFrames vs. RDDs**: DataFrames use a more optimized, columnar format (Tungsten execution engine), whereas RDDs use a row-based structure. Columnar storage allows for more efficient processing and memory utilization, particularly for queries involving filtering and aggregations.
#    - **Schema Handling**: DataFrames come with schema information (column names, types), which allows Spark to optimize the operations. With RDDs, you have to handle the data in a more unstructured way (e.g., using dictionaries or tuples), which can be less efficient.

# ### 4. **Parallelism and Distribution**:
#    - **RDDs**: When working with RDDs, Spark requires more manual control over how data is distributed and processed. While this provides flexibility, it can lead to less efficient parallelism and task scheduling compared to DataFrames, which internally take care of the distribution and parallel execution.
#    - **DataFrames**: Spark can execute DataFrame queries in parallel more efficiently by using optimizations like partitioning, broadcasting, and better scheduling.

# ### 5. **Code Complexity**:
#    - The RDD approach typically involves more explicit loops and operations, which can introduce additional computational overhead. On the other hand, DataFrames abstract away much of the complexity and handle multiple operations in a single execution plan, leading to better performance.

# ### 6. **Cluster Overhead**:
#    - **RDD**: RDDs may also face higher overhead when performing operations across partitions. Since RDD operations are lower-level and do not benefit from some of the optimizations in the DataFrame API, they can result in more shuffling and network overhead.
#    - **DataFrames**: DataFrame operations are more aware of partitioning and distributed execution, which means they generally reduce unnecessary shuffling and minimize the communication overhead between nodes.

# ### 7. **Spark Version and Configuration**:
#    - The Spark version and configurations used can also impact the performance difference. DataFrames are generally faster, but if you have specific configurations that make RDD operations more efficient (like optimized serialization or caching), that could mitigate the gap.

# ---

# ### Summary:
# - **RDDs** are more flexible but require you to manually handle optimizations, leading to higher overhead for the same operations.
# - **DataFrames** benefit from **Catalyst Optimizer**, **whole-stage code generation**, and **automatic optimizations**, which result in much faster execution times for queries like yours.

# Given that the DataFrame version outperforms the RDD version in this case, it's generally recommended to use DataFrames for most Spark operations, especially when working with structured data and performing complex filtering and transformations.