#### Useful Links 
- Spark History Server : http://83.212.73.248:18080/
- Hadoop YARN (scheduler) : http://83.212.73.248:8088/cluster
- HDFS : http://83.212.73.248:9870/dfshealth.html#tab-overview

#### Useful Commands : 
- Connect to okeanos-master (from local) : `$ ssh user@snf-40202.ok-kno.grnetcloud.net `
    - Password : 'Rand0m'
- Connect to okeanos-worker (from okeanos-master) : `$ ssh okeanos-worker`
- Open Jupyter Notebook : `$ jupyter notebook --ip 83.212.73.248 --port 8888`

#### Thinks to do :
- Make the data Csv to Parquet
- Make those columns the type we want
- Write the Queries (!)
- Benchmark and optimize them etc.
- Balance the data onto HDFS across the two datanodes

### Full HDFS path is here : hdfs://okeanos-master:54310/csv_data/
and contains :  
     
     1.  hdfs://okeanos-master:54310/csv_data/LAPD_Police_Stations.csv
     2.  hdfs://okeanos-master:54310/csv_data/crime_data_2019.csv 
     3.  hdfs://okeanos-master:54310/csv_data/crime_data_2023.csv
     4.  hdfs://okeanos-master:54310/csv_data/revgecoding.csv 
     5.  hdfs://okeanos-master:54310/csv_data/income/
         1. LA_income_2015.csv
         2. LA_income_2017.csv
         3. LA_income_2019.csv
         4. LA_income_2021.csv

In [1]:
# Pyspark Imports
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import *
from pyspark.sql.functions import to_date
from pyspark.sql.functions import col, regexp_replace
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from operator import add
import geopy.distance
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf, rank
from pyspark.sql.types import DoubleType
from pyspark.sql.window import Window
import math

In [2]:
# initialize sparkSession, make the data from csv to parquet,
spark = SparkSession \
    .builder \
    .appName("4 Executors") \
    .config("spark.driver.memory", "1g") \
    .config("spark.executor.memory", "1g") \
    .config("spark.executor.instances", "4") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/05 01:56:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/05 01:56:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/01/05 01:56:04 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/user/.local/lib/python3.10/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_

KeyboardInterrupt: 

In [None]:
# load data into memory, do the necessary joins etc. here
df = spark.read.options(inferSchema="true", delimiter=",", header="true") \
    .csv("hdfs://okeanos-master:54310/csv_data/crime_data_2019.csv")
df2 = spark.read.options(inferSchema="true", delimiter=",", header="true") \
    .csv("hdfs://okeanos-master:54310/csv_data/crime_data_2023.csv")
crime_data = df.union(df2)
revge = spark.read.options(inferSchema="true", delimiter=",", header="true") \
    .csv("hdfs://okeanos-master:54310/csv_data/revgecoding.csv")
# only 2015 income data needed
income = spark.read \
            .parquet("hdfs://okeanos-master:54310/parquet/income/LA_income_2015.parquet")
lapd_stations = spark.read.parquet("hdfs://okeanos-master:54310/parquet/LAPD_Police_Stations.parquet")

In [None]:
crime_data.show()

In [None]:
crime_data.printSchema()

#### Change Column Types 
Στην εκφωνηση λέει : Διατηρώντας τα αρχικά ονόματα στηλών 

εννοωντας οτι δεν μπορουμε να κανουμε το 'Date Rptd' -> 'Date_Rptd' ?
- Date Rptd: date
- DATE OCC: date
- Vict Age: integer
- LAT: double
- LON: double

In [None]:
# code for column type changing
crime_data = crime_data.withColumn("Date Rptd", to_timestamp("Date Rptd", 'MM/dd/yyyy hh:mm:ss a')) \
    .withColumn("DATE OCC", to_timestamp("DATE OCC", 'MM/dd/yyyy hh:mm:ss a')) \
    .withColumn("Vict Age", col("Vict Age").cast("int")) \
    .withColumn("LAT", col("LAT").cast("double")) \
    .withColumn("LON", col("LON").cast("double")) \
    .withColumn("Premis_Desc", col("Premis Desc"))

# 1st Query :
        find
            for each year
                the 3 months with the biggest crime count

        year | month | crime_total (count)  + #order
        dataframe.show()

        SELECT  YEAR(date_rptd) as year,
                MONTH(date_rptd) as month,
                COUNT(*) as crime_total,
                ROW_NUMBER() OVER (PARTITION BY year ORDER BY crime_total) as '#'
        GROUP BY YEAR(date_rptd), MONTH(date_rptd)
        SORT BY year ASC, crime_total DESp    GROUP BY police_station_name
                ORDER BY #

    2 :

In [None]:
# code for first query in SQL API
def query_1_SQL_API():
    start_time = time.time()
    crime_data.createOrReplaceTempView("crime_data")
    
    query = """
        SELECT * FROM (
            SELECT 
                year(`Date Rptd`) AS year,
                month(`Date Rptd`) AS month,
                COUNT(*) AS crime_total,
                ROW_NUMBER() OVER (PARTITION BY year(`Date Rptd`) ORDER BY COUNT(*) DESC) AS rank
            FROM 
                crime_data
            GROUP BY 
                year(`Date Rptd`), month(`Date Rptd`)
        ) ranked_data
        WHERE rank <= 3
        ORDER BY year, rank
    """
    
    
    result_df = spark.sql(query)
    result_df.show()
    
    end_time = time.time()

    return end_time - start_time

In [None]:
# code for first query in Dataframe API
def query_1_Dataframe_API():
    start_time = time.time()
    crime_counts = crime_data.withColumn("year", F.year("Date Rptd")) \
                          .withColumn("month", F.month("Date Rptd")) \
                          .groupBy("year", "month") \
                          .agg(F.count("*").alias("crime_total"))
    
    window_spec = Window.partitionBy("year").orderBy(F.desc("crime_total"))
    
    ranked_crime = crime_counts.withColumn("rank", F.row_number().over(window_spec))
    
    result_df = ranked_crime.filter("rank <= 3").orderBy("year", "rank")
    
    result_df.show()
    end_time = time.time()

    return end_time - start_time

 # 2nd Query :


            SELECT street,
                   CASE
                      WHEN HOUR('Date Rptd') BETWEEN 5 AND 11 THEN 'Morning'
                      WHEN HOUR('Date Rptd') BETWEEN 12 AND 16 THEN 'Noon'
                      WHEN HOUR('Date Rptd') BETWEEN 17 AND 20 THEN 'Afternoon'
                      ELSE 'Night'
                    END AS time_group,
                    COUNT(*) as count
            WHERE 'Prem Desc'='STREET'
            GROUP BY time_group
            ORDER BY count

In [None]:
# write code for 2nd query here for Dataframe/SQL API
def query_2_Dataframe_API():
    start_time = time.time()
    filtered_df = crime_data.filter(crime_data['Premis_Desc'] == 'STREET')

    time_group_df = filtered_df.withColumn("time_group",
                                       # TIME OCC is in 24 hour military time integer values
                                      F.when((F.col('TIME OCC').between(500, 1159)), 'Morning')
                                      .when((F.col('TIME OCC').between(1200, 1659)), 'Noon')
                                      .when((F.col('TIME OCC').between(1700, 2059)), 'Afternoon')
                                      .otherwise('Night'))

    result_df = time_group_df.groupBy("time_group").agg(F.count("*").alias("count"))

    result_df = result_df.orderBy(col("count").desc())
    result_df.show()
    end_time = time.time()
    # call explain() method in order
    # to see the query's physical plan
    # and improve the RDD query
    result_df.explain()
    return end_time - start_time

In [None]:
# write code for 2nd query here for RDD API
def time_segs(row):
    if 500 <= int(row['TIME OCC']) <= 1159:
        return 'Morning'
    elif 1200 <= int(row['TIME OCC']) <= 1659:
        return 'Noon'
    elif 1700 <= int(row['TIME OCC']) <= 2059:
        return 'Afternoon'
    else:
        return 'Night'

In [None]:
def query_2_rdd(): 
    #then broadcast
    #spark.sparkContext.broadcast(crime_data)
    # changes performance? (37s)

    start_time = time.time()
    crime_data_rdd = crime_data.rdd.filter(lambda x: x['Premis_Desc'] == 'STREET') \
                                .map(lambda x: (time_segs(x),1)) \
                                .reduceByKey(lambda k1,k2: k1+k2) \
                                .sortBy(lambda x: x[1], ascending = False)
    
    result = crime_data_rdd.collect()
    for time_of_day, count in result:
        print(f"{time_of_day}: {count}")
        
    end_time = time.time()
    return end_time - start_time

In [None]:
def map_time_group(row):
    time_occ = int(row)
    if 500 <= time_occ <= 1159:
        return 'Morning'
    elif 1200 <= time_occ <= 1659:
        return 'Noon'
    elif 1700 <= time_occ <= 2059:
        return 'Afternoon'
    else:
        return 'Night'
        
def query_2_rdd_new():
    start_time = time.time()

    crime_data_rdd = crime_data.rdd.filter(lambda x: x['Premis_Desc'] == 'STREET') \
                                .map(lambda x: (map_time_group(x['TIME OCC']), 1)) \
                                .reduceByKey(add).sortBy(lambda x: x[1], ascending = False)

    result = crime_data_rdd.collect()
    for time_of_day, count in result:
        print(f"{time_of_day}: {count}")
        
    end_time = time.time()
    return end_time - start_time

In [None]:
#query_2_Dataframe_API()

In [None]:
#query_2_rdd_new()

In [None]:
#spark.stop()

# 3rd Query :

        find the 3 zip codes with min and max household income
                    |
                    |
                    v
        // filter(remove) victimless crimes
                    |
                    |
                    v
        select vict_desc, COUNT(*) as count
        where year=2015
        group by vict_desc
        order by count DESC

In [None]:
# write code for 3rd query here
def query_3(method = 'CONTINUE'):
    start_time = time.time()

    if method == 'BROADCAST':
        crime_data_join_revge = crime_data.join(broadcast(revge), ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        crime_data_join_revge.explain()
        
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_revge = crime_data.hint(method).join(revge, ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        crime_data_join_revge.explain()
        
    elif method == 'CONTINUE':
        crime_data_join_revge = crime_data.join(revge, ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        
    else:
        return None

    #crime_data_join_revge = crime_data.join(revge, ['LAT', 'LON'], 'inner') \
    #    .withColumnRenamed('ZIPcode', 'Zip Code') \
    #    .withColumn("Zip Code", col("Zip Code").cast("int")) \
    #    .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
    
    crime_data_join_income = crime_data_join_revge.join(income, 'Zip Code', 'inner') \
                                .withColumn('Estimated Median Income', 
                                            regexp_replace(col('Estimated Median Income'), '[$,]', '')) \
                                .withColumn('Estimated Median Income', 
                                            col('Estimated Median Income') \
                                .cast('double'))
    
    max_income_zip_codes = crime_data_join_income.groupBy('Zip Code') \
                            .agg({'Estimated Median Income': 'max'}) \
                            .withColumnRenamed('max(Estimated Median Income)', 'MaxIncome') \
                            .orderBy(col('MaxIncome').desc()) \
                            .limit(3)
    
    min_income_zip_codes = crime_data_join_income.groupBy('Zip Code') \
                            .agg({'Estimated Median Income': 'min'}) \
                            .withColumnRenamed('min(Estimated Median Income)', 'MinIncome') \
                            .orderBy(col('MinIncome')) \
                            .limit(3)
    
    zip_codes = min_income_zip_codes.union(max_income_zip_codes)
    
    zip_codes_list = [row['Zip Code'] for row in zip_codes.collect()]
    
    result = crime_data_join_income \
                .filter(col('Zip Code').isin(zip_codes_list)) \
                .filter(year(col('Date Rptd')) == 2015) \
                .groupBy('Vict Descent') \
                .count() \
                .withColumnRenamed('count', '#') \
                .orderBy(col('#').desc())
    
    result.show()
    end_time = time.time()
    print(f'Method : {method} | Time {end_time - start_time}')
    return end_time - start_time

# 4th Query :
    1 :
        a)
                make_extra_columns() : distance of police stations from crime
                join crime_table with LA Police Stations on police_station
                and add column of police station
                for each row compute distance from two coordinates                         
                put this computed distance in the column named 'distance'

                SELECT year, SUM(distance)/# as average_distance
                                , COUNT(*) as #
                FROM ...
                WHERE WEAPON < 200
                GROUP BY year
                ORDER BY #


        b)
                SELECT police_station_name as division,
                       SUM(distance)/# as average_distance,
                       COUNT(*) as #
                FROM ...
                WHERE weapon NOT NULL
                GROUP BY police_station_name
                ORDER BY #

    2 :

In [None]:
def haversine(lon1, lat1, lon2, lat2):
    R = 6371
    dLat = math.radians(lat2 - lat1)
    dLon = math.radians(lon2 - lon1)
    a = math.sin(dLat / 2) * math.sin(dLat / 2) + \
        math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.sin(dLon / 2) * math.sin(dLon / 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R * c
    return distance

haversine_udf = udf(haversine, DoubleType())

In [None]:
# write code for 4th query here
# 1a)
def query_4_1a(method = 'CONTINUE'):
    start_time = time.time()
    lapd_stations_new = lapd_stations.withColumnRenamed('PREC','AREA')
    
    if method == 'BROADCAST':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(broadcast(lapd_stations_new), 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .hint(method).join(lapd_stations_new, 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method == 'CONTINUE':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(lapd_stations_new, 'AREA', 'inner')
    else:
        return None
    
    crime_data_join_stations = crime_data_join_stations.withColumn('distance',
                                    haversine_udf(col("LON"), col("LAT"), col("X"), col("Y")))
    
    crime_data_join_stations = crime_data_join_stations.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(col('Weapon Used Cd') < 200) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('year') \
        .agg((F.sum('distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')) \
       .orderBy(F.col('year'))
    
    result.show()
    end_time = time.time()
    print(f'Method : {method} | Time {end_time - start_time}')
    return end_time - start_time

In [None]:
# 1b)
def query_4_1b(method = 'CONTINUE'):
    start_time = time.time()
    lapd_stations_new = lapd_stations.withColumnRenamed('PREC','AREA')
    
    if method == 'BROADCAST':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(broadcast(lapd_stations_new), 'AREA', 'inner')
        crime_data_join_stations.explain()
        
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .hint(method).join(lapd_stations_new, 'AREA', 'inner')
        crime_data_join_stations.explain()
        
    elif method == 'CONTINUE':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(lapd_stations_new, 'AREA', 'inner')
    else:
        return None
    
    # distance of LAT and LON using Spark functions
    crime_data_join_stations = crime_data_join_stations.withColumn('distance',
                                    haversine_udf(col("LON"), col("LAT"), col("X"), col("Y"))) # Earths's radius
    
    crime_data_join_stations = crime_data_join_stations.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(F.col('Weapon Used Cd').isNotNull()) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('AREA NAME') \
        .agg(
            (F.sum('distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')
        ) \
        .orderBy(F.col('#').desc()) \
        .withColumnRenamed('AREA NAME', 'division')
    
    result.show()
    end_time = time.time()
    print(f'Method : {method} | Time {end_time - start_time}')
    return end_time - start_time

In [None]:
crime_data.printSchema()

In [None]:
# 2a)
def query_4_2a(method = 'CONTINUE'):
    start_time = time.time()
    
    if method == 'BROADCAST':
        combined_data = crime_data.hint(method).crossJoin(broadcast(lapd_stations))
        combined_data.explain()
        
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        combined_data = crime_data.hint(method).crossJoin(lapd_stations)
        combined_data.explain()
        
    elif method == 'CONTINUE':
        combined_data = crime_data.crossJoin(lapd_stations)
    else:
        return None

    # filter for DR_NO < 200  
    # if lazy evaluation/optimizer doesnt do it by itself
    
    combined_data = combined_data.withColumn("closest_distance", haversine_udf(col("LON"), col("LAT"), col("X"), col("Y")))
    
    windowSpec = Window.partitionBy("DR_NO").orderBy("closest_distance")
    closest_stations = combined_data.withColumn("rank", rank().over(windowSpec)).filter(col("rank") == 1)
    
    final_data = closest_stations.select(col("DR_NO"), col("DIVISION").alias("closest_station"), col("closest_distance"))
    
    result = crime_data.join(final_data, "DR_NO")
    crime_data_join_stations = result.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(col('Weapon Used Cd') < 200) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('year') \
        .agg((F.sum('closest_distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')) \
       .orderBy(F.col('year'))

    
    result.show()
    end_time = time.time()
    print(f'Method : {method} | Time {end_time - start_time}')
    return end_time - start_time

In [None]:
# 2a)
def query_4_2b(method = 'CONTINUE'):
    start_time = time.time()
    
    if method == 'BROADCAST':
        combined_data = crime_data.hint(method).crossJoin(broadcast(lapd_stations))
        combined_data.explain()
        
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        combined_data = crime_data.hint(method).crossJoin(lapd_stations)
        combined_data.explain()
        
    elif method == 'CONTINUE':
        combined_data = crime_data.crossJoin(lapd_stations)
    else:
        return None

    
    combined_data = combined_data.withColumn("closest_distance", haversine_udf(col("LON"), col("LAT"), col("X"), col("Y")))
    
    windowSpec = Window.partitionBy("DR_NO").orderBy("closest_distance")
    closest_stations = combined_data.withColumn("rank", rank().over(windowSpec)).filter(col("rank") == 1)
    
    final_data = closest_stations.select(col("DR_NO"), col("DIVISION").alias("closest_station"), col("closest_distance"))
    
    result = crime_data.join(final_data, "DR_NO")
    crime_data_join_stations = result.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(F.col('Weapon Used Cd').isNotNull()) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('AREA NAME') \
        .agg(
            (F.sum('closest_distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')
        ) \
        .orderBy(F.col('#').desc()) \
        .withColumnRenamed('AREA NAME', 'division')
    
    result.show()
    end_time = time.time()
    print(f'Method : {method} | Time {end_time - start_time}')
    return end_time - start_time

# Query 1 on 4 Executors

In [None]:
query_1_Dataframe_API()

In [25]:
query_1_SQL_API()



+----+-----+-----------+----+
|year|month|crime_total|rank|
+----+-----+-----------+----+
|2010|    3|      17595|   1|
|2010|    7|      17520|   2|
|2010|    5|      17338|   3|
|2011|    8|      17139|   1|
|2011|    5|      17050|   2|
|2011|    3|      16951|   3|
|2012|    8|      17696|   1|
|2012|   10|      17477|   2|
|2012|    5|      17391|   3|
|2013|    8|      17329|   1|
|2013|    7|      16714|   2|
|2013|    5|      16671|   3|
|2014|    7|      14059|   1|
|2014|   10|      14031|   2|
|2014|    9|      13799|   3|
|2015|    8|      18951|   1|
|2015|   10|      18916|   2|
|2015|    7|      18528|   3|
|2016|    8|      19779|   1|
|2016|   10|      19615|   2|
+----+-----+-----------+----+
only showing top 20 rows



                                                                                

2.3756721019744873

# Query 2 on 4 Executors 

In [43]:
query_2_Dataframe_API()



+----------+------+
|time_group| count|
+----------+------+
|     Night|237605|
| Afternoon|187306|
|      Noon|148180|
|   Morning|123846|
+----------+------+

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#10552L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#10552L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=3709]
      +- HashAggregate(keys=[time_group#10490], functions=[count(1)])
         +- Exchange hashpartitioning(time_group#10490, 200), ENSURE_REQUIREMENTS, [plan_id=3706]
            +- HashAggregate(keys=[time_group#10490], functions=[partial_count(1)])
               +- Project [CASE WHEN ((TIME OCC#20 >= 500) AND (TIME OCC#20 <= 1159)) THEN Morning WHEN ((TIME OCC#20 >= 1200) AND (TIME OCC#20 <= 1659)) THEN Noon WHEN ((TIME OCC#20 >= 1700) AND (TIME OCC#20 <= 2059)) THEN Afternoon ELSE Night END AS time_group#10490]
                  +- Filter (isnotnull(Premis_Desc#557) AND (Premis_Desc#557 = STREET))
                    

                                                                                

1.562849998474121

In [44]:
query_2_rdd()

                                                                                

Night: 237605
Afternoon: 187306
Noon: 148180
Morning: 123846


29.062253952026367

# Query 3 on 4 Executors

In [27]:
query_3()



+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           I|   3|
|           J|   3|
|           C|   2|
|           F|   1|
+------------+----+

Method : CONTINUE | Time 20.74992871284485


                                                                                

20.74992871284485

In [28]:
for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH']:
    query_3(method)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#43, LON#44, DR_NO#17, Date Rptd#412, DATE OCC#441, TIME OCC#20, AREA #21, AREA NAME#22, Rpt Dist No#23, Part 1-2#24, Crm Cd#25, Crm Cd Desc#26, Mocodes#27, Vict Age#28, Vict Sex#29, Vict Descent#30, Premis Cd#31, Premis Desc#32, Weapon Used Cd#33, Weapon Desc#34, Status#35, Status Desc#36, Crm Cd 1#37, Crm Cd 2#38, ... 6 more fields]
   +- BroadcastHashJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#43)), knownfloatingpointnormalized(normalizenanandzero(LON#44))], [knownfloatingpointnormalized(normalizenanandzero(LAT#191)), knownfloatingpointnormalized(normalizenanandzero(LON#192))], Inner, BuildRight, false
      :- Union
      :  :- Project [DR_NO#17, gettimestamp(Date Rptd#18, MM/dd/yyyy hh:mm:ss a, TimestampType, Some(Europe/Athens), false) AS Date Rptd#412, gettimestamp(DATE OCC#19, MM/dd/yyyy hh:mm:ss a, TimestampType, Some(Europe/Athens), false) AS DATE OCC#441, TIME OCC#20, AREA #21, AREA NAME#22

                                                                                

+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           I|   3|
|           J|   3|
|           C|   2|
|           F|   1|
+------------+----+

Method : BROADCAST | Time 17.158113718032837
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#43, LON#44, DR_NO#17, Date Rptd#412, DATE OCC#441, TIME OCC#20, AREA #21, AREA NAME#22, Rpt Dist No#23, Part 1-2#24, Crm Cd#25, Crm Cd Desc#26, Mocodes#27, Vict Age#28, Vict Sex#29, Vict Descent#30, Premis Cd#31, Premis Desc#32, Weapon Used Cd#33, Weapon Desc#34, Status#35, Status Desc#36, Crm Cd 1#37, Crm Cd 2#38, ... 6 more fields]
   +- SortMergeJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#43)), knownfloatingpointnormalized(normalizenanandzero(LON#44))], [knownfloatingpointnormalized(normalizenanandzero(LAT#191)), knownfloatingpointnormalized(normalizenanandzero(LON#192))], Inne

                                                                                

+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           J|   3|
|           I|   3|
|           C|   2|
|           F|   1|
+------------+----+

Method : MERGE | Time 18.987554788589478
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#43, LON#44, DR_NO#17, Date Rptd#412, DATE OCC#441, TIME OCC#20, AREA #21, AREA NAME#22, Rpt Dist No#23, Part 1-2#24, Crm Cd#25, Crm Cd Desc#26, Mocodes#27, Vict Age#28, Vict Sex#29, Vict Descent#30, Premis Cd#31, Premis Desc#32, Weapon Used Cd#33, Weapon Desc#34, Status#35, Status Desc#36, Crm Cd 1#37, Crm Cd 2#38, ... 6 more fields]
   +- ShuffledHashJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#43)), knownfloatingpointnormalized(normalizenanandzero(LON#44))], [knownfloatingpointnormalized(normalizenanandzero(LAT#191)), knownfloatingpointnormalized(normalizenanandzero(LON#192))], Inner



+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           J|   3|
|           I|   3|
|           C|   2|
|           F|   1|
+------------+----+

Method : SHUFFLE_HASH | Time 14.167540550231934


                                                                                

# Query 4 on 4 Executors

In [29]:
query_4_1a()

                                                                                

+----+------------------+-----+
|year|  average_distance|    #|
+----+------------------+-----+
|2010|4.3255933001101114| 8162|
|2011|2.7909872168227423| 7225|
|2012| 37.45827620685533| 6539|
|2013| 2.830553808457538| 5851|
|2014|11.043993584711998| 4559|
|2015| 2.706546019966876| 6729|
|2016| 2.718165310899851| 8094|
|2017|4.3382539597541765| 7781|
|2018|2.7360981635514983| 7414|
|2019| 2.741344160752832| 7135|
|2020| 8.609530452758886| 8496|
|2021|32.313062258505404|12316|
|2022|2.6126264414998865|10067|
|2023|2.5497025432007963| 8951|
+----+------------------+-----+

Method : CONTINUE | Time 4.756443738937378


4.756443738937378

In [30]:
query_4_1b()



+-----------+------------------+-----+
|   division|  average_distance|    #|
+-----------+------------------+-----+
|77th Street| 13.13902769776695|94679|
|  Southeast|18.719601477645817|77917|
|  Southwest| 9.881379517897534|72632|
|    Central|23.391588077036392|63476|
|     Newton|13.952382592339601|61300|
|    Rampart|19.798620859447436|55761|
|    Olympic| 24.55548058034261|52957|
|  Hollywood| 27.77383276714913|51099|
|    Mission|26.641751598492025|43604|
|    Pacific|25.005205895832077|42897|
| Hollenbeck|19.564855074679432|41478|
|     Harbor|14.131988077207227|40746|
|N Hollywood|17.515613806281824|40345|
|   Wilshire|16.038118359629955|37830|
|  Northeast|12.772470868243007|37210|
|   Foothill|20.930925147358497|36917|
|   Van Nuys|19.881787526384876|36172|
|    Topanga| 6.782413458606602|34703|
|West Valley|15.296850408934183|33829|
| Devonshire| 19.09195315013003|32486|
+-----------+------------------+-----+
only showing top 20 rows

Method : CONTINUE | Time 5.10639953613

                                                                                

5.1063995361328125

In [None]:
query_4_2a()

In [None]:
query_4_2b()

In [None]:
for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH']:
    query_4_1a(method)

In [None]:
for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH']:
    query_4_1b(method)

In [None]:
for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH']:
    query_4_2a(method)

In [None]:
for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH']:
    query_4_2b(method)

In [45]:
spark.stop()