#### Useful Links 
- Spark History Server : http://83.212.73.248:18080/
- Hadoop YARN (scheduler) : http://83.212.73.248:8088/cluster
- HDFS : http://83.212.73.248:9870/dfshealth.html#tab-overview

#### Useful Commands : 
- Connect to okeanos-master (from local) : `$ ssh user@snf-40202.ok-kno.grnetcloud.net `
    - Password : 'Rand0m'
- Connect to okeanos-worker (from okeanos-master) : `$ ssh okeanos-worker`
- Open Jupyter Notebook : `$ jupyter notebook --ip 83.212.73.248 --port 8888`

#### Thinks to do :
- Make the data Csv to Parquet
- Make those columns the type we want
- Write the Queries (!)
- Benchmark and optimize them etc.
- Balance the data onto HDFS across the two datanodes

### Full HDFS path is here : hdfs://okeanos-master:54310/csv_data/
and contains :  
     
     1.  hdfs://okeanos-master:54310/csv_data/LAPD_Police_Stations.csv
     2.  hdfs://okeanos-master:54310/csv_data/crime_data_2019.csv 
     3.  hdfs://okeanos-master:54310/csv_data/crime_data_2023.csv
     4.  hdfs://okeanos-master:54310/csv_data/revgecoding.csv 
     5.  hdfs://okeanos-master:54310/csv_data/income/
         1. LA_income_2015.csv
         2. LA_income_2017.csv
         3. LA_income_2019.csv
         4. LA_income_2021.csv

In [1]:
# Pyspark Imports
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import *
from pyspark.sql.functions import to_date
from pyspark.sql.functions import col, regexp_replace
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
import geopy.distance
import time 

In [2]:
# initialize sparkSession, make the data from csv to parquet,
spark = SparkSession \
    .builder \
    .appName("4 Executors") \
    .config("spark.driver.memory", "1g") \
    .config("spark.executor.memory", "1g") \
    .config("spark.executor.instances", "4") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/30 21:14:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/12/30 21:14:16 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [3]:
# load data into memory, do the necessary joins etc. here
crime_data = spark.read.parquet("hdfs://okeanos-master:54310/parquet/crime_data_*.parquet")
revge = spark.read.parquet("hdfs://okeanos-master:54310/parquet/revgecoding.parquet")
# only 2015 income data needed
income = spark.read \
            .parquet("hdfs://okeanos-master:54310/parquet/income/LA_income_2015.parquet")
lapd_stations = spark.read.parquet("hdfs://okeanos-master:54310/parquet/LAPD_Police_Stations.parquet")

                                                                                

In [4]:
crime_data.show()

23/12/30 21:14:55 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+---------+--------------------+--------------------+--------+-----+-----------+-----------+--------+------+--------------------+--------------------+--------+--------+------------+---------+--------------------+--------------+--------------------+------+------------+--------+--------+--------+--------+--------------------+--------------------+-------+---------+
|    DR_NO|           Date Rptd|            DATE OCC|TIME OCC|AREA |  AREA NAME|Rpt Dist No|Part 1-2|Crm Cd|         Crm Cd Desc|             Mocodes|Vict Age|Vict Sex|Vict Descent|Premis Cd|         Premis Desc|Weapon Used Cd|         Weapon Desc|Status| Status Desc|Crm Cd 1|Crm Cd 2|Crm Cd 3|Crm Cd 4|            LOCATION|        Cross Street|    LAT|      LON|
+---------+--------------------+--------------------+--------+-----+-----------+-----------+--------+------+--------------------+--------------------+--------+--------+------------+---------+--------------------+--------------+--------------------+------+------------+--

In [5]:
crime_data.printSchema()

root
 |-- DR_NO: integer (nullable = true)
 |-- Date Rptd: string (nullable = true)
 |-- DATE OCC: string (nullable = true)
 |-- TIME OCC: integer (nullable = true)
 |-- AREA : integer (nullable = true)
 |-- AREA NAME: string (nullable = true)
 |-- Rpt Dist No: integer (nullable = true)
 |-- Part 1-2: integer (nullable = true)
 |-- Crm Cd: integer (nullable = true)
 |-- Crm Cd Desc: string (nullable = true)
 |-- Mocodes: string (nullable = true)
 |-- Vict Age: integer (nullable = true)
 |-- Vict Sex: string (nullable = true)
 |-- Vict Descent: string (nullable = true)
 |-- Premis Cd: integer (nullable = true)
 |-- Premis Desc: string (nullable = true)
 |-- Weapon Used Cd: integer (nullable = true)
 |-- Weapon Desc: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Status Desc: string (nullable = true)
 |-- Crm Cd 1: integer (nullable = true)
 |-- Crm Cd 2: integer (nullable = true)
 |-- Crm Cd 3: integer (nullable = true)
 |-- Crm Cd 4: integer (nullable = true)
 |-- 

#### Change Column Types 
Στην εκφωνηση λέει : Διατηρώντας τα αρχικά ονόματα στηλών 

εννοωντας οτι δεν μπορουμε να κανουμε το 'Date Rptd' -> 'Date_Rptd' ?
- Date Rptd: date
- DATE OCC: date
- Vict Age: integer
- LAT: double
- LON: double

In [6]:
# code for column type changing
crime_data = crime_data.withColumn("Date Rptd", to_timestamp("Date Rptd", 'MM/dd/yyyy hh:mm:ss a')) \
    .withColumn("DATE OCC", to_timestamp("DATE OCC", 'MM/dd/yyyy hh:mm:ss a')) \
    .withColumn("Vict Age", col("Vict Age").cast("int")) \
    .withColumn("LAT", col("LAT").cast("double")) \
    .withColumn("LON", col("LON").cast("double")) \
    .withColumn("Premis_Desc", col("Premis Desc"))

# 1st Query :
        find
            for each year
                the 3 months with the biggest crime count

        year | month | crime_total (count)  + #order
        dataframe.show()

        SELECT  YEAR(date_rptd) as year,
                MONTH(date_rptd) as month,
                COUNT(*) as crime_total,
                ROW_NUMBER() OVER (PARTITION BY year ORDER BY crime_total) as '#'
        GROUP BY YEAR(date_rptd), MONTH(date_rptd)
        SORT BY year ASC, crime_total DESp    GROUP BY police_station_name
                ORDER BY #

    2 :

In [7]:
# code for first query in SQL API
def query_1_SQL_API():
    start_time = time.time()
    crime_data.createOrReplaceTempView("crime_data")
    
    query = """
        SELECT * FROM (
            SELECT 
                year(`Date Rptd`) AS year,
                month(`Date Rptd`) AS month,
                COUNT(*) AS crime_total,
                ROW_NUMBER() OVER (PARTITION BY year(`Date Rptd`) ORDER BY COUNT(*) DESC) AS rank
            FROM 
                crime_data
            GROUP BY 
                year(`Date Rptd`), month(`Date Rptd`)
        ) ranked_data
        WHERE rank <= 3
        ORDER BY year, rank
    """
    
    
    result_df = spark.sql(query)
    result_df.show()
    
    end_time = time.time()

    return end_time - start_time

In [8]:
# code for first query in Dataframe API
def query_1_Dataframe_API():
    start_time = time.time()
    crime_counts = crime_data.withColumn("year", F.year("Date Rptd")) \
                          .withColumn("month", F.month("Date Rptd")) \
                          .groupBy("year", "month") \
                          .agg(F.count("*").alias("crime_total"))
    
    window_spec = Window.partitionBy("year").orderBy(F.desc("crime_total"))
    
    ranked_crime = crime_counts.withColumn("rank", F.row_number().over(window_spec))
    
    result_df = ranked_crime.filter("rank <= 3").orderBy("year", "rank")
    
    result_df.show()
    end_time = time.time()

    return end_time - start_time

 # 2nd Query :


            SELECT street,
                   CASE
                      WHEN HOUR('Date Rptd') BETWEEN 5 AND 11 THEN 'Morning'
                      WHEN HOUR('Date Rptd') BETWEEN 12 AND 16 THEN 'Noon'
                      WHEN HOUR('Date Rptd') BETWEEN 17 AND 20 THEN 'Afternoon'
                      ELSE 'Night'
                    END AS time_group,
                    COUNT(*) as count
            WHERE 'Prem Desc'='STREET'
            GROUP BY time_group
            ORDER BY count

In [9]:
# write code for 2nd query here for Dataframe/SQL API
def query_2_Dataframe_API():
    start_time = time.time()
    filtered_df = crime_data.filter(crime_data['Premis_Desc'] == 'STREET')

    time_group_df = filtered_df.withColumn("time_group",
                                       # TIME OCC is in 24 hour military time integer values
                                      F.when((F.col('TIME OCC').between(500, 1159)), 'Morning')
                                      .when((F.col('TIME OCC').between(1200, 1659)), 'Noon')
                                      .when((F.col('TIME OCC').between(1700, 2059)), 'Afternoon')
                                      .otherwise('Night'))

    result_df = time_group_df.groupBy("time_group").agg(F.count("*").alias("count"))

    result_df = result_df.orderBy(col("count").desc())
    result_df.show()
    end_time = time.time()

    return end_time - start_time

In [10]:
# write code for 2nd query here for RDD API
def time_segs(row):
    if 500 <= int(row['TIME OCC']) <= 1159:
        return 'Morning'
    elif 1200 <= int(row['TIME OCC']) <= 1659:
        return 'Noon'
    elif 1700 <= int(row['TIME OCC']) <= 2059:
        return 'Afternoon'
    else:
        return 'Night'



def query_2_rdd(): 
    start_time = time.time()
    crime_data_rdd = crime_data.rdd.filter(lambda x: x['Premis_Desc'] == 'STREET') \
                                .map(lambda x: (time_segs(x),1)) \
                                .reduceByKey(lambda k1,k2: k1+k2) \
                                .sortBy(lambda x: x[1], ascending = False)
    
    result = crime_data_rdd.collect()
    for time_of_day, count in result:
        print(f"{time_of_day}: {count}")
        
    end_time = time.time()
    return end_time - start_time

# 3rd Query :

        find the 3 zip codes with min and max household income
                    |
                    |
                    v
        // filter(remove) victimless crimes
                    |
                    |
                    v
        select vict_desc, COUNT(*) as count
        where year=2015
        group by vict_desc
        order by count DESC

In [11]:
# write code for 3rd query here
def query_3(method):
    start_time = time.time()

    if method == 'BROADCAST':
        crime_data_join_revge = crime_data.join(broadcast(revge), ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        crime_data_join_revge.explain()
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_revge = crime_data.hint(method).join(revge, ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        crime_data_join_revge.explain()
    elif !method:
        crime_data_join_revge = crime_data.join(revge, ['LAT', 'LON'], 'inner') \
            .withColumnRenamed('ZIPcode', 'Zip Code') \
            .withColumn("Zip Code", col("Zip Code").cast("int")) \
            .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
        
    else:
        return None

    #crime_data_join_revge = crime_data.join(revge, ['LAT', 'LON'], 'inner') \
    #    .withColumnRenamed('ZIPcode', 'Zip Code') \
    #    .withColumn("Zip Code", col("Zip Code").cast("int")) \
    #    .filter((col('Vict Descent') != 'X') & (col('Vict Sex') != 'X'))
    
    crime_data_join_income = crime_data_join_revge.join(income, 'Zip Code', 'inner') \
                                .withColumn('Estimated Median Income', 
                                            regexp_replace(col('Estimated Median Income'), '[$,]', '')) \
                                .withColumn('Estimated Median Income', 
                                            col('Estimated Median Income') \
                                .cast('double'))
    
    max_income_zip_codes = crime_data_join_income.groupBy('Zip Code') \
                            .agg({'Estimated Median Income': 'max'}) \
                            .withColumnRenamed('max(Estimated Median Income)', 'MaxIncome') \
                            .orderBy(col('MaxIncome').desc()) \
                            .limit(3)
    
    min_income_zip_codes = crime_data_join_income.groupBy('Zip Code') \
                            .agg({'Estimated Median Income': 'min'}) \
                            .withColumnRenamed('min(Estimated Median Income)', 'MinIncome') \
                            .orderBy(col('MinIncome')) \
                            .limit(3)
    
    zip_codes = min_income_zip_codes.union(max_income_zip_codes)
    
    zip_codes_list = [row['Zip Code'] for row in zip_codes.collect()]
    
    result = crime_data_join_income \
                .filter(col('Zip Code').isin(zip_codes_list)) \
                .filter(year(col('Date Rptd')) == 2015) \
                .groupBy('Vict Descent') \
                .count() \
                .withColumnRenamed('count', '#') \
                .orderBy(col('#').desc())
    
    result.show()
    end_time = time.time()
    return end_time - start_time

# 4th Query :
    1 :
        a)
                make_extra_columns() : distance of police stations from crime
                join crime_table with LA Police Stations on police_station
                and add column of police station
                for each row compute distance from two coordinates                         
                put this computed distance in the column named 'distance'

                SELECT year, SUM(distance)/# as average_distance
                                , COUNT(*) as #
                FROM ...
                WHERE WEAPON < 200
                GROUP BY year
                ORDER BY #


        b)
                SELECT police_station_name as division,
                       SUM(distance)/# as average_distance,
                       COUNT(*) as #
                FROM ...
                WHERE weapon NOT NULL
                GROUP BY police_station_name
                ORDER BY #

    2 :

In [12]:
# write code for 4th query here
# 1a)
def query_4_1a(method):
    start_time = time.time()
    lapd_stations_new = lapd_stations.withColumnRenamed('PREC','AREA')
    
    if method == 'BROADCAST':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(broadcast(lapd_stations_new), 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .hint(method).join(lapd_stations_new, 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif !method:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(lapd_stations_new, 'AREA', 'inner')
    else:
        return None

    crime_data_join_stations.explain()
    
    # distance of LAT and LON using Spark functions
    crime_data_join_stations = crime_data_join_stations.withColumn('distance',
                                    F.acos(F.sin(F.radians('LAT')) * F.sin(F.radians('Y')) +
                                    F.cos(F.radians('LAT')) * F.cos(F.radians('Y')) *
                                    F.cos(F.radians('X') - F.radians('LON')))*F.lit(6371.0)) # Earths's radius
    
    crime_data_join_stations = crime_data_join_stations.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(col('Weapon Used Cd') < 200) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('year') \
        .agg((F.sum('distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')) \
       .orderBy(F.col('#').desc())
    
    result.show()
    end_time = time.time()
    return end_time - start_time

In [13]:
# 1b)
def query_4_1b(method):
    start_time = time.time()
    lapd_stations_new = lapd_stations.withColumnRenamed('PREC','AREA')
    
    if method == 'BROADCAST':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(broadcast(lapd_stations_new), 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .hint(method).join(lapd_stations_new, 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif !method:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(lapd_stations_new, 'AREA', 'inner')
    else:
        return None
    
    # distance of LAT and LON using Spark functions
    crime_data_join_stations = crime_data_join_stations.withColumn('distance',
                                    F.acos(F.sin(F.radians('LAT')) * F.sin(F.radians('Y')) +
                                    F.cos(F.radians('LAT')) * F.cos(F.radians('Y')) *
                                    F.cos(F.radians('X') - F.radians('LON')))*F.lit(6371.0)) # Earths's radius
    
    crime_data_join_stations = crime_data_join_stations.withColumn('Weapon Used Cd', col('Weapon Used Cd').cast('int')) \
                                    .filter(F.col('Weapon Used Cd').isNotNull()) \
                                    .withColumn('year', F.year('Date Rptd'))
    
    result = crime_data_join_stations.groupBy('AREA NAME') \
        .agg(
            (F.sum('distance') / F.count('*')).alias('average_distance'),
            F.count('*').alias('#')
        ) \
        .orderBy(F.col('#').desc()) \
        .withColumnRenamed('AREA NAME', 'division')
    
    result.show()
    end_time = time.time()
    return end_time - start_time

In [27]:
# 2a)
def query_4_2a(method):
    start_time = time.time()
    lapd_stations_new = lapd_stations.withColumnRenamed('PREC','AREA')
    
    if method == 'BROADCAST':
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(broadcast(lapd_stations_new), 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method in ['MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .hint(method).join(lapd_stations_new, 'AREA', 'inner')
        crime_data_join_stations.explain()
    elif method == 1:
        crime_data_join_stations = crime_data.withColumnRenamed('AREA ', 'AREA') \
                                    .join(lapd_stations_new, 'AREA', 'inner')
    else:
        return None
    
    # distance of LAT and LON using Spark functions
    crime_data_join_stations = crime_data_join_stations.withColumn('distance',
                                    F.acos(F.sin(F.radians('LAT')) * F.sin(F.radians('Y')) +
                                    F.cos(F.radians('LAT')) * F.cos(F.radians('Y')) *
                                    F.cos(F.radians('X') - F.radians('LON')))*F.lit(6371.0)) # Earths's radius
    
    windowSpec = Window.partitionBy('DR_NO').orderBy('distance')
    result = crime_data_join_stations.withColumn('rank', F.rank().over(windowSpec)) \
                                    .filter(col('rank') == 1).drop('rank')

    result.explain()
    
    result.show()
    end_time = time.time()
    return end_time - start_time

# Query 1 on 4 Executors

In [14]:
query_1_Dataframe_API()

[Stage 7:>                                                          (0 + 1) / 1]

+----+-----+-----------+----+
|year|month|crime_total|rank|
+----+-----+-----------+----+
|2010|    3|      17595|   1|
|2010|    7|      17520|   2|
|2010|    5|      17338|   3|
|2011|    8|      17139|   1|
|2011|    5|      17050|   2|
|2011|    3|      16951|   3|
|2012|    8|      17696|   1|
|2012|   10|      17477|   2|
|2012|    5|      17391|   3|
|2013|    8|      17329|   1|
|2013|    7|      16714|   2|
|2013|    5|      16671|   3|
|2014|    7|      14059|   1|
|2014|   10|      14031|   2|
|2014|    9|      13799|   3|
|2015|    8|      18951|   1|
|2015|   10|      18916|   2|
|2015|    7|      18528|   3|
|2016|    8|      19779|   1|
|2016|   10|      19615|   2|
+----+-----+-----------+----+
only showing top 20 rows



                                                                                

9.018396854400635

In [15]:
query_1_SQL_API()



+----+-----+-----------+----+
|year|month|crime_total|rank|
+----+-----+-----------+----+
|2010|    3|      17595|   1|
|2010|    7|      17520|   2|
|2010|    5|      17338|   3|
|2011|    8|      17139|   1|
|2011|    5|      17050|   2|
|2011|    3|      16951|   3|
|2012|    8|      17696|   1|
|2012|   10|      17477|   2|
|2012|    5|      17391|   3|
|2013|    8|      17329|   1|
|2013|    7|      16714|   2|
|2013|    5|      16671|   3|
|2014|    7|      14059|   1|
|2014|   10|      14031|   2|
|2014|    9|      13799|   3|
|2015|    8|      18951|   1|
|2015|   10|      18916|   2|
|2015|    7|      18528|   3|
|2016|    8|      19779|   1|
|2016|   10|      19615|   2|
+----+-----+-----------+----+
only showing top 20 rows



                                                                                

4.120455265045166

# Query 2 on 4 Executors 

In [16]:
query_2_Dataframe_API()



+----------+------+
|time_group| count|
+----------+------+
|     Night|237605|
| Afternoon|187306|
|      Noon|148180|
|   Morning|123846|
+----------+------+



                                                                                

2.307114362716675

In [17]:
query_2_rdd()

                                                                                

Night: 237605
Afternoon: 187306
Noon: 148180
Morning: 123846


35.96578645706177

# Query 3 on 4 Executors

In [None]:
query_3()

In [18]:
#for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
#    query_3(method)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#26, LON#27, DR_NO#0, Date Rptd#194, DATE OCC#223, TIME OCC#3, AREA #4, AREA NAME#5, Rpt Dist No#6, Part 1-2#7, Crm Cd#8, Crm Cd Desc#9, Mocodes#10, Vict Age#11, Vict Sex#12, Vict Descent#13, Premis Cd#14, Premis Desc#15, Weapon Used Cd#16, Weapon Desc#17, Status#18, Status Desc#19, Crm Cd 1#20, Crm Cd 2#21, ... 6 more fields]
   +- BroadcastHashJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#26)), knownfloatingpointnormalized(normalizenanandzero(LON#27))], [knownfloatingpointnormalized(normalizenanandzero(LAT#56)), knownfloatingpointnormalized(normalizenanandzero(LON#57))], Inner, BuildRight, false
      :- Project [DR_NO#0, gettimestamp(Date Rptd#1, MM/dd/yyyy hh:mm:ss a, TimestampType, Some(Europe/Athens), false) AS Date Rptd#194, gettimestamp(DATE OCC#2, MM/dd/yyyy hh:mm:ss a, TimestampType, Some(Europe/Athens), false) AS DATE OCC#223, TIME OCC#3, AREA #4, AREA NAME#5, Rpt Dist No#6, Part 1-2#7, Crm C

                                                                                

+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           J|   3|
|           I|   3|
|           C|   2|
|           F|   1|
+------------+----+

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#26, LON#27, DR_NO#0, Date Rptd#194, DATE OCC#223, TIME OCC#3, AREA #4, AREA NAME#5, Rpt Dist No#6, Part 1-2#7, Crm Cd#8, Crm Cd Desc#9, Mocodes#10, Vict Age#11, Vict Sex#12, Vict Descent#13, Premis Cd#14, Premis Desc#15, Weapon Used Cd#16, Weapon Desc#17, Status#18, Status Desc#19, Crm Cd 1#20, Crm Cd 2#21, ... 6 more fields]
   +- SortMergeJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#26)), knownfloatingpointnormalized(normalizenanandzero(LON#27))], [knownfloatingpointnormalized(normalizenanandzero(LAT#56)), knownfloatingpointnormalized(normalizenanandzero(LON#57))], Inner
      :- Sort [knownfloatingpointnormalized(normalize

23/12/30 21:17:28 WARN TransportChannelHandler: Exception in connection from /192.168.0.1:56018
java.io.IOException: Connection reset by peer
	at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)
	at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233)
	at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:254)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys

+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           J|   3|
|           I|   3|
|           C|   2|
|           F|   1|
+------------+----+

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [LAT#26, LON#27, DR_NO#0, Date Rptd#194, DATE OCC#223, TIME OCC#3, AREA #4, AREA NAME#5, Rpt Dist No#6, Part 1-2#7, Crm Cd#8, Crm Cd Desc#9, Mocodes#10, Vict Age#11, Vict Sex#12, Vict Descent#13, Premis Cd#14, Premis Desc#15, Weapon Used Cd#16, Weapon Desc#17, Status#18, Status Desc#19, Crm Cd 1#20, Crm Cd 2#21, ... 6 more fields]
   +- ShuffledHashJoin [knownfloatingpointnormalized(normalizenanandzero(LAT#26)), knownfloatingpointnormalized(normalizenanandzero(LON#27))], [knownfloatingpointnormalized(normalizenanandzero(LAT#56)), knownfloatingpointnormalized(normalizenanandzero(LON#57))], Inner, BuildLeft
      :- Exchange hashpartitioning(know

                                                                                

+------------+----+
|Vict Descent|   #|
+------------+----+
|           H|1556|
|           B|1092|
|           W|1002|
|           O| 484|
|           A| 116|
|           K|   7|
|           J|   3|
|           I|   3|
|           C|   2|
|           F|   1|
+------------+----+

== Physical Plan ==
*(3) Project [LAT#26, LON#27, DR_NO#0, Date Rptd#194, DATE OCC#223, TIME OCC#3, AREA #4, AREA NAME#5, Rpt Dist No#6, Part 1-2#7, Crm Cd#8, Crm Cd Desc#9, Mocodes#10, Vict Age#11, Vict Sex#12, Vict Descent#13, Premis Cd#14, Premis Desc#15, Weapon Used Cd#16, Weapon Desc#17, Status#18, Status Desc#19, Crm Cd 1#20, Crm Cd 2#21, ... 6 more fields]
+- CartesianProduct ((knownfloatingpointnormalized(normalizenanandzero(LAT#26)) = knownfloatingpointnormalized(normalizenanandzero(LAT#56))) AND (knownfloatingpointnormalized(normalizenanandzero(LON#27)) = knownfloatingpointnormalized(normalizenanandzero(LON#57))))
   :- *(1) Project [DR_NO#0, gettimestamp(Date Rptd#1, MM/dd/yyyy hh:mm:ss a, Timestamp

ERROR:root:KeyboardInterrupt while sending command.                 (0 + 0) / 6]
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/user/.local/lib/python3.10/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 

# Query 4 on 4 Executors

In [None]:
query_4_1a()

In [None]:
query_4_1b()

In [None]:
query_4_2a()

In [None]:
#for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
#    query_4_1a(method)

In [None]:
#for method in ['BROADCAST','MERGE', 'SHUFFLE_HASH', 'SHUFFLE_REPLICATE_NL']:
#    query_4_1b(method)

In [None]:
#query_4_2b()

In [None]:
spark.stop()