# Προχωρημένα Θέματα Βάσεων Δεδομένων

**Ονοματεπώνυμο:** Κωνσταντίνος Διβριώτης

**ΑΜ:** 03114140

## Query 3: 

Χρησιμοποιώντας ως αναφορά τα δεδομένα της απογραφής 2010 για τον πληθυσμό και εκείνα της απογραφής του 2015 για το εισόδημα ανα νοικοκυριό, να υπολογίσετε για κάθε περιοχή του Los Angeles τα παρακάτω:
- Το μέσο ετήσιο εισόδημα ανά άτομο
- Την αναλογία συνολικού αριθμού εγκλημάτων ανά άτομο

In [1]:
from pyspark.sql import SparkSession
from sedona.spark import *

spark = SparkSession \
    .builder \
    .appName("CensusDataAnalysis") \
    .getOrCreate()

sedona = SedonaContext.create(spark)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1370,application_1732639283265_1332,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
DATA_BUCKET = "s3://initial-notebook-data-bucket-dblab-905418150721"
GROUP_BUCKET = "s3://groups-bucket-dblab-905418150721/group15"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Διάβασμα και Επισκόπηση αρχείων εισόδου

In [3]:
from pyspark.sql.types import StructField, StructType, StringType
from pyspark.sql.functions import regexp_replace, col

income_schema = StructType([
    StructField("Zip Code", StringType()),
    StructField("Community", StringType()),
    StructField("Estimated Median Income", StringType())
])

income_data = spark.read.csv(f"{DATA_BUCKET}/LA_income_2015.csv", header=True, schema=income_schema)

# Μετατροπή του Estimated Median Income σε αριθμητική μορφή
income_data = income_data \
    .withColumn(
        "Estimated Median Income",
        regexp_replace(col("Estimated Median Income"), "[$,]", "").cast("float")
    ) \
    .select("Zip Code", "Estimated Median Income")

income_data.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+-----------------------+
|Zip Code|Estimated Median Income|
+--------+-----------------------+
|   90001|                33887.0|
|   90002|                30413.0|
|   90003|                30805.0|
|   90004|                40612.0|
|   90005|                31142.0|
+--------+-----------------------+
only showing top 5 rows

In [4]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, DoubleType

# Ορισμός του schema των dataset
crimes_schema = StructType([
    StructField("DR_NO", StringType()),
    StructField("Date Rptd", StringType()),
    StructField("DATE OCC", StringType()),
    StructField("TIME OCC", StringType()),
    StructField("AREA", IntegerType()),
    StructField("AREA NAME", StringType()),
    StructField("Rpt Dist No", StringType()),
    StructField("Part 1-2", IntegerType()),
    StructField("Crm Cd", IntegerType()),
    StructField("Crm Cd Desc", StringType()),
    StructField("Mocodes", StringType()),
    StructField("Vict Age", IntegerType()),
    StructField("Vict Sex", StringType()),
    StructField("Vict Descent", StringType()),
    StructField("Premis Cd", StringType()),
    StructField("Premis Desc", StringType()),
    StructField("Weapon Used Cd", IntegerType()),
    StructField("Weapon Desc", StringType()),
    StructField("Status", StringType()),
    StructField("Status Desc", StringType()),
    StructField("Crm Cd 1", IntegerType()),
    StructField("Crm Cd 2", IntegerType()),
    StructField("Crm Cd 3", IntegerType()),
    StructField("Crm Cd 4", IntegerType()),
    StructField("LOCATION", StringType()),
    StructField("Cross Street", StringType()),
    StructField("LAT", DoubleType()),
    StructField("LON", DoubleType())
])

# Διαβάζουμε τα 2 datasets (2010-2019 και 2020-σήμερα) και τα συνενώνουμε σε 1
crime_data_2010_2019 = spark.read.csv(f"{DATA_BUCKET}/CrimeData/Crime_Data_from_2010_to_2019_20241101.csv", header=True, schema=crimes_schema)
crime_data_2020_present = spark.read.csv(f"{DATA_BUCKET}/CrimeData/Crime_Data_from_2020_to_Present_20241101.csv", header=True, schema=crimes_schema)
crime_data = crime_data_2010_2019.union(crime_data_2020_present)

# Μετατρέπουμε τις στήλες LAT, LON σε geometry με το ST_POINT
crime_data = crime_data \
                .withColumn("geom", ST_Point("LON", "LAT")) \
                .filter(col("geom") != ST_Point(0, 0)) \
                .select("DR_NO", "geom")

crime_data.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+--------------------+
|    DR_NO|                geom|
+---------+--------------------+
|001307355|POINT (-118.2695 ...|
|011401303|POINT (-118.3962 ...|
|070309629|POINT (-118.2524 ...|
|090631215|POINT (-118.3295 ...|
|100100501|POINT (-118.2488 ...|
+---------+--------------------+
only showing top 5 rows

In [5]:
blocks_df = sedona.read.format("geojson") \
            .option("multiLine", "true") \
            .load(f"{DATA_BUCKET}/2010_Census_Blocks.geojson") \
            .selectExpr("explode(features) as features") \
            .select("features.*")

blocks_data = blocks_df.select( \
                [col(f"properties.{col_name}").alias(col_name) for col_name in \
                    blocks_df.schema["properties"].dataType.fieldNames()] + ["geometry"]) \
            .drop("properties") \
            .drop("type") \
            .filter(col("COMM").isNotNull() & (col("POP_2010") > 0) & (col("CITY") == "Los Angeles")) \
            .select("COMM", "ZCTA10", "POP_2010", "geometry")

blocks_data.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+------+--------+--------------------+
|     COMM|ZCTA10|POP_2010|            geometry|
+---------+------+--------+--------------------+
|San Pedro| 90732|      69|POLYGON ((-118.31...|
|San Pedro| 90731|     120|POLYGON ((-118.28...|
|San Pedro| 90731|     240|POLYGON ((-118.29...|
|San Pedro| 90732|      75|POLYGON ((-118.31...|
|San Pedro| 90731|     246|POLYGON ((-118.28...|
+---------+------+--------+--------------------+
only showing top 5 rows

In [6]:
blocks_data.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- COMM: string (nullable = true)
 |-- ZCTA10: string (nullable = true)
 |-- POP_2010: long (nullable = true)
 |-- geometry: geometry (nullable = true)

In [7]:
blocks_data_description_schema = StructType([
    StructField("field", StringType()),
    StructField("type", StringType()),
    StructField("meaning", StringType())
])

blocks_data_description = spark.read.csv(f"{DATA_BUCKET}/2010_Census_Blocks_fields.csv", header=True, schema=blocks_data_description_schema)

blocks_data_description \
        .filter(col("field").isin("COMM", "ZCTA10", "POP_2010", "geometry")) \
        .show(truncate=False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+--------+-------------------------------------------------------------------------+
|field   |type    |meaning                                                                  |
+--------+--------+-------------------------------------------------------------------------+
|COMM    |string  |Unincorporated area community name and LA City neighborhood              |
|POP_2010|long    |Population (PL 94-171 Redistricting Data Summary File - Total Population)|
|ZCTA10  |string  |Zip Code Tabulation Area                                                 |
|geometry|geometry|Geometry of the block                                                    |
+--------+--------+-------------------------------------------------------------------------+

## Υλοποίηση με DataFrame

In [8]:
from pyspark.sql.functions import sum, count, col
import time

start_time = time.time()

# Σύνδεση Εισοδήματος / Περιοχών με βάση το ZIP Code
income_per_block = blocks_data \
                        .join(income_data, blocks_data["ZCTA10"] == income_data["Zip Code"]) \
                        .groupBy("COMM") \
                        .agg( \
                            sum("POP_2010").alias("Population"), \
                            sum("Estimated Median Income").alias("Total Income") \
                        )

# Σύνδεση Εγκλημάτων / Περιοχών με βάση το geometry, δηλαδή
# το POINT του εγκλήματος βρίσκεται εντός του POLYGON της περιοχής
crimes_per_block = crime_data \
                        .join(blocks_data, ST_Within(crime_data["geom"], blocks_data["geometry"]), "inner") \
                        .groupBy("COMM") \
                        .agg(count("*").alias("Total Crimes"))

# Aναλογία συνολικού αριθμού εγκλημάτων ανά άτομο
result = income_per_block \
                .join(crimes_per_block, on=["COMM"]) \
                .withColumn("Income per Person", col("Total Income") / col("Population")) \
                .withColumn("Crimes per Person", col("Total Crimes") / col("Population")) \
                .select("COMM", "Income per Person", "Crimes per Person")

result.show()
end_time = time.time()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------------------+-------------------+
|                COMM| Income per Person|  Crimes per Person|
+--------------------+------------------+-------------------+
|       Glassell Park|352.32956381260095|0.42107565966612814|
|          Silverlake| 435.3862970584675| 0.6938143081951338|
|             Sunland| 595.2586013843649|0.46432206840390877|
|     Atwater Village| 520.4056449897171| 0.5319480887880292|
|     Hollywood Hills| 560.8047678795483| 0.7511023480910557|
|Angeles National ...|          19414.65|               6.85|
|      Mt. Washington| 386.5913058583402|0.45103574065227775|
|             Tujunga|425.03878023290537|0.43214392355927145|
|          Eagle Rock| 462.5070190802218| 0.4348379906058435|
|          Sun Valley| 334.3990089158554| 0.5274401545154808|
|       Highland Park|322.93648630829415| 0.4595841941013582|
|    Lakeview Terrace|  404.909076483656| 0.4470802919708029|
|           Los Feliz| 444.4558322794046|  0.776680610954373|
|      V

In [9]:
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Time taken: 39.47 seconds

In [10]:
result.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (36)
+- Project (35)
   +- SortMergeJoin Inner (34)
      :- Sort (15)
      :  +- HashAggregate (14)
      :     +- Exchange (13)
      :        +- HashAggregate (12)
      :           +- Project (11)
      :              +- BroadcastHashJoin Inner BuildRight (10)
      :                 :- Project (5)
      :                 :  +- Filter (4)
      :                 :     +- Generate (3)
      :                 :        +- Filter (2)
      :                 :           +- Scan geojson  (1)
      :                 +- BroadcastExchange (9)
      :                    +- Project (8)
      :                       +- Filter (7)
      :                          +- Scan csv  (6)
      +- Sort (33)
         +- HashAggregate (32)
            +- Exchange (31)
               +- HashAggregate (30)
                  +- Project (29)
                     +- RangeJoin (28)
                        :- Union (22)
                        :  :- Project (18)
           

Από προεπιλογή, το Spark χρησιμοποιεί τη στρατηγική **Sort Merge Join** (δηλαδή Merge).

Θα δοκιμάσουμε να αναγκάσουμε το Spark να χρησιμοποιήσει διαφορετικές στρατηγικές, ώστε να συγκρίνουμε την απόδοσή τους.

### 1. BROADCAST

In [27]:
from pyspark.sql.functions import sum, count, col
import time

start_time = time.time()

# Σύνδεση Εισοδήματος / Περιοχών
income_per_block = blocks_data  \
                        .hint("BROADCAST") \
                        .join(income_data, blocks_data["ZCTA10"] == income_data["Zip Code"]) \
                        .groupBy("COMM") \
                        .agg( \
                            sum("POP_2010").alias("Population"), \
                            sum("Estimated Median Income").alias("Total Income") \
                        )

# Σύνδεση Εγκλημάτων / Περιοχών
crimes_per_block = crime_data \
                        .hint("BROADCAST") \
                        .join(blocks_data, ST_Within(crime_data["geom"], blocks_data["geometry"]), "inner") \
                        .groupBy("COMM") \
                        .agg(count("*").alias("Total Crimes"))

# Aναλογία συνολικού αριθμού εγκλημάτων ανά άτομο
result = income_per_block \
                .hint("BROADCAST") \
                .join(crimes_per_block, on=["COMM"]) \
                .withColumn("Income per Person", col("Total Income") / col("Population")) \
                .withColumn("Crimes per Person", col("Total Crimes") / col("Population")) \
                .select("COMM", "Income per Person", "Crimes per Person")

result.show()
end_time = time.time()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
Invalid status code '400' from http://ec2-35-159-120-182.eu-central-1.compute.amazonaws.com:8998/sessions/914/statements/27 with error payload: {"msg":"requirement failed: Session isn't active."}


Το query δε μπορεί να εκτελεστεί με τη στρατηγική Broadcast στα join, λόγω του μεγάλου όγκου δεδομένων.

### 2. MERGE

In [11]:
from pyspark.sql.functions import sum, count, col
import time

start_time = time.time()

# Σύνδεση Εισοδήματος / Περιοχών με βάση το ZIP Code
income_per_block = blocks_data  \
                        .hint("MERGE") \
                        .join(income_data, blocks_data["ZCTA10"] == income_data["Zip Code"]) \
                        .groupBy("COMM") \
                        .agg( \
                            sum("POP_2010").alias("Population"), \
                            sum("Estimated Median Income").alias("Total Income") \
                        )

# Σύνδεση Εγκλημάτων / Περιοχών με βάση το geometry, δηλαδή
# το POINT του εγκλήματος βρίσκεται εντός του POLYGON της περιοχής
crimes_per_block = crime_data \
                        .hint("MERGE") \
                        .join(blocks_data, ST_Within(crime_data["geom"], blocks_data["geometry"]), "inner") \
                        .groupBy("COMM") \
                        .agg(count("*").alias("Total Crimes"))

# Aναλογία συνολικού αριθμού εγκλημάτων ανά άτομο
result = income_per_block \
                .hint("MERGE") \
                .join(crimes_per_block, on=["COMM"]) \
                .withColumn("Income per Person", col("Total Income") / col("Population")) \
                .withColumn("Crimes per Person", col("Total Crimes") / col("Population")) \
                .select("COMM", "Income per Person", "Crimes per Person")

result.show()
end_time = time.time()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------------------+-------------------+
|                COMM| Income per Person|  Crimes per Person|
+--------------------+------------------+-------------------+
|     Adams-Normandie|160.62637082376943| 0.7148686559551135|
|              Alsace|186.29007503410642| 0.5416098226466576|
|Angeles National ...|          19414.65|               6.85|
|    Angelino Heights| 241.8594276094276| 0.5989057239057239|
|              Arleta|312.64241391896826| 0.4264509064363061|
|     Atwater Village| 520.4056449897171| 0.5319480887880292|
|       Baldwin Hills| 146.9312427977791| 0.9974508502985648|
|             Bel Air|1013.9382641326716|0.39922527539038855|
|       Beverly Crest| 1226.391190222295| 0.3689607087195472|
|         Beverlywood| 590.2408376963351| 0.5084977849375755|
|       Boyle Heights|  169.583309101483| 0.6253271299796452|
|           Brentwood| 842.1119757004881|0.40582232688304154|
|           Brookside| 913.2125603864735| 0.8856682769726248|
|    Cad

In [12]:
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Time taken: 40.95 seconds

In [13]:
result.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (39)
+- Project (38)
   +- SortMergeJoin Inner (37)
      :- Sort (18)
      :  +- HashAggregate (17)
      :     +- Exchange (16)
      :        +- HashAggregate (15)
      :           +- Project (14)
      :              +- SortMergeJoin Inner (13)
      :                 :- Sort (7)
      :                 :  +- Exchange (6)
      :                 :     +- Project (5)
      :                 :        +- Filter (4)
      :                 :           +- Generate (3)
      :                 :              +- Filter (2)
      :                 :                 +- Scan geojson  (1)
      :                 +- Sort (12)
      :                    +- Exchange (11)
      :                       +- Project (10)
      :                          +- Filter (9)
      :                             +- Scan csv  (8)
      +- Sort (36)
         +- HashAggregate (35)
            +- Exchange (34)
               +- HashAggregate (33)
                  +- Project 

### 3. SHUFFLE HASH

In [14]:
from pyspark.sql.functions import sum, count, col
import time

start_time = time.time()

# Σύνδεση Εισοδήματος / Περιοχών με βάση το ZIP Code
income_per_block = blocks_data  \
                        .hint("SHUFFLE_HASH") \
                        .join(income_data, blocks_data["ZCTA10"] == income_data["Zip Code"]) \
                        .groupBy("COMM") \
                        .agg( \
                            sum("POP_2010").alias("Population"), \
                            sum("Estimated Median Income").alias("Total Income") \
                        )

# Σύνδεση Εγκλημάτων / Περιοχών με βάση το geometry, δηλαδή
# το POINT του εγκλήματος βρίσκεται εντός του POLYGON της περιοχής
crimes_per_block = crime_data \
                        .hint("SHUFFLE_HASH") \
                        .join(blocks_data, ST_Within(crime_data["geom"], blocks_data["geometry"]), "inner") \
                        .groupBy("COMM") \
                        .agg(count("*").alias("Total Crimes"))

# Aναλογία συνολικού αριθμού εγκλημάτων ανά άτομο
result = income_per_block \
                .hint("SHUFFLE_HASH") \
                .join(crimes_per_block, on=["COMM"]) \
                .withColumn("Income per Person", col("Total Income") / col("Population")) \
                .withColumn("Crimes per Person", col("Total Crimes") / col("Population")) \
                .select("COMM", "Income per Person", "Crimes per Person")

result.show()
end_time = time.time()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+------------------+-------------------+
|                COMM| Income per Person|  Crimes per Person|
+--------------------+------------------+-------------------+
|      Gramercy Place| 488.9200849338867| 1.0647620886014864|
|         Westchester| 604.7028968906845| 0.5666743028510736|
|      Harbor Gateway| 323.6666832768587|0.46035977675901935|
|         Playa Vista| 490.7112928523415| 0.5004481290611696|
|       Playa Del Rey| 1351.127295756808| 0.7425585813806207|
|    Marina Peninsula| 2595.523172700023| 0.5999538851740834|
|   Manchester Square|452.35297684006304| 1.0803928701345944|
|      Vermont Knolls|208.44291881520567| 1.0672142942798897|
|        Harvard Park| 260.5980739155486| 1.0106455949735231|
|           Hyde Park|  340.338796836417|  1.031421107254053|
|       Cheviot Hills|1145.7122908123683| 0.5429458051645794|
|    West Los Angeles|465.24153373335287| 0.6377579394116786|
|         Beverlywood| 590.2408376963351| 0.5084977849375755|
|       

In [15]:
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Time taken: 29.44 seconds

In [16]:
result.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (35)
+- Project (34)
   +- ShuffledHashJoin Inner BuildLeft (33)
      :- HashAggregate (15)
      :  +- Exchange (14)
      :     +- HashAggregate (13)
      :        +- Project (12)
      :           +- ShuffledHashJoin Inner BuildLeft (11)
      :              :- Exchange (6)
      :              :  +- Project (5)
      :              :     +- Filter (4)
      :              :        +- Generate (3)
      :              :           +- Filter (2)
      :              :              +- Scan geojson  (1)
      :              +- Exchange (10)
      :                 +- Project (9)
      :                    +- Filter (8)
      :                       +- Scan csv  (7)
      +- HashAggregate (32)
         +- Exchange (31)
            +- HashAggregate (30)
               +- Project (29)
                  +- RangeJoin (28)
                     :- Union (22)
                     :  :- Project (18)
                     :  :  +- Filter (17)
               

### 4. SHUFFLE REPLICATE NL

In [17]:
from pyspark.sql.functions import sum, count, col
import time

start_time = time.time()

# Σύνδεση Εισοδήματος / Περιοχών
income_per_block = blocks_data  \
                        .hint("SHUFFLE_REPLICATE_NL") \
                        .join(income_data, blocks_data["ZCTA10"] == income_data["Zip Code"]) \
                        .groupBy("COMM") \
                        .agg( \
                            sum("POP_2010").alias("Population"), \
                            sum("Estimated Median Income").alias("Total Income") \
                        )

# Σύνδεση Εγκλημάτων / Περιοχών με βάση το geometry, δηλαδή
# το POINT του εγκλήματος βρίσκεται εντός του POLYGON της περιοχής
crimes_per_block = crime_data \
                        .hint("SHUFFLE_REPLICATE_NL") \
                        .join(blocks_data, ST_Within(crime_data["geom"], blocks_data["geometry"]), "inner") \
                        .groupBy("COMM") \
                        .agg(count("*").alias("Total Crimes"))

# Aναλογία συνολικού αριθμού εγκλημάτων ανά άτομο
result = income_per_block \
                .hint("SHUFFLE_REPLICATE_NL") \
                .join(crimes_per_block, on=["COMM"]) \
                .withColumn("Income per Person", col("Total Income") / col("Population")) \
                .withColumn("Crimes per Person", col("Total Crimes") / col("Population")) \
                .select("COMM", "Income per Person", "Crimes per Person")

result.show()
end_time = time.time()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------------+-------------------+
|              COMM| Income per Person|  Crimes per Person|
+------------------+------------------+-------------------+
|    Toluca Terrace| 260.9477325134512|0.22213681783243658|
|      Elysian Park| 189.6804632618189| 0.6058477311562559|
|          Longwood|209.40380047505937| 0.7273159144893112|
|     Green Meadows|246.80379395590535| 1.1079662983704153|
|  Cadillac-Corning|223.08117029257315|  0.581695423855964|
|          Mid-city|396.23837087663014| 0.7106492781923426|
|   Lincoln Heights|207.09009761109684| 0.5137105060364757|
|          Van Nuys|148.04583871005244|  0.787558562643137|
|    Gramercy Place| 488.9200849338867| 1.0647620886014864|
| Faircrest Heights| 624.6238745280278| 0.7290153935521347|
|     Boyle Heights|  169.583309101483| 0.6253271299796452|
|  Lafayette Square|243.89444699403396| 0.8049564020192749|
|     Granada Hills| 638.8912673095048| 0.5292539694047705|
|       North Hills|243.40044345898005| 

In [18]:
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Time taken: 16.96 seconds

In [19]:
result.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (33)
+- Project (32)
   +- CartesianProduct Inner (31)
      :- HashAggregate (13)
      :  +- Exchange (12)
      :     +- HashAggregate (11)
      :        +- Project (10)
      :           +- CartesianProduct Inner (9)
      :              :- Project (5)
      :              :  +- Filter (4)
      :              :     +- Generate (3)
      :              :        +- Filter (2)
      :              :           +- Scan geojson  (1)
      :              +- Project (8)
      :                 +- Filter (7)
      :                    +- Scan csv  (6)
      +- HashAggregate (30)
         +- Exchange (29)
            +- HashAggregate (28)
               +- Project (27)
                  +- RangeJoin (26)
                     :- Union (20)
                     :  :- Project (16)
                     :  :  +- Filter (15)
                     :  :     +- Scan csv  (14)
                     :  +- Project (19)
                     :     +- Filter (18)
     

## Συμπεράσματα

Η στρατηγική join που πετυχαίνει την καλύτερη επίδοση είναι η **SHUFFLE REPLICATE NL** (*Shuffle Replicate Nested Loop*) με χρόνο 16.96 δευτερόλεπτα.

H στρατηγική **SHUFFLE HASH** πέτυχε λίγο καλύτερο χρόνο από τη **MERGE** με 29.44 και 40.95 δευτερόλεπτα αντίστοιχα.

Τέλος, η στρατηγική **BROADCAST** δε μπόρεσε να εκτελεστεί, καθώς εξαντλεί τους πόρους μνήμης. Αυτό συμβαίνει επειδή προσπαθεί να στείλει ολόκληρο το μικρότερο dataset σε όλους τους εκτελεστές (*executors*), για να εκτελεστεί το join τοπικά χωρίς shuffling. Έτσι η μέθοδος αποτυγχάνει, καθώς το μέγεθος του dataset που πρόκειται να μεταδοθεί (broadcast) είναι πολύ μεγάλο για να χωρέσει στη μνήμη των εκτελεστών.

Συνεπώς η καταλληλότερη στρατηγική για την περίπτωσή μας είναι η **SHUFFLE_REPLICATE_NL**.