# Intro to gis with pyspark

- https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html
- https://sedona.apache.org/latest/sedonaspark

## Instructions

1. install python + packages listed in requirements.txt
2. install java 21 (I'm using openjdk, jre only should work)

In [3]:
from pyspark.sql import SparkSession
from sedona.spark import SedonaContext
from pathlib import Path
import geopandas as gpd

In [None]:
%%capture

# pyspark can't read .gpkg natively
# Apache Sedona adds geospatial readers to Spark
# TODO: update jar to latest versions
sedona = (
    SedonaContext.builder()
    .master("local[*]")
    .appName("gis_intro")
    .config(
        "spark.jars.packages",
        "org.apache.sedona:sedona-spark-shaded-3.5_2.13:1.8.1,"
        "org.datasyslab:geotools-wrapper:1.8.1-33.1,"
        "org.json4s:json4s-jackson_2.13:3.7.0-M11",
    )
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config(
        "spark.sql.extensions", "org.apache.sedona.spark.CatalystSpark3ShimExtension"
    )
    .getOrCreate()
)

In [5]:
# load expects string, Path() creates pathlib.Path object
gadm_path = str(Path("..", "data", "gadm_aus.gpkg"))

In [6]:
# list all layers for sedona import
gpd.list_layers(gadm_path)

Unnamed: 0,name,geometry_type
0,ADM_ADM_0,MultiPolygon
1,ADM_ADM_1,MultiPolygon
2,ADM_ADM_2,MultiPolygon


In [7]:
# spark dataframe
gpkg_raw = (
    sedona.read.format("geopackage").option("tableName", "ADM_ADM_0").load(gadm_path)
)

In [8]:
# show up to 5 layers
gpkg_raw.show(5)

                                                                                

+---+--------------------+-----+---------+
|fid|                geom|GID_0|  COUNTRY|
+---+--------------------+-----+---------+
|  1|MULTIPOLYGON (((1...|  AUS|Australia|
+---+--------------------+-----+---------+

