# Apache Sedona

**Apache Sedona (formerly known as GeoSpark) is a cluster computing system for processing large-scale spatial data**. Sedona extends existing cluster computing systems, such as Apache Spark and Apache Flink, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines.

You may say we already have `GeoPandas`, why we need Sedona. GeoPandas is greate when your dataset is small (< 2 GB), it can't host large dataset.

The official Sedona site is [here](https://sedona.apache.org)

Sedona can be connected with various computational framework:
- spark
- flink
- snowflake

> In this tutorial, we only focus on how to use sedona in spark

## 1. What Sedona offers?

- Distributed spatial datasets
   * Spatial RDD on Spark
   * Spatial DataFrame/SQL on Spark
   * Spatial DataStream on Flink
   * Spatial Table/SQL on Flink

- Complex spatial objects
   * Vector geometries / trajectories
   * Raster images with Map Algebra
   * Various input formats: CSV, TSV, WKT, WKB, GeoJSON, Shapefile, GeoTIFF, NetCDF/HDF

- Distributed spatial queries
   * Spatial query: range query, range join query, distance join query, K Nearest Neighbor query
   * Spatial index: R-Tree, Quad-Tree

- Rich spatial analytics tools¶
   * Coordinate Reference System / Spatial Reference System Transformation
   * High resolution map generation: Visualize Spatial DataFrame/RDD
   * Apache Zeppelin integration
   * Support Scala, Java, Python, R


## 2. Install and configure Sedona
When Sedona works on top of a `Spark cluster`, and it provides the below languages as API:
- Scala/Java
- Python
- R

> In this tutorial, I only show how to install it on Spark with Scala and Python API.

For more details, you can visit the official [doc](https://sedona.apache.org/1.4.1/setup/install-python/)

### 2.1 Get the Sedona jar file

You can find all Sedona release jar file in the [maven central repo](https://mvnrepository.com/artifact/org.apache.sedona?p=1).

#### Use shaded jar files

To facilitate the installation, Sedona provides `shaded jars` (We only need to import two jars files).

For example if your spark env is `3.4> spark > 3.0 with Scala 2.12`, you will need the below mvn conf

```xml
<dependencies>
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-spark-shaded-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-viz-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<!-- Optional: https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper -->
<dependency>
    <groupId>org.datasyslab</groupId>
    <artifactId>geotools-wrapper</artifactId>
    <version>1.4.0-28.2</version>
</dependency>
</dependencies>
```

The optional **GeoTools library** is required if you want to use CRS transformation, ShapefileReader or GeoTiff reader. This wrapper library is a re-distribution of GeoTools official jars.

> For other spark env, you can find the full doc [here](https://sedona.apache.org/1.4.1/setup/maven-coordinates/)


#### Use unshaded jars

If you use unshaded jars, your mvn config file will become longer. Below is an example `3.4> spark > 3.0 with Scala 2.12`.

```xml
<dependencies>
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-core-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-sql-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-viz-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<!-- Required if you use Sedona Python -->
<dependency>
  <groupId>org.apache.sedona</groupId>
  <artifactId>sedona-python-adapter-3.0_2.12</artifactId>
  <version>1.4.1</version>
</dependency>
<dependency>
    <groupId>org.datasyslab</groupId>
    <artifactId>geotools-wrapper</artifactId>
    <version>1.4.0-28.2</version>
</dependency>
</dependencies>
```

> You can notice we have much more jar to import.

### 2.2 Use sedona in pyspark

You can find the official doc [here](https://sedona.apache.org/latest-snapshot/setup/install-python/)

To use sedona in pyspark we need to follow the below three steps:
- Install apache-sedona in the target python virtual environments
- Download the required jars(check the version dependencies).
- Create the spark session with the required jar file


#### 2.2.1 Install apache-sedona python package

The official package page is [here](https://pypi.org/project/apache-sedona/). You can use below command to install the package

```shell
# simple install
pip install apache-sedona

# install sedona with pyspark as dependency
# Since Sedona v1.1.0, pyspark is an optional dependency of Sedona Python because spark comes pre-installed on many spark platforms. 
# To install pyspark along with Sedona Python in one go, use the spark extra
pip install apache-sedona[spark]

# you need to check the version of apacke-sedona, because the jar version must be compatible 
pip show apache-sedona
```

> for example, if the python package version apache-sedona is 1.6.1, then the jar version must be 1.6.1 too.
> 

#### 2.2.2 Download the required jars

Before download, determine the right jar version is very important. You can find all required jar in the below urls:
 - sedona-jars: https://repo.maven.apache.org/maven2/org/apache/sedona/
 - geotools-wrapper-jars: https://repo.maven.apache.org/maven2/org/datasyslab/geotools-wrapper/1.6.1-28.2/

I will show two examples of the shaded jar:
  - `sedona-spark-shaded-3.0_2.12-1.4.1.jar`: this jar is built for spark 3.0 compile with scala 2.12. The sedona version is 1.4.1
  - `sedona-spark-shaded-3.5_2.13-1.6.1.jar`: this jar is built for spark 3.5 compile with scala 2.13. The sedona version is 1.6.1
  
For the geotools jar:
  - `geotools-wrapper-1.4.0-28.2.jar`: this jar is built for sedona version 1.4.0, the geotools version is 28.2
  - `geotools-wrapper-1.6.1-28.2.jar`: this jar is built for sedona version 1.6.1, the geotools version is 28.2 

#### 2.2.3 Create the spark session with the required jar file

Import the sedona Jar files into your spark session. (Create sedona config)

There are two ways to import Jar files into your spark session.

1. Put the jar files directly into the $SPARK_HOME/jars/ (In cluster mode, make sure all the worker nodes also have the jar file in place)
2. Ask spark session to download the jar file by using the `spark.jars.packages` config

Check the below example

In [2]:
from sedona.spark import *
from pathlib import Path
from pyspark.sql import SparkSession

In [None]:
# build a spark session with sedona (sedona < 1.4.1)
# spark = SparkSession. \
#     builder. \
#     appName('appName'). \
#     config("spark.serializer", KryoSerializer). \
#     config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
#     config('spark.jars.packages',
#            'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,'
#            'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
#     getOrCreate()
# SedonaRegistrator.registerAll(spark)

#### 2.2.1 Create sedona context from scratch

The below example shows how to create a sedona context from scratch.

We need extra jars for the sedona classes. To load these jars into the spark context, you can use the two below config:
- "spark.jars.packages", "package id" : It will download the jar for both driver node and worker node. But you need to have internet connection
- "spark.jars", "jar path": This config requires you to download jar manually on the local file system of the driver and worker node. You don't need internet connection anymore.

In [3]:
# build a sedona session with internet
config = SedonaContext.builder(). \
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-3.5_2.13:1.6.1,'
           'org.datasyslab:geotools-wrapper:1.6.1-28.2'). \
    config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all'). \
    getOrCreate()


In [2]:
# build a sedona session offline
jar_folder = Path(r"/home/pengfei/git/PySparkCommonFunc/jars")
jar_list = ["sedona-spark-shaded-3.5_2.13-1.6.1.jar","geotools-wrapper-1.6.1-28.2.jar"]
jar_path = ",".join(jar_list)

config = SedonaContext.builder(). \
    config('spark.jars', jar_path). \
    getOrCreate()

In [4]:
# create a sedona context
sedona = SedonaContext.create(config)

#### 2.2.2 Create sedona context from an existing spark session

In some case, you don't have the right to create a spark session from scratch (e.g. in Wherobots/AWS EMR/Databricks). The platform provides already a spark session.

You can use below command to create a sedona context from an existing spark session.

```python
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
from sedona.spark import SedonaContext

# suppose you have a spark session called spark
spark = SparkSession. \
    builder. \
    appName('MySedona'). \
    config("spark.serializer", KryoSerializer). \
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,'
           'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
     getOrCreate()

# you can create a sedona context from it
sedona = SedonaContext.create(spark)
```

## 3. Use sedona to read datasets

Sedona provides a sedona SQL api, which allows you to read, transform and write geospatial data

You can find the full doc [here](https://sedona.apache.org/1.4.1/api/sql/Overview/)

In this tutorial, we will read a tsv(tab-separated values) file that represents some country coordinates (polygons) in the USA.

In [5]:
data_folder_path = "../../../../data/"
file_path=f"{data_folder_path}/county_small.tsv"

In [6]:
rawDf = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load(file_path)
rawDf.createOrReplaceTempView("rawdf")
rawDf.show()

+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|NULL| NULL|NULL|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|NULL| NULL|NULL|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|NULL| NULL|NULL|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109| 

In [7]:
rawDf.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)



### 3.1 Create a Geometry type column

You can notice the `_c0` column has type string, even-though its value is a polygon of GPS coordinates. So first step is to convert the string column into a `Geometry type column`.

In [8]:
spatialDf=sedona.sql("select ST_GeomFromText(rawdf._c0) as county_shape, rawdf._c6 as county_name from rawdf")
spatialDf.show()

+--------------------+--------------------+
|        county_shape|         county_name|
+--------------------+--------------------+
|POLYGON ((-97.019...|       Cuming County|
|POLYGON ((-123.43...|    Wahkiakum County|
|POLYGON ((-104.56...|      De Baca County|
|POLYGON ((-96.910...|    Lancaster County|
|POLYGON ((-98.273...|     Nuckolls County|
|POLYGON ((-65.910...|Las Piedras Munic...|
|POLYGON ((-97.129...|    Minnehaha County|
|POLYGON ((-99.821...|       Menard County|
|POLYGON ((-120.65...|       Sierra County|
|POLYGON ((-85.239...|      Clinton County|
|POLYGON ((-83.880...|      Hancock County|
|POLYGON ((-102.08...|         Hale County|
|POLYGON ((-85.978...|         Clay County|
|POLYGON ((-101.62...|    Armstrong County|
|POLYGON ((-84.397...|        Allen County|
|POLYGON ((-82.449...|     McDuffie County|
|POLYGON ((-90.191...|         Sauk County|
|POLYGON ((-92.415...|        Stone County|
|POLYGON ((-117.74...|      Wallowa County|
|POLYGON ((-80.518...|       Bea

In [9]:
spatialDf.printSchema()

root
 |-- county_shape: geometry (nullable = true)
 |-- county_name: string (nullable = true)



Now you can notice the `county_shape` column has geometry type now.

## 4. Save the geo dataframe

To save a Spatial DataFrame to some permanent storage such as Hive tables, S3 and HDFS, you can simply `convert each geometry in the Geometry type column back to a plain String` and save the plain DataFrame to wherever you want.

Use the following code to convert the Geometry column in a DataFrame back to a WKT string column:

```sql
SELECT ST_AsText(county_shape) FROM rawdf
```


### 4.1 Save GeoParquet

Since v1.3.0, Sedona natively supports writing GeoParquet file. GeoParquet can be saved as follows:

In [9]:
output_path = f"{data_folder_path}/tmp/county_geo_parquet"
spatialDf.write.format("geoparquet").save(output_path)

Sedona/spark allow us to read and write many geospatial datasource, we won't show all of them in this tutorial. We will have a dedicated section for this

> From the first check, you can notice the tsv file is 4MB, and geo parquet file is only 3MB. And the geo parquet already has geo-schema(e.g. point, line, polygon, etc.).