# Use sedona to read various data format

In this tutorial, we will use sedona to read various geospatial data format such as:
- geojson
- shape file
- csv/tsv
- pbf
- geoparquet

We will also evaluate the performance(e.g. storage space, processing speed) of each format

In [1]:
from sedona.spark import *
import geopandas as gpd
from pyspark.sql.functions import trim, col
from pathlib import Path
import json

from ipyleaflet import Map, basemaps, basemap_to_tiles, MarkerCluster, Marker, AwesomeIcon
from ipywidgets import Layout
import numpy as np

In [2]:
# build a sedona session (sedona = 1.5.1)
config = SedonaContext.builder() \
    .appName("Sedona with pyspark") \
    .master("local[*]") \
    .config("spark.driver.memory", "6g") \
    .config('spark.jars.packages',
            'com.acervera.osm4scala:osm4scala-spark3-shaded_2.12:1.0.11,' 
            'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.1,' 
            'org.datasyslab:geotools-wrapper:1.4.0-28.2'). \
     getOrCreate()



24/04/16 11:39:11 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
24/04/16 11:39:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/pengfei/opt/spark/spark-3.3.0/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pengfei/.ivy2/cache
The jars for the packages stored in: /home/pengfei/.ivy2/jars
com.acervera.osm4scala#osm4scala-spark3-shaded_2.12 added as a dependency
org.apache.sedona#sedona-spark-shaded-3.0_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ac2c8d1b-b492-4ae5-b1a9-ab97a329ebdf;1.0
	confs: [default]
	found com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 in central
	found org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 in central
	found org.datasyslab#geotools-wrapper;1.4.0-28.2 in central
:: resolution report :: resolve 523ms :: artifacts dl 18ms
	:: modules in use:
	com.acervera.osm4scala#osm4scala-spark3-shaded_2.12;1.0.11 from central in [default]
	org.apache.sedona#sedona-spark-shaded-3.0_2.12;1.4.1 from central in [default]
	org.datasyslab#geotools-wrapper;1.4.0-28.2 from central in [default]
	---------------------------------------

24/04/16 11:39:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

In [2]:
# build a sedona session with internet
config = SedonaContext.builder(). \
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-3.5_2.13:1.6.1,'
           'org.datasyslab:geotools-wrapper:1.6.1-28.2'). \
    config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all'). \
getOrCreate()

In [3]:
# create a sedona context
sedona = SedonaContext.create(config)
sc = sedona.sparkContext



In [10]:
# this sets the encoding of shape files
sc.setSystemProperty("sedona.global.charset", "utf8")

## 1 Read/write shape file

In [4]:
win_root_dir = "C:/Users/PLIU/Documents/ubuntu_share/data_set"
lin_root_dir = "/home/pengfei/data_set"

fr_commune_file_path = f"{win_root_dir}/kaggle/geospatial/communes_fr_shape"

# read communes shape file
fr_commune_rdd = ShapefileReader.readToGeometryRDD(sc, fr_commune_file_path)
fr_commune_df = Adapter.toDf(fr_commune_rdd, sedona)
fr_commune_df.printSchema()

root
 |-- geometry: geometry (nullable = true)
 |-- insee: string (nullable = true)
 |-- nom: string (nullable = true)
 |-- wikipedia: string (nullable = true)
 |-- surf_ha: string (nullable = true)



In [11]:
fr_commune_df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            geometry|               insee|                 nom|           wikipedia|             surf_ha|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|POLYGON ((9.32016...|2B222            ...|Pie-d'Orezza     ...|fr:Pie-d'Orezza  ...|     573.00000000...|
|POLYGON ((9.20010...|2B137            ...|Lano             ...|fr:Lano          ...|     824.00000000...|
|POLYGON ((9.27757...|2B051            ...|Cambia           ...|fr:Cambia        ...|     833.00000000...|
|POLYGON ((9.25119...|2B106            ...|Érone            ...|fr:Érone         ...|     393.00000000...|
|POLYGON ((9.28339...|2B185            ...|Oletta           ...|fr:Oletta        ...|    2674.00000000...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



### Disk usage

The shape file use 301 MB disk space

In [31]:
! du -ah /home/pengfei/data_set/kaggle/geospatial/communes_fr_shape


8.9M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-20220101.dbf
4.0K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-20220101.prj
4.0K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-descriptif.txt
276K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-20220101.shx
4.0K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/LICENCE.txt
4.0K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-20220101.cpg
292M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape/communes-20220101.shp
301M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_shape


In [5]:
from pyspark.sql import DataFrame


def get_nearest_commune(df:DataFrame, latitude:str, longitude:str, max_commune_number:int):
    temp_table_name:str = "temp_tab"
    df.createOrReplaceTempView(temp_table_name)
    nearest_commune_df = sedona.sql(f"""
     SELECT z.nom as commune_name, z.insee, ST_DistanceSphere(ST_PointFromText('{longitude},{latitude}', ','), z.geometry) AS distance FROM {temp_table_name} as z ORDER BY distance ASC LIMIT {max_commune_number}
     """)
    return nearest_commune_df

In [6]:
# the gps coordinates for kremlin-Bicetre is 48.8100° N, 2.3539° E

kb_latitude = "48.8100"
kb_longitude = "2.3539"

In [7]:

kb_nearest_shape_df = get_nearest_commune(fr_commune_df,kb_latitude,kb_longitude,10)

In [8]:
%%time

kb_nearest_shape_df.show()
kb_nearest_shape_df.count()

+------------------+-----+------------------+
|      commune_name|insee|          distance|
+------------------+-----+------------------+
|Le Kremlin-Bicêtre|94043|198.60307108585405|
|          Gentilly|94037| 798.3521490770968|
|           Arcueil|94003|1543.0937442695515|
|         Villejuif|94076| 2007.793912679607|
|    Ivry-sur-Seine|94041| 2489.634383841373|
|            Cachan|94016| 2590.828517555236|
|         Montrouge|92049| 2750.714176859015|
|           Bagneux|92007| 3462.091511432535|
|   Vitry-sur-Seine|94081|3845.1624363327196|
|   L'Haÿ-les-Roses|94038| 3942.190017739479|
+------------------+-----+------------------+

CPU times: total: 0 ns
Wall time: 3.13 s


10

## Read write GeoParquet
GeoParquet is an **incubating Open Geospatial Consortium (OGC) standard** that adds interoperable geospatial types `(Point, Line, Polygon)` to Parquet. Currently(16/04/2024), the stable version is 1.0.0
You can find the official site of geo-parquet [here](https://geoparquet.org/)

In [7]:
clean_fr_commune_df = fr_commune_df.withColumn("clean_nom",trim(col("nom"))).withColumn("clean_insee",trim(col("insee"))).drop("nom").drop("insee").withColumnRenamed("clean_nom","nom").withColumnRenamed("clean_insee","insee")

In [8]:
clean_fr_commune_df.show()

                                                                                

+--------------------+--------------------+--------------------+-----------------+-----+
|            geometry|           wikipedia|             surf_ha|              nom|insee|
+--------------------+--------------------+--------------------+-----------------+-----+
|POLYGON ((9.32016...|fr:Pie-d'Orezza  ...|     573.00000000...|     Pie-d'Orezza|2B222|
|POLYGON ((9.20010...|fr:Lano          ...|     824.00000000...|             Lano|2B137|
|POLYGON ((9.27757...|fr:Cambia        ...|     833.00000000...|           Cambia|2B051|
|POLYGON ((9.25119...|fr:Érone         ...|     393.00000000...|            Érone|2B106|
|POLYGON ((9.28339...|fr:Oletta        ...|    2674.00000000...|           Oletta|2B185|
|POLYGON ((9.30951...|fr:Canari (Haute-...|    1678.00000000...|           Canari|2B058|
|POLYGON ((9.30101...|fr:Olmeta-di-Tuda...|    1753.00000000...|   Olmeta-di-Tuda|2B188|
|POLYGON ((9.32662...|fr:Campana       ...|     236.00000000...|          Campana|2B052|
|POLYGON ((9.33944...

In [9]:
fr_commune_geoparquet_file_path = "/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet"
clean_fr_commune_df.write.format("geoparquet").option("geoparquet.version","1.0.0").save(fr_commune_geoparquet_file_path)

                                                                                

In [10]:
! du -ah /home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet

0	/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet/_SUCCESS
2.3M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet/.part-00000-82765f1e-fe4e-4e74-81e2-01fd73bcdb34-c000.snappy.parquet.crc
4.0K	/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet/._SUCCESS.crc
291M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet/part-00000-82765f1e-fe4e-4e74-81e2-01fd73bcdb34-c000.snappy.parquet
294M	/home/pengfei/data_set/kaggle/geospatial/communes_fr_geoparquet


In [11]:
geo_parquet_df = sedona.read.format("geoparquet").load(fr_commune_geoparquet_file_path)

In [12]:
geo_parquet_df.show()
geo_parquet_df.count()

                                                                                

+--------------------+--------------------+--------------------+-----------------+-----+
|            geometry|           wikipedia|             surf_ha|              nom|insee|
+--------------------+--------------------+--------------------+-----------------+-----+
|POLYGON ((9.32016...|fr:Pie-d'Orezza  ...|     573.00000000...|     Pie-d'Orezza|2B222|
|POLYGON ((9.20010...|fr:Lano          ...|     824.00000000...|             Lano|2B137|
|POLYGON ((9.27757...|fr:Cambia        ...|     833.00000000...|           Cambia|2B051|
|POLYGON ((9.25119...|fr:Érone         ...|     393.00000000...|            Érone|2B106|
|POLYGON ((9.28339...|fr:Oletta        ...|    2674.00000000...|           Oletta|2B185|
|POLYGON ((9.30951...|fr:Canari (Haute-...|    1678.00000000...|           Canari|2B058|
|POLYGON ((9.30101...|fr:Olmeta-di-Tuda...|    1753.00000000...|   Olmeta-di-Tuda|2B188|
|POLYGON ((9.32662...|fr:Campana       ...|     236.00000000...|          Campana|2B052|
|POLYGON ((9.33944...

34955

In [18]:
kb_nearest_parquet_df = get_nearest_commune(geo_parquet_df,kb_latitude,kb_longitude,10)

In [19]:
%%time
kb_nearest_parquet_df.show()
kb_nearest_parquet_df.count()

                                                                                

+------------------+-----+------------------+
|      commune_name|insee|          distance|
+------------------+-----+------------------+
|Le Kremlin-Bicêtre|94043|255.77950075329835|
|          Gentilly|94037| 1138.204118880015|
|         Villejuif|94076|2067.5242470555963|
|           Arcueil|94003| 2269.505672821453|
|            Cachan|94016|3169.7694895288837|
|    Ivry-sur-Seine|94041| 3769.348960915047|
|         Montrouge|92049| 4124.301376321017|
|   L'Haÿ-les-Roses|94038| 4166.688028197553|
|    Chevilly-Larue|94021| 4789.020724647998|
|           Bagneux|92007|  5041.99634269013|
+------------------+-----+------------------+




CPU times: user 9.98 ms, sys: 6.56 ms, total: 16.5 ms
Wall time: 6.6 s


                                                                                

10

### Custom metadata in geo parquet

Compare the result of shape file and geo parquet, we don't gain too many things

| file format | disk space | distance (in sec) |
|-------------|------------|-------------------|
| shape file  | 301        | 7,45              |
| geoparquet  | 294        | 6,60              |

## Read write GeoJSON(Geographic JavaScript Object Notation)

Sedona can read geojson easily, but can't write geojson. Geo pandas can write geojson. But it can't support large 
data frame. Below are two examples. In the first, we create a simple geo dataframe. It works without problem.
The second does work at all. We have an oom error.

In [37]:
from shapely import Point

data = {
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'],
    'geometry': [Point(1, 1), Point(2, 2), Point(3, 3)]
}

fr_commune_geoj_file_path = "/home/pengfei/data_set/kaggle/geospatial/communes_fr_geojson.json"

gdf = gpd.GeoDataFrame(data, crs="EPSG:4326")

print(gdf.head())

# Write GeoDataFrame to GeoJSON file
gdf.to_file(fr_commune_geoj_file_path, driver='GeoJSON')

   id name                 geometry
0   1    A  POINT (1.00000 1.00000)
1   2    B  POINT (2.00000 2.00000)
2   3    C  POINT (3.00000 3.00000)


In [35]:
fr_commune_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            geometry|               insee|                 nom|           wikipedia|             surf_ha|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|POLYGON ((9.32016...|2B222            ...|Pie-d'Orezza     ...|fr:Pie-d'Orezza  ...|     573.00000000...|
|POLYGON ((9.20010...|2B137            ...|Lano             ...|fr:Lano          ...|     824.00000000...|
|POLYGON ((9.27757...|2B051            ...|Cambia           ...|fr:Cambia        ...|     833.00000000...|
|POLYGON ((9.25119...|2B106            ...|Érone            ...|fr:Érone         ...|     393.00000000...|
|POLYGON ((9.28339...|2B185            ...|Oletta           ...|fr:Oletta        ...|    2674.00000000...|
|POLYGON ((9.30951...|2B058            ...|Canari           ...|fr:Canari (Haute-...|    1678.00000000...|
|POLYGON ((9.30101...|2B188          

In [23]:
from shapely import Polygon
from pyspark.sql.functions import collect_list


def get_geopandas_df(spark_df:DataFrame):
    # Convert Spark DataFrame to Pandas DataFrame
    pandas_df = spark_df.toPandas()

    # Create a GeoPandas DataFrame from the Pandas DataFrame
    # Make sure to create Shapely geometry objects from the geometry column
    pandas_df['geometry'] = pandas_df['geometry'].apply(lambda x: Polygon(eval(x)))
    geo_df = gpd.GeoDataFrame(pandas_df, geometry='geometry')
    
    return geo_df

In [24]:
gdf = get_geopandas_df(fr_commune_df)

[Stage 21:>                                                         (0 + 1) / 1]

24/04/15 16:03:16 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 19)
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 67108864. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:391)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 67108864
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)
	at com.esotericsoftware.kryo.serialize

Py4JJavaError: An error occurred while calling o49.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 19) (10.50.2.80 executor driver): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 67108864. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:391)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 67108864
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
	at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:387)
	... 4 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:424)
	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3688)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3685)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 67108864. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:391)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:593)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 67108864
	at com.esotericsoftware.kryo.io.Output.require(Output.java:167)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:251)
	at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:237)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:49)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:38)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
	at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:387)
	... 4 more


In [39]:
city_geoj_file_path = "/home/pengfei/data_set/kaggle/geospatial/world-cities.geojson"

# read geo json
# the selectExpr action Explode the envelope to get one feature per row.
#  Unpack the features' struct. 
df = sedona.read.format("json").option("multiLine", "true").load(city_geoj_file_path) \
 .selectExpr("explode(features) as features")  \
 .select("features.*")  
 # .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show()
df.printSchema()

[Stage 28:>                                                         (0 + 1) / 1]

+--------------------+--------------------+-------+
|            geometry|          properties|   type|
+--------------------+--------------------+-------+
|{[121.4961111, 25...|{Yungho, yungho, ...|Feature|
|{[-72.233333, -37...|{Mulchen, mulchen...|Feature|
|{[-73.6405556, 40...|{Oceanside, ocean...|Feature|
|{[-70.966667, -32...|{Llaillay, llaill...|Feature|
|{[35.6, 3.1166667...|{Lodwar, lodwar, ...|Feature|
|{[10.1666667, 5.9...|{Bamenda, bamenda...|Feature|
|{[-45.533333, -20...|{Arcos, arcos, br...|Feature|
|{[-43.716944, -22...|{Seropédica, sero...|Feature|
|{[-97.1413889, 32...|{Mansfield, mansf...|Feature|
|{[-67.5419444, 10...|{Palo Negro, palo...|Feature|
|{[-42.683333, -5....|{Demerval Lobão, ...|Feature|
|{[-48.666667, -28...|{Imbituba, imbitu...|Feature|
|{[-49.333333, -5....|{Itupiranga, itup...|Feature|
|{[121.75, 24.7666...|{Ilan, ilan, tw, ...|Feature|
|{[49.1825, 11.284...|{Bosaso, bosaso, ...|Feature|
|{[64.570048, 31.8...|{Geresk, geresk, ...|Feature|
|{[7.573271,

                                                                                