# Read write various geo data format

To be able to use sedona to do geospatial operations (e.g calculate distance, area hierarchy, etc.), we need to construct geo dataframe first. A geo dataframe contains one or more columns of below type:
- Point : a point on the map with a (x,y) coordinates
- Line: two point which can form a line
- Polygon: a list of point which can form a polygon

The **full list of the constructor for the geo data types** can be found [here](https://sedona.apache.org/1.4.1/api/sql/Constructor/)

In [1]:
from sedona.spark import *

In [2]:
# build a sedona session (sedona >= 1.4.1)
config = SedonaContext.builder(). \
    config('spark.jars.packages',
           'org.apache.sedona:sedona-spark-3.5_2.13:1.6.1,'
           'org.datasyslab:geotools-wrapper:1.6.1-28.2'). \
    config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all'). \
    getOrCreate()

# create a sedona context
sedona = SedonaContext.create(config)

## 1. Read from CSV/TSV of plain text string

In below example, we will read a normal csv file which contains two column x, y. You can notice the content of the csv is `plain text` string.

### 1.1 Point example

In below example, we will construct a geo dataframe which contains a **Point** column

In [28]:
data_folder_path = "../../../../data/"

point_file_path=f"{data_folder_path}/test_points.csv"

In [29]:
# read a normal csv
raw_point_df = sedona.read.format("csv").\
          option("delimiter",",").\
          option("header","false").\
          load(point_file_path)

raw_point_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



In [30]:
raw_point_df.show(5)

+---+-----+
|_c0|  _c1|
+---+-----+
|1.1|101.1|
|2.1|102.1|
|3.1|103.1|
|4.1|104.1|
|5.1|105.1|
+---+-----+
only showing top 5 rows



In [31]:
# create a temp view
raw_point_df.createOrReplaceTempView("p_raw_table")

In [32]:
point_df = sedona.sql("select ST_Point(cast(p_raw_table._c0 as Decimal(24,20)), cast(p_raw_table._c1 as Decimal(24,20))) as point from p_raw_table")

In [33]:
point_df.show(5)

+-----------------+
|            point|
+-----------------+
|POINT (1.1 101.1)|
|POINT (2.1 102.1)|
|POINT (3.1 103.1)|
|POINT (4.1 104.1)|
|POINT (5.1 105.1)|
+-----------------+
only showing top 5 rows



In [34]:
point_df.printSchema()

root
 |-- point: geometry (nullable = true)



> You can notice that, we used the constructor **ST_Point()** to build the point column.



### 1.2 Line example

To create a line type, we can use the constructor **ST_LineStringFromText (Text:string, Delimiter:char)**


In [35]:
line_df1 = sedona.sql("SELECT ST_LineStringFromText('-74.0428197,40.6867969,-74.0421975,40.6921336,-74.0508020,40.6912794', ',') AS line_col")

In [36]:
line_df1.show()

+--------------------+
|            line_col|
+--------------------+
|LINESTRING (-74.0...|
+--------------------+



In [37]:
line_df1.printSchema()

root
 |-- line_col: geometry (nullable = true)



### 1.3 Polygon example

We have seen the below example for the section 1. We will use the constructor **ST_GeomFromText()**

In [38]:
county_small_path=f"{data_folder_path}/county_small.tsv"

In [39]:
raw_poly_df = sedona.read.format("csv").option("delimiter", "\t").option("header", "false").load(county_small_path)
raw_poly_df.createOrReplaceTempView("gon_raw_table")
raw_poly_df.show()

+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|                 _c0|_c1|_c2|     _c3|  _c4|        _c5|                 _c6|_c7|_c8|  _c9|_c10| _c11|_c12|_c13|      _c14|    _c15|       _c16|        _c17|
+--------------------+---+---+--------+-----+-----------+--------------------+---+---+-----+----+-----+----+----+----------+--------+-----------+------------+
|POLYGON ((-97.019...| 31|039|00835841|31039|     Cuming|       Cuming County| 06| H1|G4020|NULL| NULL|NULL|   A|1477895811|10447360|+41.9158651|-096.7885168|
|POLYGON ((-123.43...| 53|069|01513275|53069|  Wahkiakum|    Wahkiakum County| 06| H1|G4020|NULL| NULL|NULL|   A| 682138871|61658258|+46.2946377|-123.4244583|
|POLYGON ((-104.56...| 35|011|00933054|35011|    De Baca|      De Baca County| 06| H1|G4020|NULL| NULL|NULL|   A|6015539696|29159492|+34.3592729|-104.3686961|
|POLYGON ((-96.910...| 31|109|00835876|31109| 

In [40]:
raw_poly_df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)



In [41]:
polygon_df=sedona.sql("select ST_GeomFromText(gon_raw_table._c0) as county_shape, gon_raw_table._c6 as county_name from gon_raw_table")
polygon_df.show(5)

+--------------------+----------------+
|        county_shape|     county_name|
+--------------------+----------------+
|POLYGON ((-97.019...|   Cuming County|
|POLYGON ((-123.43...|Wahkiakum County|
|POLYGON ((-104.56...|  De Baca County|
|POLYGON ((-96.910...|Lancaster County|
|POLYGON ((-98.273...| Nuckolls County|
+--------------------+----------------+
only showing top 5 rows



In [42]:
polygon_df.printSchema()

root
 |-- county_shape: geometry (nullable = true)
 |-- county_name: string (nullable = true)



## 1.4 Read wkt and wkb file

Geometries in a `WKT and WKB` file always occupy a single column no matter how many coordinates they have. Sedona provides `WktReader and WkbReader` to create generic SpatialRDD. Then we need to convert the spatial rdd to dataframe.

> You must use the wkt reader to read wkt file, and wkb reader to read wkb file.

In [48]:
polygon_wkb_file_path=f"{data_folder_path}/county_small_wkb.tsv"

In [49]:
from sedona.core.formatMapper import WktReader
from sedona.core.formatMapper import WkbReader

In [66]:
# The WKT string starts from Column 0
wktColumn = 0 
allowTopologyInvalidGeometries = True
skipSyntaxInvalidGeometries = False

spatialRdd = WkbReader.readToGeometryRDD(sedona.sparkContext, polygon_wkb_file_path, wktColumn, allowTopologyInvalidGeometries, skipSyntaxInvalidGeometries)

#WkbReader.readToGeometryRDD(sc, wkb_geometries_location, 0, True, False)

In [69]:
from sedona.utils.adapter import Adapter
county_small_df = Adapter.toDf(spatialRdd,sedona)
county_small_df.createOrReplaceTempView("county_small_table")


In [68]:
county_small_df.show(5)

+--------------------+
|            geometry|
+--------------------+
|POLYGON ((-97.019...|
|POLYGON ((-123.43...|
|POLYGON ((-104.56...|
|POLYGON ((-96.910...|
|POLYGON ((-98.273...|
+--------------------+
only showing top 5 rows



## 3. Read from geojson

Pay attention to the below example, even thought spark can read json properly. But we still use read csv. As a result, the `raw_polygon_json_df` is a dataframe with one column.

In [55]:
poly_json_file_path = f"{data_folder_path}/test_polygon.json"

In [56]:
raw_polygon_json_df = sedona.read.format("csv").\
    option("delimiter", "\t").\
    option("header", "false").\
    load(poly_json_file_path)

raw_polygon_json_df.show()

+--------------------+
|                 _c0|
+--------------------+
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
|{ "type": "Featur...|
+--------------------+
only showing top 20 rows



In [57]:
raw_polygon_json_df.printSchema()

root
 |-- _c0: string (nullable = true)



In [58]:
raw_polygon_json_df.createOrReplaceTempView("raw_poly_json_table")
polygon_json_df = sedona.sql("select ST_GeomFromGeoJSON(raw_poly_json_table._c0) as countyshape from raw_poly_json_table")
polygon_json_df.show(5)

+--------------------+
|         countyshape|
+--------------------+
|POLYGON ((-87.621...|
|POLYGON ((-85.719...|
|POLYGON ((-86.000...|
|POLYGON ((-86.574...|
|POLYGON ((-85.382...|
+--------------------+
only showing top 5 rows



In [59]:
polygon_json_df.printSchema()

root
 |-- countyshape: geometry (nullable = false)



### 3.1 Read geo json by using the sedona GeoJsonReader
  
Sedona also provides a predefined function `GeoJsonReader`. Below is an code example

In [71]:
from sedona.core.formatMapper import GeoJsonReader

In [70]:
geojson_sp_rdd = GeoJsonReader.readToGeometryRDD(sedona.sparkContext, poly_json_file_path)
geojson_df = Adapter.toDf(geojson_sp_rdd,sedona)

In [72]:
geojson_df.show(5)

+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+------+
|            geometry|STATEFP|COUNTYFP|TRACTCE|BLKGRPCE|            AFFGEOID|       GEOID|NAME|LSAD|   ALAND|AWATER|
+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+------+
|POLYGON ((-87.621...|     01|     077| 011501|       5|1500000US01077011...|010770115015|   5|  BG| 6844991| 32636|
|POLYGON ((-85.719...|     01|     045| 021102|       4|1500000US01045021...|010450211024|   4|  BG|11360854|     0|
|POLYGON ((-86.000...|     01|     055| 001300|       3|1500000US01055001...|010550013003|   3|  BG| 1378742|247387|
|POLYGON ((-86.574...|     01|     089| 001700|       2|1500000US01089001...|010890017002|   2|  BG| 1040641|     0|
|POLYGON ((-85.382...|     01|     069| 041400|       1|1500000US01069041...|010690414001|   1|  BG| 8243574|     0|
+--------------------+-------+--------+-------+--------+--------

# 4. Read shape file

To read shape file, before 1.7. We have to create a spatial rdd first, then we convert it to a dataframe

After 1.7, we can directly read it as a dataframe(1.7 is not released officially at 12/11/2024)
Below is an example of code in 1.7
```python
df = sedona.read.format("shapefile").load("/path/to/shapefile")
```

In [64]:
from sedona.utils.adapter import Adapter

airport_shape=f"{data_folder_path}/airports_shape"

ap_spatial_rdd= ShapefileReader.readToGeometryRDD(sedona.sparkContext,airport_shape)

# Create DataFrame from RDD and schema
airport_shape_df = Adapter.toDf(ap_spatial_rdd,sedona)


In [65]:
airport_shape_df.show(5)


+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|            geometry|scalerank|featurecla| type|            name|abbrev|location|gps_code|iata_code|           wikipedia|natlscale|
+--------------------+---------+----------+-----+----------------+------+--------+--------+---------+--------------------+---------+
|POINT (113.935016...|        2|   Airport|major| Hong Kong Int'l|   HKG|terminal|    VHHH|      HKG|http://en.wikiped...|  150.000|
|POINT (121.231370...|        2|   Airport|major|         Taoyuan|   TPE|terminal|    RCTP|      TPE|http://en.wikiped...|  150.000|
|POINT (4.76437693...|        2|   Airport|major|        Schiphol|   AMS|terminal|    EHAM|      AMS|http://en.wikiped...|  150.000|
|POINT (103.986413...|        2|   Airport|major|Singapore Changi|   SIN|terminal|    WSSS|      SIN|http://en.wikiped...|  150.000|
|POINT (-0.4531566...|        2|   Airport|major| London Heathrow|   