# Spatial Data Manipulation: Vector

## 1. Overview of Geopandas

`Geopands` is one of the most important Python libraries for working with vector data. It is based on the `pandas` library and has dependencies on `Shapely`, `Fiona` and `pyproj`. 
* `Shapely` is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS (Java Topology Suite). Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. 
* `pyproj` is a Python package that performs cartographic transformations and geodetic computations.


In [3]:
import sys
sys.executable

'/opt/homebrew/anaconda3/envs/GIS_Project/bin/python'

In [1]:
import geopandas as gpd  # import geopandas package and set alias as gpd
import pandas as pd

In [None]:
# Import the data with geopandas
# You can read geospatial data with `gpd.read_file()`, such as ESRI shapefile, GeoJSON, GeoPackage. To export geospatial data, you can use `gpd.to_file()`. 

emd_gdf = gpd.read_file('./data/Seoul_EMD_simplified.geojson')
emd_gdf.head(5)

In [None]:
# Check the Coordinate Reference System (CRS) of the GeoDataFrame
emd_gdf.crs

In [None]:
# The crs attribute is interited from the pyproj.CRS class
type(emd_gdf.crs)

In [None]:
# In case of GeoDataFrame, the .plot() method shows the map of the data if the geometry column exists
emd_gdf.plot()

In [None]:
# Check the first record of the GeoDataFrame
emd_gdf.loc[0]

In [None]:
# Get the geometry of the first record
emd_gdf.loc[0, 'geometry']

In [None]:
# The geometry is a shapely object
type(emd_gdf.loc[0, 'geometry'])

In [None]:
# The geometry is a shapely object
# wkt is a string representation of the geometry in the Well-Known Text (WKT) format
emd_gdf.loc[0, 'geometry'].wkt

## 2. Data Creation
### 2.1. Creating Vector data with `Shapely`

`Shapely` has the following classes to represent geometry.

| Geometry Type | Class |
| :-: | :-: |
| Point | shapely.geometry.Point() | 
| Line | shapely.geometry.LineString() <br> shapely.geometry.polygon.LinearRing() | 
| Polygon | shapely.geometry.Polygon() | 
| Collection of points | shapely.geometry.MultiPoint() | 
| Collection of lines | shapely.geometry.MultiLineString() | 
| Collection of polygons | shapely.geometry.MultiPolygon() | 

In [None]:
# Import shapely objects
from shapely.geometry import Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon

In [None]:
# creating a point
pnt = Point(2.0, 2.0)  # x, y coordinates of a point
print(pnt.wkt)
print(type(pnt))
pnt

In [None]:
# creating a line
line = LineString([(0, 0), (3,5), (8, 6), (10,10)])  # x, y coordinates of sequences of points
print(line.wkt)
print(type(line))
line

In [None]:
# creating a polygon
pyg = Polygon(((0, 0), (5, 0), (5, 7), (0, 9))) # The order should be kept. 
print(pyg.wkt) # The first and last points should be the same to make a polygon
print(type(pyg))
pyg

In [None]:
# The order of points matters. The example below shows a situation if the order of points is messed up. 
pyg2 = Polygon(((0, 0), (5, 7), (5, 0), (0, 9)))
pyg2

In [None]:
# a collection of points
pnts = MultiPoint([(0.0, 0.0), (3.0, 3.0)])
print(pnts.wkt)
print(type(pnts))
pnts

###  2.2. Converting DataFrame to GeoDataFrame

In [None]:
import pandas as pd

# Create a DataFrame with the capitals and its coordinates of some countries in South America. 
capitals = pd.DataFrame(
    {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
     'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
     'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
     'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})

capitals

In [None]:
# It is possible to create a geometry column with the given coordinates
# gpd.points_from_xy is a function to create points based on given coordinates

capitals_gdf = gpd.GeoDataFrame(capitals, 
                                # a function to create points based on given coordinates
                                geometry=gpd.points_from_xy(capitals.Longitude, capitals.Latitude) 
                               )
capitals_gdf

`capitals_gdf` is just created from DataFrame so it does not have crs. However, it is still able to be plotted.

In [None]:
print(capitals_gdf.crs)
capitals_gdf.plot()

In [None]:
# .explore() method is a useful method to explore the data interactively
# But, it doesn't work if the crs is missing
capitals_gdf.explore()

In [None]:
# .set_crs() method is used to set the CRS of the GeoDataFrame
capitals_gdf = capitals_gdf.set_crs(epsg=4326)

capitals_gdf.crs

In [None]:
# With the correct crs, the .explore() method works
capitals_gdf.explore()

## 3. Mockup Analysis

We want to calculate the maximum temperature of each dong. The following is the data employed and steps for our mockup analysis.

* Data: 
    - Sensor locations of S-DoT: './data/S_DoT_locations.xlsx'
    - Temperature data: './data/SDoT_Seoul_20240804.csv'
    - Dong Geometry of Seoul: './data/Seoul_EMD_simplify.geojson'
* Steps:
    - Load sensor location data (`sensor_df`) using Pandas and convert it to GeoDataFrame (`sensor_gdf`).
    - Load temperature data (`temp_df`) using Pandas and join with the `sensor_gdf`.
    - Find the associatd dong for each sensor location and calculate the maximum temperature of each dong.

### 3.1. Load sensor location data

In [None]:
# Import sensor location information from an Excel file
# pandas.read_excel() is used to read an Excel file while pandas.read_csv() is used to read a CSV file
sensor_df = pd.read_excel('./data/S_DoT_locations.xlsx')
sensor_df

---
### *Exercise*
1. Investigate the syntax below and create a GeoDataFrame from `sensor_df` with the following steps:
* Create a GeoDataFrame from `sensor_df` with the geometry column named `geometry`.
* Set the crs of the GeoDataFrame to WGS 84 (`EPSG:4326`).
* Save the GeoDataFrame into `sensor_gdf`.

```python
    sensor_gdf = gpd.GeoDataFrame(`INPUT DATAFRAME`, 
                                  geometry=gpd.points_from_xy(`LONGITUDE COLUMN OF A DATAFRAME`,
                                                              `LATITUDE COLUMN OF A DATAFRAME`), 
                                  crs=`EPSG:EPSG_CODE` # WGS 84
                                  )
```
---



In [None]:
# Your code here
sensor_gdf = 
sensor_gdf

In [None]:
""" Test code for the previous function. This cell should NOT give any errors when it is run."""

# Check your result here. 
assert type(sensor_gdf) == gpd.GeoDataFrame
assert sensor_gdf.crs == 'EPSG:4326'
assert round(sensor_gdf.loc[0, 'geometry'].x, 4) == 127.0753

print("Success!")

In [None]:
# Plot the sensor locations
sensor_gdf.plot()

### 3.2. Join(Merge) DataFrame

Merge DataFrame or named Series objects with a database-style join. The join is done on columns or indexes.

```python
joined_gdf = df_a.merge(right=`df_b`,
                        how='inner', # {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
                        on=`column name` # Column or index level names to join on. If the columns have different names, specify `left_on` and `right_on`.
                        )
```


Source: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

In [None]:
# Load temperature data
# Source: https://data.seoul.go.kr/dataList/OA-15969/S/1/datasetView.do
temp_df = pd.read_csv('./data/SDoT_Seoul_20240804.csv')
temp_df

In [None]:
# Get the statistics of the 'TempMax' column
temp_df['TempMax'].describe()

In [None]:
# Merge the sensor location data with the temperature data
sensor_data = sensor_gdf.merge(temp_df, on='Serial_Num', how='inner')
sensor_data

In [None]:
# Plot the temperature data
sensor_data.plot('TempMax', legend=True, cmap='coolwarm', markersize=10, figsize=(10, 10))

---
### *Exercise*
1. Investigate the syntax below and change the CRS of both `sensor_gdf` and `emd_gdf` to `EPSG:5179`. Currently, the CRS of the two GeoDataFrame are different. 

```python
    sensor_proj = sensor_data.to_crs(`EPSG:EPSG_CODE`)
    emd_proj = emd_gdf.to_crs(`EPSG:EPSG_CODE`)
``` 

---

In [None]:
# Your code here
sensor_proj = 
emd_proj = 


In [None]:
""" Test code for the previous function. This cell should NOT give any errors when it is run."""

# Check your result here. 
assert sensor_proj.crs == 'EPSG:5179'
assert emd_proj.crs == 'EPSG:5179'

print("Success!")

### 3.3. Find the associated dong for each sensor location

We want to count the number of sensors in each dong and calculate the maximum temperature of each dong. 

In [None]:
## The entire code

# .iterrows is a generator that iterates over the rows of the DataFrame
# It returns an index (idx) and a Series for each row (row)
for idx, row in emd_proj.iterrows():

    # Find the sensors that are located in the current dong
    sensor_dong = sensor_proj.loc[sensor_proj['geometry'].intersects(row['geometry'])]
    
    # If there are sensors in the dong..
    if sensor_dong.shape[0] > 0:

        # Get the number of sensors in the dong
        emd_proj.at[idx, 'Sensor_Count'] = sensor_dong.shape[0]

        # Get the average of the 'TempMax' column in the dong
        emd_proj.at[idx, 'TempMax_Avg'] = sensor_dong['TempMax'].mean()

# Check the results
emd_proj

In [None]:
for idx, row in emd_proj.head(3).iterrows():
    print(f"Index: {idx}")
    print(row)
    print("--------------------")

In [None]:
# The current geometry in the loop
row['geometry']

In [None]:
# Check the type of the geometry -> shapely object
type(row['geometry'])

In [None]:
# Check the geometry of sensor if it intersects with the current geometry
sensor_proj['geometry'].intersects(row['geometry'])

In [None]:
# When wrapped with .loc[], it returns a GeoDataFrame with the True values
sensor_proj.loc[sensor_proj['geometry'].intersects(row['geometry'])]

Various options to check the spatial relationship between geometries
* .contains() returns True if the geometry contains the other geometry
* .within() returns True if the geometry is within the other geometry
* .intersects() returns True if the geometry intersects the other geometry

In [None]:
# This cell returns a value
sensor_proj.loc[sensor_proj['geometry'].within(row['geometry'])]

In [None]:
# This cell does not return a value
sensor_proj.loc[sensor_proj['geometry'].contains(row['geometry'])]

Now revisit the original code

In [None]:
## The entire code

# .iterrows is a generator that iterates over the rows of the DataFrame
# It returns an index (idx) and a Series for each row (row)
for idx, row in emd_proj.iterrows():

    # Find the sensors that are located in the current dong
    sensor_dong = sensor_proj.loc[sensor_proj['geometry'].intersects(row['geometry'])]
    
    # If there are sensors in the dong..
    if sensor_dong.shape[0] > 0:

        # Get the number of sensors in the dong
        emd_proj.at[idx, 'Sensor_Count'] = sensor_dong.shape[0]

        # Get the average of the 'TempMax' column in the dong
        emd_proj.at[idx, 'TempMax_Avg'] = sensor_dong['TempMax'].mean()

# Check the results
emd_proj

In [None]:
# If there is no sensor in the dong, the 'TempMax_Avg' column is NaN
emd_proj.loc[emd_proj['TempMax_Avg'].isna()]

In [None]:
# Plot the result
emd_proj.plot('TempMax_Avg', 
              legend=True, 
              scheme='NaturalBreaks',
              cmap='Reds', 
              figsize=(10, 8), 
              missing_kwds={'color': 'grey'}
              )

### 3.4. Alternative approahch (.sjoin())

The `sjoin()` function in `geopandas` is a spatial join function that allows you to join two GeoDataFrames based on their spatial relationship. This could be more convenient than the previous approach.

```python
    gpd.sjoin(left_df, 
              right_df, 
              how='inner', # This can be 'left', 'right', or 'inner'
              predicate='intersects' # This can be contains, within, etc. 
              )
```
Source: https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html

In [None]:
# Get another file to conduct the spatial join
emd_sjoin = emd_gdf.to_crs(epsg=5179) 
emd_sjoin

In [None]:
# Conduct the spatial join
emd_sjoin_result = gpd.sjoin(emd_sjoin, sensor_proj, how='left', predicate='contains')  
emd_sjoin_result

In [None]:
# Calculate the average of the 'TempMax' column and the count of sensors in each dong
emd_sjoin_clean = emd_sjoin_result.groupby(['ADM_CD', 'ADM_NM']).agg({'TempMax': 'mean', 
                                                                      'Serial_Num': 'count'}
                                                                      ).reset_index()
emd_sjoin_clean

In [None]:
# Compare with the previous result
emd_proj

# Done