In [None]:
%%HTML

<style>
td {
  font-size: 12px
}
th{
  font-size: 12px  
}
</style>

# Spatial Data Manipulation: Vector

## 1. Overview of Geopandas

`Geopands` is one of the most important Python libraries for working with vector data. It is based on the `pandas` library and has dependencies on `Shapely`, `Fiona` and `pyproj`. 
* `Shapely` is a Python package for manipulation and analysis of planar features, using functions from the GEOS library (the engine of PostGIS) and a port of the JTS (Java Topology Suite). Shapely only deals with analyzing geometries and offers no capabilities for reading and writing geospatial files. 
* `Fiona` the API (Application Programming Interface) of OGR (it used to stand for 'OpenGIS Simple Feature Reference Implementation', but not anymore; only a historical name in nature). It can be used for reading and writing data formats. 
* `pyproj` is a Python package that performs cartographic transformations and geodetic computations.

### 1.1. Importing the package and reading data
You can read geospatial data with `gpd.read_file()`, such as ESRI shapefile, GeoJSON, GeoPackage. To export geospatial data, you can use `gpd.to_file()`. 

In [None]:
import geopandas as gpd  # import geopandas package and set alias as gpd
import pandas as pd

states = gpd.read_file('./data/states.json')
print(type(states))
states

### 1.2. Inheritance of Pandas DataFrame

`GeoPandas` works the same way with `Pandas`. In other words, you can use most of the functions of `DataFrame` for `GeoDataFrame`. 

In [None]:
states = states.set_index('postal')  # Define 'postal' column as the index of GeoDataFrame
print(states.shape)  # Return the size of GeoDataFrame. In our case, 51 rows and 6 columns
states.head()

In [None]:
states['name'].to_list()  # Convert GeoSeries to a list

In [None]:
# Call FIPS (Federal Information Processing System) codes of Illinois
states.at['IL', 'fips'] 

In [None]:
# Call regions of Alabama, Illinois, and Texas
states.loc[['AL', 'IL', 'TX'], 'region']  

In [None]:
# Call regions of Midwest with a conditional statement
states.loc[states['region'] == 'Midwest']

### 1.3. Plot(); Major difference between GeoDataFrame and DataFrame

If you use `plot()` function in Pandas DataFrame, you will get a plot of numerical values. However, `plot()` in GeoPandas GeoDataFrame will give you a map. 

In [None]:
# Example of Pandas DataFrame
df = pd.read_csv('./data/daily_case.csv')
df = df.set_index('County')
df.transpose().plot()

In [None]:
# Example of GeoPandas GeoDataFrame
# You can specify its color with 'color' attribute. 
states.plot(color='black')

Given that `geopandas` is specialized in geospatial analysis, it stores coordinates system as its attribute, too. You can check `Coordinate Reference System (CRS)` of the dataset with `.crs` attrbitue. 

In [None]:
print(states.crs) # coordinate system of imported dataset, here epsg:4326 indicates WGS 1984. 
print(type(states.crs)) # the coordinate system information is stored with `pyproj` package. 

You can change the crs with `to_crs()` function. Simply type epsg code with the attribute name `epsg`.

In [None]:
albers = states.to_crs(epsg=5070)  # project from WGS84 to USA Contiguous Albers Equal Area Conic (EPSG: 5070). 
print(albers.crs)
albers.plot(color='black')

You must have noticed that `GeoDataFrame` has an additional column than normal `DataFrame`, which is `geometry` column. `GeoPandas` takes advantage of the column to store geospatial data, with `Shapely`. This is why we can visualize maps with `GeoPandas`.

In [None]:
states.head()

## 2. Data Creation
### 2.1. Creating Vector data with `Shapely`

`Shapely` has the following classes to represent geometry.

| Geometry Type | Class |
| :-: | :-: |
| Point | shapely.geometry.Point() | 
| Line | shapely.geometry.LineString() <br> shapely.geometry.polygon.LinearRing() | 
| Polygon | shapely.geometry.Polygon() | 
| Collection of points | shapely.geometry.MultiPoint() | 
| Collection of lines | shapely.geometry.MultiLineString() | 
| Collection of polygons | shapely.geometry.MultiPolygon() | 

In [None]:
from shapely.geometry import Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon

In [None]:
# creating a point
pnt = Point(2.0, 2.0)  # x, y coordinates of a point
print(pnt.wkt)
print(type(pnt))
pnt

In [None]:
# creating a line
line = LineString([(0, 0), (3,5), (8, 6), (10,10)])  # x, y coordinates of sequences of points
print(line.wkt)
print(type(line))
line

In [None]:
# creating a ring
from shapely.geometry.polygon import LinearRing
ring = LinearRing([(0,0), (3,3), (5,8), (3,0)])  # The purpose of this class is to create a boundary of a polygon
print(ring.wkt)
print(type(ring))
ring

In [None]:
# creating a polygon
pyg = Polygon(((0, 0), (5, 0), (5, 7), (0, 9))) # The order should be kept. 
print(pyg.wkt)
print(type(pyg))
pyg

In [None]:
# The order of points matters. The example below shows a situation if the order of points is messed up. 
pyg2 = Polygon(((0, 0), (5, 7), (5, 0), (0, 9)))
pyg2

In [None]:
# a collection of points
pnts = MultiPoint([(0.0, 0.0), (3.0, 3.0)])
print(pnts.wkt)
print(type(pnts))
pnts

In [None]:
# how to slice a collection of points
pnt1 = pnts.geoms[0]
print(pnt1.wkt)
print(type(pnt1))
pnt1

###  2.2. Converting DataFrame to GeoDataFrame

In [None]:
import pandas as pd

# Create a DataFrame with the capitals and its coordinates of some countries in South America. 
capitals = pd.DataFrame(
    {'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
     'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
     'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
     'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})

capitals

In [None]:
capitals_gdf = gpd.GeoDataFrame(capitals, 
                                # a function to create points based on given coordinates
                                geometry=gpd.points_from_xy(capitals.Longitude, capitals.Latitude) 
                               )
capitals_gdf

`capitals_gdf` is just created from DataFrame so it does not have crs. However, it is still able to be plotted.

In [None]:
print(capitals_gdf.crs)
capitals_gdf.plot()

In [None]:
'''
# The most up to date version ('0.10.2') of GeoPandas has the function `set_crs()`, that can be used as shown below. 
capitals_gdf = capitals_gdf.set_crs(epsg=4326)

However, we will do it another way that the current version on CyberGISX support, given its version of GeoPandas is 0.7.0. 
'''
import pyproj

capitals_gdf.crs = pyproj.CRS.from_user_input('epsg:4326')
capitals_gdf.crs

In [None]:
# Plotting the boundary of counties in South America as a background
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres')) 
ax = world[world.continent == 'South America'].plot(
    color='white', edgecolor='black')

# Overlaying the GeoDataFrame (capitals_gdf) created from a DataFrame
capitals_gdf.plot(ax=ax, color='red')

## 3. Functions of GeoPandas
### 3.1. `.cx[ ]`: Coordinate based indexing

In [None]:
# a function to help your understanding on coordinate based indexing

def create_bbox(lower_left, upper_right):  
    '''Return a bounding box using two coordinates (lower left corner and upper right corner)
    
    Input : lower_left - lower left corner of a bounding box (x_coordinate, y_coordinate)
            upper_right - upper right corner of a bounding box (x_coordinate, y_coordinate)
    
    Output: GeoDataFrame only with a bounding box geometry
    
    '''
    ll = lower_left  # lower left
    lr = (upper_right[0], lower_left[1])  # lower right
    ur = upper_right # upper right
    ul = (lower_left[0], upper_right[1]) # upper left
    
    bbox = Polygon((ll, lr, ur, ul))
    bbox_gdf = gpd.GeoDataFrame(geometry=[bbox])
    
    return bbox_gdf
   

In [None]:
x_min = -124
x_max = -100
y_min = 30
y_max = 40

# Coordinate based indexer (gpd.GeoDataFrame.cx)
exp = states.cx[x_min:x_max, y_min: y_max]  # .cx[xmin:xmax, ymin:ymax]
ax = exp.plot(cmap='Set1')
# create_bbox((x_min, y_min), (x_max, y_max)).boundary.plot(ax=ax, color='black')  # uncomment this to see a bounding box

---
### *Exercise*
1. Search the States under the latitude of 30.
2. Count the number of states and save the number as `count_states`. 
3. Save the names of states as `name_states`.
---

In [None]:
# Your code here



In [None]:
""" Test code for the previous function. This cell should NOT give any errors when it is run."""

assert count_states == 4
assert name_states == ['Florida', 'Hawaii', 'Louisiana', 'Texas']

print('Success!')

### 3.2. Geometrical methods inherited from `Shapely`

`Shapely` has various gemetrical methods, such as calculating the area or perimeter of geometry. 

In [None]:
# A polygon we created earlier
pyg

In [None]:
# apply geometrical methods with Shapely
print(pyg.area)  # area
print(pyg.bounds)  # bounding box
print(pyg.length)  # perimeter
print(pyg.geom_type)  # geometry type

In [None]:
# Use case: Example of real dataset
albers.loc['IL', 'geometry']

In [None]:
# To indicate that the unit of Albers projection is `meter`
albers.crs

In [None]:
print(albers.loc['IL', 'geometry'].area)  # in meters
print(albers.loc['IL', 'geometry'].bounds)
print(albers.loc['IL', 'geometry'].length)  # in meters
print(albers.loc['IL', 'geometry'].geom_type)

In [None]:
print(albers.loc['HI', 'geometry'].area)  # in meters
print(albers.loc['HI', 'geometry'].bounds)
print(albers.loc['HI', 'geometry'].length)  # in meters
print(albers.loc['HI', 'geometry'].geom_type)

In [None]:
# The following example is to show the impact of projections (crs) on calculating area and peripeter of shape. 
merc = albers.to_crs(epsg=3857)  # Change projection to Web Mercator (epsg:3857)
merc.plot(color='black')

In [None]:
# Shape of Illinois with Web Mercator projection (i.e., Equirectangular projection)
print(merc.loc['IL', 'geometry'].area / 2.59e+6, 'SqMi') # Unit: Square Mile
merc.loc['IL', 'geometry']

In [None]:
# Shape of Illinois with Albers Albers Equal Area Conic (i.e., Equal-area projection)
print(albers.loc['IL', 'geometry'].area / 2.59e+6, 'SqMi') # Unit: Square Mile
albers.loc['IL', 'geometry']

### 3.3. Overlay

You can perfrom spatial overlay between two GeoDataFrames, as shown below. Currently, it only supports data GeoDataFrames with uniform geometry types, i.e. containing only (Multi)Polygons, or only (Multi)Points, or a combination of (Multi)LineString and LinearRing shapes.
<br><br>
source: https://geopandas.org/en/stable/docs/reference/api/geopandas.overlay.html <br>
source: https://geopandas.org/en/stable/docs/user_guide/set_operations.html

In [None]:
# Create two GeoDataFrame, each has two square polygons 
pyg_A = gpd.GeoSeries([Polygon([(0,0), (2,0), (2,2), (0,2)]),
                       Polygon([(2,2), (4,2), (4,4), (2,4)])])
pyg_B = gpd.GeoSeries([Polygon([(1,1), (3,1), (3,3), (1,3)]),
                       Polygon([(3,3), (5,3), (5,5), (3,5)])])

gdf1 = gpd.GeoDataFrame({'geometry': pyg_A, 'df1_data':[1,2]})
gdf2 = gpd.GeoDataFrame({'geometry': pyg_B, 'df2_data':[1,2]})

In [None]:
gdf1.plot(color='blue')

In [None]:
gdf2.plot(color='red')

In [None]:
# 'union' returns all those possible geometries.
ax = gpd.overlay(gdf1, gdf2, how='union').plot(cmap='tab10')
gdf1.boundary.plot(color='blue', ax=ax)
gdf2.boundary.plot(color='red', ax=ax)

In [None]:
# 'intersection' returns only those geometries that are contained by both GeoDataFrames.
ax = gpd.overlay(gdf1, gdf2, how='intersection').plot(cmap='tab10')
gdf1.boundary.plot(color='blue', ax=ax)
gdf2.boundary.plot(color='red', ax=ax)

In [None]:
# 'symmetric_difference' is the opposite of 'intersection' and 
# returns the geometries that are only part of one of the GeoDataFrames but not of both.
ax = gpd.overlay(gdf1, gdf2, how='symmetric_difference').plot(cmap='tab10')
gdf1.boundary.plot(color='blue', ax=ax)
gdf2.boundary.plot(color='red', ax=ax)

In [None]:
# 'difference' returns the geometries that are part of gdf1 but are not contained in gdf2.
ax = gpd.overlay(gdf1, gdf2, how='difference').plot(cmap='tab10')
gdf1.boundary.plot(color='blue', ax=ax)
gdf2.boundary.plot(color='red', ax=ax)

In [None]:
# 'identity' returns the surface of gdf1 but they are divided based on the overlay from gdf2.
ax = gpd.overlay(gdf1, gdf2, how='identity').plot(cmap='tab10')
gdf1.boundary.plot(color='blue', ax=ax)
gdf2.boundary.plot(color='red', ax=ax)

---
### *Exercise*

1. import two files `illinois_county.json` and `tl_2021_17019_areawater.shp` in `data` folder, and name them as `county` and `water`, respectively. 
2. Select only Champagin county from `county` with `.loc` method, and resave the resulted GeoDataFrame back to `county`. 
3. Change the coordinate reference system for two dataset to the Illinois State Plane East (epsg:3435)
4. Calculate the area not covered by water in sqaure miles (original unit is in feet) and save the number as `diff_area` (i.e., divide sqft by 2.788e+7).
---

In [None]:
# Your code here



In [None]:
""" Test code for the previous function. This cell should NOT give any errors when it is run."""

# Check your result here. 
assert county['NAME'].values[0] == 'Champaign'
assert county.crs.name == 'NAD83 / Illinois East (ftUS)'
assert water.crs.name == 'NAD83 / Illinois East (ftUS)'
assert round(diff_area) == 996

print("Success!")

### 3.4. Spatial Join 
#### 3.4.1. Data preprocessing (i.e., importing data and matching crs)

In [None]:
# Importing necessary data for spatially joining fire location data to State polygons
# Fire Data is from USGS
fires = gpd.read_file(r'./data/fires_usgs.shp')
fires

In [None]:
# plot all fires as point data on a map
fires.plot(markersize=1, figsize=(10,10))

You need to check the coordinate reference system of two datasets before running a spatial join to make sure they have the same coordinate reference system. 

In [None]:
# crs of fire data
fires.crs

In [None]:
# crs of state geometry 
states.crs

In [None]:
# Given that two dataset has different crs, we need to make them identical.
# reproject fires shapefile the state geometry shapefile
fires = fires.to_crs(epsg=4326)

#### 3.4.2. Spatial join with <a href=https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html> .sjoin() </a> method

`.sjoin()` method has numerous `op` attributes (i.e., 'intersects', 'contains', 'within', 'touches', 'crosses', 'overlaps') to test various types of geographical relationships. <br>
**Note**: the attribute `op` attribute is deprecated in the up-to-date version of GeoPandas (0.10.2) and is replaced with `predicate`. 

In [None]:
# op (or predicate) can be 'intersects', 'contains', 'within', 'touches', 'crosses', 'overlaps'
state_fires = gpd.sjoin(fires, states[['name', 'geometry']], op='within')  
state_fires

In [None]:
# create pandas DataFrame object with states and fire count
counts_per_state = state_fires.groupby('name').size() # Will return a Series, not a DataFrame
counts_per_state = counts_per_state.to_frame(name='number_of_fires') # Convert Series to DataFrame
counts_per_state.sort_values(by='number_of_fires', ascending=False) # list highest values first 

<a href=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html> `.merge()` </a> function will add a dataframe into the other based on a given index. Be aware that if you do not specify `how='outer`, it will automatically return the values that only exist in both DataFrames, meaning that some states will be disregarded.

In [None]:
counts_per_state

In [None]:
states_sjoin_1 = states.merge(counts_per_state, left_on='name', right_on='name', how='outer')
states_sjoin_1

#### 3.4.3. Spatial Join with `.loc` method of `GeoDataFrame` and relationship test method of  `shapely`

You can test the relationship between two geometries by using <a href=https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.intersects.html>`intersects` </a>, <a href=https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html>`contains`</a>, <a href=https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.within.html>`within`</a> so on. This will give you Boolean (i.e., True or False). 

In [None]:
# Test the relationship between the fire location points and State polygons
fires['geometry'].intersects(
        states.at['AL', 'geometry']
    )

Combination of the relationship test with `.loc` method will slice the given GeoDataFrame

In [None]:
# The following example will return the fire locations in the state of Alabama
fires.loc[
    fires['geometry'].intersects(
        states.at['AL', 'geometry']
    )
]

Let's combine all the processes into one cell. 

In [None]:
# Now, let's count the number of fires in each state. 
states_sjoin_2 = states.copy() # Make a copy
states_sjoin_2['number_of_fires'] = 0 # Create an empty column to store the result of spatial join

for idx, row in states_sjoin_2.iterrows(): # Iterrating through rows of GeoDataFrame
    # This will give you the Dataframe of fires associated with each state
    temp_ = fires.loc[fires['geometry'].intersects(row['geometry'])]  
    
    if temp_.shape[0]: # If the sliced dataframe is not empty
        states_sjoin_2.at[idx, 'number_of_fires'] = temp_.shape[0] # Enter the number of fires in each state
        
    else: # If the sliced dataframe is empty
        states_sjoin_2.at[idx, 'number_of_fires'] = 0 # Enter 0 for the state in the loop. 
        
states_sjoin_2.head()

In [None]:
# Compare the results of the two appraoches of spatial join
states_sjoin_1.head()

---
### *Exercise*
```python
# Original code
fires.loc[fires['geometry'].intersects(states.at['AL', 'geometry'])]
```

1. From the original code above, replace `intersects()` with `within()`, and count the number of fires in Illinois. Then, save the number of fires as `fire_in_IL`. <br><br>

2. From the original code above, replace `intersects()` with `contains()`, and think about why it doesn't have any return for `contains()`. 
<br><br>
3. When can we use `contains()`? Find the state that had a fire on 2000-01-01, and save ONLY the name of the state as `first_fire_in_millennium`. <br>
**Hint**: the following statement will give you a coordinates (Point) that the fire ignited on 2000-01-01. 
```python
fires.loc[fires['Ig_Date'] == '2000-01-01', 'geometry'].values[0]
```

---

In [None]:
# Check the result of original code
fires.loc[fires['geometry'].intersects(states.at['AL', 'geometry'])]

In [None]:
# Your code here



In [None]:
""" Test code for the previous function. This cell should NOT give any errors when it is run."""

# Check your result here. 
assert fire_in_IL == 28
assert first_fire_in_millennium == 'Oklahoma'

print("Success!")

### 3.5. Visualize data (more details will be covered in Week 6)

In [None]:
# create a static map of the number of fire per state
states_sjoin_2.plot(column='number_of_fires', figsize=(15, 6), cmap='Reds', legend=True)

In [None]:
# Use Quantiles for the classification scheme (uses `mapclassify` package)
states_sjoin_2.plot(column='number_of_fires', figsize=(15, 6), cmap='Reds', legend=True, scheme='Quantiles')

In [None]:
# Use FisherJenks algorithm of 7 classes (k) for the classification scheme (uses `mapclassify` package)
states_sjoin_2.plot(column='number_of_fires', figsize=(15, 6), cmap='Reds', legend=True, scheme='FisherJenks', k=7)