# Section 5. Spatial Data - tabular data and shapefiles

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's Spatial Data Analysis [course](https://docs.google.com/document/d/1oC10pjyeBQTenQazCpaB8Lx1b5PC1SR3WFiPgCtXqcs/edit?tab=t.0) notes by [Jaecheol Lee](https://sites.google.com/view/jaecheollee).
    
### Learning Objectives 
    
* Practice working with pandas dataframes that include point data
* Introduction to basic calculations involving spatial information
* Understand about different types of spatial data and geometries
* Practice mapping spatial data using shapely and geopandas
* Work on manipulating spatial objects using shapely methods

### Sections
1. Tabular spatial data using pandas
2. Calculations with point data
3. Shapely and geopandas

### Required data
* crime.csv
* hawaii.p

### Required packages
* pandas
* numpy
* matplotlib
* shapely
* geopandas
* pickle

## 0. Loading modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Check what you have loaded 
dir()

In [None]:
# Suppress scientific notation in Pandas 
# set default to displaying full float numbers with 2 decimal places
pd.options.display.float_format = '{:.2f}'.format

## 1. Loading and inspecting tabular spatial data

We will be working with a dataset on crime which has types and locations of crimes. It is extracted from a larger dataset.

The data are in tabular format, but still represent spatial data because some of the tabular data include geographic/spatial information!

In [None]:
# Load crime.csv 
df = pd.read_csv('data/crime.csv')
# inspect the data
df.head()

In [None]:
# Check the shape
df.shape

Let's add a year variable based on the year in the `incident_id`. 

In [None]:
df['year'] = df['incident_id'].str[:4].astype(int)
df

In [None]:
# Get simple summary statistics using the .describe method
df.describe().transpose()

In [None]:
# Sometimes have meaningful ID variable that might want to index on
# Can set the incident_id as the index
df=df.set_index('incident_id')
df

In [None]:
# Two ways to access the second row
print(df.iloc[1]['type'])
print(df.loc['2020-00049522']['type'])

One issue to note: the IDs are no longer available as avariable, which might not be desirable. 

To get the IDs back into a column, we can run `df.reset_index()`.

In [None]:
df = df.reset_index()
df

In [None]:
# Some spatial data are missing
print(df['lat'].isna())
print(df['lat'].notna())

In [None]:
# Let's drop rows with any missing data
df=df.dropna(how = 'any') # inplace argument is False by default
df

Let's plot the coordinates! Here we will pass `lat` as the y argument and `lon` as the x argument.

In [None]:
plt.plot(df['lon'],df['lat'], '+')
plt.show()

## 2. Calculations with point data

The observations in `df` are **point data**. They are identified spatially by a point, or a pair of x and y coordinates. 

### Calculating distances

One basic thing we can do with points is calculate distances. Let's calculate (Euclidean) distances from each point to a fixed reference point (suppose it's a police station) and add that to the dataframe.

In [None]:
# Define cordinates of the reference point
ref=[-122.26,37.87]

# Calculate the Euclidean distance from ref for each observation using vectorization
df['dist_pt'] = ((df['lon'] - ref[0]) ** 2 + (df['lat'] - ref[1]) ** 2) ** 0.5
    

What does this error message mean? 

We built the current version of the dataframe by dropping one row from the original dataframe. Basically, we filtered the original dataframe and Pandas is not sure if we really want to abandon the old dataframe/keep the new view.

The dataframes are still linked (weakly within Pandas). We can show this below.

In [None]:
print(df._is_view)
print(df._is_copy)

The clean way to break the link is to add `.copy()` at the end when writing over a dataframe. Let's recreate the new df cleanly.

In [None]:
df = pd.read_csv('data/crime.csv')
df  =df.dropna(how = 'any').copy()
df['dist_pt'] = ((df['lon'] - ref[0]) ** 2 + (df['lat'] - ref[1]) ** 2) ** 0.5

In [None]:
# Set distance to km (roughly)
df['dist_pt']=df['dist_pt']*111.11
df

How can we find the incident with the minimum distance?

In [None]:
# If we use an enumerate loop:
min_distance = float('inf') # first define a placeholder value for distance - needs to be very large
min_i = None # a placeholder value for the index

# Now loop through distance values to identify the minimum
for i, distance in enumerate(df['dist_pt']):
    if distance < min_distance: 
        min_distance = distance 
        min_i = i

print(min_i)
print(min_distance)

In [None]:
# There are also built-in functions that calculates these values
min_i = np.argmin(df['dist_pt'])
min_dist = np.min(df['dist_pt'])

print(min_i)
print(min_dist)

### Heat maps - Lambda calculation

Lambda is a measure of intensity or density of events within a given radius. It is the ratio of the number of events within a given radius of a reference point to the area of the circle defined by that radius around the reference point.

Variations on this kind of measure are used to generate heat maps. The version we will calculate is one particular case.

Below is an example of using Latex formatting within Jupyter Notebook. It gives an equation for calculating lambda for some radius $h$.

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{j}})< h \right]
$$

Let's generate some random data and use it to generate a heat map.

In [None]:
# Set a seed to ensure we all get the same results.
np.random.seed(123)

# 10 random coordinates, multiply by 10 to get numbers between 0 and 10
x_coords = np.random.random(10) * 10
y_coords = np.random.random(10) * 10

# Plot the x_coords and the y_coords
plt.plot(x_coords, y_coords, 'r*');

In [None]:
# Define a function to calculate the Euclidean distance between two points
def compute_distance(x0, y0, x1, y1):
    dist = ((x0 - x1) ** 2 + (y0 - y1) ** 2) ** 0.5
    return(dist)

Now let's calculate lambda for a fixed point $s=$(3,3) where the $j$ are the points we generated. Let's set the radius $h$ to 3 degrees.

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{j}})< h \right]
$$

In [None]:
h = 3

# First, calculate the distance between the events and a point x = 3, y = 3
# using list completion
distance = [ compute_distance(3, 3, x_coords[i], y_coords[i]) for i in range(10) ]
print(distance)

# Second, count the number of pairs for which the distance is less than h
distance = np.array(distance) # convert from list to array
print(distance<h)
num  = np.sum(distance<h)

# Third, calculate lambda
lambda_est = (1/(np.pi * (h ** 2))) * num
print(lambda_est)

##### Don't be confused with a lambda [function](https://stackoverflow.com/questions/890128/why-are-python-lambdas-useful)! 
1. A lambda function is a small anonymous function (defined without a name).
2. A lambda function can take any number of arguments, but can only have one expression.

In [None]:
# lambda is a one-line function:
x = lambda a : a + 10
print(x(2))

# Equivalent to:
def x(a):
    b = a + 10
    return(b)
print(x(2))

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{j}})< h \right]
$$

Let's **write a function** to calculate lambdas for a given reference point and set of point events.

In [None]:
# Define a function to calculate the lambda given a point
def lambda_function(x_ref, y_ref, x_events, y_events, h):
    """ 
    x_ref and y_ref are coordinates of the reference point.
    x_events and y_events are arrays of coordinates of event points.
    h is the radius of interest for the lambda measure.
    We calculate Euclidean distances between the reference point and each event, using a previously created function.
    These are linear distances in degrees that assume the earth is flat.
    We then use this to calculate lambda.
    """
    distances = np.sqrt((x_events - x_ref)**2 + (y_events - y_ref)**2) ## vectorized calculation
    count = np.sum(distances < h)
    area = np.pi * (h ** 2)
    return count / area

In [None]:
# Test it!
print(lambda_function(5, 5, x_coords, y_coords, 3))
print(lambda_function(7, 3, x_coords, y_coords, 3))

### Creating a raster grid for a heatmap

Now let's **make a heatmap** for the area around the event data we created!

We don't want to just estimate densities at the locations of the events. Ideally, we want to calculate densities all across the map. 

To do this we will need to define a set of coordinates for the map. We can then calculate lambda for each point in the grid.

We will start by creating a 10 x 10 raster grid covering the event area. We can do this using `numpy` and the `np.meshgrid` function and `np.ndarray.flatten` method.

We want our 1 degree cells to have edges at each degree and midpoints in between, e.g., a midpoint of (0.5, 0.5).

In [None]:
# How can we make a 10 X 10 grid?
# The coordinates start from (0.5, 0.5) and end at (9.5, 9.5)
# Use np.meshgrid to create the grid
x, y = np.meshgrid(np.arange(0, 10) + 0.5, np.arange(0, 10) + 0.5)
x, y

To integrate this into a usable setup for mapping/tabular integration, we need to *flatten* these 2-D arrays into a 1-D array. This would allow merging with tabular data, and also is easier for python to loop through.

In [None]:
x.flatten()

In [None]:
# How can we get all the coordinates?
x, y = x.flatten(), y.flatten()

# Plot the events (x_coords, y_coords) and the grid x, y coordinates in one plot.
# With different markers
plt.plot(x_coords, y_coords, 'ro')
plt.plot(x, y, 'k+')
plt.show()

Now we have our reference points and our events.

We can use the function we wrote earlier to calcualte lambda for all grid points based on the locations of the events.

In [None]:
# Make a matrix of lambdas for all the locations
# utilizing the function above and a double loop
# let's still use h=3 

# First make an empty matrix accumulator
lambda_results = np.zeros(len(x))

# Then loop over flatterned grid points and calculate lambda to fill the matrix
for k in range(len(x)):
    lambda_results[k] = lambda_function(x[k], y[k], x_coords, y_coords, 3)

# show the results
lambda_results

Now we can **create a heatmap**!

In [None]:
# Plot the events (x_coords, y_coords) and the grid x, y coordinates in one plot.
# With different markers
plt.plot(x_coords, y_coords, 'ro')
scatter=plt.scatter(x, y, c=lambda_results, marker='s', cmap='viridis', s=500, )
plt.colorbar(scatter, label="Event density Lambda")  # Add a colorbar
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Heat map of event density")
plt.show()

Let's draw circles around the events to try to make sense of the heatmap.

In [None]:
from matplotlib.patches import Circle
from matplotlib.collections import PatchCollection

fig, ax = plt.subplots(figsize=(10, 8))

# 1. Create a list of circle patches around each event coordinate
patches = [Circle((x, y), radius=3) for x, y in zip(x_coords, y_coords)]

# 2. Create a collection from these patches
pc = PatchCollection(patches, facecolor='none', edgecolor='red', alpha=0.3, linestyle='--')

# 3. Add the circles to the plot
ax.add_collection(pc)

# 4. Plot the event points (the circle centers) and the grid points
ax.plot(x_coords, y_coords, 'ro', markersize=4)
ax.plot(x, y, 'k+')

# 5. Plot your heatmap grid
scatter = ax.scatter(x, y, c=lambda_results, marker='s', cmap='viridis', s=500, edgecolors='none')

# 6. Format the plot
plt.colorbar(scatter, label="Event density Lambda")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Heat map of event density with 3Â° radius")

plt.show()

Three degrees is quite large! Try re-running this with a 1 degree radius and see how the results change.

In [None]:
# Your code

## 3. Shapely

New let's transition to types of geometries other than points. 

#### Geometries in shapely

* [Intro to geometric objects in shapely](https://automating-gis-processes.github.io/2016/Lesson1-Geometric-Objects.html)

__Structures of Geometries__

* A `Point` is a collection of two (three) numbers, each representing x, y, (z,) coordinates.
* A `LineString` is a polyline, or sequence of points connected by straight line segments.
* A `LinearRing` is a sequence of points, with the last point being the same as the first one. (Here we skip the discussion on validity of geometries.)
* A `Polygon` (e.g., rectangles) has one exterior `polygon.exterior` (a `LinearRing`) and potentially multiple interiors `polygon.interiors` (each element, e.g. `polygon.interiors[0]`, is a `LinearRing`).
* A `MultiPolygon` is a sequence of `Polygon`s.

In [None]:
from shapely.geometry import (Point, LineString, LinearRing,
                              Polygon, MultiPolygon)
p = Point((1, 2))
line = LineString([(1, 2), (8, 4),
                   (5, 10)])
ring = LinearRing([(1, 2), (8, 4),
                   (5, 10)])
triangle = Polygon([(1, 2), (4, 8),
                    (10, 5), (1, 2)])
rectangles = Polygon(
    # these are the exterior coordinates
    [(2.5, 7), (9, 7), (9, 12), (2.5, 12), (2.5, 7)],
    # these are the interior coordinates (the holes)
    [[(3, 8), (4, 8), (4, 9), (3, 9), (3, 8)],
     [(7, 10), (8, 10), (8, 11), (7, 11), (7, 10)]])
mp = MultiPolygon([triangle, rectangles])

How do these look? You can directly call shapely objects to visualize them.

In [None]:
line

In [None]:
ring

In [None]:
triangle

In [None]:
rectangles

In [None]:
mp

### Operations with shapely

It implements many operations on geometries that would have been difficult and time consuming to write ourselves, including union, intersection, difference, buffer, distance, etc.

In [None]:
# Usually the syntax is `NewObject = OneObject.operation(AnotherObject)`, for example
result = triangle.intersection(rectangles)
result

In [None]:
result = rectangles.union(triangle)
result

In [None]:
result = rectangles.difference(triangle)
result

In [None]:
result = rectangles.buffer(1)  # buffering
result

In [None]:
from shapely.affinity import scale
result = scale(rectangles, yfact=1.3)  # scaling
result

In [None]:
result = scale(rectangles, xfact=2)  # scaling
result

In [None]:
# Construct an ellipse/oval
circle = Point((0, 0)).buffer(1)
ellipse = scale(circle, yfact=1.5)
ellipse

### Plotting shapely objects in matplotlib


In [None]:
# What you need is 'coordinate information'
# Get the x, y coordinates of the rectangles above using the attributes exterior.xy and interior.xy
lon, lat = rectangles.exterior.xy
plt.plot(lon, lat, 'k-')
for interior in rectangles.interiors:
    lon, lat = interior.xy
    plt.plot(lon, lat, 'k-')
plt.show()

In [None]:
# plotting multiple shapes
lon, lat = rectangles.exterior.xy
plt.plot(lon, lat, 'k-')
for interior in rectangles.interiors:
    lon, lat = interior.xy
    plt.plot(lon, lat, 'k-')
tlon, tlat = triangle.exterior.xy
plt.fill(tlon, tlat)
plt.show()

## 4. shapely + pandas = geopandas

A very common type of geospatial data file is called a shapefile, often with the .shp extension.

* Traditionally: Thinking of shapefiles as a collection of shapes, each associated with many attributes
* geopandas: Thinking of shapefiles as data frames
     * Each observation in a GeoDataFrame is a shape (or geometry), usually Polygon, but can be other things
     * One special column `df['geometry']` records that (these geometries are all shapely objects)
     * All the other columns will be the attributes that are associated with the geometries

Let's import `geopandas` and use it to load an example shapefile, which is based on the rectangles and triangle shapes created above.

In [None]:
import geopandas as gpd
df = gpd.read_file('data/demo.shp')

In [None]:
# Check it out
df

In [None]:
# seamless integration with shapely
geom = df.loc[0, 'geometry']
type(geom)  # shapely Polygon

In [None]:
# Plotting in one line
df.plot()
plt.show()

To plot a single observation, must use a slice of length 1 that includes the observation to return a dataframe rather than a series (which has no `.plot()` method).

In [None]:
df.iloc[0:1].plot();

In [None]:
# Plotting with matplotlib
plt.plot(df.loc[0, 'geometry'].exterior.xy[0], 
         df.loc[0, 'geometry'].exterior.xy[1], 
         'k-')
plt.plot(df.loc[1, 'geometry'].exterior.xy[0], 
         df.loc[1, 'geometry'].exterior.xy[1], 
         'k-')
for interior in df.loc[1, 'geometry'].interiors:
    lon, lat = interior.xy
    plt.plot(lon, lat, 'k-')
plt.grid();

Let's do another example with `hawaii.p`, a file with the coordinates of a point within Oahu and a multipolygon for the islands of Hawai'i.

There is a particular method for loading python `.p` data files using the `pickle` library.

In [None]:
# Open the file:
import pickle
with open('data/hawaii.p', 'rb') as f:
    d = pickle.load(f)

In [None]:
# Inspect it
d

In [None]:
type(d)

In [None]:
hawaii = d['hawaii']
oahu = d['oahu']

In [None]:
type(hawaii)

Let's plot the Hawaii multipolygon, and include the location of the point in Oahu. Note that multipolygons have the attribute `geoms` instead of `geom`.

Let's also include a buffer around the Hawaiian islands with distance 0.5 (degrees). We can use the `buffer` method in shapely.

In [None]:
hawaii_buff05 = hawaii.buffer(0.5)

for island in hawaii.geoms: # note we iterate over the geometries in a multipolygon
    lon, lat = island.exterior.xy
    plt.plot(lon, lat, 'k-')
for island in hawaii_buff05.geoms: # note we iterate over the geometries in a multipolygon
    lon, lat = island.exterior.xy
    plt.plot(lon, lat, 'b', linestyle='dashed')
plt.plot(oahu['lon'],oahu['lat'], '*y', markersize=10)
plt.grid()
plt.show()

Let's do some geospatial calculations with these data.

We'll calculate the minimum distance from each Hawai'ian island to Oahu, and then visualize this by shading each island on the map by that minimum distance.

For distance, we'll use the built in `distance()` method which finds the shortest path between any two points on the boundaries of two geometries.

First we need to create a point object from the `oahu` object.

In [None]:
type(oahu)

In [None]:
# let's convert it
oahu_pt = Point(oahu['lon'], oahu['lat'])
type(oahu_pt)

In [None]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# Calculate distances for each island
islands = list(hawaii.geoms)
island_distances = [island.distance(oahu_pt) for island in islands]

# Set up color mapping by scaling the raw distances to a 0-1 range
norm = colors.Normalize(vmin=min(island_distances), vmax=max(island_distances))
cmap = cm.viridis 

# Plot
fig, ax = plt.subplots(figsize=(10, 6))

for island, dist in zip(islands, island_distances): # a way to loop over two lists at once
    # Get the color based on distance
    color = cmap(norm(dist))
    lon, lat = island.exterior.xy
    # Plot the island filled with the color
    ax.fill(lon, lat, color=color, alpha=0.8, edgecolor='black')
    
ax.plot(oahu['lon'], oahu['lat'], '*y', markersize=15, markeredgecolor='k', label='Oahu Center')

# Add a Colorbar to explain the distances
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
plt.colorbar(sm, ax=ax, label='Minimum Distance to Oahu (Degrees)')

ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("Hawaiian Islands by Min. Distance to Oahu")
ax.grid(True, alpha=0.3)
plt.show()

This was a little bit complicated. Using **geopandas** can make things much more efficient.

If we convert the hawaii multipolygon to a **geodataframe** (Like the shapefile we loaded earlier), we can very easily calculate and plot distances.

In [None]:
# convert to gdf
gdf = gpd.GeoDataFrame({'geometry': list(hawaii.geoms)})
# one line distance calculation
gdf['dist'] = gdf.distance(oahu_pt)

In [None]:
gdf.head()

In [None]:
# Plot
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the islands
# 'legend_kwds' allows us to set the label on the colorbar directly
gdf.plot(
    column='dist', 
    cmap='viridis', 
    edgecolor='black', 
    alpha=0.8,
    legend=True, 
    ax=ax,
    legend_kwds={'label': "Min. Distance to Oahu (Degrees)"}
)

# Plot the Oahu point
ax.plot(oahu['lon'], oahu['lat'], '*y', markersize=15, markeredgecolor='k', label='Oahu Center')

# Formatting
ax.set_aspect('equal') # Keep the islands from looking stretched
ax.grid(True, alpha=0.3)
ax.set(
    xlabel="Longitude", 
    ylabel="Latitude", 
    title="Hawaiian Islands by Min. Distance to Oahu"
)
ax.legend() # for the Oahu point

plt.show()

This code is nice and efficient and also scalable for situations with a much larger number of polygons!