# Section 5. Spatial Data - tabular data and shapefiles

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley's Spatial Data Analysis [course](https://docs.google.com/document/d/1oC10pjyeBQTenQazCpaB8Lx1b5PC1SR3WFiPgCtXqcs/edit?tab=t.0) notes by [Jaecheol Lee](https://sites.google.com/view/jaecheollee).
    
### Learning Objectives 
    
* Practice working with pandas dataframes that include point data
* Introduction to basic calculations involving spatial information
* Understand about different types of spatial data and geometries
* Practice mapping spatial data using shapely and geopandas
* Work on manipulating spatial objects using shapely methods

### Sections
1. Tabular spatial data using pandas
2. Calculations with point data
3. Shapely and geopandas

## 0. Loading modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Check what you have loaded 
dir()

In [None]:
# Suppress scientific notation in Pandas 
# set default to displaying full float numbers with 2 decimal places
pd.options.display.float_format = '{:.2f}'.format

## 1. Loading and Inspecting Tabular Spatial Data

We will be working with a dataset on crime which has types and locations of crimes. It is extracted from a larger dataset.

The data are in tabular format, but still represent spatial data because some of the tabular data include geographic/spatial information!

In [None]:
# Load crime.csv 
df = pd.read_csv('data/crime.csv')
# inspect the data
df.head()

In [None]:
# Check the shape
df.shape

Let's add a year variable.

In [None]:
df['year'] = [2020,2020,2020,2020,2020]
# another way to do this is df['year'] = np.repeat(2020, 5)

In [None]:
# we could also have done this based on the incident_id, if we had a variation in year
df['year2'] = df['incident_id'].str[:4]
df

In [None]:
# Get simple summary statistics using the .describe method
df.describe().transpose()

In [None]:
# Sometimes have meaningful ID variable that might want to index on
# Can set the incident_id as the index
df2=df.set_index('incident_id')
df2

In [None]:
# Some spatial data are missing
df['lat'].isna() 
df['lat'].notna()

In [None]:
# Let's drop rows with any missing data
df=df.dropna(how = 'any') # inplace argument is False by default
df

Let's plot the coordinates! Here we will pass `lat` as the y argument and `lon` as the x argument.

In [None]:
plt.plot(df['lon'],df['lat'], '+')
plt.show()

## 2. Calculations with point data

The observations in `df` are **point data**. They are identified spatially by a point, or a pair of x and y coordinates. 

### Calculating distances

One basic thing we can do with points is calculate distances. Let's calculate distances from each point to a fixed reference point (suppose it's a police station) and add that to the dataframe.

In [None]:
# First add a new variable with all missing values (accumulator)
df['dist_pt']=np.nan

# Define cordinates of the reference point
ref=[-122.26,37.87]

# Calculate the distance from ref for each observation using a loop
for i in range(4): # loop over number of observations
    df.loc[i, 'dist_pt']=((df['lon'][i] - ref[0]) ** 2 +
                      (df['lat'][i] - ref[1]) ** 2) ** 0.5
    

In [None]:
# Older version of code generated warnings
for i in range(4): # loop over number of observations
    df['dist_pt'][i]=((df['lon'][i] - ref[0]) ** 2 +
                      (df['lat'][i] - ref[1]) ** 2) ** 0.5
    

In [None]:
# It is possible to suppress these kinds of warnings
# But be aware you may lose some useful information
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set distance to km (roughly)
df['dist_pt']=df['dist_pt']*111.11
df

How can we find the incident with the minimum distance?

In [None]:
print(float('inf'))

In [None]:
# If we use an enumerate loop:
min_distance = float('inf') # first define a placeholder value for distance - needs to be very large
min_i = None # a placeholder value for the index

# Now loop through distance values to identify the minimum
for i, distance in enumerate(df['dist_pt']):
    if distance < min_distance: 
        min_distance = distance 
        min_i = i

print(min_i)
print(min_distance)

In [None]:
# There are also functions that calculates these values
min_i = np.argmin(df['dist_pt'])
min_dist = np.min(df['dist_pt'])

print(min_i)
print(min_dist)

### Heat maps - Lambda calculation

Lambda is a measure of intensity or density of events within a given radius. It is the ratio of the number of events within a given radius of a reference point to the area of the circle defined by that radius around the reference point.

Variations on this kind of measure are used to generate heat maps. The version we will calculate is one particular case.

Below is an example of using Latex formatting within Jupyter Notebook. It gives an equation for calculation lambda.

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{s_{inf\_site\; j}}})< h \right]
$$

Let's generate some random data and use it to generate a heat map.

In [None]:
# Set a seed to ensure we all get the same results.
np.random.seed(123)

# Random coordinates as numbers between 0 and 10
x_coords = np.random.random(10) * 10
y_coords = np.random.random(10) * 10

x_coords, y_coords

In [None]:
# Plot the x_coords and the y_coords
plt.plot(x_coords, y_coords, 'r*');

In [None]:
# Define a function to calculate the distance between two points
def compute_distance(x0, y0, x1, y1):
    dist = ((x0 - x1) ** 2 + (y0 - y1) ** 2) ** 0.5
    return(dist)

Now let's calculate lambda for a fixed point $s=$(3,3), and let's set the radius $h$ to 3 degrees.

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{s_{j}}})< h \right]
$$

In [None]:
h = 3

# First, calculate the distance between the events and a point x = 3, y = 3
# using list completion
distance = [ compute_distance(3, 3, x_coords[i], y_coords[i]) for i in range(10) ]
print(distance)

# Second, count the number of pairs for which the distance is less than h
distance = np.array(distance) # convert from list to array
print(distance<h)
num  = np.sum(distance<h)

# Third, calculate lambda
lambda_est = (1/(np.pi * (h ** 2))) * num
print(lambda_est)

##### Don't be confused with a lambda [function](https://stackoverflow.com/questions/890128/why-are-python-lambdas-useful)! 
1. A lambda function is a small anonymous function (defined without a name).
2. A lambda function can take any number of arguments, but can only have one expression.

In [None]:
# lambda is a one-line function:
x = lambda a : a + 10
print(x(2))

# Equivalent to:
def x(a):
    b = a + 10
    return(b)
print(x(2))

$$
\hat{\lambda}(\overrightarrow{\underset{\cdot}{s}}) =\frac{1}{\pi h^{2}} \times \sum_{j=1}^{N} \mathbf{1}\left[ dist(\overrightarrow{\underset{\cdot}{s}}, \overrightarrow{\underset{\cdot}{s_{inf\_site\; j}}})< h \right]
$$

Let's **write a function** to calculate lambdas for a given reference point and set of point events.

In [None]:
# Define a function to calculate the lambda given a point
def lambda_function(x_ref, y_ref, x_events, y_events, h):
    """ 
    x_ref and y_ref are coordinates of the reference point.
    x_events and y_events are arrays of coordinates of event points.
    h is the radius of interest for the lambda measure.
    We calculate distances between the reference point and each event, using a previously created function.
    These are linear distances in degrees that assume the earth is flat.
    We then use this to calculate lambda.
    """
    distance = [ compute_distance(x_ref, y_ref, x_events[i], y_events[i]) for i in range(len(x_events)) ]
    distance = np.array(distance)
    lambda_est = (1/(np.pi * (h ** 2))) * np.sum(distance<h)
    return(lambda_est)

In [None]:
# Test it!
print(lambda_function(5, 5, x_coords, y_coords, 3))
print(lambda_function(7, 3, x_coords, y_coords, 3))

### Creating a raster grid for a heatmap

Now let's **make a heatmap** for the area around the event data we created!

We don't want to just estimate densities at the locations of the events. Ideally, we want to calculate densities all across the map. 

To do this we will need to define a set of coordinates for the map. We can then calculate lambda for each point in the grid.

We will start by creating a 10 x 10 raster grid covering the event area. We can do this using `numpy` and the `np.meshgrid` function and `np.ndarray.flatten` method.

In [None]:
# How can we make a 10 X 10 grid?
# The coordinates start from (0, 0) and end at (10, 10)
# Use np.meshgrid to create the grid
x, y = np.meshgrid(range(0, 11), range(0, 11)) # recall that the last number in range() is excluded
x, y

In [None]:
x.flatten()

In [None]:
# How can we get all the coordinates?
x, y = x.flatten(), y.flatten()

# Plot the events (x_coords, y_coords) and the grid x, y coordinates in one plot.
# With different markers
plt.plot(x_coords, y_coords, 'ro')
plt.plot(x, y, 'k+')
plt.show()

Now we have our reference points and our events.

We can use the function we wrote earlier to calcualte lambda for all grid points based on the locations of the events.

In [None]:
# Make a matrix of lambdas for all the locations
# utilizing the function above and a double loop
# let's use h=3 this time

# First make an empty matrix accumulator
matrix = np.zeros((11, 11))

# Then loop over x and y coordinates and calculate lambda to fill the matrix
for i in range(11):
    for j in range(11): 
        # note that the y grid points are arranged differently so it's simpler to call the x coordinates twice
        matrix[i, j] = lambda_function(x[i], x[j], x_coords, y_coords, 3)       

# Flatten the matrix for mapping
matrix = matrix.flatten()
matrix

Now we can **create a heatmap**!

In [None]:
# Plot the events (x_coords, y_coords) and the grid x, y coordinates in one plot.
# With different markers
plt.plot(x_coords, y_coords, 'ro')
scatter=plt.scatter(x, y, c=matrix, marker='s', cmap='viridis', s=500, )
plt.colorbar(scatter, label="Event density Lambda")  # Add a colorbar
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Heat map of event density")
plt.show()

## 3. Shapely and Geopandas

New let's transition to other types of geometries other than points. 

#### Geometries in shapely

* [Intro to geometric objects in shapely](https://automating-gis-processes.github.io/2016/Lesson1-Geometric-Objects.html)

__Structures of Geometries__

* A `Point` is a collection of two (three) numbers, each representing x, y, (z,) coordinates.
* A `LinearRing` is a sequence of points, with the last point being the same as the first one. (Here we skip the discussion on validity of geometries.)
* A Polygon (e.g., `rectangles`) has one exterior `rectangles.exterior` (a `LinearRing`) and potentially multiple interiors `rectangles.interiors` (each element, e.g. `rectangles.interiors[0]`, is a `LinearRing`).
* A `MultiPolygon` is a sequence of `Polygon`s.

In [None]:
from shapely.geometry import (Point, LinearRing,
                              Polygon, MultiPolygon)
p = Point((1, 2))
ring = LinearRing([(1, 2), (8, 4),
                   (5, 10), (1, 2)])
triangle = Polygon([(1, 2), (8, 4),
                    (5, 10), (1, 2)])
rectangles = Polygon(
    # these are the exterior coordinates
    [(2.5, 7), (9, 7), (9, 12), (2.5, 12), (2.5, 7)],
    # these are the interior coordinates (the holes)
    [[(3, 8), (4, 8), (4, 9), (3, 9), (3, 8)],
     [(7, 10), (8, 10), (8, 11), (7, 11), (7, 10)]])
mp = MultiPolygon([triangle, rectangles])

In [None]:
ring

In [None]:
triangle

In [None]:
# How do these look?
rectangles

In [None]:
mp

### Operations with shapely

It implements many operations on geometries that would have been difficult and time consuming to write ourselves, including union, intersection, difference, buffer, distance, etc.

In [None]:
# Usually the syntax is `NewObject = OneObject.operation(AnotherObject)`, for example
result = triangle.intersection(rectangles)
result

In [None]:
result = rectangles.union(triangle)
result

In [None]:
result = rectangles.difference(triangle)
result

In [None]:
result = rectangles.buffer(1)  # buffering
result

In [None]:
from shapely.affinity import scale
result = scale(rectangles, yfact=1.3)  # scaling
result

In [None]:
result = scale(rectangles, xfact=2)  # scaling
result

In [None]:
# Construct an ellipse/oval
circle = Point((0, 0)).buffer(1)
ellipse = scale(circle, yfact=1.5)
ellipse

### Plotting shapely objects in matplotlib


In [None]:
# What you need is 'coordinate information'
# Get the x, y coordinates of the rectangles above using attribute exterior.xy
lon, lat = rectangles.exterior.xy
plt.plot(lon, lat, 'k-')
plt.show()

In [None]:
# What about the interiors?
lon, lat = rectangles.exterior.xy
plt.plot(lon, lat, 'k-')
for interior in rectangles.interiors:
    lon, lat = interior.xy
    plt.plot(lon, lat, 'k-')
plt.show()

In [None]:
# plotting multiple shapes
lon, lat = rectangles.exterior.xy
plt.plot(lon, lat, 'k-')
for interior in rectangles.interiors:
    lon, lat = interior.xy
    plt.plot(lon, lat, 'k-')
tlon, tlat = triangle.exterior.xy
plt.fill(tlon, tlat)
plt.show()

## shapely + pandas = geopandas

A very common type of geospatial data file is called a shapefile, often with the .shp extension.

* Traditionally: Thinking of shapefiles as a collection of shapes, each associated with many attributes
* geopandas: Thinking of shapefiles as data frames
     * Each observation in a GeoDataFrame is a shape (or geometry), usually Polygon, but can be other things
     * One special column `df['geometry']` records that (these geometries are all shapely objects)
     * All the other columns will be the attributes that are associated with the geometries

Let's import `geopandas` and use it to load an example shapefile, which is based on the rectangles and triangle shapes created above.

In [None]:
import geopandas as gpd
df = gpd.read_file('data/demo.shp')

In [None]:
# Check it out
df

In [None]:
# seamless integration with shapely
geom = df.loc[0, 'geometry']
type(geom)  # shapely Polygon

In [None]:
# Plotting in one line
df.plot()
plt.show()

In [None]:
# Use method loc to keep only the first line and try method plot.
df.loc[0, :].plot()

In [None]:
# If you want to keep the geopandas dataframe class, use method cx:
df.cx[0:1, :].plot() #### NOT 0 ####

In [None]:
df.cx[3:3, :].plot() #### NOT 0 ####

In [None]:
# Plotting with matplotlib
plt.plot(df.loc[1, 'geometry'].exterior.xy[0], 
         df.loc[1, 'geometry'].exterior.xy[1], 
         'k-');

In [None]:
# Plotting with matplotlib
plt.plot(geom.exterior.xy[0], geom.exterior.xy[1], 'k-');

In [None]:
# Display a panel grid over the figure above:
plt.plot(geom.exterior.xy[0], geom.exterior.xy[1], 'k-')
plt.grid();

Let's do another example with `hawaii.p`, a file with the coordinates of a point within Oahu and a multipolygon for the islands of Hawai'i.

There is a particular method for loading python .p data files using the `pickle` library.

In [None]:
# Open the file:
import pickle
with open('data/hawaii.p', 'rb') as f:
    d = pickle.load(f)

In [None]:
# Inspect it
d

In [None]:
type(d)

In [None]:
hawaii = d['hawaii']
oahu = d['oahu']

In [None]:
type(hawaii)

Let's plot the Hawaii multipolygon, and include the location of the point in Oahu. Note that multipolygons have the attribute `geoms` instead of `geom`.

Let's also include a buffer around the Hawaiian islands with distance 0.5 (degrees). We can use the `buffer` method in shapely.

In [None]:
hawaii_buff05 = hawaii.buffer(0.5)

for island in hawaii.geoms: # note we iterate over the geometries in a multipolygon
    lat, lon = island.exterior.xy
    plt.plot(lon, lat, 'k-')
for island in hawaii_buff05.geoms: # note we iterate over the geometries in a multipolygon
    lat, lon = island.exterior.xy
    plt.plot(lon, lat, 'b', linestyle='dashed')
plt.plot(oahu['lon'],oahu['lat'], '*y', markersize=10)
plt.grid()
plt.show()