## Shape analysis of species ranges

In this notebook we consider some methods that allow us to better understand the spatial ranges of the species of plants in a taxonomic class. The observations of each species can be considered as a spatial point set. The methods used here are examples of [spatial descriptive statistics](https://en.wikipedia.org/wiki/Spatial_descriptive_statistics).

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pathlib import Path
import cartopy.crs as ccrs
from statsmodels.nonparametric.smoothers_lowess import lowess
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import haversine_distances
from scipy.spatial import distance_matrix

Select a taxonomic class for analysis.

In [None]:
#pclass = "Pinopsida"
pclass = "Polypodiopsida"

These are some geophysical constants that we will need below.

In [None]:
earth_radius_m = 6371000
earth_radius_km = earth_radius_m / 1000
earth_circumference_km = 40075
land_area = 149 * 10^6

In [None]:
pa = Path("/home/kshedden/data/Teaching/inaturalist")
fn = pa / ("Plantae_%s.csv.gz" % pclass)

We will only consider observations made on or after January 1, 2015.

In [None]:
v = ["scientificName", "decimalLatitude", "decimalLongitude", "eventDate"]
df = pd.read_csv(fn, parse_dates=["eventDate"], usecols=v)
df = df.dropna()
df = df.query("eventDate >= 20150101")
df.head()

For this analysis we will only consider species with at least 1000 observations.

In [None]:
df["n"] = df.groupby("scientificName").transform("size")
df = df.query("n>=1000")
df["scientificName"].unique().size

We create a fake species having a perfectly uniform distribution.  This will serve as a benchmark for some of our analyses below.

In [None]:
n = 1000
dd = pd.DataFrame({"scientificName": np.repeat("fake_uniform", n)})
xyz = np.random.normal(size=(n, 3))

def convert_coordinates(xyz):
    mag = np.sqrt((xyz**2).sum())
    lon = np.arctan2(xyz[0], xyz[1])
    lat = np.arccos(-xyz[2] / mag) - np.pi/2
    return lat, lon
 
latlon = [convert_coordinates(xyz[i, :]) for i in range(n)]
 
dd["decimalLongitude"] = [latlon[i][1] * 180 / np.pi for i in range(n)]
dd["decimalLatitude"] = [latlon[i][0] * 180 / np.pi for i in range(n)]
dd["eventDate"] = df["eventDate"][0:n]
df = pd.concat((df, dd), axis=0)

To visualize the results of this analysis, we will make maps showing the locations of observations of one species.

In [None]:
def map_species(sn):
    """
    Plot the locations of the observations of species 'sn' on a world map.
    """
    dx = df.query("scientificName=='{}'".format(sn))
    plt.figure()
    plt.figure(figsize=(9, 7.25))
    ax = plt.axes([0.05, 0.05, 0.84, 0.88], projection=ccrs.PlateCarree(central_longitude=180))
    ax.coastlines()
    ax.set_extent([0, 310, -60, 80])
    
    plt.scatter(dx["decimalLongitude"], dx["decimalLatitude"], s=8, alpha=0.1, color="red", 
                transform=ccrs.Geodetic(), rasterized=True)
    plt.title(sn)
    plt.show()

In [None]:
map_species("fake_uniform")

To calculate distances below, we will need to have the latitude and longitude of each observation in radians.

In [None]:
df["lonrad"] = np.pi * df["decimalLongitude"] / 180
df["latrad"] = np.pi * df["decimalLatitude"] / 180

We will be calculating some circular statistics below, for which we need these quantities.

In [None]:
df["lonrad_sin"] = np.sin(df["lonrad"])
df["lonrad_cos"] = np.cos(df["lonrad"])

The circular mean and circular variance are based on these means:

In [None]:
df["lonrad_cos_mean"] = df.groupby("scientificName")["lonrad_cos"].transform(np.mean)
df["lonrad_sin_mean"] = df.groupby("scientificName")["lonrad_sin"].transform(np.mean)

## Comparing species ranges based on spatial dispersion

Below we calculate the circular variances of the longitude values for each species.

In [None]:
df["lon_var"] = 1 - np.sqrt(df["lonrad_cos_mean"]**2 + df["lonrad_sin_mean"]**2)

The histogram below shows a strong bimodal pattern in the circular variances.

In [None]:
dd = df.groupby("scientificName")["lon_var"].first()
plt.hist(dd)
plt.xlabel("Circular variance of longitudes")
plt.ylabel("Frequency");

The maps below show the observed locations for the three species with the greatest longitudinal variance, and the three species with the least longitudinal variance.  These maps reveal that the species with small longitudinal variances are often limited to a single island.

In [None]:
dd = dd.sort_values()

for j in [0, 1, 2, -3, -2, -1]:
    sn = dd.index[j]
    map_species(sn)

Below we calculate the median pairwise distance between pairs of observations of each species.  This is a different  measure of spatial dispersion compared to the spatial variance used above.

In [None]:
def f(dx):
    n = dx.shape[0]
    m = min(n, 1000)
    ii = np.random.choice(n, m)
    di = earth_radius_km * haversine_distances(dx[["latrad", "lonrad"]].iloc[ii, :])
    ii = np.tril_indices(m)
    return pd.Series({"n": n, "med_dist": np.median(di[ii])})

dd = df.groupby("scientificName").apply(f)

As above, we plot the observed locations for the three species with the least median pairwise distance, and the three species with the greatest median pairwise distance.

In [None]:
dd = dd.sort_values(by="med_dist")

for i in [0, 1, 2, -3, -2, -1]:
    sn = dd.index[i]
    map_species(sn)

Above we considered the median pairwise distance.  But once we have computed the pairwise distances, there are many more things that we can do with them.  Below we evaluate the empirical CDF (eCDF) of pairwise distances within a species.  The functions are evaluated on a grid of points (defined below as 'dgr'), so that we can analyze them as fixed-length vectors.

In [None]:
dgr = np.square(np.linspace(1, np.sqrt(earth_circumference_km/2), 1000))

def f(dx):
    n = dx.shape[0]
    m = min(n, 1000)
    ii = np.random.choice(n, m)
    di = earth_radius_km * haversine_distances(dx[["latrad", "lonrad"]].iloc[ii, :])
    ii = np.tril_indices(m, -1)
    dv = di[ii]
    dv.sort()
    ii = np.searchsorted(dv, dgr)
    return pd.Series({"n": n, "ecdf": ii/len(dv)})

dd = df.groupby("scientificName").apply(f)

Below we plot the empirical CDFs of pairwise distances for a random subset of the species.

In [None]:
j0 = dd.index.get_loc("fake_uniform")

for j in np.concatenate([[j0], np.random.choice(dd.shape[0], 5)]):
    plt.clf()
    plt.title(dd.index[j])
    plt.plot(dgr, dd["ecdf"].iloc[j], "-")
    plt.grid(True)
    plt.xlabel("Distance")
    plt.ylabel("Fraction of pairwise distances")    
    plt.show()

## Comparing species ranges based on their geometric dimension

The [correlation dimension](https://en.wikipedia.org/wiki/Correlation_dimension) posits that the fraction $p(\epsilon)$ of pairwise distances that are less than a value $\epsilon > 0$ follow the power law $p(\epsilon) \sim \epsilon^\nu$ for small $\epsilon$.  The value of $\nu$ is the correlation dimension.  When $\nu$ is small the species range is restricted to a lower dimensional region.  When $\nu$ is large, the species range fills space more fully.  If the species distribution is spatially uniform the correlation dimension will be 2.  If the species distribution is restricted to 1-dimensional paths, the correlation dimension will be 1.  Correlation dimensions smaller than 1 indicate a "fractal-like" distribution.

In [None]:
j0 = dd.index.get_loc("fake_uniform")

plt.grid(True)
for j in np.concatenate([[j0], np.random.choice(dd.shape[0], 5)]):
    plt.plot(np.log(dgr), np.log(dd["ecdf"].iloc[j]), "-o")
plt.xlabel("Log radius")
plt.ylabel("Log fraction of pairwise distances")

Below we estimate the correlation dimension using least squares regression in log/log space.

In [None]:
log_dgr = np.log(dgr)
ii = np.flatnonzero((-np.Inf <= log_dgr) & (log_dgr <= np.Inf))
log_dgr_res = log_dgr[ii]

def f(ecdf):
    w = ecdf[ii] * (1 - ecdf[ii])
    cc = np.cov(log_dgr_res, np.log(ecdf[ii]), aweights=w)
    return cc[0, 1] / cc[0, 0]
    
dd["cor_dim"] = dd["ecdf"].apply(f)
plt.hist(dd["cor_dim"]);
plt.xlabel("Correlation dimension")
plt.ylabel("Frequency")

In [None]:
dd = dd.sort_values(by="cor_dim")

for j in [0, 1, -2, -1]:
    plt.plot(np.log(dgr), np.log(dd["ecdf"].iloc[j]), "-o")
plt.grid(True)
plt.xlabel("Log radius")
plt.ylabel("Log fraction of pairwise distances")

Below we plot the species occurrences for the three species with the lowest correlation dimension and the three species with the greatest correlation dimension.

In [None]:
dd = dd.sort_values(by="cor_dim")

for j in [0, 1, 2, -3, -2, -1]:
    map_species(dd.index[j])

## Factor analysis of the pairwise distance distributions

Below we use principal component analysis to understand the variation of the eCDF's.

In [None]:
species = dd.index.tolist()
dm = np.vstack([dd.loc[k].ecdf for k in species])
dm = np.sqrt(dm)
dmn = dm.mean(0)
dm -= dmn
u,s,vt = np.linalg.svd(dm)
v = vt.T

The spectrum seems to closely follow a power law $s_k \sim k^{-1.8}$.

In [None]:
ii = np.arange(1, len(s)+1)
plt.plot(np.log(ii), np.log(s), "-o")
plt.grid(True)
plt.xlabel("Log rank")
plt.ylabel("Log singular value")

jj = np.flatnonzero(np.log(ii) < 3)
cc = np.cov(np.log(s[jj]), np.log(ii)[jj])
b = -cc[0, 1] / cc[1, 1]
b

In [None]:
def plot_factor(j):
    f = s[j] * u[:, j].std()
    plt.plot(dgr, dmn)
    plt.xlabel("Distance")
    plt.ylabel("Sqrt cumulative probability")
    plt.grid(True)
    for k in [-1, 1]:
        plt.plot(dgr, dmn + k*f*v[:, j], color="grey")
    plt.show()
        
plot_factor(0)
plot_factor(1)

Below is a scatterplot of the PC scores for the first two factors. 

In [None]:
plt.grid(True)
plt.plot(u[:, 0], u[:, 1], "o")

Below are plots of the eCDF functions (on the log scale) with extreme scores on factor 1, and on factor 2.

In [None]:
q0 = np.quantile(u[:, 0], [0.05, 0.95])
q1 = np.quantile(u[:, 1], [0.05, 0.95])
qq = [q0, q1]

for j in [0, 1]:
    i0 = np.flatnonzero(u[:, j] < qq[j][0])
    plt.clf()
    plt.grid(True)
    plt.title("Factor {}".format(j+1))
    plt.xlabel("Log distance")
    plt.ylabel("Sqrt cumulative probability")
    for i in i0:
        plt.plot(np.log(dgr), np.log(dd["ecdf"].iloc[i]), "-", color="blue")
    
    i1 = np.flatnonzero(u[:, j] > qq[j][1])
    for i in i1:
        plt.plot(np.log(dgr), np.log(dd["ecdf"].iloc[i]), "-", color="red")
    plt.show()


The eCDF of pairwise distances is closely related to [Ripley's K-function](https://en.wikipedia.org/wiki/Spatial_descriptive_statistics) and the closely related Ripley's L-function.

The K-function is the eCDF times the area of the region containing the points.  Under a uniform distribution, the k-function will be $\pi d^2$.  However no plant is even close to being uniformly distributed on the Earth's surface (not least because of the presence of oceans), so comparing to a uniform distribution can feel like a "straw man" comparison. 

If the K-function is equal to $\pi d^2$, then the log of the K-function is a linear function of $\log(d)$, with slope 2.  As seen above, no true species has a slope approaching 2 (the 'fake_uniform' data have a slope of around 1.6).

For the sake of illustration, we can take the area to be the total land area on Earth, excluding Antarctica. 

If $K(d)$ is the K-function, then the L-function is $L(d) = (K(d)/\pi)^{1/2}$.  This is done to achieve variance stabilization.  Plotting $d - L(d)$ against $d$ should give a point set with zero conditional mean and constant conditional variance under uniformity.  We present a few such plots below, and see extreme discrepancy from what would be expected under uniformity.

In [None]:
m = 1001
plt.grid(True)
for j in np.random.choice(dd.shape[0], 5):
    plt.plot(dgr[0:m], dgr[0:m] - np.sqrt(land_area*dd["ecdf"].iloc[j][0:m] / np.pi), "-")
plt.xlabel("Radius")
plt.ylabel("Radius - L(radius)")

To check that we are using these methods properly, below we simulate data from a uniform distribution on the sphere and calculate the empirical CDF of the pairwise distances.  First we simulate the data:

In [None]:
n = 1000
theta = 2 * np.pi * np.random.uniform(size=n)
r = np.sqrt(np.random.uniform(size=n))
x = r*np.sin(theta)
y = r*np.cos(theta)
xy = np.vstack((x, y)).T
plt.plot(x, y, "o", color="grey", alpha=0.3)
plt.xlabel("X")
plt.ylabel("Y")

Next we calculate the empirical CDF and assess that it is a quadratic function of distance:

In [None]:
d = distance_matrix(xy, xy)
ii = np.tril_indices(n)
di = d[ii]
di.sort()
g = np.linspace(0, 1, 100)
pp = np.searchsorted(di, g) / len(di)

The CDF should be a quadratic function of distance, this can be assessed based on whether the plot below has a slope of 2:

In [None]:
plt.grid(True)
plt.plot(np.log(g), np.log(pp), "o")

Another check is that the K function should be equal to $d^2$, which is true up until $d$ gets large enough that there are boundary effects.

In [None]:
plt.grid(True)
plt.plot(g, np.pi*pp)
plt.plot(g, np.pi*g**2)