# Spatial Joins Exercises

Here\'s a reminder of some of the functions we have seen. Hint: they
should be useful for the exercises!

-   `sum(expression)`: aggregate to
    return a sum for a set of records
-   `count(expression)`: aggregate to
    return the size of a set of records
-   `ST_Area(geometry)` returns the
    area of the polygons
-   `ST_AsText(geometry)` returns WKT `text`
-   `ST_Contains(geometry A, geometry B)` returns the true if geometry A contains geometry B
-   `ST_Distance(geometry A, geometry B)` returns the minimum distance between geometry A and
    geometry B
-   `ST_DWithin(geometry A, geometry B, radius)` returns the true if geometry A is radius distance or less from geometry B
-   `ST_GeomFromText(text)` returns `geometry`
-   `ST_Intersects(geometry A, geometry B)` returns the true if geometry A intersects geometry B
-   `ST_Length(linestring)` returns the length of the linestring
-   `ST_Touches(geometry A, geometry B)` returns the true if the boundary of geometry A touches geometry B
-   `ST_Within(geometry A, geometry B)` returns the true if geometry A is within geometry B


Uncomment and run the following cell to install the required packages.


In [None]:
%pip install leafmap lonboard

In [None]:
import duckdb
import leafmap

In [None]:
%pip install jupysql duckdb-engine

In [None]:
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Download the [nyc_data.zip](https://github.com/opengeos/data/raw/main/duckdb/nyc_data.zip) dataset using leafmap. The zip file contains the following datasets. Create a new DuckDB database and import the datasets into the database. Each dataset should be imported into a separate table.

- nyc_census_blocks
- nyc_homicides
- nyc_neighborhoods
- nyc_streets
- nyc_subway_stations

In [None]:
url = "https://storage.googleapis.com/qm2/CASA0025/nyc_data.db.zip"
leafmap.download_file(url, unzip=True)

In [None]:
%sql duckdb:///nyc_data.db

In [None]:
%%sql

INSTALL spatial;
LOAD spatial;

1. **What subway station is in \'Little Italy\'? What subway route is it on?**

In [None]:
%%sql

select s.name subway_station_name, s.routes subway_routes
from nyc_subway_stations s
inner join nyc_neighborhoods n
  on st_intersects(s.geom, n.geom)
where lower(n.NAME) like '%little italy%'

2. **What are all the neighborhoods served by the 6-train?** (Hint: The `routes` column in the `nyc_subway_stations` table has values like \'B,D,6,V\' and \'C,6\')


In [None]:
%%sql

select distinct n.name neighborhood_name
from nyc_subway_stations s
inner join nyc_neighborhoods n
  on st_intersects(n.geom, s.geom)
where s.routes like '%6%'

3. **After 9/11, the \'Battery Park\' neighborhood was off limits for several days. How many people had to be evacuated?**

In [None]:
%%sql

select sum(popn_total)
from nyc_census_blocks b
inner join nyc_neighborhoods n
  on st_intersects(n.geom, b.geom)
where lower(n.name) like '%battery park%'

In [None]:
%%sql

select cast(sum((st_area(st_intersection(n.geom, b.geom)) / st_area(b.geom)) * b.popn_total) as int) weighted_population
from nyc_census_blocks b
inner join nyc_neighborhoods n
  on st_intersects(n.geom, b.geom)
where lower(n.name) like '%battery park%'

4. **What neighborhood has the highest population density (persons/km2)?**


In [None]:
%%sql

select n.name neighborhood, round(1000000 * sum(b.popn_total) / mean(st_area(n.geom)), 2)
from nyc_census_blocks b
inner join nyc_neighborhoods n
  on st_intersects(n.geom, b.geom)
group by 1
order by 2 desc
limit 20

In [None]:
%%sql

select * --n.name neighborhood, round(1000 * sum(b.popn_total) / mean(st_area(n.geom)), 2)
from nyc_census_blocks b
inner join nyc_neighborhoods n
  on st_contains(n.geom, b.geom)
--group by 1
--order by 2 desc
where n.name = 'Annandale'
limit 20

When you're finished, you can check your answers [here](https://postgis.net/workshops/postgis-intro/joins_exercises.html).

# Ship-to-Ship Transfer Detection

Now for a less structured exercise. We're going to look at ship-to-ship transfers. The idea is that two ships meet up in the middle of the ocean, and one ship transfers cargo to the other. This is a common way to avoid sanctions, and is often used to transfer oil from sanctioned countries to other countries. We're going to look at a few different ways to detect these transfers using AIS data.

In [None]:
import pandas as pd

In [None]:
%sql duckdb:///:memory:

In [None]:
%%sql
INSTALL httpfs;
LOAD httpfs;

In [None]:
%%sql
INSTALL spatial;
LOAD spatial;

## Step 1

Create a spatial database using the following AIS data:

https://storage.googleapis.com/qm2/casa0025_ships.csv

Each row in this dataset is an AIS 'ping' indicating the position of a ship at a particular date/time, alongside vessel-level characteristics.

It contains the following columns:
* `vesselid`: A unique numerical identifier for each ship, like a license plate
* `vessel_name`: The ship's name
* `vsl_descr`: The ship's type
* `dwt`: The ship's Deadweight Tonnage (how many tons it can carry)
* `v_length`: The ship's length in meters
* `draught`: How many meters deep the ship is draughting (how low it sits in the water). Effectively indicates how much cargo the ship is carrying
* `sog`: Speed over Ground (in knots)
* `date`: A timestamp for the AIS signal
* `lat`: The latitude of the AIS signal (EPSG:4326)
* `lon`: The longitude of the AIS signal (EPSG:4326)

Create a table called 'ais' where each row is a different AIS ping, with no superfluous information. Construct a geometry column.

Create a second table called 'vinfo' which contains vessel-level information with no superfluous information.

You can set a spatial index on each of these tables as follows:

`CREATE INDEX index_name ON table_name USING RTREE(geom);`

In [None]:
%%sql

create or replace table full_ships as select * from 'https://storage.googleapis.com/qm2/casa0025_ships.csv'

In [None]:
%%sql

select count(*) from full_ships limit 1

In [None]:
%%sql

select * from full_ships limit 10

In [None]:
%%sql

create or replace table ais as
select vesselid, draught, sog, date, geom from full_ships

In [None]:
%%sql

create or replace table vinfo as
select distinct vesselid, vessel_name, vsl_descr, dwt, v_length from full_ships

In [None]:
%%sql

select count(*) from vinfo

In [None]:
%%sql

select column_name, data_type
from information_schema.columns
where table_name = 'ais'

In [None]:
%%sql

create or replace table ais as
select * exclude(geom), st_transform(st_geomfromtext(geom), 'EPSG:4326', 'EPSG:3857') as geom from ais

In [None]:
%%sql

CREATE INDEX ais_rtree ON ais USING RTREE(geom);

In [None]:
%%sql

select *, st_astext(geom) from ais limit 5

In [None]:
%%sql

select count(*) from ais

## Step 2

Use a spatial join to identify ship-to-ship transfers in this dataset.
Two ships are considered to be conducting a ship to ship transfer IF:

* They are within 500 meters of each other
* For more than two hours
* And their speed is lower than 1 knot

Some things to consider: make sure you're not joining ships with themselves. Try working with subsets of the data first while you try different things out.

In [None]:
%%sql

with ordered_vessels as (
  select vesselid, sog, date time_start,
    lead(date) over (
      partition by vesselid
      order by date
      ) time_end,
    geom
  from ais
),

near_vessels as (
  select v1.vesselid v1, v2.vesselid v2,
  greatest(v1.time_start, v2.time_start) overlap_start, least(v1.time_end, v2.time_end) overlap_end
  from ordered_vessels v1
  inner join ordered_vessels v2
    on v1.vesselid < v2.vesselid
    and v1.time_start < v2.time_end and v1.time_end > v2.time_start
    and st_dwithin(v1.geom, v2.geom, 500)
    and v1.sog < 1 and v2.sog < 1
),

lag_times as (
  select v1, v2, overlap_start, overlap_end,
    overlap_start - lag(overlap_end) over (
      partition by v1, v2
      order by overlap_start
      ) gap
  from near_vessels
),

events as (
  select v1, v2, overlap_start, overlap_end,
    sum(case
          when gap is null or gap > interval '0 minutes'
            then 1
          else 0
        end
    ) over (
        partition by v1, v2
        order by overlap_start
      ) event_group
  from lag_times
)

select v1 vessel_1, v2 vessel_2,
  min(overlap_start) start_time,
  max(overlap_end) end_time,
  max(overlap_end) - min(overlap_start) duration
from events
group by v1, v2, event_group
having max(overlap_end) - min(overlap_start) >= interval '2 hours'
order by 1,3