## Hands-On Workshop
## Large-Scale Geospatial Analytics With Wherobots, Neo4j, & The PyData Ecosystem

This workshop demonstrates how to use 

This notebook covers :

* Calculating Bird Species Range
* Building a graph of species interactions
* 

In [1]:
!pip install neo4j

[0mCollecting neo4j
  Downloading neo4j-5.18.0.tar.gz (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.0/198.0 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: neo4j
  Building wheel for neo4j (pyproject.toml) ... [?25ldone
[?25h  Created wheel for neo4j: filename=neo4j-5.18.0-py3-none-any.whl size=273863 sha256=a48a28f811e31b0e3a6b88c5ba819af7f1b1106e23d1498f1602b6fd5babd577
  Stored in directory: /tmp/pip-ephem-wheel-cache-gz7_uizz/wheels/e7/e1/a0/dd7c19192f5383ff57d02a6c126cbfe4b7b2ae82f70c6994ce
Successfully built neo4j
Installing collected packages: neo4j
Successfully installed neo4j-5.18.0
[0m

In [2]:
from sedona.spark import *
import geopandas
import json

## Configure SedonaContext

In [3]:
# Configure SedonaContext, specify credentials for AWS S3 bucket(s) (optional)

config = SedonaContext.builder(). \
    config("spark.hadoop.fs.s3a.bucket.wherobots-examples.aws.credentials.provider","org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"). \
    getOrCreate()

sedona = SedonaContext.create(config)

24/03/21 20:50:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/21 20:50:24 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/03/21 20:50:25 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to qjnq6fcbf1/spark-logs/spark-3d43856e14234e96b8c27aa78801c942.inprogress. This is unsupported
24/03/21 20:50:48 WARN SedonaContext: Python files are not set. Sedona will not pre-load Python UDFs.


## Calculating Bird Species Range

We'll load a dataset of bird species observations, then calculate the range of each species using Spatial SQL.

Our data comes from [Bird Buddy](https://live.mybirdbuddy.com/) which makes a smart bird feeder than can identify bird species and (optionally) report their location.

![](https://wherobots.com/wp-content/uploads/2023/11/bird_buddy1.png)

In [4]:
BB_S3_URL = "s3://wherobots-examples/data/examples/birdbuddy_oct23.csv"
bb_df = sedona.read.format('csv').option('header','true').option('delimiter', ',').load(BB_S3_URL)
bb_df.show(5, truncate=False)
bb_df = bb_df.selectExpr('ST_Point(CAST(anonymized_longitude AS float), CAST(anonymized_latitude AS float)) AS location', 'CAST(timestamp AS timestamp) AS timestamp', 'common_name', 'scientific_name')
bb_df.createOrReplaceTempView('bb')
bb_df.show(15, truncate=False)

                                                                                

+---+-------------------+--------------------+-----------------------+--------------------+----------------------+
|_c0|anonymized_latitude|anonymized_longitude|timestamp              |common_name         |scientific_name       |
+---+-------------------+--------------------+-----------------------+--------------------+----------------------+
|10 |34.393112          |-118.59075          |2023-10-01 00:00:02.415|california scrub jay|aphelocoma californica|
|11 |34.393112          |-118.59075          |2023-10-01 00:00:02.415|california scrub jay|aphelocoma californica|
|26 |34.393112          |-118.59075          |2023-10-01 00:00:04.544|california scrub jay|aphelocoma californica|
|27 |34.393112          |-118.59075          |2023-10-01 00:00:04.544|california scrub jay|aphelocoma californica|
|34 |34.393112          |-118.59075          |2023-10-01 00:00:05.474|california scrub jay|aphelocoma californica|
+---+-------------------+--------------------+-----------------------+----------

[Stage 5:>                                                          (0 + 1) / 1]

+---------------------------------------------+-----------------------+--------------------+----------------------+
|location                                     |timestamp              |common_name         |scientific_name       |
+---------------------------------------------+-----------------------+--------------------+----------------------+
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:02.415|california scrub jay|aphelocoma californica|
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:02.415|california scrub jay|aphelocoma californica|
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:04.544|california scrub jay|aphelocoma californica|
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:04.544|california scrub jay|aphelocoma californica|
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:05.474|california scrub jay|aphelocoma californica|
|POINT (-118.59075164794922 34.39311218261719)|2023-10-01 00:00:05.474|c

                                                                                

In [5]:
bb_df.printSchema()


root
 |-- location: geometry (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- common_name: string (nullable = true)
 |-- scientific_name: string (nullable = true)



In [6]:
bb_df.count()


                                                                                

9502150

In [7]:
range_df = sedona.sql("""
    SELECT common_name, COUNT(*) AS num, ST_ConvexHull(ST_Union_aggr(location)) AS geometry 
    FROM bb 
    WHERE common_name IN ('california towhee', 'steller’s jay', 'mountain chickadee', 'eastern bluebird', 'wood thrush', 'yellow headed blackbird', 'spot breasted oriole', 
      'red cockaded woodpecker', 'northern red bishop', 'red naped sapsucker', 'western meadowlark', 'lazuli bunting', 'clark’s nutcracker', 'gray crowned rosy finch', 'california quail',
      'boreal chickadee', 'acorn woodpecker', 'townsend’s warbler', 'gambel’s quail', 'scott’s oriole', 'cassin’s finch', 'brown headed nuthatch', 'pygmy nuthatch', 'pinyon jay', 'florida scrub jay') 
    GROUP BY common_name 
    ORDER BY num DESC
""")
range_df.show(30)



+--------------------+------+--------------------+
|         common_name|   num|            geometry|
+--------------------+------+--------------------+
|    eastern bluebird|250862|POLYGON ((-100.13...|
|       steller’s jay| 78258|POLYGON ((-99.162...|
|   california towhee| 48338|POLYGON ((-117.05...|
|    acorn woodpecker|  9838|POLYGON ((-103.89...|
|      cassin’s finch|  8860|POLYGON ((-103.89...|
|  mountain chickadee|  6014|POLYGON ((-110.99...|
|brown headed nuth...|  4048|POLYGON ((-80.345...|
|      pygmy nuthatch|  2888|POLYGON ((-110.99...|
|      gambel’s quail|  1578|POLYGON ((-110.99...|
|   florida scrub jay|  1410|POLYGON ((-81.813...|
|    california quail|   740|POLYGON ((-117.05...|
|          pinyon jay|   222|POLYGON ((-116.87...|
|  clark’s nutcracker|   134|POLYGON ((-110.21...|
|      scott’s oriole|    64|POLYGON ((-99.140...|
|gray crowned rosy...|    42|POLYGON ((-105.93...|
|    boreal chickadee|    34|POLYGON ((-71.016...|
|      lazuli bunting|    30|PO

                                                                                

In [8]:
range_df.createOrReplaceTempView("ranges")

In [9]:
range_df = range_df.cache()

In [10]:
range_df.count()

                                                                                

25

In [11]:
SedonaKepler.create_map(df=range_df, name="Bird species range")


User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'Bird species range': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

In [12]:
## Determine which species interesct

In [13]:
intersect_df = sedona.sql("""
    WITH birds AS (SELECT * FROM ranges)
    SELECT birds.common_name, ST_Centroid(any_value(birds.geometry)) AS centroid, collect_list(ranges.common_name) AS intersects
    FROM ranges, birds
    WHERE ST_Intersects(birds.geometry, ranges.geometry) AND NOT birds.common_name=ranges.common_name
    GROUP BY birds.common_name
""")

In [14]:
intersect_df.show(truncate=False)

+-----------------------+----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|common_name            |centroid                                      |intersects                                                                                                                                                                                                                                                                                                                                                                                                                         

In [15]:
intersect_df.createOrReplaceTempView("intersects")

In [16]:
birds_gdf = geopandas.GeoDataFrame(intersect_df.toPandas(), geometry="centroid")

In [17]:
birds_gdf.to_json()

'{"type": "FeatureCollection", "features": [{"id": "0", "type": "Feature", "properties": {"common_name": "steller\\u2019s jay", "intersects": ["northern red bishop", "eastern bluebird", "townsend\\u2019s warbler", "gambel\\u2019s quail", "scott\\u2019s oriole", "california towhee", "acorn woodpecker", "pinyon jay", "western meadowlark", "pygmy nuthatch", "cassin\\u2019s finch", "red naped sapsucker", "california quail", "lazuli bunting", "boreal chickadee", "clark\\u2019s nutcracker", "mountain chickadee", "gray crowned rosy finch"]}, "geometry": {"type": "Point", "coordinates": [-121.39289651191304, 44.28043436598875]}}, {"id": "1", "type": "Feature", "properties": {"common_name": "mountain chickadee", "intersects": ["northern red bishop", "eastern bluebird", "townsend\\u2019s warbler", "gambel\\u2019s quail", "scott\\u2019s oriole", "california towhee", "acorn woodpecker", "pinyon jay", "western meadowlark", "steller\\u2019s jay", "pygmy nuthatch", "cassin\\u2019s finch", "red naped 

## Build A Graph Of Bird Species Interactions

Next, we'll load into Neo4j our bird species data, creating a graph of bird species that have overlapping range. This will allow us to answer questions related to ecology, disease transmission, and conservation.

### Define The Graph Model

The first step when building a graph is always to define the graph data model. A great tool to sketch out our graph data model is [Arrows.app](https://arrows.app). In this case our data model will be fairly simple:

![](https://wherobots.com/wp-content/uploads/2024/03/bird_species_range.png)

In [18]:
## Load into Neo4j

In [19]:
from neo4j import GraphDatabase

In [20]:
URI = "neo4j+s://<YOUR_NEO4J_URI_HERE>.neo4j.io"
AUTH = ("neo4j", "YOUR_PASSWORD_HERE")

In [21]:
neo4j_query = """
UNWIND $rows AS row
MERGE (s:Species {common_name: row.properties.common_name})
SET s.centroid = Point({longitude: row.geometry.coordinates[0], latitude: row.geometry.coordinates[1]})
WITH s, row
UNWIND row.properties.intersects AS bird
MERGE (b:Species {common_name: bird})
MERGE (s)-[:RANGE_OVERLAP]-(b)
RETURN COUNT(*) AS total
"""

In [22]:
def insert_data(tx, query, rows, batch_size=1000):
    total = 0
    batch = 0
    while batch * batch_size < len(rows):
        print(batch)
        print(batch_size)
        results = tx.run(query, parameters = {
            'rows': json.loads(rows[batch * batch_size : (batch + 1) * batch_size].to_json())['features']
        }).data()
        print(results)
        total += results[0]['total']
        batch += 1
    

In [23]:
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    with driver.session() as session:
        session.execute_write(insert_data, neo4j_query, birds_gdf)

0
1000
[{'total': 362}]


In [24]:
json.loads(birds_gdf[0:1000].to_json())['features']

[{'id': '0',
  'type': 'Feature',
  'properties': {'common_name': 'steller’s jay',
   'intersects': ['northern red bishop',
    'eastern bluebird',
    'townsend’s warbler',
    'gambel’s quail',
    'scott’s oriole',
    'california towhee',
    'acorn woodpecker',
    'pinyon jay',
    'western meadowlark',
    'pygmy nuthatch',
    'cassin’s finch',
    'red naped sapsucker',
    'california quail',
    'lazuli bunting',
    'boreal chickadee',
    'clark’s nutcracker',
    'mountain chickadee',
    'gray crowned rosy finch']},
  'geometry': {'type': 'Point',
   'coordinates': [-121.39289651191304, 44.28043436598875]}},
 {'id': '1',
  'type': 'Feature',
  'properties': {'common_name': 'mountain chickadee',
   'intersects': ['northern red bishop',
    'eastern bluebird',
    'townsend’s warbler',
    'gambel’s quail',
    'scott’s oriole',
    'california towhee',
    'acorn woodpecker',
    'pinyon jay',
    'western meadowlark',
    'steller’s jay',
    'pygmy nuthatch',
    'cassi

## Raster Examples

### Night Time Visible Light

- [ ] calculate average visible light per range
- [ ] calculate change in average visible light per range

## Precipitation

In [43]:
PREC_URL = "s3://wherobots-examples/data/examples/world_clim/wc2.1_10m_prec" #/wc2.1_10m_prec_01.tif
rawDf = sedona.read.format("binaryFile").load(PREC_URL + "/*.tif")
rawDf.createOrReplaceTempView("rawdf")

In [44]:
rasterDf = sedona.sql("""
SELECT 
  RS_FromGeoTiff(content) AS raster, 
  Int(regexp_extract(path, '(.*)([0-9]{2}).tif', 2)) AS month
FROM rawdf
""")

rasterDf.createOrReplaceTempView("prec")
rasterDf.printSchema()

root
 |-- raster: raster (nullable = true)
 |-- month: integer (nullable = true)



In [45]:
world_prec_df = sedona.sql("""
SELECT 
  sum(RS_ZonalStats(prec.raster, ranges.geometry, 1, 'avg', true)) AS yearly_avg_prec,  
  any_value(ranges.geometry), 
  ranges.common_name
FROM prec, ranges
GROUP BY common_name
ORDER BY yearly_avg_prec DESC
""")
world_prec_df.dropna().show()



+------------------+--------------------+--------------------+
|   yearly_avg_prec| any_value(geometry)|         common_name|
+------------------+--------------------+--------------------+
|1361.4046647230323|POLYGON ((-80.109...|         wood thrush|
|  1310.49101796407|POLYGON ((-80.345...|brown headed nuth...|
|1279.8888888888891|POLYGON ((-80.178...|spot breasted oriole|
|1278.6771653543308|POLYGON ((-81.813...|   florida scrub jay|
|1263.5545851528384|POLYGON ((-80.201...|red cockaded wood...|
|1251.8688969258599|POLYGON ((-82.524...|yellow headed bla...|
| 882.7875723792295|POLYGON ((-105.93...|gray crowned rosy...|
|  820.565896667247|POLYGON ((-100.13...|    eastern bluebird|
|  708.073190789473|POLYGON ((-117.05...|   california towhee|
| 696.5074055856118|POLYGON ((-99.162...|       steller’s jay|
| 575.8028245518764|POLYGON ((-71.016...|    boreal chickadee|
| 568.8380872483158|POLYGON ((-117.05...|    california quail|
| 512.3707884239993|POLYGON ((-110.99...|  mountain chi

                                                                                

In [46]:
# TODO: visualization by species
# TODO: load into Neo4j

## Overture Example

In [47]:
# find national parks within each species range

In [48]:
sedona.table('wherobots_open_data.overture_2024_02_15.base_landUse').printSchema()

root
 |-- id: string (nullable = true)
 |-- geometry: geometry (nullable = true)
 |-- bbox: struct (nullable = true)
 |    |-- minx: double (nullable = true)
 |    |-- maxx: double (nullable = true)
 |    |-- miny: double (nullable = true)
 |    |-- maxy: double (nullable = true)
 |-- subType: string (nullable = true)
 |-- names: struct (nullable = true)
 |    |-- primary: string (nullable = true)
 |    |-- common: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- rules: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- variant: string (nullable = true)
 |    |    |    |-- language: string (nullable = true)
 |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- at: array (nullable = true)
 |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- side: string (nullable = true)
 |-- wikidata: string (nullable = true)
 |-- version: integ

In [49]:
park_df = sedona.sql("""
SELECT names.primary AS park, collect_list(common_name) AS species, any_value(wherobots_open_data.overture_2024_02_15.base_landUse.geometry) AS geometry
--SELECT distinct(class)
FROM wherobots_open_data.overture_2024_02_15.base_landUse, ranges
WHERE ST_Intersects(ranges.geometry, wherobots_open_data.overture_2024_02_15.base_landUse.geometry) 
AND wherobots_open_data.overture_2024_02_15.base_landUse.class = "nationalPark"
GROUP BY park
""")

In [50]:
park_df.show()

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

+--------------------+--------------------+--------------------+
|                park|             species|            geometry|
+--------------------+--------------------+--------------------+
|Alonsa Wildlife M...|  [boreal chickadee]|MULTIPOLYGON (((-...|
|Alta Lake State Park|[steller’s jay, p...|MULTIPOLYGON (((-...|
|Arches National Park|[eastern bluebird...|MULTIPOLYGON (((-...|
|Arrow Lakes Provi...|[steller’s jay, c...|MULTIPOLYGON (((-...|
|Auburn State Recr...|[townsend’s warbl...|POLYGON ((-120.80...|
|Babine Mountains ...|[steller’s jay, g...|MULTIPOLYGON (((-...|
|Babine Mountains ...|[steller’s jay, g...|MULTIPOLYGON (((-...|
|Beaumont Provinci...|[steller’s jay, g...|MULTIPOLYGON (((-...|
|Benjamin Franklin...|  [eastern bluebird]|POLYGON ((-75.174...|
|Bligh Island Prov...|[steller’s jay, g...|MULTIPOLYGON (((-...|
|  Bluff Lake Reserve|[northern red bis...|POLYGON ((-116.96...|
|Bogachiel State Park|[steller’s jay, m...|MULTIPOLYGON (((-...|
|Bronte Creek Prov...|  [

                                                                                

In [51]:
park_df.count()

                                                                                

2227

In [52]:
park_map = SedonaKepler.create_map()
SedonaKepler.add_df(park_map, park_df, name="Parks")
SedonaKepler.add_df(park_map, range_df, name="Birds")
park_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


                                                                                

KeplerGl(data={'Parks': {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 2…

In [None]:
# TODO: load into Neo4j