# Analyzing Land Cover data

This notebook performs an analysis of [NLCD](https://catalog.data.gov/dataset/national-land-cover-database-nlcd-land-cover-collection)
data over a state.

In [None]:
import geopyspark as gps
from pyspark import SparkContext
from shapely.geometry import mapping, shape, asShape, MultiPoint, MultiLineString
from geonotebook.wrappers import TMSRasterData, GeoJsonData
import pyproj
from shapely.ops import transform
from functools import partial
import os, urllib.request, json
import numpy as np

### Setup: State data and Spark initialization

The next 2 cells grab the shapes for our state and start up the spark context.

In [None]:
# Grab data for Nevada
state_name, county_name = "NJ", "Hunterdon"
def get_state_shapes(state, county):
    project = partial(
        pyproj.transform,
        pyproj.Proj(init='epsg:4326'),
        pyproj.Proj(init='epsg:3857'))

    state_url = "https://raw.githubusercontent.com/johan/world.geo.json/master/countries/USA/{}.geo.json".format(state)
    county_url = "https://raw.githubusercontent.com/johan/world.geo.json/master/countries/USA/{}/{}.geo.json".format(state,county)
    read_json = lambda url: json.loads(urllib.request.urlopen(url).read().decode("utf-8"))
    state_ll = shape(read_json(state_url)['features'][0]['geometry'])
    state_wm = transform(project, state_ll)
    county_ll = shape(read_json(county_url)['features'][0]['geometry'])
    county_wm = transform(project, county_ll)
    return (state_ll, state_wm, county_ll, county_wm)

(state_ll, state_wm, county_ll, county_wm) = get_state_shapes(state_name, county_name) 

In [None]:
# Set up our spark context
conf = gps.geopyspark_conf(appName="Landsat") \
          .setMaster("local[*]") \
          .set(key='spark.ui.enabled', value='true') \
          .set(key="spark.driver.memory", value="8G") \
          .set("spark.hadoop.yarn.timeline-service.enabled", False)
sc = SparkContext(conf=conf)

# View NLCD from GeoTrellis Catalog

In [None]:
nlcd_layer_name = "nlcd-zoomed-256"
nlcd_color_map = gps.ColorMap.nlcd_colormap()
tms_server = gps.TMS.build(("s3://datahub-catalogs-us-east-1", nlcd_layer_name), 
                           display=nlcd_color_map)
M.add_layer(TMSRasterData(tms_server), name="nlcd")

In [None]:
p = state_ll.centroid
M.set_center(p.x, p.y, 7)

# Read State NLCD Tiles

In [None]:
layer = gps.query("s3://datahub-catalogs-us-east-1", 
                      nlcd_layer_name, 
                      layer_zoom=13, 
                      query_geom=state_wm,
                      num_partitions=100)

We can now grab the min and max of our data. 
This is a spark "action", which executes the Directed Acyclic Graph
of operations represented by the RDD that is represented by the layer,
and returns values to the driver program and through to our notebook.

In [None]:
layer.get_min_max()

### Performing Map Algebra

We can do simple map algebra operations, such as addition, 
between our layer and a scalar, or between it and another layer.
For example:

In [None]:
(layer + 10).get_min_max()

In [None]:
(layer + (layer * 0.1)).get_min_max()

### Pyramiding and viewing our layer on the map

Here we pyramid and set up a TMS server for our layer.
Notice we call `repartition` before pyramid; this is because
setting a partitioner on our layer makes key lookups more efficient,
which is how individual tiles are pulled out and served by the TMS server.

To render our layer, we are using a ColorMap built into GeoPySpark that maps
NLCD values to their appropriate colors according to the [legend supplied by USGS](https://www.mrlc.gov/nlcd06_leg.php).

In [None]:
pyramid = layer.repartition(100).pyramid()

In [None]:
tms_server = gps.TMS.build(pyramid, 
                           display=gps.ColorMap.nlcd_colormap())

In [None]:
for l in M.layers:
    M.remove_layer(l)
    
M.add_layer(TMSRasterData(tms_server), name="nlcd")

### Masking our layer

You may notice that our layer does not exactly match our state boundary. We can verify this by placing our state on the map as well:

In [None]:
M.add_layer(GeoJsonData(mapping(state_ll)), name="poly")

This is because the query we did does not mask by default; it retrieved us the tiles that intersect with our geometry, but there are cells of those intersecting tiles that will lie outside of the state.

To get our data tight to our state boundary, we can mask our layer.

In [None]:
masked = layer.mask(geometries=state_wm)
masked_pyramid = masked.repartition(100).cache().pyramid()
tms_server = gps.TMS.build(masked_pyramid, 
                           display=gps.ColorMap.nlcd_colormap())

In [None]:
for l in M.layers:
    M.remove_layer(l)
M.add_layer(TMSRasterData(tms_server), name="nlcd")

## Finding the most popular land types in our state.

Here we find the categories of land cover that have the most amount of cells assigned to them in our state.

The first step is to convert our masked layer to a numpy rdd. That way we can treat our tiles as a true PySpark RDD, where both the keys and values of the RDD are native python types. The type of the value will be a `gps.Tile`, which contains a `cells` field that holds a numpy array. That way we can use numpy directly to interact with the raster data.

In [None]:
rdd = masked.to_numpy_rdd()
rdd.first()

Next we map over our tiles to get the counts of every category per tile, and then reduce over the RDD to aggregate the counts. This gives us the total counts per category over the entire state.

In [None]:
def get_counts(tile):
    values, counts = np.unique(tile.cells.flatten(), return_counts=True)
    d = {}
    for v, c in zip(values, counts):
        if v != -128: # Remove NoData
            d[v] = c
    return d

def merge_counts(d1, d2):
    d = {}
    for k in set(d1.keys()).union(set(d2.keys())):
        v = 0
        if k in d1:
            v += d1[k]
        if k in d2:
            v += d2[k]
        d[k] = v
    return d

counts = rdd.map(lambda x: get_counts(x[1])).reduce(merge_counts)
counts

This reduce has returned us a python dictionary; we are no longer working with RDDs, and can operate on the data however we wish. For example, we can turn our data into a pandas dataframe and plot the values, creating a visualization that lets us easily see which land cover categories are most popular in our state:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

labels = { 0: 'NoData',
          11: 'Open Water',
          12: 'Perennial Ice/Snow',
          21: 'Developed, Open Space',
          22: 'Developed, Low Intensity',
          23: 'Developed, Medium Intensity',
          24: 'Developed High Intensity',
          31: 'Barren Land (Rock/Sand/Clay)',
          41: 'Deciduous Forest',
          42: 'Evergreen Forest ',
          43: 'Mixed Forest',
          52: 'Shrub/Scrub',
          71: 'Grassland/Herbaceous',
          81: 'Pasture/Hay',
          82: 'Cultivated Crops',
          90: 'Woody Wetlands',
          95: 'Emergent Herbaceous Wetlands'}
named_counts = {}
for k in counts:
    named_counts[labels[k]] = counts[k]

df = pd.DataFrame.from_dict(named_counts,  orient='index')
df

In [None]:
plt.figure()
df.plot.bar(legend=False)
plt.show()

## Viewing the "Cultivated Crops" category on the map

In [None]:
cultivated_land_colormap = gps.ColorMap.build(breaks={82: 0x00FF00FF},
                                              classification_strategy=gps.ClassificationStrategy.EXACT,
                                              fallback=0x00000000)    
tms_server = gps.TMS.build(masked_pyramid, 
                           display=cultivated_land_colormap)

In [None]:
for l in M.layers:
    M.remove_layer(l)
M.add_layer(TMSRasterData(tms_server), name="nlcd")