# Geoprocessing

---

This notebook includes the code to spatially join the Nightfire data points to the Basin Regions. 

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Geoprocessing" data-toc-modified-id="Geoprocessing-1">Geoprocessing</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#What-is-Geoprocessing?" data-toc-modified-id="What-is-Geoprocessing?-1.0.1">What is Geoprocessing?</a></span></li><li><span><a href="#Geoprocessing-Library-Selection" data-toc-modified-id="Geoprocessing-Library-Selection-1.0.2">Geoprocessing Library Selection</a></span><ul class="toc-item"><li><span><a href="#Geopandas" data-toc-modified-id="Geopandas-1.0.2.1">Geopandas</a></span></li></ul></li></ul></li><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1">Imports</a></span></li><li><span><a href="#Code-from-Notebook-1,-needed-for-in-memory-objects-for-Geoprocessing-steps-below" data-toc-modified-id="Code-from-Notebook-1,-needed-for-in-memory-objects-for-Geoprocessing-steps-below-1.2">Code from Notebook 1, needed for in-memory objects for Geoprocessing steps below</a></span><ul class="toc-item"><li><span><a href="#Glob-to-collect-the-Nightfire-files-from-2012-to-present-day" data-toc-modified-id="Glob-to-collect-the-Nightfire-files-from-2012-to-present-day-1.2.1">Glob to collect the Nightfire files from 2012 to present day</a></span></li></ul></li><li><span><a href="#Point-in-Polygon-Counts-via-Spatial-Join" data-toc-modified-id="Point-in-Polygon-Counts-via-Spatial-Join-1.3">Point-in Polygon Counts via Spatial Join</a></span><ul class="toc-item"><li><span><a href="#Join-Features" data-toc-modified-id="Join-Features-1.3.1">Join Features</a></span></li><li><span><a href="#Execute-the-Spatial-Join" data-toc-modified-id="Execute-the-Spatial-Join-1.3.2">Execute the Spatial Join</a></span></li></ul></li></ul></li><li><span><a href="#Next-Notebook" data-toc-modified-id="Next-Notebook-2">Next Notebook</a></span></li></ul></div>

### What is Geoprocessing?

Very broadly, **Geoprocessing** is any operation involving geospatial data or methods. The Geographic Information Systems (GIS) software company Esri refers to it as "[Computing with geographic data.](http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Comparing_Geoprocessing_and_Spatial_Analysis)" It is commonly misused interchangeably with the term [Spatial Analysis](http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Comparing_Geoprocessing_and_Spatial_Analysi). However, Spatial Analysis includes the interpretation of the results of Geoprocessing. Spatial Analysis has more in common with the **[Data Science Process](https://medium.springboard.com/the-data-science-process-the-complete-laymans-guide-to-what-a-data-scientist-actually-does-ca3e166b7c67)**, while Geoprocessing has more in common joining, grouping and aggregating data steps in a typical data science project workflow. 

Professor Jochen Albrecht defines Geoprocessing as;
> _...any GIS operation used to manipulate data. A typical geoprocessing operation takes an input dataset, performs an operation on that dataset, and returns the result of the operation as an output dataset, also referred to as derived data. Common geoprocessing operations are geographic feature overlay, feature selection and analysis, topology processing, and data conversion. Geoprocessing allows you to define, manage, and analyze geographic information used to make decisions._ [Jochen Albrecht](http://www.geography.hunter.cuny.edu/~jochen/GTECH361/lectures/lecture12/concepts/01%20What%20is%20geoprocessing.htm)

### Geoprocessing Library Selection

There are several Geoprocessing libraries and technologies. A few notable open source ones are;

* [PostGIS](https://postgis.net/) - Spatially Enabled PostgreSQL 
* [GeoPandas](https://geopandas.org/) - Pandas Extended with [Shapely](https://shapely.readthedocs.io/en/latest/)
* [Arcpy](https://pro.arcgis.com/en/pro-app/arcpy/get-started/what-is-arcpy-.htm) - Esri's Python site package
* [PyQGIS](https://docs.qgis.org/testing/en/docs/pyqgis_developer_cookbook/) - [QGIS](https://www.qgis.org/en/site/)'s Python Package


#### Geopandas

This project leverages [GeoPandas](https://geopandas.org/) as its very easy to leverage in the `pandas` environment and data science workflow. Additionally, outside of reading and writing geospatial files and objects, this project only leverages one geoprocessing function, [spatial join](https://geopandas.org/reference/geopandas.sjoin.html). 

> _GeoPandas is an open source project to make working with geospatial
data in python easier.  GeoPandas extends the datatypes used by
`pandas` to allow spatial operations on geometric types.  Geometric
operations are performed by `shapely`.  Geopandas further depends on
`fiona` for file access and `descartes` and `matplotlib` for plotting._ Source: [GeoPandas](https://geopandas.org/) 

**Description**

> _The goal of GeoPandas is to make working with geospatial data in
python easier.  It combines the capabilities of pandas and shapely,
providing geospatial operations in pandas and a high-level interface
to multiple geometries to shapely.  GeoPandas enables you to easily do
operations in python that would otherwise require a spatial database
such as PostGIS._ Source: [GeoPandas](https://geopandas.org/) 



## Imports

In [1]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
%load_ext autoreload
%autoreload 2

In [2]:
import glob
import geopandas as gpd
import pandas as pd
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt

In [3]:
from tools.tools import read_json, get_current_time
from capstone.etl.viirs_join_basins import viirs_join_basins, compile_basin_data
from capstone.etl.census_parse import parse_census
from capstone.etl.census_retrieval import census_retrieval
from capstone.etl.generate_basins import generate_us_basins
from capstone.etl.eia_retrieval import eia_retrieval
from capstone.etl.eia_parse import eia_parse_county, eia_parse_data

In [4]:
config = read_json('../capstone/config.json')

current_date = get_current_time('yyyymmdd')

wd = f"{config['workspace_directory']}/data"

In [5]:
plt.style.use('ggplot')

In [6]:
basin_colors_hex = {  # manually defined dictionary of EIA basin-level standardized colors 
    "Anadarko Region":    "#2BA2CF", 
    "Appalachia Region":  "#769F5D",
    "Bakken Region":      "#F6C432", 
    "Eagle Ford Region":  "#48366B", 
    "Haynesville Region": "#807B8F",
    "Niobrara Region":    "#9D3341",
    "Permian Region":     "#6F4B27",
}

## Code from Notebook 1, needed for in-memory objects for Geoprocessing steps below

In [7]:
census_shp = census_retrieval(f"{wd}/input/census")
census = gpd.read_file(census_shp)
census.columns = [c.lower() for c in census.columns]

eia_xls = eia_retrieval(f"{wd}/input/eia")
eia_cnty = eia_parse_county(eia_xls)
eia_data = eia_parse_data(eia_xls)  # parse the target variable(s) data

census_gdf = parse_census(census_shp)
basins_list, all_basins = generate_us_basins(
    census_gdf,
    eia_cnty,
    f"{wd}/input/basins",
)  # this code creates individual files for basin geographies as well as an all_basins geography file/object.

 parse eia data
    for anadarko region
    for appalachia region
    for bakken region
    for eagle ford region
    for haynesville region
    for niobrara region
    for permian region
generating us basins
    permian region
    appalachia region
    haynesville region
    eagle ford region
    anadarko region
    niobrara region
    bakken region


### Glob to collect the Nightfire files from 2012 to present day

In [8]:
# get lists of all the retrieved viirs data for both 2.1c and 3.0 viirs

viirs_2_1c_files = glob.glob(f"{wd}/input/viirs21c/*.csv")  # get viirs
viirs_2_1c_files.sort()  # sort so dates are consecutive for tracking

print(f'Total 2.1c files: {len(viirs_2_1c_files)}')

viirs_3_0_files = glob.glob(f"{wd}/input/viirs30/*.csv")  # get viirs files
viirs_3_0_files.sort()  # sort so dates are consecutive for tracking

print(f'Total 3.0 files: {len(viirs_3_0_files)}')

Total 2.1c files: 2095
Total 3.0 files: 833


## Point-in Polygon Counts via Spatial Join

While Join Features tool was not used (rather GeoPandas S-Join for Spatial Join), this illustration better shows how a given geography 2d or 3d polygon, is intersected with points, we can count those features inside. 

### Join Features

Description form Esri documentation:

> Joins attributes from one layer to another based on spatial, temporal, or attribute relationships, or a combination of those relationships. [https://pro.arcgis.com/en/pro-app/tool-reference/geoanalytics-desktop/join-features.htm](https://pro.arcgis.com/en/pro-app/tool-reference/geoanalytics-desktop/join-features.htm)

[![join](https://pro.arcgis.com/en/pro-app/tool-reference/geoanalytics-desktop/GUID-EB8FA998-105A-4D93-93E3-5FAA1057137D-web.png)](https://pro.arcgis.com/en/pro-app/tool-reference/geoanalytics-desktop/GUID-EB8FA998-105A-4D93-93E3-5FAA1057137D-web.png)


Geopandas code inside `tools.geoprocessing.py` which is used inside `viirs_join_basins(...)`  in this project repository:
```python
import geopandas as gpd


def point_in_polygon(point_gdf, poly_gdf):
    return gpd.sjoin(
        point_gdf,
        poly_gdf,
        how="inner",
        op='intersects',  # warning CRS of frames do not match
    )
```

### Execute the Spatial Join 

In [10]:
# spatial join 
viirs_join_basins( 
    wd,
    all_basins,
    viirs_2_1c_files,
    '21c',
)   # spatially join viirs 2.1c to basins geometries

viirs_join_basins(
    wd,
    all_basins,
    viirs_3_0_files,
    '30',
)  # spatially join viirs 3.0 to basins geometries

In [11]:
basins_int_viirs_21c = compile_basin_data(wd, '21c')
basins_int_viirs_30  = compile_basin_data(wd, '30')  
# above function saves master compiled 2.1c and 3.0 files, prints every january first per year.

    20130101
    20140101
    20150101
    20160101
    20170101
    20180101
    20190101
    20200101


In [12]:
print(basins_int_viirs_21c.shape)
print(basins_int_viirs_30.shape)  # check the shapes

(1009001, 129)
(528056, 46)


Geoprocessing is complete joining all region data to the Nightfire data files. 

![](../images/bakken_nightfire_2017.gif)
Each of these points layers for each of the dates from 2012-03 to present day (all of 2017 in the image above) are intersected with the region information so the basin region data can be appended to the point dataset and then aggregated in the [Feature Engineering and Exploratory Data Analysis of Processed Data](https://git.generalassemb.ly/danielmartinsheehan/capstone/blob/master/notebooks/04_feature_engineernig_and_exploratory_data_analysis_processed_data.ipynb) notebook. The animated image above is also generated in that notebook. 

# Next Notebook

[Feature Engineering and Exploratory Data Analysis of Processed Data](https://git.generalassemb.ly/danielmartinsheehan/capstone/blob/master/notebooks/04_feature_engineering_and_exploratory_data_analysis_processed_data.ipynb)