**Water Flowline Data Cleaning**

Let's clean and merge flowline-related data from the National Hydrography Dataset.

In [None]:
%run ../bootstrap.py
setup_project_path()

from scripts.io_helpers import export_interim, read_interim_layer
from scripts.geometry_helpers import strip_z_line, drop_missing_geometry, validate_geometry
from scripts import data_config as dc
import geopandas as gpd
import fiona

Let's take a look at the layers.

In [2]:
gdb_path = dc.RAW_DATA_PATH / "NHDPlus_H_National_Release_2_GDB" / "NHDPlus_H_National_Release_2.gdb"

layers = fiona.listlayers(gdb_path)
for layer in layers:
    print(layer)

NHDPlusGageSmooth
NHDPlusFlow
NHDArea
NHDLine
NHDPlusBoundaryUnit
NHDPlusCatchment
NHDPlusGage
NHDPlusSink
NHDPlusWall
NHDPoint
NHDWaterbody
NonNetworkNHDFlowline
WBDHU12
NHDPlusConnect
NetworkNHDFlowline


For flowline data, we want to merge NetworkNHDFlowline and NonNetworkNHDFlowline. Since these are massive datasets, let's load our buffered CO state boundary and use it to mask the data loading. This will save on time and memory, and the extra step of intersecting later on.

First, let's check what CRS the NHD data is in by reading a few rows, and project the state buffer to that CRS:

In [3]:
crs_check_network = gpd.read_file(dc.RAW_FILES["nhd"], layer="NetworkNHDFlowline", rows=10)
crs_check_non_network = gpd.read_file(dc.RAW_FILES["nhd"], layer="NonNetworkNHDFlowline", rows=10)
print("Network CRS: ", crs_check_network.crs)
print("Non network CRS: ", crs_check_non_network.crs)

Network CRS:  COMPD_CS["NAD83 + NAVD88 height",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4269"]],VERT_CS["NAVD88 height",VERT_DATUM["North American Vertical Datum 1988",2005,AUTHORITY["EPSG","5103"]],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Gravity-related height",UP],AUTHORITY["EPSG","5703"]]]
Non network CRS:  COMPD_CS["NAD83 + NAVD88 height",GEOGCS["NAD83",DATUM["North_American_Datum_1983",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],AUTHORITY["EPSG","6269"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4269"]],VERT_CS["NAVD88 height",VERT_DATUM["North American Vert

  return ogr_read(
  return ogr_read(


The NHD data uses EPSG:4269 for horizontal coordinates, so let's load project the buffered CO border to this CRS.

In [4]:
# Load buffered CO boundary
co_mask = read_interim_layer("state_boundary_buffered")
# Project to NHD horizontal CRS
co_mask.to_crs("EPSG:4269")
co_mask.head()


Unnamed: 0,NAME,geometry
0,Colorado,"POLYGON ((143525.677 4102169.628, 143523.188 4..."


Now, let's load the NHD data with the CO mask.

In [5]:
# Load flowlines with CO mask
net_flow = gpd.read_file(dc.RAW_FILES["nhd"], layer="NetworkNHDFlowline", mask=co_mask)
print("NetworkFlowline Columns:", net_flow.columns.tolist())
non_net_flow = gpd.read_file(dc.RAW_FILES["nhd"], layer="NonNetworkNHDFlowline", mask=co_mask)
print("NonNetworkFlowline Columns:", non_net_flow.columns.tolist())

  return ogr_read_info(
  return ogr_read(


NetworkFlowline Columns: ['permanent_identifier', 'fdate', 'resolution', 'gnis_id', 'gnis_name', 'lengthkm', 'reachcode', 'flowdir', 'wbarea_permanent_identifier', 'ftype', 'fcode', 'mainpath', 'innetwork', 'visibilityfilter', 'nhdplusid', 'vpuid', 'streamleve', 'streamorde', 'streamcalc', 'fromnode', 'tonode', 'hydroseq', 'levelpathi', 'pathlength', 'terminalpa', 'arbolatesu', 'divergence', 'startflag', 'terminalfl', 'uplevelpat', 'uphydroseq', 'dnlevel', 'dnlevelpat', 'dnhydroseq', 'dnminorhyd', 'dndraincou', 'frommeas', 'tomeas', 'rtndiv', 'thinner', 'vpuin', 'vpuout', 'areasqkm', 'totdasqkm', 'divdasqkm', 'maxelevraw', 'minelevraw', 'maxelevsmo', 'minelevsmo', 'slope', 'slopelenkm', 'elevfixed', 'hwtype', 'hwnodesqkm', 'statusflag', 'qama', 'vama', 'qincrama', 'qbma', 'vbma', 'qincrbma', 'qcma', 'vcma', 'qincrcma', 'qdma', 'vdma', 'qincrdma', 'qema', 'vema', 'qincrema', 'qfma', 'qincrfma', 'arqnavma', 'petma', 'qlossma', 'qgadjma', 'qgnavma', 'gageadjma', 'avgqadjma', 'gageidma', '

  return ogr_read_info(
  return ogr_read(


NonNetworkFlowline Columns: ['permanent_Identifier', 'fdate', 'resolution', 'gnis_id', 'gnis_name', 'lengthkm', 'reachcode', 'flowdir', 'wbarea_permanent_identifier', 'ftype', 'fcode', 'mainpath', 'innetwork', 'visibilityfilter', 'nhdplusid', 'vpuid', 'Shape_Length', 'geometry']


Let's see what columns are shared by both, and trim down the columns for merging.

In [6]:
shared_cols = [col for col in net_flow.columns if col in non_net_flow.columns]
print(shared_cols)

['fdate', 'resolution', 'gnis_id', 'gnis_name', 'lengthkm', 'reachcode', 'flowdir', 'wbarea_permanent_identifier', 'ftype', 'fcode', 'mainpath', 'innetwork', 'visibilityfilter', 'nhdplusid', 'vpuid', 'Shape_Length', 'geometry']


In [8]:
trimmed_cols = [
    'gnis_name', 'fcode', 'flowdir', 'innetwork', 'nhdplusid',
    'lengthkm', 'reachcode', 'geometry'
]

net_flow_trimmed = net_flow.copy()[trimmed_cols]
non_net_flow_trimmed = non_net_flow.copy()[trimmed_cols]

Before merging, let's add a column to signify the original type - NetworkNHDFlowline or NonNetworkNHDFlowline.

In [9]:
net_flow_trimmed['NHDType'] = 'NetworkNHDFlowline'
non_net_flow_trimmed['NHDType'] = 'NonNetworkNHDFlowline'
print(non_net_flow_trimmed.loc[0])

gnis_name                                                 None
fcode                                                    42803
flowdir                                                      0
innetwork                                                    0
nhdplusid                                     23001800067853.0
lengthkm                                                 0.058
reachcode                                       10180001011402
geometry     MULTILINESTRING Z ((-106.21778167890238 40.799...
NHDType                                  NonNetworkNHDFlowline
Name: 0, dtype: object


**Merging**

Ready to merge! Now that columns are shared, let's combine Network Flowlines and Non Network Flowlines.

In [10]:
import pandas as pd

flowlines_merged = gpd.GeoDataFrame(
    pd.concat([net_flow_trimmed, non_net_flow_trimmed], ignore_index=True),
    crs=net_flow_trimmed.crs
)

flowlines_merged.head()

Unnamed: 0,gnis_name,fcode,flowdir,innetwork,nhdplusid,lengthkm,reachcode,geometry,NHDType
0,,46006,1,1,23001800000000.0,0.223,10180001002059,"MULTILINESTRING Z ((-106.2313 40.70232 0, -106...",NetworkNHDFlowline
1,Michigan River,46006,1,1,23001800000000.0,0.527,10180001001786,"MULTILINESTRING Z ((-106.24196 40.71838 0, -10...",NetworkNHDFlowline
2,Pinkham Creek,46006,1,1,23001800000000.0,1.141018,10180001000400,"MULTILINESTRING Z ((-106.21946 40.92214 0, -10...",NetworkNHDFlowline
3,,46006,1,1,23001800000000.0,0.129345,10180010000903,"MULTILINESTRING Z ((-106.10806 40.99504 0, -10...",NetworkNHDFlowline
4,South Fork Canadian River,46006,1,1,23001800000000.0,3.704,10180001000386,"MULTILINESTRING Z ((-105.95299 40.58555 0, -10...",NetworkNHDFlowline


**Geometry**

Almost there. Now that we've merged the two flowline datasets, let's check the geometry types:

In [11]:
print("Geometry types: ", flowlines_merged.geometry.type.unique())
print("Has Z axis: ", flowlines_merged.geometry.apply(lambda z: z.has_z).value_counts())

Geometry types:  ['MultiLineString']
Has Z axis:  geometry
True    995471
Name: count, dtype: int64


All geometries are of type MultiLineString, and contain a Z axis. We don't really need the Z axis for our purposes, so let's strip it for efficiency.

In [None]:
flowlines_merged['geometry'] = flowlines_merged['geometry'].apply(strip_z_line)
print("Has Z axis: ", flowlines_merged.geometry.apply(lambda z: z.has_z).value_counts())
flowlines_merged.head()

Has Z axis:  geometry
False    995471
Name: count, dtype: int64


Unnamed: 0,gnis_name,fcode,flowdir,innetwork,nhdplusid,lengthkm,reachcode,geometry,NHDType
0,,46006,1,1,23001800000000.0,0.223,10180001002059,"MULTILINESTRING ((-106.2313 40.70232, -106.231...",NetworkNHDFlowline
1,Michigan River,46006,1,1,23001800000000.0,0.527,10180001001786,"MULTILINESTRING ((-106.24196 40.71838, -106.24...",NetworkNHDFlowline
2,Pinkham Creek,46006,1,1,23001800000000.0,1.141018,10180001000400,"MULTILINESTRING ((-106.21946 40.92214, -106.21...",NetworkNHDFlowline
3,,46006,1,1,23001800000000.0,0.129345,10180010000903,"MULTILINESTRING ((-106.10806 40.99504, -106.10...",NetworkNHDFlowline
4,South Fork Canadian River,46006,1,1,23001800000000.0,3.704,10180001000386,"MULTILINESTRING ((-105.95299 40.58555, -105.95...",NetworkNHDFlowline


Finally, let's drop missing geometries and validate.

In [19]:
print("Rows before validation: ", flowlines_merged.shape[0])
flowlines_validated = drop_missing_geometry(flowlines_merged)
flowlines_validated = validate_geometry(flowlines_validated)
print("Rows after validation: ", flowlines_validated.shape[0])

Rows before validation:  995471
Rows after validation:  995471


**Exporting Data**

Our flowline data is fully cleaned and merged! Let's export to data/interim for further use. This file (flowline_clean) is marked for display, so will also be exported to data/processed for display in visualization.

In [20]:
export_interim(flowlines_validated, "flowline_clean", driver="GPKG", verbose=True)

Saved to interim: /Users/loganproffitt/Desktop/CampGIS.nosync/Repo/CampGIS/data/interim/flowline_clean.gpkg
Also saved to processed: /Users/loganproffitt/Desktop/CampGIS.nosync/Repo/CampGIS/data/processed/flowline_clean.gpkg
