# Partition Water Bodies
Shapefiles for water bodies are available in <i>NHD</i> (https://nhd.usgs.gov/data.html) and are accessible by state. There are roughly about 100,000 water bodies in each state. Partitioning this large data into chunks eases the processing of it.<br>
Shapefiles of <i>states of interest</i> are downloaded from here and are stored in the path <i>NHD_High_Resolution/NHD_state</i> locally. <br>

This script partitions water bodies of a given <i>state of interest</i> into chunks of size at most <b>1000</b>. <br>
The <i>dbf</i> file containing data of water bodies for the given state are partitioned into <i>pandas dataframe</i> chunks of size at most 1000. 
Each chunk (n) is stored in folder <i>Given_data_path/state_Lakes_n/lakes.csv</i>.<br>
For example, if the given data path is <i>Partitioned-DFS-WI</i> and the state is <i>Wisconsin</i>.
The 1st chunk gets stored in <i>Partitioned-DFS-WI/Wisconsin_Lakes_0/lakes.csv</i> locally.

In [2]:
import shapefile
import pandas as pd
from shapely.geometry import *
from simpledbf import Dbf5
from pywqp import pywqp_client
from geopandas import *
import os,glob
import shutil

doing node 'org'
doing node 'station'
doing node 'result'
doing node 'activity'


## Function definition

1. Function to partition the dataframe of water bodies into ith chunk
2. Function to partition the dataframe of water bodies of given state into chunks of at most size 1000

### 1. Function to partition the dataframe of water bodies into ith chunk

Below is the definition of the function that partitions the water bodies dataframe into ith chunk for a given state
and stores the partitioned chunk and details like the <i>dbf path</i>, <i>shape path</i>, <i>start</i>
and <i>end indices</i> in file info.csv

In [5]:
def partition_df_i(i,state):
    """
        Partitions the water bodies dataframe into ith chunk for the given state
        and stores the dbf path, shape path, start and end indices in file info.csv
        Args:
            i (str): Input String ,state (str): Input String 
    """
    # Global Variables
    global directory_wb,data_path,df_wb
    
    # Size of the chunk
    n=1000
    
    # Path to store the chunk i
    directory_batch = directory_wb+'/'+state+'_Lakes_'+str(i/1000)
    
    # Form the ith chunk and store it in folder path locally
    if os.path.exists(directory_batch): 
        shutil.rmtree(directory_batch)
    os.makedirs(directory_batch)
    df_wb[i:i + n].to_csv(directory_batch+'/lakes.csv')
    
    # Store the path of the dbf file, path of shape file, start and end indices
    # of the partitioned chunk in file info.csv
    cols = ['dbf_path','shape_path','start_index','end_index']
    df=pd.DataFrame(columns=cols)
    if i+n > len(df_wb):
        end_index = len(df_wb)-1
    else:
        end_index = i+n-1
    df.loc[len(df)] = [data_path+'NHDWaterbody.dbf',data_path+'NHDWaterbody.shx',i,end_index]
    df.to_csv(directory_batch+'/info.csv')
    

### 2. Function to partition the dataframe of water bodies of given state into chunks of at most size 1000

Below is the definition of the function that partitions the water bodies dataframe into chunks of size at most 1000

In [9]:
def partition_df(directory_wb,state):
    """
        Partitions the water bodies dataframe into chunks of size
        at most 1000
        Args:
            directory_wb (str): Input String ,state (str): Input String 
    """
    # Local data path to the shape files 
    data_path = "NHD_High_Resolution/NHD_"+state+"/Shape/"
    # Reading in the .dbf file and converting to pandas dataframe
    dbf = Dbf5(data_path+"NHDWaterbody.dbf")
    df_wb = dbf.to_dataframe()
    
    # Creating directory to store partitioned data
    if os.path.exists(directory_wb): 
        shutil.rmtree(directory_wb)
    os.makedirs(directory_wb)
    
    [partition_df(i,state) for i in range(0,len(df_wb),1000)]

-------------------------------------------------