# Merging East River Tree Stem Geolocation Points
**Author:** 'Marshall Worsham' <br>
**Creation Date:** '09/21/2020' <br>
**Revision Date:** '12/22/2020' <br>

---

## Contents

1 - [Front matter](#front)<br>
2 - [Libraries](#libraries)<br>
3 - [Import reference table](#import)<br>
4 - [Exploratory analysis](#eda)<br>
5 - [Rename and move](#rename)<br>
6 - [Prepare for append](#prep)<br>
7 - [Append](#append)<br>

---


## Front matter<a id='front'></a>

This notebook contains markdown and code for post-processing point shapefiles generated from Trimble Geo7X GPS acquisitions in the East River domain. The script appends the `Site` name and `subdirectory` to each shapefile name, then selects all projected point shapefiles, groups them by `Site` name, and merges points from the same site. The result is a set of shapefiles containing tree geolocation points, one set for each site in the watershed where stem geolocations were acquired from 2018â€“2020. Most output files contain some extraneous points marking corners and errata, which are cleaned out in '00_EastRiver_Clean_Tree_GPSPoints.ipynb'. 

The script was developed in `Python 3.8.2` on a Macbook Pro 2014 running OSX 10.14.6.


## Libraries<a id='libraries'></a>

In [2]:
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import math
import re
from matplotlib import pyplot as plt
from os.path import join, getsize
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual

Define the working directory and list contents

In [12]:
os.getcwd()
directory = os.sep.join(['/Volumes', 'GoogleDrive', 'My Drive', 'Research', 'RMBL', 'Working_Files', 'Forest_Inventory_Dataset'])
source_dir = os.sep.join([directory, 'Source'])
scratch_dir = os.sep.join([directory, 'Scratch'])
out_dir = os.sep.join([directory, 'Output'])
gps_dir = os.sep.join([source_dir, 'GPS_Data_2021'])
os.listdir(gps_dir)[0:10]

['KATZ080208A',
 'KATZ080208B',
 'KATZ080211A',
 'KATZ080212A',
 'KATZ080213A',
 'KATZ080214A',
 'KATZ080215A',
 'KATZ080309A',
 'KATZ080310A',
 'KATZ080312A']

## Import reference table<a id='reference'></a>
First we import a CSV describing filenames and associated sites. Then we slice to create a simple list of filenames and the site at which the data inside those files were acquired.


In [51]:
gps_index = pd.read_csv(os.sep.join([source_dir, 'EastRiver_GPS_Data_Index.csv']))
gps_index.dropna(0, subset=['Site', 'Contents'], inplace=True)
gps_index.loc[:, 'Filename':'Site'].tail(10)

Unnamed: 0,Filename,Site
250,KATZ080709A,CC-CVN1
251,KATZ080709B,CC-CVN1
252,KATZ080710A,CC-CVN1
253,KATZ080712A,CC-CVN1
254,KATZ080713A,CC-CVN1
255,WORSHAMM081113A,ER-GT1
256,WORSHAMM081308A,ER-CVS1
258,WORSHAMM081812A,ER-BME1
259,WORSHAMM081812B,ER-BME1
260,WORSHAMM082008A,SG-SWR1


## Exploratory analysis<a id='eda'></a>
Some simple exploratory analysis reveals see how many unique files are associated with each site.

In [52]:
gps_index.groupby('Site').count()['Filename']

Site
CC-CVN1                                                                            5
CC-CVN2                                                                           12
CC-CVS1                                                                            6
CC-EMN1                                                                            6
CC-UC1                                                                             2
Carbon-1                                                                           1
Carbon-2                                                                           4
Carbon-21                                                                          1
Carbon_21B, Carbon_6, Carbon_15, CoalCreek_ValleyN_1, and CoalCreek_ValleyN_1b     1
Cement 28                                                                          1
ER-APL1                                                                            3
ER-APU1                                                     

In [53]:
# Scratch to set up the syntax for the function below that will relate filenames in the directory to filenames and site associations in the index dataframe
gps_index_sites = gps_index.loc[:,'Filename':'Site']
gps_index_sites.loc[gps_index_sites['Filename'] == 'WORSHAMM071610A'].iloc[0, 1]

'Snodgrass-1'

List the filenames in all subdirectories of `directory` by walking the subdirectories and string-splitting on the last `/` in the path to isolate filenames. 

In [54]:
for subdir, dirs, files in os.walk(gps_dir):
    for filename in files:
        subdir_name = subdir.rsplit('/', 1)[-1]
        print(subdir_name)

KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208A
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080208B
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080211A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080212A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080213A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080214A
KATZ080215A
KATZ080215A
KATZ080215A
KATZ080215A
KATZ080215A
KATZ

## Rename and move<a id='rename'></a>
The function below finds the name of the `subdirectory` that each shapefile lives in and finds the `Site` with which that subdirectory is associated. The function renames each shapefile by appending the `subdirectory` name and `Site` name to the filename, then moves all files into a single directory.

In [55]:
gps_dir

'/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021'

In [56]:
for subdir, dirs, files in os.walk(gps_dir):
    for filename in files:
        gps_index_sites = gps_index.loc[:, 'Filename':'Site']
        subdir_name = subdir.rsplit('/', 1)[-1]
        index_sitename = str(gps_index_sites.loc[gps_index_sites['Filename'] == subdir_name, 'Site'].values).strip("[]").strip("'")
        newname = subdir_name + '_' + index_sitename + '_' + filename
        oldpath = subdir + os.sep + filename
        newpath = subdir + os.sep + newname
        #print(oldpath)
        #print(newname)
        #print(newpath)
        os.rename(oldpath, newpath)
        if not re.search('Line.+', filename) and not re.search('Area.+', filename) and not re.search('Icon.+', filename):
            print(newpath)

/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KATZ080208A_CC-CVN2_KATZ080208A_CC-CVN2_Student__Project.shx
/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KATZ080208A_CC-CVN2_KATZ080208A_CC-CVN2_Student__Project.shp.xml
/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KATZ080208A_CC-CVN2_KATZ080208A_CC-CVN2_Student__Project.shp
/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KATZ080208A_CC-CVN2_KATZ080208A_CC-CVN2_Student_.shp
/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KATZ080208A_CC-CVN2_KATZ080208A_CC-CVN2_Student__Project.cpg
/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Source/GPS_Data_2021/KATZ080208A/KAT

KeyboardInterrupt: 

## Prepare for append<a id='prep'></a>

In [70]:
# list all files
renamed_dir = os.path.join(scratch_dir, 'GPS_Data_2021_Renamed')
allfiles = os.listdir(renamed_dir)
allfiles[0:10]
#allfiles

['KATZ080208A_CC-CVN2_Student_.dbf',
 'KATZ080208A_CC-CVN2_Student_.prj',
 'KATZ080208A_CC-CVN2_Student_.shp',
 'KATZ080208A_CC-CVN2_Student_.shp.xml',
 'KATZ080208A_CC-CVN2_Student_.shx',
 'KATZ080208A_CC-CVN2_Student__Project.cpg',
 'KATZ080208A_CC-CVN2_Student__Project.dbf',
 'KATZ080208A_CC-CVN2_Student__Project.prj',
 'KATZ080208A_CC-CVN2_Student__Project.sbn',
 'KATZ080208A_CC-CVN2_Student__Project.sbx']

In [71]:
# generate a list of unique site names represented in the dataset
sitelist = gps_index['Site'].unique().tolist()
sitelist

['Schofield-24',
 'Schofield-19',
 'Schofield-23',
 'Carbon-2',
 'PointLookout-2',
 'Snodgrass-2',
 'Snodgrass-1',
 'SplainsGulch-1',
 'Carbon-1',
 'PointLookout-1',
 'Research meadow',
 'Wash-1',
 'Schofield-4',
 'OhioPass-1',
 'Schofield-5',
 'WG-WGM1',
 'XX-PLN1',
 'CC-UC1',
 'Carbon-21',
 'Carbon_21B, Carbon_6, Carbon_15, CoalCreek_ValleyN_1, and CoalCreek_ValleyN_1b',
 'SG-SWR1',
 'SG_NESlope_5; SG_Convergent_4; SG_Berkelhammer_pine; SG_ESlope_2',
 'XX-PLN3',
 'XX-PLN2',
 'ER-BME1',
 'ER-APL1',
 'XX-CAR1',
 'SG-NES1',
 'SG-NES3',
 'SG-NES2',
 'Cement 28',
 'XX-CAR3',
 'SR-PVG1',
 'Scarp1',
 'ER-GT1',
 'ER-APU1',
 'CC-CVS1',
 'CC-CVN2',
 'CC-EMN1',
 'CC-CVN1',
 'ER-CVS1']

In [72]:
# filter only shapefiles containing point data types in the correct projection
# the target filenames will contain the tag "Project" AND either of the strings "Student" or "Point"
# filenames without "Student" and filenames containing "Area" and "Line" strings will be filtered out
point_str = 'Point_'
stud_str = 'Student'
proj_str = '_Project'
sf_allpoint = [i for i in allfiles if ((point_str in i or stud_str in i) and proj_str in i)]
print(len(allfiles))
print(len(sf_allpoint))
sf_allpoint

1526
941


['KATZ080208A_CC-CVN2_Student__Project.cpg',
 'KATZ080208A_CC-CVN2_Student__Project.dbf',
 'KATZ080208A_CC-CVN2_Student__Project.prj',
 'KATZ080208A_CC-CVN2_Student__Project.sbn',
 'KATZ080208A_CC-CVN2_Student__Project.sbx',
 'KATZ080208A_CC-CVN2_Student__Project.shp',
 'KATZ080208A_CC-CVN2_Student__Project.shp.xml',
 'KATZ080208A_CC-CVN2_Student__Project.shx',
 'KATZ080208B_CC-CVN2_Student__Project.cpg',
 'KATZ080208B_CC-CVN2_Student__Project.dbf',
 'KATZ080208B_CC-CVN2_Student__Project.prj',
 'KATZ080208B_CC-CVN2_Student__Project.sbn',
 'KATZ080208B_CC-CVN2_Student__Project.sbx',
 'KATZ080208B_CC-CVN2_Student__Project.shp',
 'KATZ080208B_CC-CVN2_Student__Project.shp.xml',
 'KATZ080208B_CC-CVN2_Student__Project.shx',
 'KATZ080211A_CC-CVN2_Student__Project.cpg',
 'KATZ080211A_CC-CVN2_Student__Project.dbf',
 'KATZ080211A_CC-CVN2_Student__Project.prj',
 'KATZ080211A_CC-CVN2_Student__Project.sbn',
 'KATZ080211A_CC-CVN2_Student__Project.sbx',
 'KATZ080211A_CC-CVN2_Student__Project.shp',
 '

In [73]:
# # manipulate the gps_index dataframe
# notcorners = gps_index[~gps_index['Contents'].str.contains('corner')] # filter out names of subdirs containing corners
# notcorners = notcorners['Filename'].to_list()
# print(notcorners[:10])
# print(len(sf_allpoint))
# print(len(notcorners))

In [74]:
# manipulate gps_index dataframe
trees = gps_index[gps_index['Contents'].str.contains('tree')]
trees = trees['Filename'].to_list()

In [75]:
# find files representing trees only, with .shp extension
trees_allfiles = [i for i in sf_allpoint if any(ii in i for ii in trees)]
trees_sf = [t for t in trees_allfiles if t.endswith('.shp')]

In [76]:
print(len(trees_sf))
print(trees_sf[:10])

80
['KATZ080208B_CC-CVN2_Student__Project.shp', 'KATZ080211A_CC-CVN2_Student__Project.shp', 'KATZ080212A_CC-CVN2_Student__Project.shp', 'KATZ080213A_CC-CVN2_Student__Project.shp', 'KATZ080214A_CC-CVN2_Student__Project.shp', 'KATZ080215A_CC-CVN2_Student__Project.shp', 'KATZ080309A_CC-CVN2_Student__Project.shp', 'KATZ080310A_CC-CVN2_Student__Project.shp', 'KATZ080312A_CC-CVN2_Student__Project.shp', 'KATZ080410B_CC-EMN1_Student__Project.shp']


## Append<a id='append'></a>

1. group files according to site name by finding common value from sitelist in `matches` string
2. for each site, select the first shapefile and assign it as base object
3. append all other shapefiles to the base object with `gpd.append()`
4. project crs to wgs84 utm zone 13
4. export the gpdf as a shapefile named: sitelist[i] + '_' + 'TreeStem_pts_WGS84UTM13.shp'

In [79]:
# add full path to all filenames
trees_sf_paths = [renamed_dir + os.sep + i for i in trees_sf]
trees_sf_paths[-5:]

['/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/WORSHAMM063012B_SG-NES1_Student__Project.shp',
 '/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/WORSHAMM081113A_ER-GT1_Student__Project.shp',
 "/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/WORSHAMM081812A_Schofield-19' 'ER-BME1_Student3_Project.shp",
 "/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/WORSHAMM081812A_Schofield-19' 'ER-BME1_Student__Project.shp",
 '/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/WORSHAMM082008A_SG-SWR1_Student__Project.shp']

In [80]:
# group files by plot
trees_sf_grouped = [[s for s in trees_sf_paths if key in s] for key in set(sitelist)]
trees_sf_grouped = [i for i in trees_sf_grouped if len(i) != 0] # filter out a few artifact empty lists

In [81]:
# output list of grouped tree shapefiles in directory
trees_sf_grouped

[['/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZJ071216A_SG-NES2_Student__Project.shp'],
 ['/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZ080410B_CC-EMN1_Student__Project.shp',
  '/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZ080412A_CC-EMN1_Student__Project.shp',
  '/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZ080414A_CC-EMN1_Student__Project.shp',
  '/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZ080415A_CC-EMN1_Student__Project.shp'],
 ['/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch/GPS_Data_2021_Renamed/KATZJ071909B_XX-CAR3_Student__Project.shp',
  '/Volumes/GoogleDrive/My Drive/Research/

In [90]:
# aggregate, import, and append
alltrees_gpdf = []
for thing in trees_sf_grouped:
    site_gpdf = []
    for i in thing:
        gpdf = gpd.read_file(i)
        site_gpdf.append(gpdf)
    alltrees = site_gpdf[0].append(site_gpdf[1:])
    alltrees.to_crs(epsg = 32613, inplace = True)
    alltrees = alltrees.loc[alltrees.geom_type == 'Point']
    site = [s for s in sitelist if s in thing[0]][0]
    alltrees['Site'] = site
    alltrees.to_file(os.path.join(scratch_dir, 'GPS_Data_2021_MERGEDBYPLOT', site + '.shp'))
    alltrees_gpdf.append(alltrees)

In [89]:
scratch_dir

'/Volumes/GoogleDrive/My Drive/Research/RMBL/Working_Files/Forest_Inventory_Dataset/Scratch'