<a href="https://colab.research.google.com/github/nicole-hjlin/mpala-tree-mapping/blob/derek-changes/baseline_labels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Baseline Label Creation**

The goal of this notebook is to randomly assign species labels to each LiDAR image (.las files cropped to tree polygons). This can be done in two methods:italicized text

1.   Assigning at random from the tree species sample space with equal
2.   Assigning tree labels such that the frequency of each label exactly matches that of the individual species in the data. 
3.   (optional) Assigning tree labels with probability equal to the frequency of each individual species in the data.

Note: Dependencies on file paths and pickle file of images with lat/long

In [125]:
# Key: {0: Hongjin, 1: Derek, 2: Matthew}
current_user = 1

# Data Loading

Import libraries, mounting data, set up

In [126]:
!pip3 install lasio laspy
!pip3 install utm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [127]:
# Mount Google Drive (where data sit)
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [128]:
# Set Project Folder
import os
header = '/content/drive/My Drive'
hongjin_path = 'classes/2022 fall/CS 288 AI for Social Impact/CS288 Final Project - Tree Species'
derek_path = 'jr/CS288 Final Project - Tree Species'
matt_path = ''

path = hongjin_path if current_user == 0 else (derek_path if current_user == 1 else matt_path)

# Select path from above
project_path = os.path.join(header, path)
project_path

'/content/drive/My Drive/jr/CS288 Final Project - Tree Species'

In [129]:
# Import code utilities files
import sys
sys.path.insert(0, os.path.join(project_path, 'mpala-tree-mapping'))

In [130]:
# import packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from osgeo import gdal
import laspy
import utm
import pickle

# Load Tree Species Labels and Images

### ForestGEO Label Load

In [131]:
# read data
forestgeo = pd.read_csv(os.path.join(project_path, 'PlotDataReport10-07-2022_1734418034.txt'), delimiter = "\t")
forestgeo.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,No.,Latin,Mnemonic,SubSpecies,Quadrat,PX,PY,TreeID,Tag,StemID,StemTag,Census,DBH,HOM,Date,Codes,Stem,Status
0,1,Acacia brevispica,ACACBR,,221,36.3004,400.50461,124386,20847,254971,20847,1,77.0,0.5,2012-11-20,,main,alive
1,2,Acacia brevispica,ACACBR,,311,51.8452,206.57974,124814,30407,255478,30407,1,37.0,0.5,2012-11-17,,main,alive
2,3,Acacia brevispica,ACACBR,,503,81.23257,58.4118,126361,50086,257294,50086,1,50.0,0.5,2012-11-23,,main,alive
3,4,Acacia brevispica,ACACBR,,10001,1982.57861,9.4611,131025,1000015,262757,1000015,1,23.0,0.5,2014-11-15,M,main,alive
4,5,Acacia brevispica,ACACBR,,10001,1982.57861,9.4611,131025,1000015,262758,1000016,1,23.0,0.5,2014-11-15,,,alive


In [132]:
# Drop all columns but species. Dropped Subspecies because all NaN
tree_labels = forestgeo[['Latin','Mnemonic']]
tree_labels.head()

Unnamed: 0,Latin,Mnemonic
0,Acacia brevispica,ACACBR
1,Acacia brevispica,ACACBR
2,Acacia brevispica,ACACBR
3,Acacia brevispica,ACACBR
4,Acacia brevispica,ACACBR


### Tree Label Cleaning

In [133]:
# Mismatch in Mnemonic
print("Num Mnemonic:", len(tree_labels['Mnemonic'].unique()))
print("Num Latin:",len(tree_labels['Latin'].unique()))

print(tree_labels['Mnemonic'].unique())
print(tree_labels['Latin'].unique())

Num Mnemonic: 68
Num Latin: 67
['ACACBR' 'ACACDR' 'ACACET' 'ACACGE' 'ACACME' 'ACACNI' 'ACACSP' 'ACACSE'
 'ACACTO' 'ACACXA' 'ACOKRB' 'ACOKRL' 'ACOKSP' 'BALAAE' 'BALAGL' 'BOSCAN'
 'CADAFA' 'CANTSP' 'CARISP' 'CLAUAN' 'COMBMO' 'COMMSP' 'CORDSI' 'CROTDI'
 'DICHCI' 'DODOAN' 'EUCLDI' 'FAGAHI' 'FICUSP' 'GREWBI' 'GREWSP' 'GREWHB'
 'GREWHL' 'GREWMH' 'GREWSI' 'HIBICA' 'LIPPJA' 'LYCIEU' 'MAERAN' 'MAERSP'
 'MAERTR' 'MAYTSP' 'MYSTAE' 'MYSTHL' 'OCIMUM' 'OLEACA' 'ORMOSP' 'ORMOTR'
 'PAPPCA' 'PAVEGA' 'PHYLSP' 'PSIAPU' 'PSYCSP' 'PSYDSP' 'PYROSP' 'RHAMST'
 'RHUSNA' 'RHUSHL' 'SCUTMY' 'TAREGR' 'TECLNO' 'TINNAE' 'TRIUBR' 'TURRMO'
 'UNKNOW' 'UNKNO1' 'ZANTCH' 'ZIZIMU']
['Acacia brevispica' 'Acacia drepanolobium' 'Acacia etbaica'
 'Acacia gerrardii' 'Acacia mellifera' 'Acacia nilotica' 'Acacia senegal'
 'Acacia seyal' 'Acacia tortilis' 'Acacia xanthophloea' 'Acokanthera sp.1'
 'Acokanthera sp.2' 'Acokanthera sp.3' 'Balanites aegypticus'
 'Balanites glaber' 'Boscia angustifolia' 'Cadaba farinosa'
 'Canthium pseu

In [134]:
undef = tree_labels.loc[tree_labels['Latin'] == 'Unidentified Unidentified']
print("Relevant mnemonics: ", undef['Mnemonic'].unique())
print("Count UNKNOW: ", len(undef.loc[undef['Mnemonic'] == 'UNKNOW']))
print("Count UNKNO1: ", len(undef.loc[undef['Mnemonic'] == 'UNKNO1']))

# Possible Actions: Dig into UNKNOW vs. UNKNO1; Assign UNKNO1 to UNKNOW

Relevant mnemonics:  ['UNKNOW' 'UNKNO1']
Count UNKNOW:  113
Count UNKNO1:  12


In [135]:
# Replace specific UNKNO1 Values
for ind in tree_labels.loc[tree_labels['Mnemonic'] == 'UNKNO1'].index:
  tree_labels.loc[ind, 'Mnemonic'] = 'UNKNOW'

# Verify that label numbers match
print("Num Mnemonic:", len(tree_labels['Mnemonic'].unique()))
print("Num Latin:",len(tree_labels['Latin'].unique()))

Num Mnemonic: 67
Num Latin: 67


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


### Calculate Tree Label Frequency

In [136]:
val_counts = tree_labels['Latin'].value_counts(normalize=True)
species_counts = val_counts.to_frame(name="Species Frequencies")
species_counts.head()

Unnamed: 0,Species Frequencies
Croton dichogamous,0.205676
Acacia brevispica,0.194565
Euclea divinorum,0.181285
Acacia drepanolobium,0.089156
Acacia mellifera,0.088063


### LiDAR Images

In [137]:
# read saved data table with approximated lat and long
with open(os.path.join(project_path, 'outputs', 'lidar_with_latlong.pickle'), 'rb') as f:
    latlong_dict = pickle.load(f)

In [138]:
lat_lst = []
long_lst = []
for tree_id in latlong_dict.keys():
  lat_lst.append(latlong_dict[tree_id][1][0])
  long_lst.append(latlong_dict[tree_id][1][1])

tree_dict = pd.DataFrame()
tree_dict['tree_id'] = latlong_dict.keys()
tree_dict['latitude'] = lat_lst
tree_dict['longitude'] = long_lst
tree_dict.head()

Unnamed: 0,tree_id,latitude,longitude
0,treeID_42693,0.284333,36.869064
1,treeID_42718,0.284321,36.871805
2,treeID_42716,0.284316,36.870929
3,treeID_42730,0.284316,36.871608
4,treeID_42715,0.284332,36.870756


In [139]:
num_trees = len(tree_dict)
num_trees

9473

# Match species labels with LiDAR images

In [140]:
species_counts.head()

Unnamed: 0,Species Frequencies
Croton dichogamous,0.205676
Acacia brevispica,0.194565
Euclea divinorum,0.181285
Acacia drepanolobium,0.089156
Acacia mellifera,0.088063


In [141]:
tree_dict.head()

Unnamed: 0,tree_id,latitude,longitude
0,treeID_42693,0.284333,36.869064
1,treeID_42718,0.284321,36.871805
2,treeID_42716,0.284316,36.870929
3,treeID_42730,0.284316,36.871608
4,treeID_42715,0.284332,36.870756


In [142]:
# Random assignment with equal probability (random labels)
tree_rand_labels = tree_dict
tree_rand_labels['Label'] = species_counts.sample(n=num_trees, replace=True).index
freq_counts = tree_rand_labels['Label'].value_counts(normalize=True)
print(f'Summary:\nRange: {freq_counts[0]-freq_counts[-1]}\nTop 5: {list(freq_counts[:5])}\nAvg: {freq_counts.mean()}')
tree_rand_labels.head()

Summary:
Range: 0.0054892853372743595
Top 5: [0.01762905098701573, 0.01752348780745276, 0.016678982370949013, 0.016678982370949013, 0.016573419191386045]
Avg: 0.014925373134328358


Unnamed: 0,tree_id,latitude,longitude,Label
0,treeID_42693,0.284333,36.869064,Unidentified Unidentified
1,treeID_42718,0.284321,36.871805,Acacia seyal
2,treeID_42716,0.284316,36.870929,Dodonaea angistifolia
3,treeID_42730,0.284316,36.871608,Scutia myrtina
4,treeID_42715,0.284332,36.870756,Olea capensis


In [143]:
# Random assignment with specific probability (proportional labels)
tree_prop_labels = tree_dict
tree_prop_labels['Label'] = species_counts.sample(n=num_trees, replace=True, weights=species_counts['Species Frequencies']).index
orig_freq = species_counts['Species Frequencies']
samp_freq = tree_prop_labels["Label"].value_counts(normalize=True)
deviation = pd.merge(orig_freq, samp_freq, how='inner', left_index=True, right_index=True)
deviation = deviation.rename(columns={'Species Frequencies': 'Original', 'Label': 'Sample'})
deviation['Diff'] = deviation['Original'] - deviation['Sample']
print(f"Total deviation: {deviation['Diff'].sum()}, Average deviation: {deviation['Diff'].mean()}")
tree_prop_labels.head()

Total deviation: -0.0008871968972829665, Average deviation: -1.8483268693395135e-05


Unnamed: 0,tree_id,latitude,longitude,Label
0,treeID_42693,0.284333,36.869064,Euclea divinorum
1,treeID_42718,0.284321,36.871805,Euclea divinorum
2,treeID_42716,0.284316,36.870929,Acacia drepanolobium
3,treeID_42730,0.284316,36.871608,Croton dichogamous
4,treeID_42715,0.284332,36.870756,Croton dichogamous


In [144]:
# Save results to picklefil for easy retrieval

with open(os.path.join(project_path, 'outputs', 'tree_random_labels.pickle'), 'wb') as f:
    pickle.dump(tree_rand_labels, f, protocol=pickle.HIGHEST_PROTOCOL)

with open(os.path.join(project_path, 'outputs', 'tree_proportional_labels.pickle'), 'wb') as f:
    pickle.dump(tree_prop_labels, f, protocol=pickle.HIGHEST_PROTOCOL)