# ReforesTree 🌴

We are excited to share with you the ReforesTree dataset! 🎉
We introduce the ReforesTree dataset in hopes of encouraging the fellow machine learning community to take on
the challenge of developing low-cost, scalable, trustworthy and accurate solutions for monitoring, verification and reporting of tropical reforestation inventory. 

#### This is a dataset for the following 6 agroforestry sites
In alphabetical order
1. _Carlos Vera Arteaga_
2. _Carlos Vera Guevara_
3. _Flora Pluas_
4. _Leonor Aspiazu_
5. _Manuel Macias_
6. _Nestor Macias_


## Dataset Components
For each site the data we publish consists of four components free for use:
1. 🛸 Raw drone RGB images _(see wwf_ecuador/"Name of site")_
2. 🌴 Hand measured tree parameters (diameter at breast height, species, biomass, and location) of every tree _(see field_data.csv)_
3. 🔲 Set of bounding boxes of trees for each site cleaned by hand and labeled as banana or not banana _(see annotations/cleaned)_
4. ↔️ Mappings of these bounding boxes with tree labels based on location _(see mappings/final)_


## Tutorial
In this tutorial we go through the steps to recreate (and hopefully improve) the dataset and how to use it. 

Please read our paper [here](https://arxiv.org/abs/2201.11192).
For any questions, please reach out to gyri.reiersen@tum.de or david.dao@inf.eth.ch

## Load packages and modules

In [39]:
import torchgeo

In [10]:
from torchgeo.datasets import ReforesTree
ds = ReforesTree(root="data/reforestree/", download=True, checksum=True)

Downloading https://zenodo.org/record/6813783/files/reforesTree.zip?download=1 to data/reforestree/reforesTree.zip


  1%|▍                                                                                    | 42237952/7514326095 [02:58<8:46:31, 236521.25it/s]


KeyboardInterrupt: 

In [42]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import rasterio
from rasterio.plot import reshape_as_image
import PIL
PIL.Image.MAX_IMAGE_PIXELS = None
from PIL import Image
import sys
package = os.path.dirname(os.getcwd())
sys.path.append(package)
sys.path.append(package + 'utils')

#import seaborn as sns
#colors = sns.color_palette('tab10')
#mypalette={'NN':colors[0], 'GMN':colors[4], 'OT':colors[1], 'OT on GPS position':colors[1], 'GW':colors[2], 'OT on GPS position + Tree species':colors[3]}
#import matplotlib.pylab as plt
#import tensorflow

import warnings
warnings.filterwarnings('ignore')

In [43]:
from utils.extract_features import *
from utils.deepforest_detection import *
#from utils.visualisation import *
#from utils.plot_folium import *
#from utils.plot_density import *
#from utils.mapping import *

## Data Processing

### 🛸 Split raw drone RGB orthomosaics into image tiles
1. **Extract drone informations on each site:**
Based on the raw data we create the dataframe _ortho_data.csv_ that contains all important information on the drone orthomosaics RGB: minimimal and maximal latititude, longitude, width and height (pixels).


2. **Split the orthomosaics into 400x400 tiles:**
To be able to handle the drone data we need to split the large .tif files into tiles that is stored in _data/tiles_. 

In [46]:
directory = "data/wwf_ecuador/RGB Orthomosaics"
save_dir = "data/tiles"

# Extracting the main information for each site 
ortho_data = create_ortho_data(directory, os.path.join(save_dir, 'ortho_data.csv'))

In [47]:
# Split images into tiles (might takes some minutes)
for file in os.listdir(directory):
    if file.endswith('.tif'):
        # Open image file for reading (binary mode)
        path_to_raster = os.path.join(directory, file)
        name = file.replace('.tif', '')

        tiles_dir = os.path.join(save_dir, name)
        if not os.path.exists(tiles_dir):
            os.makedirs(tiles_dir)        
            split_raster(path_to_raster, base_dir=tiles_dir, patch_size=4000, patch_overlap=0.05)

### 🌴 Rescale drone bounds using field data

The "_field_data.csv_" contains all the trees on the sites. Each row includes information on the tree parameters as well as the locations and calculated aboveground biomass (AGB) and carbon. 

PS: It is worth noting that the column "updated diameter" is being used for DBH. For completeness and transparancy, the comlumn "diameter" is kept and for the trees with missing values (especially cacao) an extrapolated diameter was calculated based on avg. diameter for that species for the year the tree was planted. 

A separate tutorial on how the processing of the field data is TBD.

In [48]:
field_data = pd.read_csv('data/field_data.csv')
field_data

Unnamed: 0,name,lat,lon,diameter,height,year,plot_id,site,X,Y,updated diameter,group,AGB,carbon
0,Cacao,-2.181226,-79.576630,0.000000,0.0,2016.0,P8,Nestor Macias RGB,2761.628615,6831.070678,6.843647,cacao,5.444228,2.123249
1,Cacao,-2.181312,-79.576412,0.000000,0.0,2016.0,P8,Nestor Macias RGB,5067.141765,7729.961820,6.843647,cacao,5.444228,2.123249
2,Cacao,-2.181438,-79.576322,0.000000,0.0,2016.0,P8,Nestor Macias RGB,6025.223497,9026.909643,6.843647,cacao,5.444228,2.123249
3,Cacao,-2.181593,-79.576154,0.000000,0.0,2016.0,P8,Nestor Macias RGB,7803.490078,10637.681956,6.843647,cacao,5.444228,2.123249
4,Cacao,-2.181498,-79.576179,0.000000,0.0,2016.0,P8,Nestor Macias RGB,7531.400369,9648.963864,6.843647,cacao,5.444228,2.123249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4658,Musacea,-1.129593,-79.594408,23.236567,0.0,2019.0,P7,Manuel Macias RGB,5481.100278,6416.246583,23.236567,banana,24.381901,9.508941
4659,Cacao,-1.129664,-79.594279,0.000000,0.0,2019.0,P7,Manuel Macias RGB,6767.025955,7172.157045,2.387319,cacao,0.676596,0.263872
4660,Musacea,-1.129678,-79.594442,21.645022,0.0,2019.0,P7,Manuel Macias RGB,5147.097504,7321.370785,21.645022,banana,20.962055,8.175201
4661,Cacao,-1.129588,-79.594546,0.000000,0.0,2019.0,P7,Manuel Macias RGB,4102.201665,6366.720321,2.387319,cacao,0.676596,0.263872


In [49]:
# Then to the actual rescaling
list_sites = ['Carlos Vera Arteaga RGB', 'Carlos Vera Guevara RGB', 'Flora Pluas RGB', 'Leonor Aspiazu RGB', 
             'Manuel Macias RGB', 'Nestor Macias RGB']

ortho_data = rescale_bounds(ortho_data, field_data, list_sites)
ortho_data.to_csv(os.path.join(save_dir, 'ortho_data.csv'), index = False)

In [77]:
import deepforest
from deepforest import main
from deepforest import get_data

## 🔲 DeepForest Tree Detection

From tiles to bounding box annotations! For that we use a pretrained and finetuned DeepForest model.

Note reg finetuning: To make sure our model was able to detect the trees in these sites accurately, we finetuned it on some hand-created bounding boxes. 

In [83]:
# Load model 

model = deepforest.deepforest(saved model = os.getcwd()+'/data/model/deepforest/final_model_4000_epochs_35.h5')

SyntaxError: invalid syntax. Perhaps you forgot a comma? (294467983.py, line 3)

In [79]:
# Detect trees per tile and return bounding box annotations for each site
column_names = ['img_path', 'xmin', 'ymin', 'xmax', 'ymax', 'score']

dir = os.path.join(os.getcwd(), 'data')
tiles_dir = os.path.join(dir,'tiles')
ann_dir = os.path.join(dir,'annotations')

if not os.path.exists(ann_dir):
    os.makedirs(ann_dir)

for folder in os.listdir(tiles_dir):
    if not (folder.startswith('.') or folder.startswith('ortho_data.csv')):
        file_path = ann_dir + '/{}_annotations.csv'.format(folder)
        if not os.path.exists(file_path):
            annotations_files = pd.DataFrame(columns = column_names)
            folder_path = os.path.join(tiles_dir,folder)
            
            for file in os.listdir(folder_path):
                if not file.startswith('.'):
                    tile_annotations = get_annotations(os.path.join(folder_path, file), model)
                    annotations_files = pd.concat([annotations_files, tile_annotations])
            annotations_files = annotations_files.reset_index(drop=True)

            annotations_files.to_csv(file_path, index=False)
        print('DeepForest annotations are saved for site {}'.format(folder))


AttributeError: 'tuple' object has no attribute 'predict_image'

#### Additional tile information per bounding box annotation

For each tree annotation we add tree location, tile position, and site.

In [53]:
for site_name in list_sites:
    for file in os.listdir(ann_dir):
        if (file == '{}_annotations.csv'.format(site_name)):
            file_path = os.path.join(ann_dir,file)
            df = pd.read_csv(file_path)
            
            if not ('Xmin' in df.columns): # in case you have done this before
                df['img_name'], df['tile_index'], df['tile_xmin'], df['tile_ymin'], df['tile_xmax'], df['tile_ymax'] = zip(*df['img_path'].map(expand_tile_features))
                df[['x', 'y']] = df.apply(lambda x: [get_center(x.xmin,x.xmax), get_center(x.ymin,x.ymax)], axis=1, result_type="expand")

                df['Xmin'] = df.xmin + df.tile_xmin
                df['Ymin'] = df.ymin + df.tile_ymin
                df['Xmax'] = df.xmax + df.tile_xmin
                df['Ymax'] = df.ymax + df.tile_ymin
                df['X'] = df.x + df.tile_xmin
                df['Y'] = df.y + df.tile_ymin

                df[['lon', 'lat']] = df.apply(lambda x: convert_xy_tile_to_lonlat(x.img_name, x.tile_xmin, x.tile_ymin, x.x, x.y, ortho_data), axis=1, result_type="expand")
                df.to_csv(file_path, index = False)
            print('Added info for {}'.format(site_name))
        

In [54]:
# Merge annotations files of each site into one file
all_annotations = pd.DataFrame()

for file in os.listdir(ann_dir):
    if not (file.startswith('.') or file.startswith('c') or file.startswith('a')):
        file_path = os.path.join(ann_dir, file)
        df = pd.read_csv(file_path)
        all_annotations = pd.concat([all_annotations, df])
all_annotations.to_csv(os.path.join(ann_dir,'all_annotations.csv'))

### 🍌 Cleaned bounding boxes
As part of our work, we have manually cleaned the bounding boxes coming out of deepforest. This was done in two parts; one filtering out bboxes that were not of trees, due to size (e.g. too large area or only of a leaf) or due to wrong detection (of a car or grass). The second part consisted of labelling the trees as either "banana" (easily recognizable looking like a palm tree) or "other". 

Please find the cleaned dataset in "_data/annotations/cleaned_".

In [120]:
clean_annotations = pd.read_csv('data/annotations/cleaned/clean_annotations.csv')
clean_annotations.columns

Index(['img_path', 'xmin', 'ymin', 'xmax', 'ymax', 'score', 'img_name',
       'tile_index', 'tile_xmin', 'tile_ymin', 'tile_xmax', 'tile_ymax', 'x',
       'y', 'Xmin', 'Ymin', 'Xmax', 'Ymax', 'X', 'Y', 'lon', 'lat',
       'is_banana'],
      dtype='object')

## ↔️ Matching of tree label + bounding box

TBD