# Predicting Trees with a Random Forest

## 1 Introduction

### 1.A Aims and Objectives

The aim of this project is to apply a machine learning model capable of predicting the species of trees in Vienna, when given certain characteristics as input. The data for this project stems from the City of Vienna, which records information about more than 200.000 trees located within the city. These data include physical characteristics, such as tree trunk circumference, tree height, treetop diameter, as well as the year of planting and the city district in which the tree is located.

After preprocessing and cleaning the data, I will fit a random forest classifier to 80% of the data, in order to train our model. Afterwards, I will test the model's performance on the remaining 20%. Below is a brief overview of the different sections of this project.

### 1.B Sections of this project report

1. Introduction
    1. Aims and Objectives
    2. Sections of this project report
2. The Data
    1. Data Source
    2. Data Cleaning
3. Random Forest Classification
4. Conclusion and Limitations

In [100]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score

## 2 The Data

### 2.A Data Source

I found this dataset on [Open Data Austria](https://www.data.gv.at/), which lists more than 40.000 open data sets, primarily from Austrian communal or national public institutions. The dataset for this project has been provided by the city of Vienna and was last updated in the spring of 2020. It can be found [here](https://www.data.gv.at/katalog/dataset/stadt-wien_baumkatasterderstadtwien).

The dataset lists the location and biological characteristics of more than 200.000 trees in Vienna. This includes all roadside trees as well as some (but not all) trees growing in parks and wooded areas in Vienna.

The data is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/deed.en), which allows for sharing and adapting of the data, as long as the source is credited.

In [101]:
# Loading the dataset.
data = pd.read_csv("/Users/jakobkanz/Dropbox/Python/Datasets/BAUMKATOGD.csv")

In [102]:
# Initial visual inspection.
data.head()

Unnamed: 0,FID,OBJECTID,SHAPE,BAUM_ID,DATENFUEHRUNG,BEZIRK,OBJEKT_STRASSE,GEBIETSGRUPPE,GATTUNG_ART,PFLANZJAHR,PFLANZJAHR_TXT,STAMMUMFANG,STAMMUMFANG_TXT,BAUMHOEHE,BAUMHOEHE_TXT,KRONENDURCHMESSER,KRONENDURCHMESSER_TXT,BAUMNUMMER,SE_ANNO_CAD_DATA
0,BAUMKATOGD.482058773,482058773,POINT (16.314455985683868 48.23990465927611),28620,magistrat,18.0,Pötzleinsdorfer Straße,"MA 28 - Straße, Grünanlage",Tilia cordata (Winterlinde),1940,1940,152,152 cm,3,11-15 m,3,7-9 m,1028,
1,BAUMKATOGD.482058774,482058774,POINT (16.37307077654833 48.216167292881224),251100,magistrat,1.0,"01., Donaukanal Pachtflächen 1. Bezirk, DHK",viadonau - DHK Donaukanal Pachtflächen,Platanus x acerifolia (Ahornblättrige Platane),2009,2009,48,48 cm,2,6-10 m,1,0-3 m,1050,
2,BAUMKATOGD.482058775,482058775,POINT (16.28228964539069 48.145860820264765),220135,magistrat,23.0,"23., Fridtjof-Nansen-Park, MA42",MA 42 - Parkanlage,"Pinus nigra (Schwarzkiefer, Schwarzföhre)",0,nicht definiert,50,50 cm,2,6-10 m,2,4-6 m,7037,
3,BAUMKATOGD.482058776,482058776,POINT (16.314090696657324 48.24000873075163),259236,magistrat,18.0,Pötzleinsdorfer Straße,"MA 28 - Straße, Grünanlage",Tilia cordata 'Rancho' (Kleinblättrige Winterl...,2017,2017,30,30 cm,1,0-5 m,1,0-3 m,1031,
4,BAUMKATOGD.482058777,482058777,POINT (16.31394177964165 48.24005204919185),131839,magistrat,18.0,Pötzleinsdorfer Straße,"MA 28 - Straße, Grünanlage",Tilia cordata (Winterlinde),1940,1940,142,142 cm,3,11-15 m,2,4-6 m,1032,


### 2.B Data Cleaning

For this project, only a subset of columns will be necessary. The dependent variable will be the tree species. The independent variables will include all biological or physical characteristics, which include the circumference of the tree trunk, the height of the tree and the diameter of the tree top. Additionally, I chose to include the year of planting and the district of Vienna in which the tree is located.

As a first step, I translate and rename the relevant columns for easier usability. Then, I will inspect each of the columns in turn for missing values and potential irregularities.

In [103]:
# Renaming and translating the relevant columns from German to English.
data = data.rename(columns=
          {'BEZIRK': 'district',
           'GATTUNG_ART': 'species',
           'PFLANZJAHR': 'year',
           'STAMMUMFANG': 'trunk',
           'BAUMHOEHE': 'height',
           'KRONENDURCHMESSER': 'top'
          })

#### Species

The dependent variable, whose value I will attempt to predict in this project, is species. It lists both the latin and german name of each tree species in the dataset. Some observations do not list a species but rather a note that indicates that a young tree will be planted at this location in the city. I drop these observations and convert the variable to the data type 'category'.

In [104]:
# Drop observations which do not indicate a species.
data = data[data['species'] != 'Jungbaum wird gepflanzt']

In [105]:
# Convert to category data type.
data['species'] = data['species'].astype('category')

#### District

The column district, which shows in which district of Vienna a given tree is located includes a small amount of missing values (<1% of the data). Manually cross-referencing the location of a subsample of these trees with their coordinates shows that these trees are situated outside the city boundaries in Vienna (not shown below). Given the scope of this project, I exclude these observations. Finally, I convert the data type of this column to 'category', since it contains a fixed (small) number of districts (23).

In [106]:
# Checking for missing values.
data['district'].isnull().sum()

555

In [107]:
# Dropping missing values, as these trees lie outside the city boundaries.
data = data[data['district'].notna()]

In [108]:
# Convert data type to integer first, to remove commas. Then convert to category.
data['district'] = data['district'].astype('int64')
data['district'] = data['district'].astype('category')

#### Height, Trunk, Treetop

The columns height, trunk (circumference) and top (diameter) each contain a small amount of missing values (less than 1% of the data for each column). Given the fact that these missing values amount to only a small percentage of the data, I decided to drop the observations and proceed.

In [109]:
#Checking for missing values (indicated by '0' in this dataset).
print(data[data['height'] == 0].shape)
# Dropping missing values.
data = data[data['height'] != 0]

(720, 19)


In [110]:
#Checking for missing values (indicated by '0' in this dataset).
print(data[data['top'] == 0].shape)
# Dropping missing values.
data = data[data['top'] != 0]

(230, 19)


In [111]:
#Checking for missing values (indicated by '0' in this dataset).
print(data[data['trunk'] == 0].shape)
# Dropping missing values.
data = data[data['trunk'] != 0]

(794, 19)


#### Year

Inspection of the year column, which shows the year of planting, shows over 62.000 missing values. Given that the entire dataset contains just over 200.000 observations, these missing values cannot be simply dropped. Instead, I proceed to inspect the average characteristics of the subsample of trees with a missing value in the year column.

The average values for height, trunk and top are all larger in this subsample than the mean values for those trees with known planting years. I proceed to conduct a rolling window of mean comparisons for each decade from 1900 until 2000. For the columns trunk and height, I find that the average characteristics of the subsample of trees with missing year-values are closest to those trees planted between 1970 and 1979. For the column top, the decade between 1980 and 1989 is a closer fit. Hence, I decide to replace the missing year-values with '1975' as a proxy.

In [112]:
# Checking for missing values.
data[data['year'] == 0].shape

(62615, 19)

In [113]:
# Assigning all trees with year equal to zero to a new variable.
zerotree = data[data['year'] == 0]
# Assigning all other trees to a new variable.
notzerotree = data[data['year'] != 0]

In [114]:
# Comparing mean values for 'height', 'top', and 'trunk'.
print(zerotree['height'].mean())
print(notzerotree[(notzerotree['year'] >= 1970) & (notzerotree['year'] < 1980)]['height'].mean())
print(zerotree['trunk'].mean())
print(notzerotree[(notzerotree['year'] >= 1970) & (notzerotree['year'] < 1980)]['trunk'].mean())
print(zerotree['top'].mean())
print(notzerotree[(notzerotree['year'] >= 1980) & (notzerotree['year'] < 1990)]['top'].mean())

2.635422822007506
2.737771345875543
123.2020282679869
125.46649782923299
2.6502914637067794
2.66004792879151


In [115]:
# Replacing missing values with '1975'.
data['year'] = data['year'].replace(0, 1975)

#### Selecting a species-subset

As a final step before moving into the modelling phase, I select a subset of species. This is necessary since several of the over 500 species in the dataset account only for a very small number of observations (<10). Including these would result in a badly trained model.

The three most common tree species in the dataset are acer platanoides (Norway maple), aesculus hippocastanum (horse chestnut), and fraxinus excelsior (common ash). Together, these species account for just under 40.000 observations, which amounts to about 20% of the entire dataset.

In [116]:
# Getting top n most frequent tree species and save to a list.
n = 3
top_3_trees = data['species'].value_counts()[:n].index.tolist()

In [117]:
# Creating new dataframe with just top n trees.
data_3 = data.loc[data['species'].isin(top_3_trees)]

In [118]:
# Checking number of observations.
data_3.shape

(39683, 19)

## 3 Random Forest Classifcation

Having cleaned and prepared the dataset accordingly, I am now able to move to the modelling phase. I will use a random forest classification model in order to predict the species of a given tree. As input features, I will use the physical characteristics tree trunk circumference, tree height, treetop diameter as well as the year in which the tree was planted and the district in Vienna where the tree can be found.

To prepare my model, I assign the target feature (species) to the dependent variable y. The set of independent variables gets assigned to the vector X.

In [120]:
# Assigning target and input features to y and X, respectively.
y = data_3['species']
X = data_3[['year', 'district', 'trunk', 'height', 'top']]

#### Train-Test Split

As a next step, I split the data into two subsets, with one being used to train the model whereas the other one will be used for out of sample model performance evaluation.

In [121]:
# Splitting into train and test sample.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [122]:
# Checking dimensions.
X_train.shape, y_train.shape

((31746, 5), (31746,))

In [123]:
# Checking dimesions.
X_test.shape, y_test.shape

((7937, 5), (7937,))

#### Fitting Random Forest Classifier

I can fit the random forest classifier to the training set now and predict the species of each tree in the testing sample by using the vector of input features of the testing sample.

In [124]:
# Fitting random forest classifier to training set.
clf = RandomForestClassifier(random_state = 18)
clf.fit(X_train, y_train)

RandomForestClassifier(random_state=18)

In [125]:
# Predicting the species of the set used to test the out of sample model performance.
prediction = clf.predict(X_test)
print(prediction)

['Acer platanoides (Spitzahorn)' 'Acer platanoides (Spitzahorn)'
 'Acer platanoides (Spitzahorn)' ... 'Fraxinus excelsior (Gemeine Esche)'
 'Acer platanoides (Spitzahorn)' 'Acer platanoides (Spitzahorn)']


#### Model Performance

Finally, I calculate the accuracy, which measures the percentage of correctly predicted data points, and F1 score, which is a combination of the recall and precision metrics, to gauge the model performance. The accuracy score lands at just over 0.62, whereas the F1 score amounts to just under 0.62. These scores indicate a good start for this model, however, with significant room for improvement.

In the following section, I will go into potential areas which could help improve the model and its performance.

In [126]:
# Accuracy
accuracy_score(y_test, prediction)

0.6217714501700895

In [127]:
# F1 score
f1_score(y_test, prediction, average='weighted')

0.615825316208563

## 4 Conclusion and Limitations

In this project, I used a random forest classifier to build a model in order to predict the species of trees in Vienna. The model is able to perform these predictions by using physical characteristics such as tree height, treetop diameter, and trunk circumference, as well as the year of planting and the district of Vienna where the tree is located.

The final model achieved a reasonable but improvable score both in terms of accuracy and F1 score. What could be done to improve this and what are the limitations of this project?

#### Limitations

Starting with the data quality, additional physical characteristics of these trees would be benefical. These might range from information about blooming cycles, leaf sizes and root structure. 

Perhaps even more importantly, more precise metrics would be helpful. Of the physical characterics only tree trunk circumference is measured in centimeters. Both tree top diameter and tree height are measured in larger categories (such as 0-5m, 6-10m, etc.). A more granular measure would likely improve modelling results. 

The assumption placed on the missing values in the year column (where missing values where assumed to be equal to 1975) places a limitation on this project.

For future work, extending this model to cover more than the three most common tree species would be a promising avenue.