Machine-Learning-Tree-based-method

This project practices tree-based methods on predicting the presence/absence of 'bigfoot' with climate data

This exercise applied CART and Random Forest to predict the presence of 'big foot" in area with known climate.

Load packages

library(rspatial)
library(maptools)
library(raster)
library(rpart)
library(dismo)
library(randomForest)
library(maptools)

Observations -- Occurrence of "big foot" worldwide\

bf <- sp_data('bigfoot')

plot(bf[,1:2], cex=0.5, col='red')
data(wrld_simpl)

plot(wrld_simpl, add=TRUE)

Predictors
Climate is used as predictors.
In world climate raster file, bio1 is annual mean temperature(C). bio12 is annual precipitation.

wc <- raster::getData('worldclim', res=10, var='bio')

# plot two bio data
plot(wc[[c(1, 12)]], nr=2)

bfc <- extract(wc, bf[,1:2])

# Any missing values?
i <- which(is.na(bfc[,1]))

plot(bf[,1:2], cex=0.5, col='blue')

plot(wrld_simpl, add=TRUE)

points(bf[i, ], pch=20, cex=3, col='red')

Background data
To create climate data without knowing big foot exists or not.

# extent of all points
e <- extent(SpatialPoints(bf[, 1:2]))

# 5000 random samples (excluding NA cells) from extent e
set.seed(0)
bg <- sampleRandom(wc, 5000, ext=e)

#combine presence and background data, pa = 1 is presence; pa = 0 is absence.
d <- rbind(cbind(pa=1, bfc), cbind(pa=0, bg))
d <- data.frame(d)

#split the data into east and west
de <- d[bf[,1] > -102, ]
dw <- d[bf[,1] <= -102, ]

Fit a model
Classification and Regression Trees (CART) model

cart <- rpart(pa~., data=dw)

printcp(cart) 
# CP is complexity parameter->control the size of decision tree
# if CP increases as the nsplit increases. The tree will stop growing.

plotcp(cart)

Here is the tree

plot(cart, uniform=TRUE)
# text(cart, use.n=TRUE, all=TRUE, cex=.8)
text(cart, cex=.7, digits=1)

Question 1
Describe the conditions under which you have the highest probability of finding our beloved species?

The highest probability is 0.9. There are two highest probability.

One is in condition when bio4<8679, and bio10<218.5 and bio15>=40.5, and bio8<88.5. The other is when bio4<8679, and bio10<218.5, and bio15<40.5, and bio14<44.5, and bio5<280.5, bio3<33.5, bio6>=124.5

Random Forest

Classification

The plot below shows that as the increase of tree numbers, error rate was significantly reduced at the beginning, then reached a steady error rate no matter how many trees are trained.

# create a factor to indicated that we want classification
fpa <- as.factor(dw[, 'pa'])

#fit the random forest model
crf <- randomForest(dw[, 2:ncol(dw)], fpa)
crf

plot(crf)

The variable importance plot shows which variables are most important in fitting the model.
Gini index indicates the degree of impurity of the split. High Gini index indicates high impurity.
Therefore, variables that cause high mean decrease in Gini index rank top.

varImpPlot(crf)

Regression of Random Forest

Question 2: What did tuneRF help us find? What does the values of mt represent?

'tuneRF' helps to find the optimal value of variable number for randomForest.
'mt' represents the number of variables that has the minimum error.
In this case, it is 3.

#tune the parameter and get the best variable number
trf <- tuneRF(dw[, 2:ncol(dw)], dw[, 'pa'])

mt <- trf[which.min(trf[,2]), 1]

mt

Train regression tree

rrf <- randomForest(dw[, 2:ncol(d)], dw[, 'pa'], mtry=mt)

plot(rrf)

x axis is increased node purity.

varImpPlot(rrf)

Predict

# Extent of the western points
ew <- extent(SpatialPoints(bf[bf[,1] <= -102, 1:2]))

Regression Here is our prediction based on the trained regression model.

rp <- predict(wc, rrf, ext=ew)
plot(rp)

To get the optimal threshold of deciding presence/absence, you would normally have a hold out data set, but here I used the training data for simplicity.\

The plot below is the plot of Receiver Operating Characteristic curve (ROC curve). AUC is area under curve. If ROC is close to the top left corner, the accuracy of that threshold value is the best. If ROC is close to the right bottom corner, the accuracy is low.

# Cross-validation of models with presence/absence data.
eva <- evaluate(dw[dw$pa==1, ], dw[dw$pa==0, ], rrf)

plot(eva, 'ROC')

Get a threshold value to determine presence/absence and plot the prediction.
Green area shows the presence of 'bigfoot'.

tr <- threshold(eva)

plot(rp > tr[1, 'spec_sens'])

Classification

rc <- predict(wc, crf, ext=ew)
plot(rc)

can also get probabilities for the classes

rc2 <- predict(wc, crf, ext=ew, type='prob', index=2)
plot(rc2)

Extrapolation Making prediction with regression random forest model. The AUC plot show the model accuracy is low.

de <- na.omit(de)
eva2 <- evaluate(de[de$pa==1, ], de[de$pa==0, ], rrf)

plot(eva2, 'ROC')

Take a look at it on a map.

Question 3: Why would it be that the model does not extrapolate well?

Training data is not comprehensive. The model is trained by the dw data. It didn't include any data in the eastern US. Because the topography and climate in western area are quite different from each other. So it is possible the training data is not inclusive enough for the general prediction.\
Over fitting. It is also possible that the number of variables are way more than needed which led the model fit the western region too well to be general.The tuning process also indicated that 3 is the optimal number of regression random forest.

eus <- extent(SpatialPoints(bf[, 1:2]))
rcusa <- predict(wc, rrf, ext=eus)
plot(rcusa)
points(bf[,1:2], cex=.25)

Question 4: Make a map to show where conditions are improving for western bigfoot, and where they are not. Is the species headed toward extinction?

The species do not head toward extinction.
Most of the area have variation smaller than zero as shown in the histogram which indicates that population of bigfoot increases in the future.

#get the new climate data
fut <- getData('CMIP5', res=10, var='bio', rcp=85, model='AC', year=70)

#keep the column nama the same
names(fut) <- names(wc)

##extrapolate the spatial distribution of bigfoot with the new climate data
futusa <- predict(fut, rrf, ext=ew, progress='window')

#difference between occurence predicted by regression random forest -future
bd_dift <- rp-futusa

plot(bd_dift, col = topo.colors(20), main="Bigfoot Shift")

##plot histogram on the variation data
hist(bd_dift, main="Bigfoot Shift")

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cmip5/10m		cmip5/10m
docs		docs
wc10		wc10
README.md		README.md
Tree-based method.Rmd		Tree-based method.Rmd
gadm36_MEX_1_sp.rds		gadm36_MEX_1_sp.rds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine-Learning-Tree-based-method

Classification

Regression of Random Forest

About

Uh oh!

Releases

Packages

ninalty/Machine-Learning-Tree-based-method

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Tree-based-method

Classification

Regression of Random Forest

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages