# Decion trees and random forests

## Data

Download the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) data and check out the corresponding questionnaire and codebook files to understand the dataset contents.

## Overarching research question

Explain what variables effect happiness (`V10`) using decision-tree learning.

## Method

There are many tools for running decision trees. We apply [Caret](https://topepo.github.io/caret/).

In [None]:
## Create new data matrix for analysis

selected_keys <- c('V10', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9')

full_data <- read.csv('data/wvs.csv', sep = ";")

data <- full_data[,selected_keys ]

print( nrow( data ) )
head(data)

In [None]:
library(caret)

data$V10 <- as.factor( data$V10 ) # Comment this over to run regression tree learning

model <- train( V10 ~ ., data = data, method = 'rpart')

In [None]:
best <- model$finalModel
plot( best )
text( best )

## Model analysis

As discussed in lecture, there are [many different metrics for evaluating the quality of a model](http://topepo.github.io/caret/measuring-performance.html). Beyond single metrics (like accuracy score, F1 score), examining the confusion matrix may be beneficial to assess model performance.

In [None]:
predicted_values <- predict( model, data )
confusionMatrix( predicted_values, data$V10 )

## Tasks

* Choose better or more interesting values to be modelled.
* Improve data preprocessing (remove missing values etc.)
* Apply training data - test data split in the data analysis stage. Does that improve the analysis at all?
* Increase the maximum depth of the decision tree. Does it improve the analysis at all?

## Random forests -  decision trees on steroids.

The challenge with decision trees - like many other machine learning algorithms - is that they run a single model on the data, relying on a single random state. This can easily lead to overfitting and bad performance [Random forests](https://topepo.github.io/caret/train-models-by-tag.html#Random_Forestl) address this issue through running an ensemble of trees, and creating a classifier through combining their diverging predictions (e.g. trough averaging).

In [None]:
# This can take time to run
model <- train( V10 ~ ., data = data, method = 'rf')

In [None]:
predicted_values <- predict( model, data )
confusionMatrix( predicted_values, data$V10 )