# Decion trees and random forests

## Data

Download the [World Value Survey](http://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp) data and check out the corresponding questionnaire and codebook files to understand the dataset contents.

## Overarching research question

Explain what variables effect happiness (`V10`) using decision-tree learning.

## Method

There are many tools for running decision trees. We apply [SciKit learn](https://scikit-learn.org/stable/modules/tree.html).

In [None]:
## Create new data frame for analysis

import csv
import pandas as pd

selected_keys = ['V4', 'V5', 'V6', 'V7', 'V8', 'V9','V10']

# Read data
data = []
for line in csv.DictReader( open('data/wvs.csv'), delimiter=';' ):
    dd = []
    for key in selected_keys:
        dd.append( line[key] )
    data.append( dd )

# Create data frame
df = pd.DataFrame( data, columns=selected_keys )
df

In [None]:
from sklearn import tree

decision_tree = tree.DecisionTreeClassifier( max_depth = 2) # tree.DecisionTreeRegressor() # for regression model
model = decision_tree.fit( df[selected_keys[:-1]], df['V10'] )

In [None]:
import graphviz
from IPython.display import display
image = tree.export_graphviz( model, class_names = True )
display( graphviz.Source( image ) ) ## or you could write the graphviz format to a file and visualise it using some other format

## Model analysis

As discussed in lecture, there are [many different metrics for evaluating the quality of a model](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation). Beyond single metrics (like accuracy score, F1 score), examining the confusion matrix may be beneficial to assess model performance.

In [None]:
from sklearn import metrics

predicted_values = model.predict( df[selected_keys[:-1]] )
metrics.confusion_matrix( predicted_values, df['V10'] )

## Tasks

* Choose better or more interesting values to be modelled.
* Improve data preprocessing (remove missing values etc.)
* Apply training data - test data split in the data analysis stage. Does that improve the analysis at all?
* Increase the maximum depth of the decision tree. Does it improve the analysis at all?

## Random forests -  decision trees on steroids.

The challenge with decision trees - like many other machine learning algorithms - is that they run a single model on the data, relying on a single random state. This can easily lead to overfitting and bad performance [Random forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) address this issue through running an ensemble of trees, and creating a classifier through combining their diverging predictions (e.g. trough averaging).

In [None]:
from sklearn import ensemble

random_forest = ensemble.RandomForestClassifier()
model = random_forest.fit( df[selected_keys[:-1]], df['V10'] )

In [None]:
predicted_values = model.predict( df[selected_keys[:-1]] )
metrics.confusion_matrix( predicted_values, df['V10'] )