## Updatable, weighted Random Forests

We are interested in producing a method for introducing new data into an existing Random Forest model. This data will come in the form of user provided feedback on the original model, and so we are also interested in being able to weight new training data more strongly.  

One idea would be to append new trees trained with the new data onto the end of the existing forest. The votes from these trees could be given higher weighting than the original trees, as a way of increasing the weight of the new data over the old.

NB we are interested in implementing this in R using the package `randomForest`, however here I have quickly hacked together an example in python using `sklearn`.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
from sklearn.metrics import confusion_matrix

I have extended the existing `RandomForestClassifier` from `sklearn` to include a function to allow prediction from votes rather than probability (modified from [this SE question](http://stats.stackexchange.com/questions/127077/random-forest-probabilistic-prediction-vs-majority-vote/148610#148610)), and to allow the addition of new trees.

In [2]:
class UpdatableRandomForestClassifier(RandomForestClassifier):
    
    def __init__(self, **kwargs):
        # set initial vote weight
        init_vote_weight = kwargs.pop('init_vote_weight', 1)
        super(UpdatableRandomForestClassifier, self).__init__(**kwargs)
        self.vote_weights = np.repeat(init_vote_weight, self.n_estimators)
    
    def fit_new_data(self, X, y, vote_weight=1, **kwargs):
        '''
        Fit new data independently using n_estimators and append these to the
        existing random forest model.
        '''
        new_trees = RandomForestClassifier(**kwargs).fit(X=X, y=y)
        new_tree_weights = np.repeat(vote_weight, new_trees.n_estimators)
        self.n_estimators += new_trees.n_estimators
        self.estimators_ += new_trees.estimators_
        self.vote_weights = np.concatenate([self.vote_weights,
                                            new_tree_weights])

    def predict_votes(self, X):
        """
        Predict class for X.

        Uses majority voting, rather than the soft voting scheme
        used by RandomForestClassifier.predict.
        """
        check_is_fitted(self, 'n_outputs_')

        # Check data
        X = check_array(X, dtype=DTYPE, accept_sparse="csr")

        # Assign chunk of trees to jobs
        n_jobs, n_trees, starts = _partition_estimators(self.n_estimators,
                                                        self.n_jobs)

        # Parallel loop
        all_preds = Parallel(n_jobs=n_jobs, verbose=self.verbose,
                             backend="threading")(
            delayed(parallel_helper)(e, 'predict', X, check_input=False)
            for e in self.estimators_)
        
        # weight predictions from old / new trees
        all_preds = np.repeat(all_preds, self.vote_weights, axis=0)
        
        # calculate modes from weighted votes
        modes, _ = mode(all_preds, axis=0)
        modes = np.array(modes[0]).astype('int64')
        return self.classes_.take(modes, axis=0)
    
    def score_votes(self, X, y):
        '''
        Score accuracy using predict_votes
        '''
        return sum(rf.predict_votes(X) == y) / len(y)

I have used the iris dataset as an example

In [3]:
# split data into 1/4 initial training, 1/4 subsequent training and 1/2 test
iris = load_iris()
# only use last two variables to train
iris.data = iris.data[:, 2:]
X_train1, X_train2, y_train1, y_train2 = train_test_split(
    iris.data, iris.target, test_size=0.75, random_state=0
)
X_train2, X_test, y_train2, y_test = train_test_split(
    X_train2, y_train2, test_size=0.66, random_state=0
)


In [4]:
rf = UpdatableRandomForestClassifier(n_estimators=10,
                                     init_vote_weight=1,
                                     random_state=0)
rf = rf.fit(X_train1, y_train1)
print('accuracy:', rf.score_votes(X_test, y_test))
print('confusion matrix:')
print(confusion_matrix(rf.predict_votes(X_test), y_test))

accuracy: 0.88
confusion matrix:
[[24  0  0]
 [ 0 23  9]
 [ 0  0 19]]


In [5]:
# introduce new trees from new data and use double the weighting over old trees
rf.fit_new_data(X_train2, y_train2, vote_weight=2, n_estimators=10)
print('accuracy:', rf.score_votes(X_test, y_test))
print('confusion matrix:')
print(confusion_matrix(rf.predict_votes(X_test), y_test))

accuracy: 0.893333333333
confusion matrix:
[[24  0  0]
 [ 0 23  8]
 [ 0  0 20]]
