Skip to content

Very weird behavior from RandomForest with warm start #210

@joaquinvanschoren

Description

@joaquinvanschoren

According to OpenML, I'm getting a really strange performance boost when using warm start RandomForests:

Using OpenML:

from sklearn.ensemble import RandomForestClassifier
from openml import tasks,runs,datasets

task = tasks.get_task(145677)
clfs = [RandomForestClassifier(n_estimators=64,warm_start=True,n_jobs=-1),
        RandomForestClassifier(n_estimators=64,warm_start=False,n_jobs=-1)]

for clf in clfs:
    run = runs.run_task(task, clf)
    p = run.publish()
    print("Uploaded run 1 with id %s. Check it at www.openml.org/r/%s" %(run.run_id,run.run_id))

Output:

Uploaded run 1 with id 1852507. Check it at www.openml.org/r/1852507
Uploaded run 1 with id 1852508. Check it at www.openml.org/r/1852508

Looking up the results on OpenML, the first has an AUC of 0.9959, the second has AUC 0.8764.
This is a huge performance gap. Moreover, the classifier was not trained incrementally, so why does the warm start make a difference?

Just to check whether this is OpenML-specific, here is the same experiment doing 10xCV locally:

task = tasks.get_task(145677)
dataset = task.get_dataset()
X, y, categorical = dataset.get_data(target=dataset.default_target_attribute,return_categorical_indicator=True)

clfs = [RandomForestClassifier(n_estimators=64,warm_start=True,n_jobs=-1),
        RandomForestClassifier(n_estimators=64,warm_start=False,n_jobs=-1)]

for clf in clfs:    
    scores = cross_val_score(clf,X,y,cv=10,scoring="roc_auc",n_jobs=-1)
    print("10xCV Score (AUC): %s" %(np.mean(scores)))

Output:

10xCV Score (AUC): 0.87225053074
10xCV Score (AUC): 0.87225053074

Any clue about what's going on? Are we somehow maintaining information from classifiers trained in previous runs? Even then, a score of 0.9959 would be unexpected. Is there some form of information leakage going on?

Seems like a case for @amueller @mfeurer @janvanrijn ;)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions