Very weird behavior from RandomForest with warm start

According to OpenML, I'm getting a really strange performance boost when using warm start RandomForests:

Using OpenML:
```
from sklearn.ensemble import RandomForestClassifier
from openml import tasks,runs,datasets

task = tasks.get_task(145677)
clfs = [RandomForestClassifier(n_estimators=64,warm_start=True,n_jobs=-1),
        RandomForestClassifier(n_estimators=64,warm_start=False,n_jobs=-1)]

for clf in clfs:
    run = runs.run_task(task, clf)
    p = run.publish()
    print("Uploaded run 1 with id %s. Check it at www.openml.org/r/%s" %(run.run_id,run.run_id))
```

Output:
```
Uploaded run 1 with id 1852507. Check it at www.openml.org/r/1852507
Uploaded run 1 with id 1852508. Check it at www.openml.org/r/1852508
```
Looking up the results on OpenML, the first has an AUC of **0.9959**, the second has AUC **0.8764**.
This is a huge performance gap. Moreover, the classifier was not trained incrementally, so why does the warm start make a difference?

Just to check whether this is OpenML-specific, here is the same experiment doing 10xCV locally:
```
task = tasks.get_task(145677)
dataset = task.get_dataset()
X, y, categorical = dataset.get_data(target=dataset.default_target_attribute,return_categorical_indicator=True)

clfs = [RandomForestClassifier(n_estimators=64,warm_start=True,n_jobs=-1),
        RandomForestClassifier(n_estimators=64,warm_start=False,n_jobs=-1)]

for clf in clfs:    
    scores = cross_val_score(clf,X,y,cv=10,scoring="roc_auc",n_jobs=-1)
    print("10xCV Score (AUC): %s" %(np.mean(scores)))
```

Output:
```
10xCV Score (AUC): 0.87225053074
10xCV Score (AUC): 0.87225053074
```

Any clue about what's going on? Are we somehow maintaining information from classifiers trained in previous runs? Even then, a score of 0.9959 would be unexpected. Is there some form of information leakage going on?

Seems like a case for @amueller @mfeurer @janvanrijn ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Very weird behavior from RandomForest with warm start #210

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Very weird behavior from RandomForest with warm start #210

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions