# How to overfit your ML model?

by Li Shen, Ph.D.

Icahn School of Medicine at Mount Sinai

Updated: 2018-05-11

In [1]:
%pylab inline
import numpy as np

Populating the interactive namespace from numpy and matplotlib


Generate a **purely** random dataset using standard Gaussian distribution. It has 100 samples and 20,000 features. 50 samples are randomly assigned label=0 and the other 50 samples are assigned label=1.

In [5]:
x_train = np.random.randn(100, 20000)
x_train.shape

(100, 20000)

In [8]:
y_train = np.concatenate([np.repeat([0], 50), np.repeat([1], 50)])
y_train.shape

(100,)

Perform feature selection on the **entire** data set using one-way ANOVA.

In [10]:
from sklearn.feature_selection import f_classif
f_train, p_train = f_classif(x_train, y_train)
print((p_train < .01).sum())

209


209 features passed p=0.01 cutoff, which is roughly the same as 20,000 x 0.01=200.

In [12]:
feature_sel_mask = (p_train < .01)
x_train_sel = x_train[:, feature_sel_mask]
x_train_sel.shape

(100, 209)

Perform 10-fold cross-validation on the train set to find the best classifier that can distinguish the two classes. I use random forest, which is considered (by some people) as a **robust** classifier that is less likely to overfit.

In [18]:
from scipy.stats import randint as sp_randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(clf, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search, 
                                   cv=10, 
                                   scoring='roc_auc')
random_search.fit(x_train_sel, y_train)
print('Best 10-fold CV AUC score:', random_search.best_score_)

Best 10-fold CV AUC score: 0.992


Take a closer look at the cross-validation results.

In [32]:
auc_mean_lst = [ g.mean_validation_score for g in random_search.grid_scores_]
auc_mean_lst = np.array(auc_mean_lst)
np.argmax(auc_mean_lst)

28

In [36]:
best_clf_res = random_search.grid_scores_[28]
print('mean CV score:', best_clf_res.mean_validation_score)
print('all CV score:', best_clf_res.cv_validation_scores)
print('std CV score:', best_clf_res.cv_validation_scores.std())
print('Random forest params:\n', best_clf_res.parameters)

mean CV score: 0.992
all CV score: [1.   1.   1.   1.   1.   1.   1.   0.92 1.   1.  ]
std CV score: 0.023999999999999987
Random forest params:
 {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 1, 'min_samples_leaf': 8, 'min_samples_split': 3}
