# Overfitting problem: p ~ n

One of the main objectives of predictive modelling is to build a model that will give accurate predictions on unseen data.

A necessary step in the building of models is to ensure that they have not overfit the training data, which leads to sub optimal predictions on new data.

The purpose of this challenge is to stimulate research and highlight existing algorithms, techniques or strategies that can be used to guard against overfitting.

In order to achieve this we have created a simulated data set with 200 variables and 20,000 cases. An ‘equation’ based on this data was created in order to generate a Target to be predicted. Given the all 20,000 cases, the problem is very easy to solve – but you only get given the Target value of 250 cases – the task is to build a model that gives the best predictions on the remaining 19,750 cases.

This competition is of particular relevance to medical data analysis, where often the number of cases is severely restricted.

## Packages, dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_auc_score

In [3]:
filename = './overfitting.csv'
dataset = pd.read_csv(filename)

## Train/test splitting

In [4]:
train = dataset.loc[dataset.train==1]
test = dataset.loc[dataset.train==0]

In [5]:
y = train.Target_Practice.values
X = train.iloc[:,5:].copy()
Xt = test.iloc[:,5:].copy()

In [6]:
values, counts = np.unique(y, return_counts=True)
print('Target values: {}'.format(values))
print('Target distribution: {}'.format(counts))

Target values: [0 1]
Target distribution: [119 131]


In [7]:
print ('Training set size: {}'.format(X.shape))
print ('Test set size: {}'.format(Xt.shape))

Training set size: (250, 200)
Test set size: (19750, 200)


## Benchmark solution

In [8]:
clf = SVC(kernel='rbf', probability=True)
params = {'C': np.logspace(-7, 3, 100), 
          'gamma': np.logspace(-7, 2, 100)}

rand = RandomizedSearchCV(estimator=clf, 
                          param_distributions=params,
                          n_iter=100,
                          scoring='roc_auc',
                          n_jobs=-1,
                          cv=10,
                          refit=True,
                          random_state=0)

In [9]:
rand.fit(X, y)
best_roc_auc = rand.best_score_
best_params = rand.best_params_

In [10]:
print('Best roc_auc {:1.3f} using {}'.format(best_roc_auc, best_params))

Best roc_auc 0.859 using {'gamma': 0.005336699231206312, 'C': 61.35907273413163}


## Benchmark testing

This is a classification problem. The AUC on the test portion will be used to evaluate the benchmark model.

In [11]:
probs = rand.predict_proba(Xt)

In [12]:
test_roc_auc = roc_auc_score(y_true=test.Target_Practice.values, 
                             y_score=probs[:,1])

print('roc_auc in the test set is {:1.3f}'.format(test_roc_auc))

roc_auc in the test set is 0.862


## Improvement of the benchmark

The task is to propose a better model than the benchmark. An AUC above 0.90 is considered as a good result. AUC scores about 0.95 are considered excellent.