# Avoid Multi-Test Leakage

To avoid multi-test leakage, you can modify the code as follows:
- Split the data into training and testing sets before oversampling.
- Apply oversampling only to the training set.
- Use the oversampled training set to train your model and the original testing set to evaluate its performance.

In [12]:
#Instead of this
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# generate random data
n_samples, n_features, n_classes = 200, 10000, 2
rng = np.random.RandomState(42)
X = rng.standard_normal((n_samples, n_features))
y = rng.choice(n_classes, n_samples)

# oversampling datasets , new rows are synthesized based on existing rows
X_new, y_new = SMOTE().fit_resample(X, y)

# splits after over - sampling no longer produce independent train / test data
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size =0.2, random_state =42)
rf = RandomForestClassifier().fit(X_train, y_train)
rf.predict(X_test)

array([0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])

- To avoid multi-test leakage, you should split the data into training and testing sets before performing oversampling.

In [20]:
# Do This
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [13]:
# generate random data
n_samples, n_features, n_classes = 200, 10000, 2
rng = np.random.RandomState(42)
X = rng.standard_normal((n_samples, n_features))
y = rng.choice(n_classes, n_samples)

In [14]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# perform oversampling only on the training data
sm = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)

In [16]:
# fit a random forest classifier on the oversampled training data
rf = RandomForestClassifier().fit(X_train_resampled, y_train_resampled)

In [17]:
# evaluate the model on the testing data
y_pred = rf.predict(X_test)

In [18]:
# Display
print(y_pred)

[0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0
 0 0 1]


By splitting the data into training and testing sets before performing oversampling, we ensure that the test set remains independent of the training set, and therefore we can get a more accurate estimate of the model's performance on new, unseen data.