## Data Leakage Prevention Demo

- Objective: demonstrate how to use pipeline to prevent data leakage
- Dataset: UCI ML repository SECOM Dataset: http://archive.ics.uci.edu/ml/datasets/secom
- ML Task: Binary Classification

The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where –1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point. 

Attribute Information:

Key facts: Data Structure: The data consists of 2 files the dataset file SECOM consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels file containing the classifications and date time stamp for each example. 

As with any real life data situations this data contains null values varying in intensity depending on the individuals features. This needs to be taken into consideration when investigating the data either through pre-processing or within the technique applied. 

The data is represented in a raw text file each line representing an individual example and the features seperated by spaces. The null values are represented by the 'NaN' value as per MatLab. 

In [166]:
import os
import pandas as pd
import numpy as np

In [167]:
os.listdir()

['secom_labels.txt',
 'secom_data.txt',
 '.DS_Store',
 'data_leakage_demo.ipynb',
 '.ipynb_checkpoints']

In [168]:
data = pd.read_csv('secom_data.txt', sep=' ', header=None)
labels = pd.read_csv('secom_labels.txt', sep=' ', header=None, names=['target', 'timestamp'])

In [169]:
print(data.shape)
print(labels.shape)

(1567, 590)
(1567, 2)


In [170]:
df = pd.concat([data, labels], axis=1)

In [171]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,582,583,584,585,586,587,588,589,target,timestamp
0,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,0.0162,...,0.5005,0.0118,0.0035,2.363,,,,,-1,19/07/2008 11:55:00
1,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,-0.0005,...,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1,19/07/2008 12:32:00
2,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,0.0041,...,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1,19/07/2008 13:17:00
3,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,-0.0124,...,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1,19/07/2008 14:43:00
4,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,-0.0031,...,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1,19/07/2008 15:22:00


There seem to be many nulls. Let's explore more and then impute. 

In [172]:
#Exploring nulls: 
print('Columns that have more than {}% of nulls: {}'.format(10, np.sum(df.isnull().sum()>len(df)*.1)))
print('Number of rows with at least one null: ', len(df[df.isnull().any(axis=1)]))

Columns that have more than 10% of nulls: 52
Number of rows with at least one null:  1567


In [173]:
#Timestamp column not needed
df.drop('timestamp', axis=1, inplace=True)

Modeling with Pipeline to avoid data leakage - the preprocessing is executed on the train set first, and applied to the test set also.

In [174]:
df_leakage = df.copy()

Let's impute the null values. We only have numeric features so we will not use label/one hot encodings

In [175]:
print('How many features are not numeric?', df.select_dtypes(exclude=['float64', 'int']).shape[1])

How many features are not numeric? 0


In [196]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

In [197]:
num_cols = [i for i in range(len(df.columns)-1)]
#num_cols.append('target')

In [198]:
#Defining numeric features and the transformations we will apply to them
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [199]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols)
    ])

In [200]:
param_grid = [{'clf__C' : [0.01, 0.1, 1]}]

In [201]:
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('clf', LogisticRegression(solver='lbfgs'))])

In [231]:
gs = GridSearchCV(estimator=pipe, param_grid=param_grid,n_jobs=-1, cv=4)

In [232]:
X = df_leakage.drop('target', axis=1)
y = df_leakage['target']

In [233]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The pipeline transformations are fit to the training set and applied to the test set to obtain a score and thus preventing data leakage.

In [234]:
gs.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.943


In [227]:
gs.best_estimator_.named_steps['clf']

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [215]:
print('Mean Train Score: ', gs.cv_results_['mean_train_score'].mean())
print('Mean Test Score: ', gs.cv_results_['mean_test_score'].mean())

Mean Train Score:  0.972596404535064
Mean Test Score:  0.9039638201649375




In [235]:
cross_val_score(gs, X_train, y_train, cv=6)

array([0.92380952, 0.93333333, 0.92822967, 0.92788462, 0.92307692,
       0.92307692])

The training set performance obtained by Nested CV (with the cross validate method) is a bit lower than the one obtained simply with gridsearchcv. Choosing the parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic score, so this decrease in performance is normal. 