<a href="https://colab.research.google.com/github/harishmuh/machine_learning_practices/blob/main/data_leakage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Supervised**: `Data Leakage`

**What is data leakage?**

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the model being constructed.

**Do I have data leakage?**

An easy way to know you have data leakage is if you are achieving performance that seems a little too good to be true.

For example, if you normalize or standardize your entire dataset, then estimate the performance of your model using cross-validation. The effect is overfitting and having an overly optimistic evaluation of your models performance on unseen data. You have committed the sin of data leakage.

**Tips to Combat Data Leakage**

- Use Pipelines. Heavily use pipeline architectures that allow a sequence of data preparation steps to be performed within cross validation folds.
- Use a Holdout Dataset. Hold back an unseen dataset as a final sanity check of your model before you use it.

Data generated with condition that there is no relationship between X and y.

In [None]:
# Importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_classif, SelectPercentile
from sklearn.model_selection import cross_val_score

# Creating a dataset
import numpy as np
rnd = np.random.RandomState(seed=2020)
X = rnd.normal(size=(100,1000))
y = rnd.randint(0, 2, size=(100,))

In [None]:
X

array([[-1.76884571,  0.07555227, -1.1306297 , ..., -0.42037359,
         0.16115047,  0.1473629 ],
       [ 0.65784044,  1.18446297, -2.04393187, ..., -0.38402909,
        -0.82915961,  2.35642674],
       [ 0.04007791, -0.59046287,  1.02972818, ..., -1.16693941,
         1.41252563,  0.73198659],
       ...,
       [-0.76099743, -0.30395673, -0.84413859, ..., -2.18501061,
        -1.06946633, -1.03395189],
       [-0.21626308, -0.27047282, -1.08027983, ...,  1.01511902,
         0.0555339 , -0.09061088],
       [-1.42469557,  0.14346951,  1.19711116, ...,  0.82160354,
         0.6237632 , -1.12510684]])

**Data Leakage**

In [None]:
# with information leakage
select = SelectPercentile(score_func=f_classif, percentile=5)
select.fit(X, y) # this fitting causes information leakage
X_selected = select.transform(X)

cross_val_score(
LogisticRegression(),                 # algorithm used
X_selected,                           # feature
y,                                    # target
cv=5,                                 # number of cross validations
scoring='accuracy'                    # metric
)

# achieves good accuracy despite using random data

array([0.8 , 0.85, 0.9 , 0.75, 0.85])

This result indicates a very good model while data is generated entirely random, weird isn't it? This is caused by the feature selection process involving the entire data.

**No Information Leakage**

By applying pipeline to prevent information leakage.

In [None]:
# without information leakage
select = SelectPercentile(score_func=f_classif, percentile=5)


model_pipeline = Pipeline([
    ('selection', select),
    ('estimator', LogisticRegression())
])

cross_val_score(
    model_pipeline,                 # model used
    X_selected,                     # feature
    y,                              # target
    cv=5,                           # Cross validation                                                       # cross validasi
    scoring='accuracy'              # metric
)


array([0.55, 0.6 , 0.65, 0.55, 0.5 ])

This result is reasonable for completely random data. Because the feature selection process in only performed on the training dataset, not the entire data.

**How to Build a Pipeline to prevent data leakage?**

The example below is how to create a complex Pipeline with a ColumnTransformer and a classifier, DecisionTreeClassifier, and then apply it to GridSearch, and display its visual representation.

In [None]:
# import required library

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

# building pipeline for numerical variable
# every numeric variable will pass through the Pipeline
numerical_pipeline = Pipeline([
    ('mean_imp',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])


# building pipeline for categorical variable
categorical_pipline = Pipeline([
    ('constant_imp', SimpleImputer(strategy='constant', fill_value='unk')),
    ('oneHot',OneHotEncoder())
])


# compile multiple pipeline with ColumnTransformer into 'propecessor'
prepocessor = ColumnTransformer([
    ('categorical',categorical_pipline,['state','gender']),
    ('numerical',numerical_pipeline,['age','income'])
    ])

# integrating preprocessing and algorithm into 'model_pipeline'

model_pipeline = Pipeline([
    ('prepocessor', prepocessor),
    ('classifier', DecisionTreeClassifier())
])

# hyperparameter tuning

# setting model parameter
param_grid = {# hyperparameter possesed by classifier
    'classifier__max_features' : ['auto', 'sqrt','log2'],
    'classifier__max_depth' : [4,5,6,7,8],
    'classifier__criterion' : ['gini','entropy']
}

# Integrating preprocessing, model, model parameter, crossvalidation and metrics through pipeline
tuned_model = GridSearchCV(
    model_pipeline,
    param_grid=param_grid,
    cv = 5,
    scoring='accuracy',
    n_jobs=-1
)


tuned_model

The code above contains three main parts:

**1. Preprocessing**– handling missing values, scaling numbers, and encoding categories.

**2. Modeling** – training a decision tree classifier.

**3. Hyperparameter** Tuning – using GridSearchCV to find the best decision tree settings.

All part is wrapped into a single pipeline, so data flows smoothly from raw input → preprocessing → model training → evaluation.



More detailed explanation of step by step process can be seen below

**Preprocessing for Numerical Data**

* Missing values are filled with the mean.

* Features are standardized using StandardScaler.

**Preprocessing for Categorical Data**

* Missing values are filled with a constant "unk".

* Features are converted into binary columns with OneHotEncoder.

**ColumnTransformer**

* Assigns the right preprocessing pipeline to the right columns:

  * ['state', 'gender'] → categorical pipeline.

  * ['age', 'income'] → numerical pipeline.

**Pipeline Integration**

* Combines preprocessing and the DecisionTreeClassifier into one end-to-end pipeline.

* Ensures that data transformation and model training happen together, avoiding data leakage.

**Hyperparameter Grid**

* Defines what Decision Tree parameters should be tested:

  max_features, max_depth, and criterion.

**GridSearchCV**

* Runs cross-validation to test all parameter combinations.

* Chooses the best-performing model based on accuracy.

* Uses parallel processing (n_jobs=-1) for efficiency.