### WoE Model

#### Demonstrates the dangers of inproper usage of the Weight of Evidence transformation

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.datasets import make_classification

from woe import WoETransformer
from simulation import TargetLeakageSimulation

%load_ext autoreload
%autoreload 2

In [2]:
print(TargetLeakageSimulation.__doc__)

Simulation that demonstrates the importance of proper model validation
    
    Simulates two ways of approaching the modelling:
        - correct one: using only train sample to build the whole pipeline
        - incorrect one: using the whole development sample (train+test)
          to build part of the pipeline (e.g. fitting transformers, preselecting
          features etc.) which can cause a severe target leakage and is, sadly, 
          quite a common mistake.
          
    Each round, data (simulated using the provided data generator) is split into
    train, test and production. One pipeline is build using the train test only,
    second uses train+test to fit the transformer and train to fit the estimator,
    both pipelines are then tested on unseen data to simulate production.
    
    :param transformer: sklearn transformer
    :param estimator: sklearn estimator
    :param data_generator: callable returning data as a tuple (X, y)
    


### Helper code

In [3]:
def independent_random(n_samples, n_features, p=0.5):
    """Generates random dataset with independent target
    """
    X = np.random.normal(size=(n_samples, n_features))
    y = np.random.binomial(n=1, p=p, size=n_samples)
    return X, y

### Run Simulations

In [4]:
#pipeline set up
transformer = Pipeline([("disc", KBinsDiscretizer(n_bins=5, encode="ordinal"))
                        ,("woe", WoETransformer())])
estimator = LogisticRegression()

#### Extreme case: completely independet target

In [5]:
#config
N_SIMUL = 50
N_SAMPLES = 1000
N_PREDICTORS = 500

In [6]:
simulation = TargetLeakageSimulation(transformer,
                                     estimator,
                                     independent_random)
results = simulation.run(N_SIMUL,
                         n_samples=N_SAMPLES,
                         n_features=N_PREDICTORS)

Correct approach:
Train accuracy: 0.9997
Test accuracy: 0.4998
Production accuracy: 0.4913

Leakage approach:
Train accuracy: 0.9971
Test accuracy: 0.8644
Production accuracy: 0.4895

Simulation run in: 0:01:36.408324


#### More realistic case: mix of predictive and noise features

In [7]:
#config
N_SIMUL = 50
N_SAMPLES = 1000
N_PREDICTORS = 500
N_INFORMATIVE = 300
N_REDUNDANT = 100

In [8]:
simulation = TargetLeakageSimulation(transformer,
                                     estimator,
                                     make_classification)
results = simulation.run(N_SIMUL,
                         n_samples=N_SAMPLES,
                         n_features=N_PREDICTORS,
                         n_informative=N_INFORMATIVE,
                         n_redundant=N_REDUNDANT)

Correct approach:
Train accuracy: 0.9998
Test accuracy: 0.6492
Production accuracy: 0.6523

Leakage approach:
Train accuracy: 0.999
Test accuracy: 0.8987
Production accuracy: 0.6658

Simulation run in: 0:01:30.486671
