## Data Leakage

<b>Imagine this situation:</b> You've developed a machine learning model that shows excellent accuracy on both your training and test sets. You're confident in its performance. However, when you deploy the model and it encounters new, real-world data, the results are surprisingly poor.
What's going on here? How can this be happening when the accuracy was high for both training and test sets?<br>
The likely problem/issue in this scenario is data leakage. Data leakage occurs when information that wouldn't be available in real-world situations sneaks into your model training process, leading to artificially inflated performance metrics during development but poor generalization when faced with truly new data.<br>
This situation highlights why it's crucial to not only look at accuracy metrics but also to thoroughly understand your data, carefully design your validation strategy, and be vigilant about potential sources of leakage throughout the entire machine learning pipeline.

### Common Types of Data Leakage

#### Target leakage: Using future information to predict the past

Example: Predicting whether a patient will be diagnosed with a disease.

In [1]:
features = ['age', 'symptoms', 'medication_prescribed']
target = 'disease_diagnosis'

The feature 'medication_prescribed' is target leakage because it's information that would only be available after the diagnosis (our target variable) is made.

##### How to avoid target leakage

<b>Understand your data:</b> Carefully examine each feature and consider whether it would be available at the time of prediction in a real-world scenario.<br>
<b>Use domain knowledge:</b> Consult with subject matter experts to identify potential leaky features.<br>
<b>Feature creation care:</b> When engineering new features, be mindful of not incorporating future information.

#### Train-test contamination: Test data influences the training process

Example: Using all data for feature scaling before splitting into train and test sets.


In [None]:
# Incorrect approach (leakage)
X_scaled = scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# Correct approach
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)

In [None]:
import numpy as np

def scale(X):
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    return (X - mean) / std

# Also return mean and std for demonstration purposes
def scale_with_params(X):
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    return (X - mean) / std, mean, std

##### Incorrect approach (data leakage):

In [None]:
X_scaled = scale(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

This approach scales the entire dataset before splitting. The problem here is that the scaling operation uses information from all the data, including what will become the test set. This means that information from the test set (which should be completely unseen) is influencing how the training data is scaled, leading to data leakage.

##### Correct approach:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_scaled = scale(X_train)
X_test_scaled = scale(X_test)

This approach first splits the data, then scales the training and test sets separately. However, there's still an issue here: the test set is being scaled independently of the training set, which isn't ideal.

##### The truly correct approach would be:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_scaled, train_mean, train_std = scale_with_params(X_train)
X_test_scaled = (X_test - train_mean) / train_std

This method:

Splits the data first.
Computes the scaling parameters (mean and std) using only the training data.
Applies these same parameters to scale both the training and test data.

This ensures that:

No information from the test set influences the scaling of the training set.
The test set is scaled in exactly the same way as the training set, mimicking how new, unseen data would be processed in a real-world scenario.

By using this approach, we maintain the integrity of our test set as a true representation of unseen data, providing a more reliable estimate of our model's performance on new data.

##### Using StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Split the data into training and test sets first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler only on the training data and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Use the same scaler (fitted on training data) to transform the test data
X_test_scaled = scaler.transform(X_test)