#Data Engineering: Data Leakage
Data leakage occurs when a machine learning algorithm has access to information in training that would not be available for a prediction. Models trained on a leaky dataset are useless at best, and potentially dangerous, in real-world applications.

###Signs of Data Leakage
Data leakage often results in unrealistically good performance on the test set. This is because the model has already seen some aspect of the data in training.

###Causes of Data Leakage
* Duplicate values. Duplicate values are a common problem outside of controlled systems. For example, data scraped from the web, such as customer reviews, are prone to duplicate data entry by consumer users.
* Leaky predictors. Some data have dependencies with other data that can span the test and training sets. Time is potentially leaky, and time-series data requires special handling. 
* Leaky pre-processing. The most common cause is pre-processing before splitting the dataset into training and test sets. All information gained from pre-processing is spread across the split. 

###Preventing Data Leakage


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

##Demonstration 1: Purchase Predictions 

In [None]:
#Fabricate some purchase data
df = {'Purchase':['yes','yes','no','yes','no','yes','yes'],
      'QTY':[2,5,17,4,0,3,7],
      'Product':['Milk','Sugar','Biscuit','Chocoalte','Coffee','Bread','Egg'],
      'Discount':['yes','no','no','yes','no','yes','yes']}

In [None]:
#Convert the dictionary to dataframe
df = pd.DataFrame(df)

In [None]:
df

Unnamed: 0,Purchase,QTY,Product,Discount
0,yes,2,Milk,yes
1,yes,5,Sugar,no
2,no,17,Biscuit,no
3,yes,4,Chocoalte,yes
4,no,0,Coffee,no
5,yes,3,Bread,yes
6,yes,7,Egg,yes


Most who got ***Discount*** also ***Purchased*** the product. 

In [None]:
#Prior to calculating correlation, convert categories to numeric dummies.
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
purch = le.fit_transform(df['Purchase'])
dis = le.fit_transform(df['Discount'])
prod=le.fit_transform(df['Product'])
df['Purchase']=purch
df['Discount']=dis
df['Product']=prod

In [None]:
df

Unnamed: 0,Purchase,QTY,Product,Discount
0,1,2,5,1
1,1,5,6,0
2,0,17,0,0
3,1,4,2,1
4,0,0,3,0
5,1,3,1,1
6,1,7,4,1


In [None]:
df.corr()

Unnamed: 0,Purchase,QTY,Product,Discount
Purchase,1.0,-0.377135,0.4743416,0.7302967
QTY,-0.377135,1.0,-0.4853627,-0.3202563
Product,0.474342,-0.485363,1.0,-1.6024690000000003e-17
Discount,0.730297,-0.320256,-1.6024690000000003e-17,1.0


The strongest relationship in this small sample is between Purchase and Discount, about 0.7. However, discounts are intermittent based on seasons, special events, customer type, etc. And they have short time spans -- not available all the time.

#Pre-processing activities

**Best Practice**
* Split data into training and testing sets.
* Prepare data on training set.
* Fit the model on the training set.
* Evaluate model on test/validation set. 

##**Approach 1**: The Wrong Way

Use *sklearn's* ***make_classification()*** to create the dataset with 1,000 records and 10 features.

In [None]:
from sklearn.datasets import make_classification

In [None]:
# Define a synthetic dataset with make_classification.
X, y = make_classification(n_samples=1000, 
                           n_features=10, 
                           n_informative=8, 
                           n_redundant=2, 
                           random_state=20)

In [None]:
print(X.shape, y.shape)

(1000, 10) (1000,)


Use **MinMaxScaler** to scale our data into the range of **0-1**

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Standardize the data
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split into train and test sets with 80% for traing and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=50)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Fit the model
demo01 = LogisticRegression().fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Evaluate predictions on train set
y_pred = demo01.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)
print('Accuracy on training set: %.2f' % (accuracy*100))

# Evaluate predictions on test set
y_pred = demo01.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy on test set: %.2f' % (accuracy*100))

Accuracy on training set: 88.38
Accuracy on test set: 91.50


Better accuracy on test than training is one tell that leakage occured.

##**Approach 2**: The Right Way

Step1: Split the data into training and test sets.

In [None]:
X, y = make_classification(n_samples=1000, 
                           n_features=10, 
                           n_informative=8, 
                           n_redundant=2, 
                           random_state=20)

In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=50)

Step 2: Apply data preparation to training

In [None]:
#Define the scaler
scaler = MinMaxScaler()
#Fit on the training data
scaler.fit_transform(X_train)
#Scale the test x_text data
X_test = scaler.transform(X_test)

###Do you scale the target?
* Do not scale **y_test** if you want a real world representation for use in testing or validation. 
* In most cases, also do not scale **y_train**. It is sometimes subject to data preparations, depending on the dataset and problem statement, but never before splitting.

In [None]:
#Fit the model
demo02 = LogisticRegression().fit(X_train, y_train)

In [None]:
#Evaluate predictions on train
y_pred = demo02.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)
print('Accuracy in training: %.2f' % (accuracy*100))

#Evaluate predictions on test
y_pred = demo02.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy in test: %.2f' % (accuracy*100))

Accuracy in training: 88.62
Accuracy in test: 92.50


Does the right way always lead to success? Not always. <br/> In this case, the model is overfitting. Next step: tune the model.

#**Using Cross-Validation**

K-fold cross-validation is splitting a dataset into **K** non-overlapping groups of rows in order to:
* Train the model on all but one group and evaluate it on the last group, called the hold-out fold. 
* Repeat the process so that each fold is given a chance to be used as the holdout. 
* Finally, average performance across all evaluations.

###**Approach 1**: The Wrong Way

In [None]:
X, y = make_classification(n_samples=1000, 
                           n_features=10, 
                           n_informative=8, 
                           n_redundant=2, 
                           random_state=20)

In [None]:
#Standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
#Define the evaluation procedure and pass it to cross_val_scores.
cv = RepeatedStratifiedKFold(n_splits=10, 
                             n_repeats=5, 
                             random_state=50)

#Evaluate the models from demos 1 and 2 using cross-validation
scores01 = cross_val_score(demo01, X, y, scoring='accuracy', cv=cv, n_jobs=1)
scores02 = cross_val_score(demo02, X, y, scoring='accuracy', cv=cv, n_jobs=1)

In [None]:
#Check performance
print('Accuracy: %.2f ' % (np.mean(scores01)*100))
print('Accuracy: %.2f ' % (np.mean(scores02)*100))

Accuracy: 88.76 
Accuracy: 88.76 


###**Approach 2**: The Right Way with Pipeline

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Define dataset
X, y = make_classification(n_samples=1000, 
                           n_features=20, 
                           n_informative=15, 
                           n_redundant=5, 
                           random_state=7)

In [None]:
#Define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

In [None]:
#Define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [None]:
#Evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

In [None]:
#Check performance
print('Accuracy: %.2f' % (np.mean(scores)*100))

Accuracy: 85.43


The pipeline normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

**Approach 2** with accuracy **85** will outperform Approach 1 in production, even with its higher accuracy of **88**.