# Data Leakage

Data leakage can ruin the value of our models.  When our model has access to information it shouldn't have access to we can end up confidently shipping a product which performs **worse** than random guessing. Mistakes with data leakage are one of **the most common issues** with data science projects.  This includes the final projects submitted for this boot camp!  In fact, all of the "case studies" we will look at here are fictionalized versions of mistakes which have happened in previous project submissions.  These kinds of mistakes can invalidate months of hard work.

In [1]:
## We will now start importing a common set
## of items at the onset of most notebooks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from seaborn import set_style
set_style("whitegrid")

### Case Study 1:  Modeling Customer Churn

You work for company which provides video conferencing solutions. Every year, roughly $3\%$ of your customers unsubscribe from your service (aka you have a yearly "churn rate" of $3\%$).  You have been tasked with developing a model to determine which customers are likely to churn so that an intervention can be made.

You start your project by pulling yearly data.  Here are two typical rows from the data:

In [2]:
pd.DataFrame.from_dict({'year': [2023, 2024], 'support_tickets':[4, 2], 'down_time_minutes':[321, 36], 'plan_type':['premium','basic'], 'churned':[True, False]})

Unnamed: 0,year,support_tickets,down_time_minutes,plan_type,churned
0,2023,4,321,premium,True
1,2024,2,36,basic,False


It might not be obvious but we have made a **fatal mistake** right at the beginning of our process.  

We do everything else right after this point: 
* Clean our data
* Make a train/test split
* Do EDA to understand our data
* Experiment with different models and select one with good cross-validation performance
* Finally, do one last sanity check to see that our model still performs well on the test set

We deploy our model and find that that it performs **worse than guessing** when deployed, costing our company millions of dollars!

#### What is the mistake?

------------

**Answer**:   The customers who leave (especially those who leave early in the year) will have fewer downtime minutes and fewer support tickets than the average user simply because they are not using the service.  A typical row for someone who churns in January might look like this:

In [3]:
pd.DataFrame.from_dict({'year': [ 2024], 'support_tickets':[0], 'down_time_minutes':[2], 'plan_type':['premium'], 'churned':[True]})

Unnamed: 0,year,support_tickets,down_time_minutes,plan_type,churned
0,2024,0,2,premium,True


We have given our model access to data for the **entire year** when it should only have access to data up until the point that the customer churns.  At the very beginning of the process we should have made sure that the date of churn was included.  When we compare a user who stays vs. a user who leaves, we should only include the number of support tickets and downtime minutes of each user up until the point that the churner leaves.

The consequence of our mistake is that the model learns that having few support tickets and few downtime minutes is predictive of churning, which is the opposite of the true relationship!  While our model had great CV performance and even did well on the test set, it performs worse than random guessing when deployed!

**Moral**:  Think very carefully about what information your model will have access to at prediction time.  If the model needs access to a time machine it is not a good model.

### Case Study 2:  Automated produce rejection

After the fiasco at the video conferencing company you are let go.  Thankfully, you get hired relatively quickly by a large produce distributor. 

You are in charge of developing a system to take an image of a fruit or vegetable as it is moving down a convey belt and automatically accept or reject it.  Your goal is to reject damaged, rotten, or unsightly produce.

This is a problem with pretty severe class imbalance:  only about 5% of the produce should be rejected.

You collect 50000 images and then hire a few contractors to manually label those images.  Initial machine learning attempts fail.  You request additional funding to label more data, but your request is denied.

One strategy for "beefing up" a dataset is **data augmentation**:  you take the 50000 images you have and apply slight adjustments to them such as rotating a few degrees in either direction, resizing them slightly, applying a noisy filter, etc.

You now have a dataset with 2 million labeled images!

You perform a training/validation/testing split.  You iterate several different times, eventually settling on a convolutional neural network with a certain architecture since it had the lowest validation error:  actually a relatively simple model performed quite well!  As a final sanity check you check it on the testing set, and it performs similarly.

However, when you deploy the model it performs horribly, again costing the company millions!

#### What is the mistake?

-----------

**Answer**:  Since you performed your data augmentation *before* your train/test split, your testing set of images contains examples which are extremely close to those in the training set (for example, the same image but just rotated one degree).  Your model was actually horribly over-fit but you had no way of knowing until you tried it on actually novel data.

**Moral**: When you do data cleaning, preprocessing, or augmentation think carefully about whether you are either allowing information to pass from the training set to the testing set.

### Case Study 3:  Modeling selling price of used vehicles

You are once again on the job market and get hired by a brand new company which specializes in selling used cars.  You are tasked with predicting the selling price of the cars.

The company is new so it doesn't have much data.  You decide to use data from the OpenCarSales API for training.  However, you are aware that this includes sales from some markets which are not that similar to your own.  So you decide to use your in-house data for validation and testing.

You guessed it:  the model performs excellently on the validation and testing set, but performs horribly in real life.


#### What is a potential mistake?

---------

**Answer**:  You didn't realize this, but OpenCarSales actually scrapes your company website!  All of your validation and testing samples were also included in your training set.  The results are disastrous.

**Moral**: Be especially careful when combining multiple datasets to make sure that they do not share any samples in common.

### Case Study 4:  Classifying text sentiment

You are hired by a new company on a short term contract basis because their natural language processing expert left for greener pastures and the company has a tight deadline.

Your part of the project is to classify the sentiment of 10 million reviews into "positive" or "negative".

Here is a random sample of the data you received:

In [12]:
df = pd.read_csv('../../data/leaky_imdb.csv', index_col=0)
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
5,"probably my all-time favorite movie, a story o...",positive
6,i sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,if you like original gut wrenching laughter yo...,positive


You train a classifier on the data which gets 100% accuracy.  As a seasoned veteran, instead of rejoicing you go hunting for data leakage.  

#### What is the mistake?

---------

**Answer**:  It looks like during preprocessing someone converted all of the positive sentiment reviews to lower case, but forgot to do the same thing to the negative sentiment reviews.  It is now trivial for a machine learning model to distinguish the two classes, but not for the reasons you would hope!

**Moral**:  Whenever you modify data, and especially when it is done in a label dependent way, make sure you are being consistent. You may end up with models which learn to discriminate based on your data processing irregularities rather than any real signal in the data.

--------------------------

This notebook was written for the Erdős Institute Data Science Boot Camp by Steven Gubkin.

Please refer to the license in this repo for information on redistribution.