# EDA, Validation, Data Leakages

Patricia Schuster  
Dec '19  

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

<div id="toc"></div>

# EDA (Exploratory Data Analysis)

## What and why?

Goal is to better understand the data in order to design powerful features and build accurate models. The best solutions incorporate unique insights on the data from EDA.

Visualization allows us to immediately see the patterns. How can we use the patterns to build a better model?

## Building intuition about the data

* Getting domain knowledge  
    * Usually don't need to go too deep inside the field  
    * Read Wikipedia, google it, make sure you understand the data  
    * Ex: Predict advertiser's cost. Look into Google's ad network.  
* Check if data is intuitive, agree with domain knowledge?  
    * Ex: An age column should have numbers 0-100. If you see an entry 336, it could indicate that the column is not human age, or there is a typo  
    * Sometimes an "error" in the data is there due to how the data is exported. You may be able to use that.  
* Understand how the data was generated  

## Explore anonymized data

* What is it?  
    * Sometimes organizers will export data such that you can't get a specific value out of it. Ex: Doesn't want to reveal a document's content, so replace words with hash codes  
* What can we do with it?   
    * Look whether features are related to each other or grouped somehow  
    * Run a basic ML like random forest and look at feature importances  
    * Any features with mean near 0, std near 1? That may have been standard scaled by data processing.  
    * Try to decode the feature / guess its true meaning / guess the feature type  


## Exploration and visualization tools

**First look**

* Make several visualizations to investigate the same thing, so that you are not misled due to plot type  
* Histogram with a big peak at the mean-- organizers may have filled empty data with the mean  
* Use `pandas` `describe` function: `df.describe()`  
* Evaluate number of occurrences of distinct pieces of data: `x.value_counts()`  
* Look for missing data: `x.isnull()`  

**Feature relations**

* Scatter plot for all pairs of features: `pd.scatter_matrix(df)`  
* Correlations: `df.corr()`  
* How often is one feature larger than another? Build the matrix manually, `plt.matshow(_)`  
* Calculate statistics of each feature, plot vs. index, sorted by the statistic: `df.mean().sort_values().plot(style='.')`  

## (A bit of) dataset cleaning

* Unnecessary data: Feature data is constant  
    * Original dataset might have been larger, and it was cut down for the competition  
    * Find with `train.nunique(axis=1) == 1`  
* Duplicated features  
    * Remove duplicates. They won't add anything new and will consume memory.  
    * Remove with `traintest.T.drop_duplicates()`  
    * For categorical features, labels may be shuffled. We need to label encode the features separately for each column. Then they will look identical.   
    * `for f in categorical feats:`  
    *     `traintest[f] = traintest[f].factorize()`  
    * `traintest.T.drop_duplicates()`  
* Duplicate rows  
    * Check if same rows have same label  
    * Find duplicated rows, understand why they are duplicated. It may tell us something about dataset generation process.  
* Check that the dataset is shuffled. If not, you may find data leakage. Plot target feature vs. row index.  


# Validation and overfitting

Two main reasons why people may jump down the leaderboard after revealing private results:

1) Competitors could ignore the validation and select the submission which scored best aginst the public leaderboard  
2) Competitions may have no consistent public/private data split or they have too little data in either public or private leaderboard  

Our task is to select our most appropriate submission to be evaluated by the private leaderboard.  

Next videos:

1) Understand validation and overfitting  
2) Identify number of splis that should be done to establish stable validation   
3) Break down most frequent ways to repeat train test split  
4) Discuss most common validation problems  

## What is validation?

Validation helps evaluate the quality of a model and select the model that will perform best on unseen data. Underfitting happens when you don't capture enough patterns in the data. Overfitting happens when you capture noise or patterns which do not generalize to the test data.  

In a normal competition, we don't have access to the test data. We don't want to overfit our method on the training data, so we usually split our training data further into a train and validation set. Fit the model on train, check it on validation.  

Furthermore, often the organizers will split the test set into public and private data. The public score is what you receive back upon each submission, but it is being calculated on only a portion of the test data (usually 25-33%). The public leaderboard shows how your public score compares to other competitors.  

When the competition ends, Kaggle scores your predictions against the remaining fraction of the test set, which is the private portion. You don't receive ongoing feedback about your private score-- it exists in the private leaderboard. Final competition results are based on the private leaderboard. This ensures the winner's results are accurate but generalized, without overfitting to the test data. 

Use cross validation and public leaderboard to optimize private leaderboard scores. 

## Validation strategies

Three strategies, each with different number of splits:

* Holdout  
    * Splits into two  
    * Samples in train and test do not overlap  
    * Scheme  
        * Split train data into `partA` and `partB`  
        * Fit the model on `partA`, predict for `partB`  
        * Use predictions for `partB` for estimating model quality. Find such hyper-parameters, that quality on `partB` is maximized  
* K-fold  
    * A repeated holdout  
    * Create many folds, each with the same test-train split ratio, but using a different part of the data as test each time  
    * Takes longer than holdout because you are essentially repeating holdout $K$ times  
    * Scheme
        * Split train data into K folds  
        * Iterate through each fold: retrain the model on all folds except current fold, predict for the current fold  
        * Use the predictions to calculate quality on each fold. Find hyper-parameters, such that quality on each fold is maximized. You can also estimate mean and variance of the loss. This is very helpful in order to understand significance of improvement  
* Leave-one-out  
    * A special case of K-fold, where $K$ is equal to the number of samples in the data  
    * Iterate through every sample in the data as the test data  
    * Scheme  
        * Iterate over samples: retrain the model on all samples except current sample, predict for the current sample. You will need to retrain the model $N$ times (where $N$ is the number of samples in the dataset)  
        * In the end you will get LOO predictions for every sample in the trainset and can calculate loss  
    
Usually use Holdout or K-fold on shuffle data.

Stratification preserves the same target distribution over different folds

## Data splitting strategies

**We want to identify what train-test split strategy was used by the competition organizers and reproduce it.**

How do we divide into train and test? Different splitting strategies can differ significantly in several ways:

* In generated features  
* In a way the model will rely on those features  
* In some kind of target leak

For a time-based trend, use a time-based split  

* Everything before a given date as train, everything after as validation  
* Testing data happens after our training set  
* If we create features that are useful for a time-based split but not a random split, then validating on random split may not be useful. In time-based split, the validation data is forward in time, and therefore closer to the test data.   
    
Three general categories for splitting data:

* Random, rowwise  
    * The most common way of splitting data  
    * Rows are fairly independent of each other  
* Timewise  
    * As described above  
    * A special case for validation is a moving window validation. Shift the time range of data pulled for validation  
* By ID  
    * Probably no overlap  
    * May have to reconstruct IDs  
* Combined

## Problems that occur during validation

Validation stage problems:  

* Different scores for different train-test splits  
    * Ex: Predicting sales. If you split based on time, Dec and Jan will have more sales than Feb based on holidays  
    * What if there is no clear reason for different scores?  
        * Too little data  
        * Data too diverse and inconsistent  
    * We should do extensive validation  
        * Average scores from different K-fold splits  
        * Tune model on one split, evaluate score on another  
    
Submission stage  

* We may already have quite different scores in Kfold  
* Too little data in public leaderboard  
* Train and test data are from different distributions  
* Trust your validation!  

Leaderboard shuffle: Sometimes, after the competition ends, the leaders from the public leaderboard are shuffled in the private leaderboard. Why?  
* Randomness  
* Little amount of data  
* Different public/private distributions  

# Data leakages

Unexpected information in the data that allows us to be unrealistically good predictions. NOT OK to use in the real world, but maybe ok in competitions because you are trying to get the best predictions. 

Main types of data leaks:

* Time-series  
    * Features may contain information about the future  
    * Weather, user history  
* Unexpected information  
    * Meta data from images  
    * Information in IDs which is correlated to target variable  
    * Row order  

**Leaderboard probing**

* Extracting ground truth from public parts of leaderboard  
* Exploit vulnerabilities in public-private split  