# Advanced Topics in AML

## I Leakage Detection

## II Automated Feature Engineering

## III Transfer Learning

## IV Model Selection at Scale

## V Open Research Topics in AML

# Leakage Detection


<img src="img/guts-leakage-nutshell.png" width=600>

Training on **contaminated data** leads to overly optimistic expectations about model performance in production.

[ODSC 2018 talk by Yuriy Guts on Target Leakage.](https://github.com/YuriyGuts/odsc-target-leakage-workshop)

<img src="img/guts-target-leakage.png" width=600>


## How can AML help in preventing Target Leakage?

### Data Collection
Assist Data Scientist to flag spurious columns (e.g. very high uni-variate correlation). 

### Feature Engineering
Pipeline DSLs can provide safty - use them (even when doing manual ML).

### Partitioning
Assist Data Scientist by providing good error diagnostics and model introspection.

### Training & Tuning
Human-out-of-the-loop: implement best pracise, no short cuts, no peeking at the holdout data and trying yet another model.

### ML Competitions
AML needs to have a switch to turn off leakage prevention. :P

# Automated Feature Engineering

<img src="img/kanter-fe-process.png" width=600>
<div style="text-align: right">Source: J. Kanter, K. Veeramachaneni (2015), Deep feature synthesis: Towards automating data science endeavors. DSAA 2015</div>

# Real-world use-cases
Many real-world use-cases operate on relational & transactional data sources.

<img src="img/diabetes-schema.png">

Most ML feature engineering we've seen so far operate on de-normalized data (ie a single table).

# Taking time into account...

You need to take time into account when doing *model selection* and *feature engineering*!

<img src="img/ge-flight-quest.jpg">

Example: predict the gate / runway arrival time for flights in mid-air ($t_0$ aka *cutoff*); you must not use records from the table `weather_station` that corresponds to sensor readings at time $t_1 > t_0$ when creating the feature `ceiling_at_destination`. 

# ... matters

<img src="img/ge-flight-quest-gxav.jpg">

# Deep Feature Synthesis

DFS is an algorithm for automatically generating features for relational datasets. 
DFS operates on a snowflake schema and follows relationships in the data to a base field, and then sequentially
applies mathematical functions along that path to create the final feature.

<img src="img/kanter-dfs.png" width=600>
<div style="text-align: right">Source: J. Kanter, K. Veeramachaneni (2015), Deep feature synthesis: Towards automating data science endeavors. DSAA 2015</div>

# [Featuretools](https://docs.featuretools.com/index.html)
Open source implementation of DFS on top of [pandas](https://pandas.pydata.org/):
```python
import featuretools as ft
data = ft.demo.load_mock_customer()
entities = {
   "customers" : (data['customers'], "customer_id"),
   "sessions" : (data['sessions'], "session_id", "session_start"),
   "transactions" : (data['transactions'], "transaction_id", "transaction_time")
   }
relationships = [("sessions", "session_id", "transactions", "session_id"),
                 ("customers", "customer_id", "sessions", "customer_id")]
feature_matrix_customers, features_defs = ft.dfs(entities=entities,
     relationships=relationships,
     target_entity="customers")
```


# AFE summary

Generic Extract-Transform-Load (ETL) tools are not great for feature engineering

  * Join operations need to take cutoffs into account.
  * Feature extractors need to be part of your ML pipeline to guard against leakage and train-test skew.
  
Data needs to be stored in a way AFE can be applied

  * AFE operates on a *Transaction Log* rather than a static snapshot (e.g. customer table).

# Transfer Learning

<img src="img/pan-transfer-learning.png" width=600>
<div style="text-align: right">Source: S. J. Pan, Q. Yang (2010), "A Survey on Transfer Learning" IEEE</div>

# Automatically Select the best Source Task

Goal: Build a sentiment classifier for book reviews but little to no data for the target task is available, however, we have plenty of reviews for dvds, electronics and kitchen appliances. 

Which source domain (dvd, electronics, kitchen appliances) should we use to train our sentiment classifier?

### Discriminative Distance (aka *Discrepancy*)

Use a classifier to distinguish source and target domains. The source domain that is the hardest to separate from the target has the smallest *discriminative distance* and thus most resembles the target.

<img src="img/blitzer-discrepancy.png" width=600>
<div style="text-align: right">Source: J. Blitzer, H. Daume III, Domain Adaptation</div>

# Model Selection at Scale

<div style="float: left;">
<img src="img/norvig-effectiveness-of-data.png">
<div style="text-align: right">Source: P. Norvig et al, The Unreasonable Effectiveness of Data</div>
</div>
<div style="float: right; ">
<img src="img/banko-learning-curve.png">
<div style="text-align: right">Source: M. Banko, E. Brill (2001), Scaling to Very Very Large <p/> Corpora for
Natural Language Disambiguation, ACL</div>
</div>
<div style="clear: both; "/>

### More data trumps better algorithms?

Both Norvig et al. and Banko & Brill work have often been misinterpreted to mean: *more data trumps better algorihtms*.

We cannot make such a general statement, the answer lies in whether you have a bias or a variance problem.

# Model Complexity and Overfitting

<img src="img/msas-little-data.png">

# More data to the rescue?

<img src="img/msas-more-data.png">

# Underfitting or Overfitting?

<img src="img/msas-learning-curve.png">

# Challenges at Scale

 * Why learning with more data is harder?
   - **Paradox**: we could use more complex models due to more data but we cannot because of computational constraints [1].
    - => we need more efficient ways for creating complex models!

 * Need to account for the combined cost: model fitting + model selection / tuning
   - Smart hyperparameter tuning tries to decrease the # of model fits
   - we can accomplish this with fewer hyperparameters too[2]
   

[1] P. Domingos, *A few useful things to know about machine learning*, 2012

[2] Practitioners often favor algorithms with few hyperparameters such as RandomForest or [AveragedPerceptron](http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)

# Case-study: binary classification on 1TB of data

 * Criteo click through data
 * Down sampled ads impression data on 24 days 
 * Fully anonymized dataset:
   - 1 target
   - 13 integer features
   - 26 hashed categorical features

 * Experiment setup:
  - Using day 0 - day 22 data for training, day 23 data for testing


# Big Data?

Data size:
 * ~46GB/day
 * ~180,000,000/day

However it is very imbalanced (even after downsampling non-events)
 * ~3.5% events rate

Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB
  * Will fit into a single node under “optimal” conditions
  * Loss of model accuracy is negligible in most situations

Assuming 0.1% raw event (click through) rate:
<img src="img/criteo-sampling.png">

# Where to start?

 * 70GB (~260,000,000 data points) is still a lot of data
 * Let’s take a tiny slice of that to experiment
  - Take 0.25%, then .5%, then 1%, and do grid search on them

<img src="img/criteo-where-to-start.png">

# GBM is the way to go, let’s go up to 10% data

<img src="img/criteo-gbm-1.png">

# A “Fairer” Way of Comparing Models

<img src="img/criteo-gbm-2.png">

# Can We Extrapolate?

<img src="img/criteo-gbm-3.png">

# Tree Depth vs Data Size

<img src="img/criteo-gbm-4.png">

# Open Research Topics in AML

* ____ at scale
  * Pipeline Optimization
  * Automated Feature Engineering
* Automated Leakage Detection
* Automated Partitioning/Cross-Validation
* Automated Metric Selection
* Automated Transfer Learning
* Automated Data Drift Detection
  - the world never stops changing...
* Automated Feedback Cycle Detection
  - ... and we never stop changing the world

More see [Rich Caruna's IMCL 2015 talk on AutoML](https://indico.lal.in2p3.fr/event/2914/contributions/6481/attachments/6048/7173/CaruanaAutoMLWorkshopICML2015rev4.pdf)
