# Deep Interest Network (2018)

https://arxiv.org/pdf/1706.06978

In 2018 Alibaba inspired by Youtube paper suggested its own way to solve ad CTR prediction task. 

CTR prediction is not a recommendation task but setting very similar to ranking part of YoutubeDNN - score using previous user behavior

This architecture is called DIN (Deep Interest Network)

Key ideas
- add local activation unit - attention-like mechanism that would assign weight to each recent user action depending on current candidate that is being scored
- enhance item embeddings by encoding categorical features along with item IDs


## Reference architecture

Base architecture that they've had before is almost identical to one of Youtube (Ranking Network) with only few changes:
- augmented user embeddings
- pReLU activation function instead of ReLU
- sum pooling instead of average

<img src="img/deepinterest1.png" width=500>

### Augmented item embeddings
Combine item embeddings with categorical embeddings. 

They use three embeddings for each item (see them on scheme):
- trainable item embedding
- vendor embedding
- category embedding

After quering dictionary they are concatenated in one embedding

### pReLu activation function

<img src="img/relu.png" width=150><img src="img/prelu.png" width=150>





## New architecture

They proposed several of enhancments:
- add attention-like mechanism (they call it activation unit)
- use Dice activation function
- mini-batch regularization

<img src="img/deepinterest2.png" width=750>

### 1. Local activation
They added activation unit to each item in user history. For each item in user history it scores its relevance to candidate item and scales its embedding according to this score

Local activation = "activation" of user interest under particular circumstances
It's not self-attention since does not require normalization, but is very simlar approach

<img src="img/activation.png" width=750>

### 2. Data-adaptive activation function

Previously used pReLU is partially linear activation with 1 parameter (line angle for negative X)
<img src="img/prelu.png" width=150>

They made inflection point also trainable and smooth and called this activation "Dice"

<img src="img/dice_formula.png" width=300>

Control function (where mode is changed) is now like this

<img src="img/dice.png" width=300>

### 3. Regularization

Multipart embeddings require regularization. Here is a good illustration why (green line = no regularization)
<img src="img/reg_dynamic.png" width=900>

Note that we separately regularize each dictionary, not whole network at once

__Tried regularization methods__
- Dropout<br> Randomly discard 50% of feature ids in each
sample<br><br>
- Filter<br>leave only the most frequent items (top 20 million)<br><br>
- DiFacto<br>regularize frequent features less<br><br>
- new Mini-Batch Aware regularization method (λ = 0.01)<br><br>

#### Mini-Batch Aware regularization
MBA = sampled regularization. 


Let $W$ be dictionary weights for some categorical feature, $K$ - embedding size. Then sum all matrix weights

<img src="img/reg_mba1.png" width=350>

If we split this in $B$ batches:
<img src="img/reg_mba2.png" width=300>


If $\alpha_{jm}$ denotes that feature value $j$ is present at least in one example of this particular batch $m$ then

<img src="img/reg_mba3.png" width=200>

<img src="img/minibatch_reg.png" width=350>

Comparison of different regularization methods:

<img src="img/reg.png" width=500>

# Evaluation

## Datasets

They tested their approach on 3 datasets (2 open and one private)
- Amazon - item reviews
- MovieLens - movie ratings
- Alibaba - ad serving logs

They had to adapt Amazon and MovieLens to completely different task to mimic CTR prediction. No idea how they did it
- Amazon = predict next review based on current reviews
- MovieLens = predict "good" rating based on previous ratings

<img src="img/datasets.png" width=500>

## Metrics

They used two main metrics:
- weighted AUC<br>AUC = how good predicted CTR-score divides two classes<br>weighted AUC = AUC weighted by user activity<br><img src="img/weighted_auc.png" width=200>
- Relative AUC<br>uplift to baseline (Youtube) model on AUC normalized scale<br><img src="img/relauc.png" width=350>



## Model comparison

Alternative models
- Logistic Regression
- BaseModel = Youtube like NN
- Wide'n'Deep<br>SOTA architecture that combines memorization (one-hot encoded feature interactions) / generalization network (embeddings followed by MLP)
- PNN <br>Product neural network
- DeepFM<br>combination of FM features and MLP proccessing

Evaluation on open datasets

<img src="img/models2.png" width=500>

Evaluation on private datasets

<img src="img/models1.png" width=500>

