# DRN: A Deep Reinforcement Learning Framework for News Recommendation
[Link](http://www.personal.psu.edu/~gjz5038/paper/www2018_reinforceRec/www2018_reinforceRec.pdf)

## Abstract

### &#160;&#160;&#160;&#160; Issues:
- only try to model current reward (e.g., Click Through Rate (CTR))
- very few studies consider to use user feedback other than click / no click labels (e.g., how frequent user returns) to help improve recommendation
- existing methods tend to keep recommending similar news to users, which may cause users to get bored.
### Solutions
- Deep Q-Learning based recommendation framework can model future reward explcitly
- consider user return pattern as a supplement to click / no click label in order to capture more user feedback information
- an effective exploration strategy is incorporated to find new attractive news for users.
### Applications:
- Offline datasets
- online production environment of a commercial news
### Pros
- the superior performance of new methods

## 1. INTRODUCTION
### &#160;&#160;&#160;&#160; Problem
- explosive growth of online content and services has provided tons of choices for users
- personalized online content recommendation are necessary to improve user experience.
### Existing methods:
- content-based
- collaborative filtering
- hybrid methods
- deep learning models (state-of-the-art)
### Chalenges:
- dynamic changes in news recommendations are difficult to handle: 
    - news become outdated very fast $\Rightarrow$ news features and news candidate set are changing rapidly.
    - users’ interest on different news might evolve during time $\Rightarrow$ update the model periodically
    - Existing method can only try to optimize the current reward, and hence ignore what effect the current recommendation might bring to the future.
- current recommendation methods usually only consider the click / no click labels or ratings as users’ feedback
- tendency to keep recommending similar items to users, which might decrease users’ interest in similar topics
    - $\epsilon$-greedy strategy may recommend the customer with totally unrelated items
    - Upper Confidence Bound can not get a relatively accurate reward estimation for an item until this item has been tried several times.

### &#160;&#160;&#160;&#160; Proposal:
- use Deep Q-Learning to better model the dynamic nature of news characteristics and user preference. DQN structure can easily scale up
-  consider user return as another form of user feedback information, by maintaining an activeness score for each user: **multiple historical return interval information**, so model can estimate user activeness at any time.
- apply a Dueling Bandit Gradient Descent (DBGD) method for exploration, by choosing random item candidates in the neighborhood of the current recommender


### &#160;&#160;&#160;&#160;  The system
![](https://imgur.com/QCprUl0.png)
- user pool and news pool make up the **environment**, 
- recommendation algorithms play the role of **agent** 
- **state** is defined as *feature* representation for *users* 
- **action** is defined as *feature* representation for *news*.
- **reward** is composed of *click labels* and estimation of *user activeness*.

#### **Work flow**:
1. User requests for news, 
2. A state representation (*users*) and a set of action representations (*items*) are passed to the agent,
3. Agent selects the best action (i.e., recommending a list of news to user), 
4. Agent fetches user feedback as reward, 
5. Agent stores recommendation and feedback log in the memory,
6. The agent use the log in the memory to update its recommendation algorithm **every 1 hour**.

### &#160;&#160;&#160;&#160; Contribution:
- DQN based frame work can take care of both immediate and future reward and can be generalized to many other recommendation problems
- consider user activeness to help improve recommendation accuracy,
- apply a more effective exploration method Dueling Bandit Gradient Descent
- deploy model online in a commercial news recommendation application

## 2. RELATED WORKS
### 2.1 News recommendation algorithms
- **Content-based** methods maintain news term frequency features and user profilesand select news that is more similar to user profile.
- **Collaborative filtering** methods makes rating prediction, utilizing the past ratings of current user or similar users or the combination of these two.
- **Hybrid** methods are proposed to improve the user profile modeling
- **Deep learning** models have shown much superior performance due to capability of modeling complex user-item relationship

*DQN focuses on dealing with the dynamics nature of online news recommendation, and modeling of future reward that can be* **esily intergrated**.

### 2.2 Reinforcement learning in recommendation
#### 2.2.1 Contextual Multi-Armed Bandit models (MAB)
- The context contains user and item features
- Expected reward is a linear function of the context
- Some try to combine **bandit** with **clustering based collaborative filtering** and **matrix factorization** to *model* more **complex user and item relationship** and **utilize** the **social network relationship** in determining the reward function.

*DQN-based method* **applies Markov Decision Process**, *and is able to explicitly model future rewards*

#### 2.2.2 Markov Decision Process models
- **MDP-based** methods can not only **capture the reward of current iteration**, but also the **potential reward in the future iterations**.
- Try to model item/n-gam of item as *state* and transition as *action* $\Rightarrow$:
    - giant state space size $\Rightarrow$ can't scale up
    - sparcity of action space $\Rightarrow$ hard to train

*DQN uses continuous state and action representation*

## 3. PROBLEM DEFINITION

**When a user $u$ sends a news request to the recommendation agent $G$ at time $t$, given a candidate set $I$ of news, our algorithm is going to select a list $L$ of top-k appropriate news for this user.|**

## 4. METHOD
### 4.1 Model framework
![](https://imgur.com/y4wfV6q.png)
#### Offline stage:
- Extract features from news and users
- use DQN to predict reward
- train DQN, using **offline user-news click logs**

#### Online stage:
Agent interacts with users and update the network:

1. **PUSH**: users sends request, G takes feature representstions , G gernerate top-k list, combining explotation current model and ecploration novel items.
2. **FEEDBACK**: User give feedback by clicking
3. **MINOR UPDATE**: Agent updates model after each timestamp, using *news list, feeadback* comparing *exploitation network* with *exploration network*.
4. **MAJOR UPDATE**: after one hour, update network by replaying memory.
5. **REPEAT** (1-4) until termination

### 4.2 Feature construction
- **News features**: 417 dimension one hot features that describe whether certain property appears in this piece of news, including 
    - headline, 
    - provider, 
    - ranking, 
    - entity name,
    - category, 
    - topic category, 
    - click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year.
- **User features**: 413 × 5 = 2065 dimensions, describes the features
    - headline
    - provider
    - ranking, 
    - entity name, 
    - category, 
    - topic category
    - click in 1 hour, 6 hours, 24 hours, 1 week, and 1 year
    - total click count for each time granularity
    
- **User news features**: 25-dimensional, describes the interaction between user and one certain piece of news appearing in the history of the user’s readings:
    - the frequency for the entity
    - category, 
    - topic category
    - and provider
- **Context features**: 32-dimensional, describe the context when a news request happens, including time, weekday, and the freshness of the news
- **Textual features**: Not iplementation

### 4.3 Deep Reinforcement Recommendation

- Total reward:
$$y_{a,s}=Q(a,a)=r_{\text{immediate}}+\gamma r_{\text{future}}\tag{1}$$
- state $s$ = {context & user features}
- action $a$ = {news & user-news interaction features}
- $r_{\text{immediate}}$ = rewards for current situation
- $r_{\text{future}}$ = projection of future rewards
- DDQN predict total reward:
$$
y_{s,a,t}=r_{a,t+1}+\gamma Q\left(s_{a, t+1},\text{arg}\max_{a'}{Q\left(s_{a, t+1},a';W_t\right);W'_t}\right)
\tag{2}
$$
-  Every a few iterations, $W_t$ and $W′_t$ will be switched

#### Q-network
![](https://imgur.com/7yqDKjQ.png)

### 4.4 User Activeness
Use **survival models** to model user return and user activeness. Suppose $T$ is the time until next event happens then the **hazard function** (i.e., instantaneous rate for the event to happen) can be defined as:
$$
\lambda(t)=\lim_{dt\rightarrow 0}\frac{\text{Pr}\left[t\leq T<t+dt\mid T\geq t\right]}{dt}
\tag{3}
$$
the probability for the event to happen after $t$ can be defined as:
$$ S(t)=\exp \left(\int_0^t \lambda(x)dx\right) \tag{4}$$
and the expected life span $T_0$ can be calculated as:
$$T_0=\int_0^\infty S(t)dt
\tag{5}$$
In our problem, we simply set $λ(t)=λ_0$, which means each user has a constant probability to return. Every time we detect a return of user, we will set $S(t)=S(t)+S_a$ for this particular user. The user activeness score will not exceed 1.

Parameters $S_0,S_a,λ_0,T_0$ are determined according to the real user pattern in dataset.

The click / no click label $r_{\text{click}}$ and the user activeness $r_{\text{active}} 
$ are combined as:

$$
r_\text{total} = r_\text{click} +\beta\cdot r_\text{active}
\tag{6}$$



### 4.5 Explore
- $\epsilon$-greedy randomly recommends new intems with a probability of $\epsilon$
- UCB will pick items that have not been explored for many times
- Dueling Bandit Gradient Descent algorithm is used in this paper

![](https://imgur.com/kCJjG44.png)

The agent $G$ is going to generate a recommendation list $L$ using the current network $Q$ and another list $\tilde{L}$ using and explore network $\tilde{Q}$. The parameters $\tilde{W}$ of network $\tilde{Q}$ can be obtained by adding a small disturb $\Delta W$ to the parameters $W$ of the current network $Q$:
$$\Delta W =\alpha\cdot\text{rand}(-1,1)\cdot W \tag{7}$$
Then the agent $G$ will do a **probabilistc interleave** to generate the merged recommendation list $\tilde{L}$ using $L$ and $\tilde{L}$. To determine the item for each position in the recommendation list $\tilde{L}$, the probabilistics interleave approach basically will first randomly select between list $L$  nad $\tilde{L}$.

Suppose $L$ is selected, then an item $i \in L$ will be put into $\hat{L}$ with a probability determnied by its ranking in $L$. Then list $\hat{L}$ will be recommended to user $u$ and agent $G$ will obtain the feedback $B$. It the items recommended by the explore network $\tilde{Q}$ receive a better feedback, the agent $G$ will update the network $Q$ and $\tilde{Q}$, with the paraemeters of the network being updated as:
$$
W' = W +\eta \tilde{W}\tag{8}$$
Othersiwe the agent $G$ will keep network $Q$ unchanged. Through this kind of exploration, the agetn can do more effective exploration without losing too much recommendation accuracy.

## 5 EXPERIMENT
### 5.1. Dataset: 
- crawled: 

| Stage         | Duration | # of users | # of news |
|---------------|:--------:|------------|-----------|
| Offline stage | 6 months |    541.337 | 1,355,344 |
| Online stage  | 1 months |     64,610 |   157,088 |

### 5.2 Evaluation measures
- Click through rate (CTR)
$$
\text{CTR}=\frac{\text{# of clicked items}}{\text{# of total items}}
\tag{9}$$

- Precision@k
$$
\text{Precision@k}=\frac{\text{# of clicks in top-k recommended items}}{\text{k}}
\tag{10}$$

- Normalized Discounted Cumulative Gain (nDGC)
$$
DCG(f)=\sum_{r=1}^{n}{y^f_r D(r)}
\tag{11}$$

with

$$
D(r)=\frac{1}{\log(1+r)}
\tag{12}$$

### 5.3 Experiment setting
| Parameter|                  Setting                 |
|----------|:----------------------------------------:|
| Future reward discount $γ$ (Equation 1)<br> User activeness coefficient $β$ (Equation 6)<br> Explore coefficient $α$ (Equation 7) <br>Exploit coefficient $η$ (Equation 8) <br>Major update period $T_R$ (for DQN experience replay) <br>Minor update period $T_D$ (for DBGD) |  0.4 <br>0.05 <br>0.1 <br>0.05 <br>60 minites <br>30 minutes |

### 5.4 Compared methods

- *LR*: Logistic Regression
- *FM*: Factorization Machines
- *W&D*: Wide & Deep - Deep learning model
- *LinUCB*: Linear Upper Confidence Bound
- *HLinUCB*: Hidden Linear Upper Confidence Bound

### 5.5 Offline evaluation
#### 5.5.1 Accuracy
![](https://imgur.com/3zfa9wb.png)
#### 5.5.2 Model converge process.
algorithm in this paper (*DDQN + U + DBGD*) converges to a better CTR faster than other methods.
![](https://imgur.com/WpKlatp.png)

### 5.6 Online evaluation
#### 5.6.1 Accuracy.
![](https://imgur.com/yHtcHzC)
#### 5.6.2 Recommendation diversity

$$
ILS(L)=\frac{\sum_{b_i \in L}\sum_{b_j \in L,b_j\neq b_i}S(b_i,b_j)}{\sum_{b_i \in L}\sum_{b_j \in L,b_j\neq b_i}\mathbb{1}}
\tag{13}$$

$S(b_i,b_j)$: represents the cosine similarity between item $b_i$ and item $b_j$

![](https://imgur.com/byfmw42.png)

## 6 CONCLUSION

In this paper, we propose a DQN-based reinforcement learning framework to do online personalized news recommendation. Different from previous methods, our method can effectively model the dynamic news features and user preferences, and plan for future explicitly, in order to achieve higher reward (e.g., CTR) in the long run. We further consider user return pattern as a supplement to click / no click label in order to capture more user feedback information. In addition, we apply an effective exploration strategy into our framework to improve the recommendation diversity and look for potential more rewarding recommendations. Experiments have shown that our method can improve the recommendation accuracy and recommendation diversity significantly. Our method can be generalized to many other recommendation problems.