# Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations - google 2019
## ABSTRACT
Many recommendation systems retrieve and score items from a very large corpus. 许多推荐系统从非常大的语料库中检索和评分项目。 

A common recipe to handle data sparsity and power-law item distribution is to learn item representations from its content features. 处理数据稀疏性和幂律项分配的常用方法是从其内容功能中学习项表示。

Apart from many content-aware systems based on matrix factorization, we consider a modeling framework using two-tower neural net, with one of the towers (item tower) encoding a wide variety of item content features. 除了许多基于矩阵分解的内容感知系统外，我们考虑使用两塔神经网络的建模框架，其中一个塔（项目塔）编码多种项目内容特征。

A general recipe of training such two-tower models is to optimize loss functions calculated from in-batch negatives, which are items sampled from a random mini- batch. 训练此类两塔模型的一般方法是优化根据批内负值计算的损失函数，这些负函数是从随机小批量中采样的项目。

However, in-batch loss is subject to sampling biases, poten- tially hurting model performance, particularly in the case of highly skewed distribution. 但是，批内损失会受到采样偏差的影响，这有可能损害模型的性能，尤其是在分布高度偏斜的情况下。

In this paper, we present a novel algorithm for estimating item frequency from streaming data. 在本文中，我们提出了一种从流数据估计项目频率的新颖算法。

Through theoreti- cal analysis and simulation, we show that the proposed algorithm can work without requiring fxed item vocabulary, and is capable of producing unbiased estimation and being adaptive to item distribution change.通过理论分析和仿真，我们证明了所提出的算法可以在不需要固定项目词汇的情况下工作，并且能够产生无偏估计并且能够适应项目分布变化。

We then apply the sampling-bias-corrected modeling approach to build a large scale neural retrieval system for YouTube recommendations. 然后，我们应用采样偏差校正的建模方法来为YouTube建议构建大型的神经检索系统。

The system is deployed to retrieve personal- ized suggestions from a corpus with tens of millions of videos.部署该系统可从具有数千万个视频的语料库中检索个性化建议。 

We demonstrate the efectiveness of sampling-bias correction through ofine experiments on two real-world datasets.我们通过在两个现实世界数据集上进行的offine实验证明了采样偏差校正的有效性。

We also conduct live A/B testings to show that the neural retrieval system leads to improved recommendation quality for YouTube.我们还进行了实时A / B测试，以表明神经检索系统可改善YouTube的推荐质量。





Recommendation systems help users discover content of interest across many internet services, including video recommendations [12, 18], app suggestions [9], and online advertisement targeting [38]. In many cases, these systems connect billions of users to items from an extremely large corpus of content, often in the scale of millions to billions, under stringent latency requirements. A common practice is to treat the recommendation as a retrieval-and- ranking problem, and design a two-phase system [9, 12]. That is, a scalable retrieval model frst retrieves a small fraction of related items from a large corpus, and a fully-blown ranking model re- ranks the retrieved items based on one or multiple objectives such as clicks or user-ratings. In this work, we focus on building a real- world learned retrieval system for personalized recommendation that scales up to millions of items.推荐系统可帮助用户跨许多互联网服务发现感兴趣的内容，包括视频推荐[12、18]，应用程序推荐[9]和在线广告定位[38]。 在许多情况下，在严格的延迟要求下，这些系统将数十亿用户连接到非常庞大的内容集（通常是数百万到数十亿）的项目。 常见的做法是将推荐视为一个检索和排名问题，并设计一个两阶段系统[9，12]。 也就是说，可伸缩检索模型首先从大型语料库中检索一小部分相关项目，而成熟的排名模型则基于一个或多个目标（例如点击或用户评分）对检索到的项目进行重新排名。 在这项工作中，我们专注于构建一个现实世界的学习型检索系统，以进行个性化推荐，最多可扩展到数百万个项目。

Given a triplet of {user,context,item}, a common solution to build a scalable retrieval model is: 
* 1) learn query and item repre- sentations for {user,context} and {item} respectively; and 
* 2) use a simple scoring function (e.g., dot product) between query and item representations to get recommendations tailored for the query. 

Con- text often represents variables with dynamic nature, such as time of day, and devices users are using. The representation learning problem is typically challenging in two ways: 
* 1) The corpus of items could be extremely large for many industrial-scale applications; 
* 2) Training data collected from users’ feedback is very sparse for most items, and thus causes model predictions to have large variance for long-tail content. Facing the well-reported cold-start problem, real-world systems need to be adaptive to data distribution change to better surface fresh content.

Inspired by the Netfix prize [32], matrix factorization (MF) based modeling has been widely adopted for learning query and item latent factors in building retrieval systems. 

Under the MF framework, a body of recommendation research (e.g., [21, 34]) addresses the aforementioned challenges in learning from a large corpus. The common idea is to leverage the content features of query and item. 

Content features can be roughly defined as a wide variety of features describing items beyond item id. For example, content features of a video can be the visual and audio features extracted from video frames. MF-based models are usually only capable of capturing second-order interactions of features, and thus have limited power in representing a collection of features with various formats.

In recent years, motivated by the success of deep learning in computer vision and natural language processing, there is a large amount of work applying deep neural networks (DNNs) to recom- mendations. Deep representations are well suited for encoding complicated user states and item content features in low-dimensional embedding space. 

### In this paper, we explore the applications of two-tower DNNs in building retrieval models. 

Figure 1 provides an illustration of the two-tower model architecture where left and right towers encode {user,context} and {item} respectively. Two- tower DNN is generalized from the multi-class classifcation neural network [19], a multi-layer perceptron (MLP) model, where the right tower of Figure 1 is simplifed to a single layer with item embeddings. As a result, the two-tower model architecture is capable of modeling the situation where label has structures or content features. MLP model is commonly trained with many sampled negatives from a fixed vocabulary of items. 

In contrast, with deep item tower, it is typically inefficient to sample and train on many negatives due to item content features and shared network parameters for computing all item embeddings.

We consider batch softmax optimization, where item probability is calculated over all items in a random batch, as a general recipe of training two-tower DNNs. 

However, as shown in our experiments, batch softmax is subject to sampling bias and could severely restrict the model performance without any correction. Importance sampling and the corresponding bias reduction have been studied in MLP model [4, 5]. Inspired by these works, we propose to correct sampling bias of batch softmax using estimated item frequency. 

In contrast to MLP model where the output item vocabulary is stationary, we target the streaming data situation with vocabulary and distribution changes over time. We propose a novel algorithm to sketch and estimate item frequency via gradient descent. In addition, we apply the bias-corrected modeling and scale it to build a personalized retrieval system for YouTube recommendations. We also introduce a sequential training strategy, designed to incorporate streaming data, along with the indexing and serving components of the system.

The major contributions of this paper include:
* Streaming Frequency Estimation. We propose a novel algorithm to estimate item frequency from a data stream, subject to vocabulary and distribution shifts. We ofer analyt- ical results to show the variance and bias of the estimation. We also provide simulation that demonstrates the efcacy of our approach in capturing data dynamics.
* Modeling Framework. We provide a generic modeling framework for building large-scale retrieval systems. In par- ticular, we incorporate the estimated item frequency in a cross entropy loss for the batch softmax to reduce the sam- pling bias of in-batch items.
* YouTube Recommendation. We describe how we apply the modeling framework to build a large-scale retrieval sys- tem for YouTube recommendations. We introduce the end- to-end system including the training, indexing, and serving components.
* Ofline and Live Experiments. We perform ofine exper- iments on two real-world datasets and demonstrate the ef- fectiveness of sampling bias correction. We also show that our retrieval system built for YouTube leads to improved engagement metrics in live experiments.



                                  270
In this section, we give an overview of the related work, and high- light the connections to our contributions.

Content-Aware and Neural Recommenders
Utilizing content features of users and items is critical for improving generalization and mitigating cold-start problems. There is a line of research focusing on incorporating content features in the classic matrix factorization framework [23]. For instance, the generalized matrix factorization models, e.g., SVDFeature [8] and Factorization Machine [33], can be applied to incorporate item content features. These models are able to capture up to bi-linear, i.e., second-order, interactions between features. In recent years, deep neural networks (DNNs) have been shown efective in improving recommendation accuracy. Due to the nature of being highly nonlinear, DNNs ofer a larger capacity for capturing complicated feature interactions [6, 35], compared to traditional factorization approaches. He et al. [21] directly applies the intuition of collaborative fltering (CF), and ofers a neural CF (NCF) architecture for modeling user-item interactions. In the NCF framework, users and items embeddings are concatenated and passed through a multi-layer neural network to get the fnal prediction. Our work difers from NCF in two aspects: 1) we leverage a two-tower neural network for modeling user- item interactions so that the inference can be conducted over a large corpus of items in sub-linear time; 2) learning NCF relies on point-wise loss (such as squared or log loss), while we introduce a multi-class softmax loss and explicitly model item frequency.

Extreme Classifcation
Softmax is one of the most commonly used functions in designing models for the prediction of a large output space up to millions of labels. Lots of research has been focusing on training softmax classifcation models for a large number of classes, ranging from language tasks [5, 29] to recommenders [12]. 

When the number of classes is extremely large, a widely used technique to speed up training is to sample a subset of classes. Bengio et al. [5] shows that a good sampling distribution should be adaptive to the model’s output distribution. To avoid the complication of computing the sampling distribution, many real-world models apply a simple distribution such as unigram or uniform as a proxy. Recently, Blanc et al. [7] designs an efcient and adaptive kernel based sampling method. Despite the success of sampled softmax in various domains, it is not applicable to the case where label has content features. Adaptive sampling in this case also remains an open problem. Various works have shown that tree-based label structures, e.g., hierarchical soft- max [30], are useful for building large-scale classifcation models while signifcantly reducing inference time. These approaches typi- cally require a predefned tree structure based on certain categorical attributes. As a result, they are not suitable for incorporating a wide variety of input features.


Two-tower Models
Building neural networks with two towers has recently become a popular approach in several natural language tasks including modeling sentence similarities [31], response suggestions [24], and text-based information retrieval [17, 37]. Our work contributes to this line of research, particularly demonstrating the efectiveness of two-tower models in building large-scale recommenders. Compared to many language tasks in the aforementioned literature, it is worth noting that we focus on the problem with a much larger corpus size, which is common in our target applications such as YouTube. Through live experiments, we fnd that explicitly modeling item frequency is critical for improving retrieval accuracy in this setting. Yet, this problem is not well addressed in existing works.