<a href="https://colab.research.google.com/github/phonism/notes/blob/master/Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendation System

# Some equation
## Softmax && Cross Entropy Loss
$$Softmax(x_{i})=S(x_i)=\frac{e^{x_{i}}}{\sum_{k}e^{x_{k}}}$$
$$CrossEntropy = -\sum_{k}y_{k} lnS(x_{k}) = -lnS(x_{i})$$
$$CrossEntropy' = S(z) - y$$
## Sigmoid
$$Sigmoid(x)=\frac{1}{1+e^(-x)}$$
$$Sigmoid'(x)=Sigmoid(x)(1 - Sigmoid(x))$$
## Auc (How to calculate AUC)


# Regularization?
Regularization will prevent overfitting. When we have a lot of features (or very deep model)

+ work by adding a penalty or shrinkage term called a regularization term to the loss function
+ l1 regularzaion adds the "absolute value of magnitude" of the coefficient as a penalty term to the loss function.
+ l2 regularzaion adds the “squared magnitude” of the coefficient as the penalty term to the loss function


# Collaborative Filtering
To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.
+ Advantages: No domain knowledge necessary, Great starting point
+ Disadvantages: Cannot handle fresh items, Hard to include side features for query/item
## Item-CF
based on the similarity between items calculated using people's ratings of those items (users who bought x also bought y)
## User-CF
based on the similarity between users calculated using same item bought by different user (users who's interset is similar)


# Logistic Regression
## Introduction
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.
 
## **Sigmoid Function**
In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.
$$Sigmoid(x) = \frac{1}{1 + e^{-x}}$$
```python
def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))
```

## **Cost Function**
Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for $y=1$ and one for $y=0$.
$$L = \frac{1}{m}(-y^{T}log(y\_pred) - (1-y)^{T}log(1-y\_pred))$$

# Factorization Machines
Factorization Machine type algorithms are a combination of linear regression and matrix factorization, the cool idea behind this type of algorithm is it aims model interactions between features (a.k.a attributes, explanatory variables) using factorized parameters. By doing so it has the ability to estimate all interactions between features even with extremely sparse data.



In [24]:
import torch
import numpy as np

class FeaturesLinear(torch.nn.Module):

    def __init__(self, field_dims, output_dim=1):
        super().__init__()
        self.fc = torch.nn.Embedding(sum(field_dims), output_dim)
        self.bias = torch.nn.Parameter(torch.zeros((output_dim,)))
        self.offsets = np.array((0, *np.cumsum(field_dims)[:-1]), dtype=np.long)

    def forward(self, x):
        """
        :param x: Long tensor of size ``(batch_size, num_fields)``
        """
        x = x + x.new_tensor(self.offsets).unsqueeze(0)
        return torch.sum(self.fc(x), dim=1) + self.bias


class FeaturesEmbedding(torch.nn.Module):

    def __init__(self, field_dims, embed_dim):
        super().__init__()
        self.embedding = torch.nn.Embedding(sum(field_dims), embed_dim)
        self.offsets = np.array((0, *np.cumsum(field_dims)[:-1]), dtype=np.long)
        torch.nn.init.xavier_uniform_(self.embedding.weight.data)

    def forward(self, x):
        """
        :param x: Long tensor of size ``(batch_size, num_fields)``
        """
        x = x + x.new_tensor(self.offsets).unsqueeze(0)
        return self.embedding(x)

class FactorizationMachine(torch.nn.Module):

    def __init__(self, reduce_sum=True):
        super().__init__()
        self.reduce_sum = reduce_sum

    def forward(self, x):
        """
        :param x: Float tensor of size ``(batch_size, num_fields, embed_dim)``
        """
        square_of_sum = torch.sum(x, dim=1) ** 2
        sum_of_square = torch.sum(x ** 2, dim=1)
        ix = square_of_sum - sum_of_square
        if self.reduce_sum:
            ix = torch.sum(ix, dim=1, keepdim=True)
        return 0.5 * ix

class FactorizationMachineModel(torch.nn.Module):
    """
    A pytorch implementation of Factorization Machine.
    Reference:
        S Rendle, Factorization Machines, 2010.
    """

    def __init__(self, field_dims, embed_dim):
        super().__init__()
        self.embedding = FeaturesEmbedding(field_dims, embed_dim)
        self.linear = FeaturesLinear(field_dims)
        self.fm = FactorizationMachine(reduce_sum=True)

    def forward(self, x):
        """
        :param x: Long tensor of size ``(batch_size, num_fields)``
        """
        x = self.linear(x) + self.fm(self.embedding(x))
        return torch.sigmoid(x.squeeze(1))

# GBDT + LR
use gbdt to generate embedding features

# Wide & Deep
Wide and Deep Learning Model has two main components.
+ Wide: Memorization, Memorization can be loosely defined as learning the frequent co-occurrence of items or features and exploiting the correlation available in the historical data. a linear model with a wide set of cross-product feature transformations.
+ Deep: Generalization, Generalization, on the other hand, is based on transitivity of correlation and explores new feature combinations that have never or rarely occurred in the past. a deep feed-forward neural network. each feature has it's embeddings.

## DeepFM
Wide part using facterization machines

# Abacus
A framework for large-scale discrete DNN models based on parameter servers
## Architecture
+ Support distributed multi-machine training based on mpi. And it have a single master node
+ Network communication with zeromq
+ After the sample is shuffled, it is distributed to multiple machines and trained in parallel
+ The training node itself acts as a parameter server, providing distributed parameter services (large and small models), The large model is a sparse table that stores hundreds of billions of feasign, and it's show, click, embedding, etc. The small model is the DNN model, which stores the DNN parameters. 

## Structure
## Training Process
The traning process have two stages: join and update
+ in join stage: query the sparse table according to the input feasign to obtain show, ctr, lr, emb. and will update the join dnn network parameters, and estimate online.
+ in update stage: query the sparse table according to the input feasign to obtain lr, emb. And will update the update dnn network parameters and the sparse table

why?
+ The embedding training speed is slow, and the expected changes are small. When new data comes, train the join first, without updating the embedding. The model can quickly learn the latest data distribution and take effect online
+ It can alleviate the over-fitting situation, because the online prediction models are all embedding at time T-1 and dnn at time T, and the information of the current sample is not used.
+ There will be strong features such as feasign's show and click in the join stage, but not in the update stage. If we trained the model end2end, these strong features may lead to biased embedding learning
+ The join stage has strong features, it will learn very quickly, and the importance of each slot is clear. In the update stage, it can help the embedding converge in the correct direction.

# Bias
## Cold-start
+ Given a new item not seen in training, if the system has a few interactions with users, then the system can easily compute an embedding for this item without having to retrain the whole model. The item embedding can be the average of user embeddings.
+ Heuristics to generate embeddings of fresh items. If the system does not have interactions, the system can approximate its embedding by averaging the embeddings of items from the same category, from the same uploader (in YouTube), and so on.