# Introduction

In general, we would like to make use of all the knowledge available in different data sources and also use the algorithmic power of various recommender systems to make robust inferences - **Hybrid Recommender Systems**. 

Three primary ways of creating hybrid recommender systems:
1. **Ensemble design**: Results from off-the-shelf algorithms are combined into a single and more robust output.
2. **Monolithic design**: An integrated recommendation algorithm is created by using various data types.
3. **Mixed systems**: These systems use multiple recommendation algorithms as black-boxes, but the items recommended by the various systems are presented togetther side by side.

The **hybrid system** is used in a broader context than the term **ensemble system**.

The **hybrid recommendation systems** can be classified into the following categories:
1. **Weighted**: The scores of several recommender systems are combined into a sigle unified score by computing the weighted aggregates of the scores from invidual ensemble components.
2. **Switching**: The algorithm switches between various recommender systems depending on current needs.
3. **Cascade**: One recommender system refines the recommendations given by another.
4. **Feature augumentation**: The output of one recommender system is used to create input features for the next.
5. **Feature combination**: The features from different data sources are combined and used in the contexr of a single recommender system.
6. **Meta-level**: The model used by one recommender system is used as input to another system. Example: Content-based system creates peer groups, then the collaborative filtering system use that peer groups to make the recommendations.
7. **Mixed**: Recommendations from several engines are presented to the user at the same time.

The first four are ensemble systems, the next two are monolithic systems, and the last one is mixed system.

# Ensemble Methods

The error of a classifier in predicting the dependent variable can be decomposed into three components:
1. **Bias**: Every classifier makes its own modeling assumptions about the nature of the decision boundary between classes. When a classifier has high bias, it will make consistently incorrect predictions over particular choices of test instances near the incorrectly modeled-decision boundary.
2. **Variance**: Random variations in the choices of the training data will lead to different models. As a result, the dependent variable for a test instance might be inconsistently predicted by different choices of training data sets. Model variance is closely related to overfitting.
3. **Noise**: The noise refers to the intrinsic errors in the target class labeling.

The expected mean-squared error of a classifier over a set of test instances can be shown to be sum of the bias, variance, and noise:
$$ Error = Bias^2 + Variance + Noise $$

Classification ensemble methods such as *bagging* reduce the variance, whereas methods such as *boosting* can reduce the bias.

# Weighted Hybrid

Let $R = [r_{uj}]$ be an $m \times n$ ratings matrix. Let $\hat{R}_1...\hat{R}_q$ be the $m \times n$
completely specified ratings matrices, in which the unobserved entries of $R$ are predicted by $q$ different algorithms. Then, for a set of weights $\alpha_1...\alpha_q$ , the weighted hybrid creates a combined prediction matrix $\hat{R} = [\hat{r}_{uj}]$ as follows:
$$ \hat{R} = \sum_{i=1}^q \alpha_i \hat{R}_i $$

In [78]:
import numpy as np
import tensorflow as tf

  from ._conv import register_converters as _register_converters


In [30]:
def ensemble_all_predicted_ratings_matrix(list_matrixs, list_weights):
    z = np.multiply(list_weights, np.transpose(list_matrix))
    z = np.transpose(z)
    
    return np.sum(z, axis = 0)/np.sum(list_weights)

In order to determine the optimal weights, it is necessary to be able to evaluate the
effectiveness of a particular combination of weights $\alpha_1...\alpha_q$. <br>
A simple approach is to hold out a small fraction (e.g., 25%) of the known entries in the $m \times n$ ratings matrix $R = [r_{uj}]$ and create the prediction matrices $\hat{R}_1...\hat{R}_q$ by applying the $q$ different base algorithms on the remaining 75% of the entries in R. The resulting predictions $\hat{R}_1...\hat{R}_q$ are then combined to create the ensemble-based prediction $\hat{R}$. Let the user-item indices $(u,j)$ of these held-out entries be denoted by $H$. The effectiveness of a particular scheme can be evaluated using either the mean-squared error (MSE) or the mean absolute error (MAE) of the predicted matrix over the held-out ratings in $H$. We can use linear regression model to find the most effective values of weights.

After the weights have been learned using linear regression, the individual component models are retrained on the entire training set without any held-out entries.

Regularization can be added to prevent overfitting. It is also possible to add other con- straints on the various values of αi such as non-negativity or ensuring that they sum to 1. 

Often, these tech- niques use a simple average of the predictions of different components. It is particularly important to weight the different components when the predicted utility values are on dif- ferent scales, or when some of the ensemble components are much more accurate than others. 

In [32]:
# indices for vector
def specified_rating_indices(u):
    return np.where(np.isfinite(u))

In [79]:
from sklearn.utils import shuffle

def get_batch(X, y, batch_size, iteration):
    return (X[iteration * batch_size: (iteration + 1) * batch_size], y[iteration * batch_size: (iteration + 1) * batch_size])

In [140]:
def find_weights_using_linear_regression(list_predicted_matrix, original_matrix,
                                         batch_size=2, n_epochs=10, learning_rate=0.001):
    indices = specified_rating_indices(original_matrix)

    X_train = []
    for i in range(len(list_predicted_matrix)):
        matrix = list_predicted_matrix[i]
        X_train.append(matrix[indices])
        
    X_train = np.transpose(X_train)
    y_train = original_matrix[indices]
    
    n_dimensions = len(list_predicted_matrix)
    
    # create tensorflow graph
    tf.reset_default_graph()
    
    X = tf.placeholder(tf.float32, shape=[None, n_dimensions], name='X')
    y = tf.placeholder(tf.float32, shape=None, name='y')
    
#     W = tf.get_variable('W', shape=[n_dimensions, 1], initializer=tf.contrib.layers.xavier_initializer(seed = 1))
#     b = tf.Variable(dtype=tf.float32, initial_value=0)
#     output = tf.matmul(X, W) + b

    output = tf.layers.dense(X, units=1)
    
    loss = tf.losses.mean_squared_error(output, y)
    
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
        init.run()

        for epoch in range(n_epochs):
            X_train, y_train = shuffle(X_train, y_train)
            total_loss = 0
            n_iteration = X_train.shape[0] // batch_size
            for i in range(n_iteration):
                X_batch, y_batch = get_batch(X_train, y_train, batch_size, i)
                l, _ = sess.run([loss, optimizer], feed_dict={X: X_batch, y: y_batch})
                total_loss += l
            
            print('Loss: ', total_loss / n_iteration)
        
        var = [v for v in tf.trainable_variables()][0]
        value = sess.run(var)
    return value

In [156]:
a = np.array([[1, 2, 9, 4, 5],
              [nan, 1, 3, nan, 2]])
b = np.array([[2, 2, 4, 5, 6],
              [nan, 1, 3, nan, 2]])
c = np.array([[3, 4, 5, 6, 7],
              [nan, 1, 3, nan, 2]])

value = find_weights_using_linear_regression([b, c], a, n_epochs=10)

Loss:  19.537912786006927
Loss:  15.251693665981293
Loss:  11.021173030138016
Loss:  8.004278674721718
Loss:  7.621613174676895
Loss:  7.934566304087639
Loss:  6.653058409690857
Loss:  8.155106753110886
Loss:  7.148072227835655
Loss:  6.631481006741524


In [157]:
value

array([[ 1.0894009 ],
       [-0.20308039]], dtype=float32)

## Various Types of Model Combinations

There are typically two forms of model combinations:
1. **Homogeneous data type and model classes**: Different models are applied on the same data. Such an approach is robust because it avoids the specific bias of particular algorithms on a given data set even though all the constituent models belong to the same class
2. **Heterogeneous data type and model classes**: Different classes of models are applied to different data sources. The idea is to leverage the complementary knowledge in the various data sources in order to provide the most accurate recommendations.

## Adapting Bagging from Classification