# The evolution of collaborative filtering: the Two Tower Model

In a sense, it is an evolution of the SVD. The idea is more or less the same: finding latent variables.

A two tower model is a type of recommender system that uses two separate neural networks, often called "towers", to make recommendations.
The users and items are represented as N-dimensional embedding vectors, these are learned by the model such that the similarity score between a user and item representation is higher for items with which the user has interacted. The name two towers is derived from the fact that there are 2 towers one for learning the encoding of the users and the other for learning the encodings of the items.

* The first tower is a **feature encoder** that takes the explanatory variables and encodes them into a fixed-length vector representation - basically a vector of latent variables spotted in a non-linear fashion. This vector representation is then passed to the second tower. See: https://blog.tensorflow.org/2020/09/introducing-tensorflow-recommenders.html

* The second tower is a **ranking model** that uses the encoded features to score and rank the potential recommendations, i.e., it links users and products, based on the given examples.


Finally, we can use the trained model to make recommendations on new, unseen data. This will involve passing the explanatory variables through the feature encoder tower to get the encoded features, and then passing those features to the ranking model tower to score and rank the recommendations.

In what follows there is just one example of how a two tower model can be used for recommendation (there are many different variations and ways to implement this type of model, and the specific details will depend on the particular problem and data at hand). This a toy example to illustrate the basic steps involved in building a recommender system using a neural network - you can start from here and try to work on the theme yourself.

## Example
We first load the libraries, create some *synthetic data* and split it into training and testing sets, as usual.
Let's assume we have 5 products to recommend, and 10 explanatory variables. 

**HINTS**: Try with different synthetic data (try more sensible probability distributions), or with real data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Embedding, Flatten, Concatenate
from tensorflow.keras.models import Model

# Set random seed for reproducibility
np.random.seed(0)

# Define the number of recommendations (say 5, purely as an example, but you can change) and explanatory variables (say 10)
num_recommendations = 5
num_explanatory_variables = 10

# Generate simulated data
X = np.random.rand(1000, num_explanatory_variables)
y = np.random.randint(0, 2, (1000, num_recommendations))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



Now we define **model architecture**, in its most classic form:
* The model consists of two "towers".
* Each tower has two hidden layers, and a merged layer that concatenates the output of the two towers and feeds it into two more hidden layers before producing the final output.
* More in detail, the input layer receives the explanatory variables as input.
* The embedding layer converts the input variables into a lower-dimensional representation (using an embedding matrix).
* The flatten layer flattens the output of the embedding layer into a 1D vector.
* The two towers each consist of two hidden layers (with ReLU activation), as described above, that transform the flattened embedding vectors into intermediate representations that are specific to each tower.
* The output layers of each tower produce probability distributions over the five possible products to recommend (using the softmax activation function).
* The concatenated output of the two towers is fed into two more hidden layers (with ReLU activation) that combine the two towers' outputs into a single, final representation.
* The output layer produces a probability distribution over the N=5 products to recommend (again, using the softmax activation function).
* We compile the model, preparing it for training, using the Adam optimizer and categorical cross-entropy loss function - quite standard.

In [None]:
# Define model architecture
input_dim = X_train.shape[1]
output_dim = y_train.shape[1]

input_layer = Input(shape=(input_dim,))
embedding_layer = Embedding(input_dim=100, output_dim=10)(input_layer)
flatten_layer = Flatten()(embedding_layer)

tower1_layer1 = Dense(64, activation='relu')(flatten_layer)
tower1_layer2 = Dense(32, activation='relu')(tower1_layer1)
tower1_output = Dense(output_dim, activation='softmax', name='tower1_output')(tower1_layer2)

tower2_layer1 = Dense(64, activation='relu')(flatten_layer)
tower2_layer2 = Dense(32, activation='relu')(tower2_layer1)
tower2_output = Dense(output_dim, activation='softmax', name='tower2_output')(tower2_layer2)

concat_layer = Concatenate()([tower1_layer2, tower2_layer2])
merged_layer1 = Dense(32, activation='relu')(concat_layer)
merged_output = Dense(output_dim, activation='softmax', name='merged_output')(merged_layer1)

# Define the model
model = Model(inputs=[input_layer], outputs=[tower1_output, tower2_output, merged_output])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


Now we **train and evaluate** the model.

And, finally, we do our **predictions**.

In [None]:
# Train the model
model.fit(X_train, [y_train, y_train, y_train], epochs=10, batch_size=32)

# Evaluate the model
loss, tower1_loss, tower2_loss, merged_loss, tower1_acc, tower2_acc, merged_acc = model.evaluate(X_test, [y_test, y_test, y_test], verbose=0)
print('Overall Loss:', loss)
print('Tower 1 Loss:', tower1_loss)
print('Tower 2 Loss:', tower2_loss)
print('Merged Loss:', merged_loss)
print('Tower 1 Accuracy:', tower1_acc)
print('Tower 2 Accuracy:', tower2_acc)
print('Merged Accuracy:', merged_acc)

# Make predictions on test set
predictions = model.predict(X_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Overall Loss: 143.5790557861328
Tower 1 Loss: 5.915769100189209
Tower 2 Loss: 53.61212921142578
Merged Loss: 84.0511703491211
Tower 1 Accuracy: 0.15000000596046448
Tower 2 Accuracy: 0.15000000596046448
Merged Accuracy: 0.054999999701976776


And now we can use the trained model to **make recommendations on some new data**.

In [None]:
# Use the trained model to make recommendations on some new data (just an example)
X_new = np.random.rand(100, num_explanatory_variables)
predicted_scores = model.predict(X_new)



## Some remarks
There are some advantages of a two-tower model compared to a standard SVD model for recommender systems:

* Handling of non-linear relationships: A two-tower model can handle non-linear relationships between the input variables and the recommended products, whereas an SVD model assumes a linear relationship. It models the complex relationships between variables and recommended products - which might end in overfitting, BTW.

* Incorporating additional features: a two-tower model can incorporate additional features beyond user-item interactions, such as user and item attributes or contextual information. This allows the model to capture more complex relationships between users, items, and the environment in which they interact.

* Scalability: a two-tower model can be more scalable than SVD models because it can be trained using mini-batch gradient descent, which can handle larger datasets more efficiently. SVD models require computing the full matrix factorization, which can be computationally expensive and memory-intensive.