This repository provides examples and best practices for building recommendation systems, provided as Jupyter notebooks. The examples detail our learnings on five key tasks:
- Prepare Data: Preparing and loading data for each recommender algorithm
- Model: Building models using various classical and deep learning recommender algorithms such as Alternating Least Squares (ALS) or eXtreme Deep Factorization Machines (xDeepFM).
- Evaluate: Evaluating algorithms with offline metrics
- Model Select and Optimize: Tuning and optimizing hyperparameters for recommender models
- Operationalize: Operationalizing models in a production environment on Azure
Several utilities are provided in reco_utils to support common tasks such as loading datasets in the format expected by different algorithms, evaluating model outputs, and splitting training/test data. Implementations of several state-of-the-art algorithms are provided for self-study and customization in your own applications.
To setup on your local machine:
- Install Anaconda with Python >= 3.6. Miniconda is a quick way to get started.
- Clone the repository
git clone https://github.com/Microsoft/Recommenders
- Run the generate conda file script to create a conda environment:
(This is for a basic python environment, see SETUP.md for PySpark and GPU environment setup)
cd Recommenders python scripts/generate_conda_file.py conda env create -f reco_base.yaml
- Activate the conda environment and register it with Jupyter:
conda activate reco_base python -m ipykernel install --user --name reco_base --display-name "Python (reco)"
- Start the Jupyter notebook server
cd notebooks jupyter notebook
- Run the SAR Python CPU Movielens notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".
The table below lists recommender algorithms available in the repository at the moment.
|Smart Adaptive Recommendations (SAR)*||Python CPU||Collaborative Filtering||Similarity-based algorithm for implicit feedback dataset|
|Surprise/Singular Value Decomposition (SVD)||Python CPU||Collaborative Filtering||Matrix factorization algorithm for predicting explicit rating feedback in datasets that are not very large|
|Vowpal Wabbit Family (VW)*||Python CPU (train online)||Collaborative, Content-Based Filtering||Fast online learning algorithms, great for scenarios where user features / context are constantly changing|
|Extreme Deep Factorization Machine (xDeepFM)*||Python CPU / Python GPU||Hybrid||Deep learning based algorithm for implicit and explicit feedback with user/item features|
|Deep Knowledge-Aware Network (DKN)*||Python CPU / Python GPU||Content-Based Filtering||Deep learning algorithm incorporating a knowledge graph and article embeddings to provide powerful news or article recommendations|
|Neural Collaborative Filtering (NCF)||Python CPU / Python GPU||Collaborative Filtering||Deep learning algorithm with enhanced performance for implicit feedback|
|Restricted Boltzmann Machines (RBM)||Python CPU / Python GPU||Collaborative Filtering||Neural network based algorithm for learning the underlying probability distribution for explicit or implicit feedback|
|FastAI Embedding Dot Bias (FAST)||Python CPU / Python GPU||Collaborative Filtering||General purpose algorithm with embeddings and biases for users and items|
|Alternating Least Squares (ALS)||PySpark||Collaborative Filtering||Matrix factorization algorithm for explicit or implicit feedback in large datasets, optimized by Spark MLLib for scalability and distributed computing capability|
NOTE - * indicates algorithms invented/contributed to by Microsoft.
We provide a comparison notebook to illustrate how different algorithms could be evaluated and compared. In this notebook, data (MovieLens 1M) is randomly split into training/test sets at a 75/25 ratio. A recommendation model is trained using each of the collaborative filtering algorithms below. We utilize empirical parameter values reported in literature here. For ranking metrics we use k = 10 (top 10 recommended items). We run the comparison on a Standard NC6s_v2 Azure DSVM (6 vCPUs, 112 GB memory and 1 P100 GPU). Spark ALS is run in local standalone mode.
This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.