# Building a Recommender Engine

"User who liked ... also liked..." - nowadays, **recommender engines** are everywhere on the web. A recommender engine is basically any of a large variety of algorithms that recommends items to users while trying to maximize the likelyhood that the user will select them. This is also known as **collaborative filtering**, because such algorithms allow a user to use the input of many previous users to help them sift through the data.


## Preamble

In [None]:
import pandas

In [None]:
import data_science_learning_paths

## Example: Generating Movie Recommendations


In this example, we are going to build a simple recommender engine for movies. Given the ratings (1-5 stars) that a user has given to movies, the engine is going to predict the ratings that the user is likely to give to previously unseen movies.

We are going to use [`surpsrise`](http://surpriselib.com/), a library in the style of `scikit-learn` and made specifically for recommender engines.

In [None]:
import surprise

### Loading the Data

Our training data comes from the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.

In [None]:
data_dir = "../.assets/data/movielens/small"

In [None]:
movies = pandas.read_csv(f"{data_dir}/movies.csv")
ratings = pandas.read_csv(f"{data_dir}/ratings.csv")

In [None]:
movies.head()

In [None]:
ratings.head()

In [None]:
!head {data_dir}/ratings.csv

`surprise` algorithms expect the data in the library's own data format:

In [None]:
ratings = surprise.Dataset.load_from_file(
    file_path=f"{data_dir}/ratings.csv",
    reader=surprise.Reader(
        line_format="user item rating timestamp", 
        sep=",", 
        skip_lines=1
    )
)

In [None]:
ratings

### Training a Recommendation Model

We try the **SVD** algorithm and see if it gives accurate predictions using well-known regression error metrics:

In [None]:
%%time
surprise.model_selection.cross_validate(
    surprise.SVD(), 
    ratings, 
    measures=['RMSE', 'MAE'], 
    cv=5, 
    verbose=True
)


### Example Recommendations

As a sanity check, let's pick out a user and look at their ratings and the recommendations generated:

In [None]:
from surprise.model_selection import train_test_split

In [None]:
ratings_train, ratings_test = train_test_split(ratings, test_size=.25)


In [None]:
predictions = surprise.SVD().fit(ratings_train).test(ratings_test)

In [None]:
predicted_ratings = pandas.DataFrame(
    [
        {"userId": pred.uid, "movieId": pred.iid, "rating": pred.est} for pred in predictions
    ],
    columns=["userId", "movieId", "rating"],
)

In [None]:
movies.head()

In [None]:
movies.dtypes

In [None]:
movies["movieId"] = movies["movieId"].astype("str")

In [None]:
predicted_ratings = predicted_ratings.join(movies.set_index("movieId"), on="movieId")

In [None]:
predicted_ratings.head()

In [None]:
example_user = "642"

In [None]:
predicted_ratings[predicted_ratings["userId"] == example_user]

In [None]:
example_user = "42"

In [None]:
predicted_ratings[predicted_ratings["userId"] == example_user]

## So how does it work actually?

In this course we do not go deep into the mathematics or algorithmics of machine learning, but since you asked: The ALS algorithm used above uses a mathematical technique called **matrix factorization**. [This blogpost](https://beckernick.github.io/matrix-factorization-recommender/) explains the approach, also using the movie ratings data set. As usual in machine learning, matrix factorization entails an optimization problem, and **alternating least squares** is a fast and parallelizable way of solving it, as [explained here](https://www.quora.com/What-is-the-Alternating-Least-Squares-method-in-recommendation-systems-And-why-does-this-algorithm-work-intuition-behind-this).

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_