[![Binder](https://mybinder.org/badge_logo.svg)](https://lab.mlpack.org/v2/gh/mlpack/examples/master?urlpath=lab%2Ftree%2Fmovie_lens_prediction_with_cf%2Fmovie-lens-cf-cpp.ipynb)

In [1]:
/**
 * @file movie-lens-cf-cpp.ipynb
 *
 * A simple example usage of Collaborative Filtering (CF)
 * applied to the MovieLens dataset.
 * 
 * https://grouplens.org/datasets/movielens/
 */

In [2]:
!rm -rf ml-latest-small && wget -q -O tmp.zip https://lab.mlpack.org/data/MovieLens-small.zip && unzip tmp.zip && rm tmp.zip

Archive:  tmp.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [3]:
#include <mlpack/xeus-cling.hpp>

#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>

#include <mlpack/methods/cf/decomposition_policies/regularized_svd_method.hpp>
#include <mlpack/methods/cf/cf.hpp>

#include <fstream>

In [4]:
// Header files to create and show the plot.
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "wordcloud.hpp"

namespace plt = matplotlibcpp;

In [5]:
using namespace mlpack;

In [6]:
using namespace mlpack::cf;

In [7]:
/**
 * The MovieLens dataset contains a set of movie ratings from the MovieLens website,
 * a movie recommendation service. This dataset was collected and maintained by
 * GroupLens, a research group at the University of Minnesota.
 *
 * There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m".
 *
 * In this example, we are working on the "latest-small" dataset,
 * which is a small subset of the latest version of the MovieLens dataset.
 * It is changed and updated over time by GroupLens.
 *
 * The dataset has 100,000 ratings and 3,600 tag applications applied
 * to 9,000 movies by 600 users.
 */

// Load ratings file.
arma::mat ratings;
data::Load("ml-latest-small/ratings.csv", ratings);
// Ignore the timestamp column and the header.
ratings = ratings.submat(0, 1, ratings.n_rows - 2, ratings.n_cols - 1);

// Load movies file.
std::vector<size_t> moviesId;
std::vector<std::string> moviesTitle;
std::vector<std::string> moviesGenres;

std::ifstream moviesFile("ml-latest-small/movies.csv");
std::string line;
size_t lineNum = 0;
while (getline(moviesFile, line))
{
    std::stringstream linestream(line);
    std::string value;
    
    size_t valueNum = 0;
    while (getline(linestream, value, ','))
    {
        if (lineNum > 0 && valueNum == 0)
            moviesId.push_back(std::stoi(value));
        else if (lineNum > 0 && valueNum == 1)
            moviesTitle.push_back(value);
        else if (lineNum > 0 && valueNum == 2)
            moviesGenres.push_back(value);
        
        valueNum++;
    }
    
    lineNum++;
}

In [8]:
// Print the first 10 rows of the ratings data.
std::cout << "   userId       movieId      rating\n";
ratings.cols(0, 9).t().print()

   userId       movieId      rating
   1.0000e+00   1.0000e+00   4.0000e+00
   1.0000e+00   3.0000e+00   4.0000e+00
   1.0000e+00   6.0000e+00   4.0000e+00
   1.0000e+00   4.7000e+01   5.0000e+00
   1.0000e+00   5.0000e+01   5.0000e+00
   1.0000e+00   7.0000e+01   3.0000e+00
   1.0000e+00   1.0100e+02   5.0000e+00
   1.0000e+00   1.1000e+02   4.0000e+00
   1.0000e+00   1.5100e+02   5.0000e+00
   1.0000e+00   1.5700e+02   5.0000e+00


We can see that user 1 has rated the movie with the id 1, 3 and 6 with a rating of 4.0;
rated the movies with the id's 47, 50, 101, 151 and 157 with a 5.0 and rated the movie
with the id 70 with 3.0.

In [9]:
// Print the first 10 rows of the movies data.
std::cout << std::left << std::setw(10) << "id" << std::setw(40) << "title" << "genres" << std::endl;
for (size_t i = 0; i < 10; ++i)
{
    std::cout << std::left << std::setw(10)
              << moviesId[i]
              << std::setw(40)
              << moviesTitle[i]
              << moviesGenres[i] << std::endl;
}

id        title                                   genres
1         Toy Story (1995)                        Adventure|Animation|Children|Comedy|Fantasy
2         Jumanji (1995)                          Adventure|Children|Fantasy
3         Grumpier Old Men (1995)                 Comedy|Romance
4         Waiting to Exhale (1995)                Comedy|Drama|Romance
5         Father of the Bride Part II (1995)      Comedy
6         Heat (1995)                             Action|Crime|Thriller
7         Sabrina (1995)                          Comedy|Romance
8         Tom and Huck (1995)                     Adventure|Children
9         Sudden Death (1995)                     Action
10        GoldenEye (1995)                        Action|Adventure|Thriller


This dataset contains attributes of the 9700 movies.
There are 3 columns including the movie ID, their titles, and their genres.
Genres are separated and selected from 18 genres (Action, Adventure, Animation,
Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror,
Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western).

In [10]:
// Create a wordcloud of the movie titles.
std::string moveTitleCloudString = "";
for (size_t s = 0; s < moviesTitle.size(); ++s)
    moveTitleCloudString += moviesTitle[s] + ";";

WordCloud(moveTitleCloudString, "movie-title-word-cloud.png", 400, 1000);
auto im = xw::image_from_file("movie-title-word-cloud.png").finalize();
im

A Jupyter widget

The, Man, Love, Dead, Day are among the most commonly occuring words in movie titles.

In [11]:
// Create a wordcloud of the movie genres.
std::string movieGenresCloudString = "";
for (size_t g = 0; g < moviesGenres.size(); ++g)
    movieGenresCloudString += moviesGenres[g] + ";";

// Replace all '|' to ', since that's
// what the WordCloud method uses as delimiter.
std::replace(movieGenresCloudString.begin(),
             movieGenresCloudString.end(), '|', ';');

WordCloud(movieGenresCloudString, "movie-genres-word-cloud.png", 400, 1000);
auto im = xw::image_from_file("movie-genres-word-cloud.png").finalize();
im

A Jupyter widget

Drama, Comedy and Action are among the most commonly occuring movie genres.

In [12]:
// Get summary statistics of the ratings.
std::cout << std::setw(10) << "count" << ratings.n_cols << std::endl;
std::cout << std::setw(10) << "mean" << arma::mean(ratings.row(2)) << std::endl;
std::cout << std::setw(10) << "std" << arma::stddev(ratings.row(2)) << std::endl;
std::cout << std::setw(10) << "min" << arma::min(ratings.row(2)) << std::endl;
std::cout << std::setw(10) << "max" << arma::max(ratings.row(2)) << std::endl;
std::cout << std::setw(10) << "range" << arma::range(ratings.row(2)) << std::endl;

count     100836
mean      3.50156
std       1.04253
min       0.5
max       5
range     4.5


In [13]:
// Plot ratings histogram.
std::vector<double> hist = arma::conv_to<std::vector<double>>::from(ratings.row(2).t());

plt::figure_size(400, 400);
plt::xlabel("ratings");
plt::hist(hist);

plt::save("./hist.png");
auto im = xw::image_from_file("hist.png").finalize();
im

A Jupyter widget

The mean rating is 3.5 on a scale of 5. Half the movies have a rating of 3 and 4.

In [14]:
// Hold out 10% of the dataset into a test set so we can evaluate performance.
arma::mat ratingsTrain, ratingsTest;
data::Split(ratings, ratingsTrain, ratingsTest, 0.1);

In [15]:
// Train the model. Change the rank to increase/decrease the complexity
// of the model.
//
// For more information checkout https://www.mlpack.org/doc/stable/python_documentation.html#cf
// or uncomment the line below.
// ?CF

// Note: batch size is 1 in our implementation of Regularized SVD.
// A batch size other than 1 has not been supported yet.
CFType<cf::RegSVDPolicy> cfModel(ratingsTrain);

[0;33m[WARN ] [0mThe batch size for optimizing RegularizedSVD is 1.


In [16]:
// Now query the 10 top movies for user 2.
arma::Mat<size_t> recommendations;
cfModel.GetRecommendations(10, recommendations, {2});

In [17]:
// Get the names of the movies for user 2.
std::cout << "Recommendations for user 2:" << std::endl;
for (size_t i = 0; i < recommendations.n_elem; ++i)
{
    std::vector<size_t>::iterator it = std::find(moviesId.begin(),
        moviesId.end(), (size_t)recommendations[i]);
    size_t index = std::distance(moviesId.begin(), it);

    std::cout << "  " << i << ":  " << moviesTitle[index] << std::endl;
}

Recommendations for user 2:
  0:  Bent (1997)
  1:  Play Time (a.k.a. Playtime) (1967)
  2:  Dylan Moran: Monster (2004)
  3:  "Mist
  4:  Seve (2014)
  5:  Freeway (1996)
  6:  Damien: Omen II (1978)
  7:  Twin Dragons (Shuang long hui) (1992)
  8:  Pickpocket (1959)
  9:  Saving Face (2004)


In [18]:
// Print the movie ratings for user 2 from the dataset.
std::cout << "Ratings for user 2:" << std::endl;
for (size_t i = 0, r = 0; i < ratings.n_cols; ++i)
{
    if ((size_t)ratings.col(i)(0) == 2)
    {
        std::vector<size_t>::iterator it = std::find(moviesId.begin(),
            moviesId.end(), (size_t)ratings.col(i)(1));
        size_t index = std::distance(moviesId.begin(), it);

        std::cout << "  " << r++ << ":  "
                  << std::fixed << std::setprecision(1)
                  << ratings.col(i)(2)
                  << "  - " << moviesTitle[index] << std::endl;
    }
}

Ratings for user 2:
  0:  3.0  - "Shawshank Redemption
  1:  4.0  - Tommy Boy (1995)
  2:  4.5  - Good Will Hunting (1997)
  3:  4.0  - Gladiator (2000)
  4:  4.0  - Kill Bill: Vol. 1 (2003)
  5:  3.5  - Collateral (2004)
  6:  4.0  - Talladega Nights: The Ballad of Ricky Bobby (2006)
  7:  4.0  - "Departed
  8:  4.5  - "Dark Knight
  9:  5.0  - Step Brothers (2008)
  10:  4.5  - Inglourious Basterds (2009)
  11:  3.0  - Zombieland (2009)
  12:  4.0  - Shutter Island (2010)
  13:  3.0  - Exit Through the Gift Shop (2010)
  14:  4.0  - Inception (2010)
  15:  4.5  - "Town
  16:  5.0  - Inside Job (2010)
  17:  4.0  - Louis C.K.: Hilarious (2010)
  18:  5.0  - Warrior (2011)
  19:  3.5  - "Dark Knight Rises
  20:  2.5  - "Girl with the Dragon Tattoo
  21:  3.5  - Django Unchained (2012)
  22:  5.0  - "Wolf of Wall Street
  23:  3.0  - Interstellar (2014)
  24:  4.0  - Whiplash (2014)
  25:  2.0  - The Drop (2014)
  26:  3.5  - Ex Machina (2015)
  27:  5.0  - Mad Max: Fury Road (2015)
  2

Here is some example output, showing that user 2 seems to have an interesting taste in movies.