# Assignment: Recommend new movies to film fans

In this assignment you're going to build a movie recommendation system that can recommend new movies to film fans.

The first thing you'll need is a data file with thousands of movies rated by many different users. The [MovieLens Project](https://movielens.org) has exactly what you need.

The data files **recommendation-movies.csv**, **recommendation-ratings-test.csv** and **recommendation-ratings-train.csv** have already been downloaded and are available to your code. There are 100,000 movie ratings in total with 99,980 set aside for training and 20 for testing. 

The training and testing files are in CSV format and look like this:
￼

![Data File](./assets/data.png)

There are only four columns of data:

* The ID of the user
* The ID of the movie
* The movie rating on a scale from 1–5
* The timestamp of the rating

There's also a movie dictionary in CSV format with all the movie IDs and titles:


![Data File](./assets/movies.png)

You are going to build a data science model that reads in each user ID, movie ID, and rating, and then predicts the ratings each user would give for every movie in the dataset.

Once you have a fully trained model, you can easily add a new user with a couple of favorite movies and then ask the model to generate predictions for any of the other movies in the dataset.

And in fact this is exactly how the recommendation systems on Netflix and Amazon work. 

Let's get started. You'll need to install the following packages:

In [2]:
#r nuget:Microsoft.ML
#r nuget:Microsoft.ML.Recommender

This will install the Microsoft ML.NET library and the extension for building recommendation systems. 

ow you're ready to add some classes. You will need one class to hold a movie rating, and one to hold your model’s predictions.

In [3]:
using System;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Trainers;
using Microsoft.ML.Data;

/// <summary>
/// The MovieRating class holds a single movie rating.
/// </summary>
public class MovieRating
{
    [LoadColumn(0)] public float UserID;
    [LoadColumn(1)] public float MovieID;
    [LoadColumn(2)] public float Label;
}

/// <summary>
/// The MovieRatingPrediction class holds a single movie prediction.
/// </summary>
public class MovieRatingPrediction
{
    public float Label;
    public float Score;
}

The **MovieRating** class holds one single movie rating. Note how each field is tagged with a **LoadColumn** attribute that tell the CSV data loading code which column to import data from.

You're also declaring a **MovieRatingPrediction** class which will hold a single movie rating prediction.

Now you need to load the training data in memory:

In [5]:
// filenames for training and test data
private static string trainingDataPath = Path.Combine(Environment.CurrentDirectory, "recommendation-ratings-train.csv");
private static string testDataPath = Path.Combine(Environment.CurrentDirectory, "recommendation-ratings-test.csv");

// set up a new machine learning context
var context = new MLContext();

// load training and test data
Console.Write("Loading data...");
var trainingDataView = context.Data.LoadFromTextFile<MovieRating>(trainingDataPath, hasHeader: true, separatorChar: ',');
var testDataView = context.Data.LoadFromTextFile<MovieRating>(testDataPath, hasHeader: true, separatorChar: ',');
Console.WriteLine("done");

Loading data...done


This code uses the method **LoadFromTextFile** to load the CSV data directly into memory. The class field annotations tell the method how to store the loaded data in the **MovieRating** class.

Now you're ready to start building the machine learning model:

In [8]:
// prepare matrix factorization options
var options = new MatrixFactorizationTrainer.Options
{
    MatrixColumnIndexColumnName = "UserIDEncoded",
    MatrixRowIndexColumnName = "MovieIDEncoded", 
    LabelColumnName = "Label",
    NumberOfIterations = 20,
    ApproximationRank = 100
};

// set up a training pipeline
// step 1: map UserID and MovieID to keys
var pipeline = context.Transforms.Conversion.MapValueToKey(
        inputColumnName: "UserID",
        outputColumnName: "UserIDEncoded")
    .Append(context.Transforms.Conversion.MapValueToKey(
        inputColumnName: "MovieID",
        outputColumnName: "MovieIDEncoded")

    // step 2: find recommendations using matrix factorization
    .Append(context.Recommendation().Trainers.MatrixFactorization(options)));

// train the model
Console.Write("Training the model...");
var model = pipeline.Fit(trainingDataView);  
Console.WriteLine("done");

Training the model...done


Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

This pipeline has the following components:

* **MapValueToKey** which reads the UserID column and builds a dictionary of unique ID values. It then produces an output column called UserIDEncoded containing an encoding for each ID. This step converts the IDs to numbers that the model can work with.
* Another **MapValueToKey** which reads the MovieID column, encodes it, and stores the encodings in output column called MovieIDEncoded.
* A **MatrixFactorization** component that performs matrix factorization on the encoded ID columns and the ratings. This step calculates the movie rating predictions for every user and movie.

With the pipeline fully assembled, you train the model with a call to **Fit**.

You now have a fully- trained model. So now you need to load the validation data, predict the rating for each user and movie, and calculate the accuracy metrics of the model:

In [9]:
// evaluate the model performance 
Console.WriteLine("Evaluating the model...");
var predictions = model.Transform(testDataView);
var metrics = context.Regression.Evaluate(predictions, labelColumnName: "Label", scoreColumnName: "Score");
Console.WriteLine($"  RMSE: {metrics.RootMeanSquaredError:#.##}");
Console.WriteLine($"  MAE:  {metrics.MeanAbsoluteError:#.##}");
Console.WriteLine($"  MSE:  {metrics.MeanSquaredError:#.##}");

Evaluating the model...
  RMSE: .97
  MAE:  .61
  MSE:  .94


This code uses the **Transform** method to make predictions for every user and movie in the test dataset.

The **Evaluate** method compares these predictions to the actual area values and automatically calculates three metrics for me:

* **RootMeanSquaredError**: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
* **MeanAbsoluteError**: this is the mean absolute prediction error, expressed as a rating.
* **MeanSquaredError**: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.

My validation RMSE is 0.97 and the MAE is 0.61. That means that on average the model is off by slightly over half a rating point. That's a pretty good result!

To wrap up, let’s use the model to make a prediction about me. Here are 6 movies I like:

* Blade Runner
* True Lies
* Speed
* Twelve Monkeys
* Things to do in Denver when you're dead
* Cloud Atlas

And 6 more movies I really didn't like at all:

* Ace Ventura: when nature calls
* Naked Gun 33 1/3
* Highlander II
* Throw momma from the train
* Jingle all the way
* Dude, where's my car?

You'll find my ratings at the very end of the training file. I added myself as user 999. 

So based on this list, do you think I would enjoy the James Bond movie ‘GoldenEye’?

Let's write some code to find out:

In [12]:
// check if Mark likes GoldenEye
var predictionEngine = context.Model.CreatePredictionEngine<MovieRating, MovieRatingPrediction>(model);
var prediction = predictionEngine.Predict(
    new MovieRating()
    {
        UserID = 999,
        MovieID = 10  // GoldenEye
    }
);
Console.WriteLine($"Prediction for Mark liking Goldeneye: {prediction.Score}");

Prediction for Mark liking Goldeneye: 3.3517098


This code uses the **CreatePredictionEngine** method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once the prediction engine is set up, you can simply call **Predict** to make a single prediction on a MovieRating instance.

The model predicts that I would give a rating of 3.37 to the movie ‘GoldenEye’. That's actually quite a good prediction. I've seen the movie and found it entertaining, but it's definitely not the best James Bond movie I've ever seen.

Let’s do one more thing and ask the model to predict my top-5 favorite movies. 

We're going to need some helper code:

In [18]:
public class Movie
{
    public int ID;
    public String Title;
}

public static class Movies
{
    public static List<Movie> All = new List<Movie>();
    private static string moviesDataPath = Path.Combine(Environment.CurrentDirectory, "recommendation-movies.csv");

    static Movies()
    {
        All = LoadMovieData(moviesDataPath);
    }

    public static Movie Get(int id)
    {
        return All.Single(m => m.ID == id);
    }

    private static List<Movie> LoadMovieData(String moviesdatasetpath)
    {
        var result = new List<Movie>();
        Stream fileReader = File.OpenRead(moviesdatasetpath);
        StreamReader reader = new StreamReader(fileReader);
        try
        {
            bool header = true;
            int index = 0;
            var line = "";
            while (!reader.EndOfStream)
            {
                if (header)
                {
                    line = reader.ReadLine();
                    header = false;
                }
                line = reader.ReadLine();
                string[] fields = line.Split(',');
                int movieId = Int32.Parse(fields[0].ToString().TrimStart(new char[] { '0' }));
                string movieTitle = string.Join(',', fields.Skip(1).Take(fields.Length-2));
                result.Add(new Movie() { ID = movieId, Title = movieTitle });
                index++;
            }
        }
        finally
        {
            if (reader != null)
            {
                reader.Dispose();
            }
        }

        return result;
    }
}

This sets up two new classes: **Movie** which holds the identifier and title of a single movie, and **Movies** which is a list of all movies in the dataset. The **LoadMovieData** method will load the entire list of movie titles from a CSV file.

Now we can calculate my top-5 movie list with the following code:

In [17]:
// find Mark's top 5 movies
Console.WriteLine("Calculating Mark's top-5 movies...");
var top5 =  (from m in Movies.All
                let p = predictionEngine.Predict(
                new MovieRating()
                {
                    UserID = 999,
                    MovieID = m.ID
                })
                orderby p.Score descending
                select (MovieId: m.ID, Score: p.Score)).Take(5);
foreach (var t in top5)
    Console.WriteLine($"  Score:{t.Score}\tMovie: {Movies.Get(t.MovieId)?.Title}");

Calculating Mark's top-5 movies...
  Score:4.8665724	Movie: "Three Billboards Outside Ebbing, Missouri (2017)"
  Score:4.795279	Movie: Schindler's List (1993)
  Score:4.758946	Movie: Cinema Paradiso (Nuovo cinema Paradiso) (1989)
  Score:4.7308774	Movie: "General, The (1926)"
  Score:4.724816	Movie: "Shawshank Redemption, The (1994)"


This code uses the helper class **Movies** to enumerate every movie ID. It predicts my rating every possible movie, sorts them by score in descending order, and takes the top 5 results.

Here are the model predictions for my top-5 movies:

* Cinema Paradiso
* The General
* Three billboards outside Ebbing
* The Shawshank redemption
* Schindler's List

This is a good prediction. I've seen Schindler's List and The Shawshank redemption and enjoyed them a lot. I've heard good things about Three billboards outside Ebbing, it sounds like a movie I'd enjoy. 

I can't say much about Cinema Paradiso or The General. I'll check them out and let you know. 