

# Movie Ratings

This solution presents an example of machine learning to predict the movie rating for a given user and movie using Google Cloud Platform (GCP). 

The technique used here solves a very general problem: given a matrix where we know only a few of the entries, can we predict the values of the unknown entries? There are many tools for exploring this problem, almost all of them are outside the scope of this series of notebooks, but we will focus on a particular method for predicting values users assign to movies. We will construct a large matrix where each row represents a user, and each column represents a movie, while the entries in the matrix are numbers representing the rating a user gives a movie. In the dataset we will use, the ratings can be any number in the list [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0], representing a rating of "zero stars" up to "5 stars" by "half star" increments. 

![Figure of a user-movie rating matrix with missing values.](images/matrix.png "Figure of a user-movie ratings matrix with missing values")


# The dataset: Movielens

For the datalab, we will download and use one of the Movielens datasets distributed by the GroupLens research group from the the Department of Computer Science and Engineering at the University of Minnesota. See their recommendation system called [Movielens](https://movielens.org/), their publicly accessible [datasets](http://grouplens.org/datasets/movielens/), and [Wikipedia](https://en.wikipedia.org/wiki/MovieLens) for more information. We make no claims on how their recommendation system works, we will just use their data to build an interesting model. 


The GroupLens Research group has made several datasets publicly available over the years. We will be working with the MovieLens 10M dataset which was released in 2009 as the example in this datalab. It contains 10 million ratings by 72,000 users on 10,000 movies. The size of this dataset is big enough to explore an interesting movie recommendation system while being small enough to fit in main memory. 

# The premise

The premise of our model is straightforward: we associate a k-dimensional real vector for each user and movie. To compute a rating prediction for a given user and movie, the [dot product](https://en.wikipedia.org/wiki/Dot_product#Definition) of the associated user vector and movie vector is computed. 

In the below figure, let A be a n-by-m user-movie matrix, U the orange n-by-k matrix, and V the blue k-by-m matrix (here, m=n=5, k=2). The data in the A matrix comes from the Movielens dataset, and it is a sparse matrix. Our embedding model takes the user-movie matrix A, the embedding dimension k, and produces two dense matrices U and V. U has a row for every user (row in A), and V has a column for every movie (column in A). To predict the (i,j) entry in the matrix A, the dot product of the the ith row of U and the jth column of V is computed. 

![Factoring A into user and movie embedding matrices.](images/embeddings.png "Factoring the user-movie rating matrix into user and movie embedding matrices.")

The model is responsible for finding the matrices U and V such that they produce ratings that are close to the entries of A that are known, and it is hoped they product ratings that are good for the unknown ratings. Note that the dot product of two vectors can product any number--a number that is not in \[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0\]. Therefore when using the results, we have to interpret what a predicted rating of 3.35 or -1.2 means. Of course, the absolute value is less important if these predicted ratings are going to be used for ranking (making a sorted list). Despite some of the drawbacks of this model, it has successfully be used for many problems, see for instance any discussion on the [Netflix prize](https://en.wikipedia.org/wiki/Netflix_Prize). In terms of vocabulary found in the literature, this datalab builds a *matrix factor model* or *latent factor model* for *collaborative filtering*.



# In this datalab, you will:

* Download and analyze the Movielens dataset.
* Perform preprocessing on the dataset based on the analyses.
* Build an embedding model for the user-movie rating matrix using TensorFlow.
* Evaluate the usefulness of the model. 

You'll perform the entire investigation in this series of Cloud Datalab notebooks.

Important: This solution is intended to illustrate the capabilities of GCP and TensorFlow for fast, interactive, iterative data analysis and machine learning. It does not seek to build a world-class movie recommendation system. 