by Leana Critchell, Jacob Prebys and Dann Morr
- Repo Structure and Directory
- Overview
- Success Criteria
- Members
- Dataset
- Analysis of Data
- Top 10 Genres -Ratings by Release Year
- Metrics
- Models
- Collaborative Filtering Model
- Content-Based Model
- Evaluation
- Deployment
- Final Results
- Future Work
- Exploratory Notebooks
- Report Notebook
- Project Presentation
- Data
- src/ directory with project source code
- Figures/ directory with project visuals
- References
- Project environment
We aim to create a recommendation system based on the MovieLens dataset from the GroupLens research lab at the University of Minnesota. Furthermore, we would like to deploy a web app that will alloy a user to enter some ratings for movies that they have seen, and then, based on the model we have implemented, it will reccomend movies that align with their interests.
Success will be measured by implementing collaborative and content-based models that can return movie recommendations to a user. The goal is to provide reviews that we find sensible based on either reviews that the user enters, or based on a film given to the content-based system. A good recommendation algorithm can be extremely useful for streaming companies, as a constant stream of accurate or interesting recommendations will keep users engaged with the platform.
Name | GitHub |
---|---|
Leana Critchell | lecritch |
Jacob Prebys | jprebys |
Dann Morr | dannmorr |
The datasets used can be found here. There is a complete dataset available here which includes over 27,000,000 reviews. They also offer a subset of this data that has about 100,000 reviews. For our initial data exploration and model tuning, we will be using this subset.
You can click this link to download the zip file of the data files used in this project (1MB). This zip file contains 4 csv files: movies
, ratings
, tags
and links
. See the README.md in the data folder for more info on how this data is formatted. On the website provided above, you also have access to the 'large' dataset which is 256MB and was not used in this project. Download from their website at your own will.
The four csv datasets were downloaded to this repo which you can find here - they are labelled movies.csv
, links.csv
, ratings.csv
and tags.csv
. If you're following along in the final notebook, the cells will run as we import the csv's using pandas.
We found the average rating is 3.5 and the data is left-skewed as can be seen in the image below. This shows us that there aren't many low ratings between 0.5 and 2. Perhaps this says something about the motivation for people to rate movies and that people only rate movies they enjoy.
Something that comes up a lot in recommendation system problems is the long tail problem. This is where we have a fast majority of users and/or items that only have 1 rating associated to them and a small amount of items/users that have a lot of ratings associated with them. We first looked into the number of ratings per movie:
As you can see we do have a long tail problem here where the majority of movies have less than 25 ratings and very few have more than that.
We then looked into the number of ratings per user to investigate this long tail problem further:
We looked into the most common genres and found the top ten genre combinations (that is, the genre with the most amount of movies listed as this genre).
We found that the top movies boil down to:
- Drama
- Crime/Thriller
- Comedy
We visualise the top 10 genres:
We can see here that Drama is the most highly rated genre, second is Comedy and third Comedy|Drama. This along suggests that these could be aggregated some how and should be considered in future investigations.
From this graph we can see that movies that were released before 1990 tend to have a higher average rating. From roughly 1990, the average movie rating appears to trend downwards towards the average rating of the dataset (3.5). Since the rating of these movies have taken place since 1993, this could suggest that people who watched and rated older movies, watched them because they were already a recommended to them as being good movies and so these movies are watched by good referral. Whereas from 1993, movies could have been watched and rated by people's own motivations rather than personal recommendations. So perhaps this suggests what we see in the data here.
For this recommendation system, we are provided with actual ratings that actual users gave to movies. Because we have a numerical rating system, the standard metrics for regression problems apply here. Calculating the root mean squared error (RMSE) is a natural choice for model evaluation, but there are problems in practice with this method. Most notably, the movies that have few ratings don't have much affect on the RMSE; therefore, we will have to take this into consideration when tuning the model.
The key idea behind collaborative filtering is that similar users share similar interests and that users tend to like items that are similar to one another. We plan to use this for our recommendation system. A user will rate 5 movies, that new data will be used to generate recommendations based on the ratings from users in our datset.
- Determine the model to use
- We performed a train test split on our data, then compared several models in their default state to see which would return the best RMSE score. In this test, the best performing model was SVDpp - The SVD++ algorithm, an extension of SVD taking into account both explicit and implicit ratings.
- Iterating and tuning the model
- After the model was chosen we ran several iterations, tuning the hyperparameters each time to see if we could imporve the score.
The next type of recommendation system we wanted to explore was a content-based version. Our previous model would look at other users that have similar interests, and it would recommend other titles that they have liked. This system goes the other direction and it takes movies that you like, and, having learned some information about the film, recommends titles that are similar to it.
To do this, we gathered descriptions and genre tags for each film, and then utilized some of Python's natural language processing tools to turn this text information into numerical information. We used the following process:
- TF-IDF Vectorization
- Short for Term Frequency - Inverse Document Frequency, this is a method for assigning values to each word based on the amount of times it appear in documents. This specific value takes in to account the number of times a word appears in a single description and also how commonly it appears in all descriptions. In a single description, a word is given a high tf-idf score if it appears many times in one description, but it is relatively uncommon across all descriptions. This is partially meant to filter out words that are common to movies in general.
- Cosine Similarity
- Once each film is represented by a many-dimensional vector, a common method for determining how 'similar' two films are is by caluculating how close to 1 the cosine of the angle between them is.
- Sorting
- Now that we have a measure of similarity between every pair of movies, we can take in a single movie, sort the rest of the movies by how similar they are to our chosen film, and then return the top 10 most similar films.
We have put together a Python class to demonstrate our content-based recommender, the source code for it can be found in the src folder under the name content_rec.py.
Overall our models were successful in providing good recommendations to users. Our final model had a root mean squared error of 0.856, but that could be improved through further model iterations and perhaps some integrations of the content-based system into the collaborative one.
To deploy our recommendation system we decided to use the Python library Flask, which is a framework for making simple web-apps backed with Python code. With this tool we were able to make a cool app that will ask users to rate a certain number of movies, and it will recommend films based on similar users' interests.
Here's an app preview:
We had good success with both collaborative and content-based recommendation systems, as well as our Flask deployment. Our final collaborative model ended up with a RMSE of approx 0.855, which is not bad on a 5-point rating scale. Our content based model is showing very good variety in picking movies that are similar in genre and description.
A good place to direct our efforts in the future would be speeding up our model training process so our app deployment can work faster. We should also consider taking parts of our content and collaboration systems to make a hybrid recommender system that makes even more valid or interesting recommendations.