# Building a Movies Recommendation Engine
“What movie should I watch this evening?” — have you ever had to answer this question at least once when you came home from work? As for me — yes, and more than once. From Netflix to Hulu, the need to build robust movie recommendation systems is extremely important given the huge demand for personalized content of modern consumers.

In this notebook, I will attempt at implementing a few algorithms (content based, popularity based and collaborative filtering) to recommend movies and evaluate them to see which one performs the best.

After reading this post you will know:

* About the MovieLens dataset 
* How to load then explore the dataset in Python.
* The 3 different types of recommendation engines.
* How to develop a popularity-based recommendation model for the MovieLens dataset.
* How to develop a content-based recommendation model for the MovieLens dataset.
* How to develop a collaborative filtering model for the MovieLens dataset.
* How to evaluate these models based on precision and recall.
* Suggestions to improve the model accuracy.

Let’s get started.

## The MovieLens Dataset
One of the most common datasets that is available on the internet for building a Recommender System is the [MovieLens DataSet](https://grouplens.org/datasets/movielens/). This version of the dataset that I'm working with ([1M](https://grouplens.org/datasets/movielens/1m/)) contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.

The data was collected by GroupLens researchers over various periods of time, depending on the size of the set. This 1M version was released on February 2003. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id, and no other information is provided.

The original data are contained in three files, [movies.dat](https://github.com/khanhnamle1994/movielens/blob/master/dat/movies.dat), [ratings.dat](https://github.com/khanhnamle1994/movielens/blob/master/dat/ratings.dat) and [users.dat](https://github.com/khanhnamle1994/movielens/blob/master/dat/users.dat). To make it easier to work with the data, I used a [script](https://github.com/khanhnamle1994/movielens/blob/master/dat_to_csv.py) to convert the .dat files into [.csv files](https://github.com/khanhnamle1994/movielens/tree/master/csv).

## Loading and Exploring the Data
Let's load this data into Python. I will load the dataset with Pandas onto Dataframes **ratings**, **users**, and **movies**. Before that, I'll also pass in column names for each CSV and read them using pandas (the column names are available in the [Readme](https://github.com/khanhnamle1994/movielens/blob/master/README.md) file).

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
r_cols = ['user_id', 'movie_id', 'rating', 'ts']
ratings = pd.read_csv('csv/ratings.csv', names=r_cols, encoding='latin-1')

# Reading users file
u_cols = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_csv('csv/users.csv', names=u_cols, encoding='latin-1')

# Reading movies file
m_cols = ['movie id', 'title' ,'genres']
movies = pd.read_csv('csv/movies.csv', names=m_cols, encoding='latin-1')

  interactivity=interactivity, compiler=compiler, result=result)


Now lets take a peak into the content of each file to understand them better.

In [7]:
print(ratings.shape)
print(ratings.head())

(1000210, 4)
   user_id  movie_id  rating         ts
0  user_id  movie_id  rating         ts
1        1      1193       5  978300760
2        1       661       3  978302109
3        1       914       3  978301968
4        1      3408       4  978300275


This confirms that there are 1M ratings for different user and movie combinations. Also notice that each rating has a timestamp (ts) associated with it.

In [8]:
print(users.shape)
print(users.head())

(6041, 5)
   user_id  gender  age  occupation    zip
0  user_id  gender  age  occupation    zip
1        1       F    1          10  48067
2        2       M   56          16  70072
3        3       M   25          15  55117
4        4       M   45           7  02460


This confirms that there are 6041 users and we have 5 features for each (unique user ID, gender, age, occupation and the zip code they are living in).

In [9]:
print(movies.shape)
print(movies.head())

(3884, 3)
   movie id                     title                        genres
0  movie_id                     title                        genres
1         1          Toy Story (1995)   Animation|Children's|Comedy
2         2            Jumanji (1995)  Adventure|Children's|Fantasy
3         3   Grumpier Old Men (1995)                Comedy|Romance
4         4  Waiting to Exhale (1995)                  Comedy|Drama


This dataset contains attributes of the 3884 movies. There are 3 columns including the movie ID, their titles, and their genres. Genres are pipe-separated and are selected from 18 genres (Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western).

## Types of Recommendation Engines

## Popularity-Based Recommendation Model

## Content-Based Recommendation Model

## Collaborative Filtering Recommendation Model

## Evaluating Recommendation Models

## Suggestions to Improve Model Accuracy

## Summary