# Introduction to the MovieLens Dataset

[MovieLens](http://www.movielens.org/) is a website where users submit ratings for movies that they watch and receive recommendations for movies (based on their ratings). The collected data is made [publicly available for research](http://grouplens.org/datasets/movielens/). We will be working with a data set of 1 million user ratings of movies. There are more recent data sets with more ratings, but these data sets do not contain demographic information about the users, which is perhaps the most interesting aspect of the data.

The MovieLens data has been uploaded to JupyterHub and is available at `/data/movielens/`. The data consists of three files:

- `movies.dat`, which contains information about the movies, such as their title and their genre
- `ratings.dat`, which contains the 1 million ratings
- `users.dat`, which contains information about the users, such as their gender and their occupation

This lab consists of 3 guided questions, plus an open-ended question. Each question is in a separate notebook. To answer each question, you will need to merge 2 (or possibly even all 3) of the data sets to each other. Try to avoid merging more than you need to.

In [1]:
!ls /data/movielens/

movies.dat  ratings.dat  README  users.dat


Run the cell below to view the README file. You should skim the README file before starting the assignment. In particular, the data files do not have a header row, so you will need this README file to figure out what each column represents.

In [2]:
!cat /data/movielens/README

SUMMARY

These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.

USAGE LICENSE

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set
       (see below for citation information).

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permiss

Notice that the data is separated by the characters `::`. Therefore, to read in the data, we will need to specify this delimiter. You might get a warning, which you can suppress by adding the argument `engine='python'`.

In [3]:
import pandas as pd
pd.set_option("display.max_rows", 15)

columns = "UserID::MovieID::Rating::Timestamp".split("::")
ratings = pd.read_table('/data/movielens/ratings.dat', sep='::', 
                        header=None, names=columns, engine='python')
ratings

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
5,1,1197,3,978302268
6,1,1287,5,978302039
...,...,...,...,...
1000202,6040,1089,4,956704996
1000203,6040,1090,3,956715518
