# Recommending Movies

The [MovieLens 20M](http://files.grouplens.org/datasets/movielens/ml-20m-README.html) dataset contains 20 million user ratings from 1 to 5 of thousands of movies. In this demo we'll build a simple recommendation system which will use this data to suggest 25 movies based on a seed movie you provide.

The notebook cells below use `pymldb`'s `Connection` class to make [REST API](/doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](/doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details.

In [1]:
from pymldb import Connection
mldb = Connection()

## Download the MovieLens 20M data

We'll start by using some command-line tools to download and decompress the data.

In [2]:
%%bash
mkdir -p /mldb_data/data
curl "http://files.grouplens.org/datasets/movielens/ml-20m.zip" 2>/dev/null  > /mldb_data/data/ml-20m.zip
unzip /mldb_data/data/ml-20m.zip -d /mldb_data/data

Archive:  /mldb_data/data/ml-20m.zip


replace /mldb_data/data/ml-20m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)


In [3]:
%%bash
head /mldb_data/data/ml-20m/README.txt

Summary

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.


In [4]:
%%bash
head /mldb_data/data/ml-20m/ratings.csv

userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940


## Load the data into MLDB

See the [Loading Data Tutorial](/doc/nblink.html#_tutorials/Loading Data Tutorial) guide for more details on how to get data into MLDB. 

In [5]:
mldb.put('/v1/procedures/import_mvlns', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/ratings.csv",
        "outputDataset": "mvlns_ratings_csv",
        "runOnCreation": True
    }
})

mldb.put('/v1/procedures/process_mvlns', {
    "type": "transform",
    "params": {
        "inputData": "select pivot(movieId, rating) as * named userId from mvlns_ratings_csv group by userId",
        "outputDataset": "mvlns_ratings",
        "runOnCreation": True
    }
})

## Take a peek at the dataset

We'll use the [Query API](/doc/#builtin/sql/QueryAPI.md.html).

In [6]:
mldb.query("select * from mvlns_ratings limit 5")

Unnamed: 0_level_0,1,3,5,6,7,9,17,25,32,36,...,368,370,419,454,468,474,485,520,543,778
_rowName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56641,4.0,3.0,3.0,2.0,1.0,4.0,1.0,1.0,3.0,3.0,...,,,,,,,,,,
118850,,,,,,,,,4.0,,...,,,,,,,,,,
79822,,,,,,,,,,,...,,,,,,,,,,
105445,3.0,,,,,,,,,,...,,,,,,,,,,
54666,,,,5.0,,,,,,,...,5.0,3.0,3.0,4.0,4.0,4.0,3.0,2.0,3.0,4.0


## Singular Value Decomposition (SVD)

We will create and run a [Procedure](/doc/#builtin/procedures/Procedures.md.html) of type [`svd.train`](/doc/#builtin/procedures/Svd.md.html).

In [7]:
mldb.put('/v1/procedures/mvlns_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : "select COLUMN EXPR (where rowCount() > 3) from mvlns_ratings",
        "columnOutputDataset" : "mvlns_svd_embedding",
        "modelFileUrl": "file://models/mvlns.svd",
        "functionName": "mvlns_svd_embedder",
        "runOnCreation": True
    }
})

## Explore the results!

Our dataset has `movieId`s but humans think about movie names, and in order to display movie posters, we will need IMDB IDs so we'll load up some metadata tables.

In [8]:
import pandas as pd
movies = pd.read_csv("/mldb_data/data/ml-20m/movies.csv", index_col="movieId", encoding="utf-8")
links = pd.read_csv("/mldb_data/data/ml-20m/links.csv", index_col="movieId")

def movie_search(term):
    result = []
    for x in term.split(","):
        if len(x) < 3: continue
        m_ids = movies[movies.title.str.contains(x.replace("the", "").strip(), case=False)].index
        if len(m_ids) == 0:
            print "no results found for " + x
        else:
            result += list(m_ids)
    return result

movies.loc[movie_search("toy story, the terminator")].join(links)
    

Unnamed: 0_level_0,title,genres,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862
3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,120363,863
78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,435761,10193
106022,Toy Story of Terror (2013),Animation|Children|Comedy,2446040,213121
115875,Toy Story Toons: Hawaiian Vacation (2011),Adventure|Animation|Children|Comedy|Fantasy,1850374,77887
115879,Toy Story Toons: Small Fry (2011),Adventure|Animation|Children|Comedy|Fantasy,2033372,82424
120468,Toy Story Toons: Partysaurus Rex (2012),Animation|Children|Comedy,2340678,130925
120474,Toy Story That Time Forgot (2014),Animation|Children,3473654,256835
589,Terminator 2: Judgment Day (1991),Action|Sci-Fi,103064,280
1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller,88247,218


If you say you like some films and dislike some others, we can simplify this by saying your preferences are a vector where the movies you like are rated a 5 and the ones you dislike as a 1.

In [9]:
def make_user_vector(like, dislike):
    features= {}
    for rating, movies in zip([5,1],[movie_search(like), movie_search(dislike)]):
        for m in movies:
            features[str(m)] = rating
    return features

pd.DataFrame(make_user_vector(like="toy story, the terminator", dislike="star trek"), index=['user'])

Unnamed: 0,1,102425,102445,106022,107137,115875,115879,120468,120474,1240,...,2393,3114,329,4934,589,5944,6537,68358,68791,78499
user,5,5,1,5,1,5,5,5,5,5,...,1,5,1,5,5,1,5,1,5,5


The SVD we trained above can 'embed' this point into a special space, which operates a bit like a map of movies. Once we know where your preferences sit in this map, recommending movies to you is just a question of telling which movies are near where you are.

In [10]:
def nearby_movies(user_vector):
    result1 = mldb.get('/v1/functions/mvlns_svd_embedder/application', input=dict(row=user_vector))
    point = result1.json()["output"]["embedding"]["val"] 
    params = {"svd{0:04d}".format(i): str(e) for i, e in enumerate(point)}
    params["numNeighbours"] = 50
    result2 = mldb.get('/v1/datasets/mvlns_svd_embedding/routes/neighbours', **params)
    recos = []
    for r in result2.json():
        if str(r[0]) not in user_vector:
            recos.append(r[0])
    return recos

user_vector = make_user_vector(like="toy story, the terminator", dislike="star trek")
recommendations = nearby_movies(user_vector)
movies.loc[recommendations].join(links).head()


Unnamed: 0_level_0,title,genres,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
316,Stargate (1994),Action|Adventure|Sci-Fi,111282,2164
1682,"Truman Show, The (1998)",Comedy|Drama|Sci-Fi,120382,37165
2617,"Mummy, The (1999)",Action|Adventure|Comedy|Fantasy|Horror|Thriller,120616,564
1222,Full Metal Jacket (1987),Drama|War,93058,600
2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,120623,9487


Finally, just to show off, we can turn this into a little interactive app.

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://mldb.ai. See the documentation for [Running MLDB](/doc/#builtin/Running.md.html).

In [11]:
from ipywidgets import interact 
from IPython.display import IFrame, display

@interact 
def recommend_movies(like="toy story, terminator", dislike="star trek"):
    user_vector = make_user_vector(like, dislike)
    recommendations = nearby_movies(user_vector)[:25]
    return movies.loc[recommendations].join(links)
    #display( 
    #    IFrame("data:text/html," + "".join([ 
    #    "<script src='http://www.movieposterdb.com/embed.inc.php?movie_id=%s'></script>" % str(x) 
    #    for x in links.loc[recommendations].imdbId
    #    ]), 600, 800) 
    #)


Unnamed: 0_level_0,title,genres,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
316,Stargate (1994),Action|Adventure|Sci-Fi,111282,2164
1682,"Truman Show, The (1998)",Comedy|Drama|Sci-Fi,120382,37165
2617,"Mummy, The (1999)",Action|Adventure|Comedy|Fantasy|Horror|Thriller,120616,564
1222,Full Metal Jacket (1987),Drama|War,93058,600
2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,120623,9487
6365,"Matrix Reloaded, The (2003)",Action|Adventure|Sci-Fi|Thriller|IMAX,234215,604
3527,Predator (1987),Action|Sci-Fi|Thriller,93773,106
2329,American History X (1998),Crime|Drama,120586,73
292,Outbreak (1995),Action|Drama|Sci-Fi|Thriller,114069,6950
145,Bad Boys (1995),Action|Comedy|Crime|Drama|Thriller,112442,9737


## Where to next?

Check out the other [Tutorials and Demos](/doc/#builtin/Demos.md.html).