Exercise for Kaggle Days Ulaanbaatar
================================

Download the file with ratings from movie lens dataset.

More information here https://grouplens.org/datasets/movielens/

In [4]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m.zip

Archive:  ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


In [47]:
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors

In [27]:
ratings = pd.read_csv("ml-25m/ratings.csv").drop_duplicates(["userId", "movieId"])
movies = pd.read_csv("ml-25m/movies.csv")

Let's find my favourite movie series Indiana Jones

In [96]:
movies_dict = dict(zip(movies["movieId"], movies["title"]))
movies[movies["title"].str.find("Indiana Jones") >= 0]

Unnamed: 0,movieId,title,genres
1168,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
1258,1291,Indiana Jones and the Last Crusade (1989),Action|Adventure
2025,2115,Indiana Jones and the Temple of Doom (1984),Action|Adventure|Fantasy
12357,59615,Indiana Jones and the Kingdom of the Crystal S...,Action|Adventure|Comedy|Sci-Fi
57380,196241,The Adventures of Young Indiana Jones: Adventu...,Action|Adventure|Drama


Let's find a funny comedy

In [100]:
movies[movies["title"].str.find("Naked Gun") >= 0]

Unnamed: 0,movieId,title,genres
365,370,Naked Gun 33 1/3: The Final Insult (1994),Action|Comedy
3765,3868,"Naked Gun: From the Files of Police Squad!, Th...",Action|Comedy|Crime|Romance
3766,3869,"Naked Gun 2 1/2: The Smell of Fear, The (1991)",Comedy
26982,128025,Naked Gun (1956),Western


In [28]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [29]:
ratings.nunique()

userId         162541
movieId         59047
rating             10
timestamp    20115267
dtype: int64

Let's build a matrix with ratings
-----------

Because the movie and user is id we can use them as coordinates in the matrix. 

We are using `scipy.sparse.coo_matrix` which is very easy to build from (x,y) coordinates.

In [80]:
ratings_matrix = sparse.coo_matrix((ratings.rating.values, (ratings.userId.values, ratings.movieId.values)))
ratings_matrix

<162542x209172 sparse matrix of type '<class 'numpy.float64'>'
	with 25000095 stored elements in COOrdinate format>

In [81]:
svd = TruncatedSVD(n_components=20)
svd.fit(ratings_matrix)

TruncatedSVD(n_components=20)

In [82]:
movie_embedding = svd.components_.transpose()
movie_embedding.shape

(209172, 20)

Let's look for "Indiana Jones - Raiders of the lost ark" embedding

In [83]:
movie_embedding[1198]

array([ 0.08744582,  0.01620714, -0.01203282, -0.07023505, -0.09860206,
        0.10747683,  0.06099423, -0.02860729,  0.02006373, -0.00695352,
       -0.04954247,  0.0076526 , -0.03609171, -0.02702496,  0.05481226,
        0.11156189,  0.11168829, -0.07143713,  0.00234318,  0.06138969])

Let's learn an algorithm to search for nearest neighbors for movies

In [84]:
nn = NearestNeighbors(n_neighbors=20)
nn.fit(movie_embedding)

NearestNeighbors(n_neighbors=20)

Indiana Jones - similar movies
------------------

In [85]:
distances, movie_ids = nn.kneighbors([movie_embedding[1198]])
for distance, movie_id in zip(distances[0], movie_ids[0]):
    print(distance, movies_dict[movie_id])

0.0 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
0.11822598587882985 Star Wars: Episode V - The Empire Strikes Back (1980)
0.13933854201053533 Indiana Jones and the Last Crusade (1989)
0.1841531201994526 Star Wars: Episode VI - Return of the Jedi (1983)
0.2047188762724114 Princess Bride, The (1987)
0.22244455569077162 Die Hard (1988)
0.23319149426543237 Star Wars: Episode IV - A New Hope (1977)
0.23352187336807706 Terminator, The (1984)
0.23535671054241958 Aliens (1986)
0.23852108522732834 Monty Python and the Holy Grail (1975)
0.2418752281584256 Star Trek II: The Wrath of Khan (1982)
0.24188772184075155 Back to the Future (1985)
0.255201474604852 Glory (1989)
0.25533631441588045 Star Trek IV: The Voyage Home (1986)
0.25567052455738865 Indiana Jones and the Temple of Doom (1984)
0.25629137665715435 Alien (1979)
0.2593251429585787 Blade Runner (1982)
0.26229706265079655 Blues Brothers, The (1980)
0.26478479032091357 Star Trek III: The Search for Spock (

Naked Gun - similar movies
------------

In [99]:
distances, movie_ids = nn.kneighbors([movie_embedding[26871]])
for distance, movie_id in zip(distances[0], movie_ids[0]):
    print(distance, movies_dict[movie_id])

0.0 My Father the Hero (1994)
0.0008985587260327699 Good Dick (2008)
0.0009062983176811181 My Date with Drew (2004)
0.0009097100074937297 Home Room (2002)
0.0009868212762231428 Scorched (2003)
0.0010055508501584135 Gracie (2007)
0.001019302458954904 Margaret Cho: Assassin (2005)
0.0010289897107564393 Brief Interviews with Hideous Men (2009)
0.0010308597495533895 Puccini for Beginners (2006)
0.0010364846461040614 Greedy (1994)
0.0010499723436063993 November (2004)
0.0010609187728589766 Vamps (2012)
0.001065055799468779 Ellie Parker (2005)
0.0010734153624753745 Career Opportunities (1991)
0.0010756035078611846 Gray Matters (2006)
0.0010812815505772079 Noel (2004)
0.0010843504289125734 Buying the Cow (2002)
0.0011067544081203738 Wedding Daze (2006)
0.001115525954821114 Adam & Steve (2005)
0.0011246646057088449 Waking Up in Reno (2002)


Exercises:
==========
    1. Find your favourite movie and check similar movies - do the recommendations make sense.
    2. Try with different number of SVD components (it could give more precise results).
    3. Try other decomposition algorithms from scikit-learn like NMF.
    4. Try to reduce the number of components to 2 and plot the embeddings - add colors for movie genres - do they make clusters?