Exercise for Kaggle Days Ulaanbaatar
================================

Download the file with ratings from movie lens dataset.

More information here https://grouplens.org/datasets/movielens/

In [1]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m.zip

--2020-11-06 23:14:59--  http://files.grouplens.org/datasets/movielens/ml-25m.zip
Translacja files.grouplens.org (files.grouplens.org)... 128.101.65.152
Łączenie się z files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 200 OK
Długość: 261978986 (250M) [application/zip]
Zapis do: `ml-25m.zip.1'


2020-11-06 23:15:38 (6,50 MB/s) - zapisano `ml-25m.zip.1' [261978986/261978986]

Archive:  ml-25m.zip
replace ml-25m/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [2]:
import pandas as pd
from scipy import sparse
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors

In [3]:
ratings = pd.read_csv("ml-25m/ratings.csv").drop_duplicates(["userId", "movieId"])
movies = pd.read_csv("ml-25m/movies.csv")

In [17]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


Let's find my favourite movie series Indiana Jones

In [4]:
movies_dict = dict(zip(movies["movieId"], movies["title"]))
movies[movies["title"].str.find("Indiana Jones") >= 0]

Unnamed: 0,movieId,title,genres
1168,1198,Raiders of the Lost Ark (Indiana Jones and the...,Action|Adventure
1258,1291,Indiana Jones and the Last Crusade (1989),Action|Adventure
2025,2115,Indiana Jones and the Temple of Doom (1984),Action|Adventure|Fantasy
12357,59615,Indiana Jones and the Kingdom of the Crystal S...,Action|Adventure|Comedy|Sci-Fi
57380,196241,The Adventures of Young Indiana Jones: Adventu...,Action|Adventure|Drama


Let's find a funny comedy

In [5]:
movies[movies["title"].str.find("Naked Gun") >= 0]

Unnamed: 0,movieId,title,genres
365,370,Naked Gun 33 1/3: The Final Insult (1994),Action|Comedy
3765,3868,"Naked Gun: From the Files of Police Squad!, Th...",Action|Comedy|Crime|Romance
3766,3869,"Naked Gun 2 1/2: The Smell of Fear, The (1991)",Comedy
26982,128025,Naked Gun (1956),Western


In [7]:
ratings.nunique()

userId         162541
movieId         59047
rating             10
timestamp    20115267
dtype: int64

Let's build a matrix with ratings
-----------

Because the movie and user is id we can use them as coordinates in the matrix. 

We are using `scipy.sparse.coo_matrix` which is very easy to build from (x,y) coordinates.

In [8]:
ratings_matrix = sparse.coo_matrix((ratings.rating.values, (ratings.userId.values, ratings.movieId.values)))
ratings_matrix

<162542x209172 sparse matrix of type '<class 'numpy.float64'>'
	with 25000095 stored elements in COOrdinate format>

In [9]:
svd = TruncatedSVD(n_components=20)
svd.fit(ratings_matrix)

TruncatedSVD(n_components=20)

In [10]:
movie_embedding = svd.components_.transpose()
movie_embedding.shape

(209172, 20)

Let's look for "Indiana Jones - Raiders of the lost ark" embedding

In [11]:
movie_embedding[1198]

array([ 0.08744582,  0.01620713, -0.01203286, -0.07023537, -0.0986071 ,
        0.10747828,  0.06099655, -0.02860278,  0.01984636, -0.00688183,
       -0.04975415,  0.00932437, -0.03605594, -0.02638384,  0.06015141,
        0.11285749,  0.11305136, -0.07237385, -0.00484498,  0.0711646 ])

Let's learn an algorithm to search for nearest neighbors for movies. It will take a while.

We can check out the documentation of TruncatedSVD - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD.

In [12]:
nn = NearestNeighbors(n_neighbors=20)
nn.fit(movie_embedding)

NearestNeighbors(n_neighbors=20)

Indiana Jones - similar movies
------------------

In [13]:
distances, movie_ids = nn.kneighbors([movie_embedding[1198]])
for distance, movie_id in zip(distances[0], movie_ids[0]):
    print(distance, movies_dict[movie_id])

0.0 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
0.11319057161491887 Star Wars: Episode V - The Empire Strikes Back (1980)
0.1399873133584752 Indiana Jones and the Last Crusade (1989)
0.18465128000789008 Star Wars: Episode VI - Return of the Jedi (1983)
0.20601690983492293 Princess Bride, The (1987)
0.22300411992879432 Die Hard (1988)
0.22947689381572411 Star Wars: Episode IV - A New Hope (1977)
0.23457487403574817 Terminator, The (1984)
0.23775986844616487 Back to the Future (1985)
0.23932734821637136 Aliens (1986)
0.24415344256426236 Monty Python and the Holy Grail (1975)
0.247865008871006 Star Trek II: The Wrath of Khan (1982)
0.25897859521168565 Indiana Jones and the Temple of Doom (1984)
0.259486195969495 Star Trek IV: The Voyage Home (1986)
0.2602243913301259 Alien (1979)
0.2625297208591597 Glory (1989)
0.26429001998870344 Blade Runner (1982)
0.268141922337746 Blues Brothers, The (1980)
0.2692909863324885 Star Trek III: The Search for Spock (1984

Naked Gun - similar movies
------------

In [16]:
distances, movie_ids = nn.kneighbors([movie_embedding[3868]])
for distance, movie_id in zip(distances[0], movie_ids[0]):
    print(distance, movies_dict[movie_id])

0.0 Naked Gun: From the Files of Police Squad!, The (1988)
0.039020327960735054 Naked Gun 2 1/2: The Smell of Fear, The (1991)
0.03913172612790046 Coming to America (1988)
0.040023118822807145 Wayne's World 2 (1993)
0.041115292725337406 Revenge of the Nerds (1984)
0.042002291072521174 Airplane II: The Sequel (1982)
0.04260660165273912 History of the World: Part I (1981)
0.04262211063984662 ¡Three Amigos! (1986)
0.04262380690807967 Major League (1989)
0.04415163152833371 Planes, Trains & Automobiles (1987)
0.044220495439206614 Big Trouble in Little China (1986)
0.04441460992329794 Eddie Murphy Raw (1987)
0.0446007090173688 Scrooged (1988)
0.044657483181479794 What About Bob? (1991)
0.04477692956356961 Dirty Rotten Scoundrels (1988)
0.04509686318552511 European Vacation (aka National Lampoon's European Vacation) (1985)
0.04568472382446882 Beverly Hills Cop II (1987)
0.04579787421252706 Top Secret! (1984)
0.045910160640419674 White Men Can't Jump (1992)
0.04594901936555633 Uncle Buck (198

Exercises:
==========
    1. Find your favourite movie and check similar movies - do the recommendations make sense.
    2. Try with different number of SVD components (it could give more precise results).
    3. Try other decomposition algorithms from scikit-learn like NMF.
    4. Try to reduce the number of components to 2 and plot the embeddings - add colors for movie genres - do they make clusters?