In [1]:
import numpy as np
import pandas as pd

from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

%matplotlib inline

# Prepare data

Using Movielens-1M dataset ([Movielens](https://movielens.org/) is a movie recommendation system, created by researchers)

In [2]:
movies = pd.read_csv('movies.gz', index_col='movieid', header=0, encoding='unicode-escape')[['movienm', 'genreid']]
ratings = pd.read_csv('ratings.gz', header=0)

We have description of the movies

In [3]:
movies.head()

Unnamed: 0_level_0,movienm,genreid
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),"Animation, Children's, Comedy"
2,Jumanji (1995),"Adventure, Children's, Fantasy"
3,Grumpier Old Men (1995),"Comedy, Romance"
4,Waiting to Exhale (1995),"Comedy, Drama"
5,Father of the Bride Part II (1995),Comedy


and ratings data

In [4]:
ratings.head()

Unnamed: 0,userid,movieid,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


Total number of ratings: 

In [5]:
ratings.shape[0]

1000209

Number of users and items:

In [6]:
ratings[['userid', 'movieid']].apply(pd.Series.nunique)

userid     6040
movieid    3706
dtype: int64

Data sparsity:

In [7]:
ratings.shape[0] / np.prod(ratings[['userid', 'movieid']].apply(pd.Series.nunique))

0.044683625622312845

Select favorite movies (to generated recommendations based on it)

In [8]:
movies.loc[movies.movienm.str.contains('blade runner|matrix', flags=2)]

Unnamed: 0_level_0,movienm,genreid
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
541,Blade Runner (1982),"Film-Noir, Sci-Fi"
2571,"Matrix, The (1999)","Action, Sci-Fi, Thriller"


In [9]:
favorite_movies_ids = [541, 2571] 

check

In [10]:
movies.loc[favorite_movies_ids]

Unnamed: 0_level_0,movienm,genreid
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
541,Blade Runner (1982),"Film-Noir, Sci-Fi"
2571,"Matrix, The (1999)","Action, Sci-Fi, Thriller"


# Recsys model in 3 lines of code

1 Build sparse matrix from ratings data

In [11]:
data_matrix = csr_matrix((ratings.rating.values.astype('f8'), (ratings.userid.values, ratings.movieid.values)))

2 Compute sparse SVD

In [12]:
_, S, Vt = svds(data_matrix, k=50, return_singular_vectors='vh')

3 Generate top-$n$ recommendations based on the known user preferences

In [13]:
movies.loc[np.argsort(-Vt.T @ Vt[:, favorite_movies_ids].sum(axis=1))[:15]] # assuming binary preference vector

Unnamed: 0_level_0,movienm,genreid
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
2571,"Matrix, The (1999)","Action, Sci-Fi, Thriller"
541,Blade Runner (1982),"Film-Noir, Sci-Fi"
589,Terminator 2: Judgment Day (1991),"Action, Sci-Fi, Thriller"
260,Star Wars: Episode IV - A New Hope (1977),"Action, Adventure, Fantasy, Sci-Fi"
1214,Alien (1979),"Action, Horror, Sci-Fi, Thriller"
1240,"Terminator, The (1984)","Action, Sci-Fi, Thriller"
1617,L.A. Confidential (1997),"Crime, Film-Noir, Mystery, Thriller"
1199,Brazil (1985),Sci-Fi
1206,"Clockwork Orange, A (1971)",Sci-Fi
1196,Star Wars: Episode V - The Empire Strikes Back...,"Action, Adventure, Drama, Sci-Fi, War"


# What the NLA has happened?

<img src="http://cdn-static.denofgeek.com/sites/denofgeek/files/styles/main_wide/public/2016/06/silicon_valley_s3.jpg">
<div style="text-align: right">image credit http://www.denofgeek.com</div>

SVD of the ratings matrix (imputed with zeros):

$$
A \approx U \Sigma V^T
$$

gives compact *representation of users and movies in terms of some hidden (latent) features* encoded by $U$ and $V$ respectively.  
Recommendations are defined by an *orthogonal projection of a user's preferences onto the latent features space of movies*:

$$
\boldsymbol{r} = VV^T \boldsymbol{p},
$$

where $\boldsymbol{r}$ is a vector or predicted relevance scores for all movies, $\boldsymbol{p}$ is a vector of user preferences.  
Top-$n$ recommendations are generated as 

$$\text{arg}\max_n\,r$$

The model is known as *PureSVD*, see Cremonesi, P., Koren, Y., and Turrin, R, [*Performance of recommender algorithms on top-n recommendation tasks*](https://dl.acm.org/citation.cfm?id=1864721), Proceedings of the Fourth ACM Conference on Recommender Systems, 2010.